# Learning-Based Hardware Design for Data Acquisition Systems

THÈSE Nº 8693 (2018)

PRÉSENTÉE LE 24 AOÛT 2018 À LA FACULTÉ DES SCIENCES ET TECHNIQUES DE L'INGÉNIEUR LABORATOIRE DE SYSTÈMES D'INFORMATION ET D'INFÉRENCE PROGRAMME DOCTORAL EN GÉNIE ÉLECTRIQUE

# ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE

POUR L'OBTENTION DU GRADE DE DOCTEUR ÈS SCIENCES

PAR

# Cosimo APRILE

acceptée sur proposition du jury:

Dr J.-M. Sallese, président du jury Prof. V. Cevher, Prof. Y. Leblebici, directeurs de thèse Prof. A. Emami, rapporteuse Dr C. Menolfi, rapporteur Prof. C. Dehollain, rapporteuse



Nullius in verba. — Epistle, Orace

To my mother, my brother and To the memory of my father

# Acknowledgements

This Ph.D thesis work has been supported by many people throughout these years. First and foremost, I would like to thank my advisors Prof. Yusuf Leblebici and Prof. Volkan Cevher for giving me the opportunity to pursue my doctoral degree at EPFL under their constant guidance. In particular, I would like to thank Prof. Leblebici for his endless support, kindness, patience and for giving me the motivation, enthusiasm and guidance. He was my advisor at the time of my Master thesis and I really thank him for giving me the opportunity to continue my studies in his lab, opening me the doors to improve drastically my personal and professional profile, working on many exciting projects. Before I started my Ph.D, he introduced me to Prof. Cevher who became my advisor during these years. I am very thankful to Prof. Cevher for many reasons. First of all, he constantly showed me the way to reach the top, giving me many fruitful advices and some times trying me, giving me challenges to push my motivation. Without his advices and his will for excellence, this work would not have been possible. I am also very thankful to Prof. Yusuf Leblebici and Prof. Volkan Cevher for allowing and enabling many collaborations during these years, in particular with IBM Zurich Research Laboratory and RFIC group and EPFL.

I am very grateful towards the members of my thesis committee: Prof. Azita Emami, Dr. Christian Menolfi, Prof. Jean-Michel Sallese and Prof. Catherine Dehollain for their useful comments, patience and time in reading this dissertation. Moreover, I am sincerely thankful to Prof. Catherine Dehollain and Dr. Thomas Toifl for the very fruitful collaboration established among our groups. This work would not have been possible without their outstanding project ideas and commitment.

I would like to thank Dr. Alain Vachoux for his timeless support on CAD tools and nice time spent together and Dr. Alexandre Schmid for his support with computing infrastructure.

My deepest gratitude goes to Dr. Luca Baldassarre for all his extraordinarily support. Luca has been one of the most important person during these years. In few words, Luca inspired me. He showed me how to work hard, being professional while still enjoying your life with many activities. I really appreciated all the brainstorming we had even after he left EPFL. And I can not forget all the times we challenged our-selfs in climbing gym and running, waiting for Elias to grow and climb with us...Luca, thank you and to your lovely family!

I am extremely grateful to Dr. Alessandro Cevrero who has been a constant guidance during these years. Thanks to his positive mood and determination we have managed to reach great results during the collaboration at IBM Zurich Research Laboratory. Thank you so much Ale, you are the man!

#### Acknowledgements

I can not forget to thank Dr. Jury Sandrini. I really thank him for being a great colleague and friend during our Ph.D. We shared great times, nice coffee breaks, climbs and my few skiing experiences. It has been a pleasure to share the office space with you!

I acknowledge Dr. Kiarash Gharibdoust for his friendship, guidance and assistance on circuit modeling. We were office mates for 2 amazing years. Thank you *doostam*!

I would like to express my gratitude to Prof. Mahsa Shoaran. Our constant discussions and brainstorming have been source of new ideas and solutions to problems.

Many thanks to Kerim Ture for the great collaboration we had during these years. I really enjoyed working together with a great engineer and kind person as you, Kerim!

I would like to thank all former and current members of LSM and lions@epfl laboratories. Especially, I am very thankful to Jonathan Narinx for his friendship and his positive behavior. I really appreciate his help on CAD tools and his constant support in the different phases of my last 2 years in the lab. I have been very lucky to find a true friend on top of a great colleague and engineer. Thank you, Jonny!

Thanks a lot to Arda Uran for his support and kindness. In these months I have been truly impressed by his preparation and positive attitude and I am sure Arda will reach great results during his research.

Thanks to Dr. Tugba Demirci for her constant support, especially during the tape-out periods. Thanks to Sylvain Hauser, for his kind support and nice time shared together during all these years. Thanks to Selman, Mustafa, Radisav, Nikola, Vladan, Clemens, Kadir, Giulia, Sebastian, Gulperi, Ayca, Firat, Can, Irem, Bilal, Wen-Yang, Duygu, Omer, Seniz and Kerem for the funny and joyful moments we shared. A special thank to Elmira Shahrabi and Reza Ranjandish (*navidinho*), for the constant support and friendship. Thanks to all the *lions* people, especially to Dr. Anastasios Kyrillidis (*Tasos*) for his friendship; the amazing Prof. Bah Bubacarr who supported me a lot during my first complicate year; Prof. Tran-Dinh Quoc for sharing his knowledge and kindness; Baran Gozcu for all the good times we spent together; Dr. Marwa El Halabi for her friendship, Alp Yurtsever, Ilija Bogunovic and Dr. Yen-Huan Li.

A special thank goes to Dr. Mariazel M. Lopez. Among all these years I really enjoyed our discussions and all the time spent together. Thank you so much for your unconditional support, encouragement and inspiration, zellina!

A great thank goes to Mattia D'Agostino, Matteo Cossale and Yari Ferrante. Even if we are far away from each other, we have managed to keep our great friendship. Life would have been more tough without you, guys.

Thanks to Dr. Enrica Montinaro who (likely) was my home-mate during my first 2 years of Ph.D. We had amazing times together in our villa. Thanks for your true friendship, Enri!

I am also thankful to the secretaries of LSM and lions@epfl laboratories, Melinda Mischler, Patricia Vonlanthen and Gosia Baltaian as well as the secretary of the doctoral program Vanessa Maier.

I would like to thanks my colleagues from IBM Zurich Research Laboratory for their useful technical discussion: Lukas Kull, Pier Andrea Francese, Christian Menolfi, Matthias Brandli, Gain Kim, Ilter Oezkaya and Marcel Kossel. I am also thankful to my colleague from TCL laboratory Andrea Bonetti and Prof. Andreas Burg.

Last but foremost, I am hugely indebted to my parents, my brother and my girlfriend. Ringrazio mia madre, per il suo supporto incondizionato, il suo infinito amore e per rappresentare la mia ancora. Mio padre, per avermi insegnato i giusti valori e per avermi spinto a superare i miei limiti e a vincere le mie paure.

Mio fratello Antonio, esempio di forza e determinazione, ispirazione di grinta e motivazione. Grazie a zia Lucia per il suo conforto e la sua meravigliosa presenza. Grazie a zio Gigi per avermi insegnato tanto e per tutto il supporto negli anni. Grazie a zia Ada e zio Tonino...

Grazie ad Andrea F. Re, per essere stato, da sempre, un amico fraterno. Grazie alla mia fidanzata Rachele, per essermi stata sempre presente, per avermi aiutato a superare i momenti piu difficili, per tutta la pazienza e l'incoraggiamento.

Lausanne, 24 August 2018

Cosimo Aprile

# Abstract

This multidisciplinary research project aims to investigate the optimized information extraction from signals or data volumes and to develop tailored hardware implementations that trade-off the complexity of data acquisition with that of data processing, conceptually allowing radically new device designs. The mathematical results in classical *Compressive Sampling* (CS) support the paradigm of *Analog-to-Information Conversion* (AIC) as a replacement for conventional ADC technologies. The AICs simultaneously perform data acquisition and compression, seeking to directly sample signals for achieving specific tasks as opposed to acquiring a full signal only at the Nyquist rate to throw most of it away via compression. Our contention is that in order for CS to live up its name, both theory and practice must leverage concepts from learning. This work demonstrates our contention in hardware prototypes, with key trade-offs, for two different fields of application as edge and big-data computing.

In the framework of edge-data computing, such as wearable and implantable ecosystems, the power budget is defined by the battery capacity, which generally limits the device performance and usability. This is more evident in very challenging field, such as medical monitoring, where high performance requirements are necessary for the device to process the information with high accuracy. Furthermore, in applications like implantable medical monitoring, the system performances have to merge the small area as well as the low-power requirements, in order to facilitate the implant bio-compatibility, avoiding the rejection from the human body. Based on our new mathematical foundations, we built different prototypes to get a neural signal acquisition chip that not only rigorously trades off its area, energy consumption, and the quality of its signal output, but also significantly outperforms the state-of-the-art in all aspects.

In the framework of big-data and high-performance computation, such as in high-end servers application, the RF circuits meant to transmit data from chip-to-chip or chip-to-memory are defined by low power requirements, since the heat generated by the integrated circuits is partially distributed by the chip package. Hence, the overall system power budget is defined by its affordable cooling capacity. For this reason, application specific architectures and innovative techniques are used for low-power implementation. In this work, we have developed a single-ended multi-lane receiver for high speed I/O link in servers application. The receiver operates at 7 Gbps by learning inter-symbol interference and electromagnetic coupling noise in chip-to-chip communication systems. A learning-based approach allows a versatile receiver circuit which not only copes with large channel attenuation but also implements novel

#### Abstract

crosstalk reduction techniques, to allow single-ended multiple lines transmission, without sacrificing its overall bandwidth for a given area within the interconnect's data-path.

#### Key words:

Implantable integrated circuit, area-efficient, low-power, compressive sensing, neural signals, learning-based digital signal processing, signal recovery, medical monitoring, adaptive compression. Far-end crosstalk, Decision-Feedback Equalizer, Inter-Symbol Interference, source-synchronous architecture, Continuous Time Linear Equalizer.

# Résumé

Ce projet de recherche multidisciplinaire vise à étudier l'extraction d'informations optimisée à partir de signaux ou de volumes de données et à développer des implémentations matérielles dédiées qui transforment la complexité de l'acquisition de données en traitement de données permettant de concevoir des dispositifs radicalement nouveaux. Les résultats mathématiques de l'*Acquisition comprimée (Compressive Sampling,* CS) classique prennent en charge le paradigme de la *Conversion analogique-à-information (Analog-to-Information Conversion,* AIC) en remplacement les technologies ADC classiques. Les AICs effectuent simultanément l'acquisition et la compression des données, en cherchant à échantillonner directement les signaux pour réaliser des tâches spécifiques, par opposition à l'acquisition d'un signal complet uniquement à la fréquence de Nyquist pour en éliminer la plus grande partie par compression. Notre thèse est que, pour que le CS suive son nom, la théorie et la pratique doivent tirer parti des concepts de l'apprentissage. Ce travail démontre notre prétention dans les prototypes de matériel, avec des compromis clés, pour deux domaines d'application différents comme le calcul de bord et de big-data.

Dans le cadre de l'informatique de bord, tel que les écosystèmes portables et implantables, le budget de puissance est défini par la capacité de la batterie, ce qui limite généralement les performances et la facilité d'utilisation de l'appareil. Ceci est plus évident dans les domaines très exigeants, tels que la surveillance médicale, où des exigences de haute performance sont nécessaires pour que l'appareil traite l'information avec une grande précision. En outre, dans des applications telles que la surveillance médicale implantable, les performances du système doivent résulter en une petite taille tout en répondant aux exigences de faible puissance, afin de faciliter la biocompatibilité de l'implant en évitant le rejet du corps humain. Sur la base de nos nouvelles bases mathématiques, nous avons construit différents prototypes pour obtenir une puce électronique d'acquisition de signaux neuronaux qui non seulement présentent des compromis rigoureux entre sa surface, sa consommation d'énergie et sa qualité de sa sortie de signal, mais qui surpasse également l'état de l'art.

Dans le cadre du calcul de données volumineuses et hautes performances, comme dans l'application des serveurs haut de gamme, les circuits RF destinés à transmettre des données de puce à puce ou de puce à mémoire sont définis par des exigences de faible consommation, car la chaleur générée par les circuits intégrés est partiellement distribuée par le package des puces. Par conséquent, le budget de puissance global du système est défini par sa capacité de refroidissement. Pour cette raison, des architectures spécifiques aux applications et des techniques innovantes sont utilisées pour une implémentation à faible consommation. Dans

#### Résumé

ce travail, nous avons développé un récepteur multi-voies à extrémité unique pour la liaison entrée/sortie à haute vitesse dans l'application des serveurs. Le récepteur fonctionne à 7 Gbps en apprenant l'interférence entre symboles et le bruit de couplage électromagnétique dans les systèmes de communication puce à puce. Une approche basée sur l'apprentissage permet au circuit récepteur polyvalent de, non seulement gérer l'atténuation des grands canaux, mais également mettre en œuvre de nouvelles techniques d'annulation de diaphonie pour permettre une transmission à plusieurs lignes sans sacrifier sa bande passante globale pour une zone donnée dans le chemin de données de l'interconnexion.

#### Mots clés :

Circuit intégré implantable, taille efficace, faible puissance, acquisition comprimée, signaux neuronaux, traitement de signal numérique basé sur l'apprentissage, reconstruction de signal, surveillance médicale, compression adaptative. Far-end crosstalk, égaliseur de décision-rétroaction, interférence inter-symbole, architecture source-synchrone, Continuous Time Linear Equalizer.

# Sommario

Questo progetto di ricerca multidisciplinare mira a studiare l'estrazione ottimizzata delle informazioni da segnali o volumi di dati e a sviluppare implementazioni hardware su misura, che compromettono la complessità dell'acquisizione dei dati con quella dell'elaborazione dei dati, consentendo concettualmente di progettare dispositivi radicalmente nuovi. I risultati matematici nel settore di Compressive Sampling (CS) supportano il nuovo paradigma della conversione da *Analogico a Information* (AIC) in sostituzione delle tecnologie ADC convenzionali. Le AIC eseguono simultaneamente l'acquisizione e la compressione dei dati, cercando di campionare direttamente i segnali per ottenere compiti specifici anziché acquisire un segnale completo alla frequenza di Nyquist, per poi buttarne via la maggior parte tramite la compressione. La nostra tesi è che, affinché l'approccio CS mantenga il suo nome, sia la teoria che la pratica devono sfruttare i concetti dell'apprendimento. In questo lavoro sono stati sviluppati diversi prototipi di hardware, con trade-offs chiave, implementati su due diversi campi di applicazione come edge e big-data computing.

Nell'ambito del edge-data computing, come sono le applicazioni wearable o impiantabili, il budget energetico è definito dalla capacità della batteria, che generalmente limita le prestazioni e l'usabilità del dispositivo. Ciò è più evidente in un settore molto impegnativo, come il monitoraggio medico, in cui sono necessari requisiti ad alte prestazioni del dispositivo per elaborare le informazioni con elevata precisione. Inoltre, in applicazioni come il monitoraggio medico in dispositivi impiantabili, le prestazioni del sistema devono essere raggiunte in un'area minima così come a bassa potenza, al fine di facilitare la biocompatibilità dell'impianto, evitando il rifiuto da parte del corpo. Sulla base delle nostre nuove basi matematiche, abbiamo costruito diversi prototipi per il chip di acquisizione del segnale neuronale, che non si limita a compattare e minimizzare la sua area, il consumo di energia e ottimizzare la qualità del segnale ricostruito, ma supera anche in modo significativo lo stato dell'arte in tutti aspetti.

Nell'ambito dei big-data e del calcolo ad alte prestazioni, come nelle applicazioni di server di fascia alta, i circuiti RF dedocato a trasmettere dati da chip a chip o da chip a memoria sono definiti da requisiti di bassa potenza, dal momento che il calore generato dai circuiti integrati è solo parzialmente dissipato dal package del chip. Quindi, il budget complessivo di energia del sistema è definito dalla sua capacità di raffreddamento. Per questo motivo, per l'implementazione a bassa potenza vengono utilizzate architetture specifiche dell'applicazione e tecniche innovative. In questo lavoro, abbiamo sviluppato un ricevitore a più linee, single-ended, per

#### Sommario

il collegamento I/O ad alta velocità per applicazioni server. Il ricevitore funziona a 7 Gbps riducendo l'interferenza inter-symbol e il rumore di accoppiamento elettromagnetico nei sistemi di comunicazione chip-to-chip. Un approccio basato sull'apprendimento consente un circuito ricevitore versatile che non solo abbatte l'importante attenuazione dovuta alla trasmissione nei canali, ma implementa anche nuove tecniche di riduzione del crosstalk, per consentire la trasmissione di linee multiple single-ended, senza sacrificare la larghezza di banda complessiva.

#### Parole chiave:

Circuiti integrati impiantabili, bassa potenza, area minima, compressive sensing, segnali neuronali, learning-based digital signal processing, ricostruzione dei segnali, monitoraggio medico, compressione adattiva. Far-end crosstalk, Decision-Feedback Equalizer, Inter-Symbol Interference, source-synchronous architecture, Continuous Time Linear Equalizer.

# Contents

| Ac                                     | Acknowledgements i |                                                                          |     |  |  |
|----------------------------------------|--------------------|--------------------------------------------------------------------------|-----|--|--|
| Abstract (English/Français/Italiano) v |                    |                                                                          |     |  |  |
| Li                                     | List of Figures xv |                                                                          |     |  |  |
| Li                                     | st of '            | Tables                                                                   | xxi |  |  |
| 1                                      | Intr               | oduction                                                                 | 1   |  |  |
|                                        | 1.1                | Mobile computing and autonomous sensing systems                          | 3   |  |  |
|                                        | 1.2                | High performance computing                                               | 5   |  |  |
|                                        | 1.3                | Thesis Goal                                                              | 6   |  |  |
|                                        | 1.4                | Organization and Thesis Overview                                         | 6   |  |  |
|                                        |                    | 1.4.1 Part One: Wireless Implantable Device for Medical Monitoring Brain | 6   |  |  |
|                                        |                    | 1.4.2 Part Two: Multi-lane Single-Ended High Speed I/O Receiver          | 8   |  |  |
|                                        |                    | 1.4.3 Last Part: Conclusion and Appendix                                 | 8   |  |  |
|                                        |                    |                                                                          |     |  |  |
| I                                      | Wir                | eless Implantable Device for Medical Monitoring Brain                    | 9   |  |  |
| 2                                      | Imp                | lantable ecosystem                                                       | 11  |  |  |
|                                        | 2.1                | Bio-compatible requirements of the implant                               | 12  |  |  |
|                                        | 2.2                | Neuronal bioelectricity and biocompatible electrodes                     | 13  |  |  |
|                                        |                    | 2.2.1 Macro and Micro-electrodes iEEG recording                          | 15  |  |  |
|                                        | 2.3                | Implantable System on Chip                                               | 15  |  |  |
|                                        |                    | 2.3.1 Wireless recording System-on-Chip                                  | 18  |  |  |
|                                        |                    | 2.3.2 Data telemetry                                                     | 18  |  |  |
|                                        |                    | 2.3.3 Power management                                                   | 18  |  |  |
| 3                                      | Dat                | a compression for autonomous sensing systems                             | 21  |  |  |
|                                        | 3.1                | Compressive Sensing                                                      | 21  |  |  |
|                                        |                    | 3.1.1 Signal Sparisity                                                   | 22  |  |  |
|                                        |                    | 3.1.2 Compressive Signal Measurements                                    | 24  |  |  |
|                                        |                    | 3.1.3 Signal Recovery                                                    | 25  |  |  |
|                                        | 3.2                | Structured Sparsity, Sampling and Recovery                               | 29  |  |  |

## Contents

|    |      | 3.2.1 Structured Sparsity                          | 30 |
|----|------|----------------------------------------------------|----|
|    |      | 3.2.2 Structured Sampling                          | 31 |
|    |      | 3.2.3 Structured Recovery                          | 31 |
|    | 3.3  | Learning Based Compressive Sampling                | 39 |
|    |      | 3.3.1 Optimal encoding                             | 40 |
|    |      | 3.3.2 LBCS performance evaluation                  | 40 |
|    | 3.4  | Summary                                            | 45 |
| 4  | LBC  | CS based hardware implementation and validation    | 47 |
|    | 4.1  | System level overview                              | 47 |
|    |      | 4.1.1 Analog to compressed data stream             | 47 |
|    |      | 4.1.2 Wireless Communication                       | 50 |
|    |      | 4.1.3 Implanted System Powering                    | 51 |
|    | 4.2  | Learning based sampling implementations            | 53 |
|    |      | 4.2.1 LBCS-Had Implementation                      | 54 |
|    |      | 4.2.2 LBCS-DCT implementation                      | 57 |
|    |      | 4.2.3 Optimal vs LBCS encoders                     | 61 |
|    | 4.3  | Single channel Adaptive LBCS-Had implementation    | 62 |
|    |      | 4.3.1 Adaptive LBCS                                | 62 |
|    |      | 4.3.2 Implantable Architecture                     | 64 |
|    |      | 4.3.3 Measurement results                          | 72 |
|    | 4.4  | Multichannel Adaptive LBCS-Had implementation      | 76 |
|    |      | 4.4.1 Multichannel Implantable Architecture        | 76 |
|    |      | 4.4.2 Multichannel Layout                          | 78 |
|    | 4.5  | Summary                                            | 78 |
| II | M1   | ılti-lane Single-Ended High Speed I/O Receiver     | 81 |
| 11 | IVIL | nu-tane single-Ended fingh speed 1/0 Receiver      | 01 |
| 5  | Hig  | h speed IOs ecosystem                              | 83 |
|    | 5.1  | System overview                                    | 84 |
|    |      | 5.1.1 Channel boards environment                   | 85 |
|    | 5.2  | Crosstalk cancellation state-of-art                | 88 |
| 6  | Syst | em level analysis for high speed RX                | 89 |
|    | 6.1  | Crosstalk cancellation considerations              | 89 |
|    | 6.2  | Boards characteristics                             | 91 |
|    |      | 6.2.1 Ch1 board                                    | 91 |
|    |      | 6.2.2 Ch2 board                                    | 91 |
|    | 6.3  | Mathematical formulation for ideally coupled lanes | 93 |
|    | 6.4  | System level simulations                           | 93 |
|    |      | 6.4.1 Ch1 board crosstalk reduction                | 94 |
|    |      | 6.4.2 Ch2 board crosstalk reduction                | 94 |
|    | 6.5  | Crosstalk cancellation over skewed lanes           | 98 |

| 7  | Hig                                                                                      | h speed re                                               | eceiver hardware implementation and validation | 103                                              |
|----|------------------------------------------------------------------------------------------|----------------------------------------------------------|------------------------------------------------|--------------------------------------------------|
|    | 7.1                                                                                      | Receiver                                                 | Architecture and Circuits                      | . 103                                            |
|    |                                                                                          | 7.1.1 Cl                                                 | lock generation                                | . 104                                            |
|    |                                                                                          | 7.1.2 C                                                  | TXC and CTLE                                   | . 104                                            |
|    |                                                                                          | 7.1.3 D                                                  | FE and DFXC                                    | . 106                                            |
|    | 7.2                                                                                      | Measure                                                  | ment Results                                   | . 107                                            |
|    |                                                                                          | 7.2.1 Cl                                                 | h1 measurement results                         | . 108                                            |
|    |                                                                                          | 7.2.2 Cl                                                 | h2 measurement results                         | . 110                                            |
|    | 7.3                                                                                      | Summar                                                   | y                                              | . 113                                            |
| II |                                                                                          | onclusio                                                 | ns                                             | 115                                              |
|    |                                                                                          |                                                          |                                                |                                                  |
| 8  | Con                                                                                      | clusion a                                                | nd future work                                 | 117                                              |
| 8  | <b>Con</b><br>8.1                                                                        |                                                          | <b>nd future work</b><br>Vork                  |                                                  |
| 0  | 8.1                                                                                      | Future W                                                 |                                                |                                                  |
| 0  | 8.1                                                                                      | Future W<br>endix: Da                                    | Vork                                           | . 118<br>119                                     |
| 0  | 8.1<br><b>App</b><br>A.1                                                                 | Future W<br>endix: Da<br>I001-P0                         | Vork                                           | . 118<br>119<br>. 119                            |
| 0  | 8.1<br><b>App</b><br>A.1                                                                 | Future W<br>endix: Da<br>1001-P0<br>Study 0              | Vork                                           | . 118<br>119<br>. 119<br>. 119                   |
| 0  | <ul><li>8.1</li><li>App</li><li>A.1</li><li>A.2</li><li>A.3</li></ul>                    | Future W<br>eendix: Da<br>I001-P0<br>Study 0<br>Experime | work                                           | . 118<br>119<br>. 119<br>. 119<br>. 119<br>. 119 |
| A  | <ul> <li>8.1</li> <li>App</li> <li>A.1</li> <li>A.2</li> <li>A.3</li> <li>A.4</li> </ul> | Future W<br>eendix: Da<br>I001-P0<br>Study 0<br>Experime | work                                           | . 118<br>119<br>. 119<br>. 119<br>. 119<br>. 119 |

| 1.1 | The first 7 nm node test chip wafer from IBM Research [3].                                                                                                                                                                                                                      | 2  |
|-----|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|
| 1.2 | Big-data and instant data in the cloud era [4]                                                                                                                                                                                                                                  | 2  |
| 1.3 | Wireless implantable devices currently available in the medical market [5]                                                                                                                                                                                                      | 4  |
| 1.4 | The Watson supercomputer, based on IBM Power7 servers                                                                                                                                                                                                                           | 5  |
| 1.5 | Designed and tested prototypes in this thesis: (a) Learning-based CS hardware design in 180nm CMOS technology; (b) Multichannel LBCS-based neuronal sensing system; (c) high-speed, 8-lanes single-ended RX in 32nm SOI technology.                                             | 7  |
| 2.1 | European brain disorders costs in 2010, reprinted from [10]                                                                                                                                                                                                                     | 12 |
| 2.2 | Biocompatible electrodes, reprinted from [25].                                                                                                                                                                                                                                  | 14 |
| 2.3 | Hybrid electrodes grid containing macro and microelectrode arrays (a) for iEEG signal recordings, reprinted from [21]. Signals recorded from micro and macro electrodes in (b), with an highlight on micro electrode 27 that records a seizure onset seconds before the macros. | 16 |
| 2.4 | Block diagram of the implantable integrated system (on the left side), wirelessly linked with an external base station (on the right), where the data is reconstructed for medical monitoring and stored. No battery is used in the implanted system.                           | 17 |
| 3.1 | Electrocardiography gives a time-sparse representation of the heart electrical activity.                                                                                                                                                                                        | 22 |
| 3.2 | A multi-tone sine in the non-sparse time domain (left) and its sparser represen-<br>tation in Fourier domain (right).                                                                                                                                                           | 23 |
| 3.3 | Electrocardiography signal on top with the threshold level; its sparser represen-<br>tation at the bottom.                                                                                                                                                                      | 24 |
| 3.4 | Dimensionality reduction applying Compressive Sensing technique.                                                                                                                                                                                                                | 25 |
| 3.5 | Shape of the $\ell_p^2$ minimization for $p = 1$ and $p = 2$ , while the thick straight line represents all the solutions to $\mathbf{y} = \mathbf{A}\mathbf{x}$ .                                                                                                              | 27 |
| 3.6 | Empirical performance of simple and structured sparsity recovery of natural images, reprinted from [61].                                                                                                                                                                        | 30 |

| 3.7   | (left) Coherence between the Hadamard and the Wavelet bases. The coherence decreases for higher frequencies (higher coefficients). (right) Probability func-                                                                                      |          |
|-------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|
|       | tions used for sampling the indices of the Fast Walsh-Hadamard Transform for 4x and 32x compression factors.                                                                                                                                      | 22       |
| 3.8   | Tree structure in one signal from iEEG.org dataset I001 P034 D01 (channel 6,                                                                                                                                                                      | 32       |
| 5.0   | first annotated seizure, first 1024 samples window) and in three reconstructions obtained via Bernoulli sampling (BERN) and structured Hadamard sampling                                                                                          |          |
|       | (SHS). The tree structure can be enforced via a specific tree regularizer or mostly                                                                                                                                                               |          |
|       | captured via structured sampling.                                                                                                                                                                                                                 | 33       |
| 3.9   | First 64 Wavelet coefficients of the micro-electrode signals from two datasets<br>from the iEEG.org portal. (left) 7 channels from dataset I001 P034 D01. (right)<br>32 channels from dataset Study 040. The group structure is evident among the |          |
|       | correlated channels in both datasets, however, there remain outlier channels                                                                                                                                                                      |          |
|       | which do not abide to the group structure.                                                                                                                                                                                                        | 34       |
|       | Dataset 1                                                                                                                                                                                                                                         | 36       |
| 3.11  | Example of micro-electrode signals from iEEG.org dataset I001 P034 D01                                                                                                                                                                            |          |
|       | (first seizure, first 1024 samples window). Channel 1 is inactive, since it simply                                                                                                                                                                |          |
|       | jumps between $-1\mu V$ and $1\mu V$ . Channel 2 to 6 record normal activity, which is                                                                                                                                                            |          |
|       | not much correlated. Channel 7 exhibits strong AC components, possibly picked                                                                                                                                                                     | ~ -      |
| 0.10  | up from the power sources.                                                                                                                                                                                                                        | 37       |
| 3.12  | Example of micro-electrode signals from iEEG.org dataset Study 040 (first                                                                                                                                                                         |          |
|       | seizure, first 1024 samples window). Channel 26 seems completely inactive, it                                                                                                                                                                     |          |
|       | sends a constant signal of approximately $-131 mV$ . Channels 3 and 28, among others are highly correlated. Channel 1 is an example of a channel which does                                                                                       |          |
|       | others, are highly correlated. Channel 1 is an example of a channel which does not exhibit the smaller oscillations of channels 3 and 28                                                                                                          | 38       |
| 2 1 2 |                                                                                                                                                                                                                                                   | 30       |
| 5.15  | I001-P034-D01 Reconstruction example for channel Grid28 on four windowsof length 256 each.                                                                                                                                                        | 41       |
| 2 1 4 | Study 040 Reconstruction example for channel LG50 on four windows of length                                                                                                                                                                       | 41       |
| 5.14  | 256 each.                                                                                                                                                                                                                                         | 43       |
| 3 15  | Trade-off between bit-rate, memory size and reconstruction performance                                                                                                                                                                            | 45<br>45 |
| 5.15  | nade-on between bit-rate, memory size and reconstruction performance                                                                                                                                                                              | 43       |
| 4.1   | Typical wireless sensor system, with highlight in a battery-powered multiple                                                                                                                                                                      |          |
|       | lanes TX (a) and its RX counterpart (b).                                                                                                                                                                                                          | 48       |
| 4.2   | Different ADC operating range, considering the sampling rate and the bit resolu-                                                                                                                                                                  |          |
|       | tion, adapted from [71].                                                                                                                                                                                                                          | 49       |
| 4.3   | Downlink data communication at (a) 500 kbit/s and (b) 50 kbit/s (waveforms from top to bottom;                                                                                                                                                    |          |
|       | turquoise: modulator input (5V/div), purple: demodulator input ((a)2V/div (b)5V/div), and green:                                                                                                                                                  |          |
|       | demodulator output (1V/div), respectively) [85]                                                                                                                                                                                                   | 52       |
| 4.4   | Lumped circuit model of the 3-coil inductive link [88]                                                                                                                                                                                            | 53       |
| 4.5   | (Left top) Half-wave active rectifier composed of a pass transistor, comparator, and a multiplexer;                                                                                                                                               |          |
|       | (right) the low drop-out voltage regulator with its cascoded bootstrapped current source; and (left                                                                                                                                               |          |
|       | bottom) connection of rectifier and the regulator. [89]                                                                                                                                                                                           | 53       |

| 4.6  | One channel block diagram showing the LBCS encoder and the matrix sequence                                                          |     |
|------|-------------------------------------------------------------------------------------------------------------------------------------|-----|
|      | generation logic.                                                                                                                   | 55  |
| 4.7  | Accumulator block diagram.                                                                                                          | 55  |
| 4.8  | One channel encoder layout showing the LBCS encoding circuit and the matrix sequence generation logic for $N = 256$ and $CR = 16$ . | 57  |
| 4.9  | One channel block diagram showing the LBCS encoder and the matrix sequence                                                          |     |
|      | generation logic.                                                                                                                   | 59  |
| 4.10 | One-channel DCT-LBCS encoder layout for $N = 256$ and $CR = 32$                                                                     | 60  |
| 4.11 | Variable CR block diagram, defined by the threshold level (Thr).                                                                    | 64  |
| 4.12 | SNR analysis for adaptive approach                                                                                                  | 65  |
| 4.13 | One channel block diagram showing the LBCS encoder and the matrix sequence                                                          |     |
|      | generation logic.                                                                                                                   | 66  |
| 4.14 | Hadamard bit generator block diagram.                                                                                               | 67  |
| 4.15 | Schematic of the LC cross-coupled voltage controlled oscillator [93]                                                                | 68  |
| 4.16 | Schematic of the IR-UWB transmitter [93]                                                                                            | 69  |
| 4.17 | Block diagram of the proposed implanted electronics for wireless power trans-                                                       |     |
|      | mission [93]                                                                                                                        | 70  |
| 4.18 | Layout (on the left) and micrograph (on the right) of the tested chip                                                               | 71  |
| 4.19 | Measurement setup, highlighting the FPGA and PCB link.                                                                              | 71  |
| 4.20 | Measured compressed values with low threshold (on the left) and high threshold                                                      |     |
|      | (on the right).                                                                                                                     | 73  |
| 4.21 | Spectrum of the LC cross-coupled voltage controlled oscillator [93]                                                                 | 75  |
| 4.22 | Transient pulses of the IR-UWB transmitter at 250 Mpps [93]                                                                         | 76  |
| 4.23 | Power spectral density of the IR-UWB transmitter [93]                                                                               | 77  |
| 4.24 | Layout of the designed multichannel implementation.                                                                                 | 79  |
| 5.1  | Chip-to-chip backplane link, SE 4.8 Gb/s [100].                                                                                     | 84  |
| 5.2  | Pin data rate evolution across most common I/O standards [101]                                                                      | 85  |
| 5.3  | Chip-to-chip block diagram (top), depicting the transmitted signal before and                                                       |     |
|      | after the attenuation due to the channel link. Section of a typical backplane                                                       | 0.0 |
| - 4  | system, highlighting the signalling paths [100] (bottom)                                                                            | 86  |
| 5.4  | ISI and crosstalk highlight in a multilane high speed I/O link [105]                                                                | 86  |
| 5.5  | Highlight of the pulse response and its derived crosstalk pulse response in a 2                                                     | 07  |
|      | lanes single ended I/O link, reprinted from [105]                                                                                   | 87  |
| 6.1  | Crosstalk cancellation using CTXC front-end on 3 lanes channel                                                                      | 90  |
| 6.2  | Forward and FEXT frequency responses (magnitude) for the Ch1 (a) and Ch2 (b)                                                        |     |
|      | PCB board                                                                                                                           | 92  |
| 6.3  | Single-lane transceiver block diagram with crosstalk compensation scheme                                                            |     |
|      | combining CTXC on the front-end.                                                                                                    | 94  |

| 6.4  | Simulated RX data eye for Ch1 board, with all aggressors switched (a) off and on (b) without crosstalk compensation scheme (CTLE and DFE on, in both cases). (c) Data eye and (d) bathtub plot with optimally calibrated CTXC front-end. All aggressors are transmitting.      | 95  |
|------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|
| 6.5  | FEXT pulse response from the aggressor to victim lane before and after CTXC.                                                                                                                                                                                                   | 95  |
| 6.6  | Simulated RX data eye for Ch2 board, with all aggressors switched off (a) and on (b) without crosstalk compensation (CTLE and DFE on, in both cases). (c) Data eye and (d) bathtub plot with optimally calibrated CTXC front-end with the two nearest aggressors transmitting. | 96  |
| 6.7  | Highlight of the vertical eye aperture (a) and <i>signal</i> , crosstalk and ISI (b) evolution for different CTLE peaking settings.                                                                                                                                            | 97  |
| 6.8  | Probability distribution function of the crosstalk pulse-response spanned over all postcursor taps without crosstalk cancellation (a), with only CTXC on (b), with CTXC off and DFXC on (c) and with both CTXC-DFXC activated (d)                                              | 100 |
| 6.9  | Skewed (a) and un-skewed (b) impulse responses at the TX side                                                                                                                                                                                                                  | 100 |
|      | Qualitative highlight of CTXC effects for un-skewed (a) and skewed (b) board                                                                                                                                                                                                   |     |
|      | lanes.                                                                                                                                                                                                                                                                         | 101 |
| 6.11 | Vertical eye aperture versus different lane skews at the RX side, with different number <i>n</i> of taps activated on the DFXC. The simulations are performed with Ch2 channel board.                                                                                          | 102 |
| 7.1  | 8-lane single-ended receiver architecture.                                                                                                                                                                                                                                     | 103 |
| 7.2  | CTXC stage with single-ended passive differentiator, variable gain amplifier and current summation. The two high pass RC differentiators are highlighted in the                                                                                                                |     |
|      | boxes                                                                                                                                                                                                                                                                          | 104 |
| 7.3  | Simulated AC response of main signal path VGA with maximum gain setting. $% \mathcal{A} = \mathcal{A} = \mathcal{A}$ .                                                                                                                                                         | 105 |
| 7.4  | CTLE stage with negative-C bandwidth enhancement. Reprinted from [121]. $\ .$                                                                                                                                                                                                  | 106 |
| 7.5  | Integrating DFE using SC feedback.                                                                                                                                                                                                                                             | 107 |
| 7.6  | DFE and DFXC core, with fast tap-1 feedback, including 8-tap DFE and 7×8 DFXC SC cells.                                                                                                                                                                                        | 108 |
| 7.7  | Layout of RX macro (center), detail of the SC-DFE cells (on top) and the die micrograph (bottom).                                                                                                                                                                              | 109 |
| 7.8  | On the left, the chip is flip-chip mounted on the LCP PCB. On the right, the LCP is packaged in a rigid metallic frame.                                                                                                                                                        | 110 |
| 7.9  | Measurement setup: clock generators on top left, PARBERT for PRBS generation<br>on bottom left, test board Ch2 on bottom right and the RX in the middle                                                                                                                        | 111 |
| 7.10 | Measured bathtub plots for Ch1 board with CTXC switched off (a) and switched                                                                                                                                                                                                   |     |
|      |                                                                                                                                                                                                                                                                                | 112 |
| 7.11 | Board-Ch2: measured correlation with postcursor taps with and without DFXC, on the left; measured bathtub plots, on the right.                                                                                                                                                 | 112 |

| 7.12 Received eye diagrams with silent aggressors (top-left), crosstalk cancellation off |     |
|------------------------------------------------------------------------------------------|-----|
| (top-right), crosstalk cancellation activated (bottom-left) with related bathtub         |     |
| plot (bottom-right)                                                                      | 113 |

# List of Tables

| 1.1 | Power budget for different applications                                                        | 3   |
|-----|------------------------------------------------------------------------------------------------|-----|
| 2.1 | Neuronal signals characteristics.                                                              | 15  |
| 3.1 | iEEG.org portal dataset I001 P034 D01. Mean SNR over channels 2-6                              | 36  |
| 3.2 | iEEG.org portal dataset I001 P034 D01. Mean SNR over channels 2-6                              | 36  |
| 3.3 | iEEG.org portal dataset Study 040. Mean SNR over all channels                                  | 39  |
| 3.4 | I001-P034-D01 N = 256, $B_i$ = 10 $\hdots$                                                     | 42  |
| 3.5 | Study 040 N = 256, $B_i = 10$                                                                  | 42  |
| 3.6 | Reconstruction performance (in dB) N = 32 - $B_i = 10$                                         | 42  |
| 3.7 | Performance (dB) N = 256, $B_i$ = 10, $B_{DCT}$ = 8                                            | 44  |
| 4.1 | Comparison With Published Work                                                                 | 58  |
| 4.2 | Comparison With Published Work                                                                 | 61  |
| 4.3 | Recovery performance comparison with published work (N = 256, B_i = 10) $\ldots$               | 74  |
| 4.4 | Recovery performance summary for this work (N = 64, B <sub>i</sub> = 8) $\ldots \ldots \ldots$ | 74  |
| 4.5 | Compression hardware comparison with published work                                            | 74  |
| 6.1 | Crosstalk boards key parameters                                                                | 91  |
| 6.2 | Crosstalk Cancellation Performances                                                            | 98  |
| 7.1 | RX power distribution                                                                          | 114 |
| 7.2 | Comparison of 8 lanes × 7 Gb/s RX macro with prior art                                         | 114 |

# **1** Introduction

Since the advent of *Integrated Circuits* (ICs) in 1958 [1], the semiconductor manufacturing technology has improved constantly, to follow the need for integrating more complex functions into a single chip. In the last few decades, there has been a revolution in the information technology, in which *Very Large Scale Integration* (VLSI) technology has been the key to develop systems capable to address different challenges for many applications, spanning from telecommunications to imaging, high-speed transceivers, home automation, environmental and medical monitoring etc. The continuous technology scaling, known as Moore's law [2], where the transistor physical size has been halved every two years, has been the feedstock for the continuous innovation in the system performances. Indeed, this trend predicted by Moore resulted into more complex system-on-chip architectures, with naturally increase of the system's bandwidth. Recently, IBM Research in collaboration with GlobalFoundries and Samsung, has successfully made the first 7 nm node test chips at wafer scale, shown in Fig. 1.1. Such new manufacturing technique has the potential to host 20 billion working transistors packed into a chip of a fingernail's size.

Although the miniaturization law has been followed rigidly for half a century, the prediction of the 2015 *International Technology Roadmap for Semiconductors* (ITRS) [4], reports that the transistor could stop shrinking in the next few years. The report forecasts that, after 2021, it will be no longer economically viable for companies to continue the traditional effort in transistor miniaturization, sacrificing the chip speed gains for energy savings. However, the manufacturing technology will move towards other ways to increase chip density, turning the chip design to the vertical geometry, allowing multiple layers of circuitry, one on top of the other, namely allowing 3D microprocessor structures.

Moreover, system scaling is challenged with the limits on area, power and interconnect bandwidth. Since the advent of cloud computing, there are mainly two kinds of data generation: the *big-data*, requiring heavy computation and memory resources, and *instant data*, which are produced by always-on low power devices, as depicted in Fig. 1.2. In such framework, the industry is currently facing a new trend, named *More Moore* (MM), in which added value to devices is enabled by integrating optimized solutions that do not scale following the Moore's



Figure 1.1 – The first 7 nm node test chip wafer from IBM Research [3].



Figure 1.2 – Big-data and instant data in the cloud era [4].

| Application              | Sensors                         | Wireless<br>Interfaces                 | Power<br>Consumption | Battery<br>Lifetime |
|--------------------------|---------------------------------|----------------------------------------|----------------------|---------------------|
| Pacemaker                | Pacing leads                    | Inductive link                         | $10 \mu \mathrm{W}$  | Several Years       |
| Human body<br>monitoring | ECG, heart rate,<br>Temperature | 900 MHz ISM                            | 1-8 mW               | Several Hours       |
| Smart Phone              | Multiple sensors                | Bluetooth, WiFi,<br>GSM,<br>HSDPA, LTE | 1 W                  | Few Hours           |

Table 1.1 – Power budget for different applications.

law. MM technologies allow applications that go from mobile computing, *autonomous sensing and monitoring systems* (targeting reduced energy and area costs) to the *high performance computing*, requiring more performance and operating frequency.

In this work, we are concerned with optimized information extraction from signals or data volumes. We therefore develop mathematical theory, computational methods and their hard-ware implementations, for information recovery from highly incomplete data. Our approach trades-off the complexity of data acquisition with that of data processing, conceptually allowing radically new device designs. Our contention is that both theory and practice must leverage concepts from learning processes, in order to validate the merging between mathematical algorithms and circuit design. This work has been demonstrated in two new hardware prototypes with key trade-offs, in the More Moore technologies.

# 1.1 Mobile computing and autonomous sensing systems

In mobile applications the power budget is defined by the battery limits, which, unfortunately, does not improve from one node to the following one, as the amount of logic gates does in the IC. Table 1.1 gives an overview of different battery power budget used in some of the current electronic devices used for general daily life applications.

Among all the autonomous sensing applications, one of the most critical and challenging field is medical monitoring, in which various biological signals have to be processed with a relatively high accuracy, in order to extract reliable medical information for disease diagnosis or therapy. The implantable medical sector is nowadays a highly consolidated market, which is mainly dominated by few companies (e.g., Medtronic, St. Jude Medical and Boston Scientific). According to a research report [6], the implantable medical market forecast expects to grow from an evaluation of US\$ 32.3 Billion in 2015 to US\$ 49.8 Billion by the end of 2024.

In the last few decades, new health-oriented devices and wireless technologies have been



Figure 1.3 - Wireless implantable devices currently available in the medical market [5].

proposed, spanning from low-power implants that harvest energy from the body, to wireless sensors for in-house medical monitoring. In particular, implantable medical devices, including pacemakers, cardiac defibrillators, insulin pumps, and neurostimulators (shown in Fig. 1.3), feature wireless communication, enabling remote personal health monitoring, and facilitate the treatment procedures provided by health care systems.

Usually, the implant, also named *sensor*, is characterised by limited energy resources, due to the limits on the battery. The power consumption by the wireless *Transmitter* (TX) unit, in the sensor node, is usually higher than the required power by all the other blocks in the signal acquisition system of the implanted chip. For this reason, some data treatment on the sensor node is crucial to reduce the amount of data sent by the *Radio Frequency* (RF) TX, while keeping a relatively high information content, enabled after a tailored signal reconstruction, at the receiver node. Data compression becomes then crucial to reduce data telemetry power costs, without losing any critical information of the signal. To address this challenge, a new mathematical approach named *Compressive Sensing* [7] or *Compressed Sensing* [8] (CS) has been exploited in many applications, spanning from remote controlling to imaging systems. In a nutshell, CS allows to sample less the signal of interest than dictated by the Shannon-Nyquist theorem, while the recovered signal performance is still robust. Such mechanism is possible because, in natural signals, the information content is often much lower than the raw signal data content.

Overall, CS reduces the costs on the sensor node, allowing less linear samples than standard systems. However, the receiver has to deal with fewer data and requires to perform non-linear



Figure 1.4 - The Watson supercomputer, based on IBM Power7 servers.

operation to get the reconstructed signal. This means that the receiver will present some data latency and high energy costs. In this thesis, we propose hardware implementations of different CS-based approaches, capable to boost the performance of autonomous sensing systems, both on the area and power. Finally, an adaptive learning-based CS approach, named LBCS, allows a linear sampling and linear recovery, resulting in a real-time high signal reconstruction quality up to  $64 \times$  compression rate, as quantitatively demonstrated on different datasets.

# 1.2 High performance computing

High performance computing is needed to perform massive-scale and complex computing, at server nodes. Such technology targets are energy efficiency, real-time responsiveness and huge demand for processing power. As examples, the IBM Power Systems, shown in Fig. 1.4, are servers designed for critical applications and massive workloads needed for advanced machine learning, deep learning, advanced analytics and high performance computing.

The data processing capability of such big data computational infrastructures are highly dependent on how fast is the system to perform the operations. In such framework, high speed *transceivers* (TRX) play a crucial role connecting short range chip-to-chip or chip-to-memory, to allow high performance big data treatment. High speed TRX capability has continuously grown following the IC technology trend. However, the channel board in which the signal propagates has not improved accordingly, resulting in an intricate path in which the data gets deteriorated. The electrical link through PCB or backplane channels is characterised by the related signal losses, due to natural low pass filter characteristic of the board. In these interconnects, the need of complex equalization grows as the channel loss increases, resulting in a reduction in energy efficiency of the channel link. An increased bandwidth per pin ratio is allowed by single-ended signaling architectures, which doubles the performance with respect to the differential implementation. However, as the operating frequency of the system increases, the electromagnetic coupling between the PCB traces, named *crosstalk* (XTK or xtalk), becomes a significant noise source in single-ended parallel links. The combination of high speed signal processing algorithms with technological challenge in the circuit implementation, due to transistor scaling, becomes crucial, in order to reduce the signal degradation due to the aforementioned noise sources. Moreover, the challenge is enhanced by the stringent power constraints, so to meet the high performance requirements related to the application.

In this work, we propose a low power, multi-lane single-ended RX for high loss sourcesynchronous links. In such system, a learning-based approach allows us to effectively cancel insertion loss and electromagnetic coupling noise in hardware and increase communication speed without sacrificing its overall bandwidth for a given area within the interconnect's datapath. Moreover, the receiver macro can be adapted to different board channels, presenting high insertion losses and intricate crosstalk patterns.

## 1.3 Thesis Goal

The More Moore technologies depict an indispensable and stronger trade-off in the main chip's requirements, that is the device area, power and operating frequency. The goal of this work is to improve the system performance boosting the information we have from the signal and/or environment in which the device is exploited. In this thesis we developed different learning-based approaches specifically tailored to the applications, defining new architectures, efficient circuit implementations and silicon prototypes of integrated signal processing algorithms. Fig. 1.5 depicts the die micrograph of the prototypes that have been designed, fabricated and tested in the frame of this work. In particular, Fig. 1.5 (a) shows the adaptive learning-based chip used for autonomous sensing systems, while Fig. 1.5 (b), depicts the chip micrograph of the 8-lanes receiver for high performance computing applications. Overall, particular emphasis is given to the efficient design and implementation of adaptive architectures, in order to cope with the multiple scenarios in which the proposed systems are used.

### 1.4 Organization and Thesis Overview

### 1.4.1 Part One: Wireless Implantable Device for Medical Monitoring Brain

Chapter 2 - Implantable ecosystem



Figure 1.5 – Designed and tested prototypes in this thesis: (a) Learning-based CS hardware design in 180nm CMOS technology; (b) Multichannel LBCS-based neuronal sensing system; (c) high-speed, 8-lanes single-ended RX in 32nm SOI technology.

In Chapter 2, we give an overview of the implantable ecosystem, discussing the general requirements of the implant. This will be followed by a discussion on neuronal bioelectricity and biocompatible electrodes. Afterwards, this chapter presents the basic information needed about the overall implantable system on chip.

#### Chapter 3 - Data compression for autonomous sensing systems

Chapter 3 describes the data compression algorithms developed during this work. We describe the main concepts of compressive sensing, which is then followed by a discussion on structureaware sparsity, sampling and recovery methods. The final part of the chapter focuses on the Learning-based compressive sampling, describing the main advantages of this method. For all the CS-based methods described in this work, a performance evaluation is given, based on iEEG human datasets (defined in the Appendix).

### Chapter 4 - LBCS based hardware implementation and validation

In Chapter 4, we describe the different hardware prototypes based on the LBCS algorithm. In particular, we show LBCS implementation applying different measurement schemes, analysing the pros and cons of each one. Then, we show the global system on chip for a single channel implementation, where we adopt an adaptive LBCS compression technique. Afterwards, we give the silicon electrical measurements of the single channel implementation. In the last part of the chapter is then discussed a multichannel implementation that, at the time of writing, is under fabrication process.

# 1.4.2 Part Two: Multi-lane Single-Ended High Speed I/O Receiver

### Chapter 5 - High speed I/Os ecosystem

Chapter 5 describes the high speed Input/Output link interconnection. This Chapter gives a general system level overview, discussing the channel board environment and the signal loss caused by attenuation in high frequency wired link. The state-of-art reduction of crosstalk in serial link high speed link is then described in the last part of the chapter.

## Chapter 6 - System level analysis for high speed receiver

In Chapter 6, we describe the system level analysis of the high speed link, motivating our crosstalk cancellation technique on the receiver side only. We give the boards characteristics before describing the crosstalk mathematical formulation for ideally coupled lanes. In the last part of the chapter we discuss the system level simulations, followed by the analysis of crosstalk cancellation over skewed lanes.

## Chapter 7 - High speed receiver hardware implementation and validation

Chapter 7 describes the receiver architecture and gives the circuit design of its main components. The validation of the overall receiver design is given by the measurement results over two single-ended channel boards, presenting high insertion loss and crosstalk.

## 1.4.3 Last Part: Conclusion and Appendix

The conclusion of the work is presented in Chapter 8. The main results and the contributions of this work are summarized in this chapter, and a perspective on future works is given.

# Wireless Implantable Device for Part I Medical Monitoring Brain

# 2 Implantable ecosystem

Among all the autonomous sensing system applications, one of the most critical and challenging field is the medical monitoring, in which biological signals have to be processed with a relatively high accuracy, in order to treat reliable medical informations.

A research study developed in all the high-income countries have evinced that brain disorders are the major health problem [9]. Di Luca et al. [10] estimate that brain disorders cost to the EU economy around 900 billion US\$, with 179 million people afflicted in 2010. Fig. 2.1 shows the cost distribution of the main brain diseases in Europe [10], in 2010. A tentative comparison is given with other major human disorders, such as around 200 billion US\$ [11] for cardiovascular diseases and from 150 to 250 billion US\$ [9, 10], giving the global picture of how important is the brain health as social and economic burden in Europe and the rest of the world.

Since several decades, many scientists have tried to understand the brain activity. Since 90's clinicians have been able to implant devices capable to monitor the neuronal activity [12]. Micro/Nano fabrication of electromechanical systems (M(N)EMS) industry is currently improving the capability to interface with the brain. A multitude of applications are related to these systems, from research experiments to personal health monitoring and in-house treatments. In particular, electrodes and micro fabricated electrodes have enabled efficient electrical or optical links, enhancing the functionality of the neuronal interfaces. Since 1997, the usage of prostheses has been approved to provide medical treatments for some brain diseases, such as Parkinson and Epilepsy, and in 2005 also for depression [13]. Nowadays over 5% of the population worldwide had at least one epileptic seizure during lifetime and around 50 million people worldwide are actively treated for epilepsy [14, 15]. Moreover, in the 30% of the cases (around 20 million people) pharmaco-resistant epilepsy is diagnosed and standard medical drugs are not sufficient to cope with this problem. Currently, the only available solution -when applicable- requires a long term hospitalization in order to record and to localize the epileptic seizures, using a bulky system directly connected with cables trough the skull and scalp to the brain. After the localization of the epileptic hotspots an invasive surgery procedure is required, with the aim of physically removing the brain cortex where the stroke starts. This





Figure 2.1 – European brain disorders costs in 2010, reprinted from [10].

would enable to propose possibly autonomous and with minimal maintenance requirements wearable medical devices. According to the vision of *Body Area Network* (BAN), such set of bio-electrical devices attached to the human body can either serve to carry out information to a medical host or to provide some feedback as first aid treatment.

In the following, we first discuss the fundamentals of bioelectricity, giving an overview on the different available biocompatible electrodes, focusing then on the sensor ecosystem.

# 2.1 Bio-compatible requirements of the implant

An medical device implanted under the skull gives raise to very strict constraints on area and power. The implant is meant to be physically placed between the cortex and the skull, thus it needs to be small, with maximum sizes of a few dozen of square millimeters.

One of the most important safety issues of using an implantable biomedical device inside the body is the temperature elevation in the surrounding tissues due to the operation of the implant. Temperature elevation may disturb the natural behavior of the cells nearby the implant or may even cause cell death. Regulations allow maximum 1 °C temperature elevation for body implants [16]. This temperature rise corresponds to 40 mW/cm<sup>2</sup> power outflux density [17].

The implant containing power management and data communication system, can be im-

planted in the Burr hole which is opened on the skull for neurosurgical treatment of epilepsy. This hole can be defined as a cylinder having a height of approximately 10 mm (average skull thickness) and a diameter of 15 mm (subject to change depending on the drill size) [17]. These numbers, in fact, determines the size limitations for the system proposed in this project. These numbers also determines the dimension limit of the communication antenna.

The packaging of cortical implants is one of the most critical challenges in the design of fully implantable cortical recording devices. The requirements for the packaging include hermetical sealing, bio-compatibility, transparency to magnetic fields, size, and weight. However, it has been demonstrated that these challenges can be overcome feasibly with the current state of the art. At this regard, Yilmaz et al. showed that hermetical sealing capability of the packaging that is composed of epoxy and Parylene-C is successfully tested for one month to evaluate the implant's short-term performance [18].

# 2.2 Neuronal bioelectricity and biocompatible electrodes

The early studies on living tissues electricity can be tracked from the 1600's. Since 1791, when the Italian scientist Galvani discovered the electrical nature of nerve impulse in a frog muscle, the bioelectrity in human bodies starts to be studied and observed.

The electrical activity of the neural cells of the brain are classified into three types, as depicted in Fig. 2.2, depending on the setup required to measure the brain activity:

• *Electroencephalography* (EEG) measures potential fluctuations with non-invasive electrodes placed along the scalp, over a period of time. EEG measures voltage variations due to ionic current flows within the neurons of the brain. EEG signals represent the superposition of millions of individual neural events, demonstrating the group behaviour of neurons in a specific area of the brain. The EEG signals are characterized by amplitudes as high as  $300 \,\mu V$  and frequency content up to  $100 \,\text{Hz}$ . Because the EEG electrodes are placed relatively far from the neural cells, the artefacts associated with this techniques are important. Indeed, during the EEG recording, because of the scalp and the blood circulation, the neural electrical information gets attenuated and distorted.

• *Electrocorticographic* (ECoG) or *intracranial EEG* (iEEG) techniques measure the voltage fluctuations with electrodes directly placed on the bran surface, named cortex. In this way, the quality of ECoG (iEEG) signals is improved, beside a minimal invasive procedure, which requires a brain surgery to bypass the scalp, placing flat electrodes on the brain's surface. The ECoG records neural events associated with a small group of neurons or related to a single neuron cell, depending on the size of the implanted electrodes. Since a large signal amplitude and a wide frequency spectrum are related to this technique, the ECoG signals are generally used for mapping of cortical functions [19, 20] and localization of seizure onsets [21].

• Needles-like Micro-Electrodes Arrays (MEA) are used to sense extracellular Action Potentials



Figure 2.2 - Biocompatible electrodes, reprinted from [25].

(AP). The penetrating electrodes are placed into the cerebrum and measure the activity of a defined group of neurons. Such technique is used for detection and sorting of neural spikes [22, 23]. A minimal risk of haemorrhage is related to the micro-electrodes implanted in the brain's motor cortex [24]. A related *Local Field Potential* (LFP), which is an electro-physiological signal generated by the electrical contribution given by multiple nearby neurons within a small volume of nervous tissue, is also measured with this technique.

Table 2.1 reports a general overview of the bio-compatible electrodes employed to collect the neuronal signals. The electrode's geometry has to be considered in conjunction with the application; for measuring the single neuron activity, the electrode size has to be in the order of micrometers, matching the neuron size, while for studying the behaviour of population of neurons, the size may be larger. The micro electrodes and their related read-out circuits are designed according to the implant typology and targeted neural activity. For instance, in *Brain-Computer Interfaces* (BCI) the needle-shaped micro electrodes are preferred [26], while the flat electrodes are mostly used for motor cortex recordings [27]. Moreover, active electrodes are preferred to increase the quality of the information, but a power supply for the neural probes is then needed. The idea of wireless monitoring neural activities enable a change in medical procedures and patients with implanted cortical systems will be allowed to safely leave the hospital environment during a monitoring period, extending over several months.

In this work, the implemented low-power IC acquires and wirelessly transmits the neuronal data collected from iEEG electrodes, for epileptic seizure detection. For this application, we decided to take into account micro-sized iEEG electrodes, in order to improve the seizure detection capability. Such important choice is discussed in the following subsection.

| Signal    | Amplitude            | Bandwidth  | Electrode      | Invasiveness           | Risk                          |
|-----------|----------------------|------------|----------------|------------------------|-------------------------------|
| EEG       | $5-300\mu\mathrm{V}$ | 1-100 Hz   | scalp          | non-invasive           | none                          |
| ECoG/iEEG | $\leq$ 5mV           | 0.5-200 Hz | cortical       | minimally-<br>invasive | minimal                       |
| LFP       | ≤1mV                 | ≤200 Hz    | microelectrode | es invasive            | possible local<br>haemorrhage |
| spikes/AP | $\leq$ 500 $\mu$ V   | 1-7kHz     | needles        | invasive               | possible local<br>haemorrhage |

Table 2.1 – Neuronal signals characteristics.

## 2.2.1 Macro and Micro-electrodes iEEG recording

Recordings from micro-electrodes of diameter less than  $100 \,\mu$ m in the epileptic human hippocampus and neocortex have enabled the identification of several classes of electrographic activity localized to sub-millimeter-scale tissue volumes, inaccessible to standard iEEG technology with macro-electrodes [28]. Moreover, Stead and colleagues [21] have observed that epileptic seizures identified on the macro-electrodes are often preceded by seizure-like activity on the micro-electrodes. In particular, some of the micro-electrodes record an ongoing microperiodic epileptic form spiking discharge, which starts minutes before the onset of the seizure itself [21].

Furthermore, the same researchers have also found that the signals recorded by adjacent micro-electrodes can be uncorrelated, despite their spatial vicinity. Thus, the sub-millimeter scale of high frequency oscillations involved in seizure generation motivates the wide-band iEEG using micro-electrodes for monitoring epileptic patients. The number of recording channels is predicted to exceed thousands in the near future and the major bottlenecks of monitoring systems will be the power consumption of data telemetry and the large circuit area requirement.

# 2.3 Implantable System on Chip

The hardware component which takes care of the neuronal signal collected by the electrodes is a *System-on-Chip* (SoC), which is integrated on the implanted device and allows to collect/amplify, digitize, process and transmit the signal to an external receiver, named *base station*. A high level view of the integrated SoC is depicted on the top left side of Fig. 2.4. On the external base station (on the right side of Fig. 2.4), the transmitted data is reconstructed for medical monitoring and storing.

In the following, we give an overview of the implantable neural recording system, mainly divided into three blocks: the wireless recording System-on-Chip, the wireless powering and the data communication units.



Figure 2.3 – Hybrid electrodes grid containing macro and microelectrode arrays (a) for iEEG signal recordings, reprinted from [21]. Signals recorded from micro and macro electrodes in (b), with an highlight on micro electrode 27 that records a seizure onset seconds before the macros.



Figure 2.4 – Block diagram of the implantable integrated system (on the left side), wirelessly linked with an external base station (on the right), where the data is reconstructed for medical monitoring and stored. No battery is used in the implanted system.

# 2.3.1 Wireless recording System-on-Chip

Generally, the implanted SoC is composed by a neural amplifier, which collects the neural informations recorded by the active electrodes, placed in contact with the brain surface. An *Analog to Digital Converter* (ADC), samples and digitises the amplified neural signals; the ADC output is processed by the *Digital Signal Processor* (DSP), which treat the digital informations, aiming to reduce the amount of information sent by the wireless RF transmitter. Indeed, the transmitter power budget in typical wireless monitoring systems, is usually one order of magnitude higher than any other system on the chip [29, 30]. In this Section, we discuss more into details each of the SoC blocks, on the system level perspective.

For each sampling electrodes, the collected signal by the electrodes is amplified by *Low-Noise Amplifier* (LNA). Then, the ADC samples and digitises the analog neural informations. Before data transmission, the digitised data are processed in order to reduce the wireless TX power requirements. Data compression is usually employed to reduce the data packages which are transmitted from the implant to the external base station. This allows to save on the telemetry power, especially for multichannel neural signal system acquisition.

# 2.3.2 Data telemetry

In addition to the neural data acquisition and processing, the data has to be transmitted from the implanted device to an external base station. Such communication link is named *uplink communication*, and is required to transmit the digitized neural data to an external receiver device. A *downlink communication* is also required, in order to allow data transfer from the external station to the implant. Such link is required for the calibration and configuration of the sensor and processing parameters, such as sampling coefficient selections.

The proposed epilepsy monitoring system in this project implements both uplink and downlink communications. Since the downlink communication is only used for setting the system parameters, there is no need for a high data rate communication. Thus, it is sufficient a downlink receiver at the implanted SoC, which communicates at a data rate of 10 kbps. However, for the uplink communication, very high data rate communication is required, since the number of monitoring channels and their sampling rate is high. For the neural monitoring application with tens of electrodes, uplink communication should at least provide a data rate in the order of 10 Mbps. Accordingly, design of an uplink transmitter is challenging in such applications. The minimum distance for both communication types is the average human skull thickness of about 10 mm.

# 2.3.3 Power management

The powering of the implanted system can be managed exploiting the following solutions:

• implementation of medical grade batteries;

- wireless power transfer;
- · ambient energy harvesting.

Mainly the application scenario and ecosystem determines the most appropriate powering solution. Typically, implantable devices meant for medical application require large energy reservoirs which allow the system to operate over a wide period of time. A possible alternative might be harvesting energy from the sources surrounding the implant. Such approach might extend the implant life and, in case of sufficient available energy, allow the implant to operate autonomously. For biomedical applications, there are different types of energy harvesters such as piezoelectric [31], thermal [32], light [33] and infrared light [34] with power density of microwatts per cm<sup>2</sup>. Although, the energy harvested by these types of devices is limited, they can be used for ultra-low power implants. Mercier *et al.* demonstrated an electronic system extracted a minimum of 1.12 nW from the endocochlear potential (EP) of a guinea pig for up to 5 h, enabling a 2.4 GHz radio to transmit measurement of the EP every 40–360 s [35].

Concerning batteries for long-term implants, an additional surgery is required to replace the exhausted one, or recharging them is required. Charge capacities of the batteries and required power for certain time of operation should also be considered while selecting the battery, as well as the size restrictions of the implant. As an example, Miranda et al. presents a wireless biomedical system for recording and transmitting neural activity of the brain with 32 channels. The power consumption is low enough to operate continuously for 33 hours, using two 3.6-V/1200-mAh Li-SOCI<sub>2</sub> batteries [36]. A rechargeable battery can be used in order to extent the duration of the operation. However, wireless power transfer is required for recharging the implanted battery.

For long term and consuming milliwatts operations such as monitoring of neural signals, the most appropriate solution is wireless power transfer. Remote powering can be divided in two categories, depending on the distance between the implanted device and the external power delivery station: near field and far field. The boundary between near field and far field is defined by  $d = \lambda/2\pi$ , where, d and  $\lambda$  are the distance and the wavelength of the signal, respectively. For long distance remote powering (about few meters), generally the far field properties are exploited, such as radiation properties of antennas at several hundreds of MHz frequencies [37]. Hence, such implementation is suited for applications that necessitate high mobility. For what concerns the short distance range powering (few centimeters), reactive coupling techniques, such as capacitive and inductive at several MHz frequencies, are implemented. The capacitive coupling is given by an electrical coupling, while the inductive coupling exploits magnetic coupling in the link. Capacitive coupling requires a dielectric medium that allows strong coupling and is more sensitive to distance variations. On the other hand, inductive coupling method exploits the mutual inductance between coupled inductors. In the literature, there are numerous examples of wireless power transfer by means of electromagnetic (EM) radiation [38, 39], magnetic coupling [18, 40, 41], ultrasonic coupling [42], and infrared radiation [43]. For a chosen wavelength, if the distance between the coils

#### Chapter 2. Implantable ecosystem

is smaller than d, the magnetic coupling gives more efficient wireless power transfer [44]. Accordingly, magnetic coupling is more preferable for powering the implanted devices. It is fair to claim that EM radiation and magnetic coupling based systems dominate the literature especially for neural implant powering applications. Recently, Lee et al. has presented an inductively-powered wireless integrated neural recording system for wireless and battery-less neural recording from freely-behaving animal subjects inside a wirelessly powered standard homecage [45]. The proposed system consumes 51.4 mW and it is powered by an inductive link at 13.56 MHz.

For several biomedical applications such as hearing aids and pacemakers, batteries occupy a significant amount of volume. However, the volume allocated for a neural implant is very small compared to these applications. Moreover, the neural recording applications consumes higher amount of power and this property reduces the duration of the operation. Considering continuous power demand of the neural implants aiming for continuous data transmission and the estimated power budget, current ambient energy harvesters are found to be insufficient to fulfill this task. Wireless power transfer link by means of inductive coupling as a power source of the implanted system is good solution since the distance between implant and external units is in the order of millimeters (human scalp thickness ~10 mm) and sending the required power to the implant is feasible with current technology.

# **3** Data compression for autonomous sensing systems

In this Chapter, we discuss the optimized information extraction from signals or data volumes. Following an overview on the compressive sensing paradigm, we analyse the developed mathematical theory and computational methods for information recovery from highly incomplete data. In the vast majority of applications, including imaging systems, home automation, environmental remote controlling and real-time medical monitoring/treatment, data compression becomes indispensable to reduce the amount of processed/transmitted information. The data compression *Integrated Circuit* (IC) implementation has to be integrated in a compact and low power device, such that its related costs are minimal with respect to the benefits in the overall sensing system.

The remainder of this Chapter is organized as follows. First, in Section 3.1, a brief overview on the CS approach is described, highlighting its main advantages and disadvantages. Afterward, a recently proposed structure-aware compressive strategy, named *Structured Sampling* [46], is then discussed in Section 3.2. The structured recovery and optimization details are followed by numerical results that motivates this method. Then, a new compression architecture based on Machine Learning, named *Learning Based Compressive Subsampling* (LBCS) [47], is described in Section 3.3. The LBCS method is compared with state-of-the-art schemes, using iEEG human datasets.

# 3.1 Compressive Sensing

The *Compressive Sensing* (CS) technique has recently emerged as a very efficient compression method, which allows easy integration on-chip, reducing the sampling rate at the sensor node. In this work, the CS technique has been the seed from which different algorithms have been developed, tailored to hardware requirements. In a nutshell, CS consists in taking fewer linear samples than dictated by the Shannon-Nyquist theorem, while still allowing robust off-line signal reconstruction. This is possible by exploiting the fact that the information content of a signal is often much lower than its raw data content.





Figure 3.1 – Electrocardiography gives a time-sparse representation of the heart electrical activity.

The CS system employs compressive measurements, obtained by linear projections of the signal of interest. The signal reconstruction is the process that allows to recover these original signal from the compressed measurements. One condition for accurate reconstruction is that the signal needs to be *sparse*, i.e. the signal has few non-zero components. If it is not, it can be projected onto a *sparsifying* signal domain (named basis). Other requirements and conditions are discussed later in this chapter.

CS is a relatively novel field, but other similar methods have been proposed since a long while. One of the first methods to reduce the sampling rate in sparse signals has been proposed by Prony in 1795 [48], and it estimates the non-zero amplitudes in series of complex exponentials, used for different application [49]. In 1989 Donoho and Stark proposed a method that shows signal recovery with missing data, sparsity and band-limited measured signals [50]. In 2002, Vetterli et al. proposed a method to reduce the sampling rate based on finite rate of innovation of the sampled signals [51].

#### 3.1.1 Signal Sparisity

If a band-limited signal occupies the overall bandwidth, then the Shannon-Nyquist theorem sets the minimum sampling frequency as at least the double of its bandwidth [52, 53]. However, if the signal does not occupy all the available bandwidth, CS states that it can be recovered from less samples. In this case fall all the *sparse* signals, which are represented by few non-zeros components in a certain domain or basis. Some natural signals are sparse in the given domain, such as in the time representation. A classical example is given by the electrical activity of the heart, such as the one depicted in Fig. 3.1, where an ideal *electrocardiograph* (ECG) is characterised by few non-zero coefficients in the time domain.

Given an input signal  $\boldsymbol{\alpha} \in \mathbb{R}^N$  which has *K* non-zero coefficients, this signal is named *K*-sparse. The sparsity of the signal is then defined as  $p = \frac{K}{N}$ . Thus if a signal is very sparse, K << N and p << 1. If the input signal  $\boldsymbol{\alpha}$  is not sparse in the given domain, another domain (e.g., Fourier, Wavelet, etc.) may allow for a sparse representation of the signal. This transformation mechanism, named transformation coding, is very often used in CS and in compression



Figure 3.2 – A multi-tone sine in the non-sparse time domain (left) and its sparser representation in Fourier domain (right).

algorithms in general. A very intuitive example to depict this concept can be given taking into account a multi-tone sinusoidal signal, formulated as follows

$$y = \sum_{n=1}^{L} A_n \sin(2\pi f_n) , \qquad (3.1)$$

where, *A* is the signal amplitude and  $f_n$  is the frequency of the *n*-th sine. Fig. 3.2 depicts the multi-tone sine wave formulated in equation (3.1), with L = 3, both in the non-sparse time domain (left) and its sparser representation in the Fourier domain (right).

Assuming that  $\mathbf{x} \in \mathbb{R}^N$  is not sparse in the identity basis, it can be projected over a sparsifying basis  $\Phi \in \mathbb{R}^{N \times N}$ 

$$\boldsymbol{\alpha} = \boldsymbol{\Phi}^T \mathbf{x} \,, \tag{3.2}$$

such that  $\boldsymbol{\alpha}$  is the *K*-sparse representation of  $\mathbf{x}$  in the transformed domain. Since  $\Phi$  is an orthonormal basis, thus  $\Phi \Phi^T = I$  and

$$\mathbf{x} = \boldsymbol{\Phi}\boldsymbol{\alpha} \,, \tag{3.3}$$

The support of the input signals is mathematically defined as the subset of the domain containing the elements which are zero. Then, the cardinality of the signals' support gives the sparsity of the input signal. Moreover, the  $\ell_0$  "norm"  $(\|.\|_0)^1$  gives the number of non-zero

<sup>&</sup>lt;sup>1</sup> the  $\ell_0$  is not a norm by definition, but can be seen as the limit of the  $\ell_p$  norms, with  $(p \to \infty)$ .





Figure 3.3 – Electrocardiography signal on top with the threshold level; its sparser representation at the bottom.

entries of the input signal, thus

$$K = |\operatorname{supp}(\boldsymbol{\alpha})| = \|\boldsymbol{\alpha}\|_0.$$
(3.4)

In general, natural signals are not perfectly sparse, but can be approximated to be sparse, since they can be represented by few number of coefficients, while all the other are negligible. The error introduced to sparsify these signals, named *compressible* signals, can be negligible depending on the defined sparsifying level. Taking into account the example shown in Fig. 3.3, it can be defined a threshold level, such that only the main signal's variations are taken into account (e.g., the ECG spikes), while the rest is neglected, as highlighted in Fig. 3.3-(bottom).

# 3.1.2 Compressive Signal Measurements

Given the input signal  $\mathbf{x} \in \mathbb{R}^N$ , its compressed representation  $\mathbf{y} \in \mathbb{R}^M$  is given sampling  $\mathbf{x}$  through a dense measurement matrix  $\mathbf{A} \in \mathbb{R}^{M \times N}$ , with M < N, as

$$\mathbf{y} = \mathbf{A}\mathbf{x} \,. \tag{3.5}$$

It is worth noticing that the ratio between the N and M defines the compression ration (CR)

$$CR = \frac{N}{M}.$$
(3.6)

24



Figure 3.4 – Dimensionality reduction applying Compressive Sensing technique.

A compressed measurement is valid if each sample contains information from the non-zeros values of the input signal. In order to allow a successful compressed sampling, as discussed in paragraph 3.1.1 with equation 3.3, the measurement needs to be performed in the sparsifying basis. Combining equation (3.5) with (3.3),

$$\mathbf{y} = \mathbf{A}\mathbf{x} = \mathbf{A}\boldsymbol{\Phi}\boldsymbol{\alpha} = \boldsymbol{\Psi}\boldsymbol{\alpha} \,, \tag{3.7}$$

where  $\Psi = \mathbf{A}\Phi$  is defined as the measurement matrix and  $\boldsymbol{\alpha}$  is the sparse representation of  $\mathbf{x}$  in the ortho-normal basis  $\Phi$ .

#### 3.1.3 Signal Recovery

For simplicity, let's assume that the input signal **x** is already sparse in the given domain. Fig. 3.4 illustrates this scenario, where **x** is sparse and it is sampled through a dense measurement matrix **A**, which gives a compressed output **y**. In such case, the sparsifying matrix is the unitary matrix ( $\Phi = I$ ), thus, following equation (3.7): **x** =  $\alpha$ .

In order to recover the input signal **x** from  $\mathbf{y} = \mathbf{A}\mathbf{x}$  an under-determined linear system of equations needs to be solved. Normally, this system has no unique solution and we look for a solution with specific properties. In order to clarify this point, we consider a two dimensional reconstruction problem, meaning that  $\mathbf{x} \in \mathbb{R}^2$ . In such case, the system which we want to solve, is defined as

$$\mathbf{y} = \mathbf{A}\mathbf{x} = a_1 x_1 + a_2 x_2 \,, \tag{3.8}$$

The set of all the possible solutions of (3.8) is depicted by the thick line in Fig. 3.5. Our content is to find the sparsest solution of the system, with the minimum norm. Then, for the example

shown in Fig. 3.5, we need to find  $(0, x_2)$ ,

$$x_{2} = \frac{y}{a_{2}} - \frac{a_{1}}{a_{2}}x_{1}$$

$$= cx_{1} + d,$$
(3.9)

which is one sparse solution of the system defined in (3.8), and it is closer to the origin, thus the norm is less then the solution  $(x_1, 0)$ .

The sparse signal recovery requires to find the solution of

$$\hat{\mathbf{x}} = \underset{\mathbf{x} \in \mathbb{R}^{N}}{\operatorname{argmin}} \|\mathbf{x}\|_{0}$$
subject to  $\mathbf{y} = \mathbf{A}\mathbf{x}$ 

$$(3.10)$$

However, it results to be an NP-hard combinatorial problem, whose complexity to find the exact solution exponentially grows with size of *N*.

 $\ell_1$ -norm, also known as *Least Absolute Deviations* (LAD) is defined as the minimization of the sum of the absolute differences between the target value and the estimated values. A possible way to circumvent the NP-hardness problem is to replace the  $\ell_0$  minimization by  $\ell_1$  minimization:

$$\hat{\mathbf{x}} = \underset{\mathbf{x} \in \mathbb{R}^{N}}{\operatorname{argmin}} \|\mathbf{x}\|_{1}$$
subject to  $\mathbf{y} = \mathbf{A}\mathbf{x}$ 

$$(3.11)$$

The  $\ell_1$  optimization method, named *Basis Pursuit* (BP), under various conditions gives the same unique solution to equation (3.10) and (3.11).

Fig. 3.5 shows the  $\ell_p^n$  minimization considering a two dimentional reconstruction problem (n = 2). In particular, it shows how with  $\ell_1$  optimization (p = 1), one of the  $\ell_1$  rhombus corners intersects the line defined by equation (3.8), and the overlaps corresponds to the one-sparse solution (0,  $x_2$ ). The  $\ell_2$  optimization (p = 2) gives instead a solution which does not touch the constraint line on one of the axis, meaning that the solution is not sparse.

The measured input signal may be corrupted by some noise  $\mathbf{w} \in \mathbb{R}^N$ :

$$\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{w} \,. \tag{3.12}$$

The *Basis Pursuit De-Noising* (BPDN) allows to recover the input signal, allowing some measurement mismatch  $\epsilon$ , as formulated by:

$$\hat{\mathbf{x}} = \underset{\mathbf{x} \in \mathbb{R}^{N}}{\operatorname{argmin}} \|\mathbf{x}\|_{1}$$
subject to  $\|\mathbf{A}\mathbf{x} - \mathbf{y}\|_{2}^{2} \le \epsilon$ 
(3.13)



Figure 3.5 – Shape of the  $\ell_p^2$  minimization for p = 1 and p = 2, while the thick straight line represents all the solutions to  $\mathbf{y} = \mathbf{A}\mathbf{x}$ .

The conditions under which the signal **x** can be efficiently and accurately recovered from the **y** measurements, discussed by Candès et al. [7, 54, 55], can be listed as:

• the *Null Space* or *Kernel* of the measurement matrix **A** contains all vectors **x**, which are mapped to **0**:

$$\mathcal{N}(\mathbf{A}) = \{\mathbf{x} : \mathbf{A}\mathbf{x} = 0\}.$$
(3.14)

The *Null Space Property* (NSP) is satisfied if there is a constant  $\gamma \in (0, 1)$ , such that

$$\|\mathbf{n}_{\mathscr{S}}\|_{2} \leq \gamma \|\mathbf{n}_{\mathscr{S}^{C}}\|_{1}, \forall \mathbf{n} \in \mathcal{N}(\mathbf{A}),$$
(3.15)

for all sets  $\mathcal{S} \subset \{1, ..., N\}$  and their complements  $\mathcal{S}^C$  (set of elements not in set  $\mathcal{S}$ ), with cardinality K.

• *Restricted Isometry Property* (RIP) property of the measurements matrix. **A** fulfils the RIP if

$$(1 - \delta_k) \|\mathbf{x}\|_2^2 \le \|\mathbf{A}\mathbf{x}\|_2^2 \le (1 + \delta_k) \|\mathbf{x}\|_2^2, \forall \mathbf{x} \, \mathbf{K} - \text{sparse},$$
(3.16)

where  $\delta_k$  is a defined isometry constant.

• *Incoherence* of the projection basis **A** and the sparsifying basis Φ. The coherence is defined as

$$\mu = \max|\langle \mathbf{A}, \Phi \rangle|. \tag{3.17}$$

When  $\mu$  is close to 0, the coherence between **A** and  $\Phi$  is minimal, so the reconstruction performs better. A consequence of this definition is that if a signal is sparsely represented in one basis, it is not sparse in the other one. The time and Fourier pair of bases is a good example of incoherence, as previously illustrated in Fig. 3.2, where a sparse signal in the Fourier base domain, is not sparse in the time base.

As demostrated by Candès et al. [56], randomly chosen measurement matrices perform, with high probability, very well. Examples of random matrices are the *random Gaussian* -where each **A** entry is independently given by a normal distribution with zero mean and 1/M variance- and *random Binary* -where the matrix entries are drawn from a Bernoulli distribution, where the values  $\pm 1/\sqrt{M}$  have the same probability. Thus, on the theoretical point of view, the **A** matrix can be generated with random coefficients, since *independent and identical distributed* (i.i.d.) Gaussian matrices are incoherent and also satisfy the RIP condition. Moreover, they are universal, i.e., the RIP or the incoherence of **A** $\Phi$  is the same as of the original **A** [57], where matrix  $\Phi$  (that needs to be unitary, i.e.,  $\Phi\Phi^T = I$ ) is used to move for a sparser representation of the signal **x**. However, Gaussian matrices are prohibitively expensive to use in practice, since they require  $\mathcal{O}(MN)$  space and time.

In the remainder of this section, it is briefly discussed some of the most common recovery algorithms used at the place of the  $\ell_0$  minimization problem (3.10), in order to reduce the complexity. The algorithms that result to be more practical are the convex  $\ell_1$  minimization (3.11) and greedy approximations.

The convex relaxation formulation allow to solve the BP problem (3.11) or, considering noisy data, the BPDN problem (3.13) [58]. In general, the convex relaxation problem is solved applying iterative algorithm, to arrive at the optimal solution. Popular choices to solve the convex relaxation problem are primal-dual interior point methods [58] or fixed-point continuation, that applies soft-thresholding approach [59], and the *Least Absolute Shrinkage and Selection Operator* (LASSO) [60].

Even though the convex relaxation methods allow to solve large size problems, recovering **x**, usually requires to solve a non-linear optimization problem that requires high numerical precision and makes unsuitable for dedicated hardware implementation. To overcome this limitation, *greedy algorithms* are implemented in VLSI designs, since they are generally simpler and faster than what required by convex relaxation methods. On the other hand, such implementation is generally traded-off with sub-optimal solutions. Greedy algorithms may be divided into three main groups:

- Serial greedy pursuits: such as *Matching Pursuit* (MP), *Orthogonal Matching Pursuit* (OMP) and *Gradient Pursuit* (GP);
- Parallel greedy pursuits: *Compressive Sampling Matching Pursuit* (CoSaMP) and *Subspace Pursuit* (SP);
- Thresholding Algorithms: *Approximate Message Passing* (AMP), *Iterative Hard Thresholding* (IHT) and *Iterative Soft Thresholding* (IST).

# 3.2 Structured Sparsity, Sampling and Recovery

As discussed in previous section, sparsity of the input signal has a significant importance for compression and signal sampling. Moreover, sparsity is widely used in machine learning, optimization and signal analysis. This is motivated by the fact that most of natural signals are characterized by sparse representations using an appropriate basis.

In this section, we introduce the concept of structured sparsity in subsection 3.2.1, which aims to improve the simple sparsity model, exploiting the relationship between the nonzero components of the input signal. Then, we propose our CS sampling approach in subsection 3.2.2 [46], where sampling schemes aware of the signal's structure is adopted to outperform the the standard Gaussian or Bernoulli schemes. Furthermore, we also exploit the signal structure in the reconstruction phase, as discussed in subsection 3.2.3.

Overall, we reap the benefits of both structured sampling and structured recovery to yield







state-of-art compression of up to 32×, while maintaining high signal reconstruction quality, as numerically demonstrated on two clinical iEEG datasets.

# 3.2.1 Structured Sparsity

As discussed by Kyrillidis et al. in Chapter 12 of [61], the true underlying structure of many signal processing and machine learning problems is often more sophisticated than sparsity alone.

In the recent years, researchers have explored the *Structured Sparsity*, with models that go beyond the simple sparsity idea, finding relationships between the nonzero coefficients. In general, the main advantages of structured sparsity models are easier interpretation of the solutions and better recovery performance, since the number of samples is reduced. Many approaches have been applied to define the selection process: greedy algorithms for signal approximation under a tree-structure model approximation, group Lasso etc. [61]. To highlight the different outcomes from this approach to the general CS case, Fig. 3.6 (reprinted from [61]) depicts the difference between the general CS approach and the structured sparsity recovery of images, with number of measurements M approximately 5% of N for the landscape picture

(top row, original image on the left side, with dimension  $2048 \times 2048$ ) and M=10% of N, for the MRI image (bottom row, original image on the left side, with dimension  $512 \times 512$ ).

# 3.2.2 Structured Sampling

More efficient types of sampling are being successfully used in real applications, such as subsampled fast transforms, like the *Fast Fourier* (FFT), the *Discrete Cosine* (DCT) or the *Fast Walsh-Hadamard* (FWHT) *Transforms*, which can be computed in  $\mathcal{O}(N\log N)$  time. Despite not being universal, recently, a new theoretical approach, discussed by Adcock et al. [62], has explained the reasons why they work with some bases such as wavelets, introducing the concepts of multi-level sampling, asymptotic sparsity and incoherence, as shown in Fig. 3.7 (left).

Additional structures in the signal **x**, such as interdependencies between its non-zero coefficients or constraints on its support, allow to reduce the number of samples required for exact or stable recovery (see [63] and [64]). Many of these structures can be encoded via linear inequalities that admit tight and tractable convex relaxations [65]. Interestingly, natural signals are often characterized by sparse and structured representations in time-frequency (or space-frequency) domains, such as provided by wavelets [66].

Following the asymptotic sparsity discussed in [62], our sampling scheme selects the indices of the defined transformation basis according to a probability function which favours the low frequencies of the signal, which carry most of its energy. Such probability function is defined following the compression factor, and always samples the low frequencies, while the higher frequencies are selected with fast decreasing probability, as depicted in Figure 3.7 (right). We named this approach *Structured Sampling* [46].

The structured sampling has been developed tailored to the hardware requirements: the sampling probability function selects the indices of the Walsh-Hadamard Transform, which has the advantage of only requiring binary operations. For this reason, we call this approach *Structured Hadamard Sampling* (SHS).

# 3.2.3 Structured Recovery

In order to reconstruct the original signal **x** from its compressive samples **y**, most structuredsparsity methods resort to solving the following optimization problem on the wavelet coefficients  $\alpha$ ,

$$\begin{array}{ll} \underset{\boldsymbol{\alpha} \in \mathcal{A}}{\text{minimize}} & f(\boldsymbol{\alpha}) \\ \text{subject to} & \mathbf{A} \Phi \boldsymbol{\alpha} - \mathbf{y} \in \mathcal{K} \end{array}$$
(3.18)

where *f* is a Gauge function that promotes the structure we expect in  $\alpha$ ,  $\mathcal{K}$  encodes our information about the noise and  $\mathcal{A}$  is a constraint set that specifies further assumptions about



Figure 3.7 – (left) Coherence between the Hadamard and the Wavelet bases. The coherence decreases for higher frequencies (higher coefficients). (right) Probability functions used for sampling the indices of the Fast Walsh-Hadamard Transform for 4x and 32x compression factors.

the signal, e.g. boundedness. We reconstruct the signal as  $\tilde{\mathbf{x}} = \Phi \tilde{\boldsymbol{\alpha}}$ , where  $\tilde{\boldsymbol{\alpha}}$  is the solution to (3.18) and  $\Phi$  is the wavelet transformation matrix.

For sparse signals, it is common to use the  $\ell_1$  norm,  $\|\mathbf{x}\|_1 := \sum_{i=1}^n |x_i|$ , leading to the *Basis Pursuit* (BP) optimization problem.

It is well-known that biological signals are not only sparse in the wavelet domain, but their wavelet coefficients can be naturally arranged on a dyadic tree with the coefficients decaying from root to leaves. This type of structure can be promoted by a tree regularizer that gradually penalizes the coefficients closer to the leaves. In order to do so, we define a group structure  $\mathcal{T} = \{\mathcal{G}_1, \ldots, \mathcal{G}_n\}$  where each group  $\mathcal{G}_i \subseteq \{1, \ldots, n\}$  contains the node *i* in the tree and all its descendants. Such approach is named *Hierarchical Group Lasso* (HGL). Let  $\mathbf{x}_{|\mathcal{G}|}$  be the restriction of the vector  $\mathbf{x}$  to the coefficients indexed by  $\mathcal{G}$ . The tree norm is then defined as [67]

$$\|\mathbf{x}\|_{\mathcal{T}} := \sum_{\mathcal{G} \in \mathcal{T}} \|\mathbf{x}_{|_{\mathcal{G}}}\|.$$
(3.19)

Given a sampling strategy that is not aware of the signal structure (e.g., MCS, discussed later in subsection 3.2.3), in Figure 3.8, we show how this structure emerges in the reconstructed signals when using the tree norm, but not when using the  $\ell_1$  norm. In the same figure, it can also be noted that when using SHS, the hierarchical structure emerges even if it is not imposed during reconstruction, because it is already mostly captured during sampling.

Considering the real application in which this encoding scheme is tailored for, we can exploit the signal structure derived by the spatial distance between the micro-electrodes of the implanted system. Indeed, when the micro-electrodes are very close to each other, due to the



3.2. Structured Sparsity, Sampling and Recovery

Figure 3.8 – Tree structure in one signal from iEEG.org dataset I001 P034 D01 (channel 6, first annotated seizure, first 1024 samples window) and in three reconstructions obtained via Bernoulli sampling (BERN) and structured Hadamard sampling (SHS). The tree structure can be enforced via a specific tree regularizer or mostly captured via structured sampling.

Chapter 3. Data compression for autonomous sensing systems



Figure 3.9 – First 64 Wavelet coefficients of the micro-electrode signals from two datasets from the iEEG.org portal. (left) 7 channels from dataset 1001 P034 D01. (right) 32 channels from dataset Study 040. The group structure is evident among the correlated channels in both datasets, however, there remain outlier channels which do not abide to the group structure.

high correlation among the signals, their time-frequency domain coefficients tend to be group sparse. That is, when a certain coefficient is zero for a signal, it is likely to be zero also for the correlated signals and vice-versa. Let  $\mathbf{X} \in \mathbb{R}^{N \times n}$  be the signal matrix, each row containing the signal for one of the *N* channels. In order to promote group-sparsity, [68] proposed to use the  $\ell_{2,1}$  mixed norm,  $\|\mathbf{X}\|_{2,1} := \sum_{i=1}^{n} \sqrt{\sum_{j=1}^{N} X_{i,j}^2}$ . In Figure 3.9, we report the first 64 wavelet coefficients for the signals from two datasets, exhibiting the group structure among correlated channels.

#### **Optimization algorithm**

In order to solve the constrained convex optimization problem (3.18), we use the primal-dual algorithm proposed in [69], named DecOpt. This algorithm is very flexible in handling different problem types, scalable and guaranteed to converge at an optimal rate [69]. Its iterations require to compute the proximity operator of f and to apply  $\mathbf{A}\Phi$  or its adjoint, which for the FWHT and wavelet transform require only  $\mathcal{O}(n \log n)$  time. The proximity operator of f is defined as

$$\operatorname{prox}_{f}(\mathbf{z}) = \operatorname{argmin}_{\mathbf{x} \in \mathbb{R}^{n}} \frac{1}{2} \|\mathbf{x} - \mathbf{z}\|_{2}^{2} + f(\mathbf{x})$$

The proximity operators of the  $\ell_1$  norm and of the  $\ell_{2,1}$  mixed norm can be computed in closed form via soft-thresholding or group soft-thresholding, while the proximity operator for the tree norm can be computed in a finite number of steps via an active set algorithm [67]. In practice, there is almost no computational difference between the three approaches, thus we can take advantage of additional structure at almost no increase in computational cost.

## SHS performance evaluation

The performance evaluation have been developed over two iEEG human dataset, named I001-P034-D01 and Study 040 from iEEG.org portal. More details on the dataset are discussed in the Appendix A.

In the following, we compare Structured Hadamard Sampling applied to each channel independently, with the same subsampling for all the channels, to two other sampling approaches:

- The first, named BERN, uses the same random Bernoulli {±1} matrix to sample each channel independently [29];
- The second, named Multi-Channel Sampling (MCS) [68], designed to be highly powerefficient, uses a Bernoulli {0, 1} matrix to sample across the channels at each time step. The compression achievable by this method depends on the number of samples taken at each time step, with a minimum of one sample, yielding a compression factor equal to the number of channels.

SHS and BERN sampling strategies are limited only by the length of the considered time window.

We also compare the three structured recovery methods described in the previous section. Namely, Basis Pursuit (BP) using the  $\ell_1$  norm, L2L1 using the  $\ell_{2,1}$  mixed norm and the TREE method which uses the tree norm,  $\|\mathbf{x}\|_{\mathcal{T}}$ . As sparsifying basis, we use the Daubechies-4 Wavelet basis as provided by the Rice Wavelet Toolbox<sup>2</sup>. We pursue the following protocol for the experiments:

- 1. Sample all channels in each window according to the sampling method chosen: MCS, SHS or BERN.
- 2. Reconstruct using DecOpt [69] via BP, L2L1 or TREE.
- 3. Compute the signal-to-noise ratio (SNR) of the reconstructed signals.
- 4. Average over 20 different randomizations of the sampling scheme.

Tables 3.1 and 3.2 report the results on the first dataset averaged over 157 windows and channels 2 to 6. A posteriori, we excluded channels 1 and 7 from the analysis of the performance because these channels are either inactive or not recording the neurological signal. An advantage of channel-wise sampling (SHS or BERN) against the MCS sampling is that the former does not suffer from mixing "noisy" channels with "clean" ones. In an embedded system, a sub-circuit may be required for MCS in order to detect if a channel is recording properly. Furthermore, the compression factor of MCS is limited by the number of channels, while SHS and

<sup>&</sup>lt;sup>2</sup>http://dsp.rice.edu/software/rice-wavelet-toolbox



Figure 3.10 - Dataset 1

BERN can achieve much higher compression rates. However, only SHS yields an acceptable reconstruction performance above 16dB at 32× compression, with 10dB considered as the minimum required performance in order to retain diagnostically relevant information [70].

Table 3.3 contains the reconstruction SNR for the second dataset. In this case, the SNRs for all methods are very high, with an advantage for SHS with structured recovery, either L2L1 or TREE. The high SNRs can be explained by the fact that the signals in this dataset are quite regular, whose wavelet coefficients are then very sparse, therefore requiring much fewer linear measurements for robust recovery.

The running time of the optimization algorithm is, in general, less than 10 seconds per time window for recovering the signals from all the channels simultaneously on an Intel Xeon E5-2630 @ 2.40GHz.

| Sampling | Recovery | Compression factor |      |      |      |  |  |
|----------|----------|--------------------|------|------|------|--|--|
|          | Recovery | 7/4                | 7/3  | 7/2  | 7    |  |  |
|          | BP       | 29.3               | 25.7 | 21.4 | 15.2 |  |  |
| MCS      | L2L1     | 31.3               | 28.1 | 24.3 | 17.3 |  |  |
|          | TREE     | 34.1               | 30.6 | 26.8 | 21.2 |  |  |

| Table 3.2 – iEEG.org portal dataset I001 | P034 D01. Mean SNR over channels 2-6 |
|------------------------------------------|--------------------------------------|
|------------------------------------------|--------------------------------------|

| Sampling    | Decovory | Compression factor |      |      |       |      |  |  |
|-------------|----------|--------------------|------|------|-------|------|--|--|
| Samping     | Recovery | 2                  | 4    | 8    | 16 32 |      |  |  |
| SHS         | BP       | 34.1               | 27.6 | 23.7 | 21.0  | 16.7 |  |  |
| 5115        | L2L1     | 35.3               | 28.4 | 24.0 | 21.2  | 16.8 |  |  |
| (this work) | TREE     | 35.6               | 28.8 | 24.6 | 22.2  | 17.6 |  |  |
|             | BP       | 33.1               | 24.4 | 16.7 | 10.8  | 5.7  |  |  |
| BERN        | L2L1     | 35.8               | 27.3 | 18.5 | 11.8  | 6.6  |  |  |
|             | TREE     | 36.9               | 29.5 | 23.0 | 17.8  | 13.5 |  |  |



3.2. Structured Sparsity, Sampling and Recovery

Figure 3.11 – Example of micro-electrode signals from iEEG.org dataset I001 P034 D01 (first seizure, first 1024 samples window). Channel 1 is inactive, since it simply jumps between  $-1\mu V$  and  $1\mu V$ . Channel 2 to 6 record normal activity, which is not much correlated. Channel 7 exhibits strong AC components, possibly picked up from the power sources.

Chapter 3. Data compression for autonomous sensing systems



Figure 3.12 – Example of micro-electrode signals from iEEG.org dataset Study 040 (first seizure, first 1024 samples window). Channel 26 seems completely inactive, it sends a constant signal of approximately -131 mV. Channels 3 and 28, among others, are highly correlated. Channel 1 is an example of a channel which does not exhibit the smaller oscillations of channels 3 and 28.

| Sampling    | Recovery | Compression factor |      |      |      |      |
|-------------|----------|--------------------|------|------|------|------|
| Sampling    | Recovery | 2                  | 4    | 8    | 16   | 32   |
|             | BP       | 79.6               | 72.4 | 66.5 | 61.2 | 55.6 |
| MCS         | L2L1     | 77.4               | 71.6 | 64.6 | 60.4 | 58.9 |
|             | TREE     | 86.6               | 77.4 | 70.8 | 59.6 | 12.3 |
| SHS         | BP       | 91.1               | 84.0 | 80.0 | 77.5 | 73.8 |
| 5115        | L2L1     | 93.6               | 86.3 | 82.1 | 79.5 | 74.8 |
| (this work) | TREE     | 92.8               | 85.3 | 81.3 | 78.7 | 74.7 |
|             | BP       | 104                | 85.9 | 70.0 | 63.4 | 60.9 |
| BERN        | L2L1     | 73.4               | 69.4 | 63.7 | 59.4 | 57.1 |
|             | TREE     | 83.3               | 76.4 | 61.7 | 54.3 | 32.4 |

Table 3.3 - iEEG.org portal dataset Study 040. Mean SNR over all channels.

# 3.3 Learning Based Compressive Sampling

In the previous Section, the three different structured-sparsity recovery methods have been compared for reconstructing iEEG signals sampled via the SHS, MCS and BERN approaches. The best performance was obtained using a Gauge function that exploits the natural tree representation of the wavelets coefficients in order to penalize the coefficients closer to the tree leaves [46].

The compression architecture that is described in this Section is based on the idea of *Learning-Based Compressive Subsampling* (LBCS) [47], which consists on *linear encoding* and *linear decoding* with respect to a given orthonormal basis, resulting in a much simpler and faster solution compared to the approaches described in Section 3.1.

LBCS can be summarized as follows. Given a signal  $\mathbf{x} \in \mathbb{R}^N$ , we consider the compression model

$$\mathbf{y} = \mathbf{P}_{\Omega} \boldsymbol{\Psi} \mathbf{x} \,, \tag{3.20}$$

where  $\Psi \in \mathbb{R}^{N \times N}$  is an orthonormal basis and  $\mathbf{P}_{\Omega} \in \mathbb{R}^{M \times N}$  is a subsampling matrix whose rows are canonical basis vectors. The effect of applying  $\mathbf{P}_{\Omega}$  to  $\Psi \mathbf{x}$  is to retain only the coefficients indexed by the set  $\Omega$ , also known as the *subsampling map*. The vector  $\mathbf{y} \in \mathbb{R}^{M}$  is the compressed version of  $\mathbf{x}$ , with a nominal compression rate (CR) of  $\frac{N}{M}$ . The signal  $\mathbf{x}$  is then approximately recovered via the fast linear decoder

$$\hat{\mathbf{x}} = \boldsymbol{\Psi}^* \mathbf{P}_{\boldsymbol{\Omega}}^T \mathbf{y}. \tag{3.21}$$

Given a training set  $\mathcal{D} = {\mathbf{x}_1, ..., \mathbf{x}_m}$  of *m* fully sampled signals of unit norm, we learn the optimal subsampling map  $\Omega$  by choosing the indices that capture most of the average energy

in the transform domain:

$$\hat{\Omega} = \underset{\Omega, |\Omega|=M}{\operatorname{argmax}} \frac{1}{m} \sum_{j=1}^{m} \sum_{i \in \Omega} |\langle \boldsymbol{\psi}_i, \mathbf{x}_j \rangle|^2,$$
(3.22)

where  $\boldsymbol{\psi}_i$  is the *i*-th row of  $\boldsymbol{\Psi}$ .  $\hat{\Omega}$  can be exactly found by selecting the *M* indices whose values of  $\frac{1}{m}\sum_{j=1}^{m} |\langle \boldsymbol{\psi}_i, \mathbf{x}_j \rangle|^2$  are the largest [47]. The learnt sampling scheme is then used to directly sample only those transform coefficients indexed by  $\hat{\Omega}$  for all signals **x**.

#### 3.3.1 Optimal encoding

Given a basis  $\Psi$  and a desired number of samples M, the optimal linear encoding of each  $\mathbf{x}$  is given by retaining only the M largest coefficients of  $\Psi \mathbf{x}$  in absolute value. However, this optimal encoding requires to first compute all the coefficients  $\Psi \mathbf{x}$ , which is prohibitive with small area and power consumption, as discussed in next Chapter 4.2.3.

In Section 3.3.2, we use the optimal encoding approach to show the upper limit in terms of quality of the reconstructed signal and we compare the results with the LBCS method.

# 3.3.2 LBCS performance evaluation

In this section, we first give the details related to the human iEEG datasets used in the experiments and then we compare the numerical results obtained applying the LBCS encoder against the other approaches described in Section 3.1.

#### Hadamard based LBCS performance evaluation

We conducted numerical experiments with all the methods described in this paper on both datasets (discussed in Appendix A). We varied the length of the signal window N, the number of bits,  $B_i$ , of the input A/D converter and the compression rate CR. We observed that the LBCS approach is not very sensitive to the window length N, therefore, we present only results for N = 256 and  $B_i = 10$  bits, which seemed to offer a relatively high reconstruction quality, with limited area-power consumption. A more into details discussion on the hardware implementation is given in Chapter 4.

Tables 3.4 and 3.5 report the reconstruction quality, in dB, obtained on the I001–P034–D01 and the Study 040 datasets respectively. As expected, Optimal compression (subsection 3.3.1) sets the upper limit on the achievable performance. LBCS offers the best reconstruction quality at any compression rate, with an increase in the SNR of several dBs compared to the other methods. The SHS approach offers the second best performance, as its variable density is adapted to the signals, but still fails at capturing as much structure as LBCS. The BERN and MCS methods offer a much inferior performance at high compression rates, because imposing structure only during reconstruction does not fully compensate the limitations of their



Figure 3.13 – 1001-P034-D01 Reconstruction example for channel Grid28 on four windows of length 256 each.

structure-unaware sampling mechanisms. Figures 3.13 and 3.14 show some reconstructions obtained with each method on both datasets. The LBCS reconstructions are much smoother and better follow the original signal.

The linear decoder (3.21) yields reconstructions at a fraction of the computational cost of the other methods. Indeed, solving a single optimization problem with the HGL norm, using DecOpt [69], requires on average approximately 0.1s, while the linear decoder requires only approximately  $10^{-5}$ s for a 256 samples signal.

| Method   | Compression rate |       |       |       |       |       |  |  |
|----------|------------------|-------|-------|-------|-------|-------|--|--|
| Methou   | 2                | 4     | 8     | 16    | 32    | 64    |  |  |
| Optimal  | 41.60            | 39.86 | 36.38 | 31.40 | 25.42 | 19.43 |  |  |
| LBCS     | 40.79            | 37.64 | 33.27 | 28.48 | 23.27 | 18.06 |  |  |
| SHS HGL  | 36.92            | 27.96 | 23.89 | 20.26 | 18.53 | 14.49 |  |  |
| BERN HGL | 37.48            | 26.69 | 20.49 | 16.87 | 13.53 | 11.15 |  |  |
| MCS HGL  | 28.96            | 24.40 | 20.92 | 17.48 | n.a.  | n.a.  |  |  |

Table 3.4 – I001–P034–D01 N = 256,  $B_i = 10$ 

Table 3.5 – Study 040 N = 256,  $B_i$  = 10  $\,$ 

| Method   | Compression rate |       |       |       |       |       |  |  |
|----------|------------------|-------|-------|-------|-------|-------|--|--|
| Method   | 2                | 4     | 8     | 16    | 32    | 64    |  |  |
| Optimal  | 40.79            | 40.05 | 38.11 | 35.28 | 32.07 | 28.61 |  |  |
| LBCS     | 40.55            | 38.90 | 35.77 | 33.09 | 30.28 | 27.28 |  |  |
| SHS HGL  | 37.58            | 33.67 | 31.75 | 29.21 | 27.73 | 24.75 |  |  |
| BERN HGL | 38.23            | 33.57 | 29.59 | 26.62 | 24.03 | 22.08 |  |  |
| MCS HGL  | 37.20            | 34.22 | 30.82 | 27.03 | 23.00 | 18.45 |  |  |

Table 3.6 – Reconstruction performance (in dB) N = 32 -  $B_{\rm i}$  = 10

| Method  | Compression rate |       |       |       |       |  |  |  |
|---------|------------------|-------|-------|-------|-------|--|--|--|
| Method  | 2                | 4     | 8     | 16    | 32    |  |  |  |
| Optimal | 41.51            | 39.39 | 35.08 | 28.61 | 23.27 |  |  |  |
| LBCS    | 40.98            | 38.06 | 33.27 | 28.48 | 23.27 |  |  |  |



Figure 3.14 – Study 040 Reconstruction example for channel LG50 on four windows of length 256 each.

#### DCT based LBCS performance evaluation

The numerical experiments have been developed with all the methods described in this paper, varying the length of the signal window N, the ADC resolution  $B_i$  and the compression rate CR.

The DCT-based LBCS approach has been evaluated considering N = 256 and  $B_i = 10$  ADC resolution bits, as in previous subsection 3.3.2. Moreover, the resolution of DCT transformation matrix coefficient  $B_{DCT} = 8$  bits.

Table 3.7 reports the reconstruction quality, in dB, obtained on the I001-P034-D01 dataset. As for the case of Optimal Hadamard-LBCS, Optimal DCT compression sets the upper limit on the achievable performance. DCT-LBCS offers the best reconstruction quality at any compression

| Table 3.7 – Performance (dB) N = 256, $B_i = 10$ , $B_{DCT} = 8$ |       |                                    |       |       |       |       |  |  |
|------------------------------------------------------------------|-------|------------------------------------|-------|-------|-------|-------|--|--|
| Method                                                           |       | Compression rate                   |       |       |       |       |  |  |
| Wiethou                                                          | 2     | 4                                  | 8     | 16    | 32    | 64    |  |  |
| DCT Optimal                                                      | 42.03 | 41.96                              | 40.16 | 37.36 | 32.88 | 25.63 |  |  |
| DCT LBCS                                                         | 41.65 | 41.65 40.66 38.59 35.55 31.00 23.9 |       |       |       |       |  |  |
| Had-Optimal                                                      | 41.60 | 39.86                              | 36.38 | 31.40 | 25.42 | 19.43 |  |  |
| Had-LBCS                                                         | 40.79 | 37.64                              | 33.27 | 28.48 | 23.27 | 18.06 |  |  |
| SHS HGL                                                          | 36.92 | 27.96                              | 23.89 | 20.26 | 18.53 | 14.49 |  |  |
| BERN HGL                                                         | 37.48 | 26.69                              | 20.49 | 16.87 | 13.53 | 11.15 |  |  |
| MCS HGL                                                          | 28.96 | 24.40                              | 20.92 | 17.48 | n.a.  | n.a.  |  |  |

Chapter 3. Data compression for autonomous sensing systems

rate, with an increase in the SNR of several dBs compared to the other methods. The Optimal Hadamard yields the second best performance and sets the upper limit for the Hadamardbased approach. Interestingly, the DCT-LBCS method offers a comparable performance to the Optimal Hadamard even at higher compression rate. In the SHS approach the variable density is adapted to the signals, but still fails at capturing as much structure as LBCS. The BERN and MCS methods offer a much inferior performance at high compression rates, because imposing structure only during reconstruction does not fully compensate the limitations of their structure-unaware sampling mechanisms.

As for the Had-based LBCS, the linear decoder (3.21) yields reconstructions with less computational cost of the other methods.

#### **Exploring trade-offs**

In order to understand the impact of various design choices on the power and area consumption, and on the quality of the compression, we vary the number of samples considered in each window N = 256,512 or 1024. To understand the impact of the quantization, we simulate an ADC with resolution  $B_i = 8,9,10$  or 11 bits. Finally, we vary the CR from  $2 \times$  to  $64 \times$  in a geometric fashion.

Figure 3.15 summarizes the results obtained with the proposed learning-based encoder and linear decoder, both for Hadamard-based LBCS (left) and DCT-based transform. The x-axis represents transmission bit rate, TBR, computed as TBR =  $\frac{MB_o}{N} = \frac{B_i + \log_2 N}{CR}$ , where  $B_o$  is the necessary bit resolution after the transformation (more details in next Chapter 4). The y-axis measures the memory area size (MA), which is directly proportional to the number of entries in the transformation matrix needed for sampling, MA =  $M \times N \times \mathbf{B}_{entry} = \frac{N^2}{CR}$ , where  $\mathbf{B}_{entry}$  is 1 for Hadamard and 8 for DCT transform. For a given compression rate, the memory area scales quadratically with the window length, N, thus we report MA in a log scale. The color, and size, of the dots represent the reconstruction performance in SNR, measured as described in the previous subsections.



Figure 3.15 – Trade-off between bit-rate, memory size and reconstruction performance.

# 3.4 Summary

In this section we summarize the information presented in this chapter.

After the introduction to the main concepts of CS, in the first part of the chapter, we then presented our mathematical theories applied for low-dimensional signal-models. In particular, we showed that a structure aware signal sampling and reconstruction allows to outperform the standard CS techniques. The proposed CS sampling scheme adapted to the structure of intracranial EEG signals consists in taking random components of the Hadamard transform of the input signal, where randomness is controlled by a probability function that fovors the lower frequencies.

Afterwards, a learning based theory, named LBCS has been applied for the data compression scheme. Such scheme has been tailored for reduced area and power costs for neural signal encoding in wireless implantable devices. LBCS consists on linear encoding and linear decoding with respect to a given orthonormal basis, resulting in a much simpler and faster solution compared to the standard CS's approaches. The set of indices is learnt from a training set of fully sampled signals, by selecting the ones that capture most of the signals' average energy. LBCS approach allows a more faithful reconstruction of original signals, as compared with state-of-the-art schemes.

# **4** LBCS based hardware implementation and validation

This chapter describes the different prototypes we have developed for the neuronal signal acquisition system, motivating the different circuit and system choices made to design the dedicated ASICs. The integrated circuits proposed in this work are based on the Learning-Based CS technique described in Chapter 3.

The remainder of this chapter is discussed as follows. In Section 4.1 we give an overview of the overall implantable system. Then, Section 4.2 describes the LBCS implementations, based on the Hadamard and Discrete Cosine Transformations. Afterwards, the complete single channel implementation is described in Section 4.3. The multichannel implantable design is described in Section 4.4, which is then followed by the Chapter summary in Section 4.5.

# 4.1 System level overview

A typical wireless system used in a multiple channels scenario, is depicted in Fig. 4.1. In applications like monitoring systems, the *transmitter* (TX) side is powered by a battery, shown in Fig. 4.1 (a), and comprises the signal sensors, the *Analog Front End* (AFE), the *Analog to Digital Converter* (ADC) and the *Digital Signal Processing* (DSP) block, before the *Radio Frequency* (RF) module. The TX node sends the signals to a remote system, where the data are received, processed and stored, as shown in Fig. 4.1 (b). Usually, the TX side is characterised by limited energy resources, due to the limits on the battery. Moreover, the power consumed by the RF transmitter is usually higher than the signal acquisition system on the chip [29, 30]. For this reason, some data treatment on the sensor node is crucial to reduce the amount of data sent by the RF TX, while keeping a relatively high information content, enabled after a tailored signal reconstruction, at the receiver node.

# 4.1.1 Analog to compressed data stream

In this work, we consider neural signals collected and processed from every micro-electrode node (discussed in Chapter 2), in order to accurately estimate the seizure onset using an



Figure 4.1 – Typical wireless sensor system, with highlight in a battery-powered multiple lanes TX (a) and its RX counterpart (b).

implantable monitoring device. The focus of this work is on compressive sampling and wireless telemetry, while discussions on seizure detection algorithm are beyond the scope of this thesis. For each sampling electrode, the recorded signal is boosted by a *Low-Noise Amplifier* (LNA) (not present in this work). Then, the ADC, samples and digitises the analog neural signal, which then is processed and transmitted by the RF unit.

## Analog to Digital converter

The goal of an ADC is to convert a continuous time signal into a digital representation of its amplitude. According to the Shannon-Nyquist sampling theorem, a conventional ADC samples the input signal with frequency at least twice its bandwidth [52, 53]. Then the conversion involves the quantization of the input. The result is a sequence of digital representation of the continuous time input signal into the discrete time and discrete-amplitude digital domain. An ADC is generally defined by its *Signal to Noise Ratio* (SNR), its bandwidth (or sampling rate) and its dynamic range (summarized in terms of effective number of bits of resolution ENOB) [72]. There are many types of ADCs, and each one has a range of application defined by its main characteristic (e.g., speed, resolution, area and power consumption). These are the most common ways of implementing an electronic *Analog to Digital* (A2D) converter:

• Flash ADC: is the fastest type of ADC but usually has only 8 bits of resolution or fewer, since the number of comparators needed is 2N - 1, where N is the number of bits, and it doubles with each additional bit, requiring a large and expensive circuit. ADCs of this type have a large die size, a high input capacitance and high power dissipation;



Figure 4.2 – Different ADC operating range, considering the sampling rate and the bit resolution, adapted from [71].

- Successive Approximation ADC: this topology requires just one comparator; an Nbit SAR ADC will require N+1 comparison periods and will not be ready for the next conversion until the current one is complete. This topology is expected to allow the lowest power dissipation, but is also defined by a slow sampling rate. At each step in this process, the approximation is stored in a successive approximation register (SAR);
- Pipeline ADC: combines the merits of the successive approximation and flash ADCs. This architecture is fast, is defined by high resolution, and requires a relatively small die size while the power consumption is relatively high;
- Delta-Sigma ADC: (or Sigma-Delta ADC) has a modulator and a decimator. The modulator converts the input analog signal into digital bit sequences and the decimator receives the input bit streams and, depending on the over sampling ratio (OSR) value, it gives one N-bit digital output per OSR clock edge. The main advantage of delta sigma ADC is that it suppresses noise including quantization noise near signal frequency thanks to oversampling, so it can reach SNR the generally other ADC typologies can not. The main drawback of this architecture is that it requires amplifiers which burn power and area, and it is relatively slow.

Fig. 4.2 depicts the different ADC operating range, considering the sampling rate and the bit resolution.

## Chapter 4. LBCS based hardware implementation and validation

In the proposed system, to meet the stringent area and power constraints of the SoC, we have designed and implemented a *Successive Approximation Analog to Digital Converter* (SAR ADC), which yields medium bit resolutions, while requiring low-power data conversion [71, 72]. More details on the ADC are given in Section 4.3.

## **Compression implementations**

Before data transmission, the digitized data is processed in order to reduce the power requirements of wireless TX. In many recently proposed implantable systems (e.g., [73, 29, 68, 74] and references therein), *Compressive Sampling* (CS) [7, 8] has been exploited to drastically reduce the amount of data transmitted, while still allowing robust (and complex) off-line signal reconstruction. CS indeed, allows taking fewer linear samples, exploiting the natural information content of the signal, which often is lower than the data content itself.

In the proposed work, we implement fully digital DSPs, which implements the Learning Based CS, described in Section 3.3. All the different encoding prototypes developed during this thesis work are described in Section 4.2.

## 4.1.2 Wireless Communication

In the literature, numerous methods have been presented for wireless data communication aiming implanted biosensors [75, 76, 77, 78]. The variety is formed by the different modulation schemes and number of communication channels. Modulation schemes are mainly based on three modulation techniques, namely as amplitude shift keying (ASK), frequency shift keying (FSK), and phase shift keying (PSK). Modern digital communication schemes employ modified and improved versions of these basic schemes. On top of that, variety is enriched by the type of the communications, namely as half-duplex or full-duplex. In half-duplex communication, data transfer in both directions is performed on the same link but only one direction at a time. Full-duplex communication refers to continuous data flow simultaneously for both directions. As expected full-duplex communication requires two channels in the basic case.

To target even higher data rates than what can be achieved with conventional, narrowband communication schemes at the expense of a shorter transmission range (i.e., tens of centime-ters to several meters), pulse-based and in particular impulse radio ultra-wideband (IR-UWB) transmission techniques have recently garnered much attention for a wide range of wearable and implantable medical sensor applications [79, 80, 81, 82].

Wireless data communication solutions can be classified into two groups: data communication on the power line by charging parameters of the wireless power transfer link or employing a dedicated transceiver on both parts. Downlink communication can be directly performed by modulating the signal source in amplitude, frequency or phase. Uplink communication performed by perturbing the characteristics of the power line is called load modulation for magnetic coupling based power transfer links [78] and backscattering for electromagnetic radiation based links [83]. Using a dedicated transceiver isolates power and data transmission channels, allowing these two links to be designed independently [84]. A compromise between these two solutions can be formulated with respect to the power budget and data rate requirement of the recording application. Moreover, additional components such as antennas occupy a non-negligible volume that may violate size restrictions. In both cases, the selection of operation frequencies has to be carried out in careful consideration of the bandwidth requirement imposed by the data rate of the application.

The decision of uplink communication is a trade-off based on the high data rate communication and absorbed power by the tissue. At low MHz frequencies, tissues absorb less power than at few hundreds of MHz. However, communication data rate and bandwidth is limited at these frequencies. The additional loss at the tissue is accepted and communication at high frequency is selected in order to be able to send the data at higher speed. For the uplink communication, a dedicated UWB transmitters is selected since the data communication on power channel is limited in terms of data rate. Pulse UWB transmission is particularly suited for this application since it does not require the generation of a carrier signal and the circuitry can be operated in a very low duty cycle regime. A circuit topology proposed by [80] is selected since each cycle of the RF pulse is digitally programmable in amplitude and duration, enabling a very flexible shaping of the transmitted PSD signal, without the use of an output filter.

Downlink communication is decided to built on the power transfer link. Yilmaz and Dehollain showed that 500 kbps data rate can be reached for a downlink communication superposed on the wireless power transfer link [85]. In such work, data is fed to the power source to modulate it in ASK mode. Fig. 4.3 shows the downlink communication over the powering channel for 500 kbps and 50 kbps data rates. The first green square signal corresponds to the data to be sent at the external base station. The purple wave shows the induced AC voltage at the implanted system. The second green signal shows the demodulator output which determines the downlink data by using the purple wave.

By applying these two methodologies, an ASIC for neural monitoring system with the data rates of 24 Mbps and 500 kbps can be reached for uplink and downlink communication, respectively.

# 4.1.3 Implanted System Powering

For most of the powering methods that are mentioned in 2.3.3, the power source generates an AC signal. However, a DC supply is required to power up the electronic circuits inside the implant. Therefore, this AC signal is first converted to a DC signal with ripples by means of a rectifier and then, this signal is converted to a stable DC signal using a regulator. For the power sources that generates DC voltage, generally it does not match with the required power supply level. In such cases, there is a need for a DC-DC converter block. In both of the approaches, there is a high requirement for high conversion efficiency since it determines the system efficiency directly. In order to keep the system alive for short power interruptions, a



Chapter 4. LBCS based hardware implementation and validation

Figure 4.3 – Downlink data communication at (a) 500 kbit/s and (b) 50 kbit/s (waveforms from top to bottom; turquoise: modulator input (5V/div), purple: demodulator input ((a)2V/div (b)5V/div), and green: demodulator output (1V/div), respectively) [85].

load capacitor or a super-capacitor can be used.

The inductive links are commonly realized by two coils and they are placed in the vicinity of each other to create mutually coupled inductors. Several studies have been conducted to maximize the efficiency of the wireless power transfer systems incorporating 2-coils [86]. In addition to the 2-coil inductive links, a 4-coil power transmission link was proposed in [87] to further increase the *Power Transfer Efficiency* (PTE), particularly at large distance. Kiani and his colleagues have proposed a 3-coil inductive power transfer link with comparable PTE over its 4-coil counterpart at large coupling distances, which can also achieve high power delivered to the load [88]. This structure is considered as suitable for desired application of powering neural implant. The simplified schematic diagram of 3-coil inductive link is shown in Fig. 4.4. The implanted coil in this work is designed such that its outer diameter is 10 mm which is in the size limitation of the desired application.

The received AC signal from the 3-coil structure can be converted to a DC voltage by using a half-wave rectifier. An active rectifier which is composed of a pass transistor with dynamic bulk biasing, a comparator, and a multiplexer is proposed by [89]. The PMOS pass transistor works as a switch and is controlled by the comparator and the multiplexer according to the input and output voltage levels. It is turned on when the input is higher than the output, and turned off otherwise. Therefore, the reservoir capacitance at the output is charged when the switch is on, and reverse leakage to the input is minimized when the switch is off. The comparators decides to pull-up or pull-down the gate voltage of the PMOS pass transistor according to the difference between input and output voltages and the multiplexer changes



Figure 4.4 – Lumped circuit model of the 3-coil inductive link [88].



Figure 4.5 - (Left top) Half-wave active rectifier composed of a pass transistor, comparator, and a multiplexer; (right) the low drop-out voltage regulator with its cascoded bootstrapped current source; and (left bottom) connection of rectifier and the regulator. [89].

the bias voltage accordingly. In order to eliminate ripple at the operating frequency and provide a DC voltage independent of the input voltage, a regulator needs to be employed. Fig. 4.5 depicts the half-wave rectifier and regulator proposed by [89]. By using a large load capacitance or a super-capacitance at the output of the regulator, effect of a short interruption in the power link on supply voltage of the system can be eliminated.

# 4.2 Learning based sampling implementations

In the proposed work, we implement fully digital signal processor, which implements the Learning Based CS, described in Section 3.3. In the following, a first LBCS-based implementation is described, highlighting the main *Hadamard based LBCS* (LBCS-Had) encoding scheme design. Moreover, it is compared the optimal encoding system (discussed in subsection 3.3.1) versus the LBCS-based Hadamard one, motivating our choice in the hardware implementation. Furthermore, it is demonstrated how the *Discrite Cosine Transform based LBCS* (LBCS-DCT) allows for better performances on the signal recovery compared to the LBCS-Had, but with an important payload on the hardware design, making it a weak solution for implantable device

implementation, while it becomes very interesting for applications like imaging. This analysis is then followed by two LBCS-Had designs, where both the system and circuits are improved through design strategies which allows the dynamic generation of the transformation coefficients, and variable compression ratio, allowing for an *adaptive LBCS* implementation, very suited for neural signal acquisition systems that not only rigorously trades off area, energy consumption, and the quality of its signal output, but also significantly outperforms the state-of-the-art in all aspects.

# 4.2.1 LBCS-Had Implementation

Walsh-Hadamard based transformation has been used in recent publications [90, 91] because of its hardware friendly implementation, since each transformation coefficient requires one bit resolution, resulting in easy related computations. In particular, in [91] authors propose a threshold-based Walsh-Hadamard compression, to sample the Action Potentials (AP) related to neuronal signals for brain machine interfaces. The authors apply a butterfly scheme to transform the input signal samples into the Hadamard domain. However, such butterfly-based method can be performed on very few number of consecutive samples (8 samples in [91]), limiting any kind of learning approach because of the low signal statistic. For this reason, such work is used for AP signal detection, with limited implementation in constant medical monitoring for applications like epilepsy, where the whole signal behaviour is required by clinicians. Authors in [90] propose the generation of the full Hadamard matrix  $\Psi \in \mathbb{R}^{16\times 16}$ for a parallel neural recording system. However, such implementation does not apply any compression mechanism, requiring an important power budget.

In our work, we have applied the Hadamard-based LBCS compression algorithm, performing the temporal to Hadamard domain through different implementations [92, 93]. The first version [92], described in this subsection, the whole Hadamard transformation matrix is stored in static memories. In a more advanced encoding system, described in subsection 4.3.1, the LBCS-based compression algorithm performs the transformation through on-the-fly generated Hadamard coefficients. Such implementation also allows adaptive compression rates based on energy threshold method, as discussed in 4.3.1.

## Sampling procedure

In this section, we propose the architecture to allow an embedded sampling and compression of the neural input signal based on the LBCS approach described in Section 3.3.

In the following, we fix  $\Psi$  equal to the Hadamard matrix **H** which has the advantage of only requiring a single bit to represent each matrix entry and also minimizes the matrix multiplication operations. Let  $\mathbf{H}_{\Omega} = \mathbf{P}_{\Omega}\mathbf{H}$  be the matrix composed of the rows of **H** indexed by  $\Omega$ . We

# 4.2. Learning based sampling implementations



Figure 4.6 – One channel block diagram showing the LBCS encoder and the matrix sequence generation logic.



Figure 4.7 – Accumulator block diagram.

sequentially compute  $\mathbf{y} = \mathbf{H}_{\Omega} \mathbf{x}$ , in order to obtain:

$$\begin{bmatrix} h_{11} & h_{12} & h_{13} & \dots & h_{1N} \\ h_{21} & h_{22} & h_{23} & \dots & h_{2N} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ h_{M1} & h_{M2} & h_{M3} & \dots & h_{MN} \end{bmatrix} \times \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ \vdots \\ \vdots \\ x_N \end{bmatrix} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_M \end{bmatrix}.$$
(4.1)

Looking at each component of **y**, we have

$$y_k = \sum_{j=1}^N h_{kj} x_j, \ k \in \{1, \dots, M\},$$
(4.2)

where  $h_{kj}$  is the (k, j)-entry of  $\mathbf{H}_{\Omega}$ .

55

#### Chapter 4. LBCS based hardware implementation and validation

Figure 4.6 shows the block diagram of the LBCS architecture proposed in this work for onechannel sampling. The *Matrix Sequence Generator Logic* is a chip memory that stores the entries of  $\mathbf{H}_{\Omega}$  that are used for the sub-sampling procedure performed by the *LBCS Encoder* block. The entries are stored into the chip memory in a sequential fashion through the *Matrix Input*. The sampling procedure starts once the memory is loaded and a *serializer* is used to sequentially send the  $h_{kj}$  weights to the summation node.

The input signal  $x_j$  is the digital output of an A/D converter with a resolution of  $B_i$  bits. At the beginning of each window of length N, we set  $\mathbf{y} = 0$  and then, at each time step j,  $x_j$  is summed or subtracted to the  $B_o$ -bit accumulator value  $y_k$  depending on the one-bit Hadamard entry  $h_{kj}$ , updating each component via the rule:

$$y'_{k} = y_{k} + h_{kj} x_{j}, \ k \in \{1, \dots, M\},$$
(4.3)

Instead of performing the subtraction through a subtractor, the  $B_o$ -bit signal  $y'_k$  is formed with a single  $B_o$ -bit ripple carry adder, and the  $h_{kj}$  input defines the polarity of  $y_k$ . This also allows to avoid any multipliers in the weighting phase when  $y_k$  is fed-back to the summation node. Each accumulator has to be updated before the next sample  $x_j$  arrives, therefore we use an *enable* signal to drive the multiplexer of the accumulator block, shown in Figure 4.7, in order to update only one register per time. With this design choice, we avoid having one adder per accumulator lane, but require an internal digital clock frequency

$$f_{encoder} = M \times f_s, \tag{4.4}$$

where  $f_s$  is the signal sampling frequency<sup>1</sup>.

When  $M = \frac{N}{CR}$  is large, the internal clock frequency may become a limiting factor, requiring additional digital blocks to synchronize the clock. However, with the sampling frequency of 5 kHz for the considered datasets, choosing N = 256 and a hypothetical compression rate of  $16 \times$ , the LBCS encoder frequency results to be 5 kHz  $\times \frac{256}{16} = 80$  kHz, which is still in a relatively low frequency range.

## **Circuit implementation**

To implement the proposed architecture, we have defined our target signal quality close to 30dB. Then, considering a sampling time window of 256 samples and assuming an ADC resolution  $B_i = 10$  bits, we have set the compression ratio CR = 16 following the numerical results reported in Tables 3.7 and 3.5. The internal encoder core clock frequency is then  $f_{encoder} = M \times f_s = 80$ kHz and the accumulator resolution is set as  $B_o = B_i + \log_2(N)$  to avoid overflow.

The architecture shown in Figure 4.6 has been implemented in a 1P9M 90 nm CMOS tech-

<sup>&</sup>lt;sup>1</sup>The  $h_{kj}$ -serializer works at frequency  $f_{encoder}$  too.

## 4.2. Learning based sampling implementations



Figure 4.8 – One channel encoder layout showing the LBCS encoding circuit and the matrix sequence generation logic for N = 256 and CR = 16.

nology. Table 4.1 shows the comparison between the LBCS-Had implementation with the state-of-art and our SHS results, previously described in Section 3.2.2. The design is fully digital and the layout of a one-channel encoder is shown in Fig. 4.8. To verify the functionality of the digital encoder, the digitized neuronal data is directly given as input to the LBCS block. A post place-and-route simulation has verified that the *M* outputs given by the encoder are equal to the expected values computed in Matlab. The simulation has been run considering a worst case scenario with slow-slow process corner operating at 0.9V, which results in an estimated power consumption of the LBCS encoder around  $1\mu W$ . The silicon area of the encoder block is  $210 \times 210\mu m$ . Considering the fact that the electrode pitch in a typical Utah-MEA is  $400\mu m$ , the resulting size of the encoder is fully suitable for such embedded applications.

# 4.2.2 LBCS-DCT implementation

The LBCS technique has been applied on circuit implementation with DCT-based transform [94]. Even though its implementation shows great signal reconstruction performances, the actual hardware implementation, which requires relatively larger area and power consumption with respect to its LBCS-Hadamard counterpart, makes it more suitable for different application, such as image processing.

The one-channel sampling DCT-LBCS architecture proposed in this work is depicted in Figure 4.9. The embedded sampling and compression of the neural input signal follows the description presented in Section 3.3.

| Deremeter                           | [29] [68] | [60]               | This   | This     |
|-------------------------------------|-----------|--------------------|--------|----------|
| Parameter                           |           | [00]               | Work I | Work II  |
| Compression Method                  | BERN      | MCS                | SHS    | LBCS-Had |
| Compression Rate                    | 10        | 16                 | 16     | 16       |
| Technology [µm CMOS]                | 0.09      | 0.18               | -      | 0.09     |
| Compression Power $[\mu W]$         | 1.9       | 17.83 <sup>a</sup> | -      | 1.0      |
| Compression Area [mm <sup>2</sup> ] | 0.090     | 0.090              | -      | 0.044    |
| Recovered Signal [dB] <sup>b</sup>  | 21.7      | 22.2               | 24.7   | 30.8     |

Table 4.1 - Comparison With Published Work

<sup>a</sup> Compression power cost over 16 channels.

<sup>b</sup> Average SNR calculated from Tables 3.7 and 3.5, considering CR=16 for all the compression methods.

In the following, we fix  $\Psi$  equal to the DCT matrix. Let  $\mathbf{D}_{\Omega} = \mathbf{P}_{\Omega} \Psi$  be the matrix composed of the rows of  $\Psi$  indexed by  $\Omega$ . We sequentially compute  $\mathbf{y} = \mathbf{D}_{\Omega} \mathbf{x}$ : looking at each component of  $\mathbf{y}$ , we have

$$y_k = \sum_{j=1}^N d_{kj} x_j, \ k \in \{1, \dots, M\},$$
(4.5)

where  $d_{kj}$  is the (k, j)-entry of **D**<sub> $\Omega$ </sub>.

The DCT transformation matrix  $\mathbf{D}_{\Omega}$  contains real valued coefficients (positive and negative), which are stored into an SRAM, shown in Figure 4.9, with  $N \times M$  cells of size  $B_{DCT}$ .

## Sampling procedure

A finite state machine (FSM) drives the LBCS encoder sub-sampling procedure. The entries  $d_{kj}$  are stored into the chip memory in a sequential fashion through the *DCTCoef* input. The input signal  $x_j$  is the digital output of an A/D converter with a resolution of  $B_i$  bits. The sampling procedure starts once the memory is loaded and the operations are carried out by a single multiplier and an adder, which are used in a time-multiplexed manner to accumulate the *M* output values into the registers.

At each time step j,  $x_j$  is multiplied to the DCT entry  $d_{kj}$ , and summed to the  $B_o$ -bit accumulator value  $y_k$ , updating each component following the rule  $y'_k = y_k + d_{kj}x_j$ ,  $k \in \{1, ..., M\}$ . At the beginning of each window of length N, the registers are then reset ( $\mathbf{y} = 0$ ). The enable signal is meant to drive the digital registers, so that each accumulator is updated before the next sample  $x_j$  arrives. This design choice avoids having one multiplier-adder per accumulator lane, but requires an internal digital clock frequency  $f_{encoder} = M \times f_s$ , where  $f_s$  is the signal sampling frequency.

The input data sampling frequency for the considered dataset is 5 kHz, and choosing a window

#### 4.2. Learning based sampling implementations



Figure 4.9 – One channel block diagram showing the LBCS encoder and the matrix sequence generation logic.

length N = 256 with a compression rate of  $32 \times$ , the DCT-LBCS encoder frequency results to be  $5 \text{ kHz} \times \frac{256}{32} = 40 \text{ kHz}$ , which is a relatively low frequency range. Indeed, if  $M = \frac{N}{CR}$  is large, the internal clock frequency may become a limiting factor, requiring additional digital blocks for clock synchronization.

### **Circuit implementation**

The circuit implementation has been defined following the experimental results discussed in Section 3.3.2 and considering the trade-off between area and power requirements. The target signal reconstruction quality is set to 30 dB. Considering a sampling window length of 256 samples and assuming an ADC resolution of  $B_i = 10$  bits, the Had-LBCS method reaches 30 dB performance with a compression ratio CR = 16. As reported in Table 3.7, with the DCT-LBCS approach a compression ratio CR = 32 still allows to have a performance higher than 30 dB (and improved with respect to the Had-LBCS design). Thus, we are allowed to relax the number of bits to transmit, which is directly related to the RF data transmission cost. The internal encoder core clock frequency is  $f_{encoder} = M \times f_s = 40$  kHz with the accumulator resolution set as  $B_o = B_i + \log_2(N) + 1$  to avoid overflow. This leads to define an *effective compression ratio ratio* defined as

$$CR_{eff} = CR \times \frac{B_i}{B_o}, \qquad (4.6)$$

which takes into account the actual number of bits per accumulator, after the compression.

Chapter 4. LBCS based hardware implementation and validation



Figure 4.10 – One-channel DCT-LBCS encoder layout for N = 256 and CR = 32.

Table 4.2 reports the performance of the system and presents a comparison with recent published work. In this table is summarized the compression power and area requirements for each methods discussed in this paper. It also reports the simulated recovered signal and transmitter performances, highlighting how the DCT-LBCS approach reduces the RF data telemetry cost while improving by almost 3 dB the performances with respect to the best approach presented in [92]. On the other hand, the area requirement is higher because of an increased bit resolution per DCT matrix entry and because of a different CMOS technology node. However, considering a multiple channel application, the memory content is shared among all the channels, reducing the impact of the storage area over the overall chip.

The architecture shown in Figure 4.9 has been implemented in a 1P6M 0.18  $\mu m$  CMOS technology. The layout of the fully digital one-channel encoder is shown in Figure 4.10. To verify the functionality of the digital encoder, the digitized neuronal data is directly given as input to the DCT-LBCS block. A post place-and-route simulation has verified that the *M* outputs given by the encoder are equal to the expected values computed through MATLAB software. The simulation has been run considering a worst case scenario with slow-slow process corner operating at 1.8 V, which results in an estimated power consumption of the DCT-LBCS encoder around 2  $\mu W$ . The silicon area of the encoder block is 490 × 650  $\mu m$ .

| Parameter                           | [20] [69] |             | [02]  | This  |
|-------------------------------------|-----------|-------------|-------|-------|
| Parameter                           | [29]      | [68]        | [92]  | Work  |
| Compression Method                  | BERN      | MCS         | Had   | DCT   |
| Compression Method                  | DLIUN     | WIC5        | LBCS  | LBCS  |
| Technology [µm CMOS]                | 0.09      | 0.18        | 0.09  | 0.18  |
| Compression Rate                    | 20        | 16          | 16    | 32    |
| Compression Power $[\mu W]$         | 1.9       | $17.83^{*}$ | 1.0   | 2.0   |
| Compression Area [mm <sup>2</sup> ] | 0.090     | 0.090       | 0.044 | 0.3   |
| Recovered Signal [dB]               | 15.76     | 17.48       | 28.48 | 31.00 |
| TX-Power @ $f_s$ [ $\mu$ W]         | 1.5       | 0.94        | 1.7   | 0.85  |

Table 4.2 - Comparison With Published Work

\* Compression power cost over 16 channels.

# 4.2.3 Optimal vs LBCS encoders

Section 3.3.1 describes that the best linear encoder, for a fixed compression rate, is given by sampling the coefficients that capture most of the energy of each signal in each sampling window, naming this approach as optimal encoding. We now analyze the power and area costs for LBCS and optimal encoding respectively.

## LBCS encoding power and area analysis

• *Power cost*: as shown in Figure 4.6, *M B<sub>o</sub>*-bit accumulators are used to store the Hadamard coefficients. This leads to a dynamic power consumption of:

$$P_{LBCS} \propto M \cdot B_o \cdot f_s \cdot V_{DD}^2 \cdot C_{ref}, \tag{4.7}$$

where  $V_{DD}$  is the operating voltage of the digital block and  $C_{ref}$  is the reference capacitance defined by the technology.

• *Area cost*: since a single adder is used for sampling, the area of the digital encoder block is proportional to the number *M* of accumulators:

$$Area_{LBCS} \propto M. \tag{4.8}$$

# Optimal encoding power and area analysis

• *Power cost*: considering a similar architecture, the adaptive encoder requires *N* accumulators, leading to a dynamic power consumption:

$$P_{Optimal} \propto N \cdot B_o \cdot f_s \cdot V_{DD}^2 \cdot C_{ref}.$$
(4.9)

61

• *Area cost*: the area cost is proportional to the number of accumulators used to store all the the Hadamard coefficients:

$$Area_{Optimal} \propto N. \tag{4.10}$$

# Comparison

Comparing the area-power costs of the two approaches, we obtain

$$\frac{P_{Optimal}}{P_{LBCS}} \ge \frac{N}{M} = CR,$$
$$\frac{Area_{Optimal}}{Area_{LBCS}} \ge \frac{N}{M} = CR.$$

Combining these observations with Tables 3.7 and 3.5, we conclude that LBCS yields reconstructions almost as good as the ones obtained with the adaptive encoder, but at a fraction of its power and area cost. The advantage is more significant the higher the compression ratio.

# 4.3 Single channel Adaptive LBCS-Had implementation

In previous section, two different LBCS prototypes have been discussed. In particular, it has been highlighted how the Hadamard-based LBCS encoding scheme results as a more suited implementation for implantable devices, with respect to the LBCS-DCT scheme.

In this section we first discuss some techniques we adopt to improve the LBCS-Had hardware implementation, in terms of area and overall performance. Then, we describe the complete single channel system architecture and circuit implementation, including the ADC, the Hadamard based LBCS and the RF parts (developed in collaboration with RFIC group at EPFL), for power and data wireless link. Afterwards, we present the electrical measurements.

# 4.3.1 Adaptive LBCS

As previously discussed, Hadamard transform is particularly suited for hardware implementation since each coefficient can be computed by performing only simple additions or subtractions.

The on-the-fly generation of only the selected rows of the Hadamard matrix (defined by  $\hat{\Omega}$ ) is required for embedded compression, which results as a dynamic generation of the coefficients used to apply the LBCS approach. Such technique would drastically reduce the encoder memory requirements needed by previous LBCS-Hadamard implementation (described in Section 4.2.1), while the signal reconstruction quality is preserved within a low power chip implementation. Moreover, a variable compression rate based on the signal energy content allows a recovery mechanism that adapts its performance to the data information content. This approach, named adaptive LBCS then varies the compression rate, hence the overall device power consumption, based on the input signal behaviour.

## **Dynamic Hadamard entries generation**

The reduction of hardware area in the Had-based LBCS described in Section 4.2.1 is possible by replacing the SRAM dedicated to store the Hadamard coefficients, with a direct computation of each matrix entry [90]. Such computation is feasible due to the intrinsic structure of the Hadamard matrix, which is summarized as follows. The non-normalized Hadamard transformation matrix  $\hat{H}_n \in (-1, 1)^{N \times N}$  of size *n*, with  $N = 2^n$  is expressed as a recursive Kronecker product of two matrices

$$\hat{H}_n = \hat{H}_1 \otimes \hat{H}_{n-1}$$
, where  $\hat{H}_1 \triangleq \begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix}$ . (4.11)

Each matrix coefficient indexes k and j, can be expressed in binary representation

$$k = \sum_{i=0}^{n-1} k_i 2^i, \ j = \sum_{i=0}^{n-1} j_i 2^i \text{ with, } k_i, j_i \in (0,1).$$
(4.12)

Each Hadamard entry  $h_{k,i}$  can then be expressed as

$$h_{k,j} = (-1)^{\sum_{i=0}^{n-1} k_i j_i} \equiv (-1)^{mod_2(\sum_{i=0}^{n-1} l_i j_i)}.$$
(4.13)

In particular, mapping the (1, -1) to (0, 1), each Hadamard entry can be derived by

$$h_{k,j} = mod_2(\sum_{i=0}^{n-1} l_i j_i).$$
(4.14)

Such expression can be efficiently implemented in hardware, through logic AND gates to perform  $l_i j_i$ , while the module-2 sum is derived by a logic XOR. Thus, the circuit implementation takes the row and column indexes k and j and computes the Hadamard coefficient in the binary map (0, 1).

## Adaptive Hadamard compression

The simulation results shown in Fig. 4.12-(a), depict the energy content of the N samples in the Hadamard domain, for a particular sampling window. As described in Sec. 3.3, the Learning-based algorithm allows to define the coefficients that, in average, have the most energy contribution. However, depending on the signal evolution in the sampling window, the coefficients defined by the learning process might have a low energy content. This analysis is useful to define the system's trade-off and a variable compression rate, which adapts from



Figure 4.11 - Variable CR block diagram, defined by the threshold level (Thr).

window to window, depending on the energy levels defined by the neural signal evolution in time.

On the system level implementation, for a window length of N=64, a maximum compression rate of 8 has been defined, in order to allow relatively high SNR after the signal reconstruction. Since in the  $\frac{N}{CR}$  = 8 Hadamard coefficients the energy might be below a certain level, a threshold is also defined during the learning process, in order to transmit only the most relevant coefficients, enabling a dynamic compression. The dynamic detection of the Hadamard coefficients results in an easy hardware implementation, and allows a variable CR from window to window. Fig. 4.11 shows the block diagram of the variable CR implementation, depicting how, the energy content of the coefficient value  $y_K$  is transmitted or substituted with a  $B_O$  bit stream by means of a multiplexer, mathematically resumed as:

$$y'_{k} = \begin{cases} 0, & |y_{k}| < \text{Threshold} \\ y_{k}, & \text{otherwise.} \end{cases}$$
(4.15)

In such a design implementation, the SoC features a compression which varies from CR=8 to CR=64, and allows the TX to transmit fewer coefficients, thus drastically reducing its power consumption. Fig. 4.12-(b) shows the trade-off between the mean signal reconstruction SNR and the mean CR over the whole dataset, as the threshold varies. In particular, Fig. 4.12-(c) and Fig. 4.12-(d), show respectively the mean signal recovery quality and the mean window compression rates, with respect to the threshold levels. In particular, it is worth highlighting how a relatively small Hadamard energy threshold (e.g., below 100) allows to reduce the number of coefficients transmitted (thus, higher CR level), while the SNR is still relatively high (above 28 dB).

## 4.3.2 Implantable Architecture

The implantable single channel chip architecture is described in this Section. The SoC designed in this work consists of the analog to digital converter, followed by the encoder which compresses the sampled data, implementing the Learning-based CS algorithm described in Section 3.3. The compressed bit stream is then serialized and wirelessly sent out by the RF transmitter. The circuit can be powered wirelessly through an inductive link between the implant and a power delivery unit.



Figure 4.12 - SNR analysis for adaptive approach.

## Analog to compressed data stream design

The neural signal digitization is realized by a Successive Approximation Analog to Digital Converter (SAR ADC). Such ADC design results in a compact and low-power implementation, which matches the stringent area and power constraints of our implantable SoC. The SAR ADC has 8 bit resolution and a sampling rate of 45 kHz, in order to match the 5 kS/s rate of the input signal from iEEG dataset. A compact ADC implementation is achieved by a binary-weighted capacitive array, with attenuation capacitor [68]. Since the neural signal bandwidth is relatively low, the compression computations are completed at the DSP, with the same frequency defined by the ADC. In particular, the ADC requires 9 cycles to complete the digitization of the input signal (at 5 kHz), thus running at 45 kHz. The DSP core frequency runs at the same speed, performing the data compression.

The Hadamard-based LBCS encoder block diagram is depicted in Fig. 4.13, where is shown the input data path from the Analog to Digital Converter (ADC), through the LBCS Digital Signal Processor (DSP) to the encoded data transmitter. The *Finite State Machine* (FSM) of the DSP drives the *Had-block* and the main DSP core, where the encoding process is executed. The Had-block generates the Hadamard bit streams and replaces the SRAM used in previous implementation [92], reducing the encoder area. The Had-block is mainly composed by the





Figure 4.13 – One channel block diagram showing the LBCS encoder and the matrix sequence generation logic.

Row-Index Look up Table (LuT), and the Hadamard bit generator. The Row-Index LuT is meant to store the learnt indices of the sub-sampling matrix  $\mathbf{P}_{\Omega}$ , described in subsection 3.3. Assuming that only M rows of the full Hadamard matrix  $\mathbf{H} \in \mathbb{R}^{N \times N}$  have to be used to apply the LBCS-based compression, then we can define a mapping function  $w(k) = \in [0 \ N - 1]$ , where  $k \in [0 \ M - 1]$  is the index of the output value, and we define  $h_{k,j} = h_{w(k),j}$ . Then, the LuT implements such mapping function w(k).

The LuT coefficients, driven by the FSM, are sent to the Hadamard-bit generator, which produces the transformation entries  $h_{k,j}$ , following the description in subsection 4.3.1. Fig. 4.14 shows the block diagram of the Hadamard bit generator, highlighting the logic gates used to generate the  $h_{k,j}$  entries [90]. During a calibration phase, the learnt Hadamard row indices, defined by the RowIDX input (log(N) bit wide, to code all the possible Hadamard matrix indexes) are loaded in the LuT. As soon as the program enable (Pr\_en) is active, the initialization starts and the FSM programs the M indexes into the LuT, following the RowIDX and the k signals used to correctly address the register. The FSM also generates and programs the enable and reset commands sent to the DSP, to correctly synchronize the encoding procedure, and to reset the accumulator registers (*Accum* in Fig. 4.13) at the end of each encoding window.



Figure 4.14 – Hadamard bit generator block diagram.

The encoder input signal  $x_j$ , digitized by the ADC with  $B_i$  bit resolution, is summed or subtracted from the previous accumulator register values, at each sampling instant j in the sampling window of length N. The LBCS-DSP block performs the embedded compression, defined as

$$y_k = \sum_{j=1}^N h_{k,j} x_j, \ k \in \{1, \dots, M\},$$
(4.16)

where  $h_{k,j}$  is the (k, j)-entry of  $\mathbf{H}_{\Omega} = \mathbf{P}_{\Omega}\mathbf{H}$ ; the Hadamard matrix  $\mathbf{H}$  (= $\Psi$  described in subsection 3.3), requires a single bit per entry, minimizing the computation costs in the transformation process. The encoder processing frequency is M times faster than the input signal frequency, in order to update each of the accumulator registers, where the transformation coefficients are stored.

The previous Hadamard based LBCS implementation shown in [92], has been designed for sampling window of 256 samples (N = 256), with a fixed CR of 16×. In this work, we propose the hardware implementation with an on-the-fly Hadamard generation, with a sampling window length of N=64 and compression rate of CR=8. The same dataset as in [92] has been taken into account, to validate the proposed hardware implementation. The N=64 and CR=8 combination allows to get similar average reconstruction quality, while the LBCS encoder frequency  $f_s$  is halved, resulting in a lower power consumption. Indeed, since M is defined as N/CR, the larger is the number of the Hadamard rows M, the higher is the core LBCS clock

#### Chapter 4. LBCS based hardware implementation and validation



Figure 4.15 - Schematic of the LC cross-coupled voltage controlled oscillator [93].

frequency, which might become a limiting factor. On the other hand, a further reduction on the number of samples N, would degrade the signal statistics over which the learning approach is based on.

## Wireless data transmitter

Two different wireless transmitters are designed and implemented with different data rates, operating frequency, and transmission distance, in order to cover different applications. The narrowband transmitter which operates in the MedRadio band at 416 MHz is designed for low data rate and indoor communication. The other transmitter is based on impulse-radio ultra-wideband (IR-UWB) in the 3.1-10.6 GHz frequency range and utilized for high data rate and very short distance transmission. The two transmitters provide the flexibility of sending compressed or raw data.

**Narrowband Transmitter** The proposed on-off keying (OOK) modulated narrowband transmitter is based on the turning on and off a voltage controlled oscillator (VCO). The VCO which is shown in Fig. 4.15 is composed of NMOS and PMOS cross-coupled pairs and data is applied to the bias current for modulation. Reuse of the current by PMOS and NMOS pairs provides higher transconductance and higher voltage swing on the inductor. For setting the resonance frequency of the VCO, a bank of three capacitors are utilized for coarse tuning and varactors are used for fine tuning. An off-chip loop antenna is connected to the differential output of the VCO to transmit the signal and create the required inductance for LC tank [95].



Figure 4.16 – Schematic of the IR-UWB transmitter [93].

**Ultra-wideband Transmitter** IR-UWB is a promising technique based on transmission of short pulses and it is very efficient for low range applications which requires high data-rate. In 2002, the Federal Communications Commission (FCC) approved and limited the maximum effective isotropic radiated power (EIRP) to -41.3 dbm/MHz for bandwidth between 3.1 and 10.6 GHz [96].

In this work, in addition to the narrowband transmitter, we present a high data-rate, energy and area efficient, and low complexity IR-UWB transmitter. Fig. 4.16 shows the schematic block diagram of the IR-UWB transmitter. The core of the transmitter is based on the current starved ring oscillator (RO) which generates output in the range of 3.5-4.5 GHz frequency. The pulse generator (PG) block creates short pulses at the rising edges of the data signal. The output of the RO and PG is mixed with cascode connected transistors. The drain of the transistor driven by the RO is connected to external resonator circuit formed by an inductor and a capacitor. Before the 50  $\Omega$  UWB antenna, a band-pass filter (BPF) centered at 4 GHz is used in order to satisfy the FCC regulation.

#### Wireless power transfer

To design an implantable system, wireless power transfer (WPT) method is chosen since batteries increase the total weight and dimensions of device. Considering the required power of the implant and the power transmission distance, which is in the order of millimeters, an inductive link is selected for power transmission. The losses due to remote powering are a critical concern that can cause a temperature elevation, which may damage the tissue. Hence, a power efficient transmission link composed of 4-coils, an active half-wave rectifier, and a low drop-out voltage regulator is designed and represented in Fig. 4.17.

Different approaches are used for various applications, but the average power consumptions of the implants are considered nearly constant in system parameters. However, in some applications such as neural monitoring with a variable number of active electrodes, the power consumption of the implant is not always the same. Hence, the power efficiency of WPT and





Figure 4.17 – Block diagram of the proposed implanted electronics for wireless power transmission [93].

the dimensions of the implanted coil become the major limitations in designing the coils for remote powering. In the fundamental approach with two coupled coils, there is a direct relation between the delivered power to the load and the efficiency. The variation in the load power requires an additional approach for keeping PTE maximum for different activity rates. A modified version of inductive link with 4-coil instead of 2-coil has been introduced for 2 meters remote powering [87], and the structure was adapted for implant powering applications[18]. The results show a significant improvement in the efficiency. The low coupling coefficient and the low quality factor of the coils in 2-coil link are compensated by the introduced two high quality factor coils between them[97]. Moreover, the introduced coils transform different load impedances to the optimal impedance at the input of the inductive link and efficiency does not significantly change with load power. Therefore, a 4-coil inductive link is implemented to take the advantage of high PTE and tolerance for variable load power.

The induced AC voltage by the 4-coil inductive link requires to be rectified to a DC voltage. To achieve high conversion efficiency, an active half-wave rectifier is selected at the price of the losses in the comparison and decision blocks in Fig. 4.17. In this study, the half-wave rectifier is designed based on the work published in [98]. Pass transistor with dynamic bulk biasing constitutes the core of the rectification. To prevent the leakage from the capacitor to the input, the n-well of the two PMOS transistors are dynamically biased. Hence, the transistor conducts current only when the input voltage is higher than the voltage at the accesses of the capacitor. The comparator decides the condition of the PMOS pass transistor by comparing the input voltage and the charged voltage on the capacitance. Timing and control block applies the decision given in the comparator with an optimum switching time such that it is fast enough compared to operation frequency and minimizes the switching power losses. The low drop-out voltage regulator eliminates the ripples at the output of the rectifier and generated clean voltage supply for the other circuits in the implant. The capacitors at the output of the rectifier and regulator are implemented externally.



Figure 4.18 – Layout (on the left) and micrograph (on the right) of the tested chip.



Figure 4.19 – Measurement setup, highlighting the FPGA and PCB link.

# 4.3.3 Measurement results

The chip, fabricated in UMC 180 nm 1P6M MM/RF process technology, has been packaged and bonded to a dedicated PCB. A Xilinx development board, providing a Virtex 5 FPGA [99], is linked to the PCB trough rigid headers, as shown in Fig. 4.19. The board is used to set and program the SoC blocks with a PC station.

# Sampling and data compression

Each block of the SoC has been independently connected to dedicated pads on the chip, in order to validate each design. The analog input of ADC and the DSP digital bit streams are connected to ESD protection circuits, to reduce any possible damage due to electrostatic discharges during the measurements.

As shown in Fig. 4.18-left, the ADC and the two encoder versions (the variable CR on top and the non-variable version on the right side of the ASIC) do not share the power-grids, in order to separate the analog and digital domains. The power-grid has been designed in a very dense manner, with capacitors that surround the SoC blocks, stabilizing the VDD to ground fluctuations. Fig. 4.18-right shows the micrograph of the tested chip.

The 8 bit resolution SAR-ADC with a sampling rate of 45 kHz requires an area of  $230\mu$ m× $150\mu$ m, with a power consumption of 0.46  $\mu$ W. The low power requirements of the ADC is mainly dictated by the medium resolution of 8 bits, and the low sampling frequency of the neural signals.

A Verilog code, implemented on Xilinx ISE tool, has been developed to program the encoder registers, to provide the clock at 45 kHz to the SoC, and to send the input bit stream to the encoders through the FPGA. The compressed data sequences at the output of the DSPs are collected as input to the FPGA, and analyzed with Xilinx ChipScope tool. The measurement setup is shown in Fig. 4.19.

The measured compressed bit streams have been plotted by an oscilloscope and are highlighted in Fig. 4.20. Both plots have been generated with the variable CR encoder version, in order to show, on the same plot, the dynamic generation of the transformation coefficients, and the different outputs due to low threshold (on Fig. 4.20 top-left) and high threshold (on Fig. 4.20 top-right) settings. The reconstructed signal versus the original data is plotted for 4 sampling windows, at the bottom of Fig. 4.20.

Table 4.3 reports the numerical results of the recovered signal, for the different compression methods discussed in this work, with fixed compression rates. In particular, this table shows how the LBCS-based signal recovery performs better than Bernoulli [29], Multi-channel [68] or Structured Hadamard Sampling [46]. The comparison of reconstruction performance has been done considering N=256 and an ADC resolution  $B_i = 10$ , for the iEEG dataset described in Appendix A. Furthermore, the LBCS signal recovery requires the linear decoder (3.21), which



Figure 4.20 – Measured compressed values with low threshold (on the left) and high threshold (on the right).

yields the reconstructions at a fraction of the computational cost of the other methods [92].

Since the actual hardware implementation of this work has been developed with N=64 and  $B_i = 8$ , Table 4.4 summarizes the recovery performances for the variable encoder design, for different fixed energy thresholds (the reported CR are in average over the whole dataset). For this reason, Table 4.4 gives an *energy content based* comparison, while Table 4.3 reports a *CR-based* comparison.

| Method   | Compression rate |       |       |       |
|----------|------------------|-------|-------|-------|
| Methou   | 8                | 16    | 32    | 64    |
| LBCS     | 33.27            | 28.48 | 23.27 | 18.06 |
| SHS HGL  | 23.89            | 20.26 | 18.53 | 14.49 |
| BERN HGL | 20.49            | 16.87 | 13.53 | 11.15 |
| MCS HGL  | 20.92            | 17.48 | n.a.  | n.a.  |

Table 4.3 – Recovery performance comparison with published work (N = 256,  $B_i = 10$ )

| Table 4.4 – Recovery | performance summar | v for this work        | $(N = 64, B_i = 8)$  |
|----------------------|--------------------|------------------------|----------------------|
| fuble fif fleeovery  | periorinance summa | <i>y</i> 101 time work | $(11 - 01) D_1 - 0)$ |

| Method | Compression rate <sup>a</sup> |      |      |      |
|--------|-------------------------------|------|------|------|
| Methou | 8                             | 16   | 32   | 64   |
| LBCS   | 30.4                          | 29.5 | 26.1 | 15.7 |

<sup>a</sup> Average compression rate over the whole dataset.

The Learning-based compression algorithm with dynamic generation of the transformation coefficients requires an area of  $230\mu$ m× $230\mu$ m. A comparable area of  $230\mu$ m× $265\mu$ m is required for the adaptive DSP design, which only consumes 0.47  $\mu$ W at 0.8 V. Table 4.5 reports the hardware comparison with respect to other published works.

# Wireless Power Transfer

The resonance frequency of each LC tank in the 4-coil inductive link is fixed at 8 MHz. Power transfer efficiency of 55% is obtained for the inductive link when the separation between the coils and the load is 10 mm and 10 mW, respectively. The performance of the rectifier and the regulator is also characterized for 10 mW load and their efficiency reach to 82% and 78%, respectively. As a result, wireless power transmission beginning from the signal generator to implant load is achieved at 36% efficiency.

| Parameter                           | [29]     | [68]               | This<br>Work                                       |
|-------------------------------------|----------|--------------------|----------------------------------------------------|
| Compression Method                  | BERN     | MCS                | LBCS                                               |
| Compression Rate                    | 10       | 16                 | Variable CR from 64 <sup>h</sup> to 8 <sup>b</sup> |
| Technology [µm CMOS]                | 0.09     | 0.18               | 0.18                                               |
| Compression Power $[\mu W]$         | 1.9      | 17.83 <sup>a</sup> | 0.47                                               |
|                                     | at 0.6 V | at 1.2 V           | at 0.8 V                                           |
| Compression Area [mm <sup>2</sup> ] | 0.090    | 0.090              | 0.054                                              |

Table 4.5 - Compression hardware comparison with published work

<sup>a</sup> Compression power cost over 16 channels.

<sup>b</sup> Average compression rate over the whole dataset.





Figure 4.21 – Spectrum of the LC cross-coupled voltage controlled oscillator [93].

## Narrowband Transmitter

The VCO is supplied with internally generated 1.8 V and the measured average power consumption during operation is 248.4  $\mu$ W. Thanks to the discrete and fine tuning capacitors, VCO covers the two MedRadio bands (401-406 MHz and 413-419 MHz). Fig. 4.21 shows the frequency spectrum of the OOK transmitter with the highest data rate of 2 Mbps. During the measurement of the spectrum, the distance between the transmitter antenna and the receiver antenna (Taoglas Limited-TI.10.0112), which was directly connected to the spectrum analyzer, is fixed to 60 cm. A custom made OOK receiver board based on discrete components is used to demodulate the transmitted data.

## Ultra-wideband Transmitter

The proposed IR-UWB transmitter is fabricated and it occupies a 60  $\mu$ m × 30  $\mu$ m area. Fig. 4.22 shows the measured output waveform of the implemented IR-UWB transmitter with 250 MHz pulse repetition rate. The maximum peak-to-peak amplitude of the measured pulse is 111 mV while its duration is 2.2 ns. Fig. 4.23 depicts the measured power spectral density of the transmitter and FCC regulation. The triangular envelope of the output waveform suppress the side-lobes and measured spectrum fully meets the FCC mask. When the pulse repetition frequency is 250 Mpps, the complete IR-UWB transmitter consumes 11.3 mW power which



Figure 4.22 – Transient pulses of the IR-UWB transmitter at 250 Mpps [93].

corresponds to 45.2 pJ/pulse. High throughput of the IR-UWB transmitter makes it possible to buffer the raw data and transmit it in several bursts.

# 4.4 Multichannel Adaptive LBCS-Had implementation

Based on our findings described in previous sections, a multiple channel implementation has been designed in the latest part of this work. Such design is discussed in this Section.

# 4.4.1 Multichannel Implantable Architecture

In this design, we have 8 independent neural recording channels, whose compressed output is serialized and wirelessly transmitted by the RF blocks.

The multichannel SoC consists of a 8 independent channel implementation. Each channel features a dedicated 8 bits SAR ADC, which digitizes the input signal. The ADC output bit stream is compressed by the adaptive Hadamard based LBCS implementation. Then, the compressed data stream of every channel is serialized with all the channel outputs, before being transmitted. As for the single channel implementation, this chip features an inductive link between the implant and the power delivery unit.

The dedicated ADC per channel has been a design strategy to avoid time-multiplexing imple-



Figure 4.23 – Power spectral density of the IR-UWB transmitter [93].

mentation. In this design, we trade-off the higher area requirement due to the dedicated ADC per channel, with different advantages which are described here.

A shared among all the channels time-multiplexed ADC, would require a higher neural amplifier bandwidth. This because, at each sampling time, the input signal comes from a different neural electrode, requiring the amplifier to settle at a new level at each time the electrode address is changed. Furthermore, a multiplexed multichannel implementation suffers from noise aliasing. Indeed, while each channel has a limited bandwidth, named  $f_{BW}$ , the Nchannels multiplexed output requires the ADC to sample the signal at frequency  $\geq 2Nf_{BW}$ . For this reason, each channel is subjected to the thermal noise spread in the whole bandwidth, which then folds in the first Nyquist zone.

# 4.4.2 Multichannel Layout

The 8 channels implantable device layout is depicted in Fig. 4.24. The 8 channels are placed on the top side of the design. The ADC-DSP block of each channel has been placed together, in order to guarantee a minimum area requirements. A semi-custom block, placed in the middle of the 8 channels, serves to collect the output of each DSP and serializes the bit-streams, which then are transmitted by the RF TX, placed at the bottom of the design, together with the power delivery block.

# 4.5 Summary

In this chapter we have proposed different sampling methods, based on our new mathematical foundations described in the previous chapter. We built different prototypes of neural signal acquisition systems that not only rigorously trades off area, energy consumption, and the quality of its signal output, but also significantly outperforms the state-of-the-art in all aspects.

Our learning-based digital encoder scheme leverages the benefits of structured linear sampling and linear recovery to yield state-of-the-art compression performance, maintaining a high signal reconstruction quality up to  $64 \times$  compression, as quantitatively demonstrated on two human iEEG datasets. We designed different digital encoders for neuronal signals where both the system architecture and the circuit design have been developed to reduce the overall implantable chip's power and area requirements.

Overall the designed prototypes, the best encoding scheme, in terms of area, power and performance, for the implantable application results to be the adaptive LBCS-based implementation. The proposed encoding solution enables dynamic generation of the transformation coefficients, allowing on-the-fly compression with faster and improved off-line signal recovery than Random Bernoulli [29], Multi-channel [68] or Structured Hadamard Sampling [46]. Moreover, a variable compression rate is achieved by energy based threshold method. The proposed data compression reduces the amount of bit stream transmitted wirelessly, thus lowers the



Figure 4.24 - Layout of the designed multichannel implementation.

TX and implantable system's power requirements. Such learning-based encoder scheme has been implemented in the single channel and multiple channel design.

In the proposed implementation, the threshold that defines the coefficients to be transmitted is set during the off-line learning process. A further development of the current chip implementation can include an on-chip calibration, which sets the threshold level of the encoder in the implanted device.

# Multi-lane Single-Ended High Speed Part II I/O Receiver

### 5 High speed IOs ecosystem

Advances in CMOS process technologies have led to an exponential increase in the digital processing power of high performance microprocessor units, leading to an increment of the data transfer bandwidth between local chips. CMOS fabrication allows design of RF circuits for transmitting data from chip to chip over relatively short distances, with reduced costs with respect to other technologies.

These RF circuits are defined by low power requirements, since the heat generated by the chip is partially distributed by the chip package. More generally, the overall system power budget is limited by its affordable cooling capacity, such as in high-end servers application. Thus, low power designs allow for more integrated circuits in the same chip. For this reason, application specific architectures and innovative techniques are used for low-power implementation.

Every electrical signal travelling in a medium, such as in a *Printed Circuit Board* (PCB), suffers from attenuation. Such signal loss generally increases with the signal frequency. Fig. 5.1 depicts a classical chip-to-chip link, named backplane, over a printed circuit board, highlighting the signalling path. For signal frequencies above the GHz, the skin-effect and dielectric losses are the main contributions to signal attenuation. Reflections due to connectors and via stubs in the PCB, shown in Fig. 5.1, further deteriorates the transmitted signal. Moreover, electromagnetic coupling, named *crosstalk* (XTK or xtalk), between different lines also impacts over the signal integrity.

In order to minimize the signal attenuation and to preserve the data integrity, equalization and coding are generally implemented. Consequently, these techniques must trade-off the system complexity with the higher power and area requirements.

In this work, we propose a versatile receiver circuit which not only copes with large channel attenuation but also implements novel crosstalk cancellation techniques, to allow single-ended multiple lines transmission.

In the reminder of this Chapter, we motivate the single-ended signalling. Then, we briefly discuss the channel boards characteristics and the noise and interference sources that define



Figure 5.1 – Chip-to-chip backplane link, SE 4.8 Gb/s [100].

the overall signal attenuation in a transceiver link.

#### 5.1 System overview

Over the past years, the data rate required for each pin has almost doubled every four years across different I/O standards, as depicted in Fig. 5.2, [101]. However, due to packaging constraints as well as chip size limitation, the number of package pins is increasing only slightly, while the number of transistors served by one I/O approximately doubles every new CMOS technology node, [102].

At the same time, low power consumption is a first order design constraint for I/O circuits. ITRS assumes that high performance serial transceivers can consume a maximum of 10% of the chip power and I/O links should occupy maximum 20% of the entire chip area.

In combination with innovative circuits techniques, adopting single-ended signaling technology doubles the performance (bandwidth per pin) with respect to similar channel boards operating with differential lines per signal, such as *Quick Pack Interconnect* (QPI) by Intel [103] and Hypertransport by AMD [104]. The main limitation of using single-ended PCB traces comes from the increasing of crosstalk noise due to electromagnetic coupling because of increased wire density. As data rate increases, crosstalk becomes then the most significant noise source in single-ended parallel links.



Figure 5.2 - Pin data rate evolution across most common I/O standards [101].

#### 5.1.1 Channel boards environment

In this subsection, we give an overview of the main limiting factors that define the link performance.

The channel is the communication link between one chip to the other. The on-chip 50  $\Omega$  termination resistor, linked with the device capacitance, define a parasitic low-pass filter that degrades the transmitted signal. Such signal has to traverse different traces, in order to be collected by the receiver, as shown in Fig. 5.3. Skin effect and dielectric loss, along with the long backplane traces, improve the line attenuation at higher frequencies. Furthermore, higher attenuation is due to shorter traces, such as vias or connectors to extension boards, that are used to connect different components in the backplane link. These traces can define large impedance mismatches and cause reflections of the signals that impacts on the signal integrity.

#### Noise and interferences

The limitation to the number of bits that we can transmit across a channel is determined by the signal to noise and interference ratio at the receiver. The larger this ratio, the more distinguishable are the levels that one can transmit in each symbol, increasing the effective bit rate. Unfortunately at high symbol rates, the interference levels are often quite high. In backplane systems the interference occurs between symbols that travel on the same wire, due to the limited bandwidth of the wire, and also between different wires due to electromagnetic coupling of signals travelling in densely spaced channel bundle, as depicted in Fig. 5.4. These two different kind of noise are described in the following paragraphs.



Figure 5.3 – Chip-to-chip block diagram (top), depicting the transmitted signal before and after the attenuation due to the channel link. Section of a typical backplane system, highlighting the signalling paths [100] (bottom).



Figure 5.4 – ISI and crosstalk highlight in a multilane high speed I/O link [105].

**Inter-Symbol Interference (ISI)** Dispersions and reflections of the main signal in the channel link defines the overall *Inter-Symbol Interference* (ISI). These two phenomena are based on different mechanisms.

• Dispersion: for frequencies over the gigahertz the *skin-effect* and the *dielectric loss* are the main contributions to have a lossy transmission line.

Indeed, at high frequency, the current flow gets distributed into the wire with an higher density near the conductor surface, reducing the effective section of the wire. This phenomena is named skin-effect.

The dielectric loss is attributed to the energy loss in the dielectric surrounding the transmission line. Such attenuation strongly depends on the insulator material and linearly depends in frequency. For this reason, the dielectric loss dominates over the



Figure 5.5 – Highlight of the pulse response and its derived crosstalk pulse response in a 2 lanes single ended I/O link, reprinted from [105].

skin-effect at very high frequency [100].

• Reflection: a signal that travels in the transmission line suffers from reflections at several distinct points. This effect is due to the impedance discontinuities caused by the interconnections in the backplane link and the frequency dependent impedance discontinuity due to parasitic device capacitance at both the transmitter and receiver and the via stubs.

**Inter-Channel Interference (Crosstalk)** The *Inter-Channel Interference*, known as crosstalk, is caused by the electromagnetic coupling of signals travelling in a parallel channel link. Crosstalk occurs at points with dense wiring and can be divided into far-end (FEXT) and near-end (NEXT) crosstalk. *Near-End Crosstalk* (NEXT) does not affect the signal integrity in unidirectional links [106], while *Far-End Crosstalk* (FEXT) heavily affects single-ended PCB traces.

This section introduces the far-end crosstalk (FEXT) channel model for both single-ended and differential I/Os [105]. As previously mentioned, if an active signal is transmitted on one of the line of the channel bundle, then the end of the neighbour channel collects the coupled FEXT signal, as illustrated in Fig. 5.5. If the adjacent channel is transmitting another independent signal in the same direction, it will receive both its own original signal and the FEXT coupled from the adjacent channel. Hence, since these two signals are uncorrelated, the horizontal and the vertical eye-opening of the original signal are deteriorated.

In a homogeneous channel, like strip-line, the inductive and capacitive coupling is wellbalanced and the FEXT becomes negligible [107] but in an inhomogeneous channel like a micro-strip line, significant crosstalk energy couples through the asymmetrical field [108]. As PCBs are required to have more and more channels in a limited board area for higher data throughput, the physical spacing between channels is reduced and crosstalk is rapidly becoming the dominant factor affecting signal integrity. Long channel lengths and reduced channel spacing bring in a higher coupling coefficient and more crosstalk transfers onto the adjacent channel. Fig. 5.5 presents the physical parameters used to formulate the crosstalk model in single-ended I/Os. L is the channel length, W is the channel width and d is the center-line distance between channels. When the aggressor signal  $V_{in}(\omega)$  is transmitted on closely spaced channels, the FEXT signal  $V_{FEXT}(\omega)$  occurs at the adjacent channel output as

$$V_{FEXT}(\omega) = -j\omega\tau H(\omega)V_{in}(\omega)$$
  
=  $-j\omega(\frac{u}{d^k})H(\omega)V_{in}(\omega)$ , (5.1)

where  $H(\omega)$  is the channel transfer function,  $t = u/d^k$  is the forward coupling strength and u is a function of channel length and channel height [105].

As discussed in [109] and [107], the crosstalk energy diminishes approximately by a factor of  $d^k$ , where, for single-ended I/Os, the nominal value for k is between 1-2 depending on channel conditions. Interestingly, as derived in equation (eq:fext), in most low-impedance micro-strip lines used in portable electronics, the inductive coupling component is dominant and the FEXT pulse response is approximately the negative derivative of the channel pulse response. Such model has been validated by means of measurements on physical PCB channels fabricated at IBM Zurich Research Laboratory.

#### 5.2 Crosstalk cancellation state-of-art

Conventionally, board level techniques allow to handle FEXT, increasing, for instance, the distance between channels, or including shielding techniques [110, 111]. However, these techniques require additional space on PCB and are rarely implemented in high density and high speed links.

On the circuit side, there is a lack of crosstalk cancellation schemes that simultaneously handle a multichannel board. Most previous work focuses on crosstalk compensation circuits for memory channels. Crosstalk-induced timing distortion is reduced by means of the timing-delay adjustment of data transition versus the state of the data [112]. However, the challenge is in knowing the correct timing compensation, which is also dependent on the process variation [113]. Crosstalk in the memory interface has also been addressed by Bae et al. [114], where it limits the maximum number of transitioning lanes, but does not compensate for the distorted signals. Other approaches to compensate for crosstalk noise include the use of staggered I/Os combined with a glitch suppression scheme to improve vertical eye opening or a slew rate control driver on transmitter. Sham et al. proposed to cancel FEXT injected by neighboring aggressor lanes by using *Finite Impulse Response* (FIR) at the transmitter [115]. Nazari et al. [116] used a switched capacitor technique linearly combining two analog signals to reduce crosstalk, where the amount of FEXT is controlled attenuating a passive filter output.

## 6 System level analysis for high speed RX

In this work, we propose a versatile *receiver* (RX) circuit capable of coping with large insertion loss and can minimize crosstalk with multiple channels having identical lane spacing and important channel attenuation of 28 dB and 30 dB at Nyquist frequency. This study has been publised in [117] and extends our previous work [118], by considering different crosstalk reduction methodologies tailored with the channel board characteristics (Section 6.2).

In high-loss single-ended communication links the main signal path needs to be equalized to cope with *Inter-Symbol Interference* (ISI) noise. Moreover, the pre-cursors, cursor and post-cursors FEXT components need to be minimized to improve the signal integrity.

This Chapter is organized as follows. In this section we first discuss the crosstalk cancellation strategy we adopt on the RX macro (Section 6.1), then we present the different channel board characteristics used to test the chip (Section 6.2). The mathematical paragraph (Section 6.3), discusses the crosstalk reduction in ideally coupled lanes, highlighting its limitations. Section 6.4 shows the system level simulations of a Continuous Time and a Decision Feedback based crosstalk canceller blocks, highlighting how their co-existence has to be tailored depending on the channel board characteristics. The last part of this Chapter (Subsection 6.5) analyses the crosstalk cancellation techniques proposed by this work, in case of skewed board lanes.

#### 6.1 Crosstalk cancellation considerations

Analog filters can be used at the RX side to remove crosstalk components. The compensation scheme relies on the fact that, in ideally coupled lanes, FEXT is proportional to the derivative of the crosstalk source signal. A differentiation (easily implemented with analog filters), with appropriate gain  $\beta$ , can then reproduce FEXT and subtract it to the forward signal component to effectively remove far-end crosstalk [119]. It is possible to replace the RX analog filters with *Feed-Forward Equalizer* (FFE) at the transmitter side. However, such architecture is unable to prevent jitter amplification in the transmitted signal and imposes stringent linearity specifications in the output drivers.



Figure 6.1 – Crosstalk cancellation using CTXC front-end on 3 lanes channel.

For these motivations, in the proposed I/O link, the received data from adjacent lanes are processed in the analog domain by means of a *Continuous Time Crosstalk Canceller* (CTXC)<sup>1</sup>, to generate precursors and cursor FEXT cancellation signals [105], [118], [120]. The CTXC concept for 3-lanes channel is shown in Fig. 6.1, where a passive differentiator block is used to emulate the FEXT signal.

In presence of an 8-lanes single ended bus, the extension of the scheme introduced for 3-lanes system would differentiate the received signals from the 7 aggressors (the crosstalk sources) and add them to the forward signal lane with appropriate gain. However, processing the received signal from all aggressor lanes to remove FEXT in a defined victim lane (the crosstalk recipient) is not a practical solution, since it would enlarge the capacitance at the summation node, limiting the bandwidth of the overall system.

Based on these motivations, our CTXC implementation receives the signals from the two adjacent channels only, which have the greatest impact on the signal integrity over the victim lane. By doing so, large FEXT cursor and precursor components can be reduced.

The residual crosstalk noise is then treated and effectively removed by means of a decision-feedback based block, cross-connected between each lane of the channel board [118], [120]. The analog correction of such *Decision Feedback Crosstalk Canceller* (DFXC)<sup>2</sup>, is based on the switch-cap approach proposed in [121]. As a result, the transmitted signal over each lane

<sup>&</sup>lt;sup>1</sup>named XCTLE in reference [118]

<sup>&</sup>lt;sup>2</sup>named XDFE in reference [118]

| Name | <b>Extensions &amp; Stubs</b> | Length | Geometry       |
|------|-------------------------------|--------|----------------|
| Ch1  | none                          | 995 mm | s=1.5×w=142 μm |
| Ch2  | 2 via stubs                   | 720 mm | s=1.5×w=142 μm |

Table 6.1 - Crosstalk boards key parameters

is corrected by CTLE and DFE for the ISI distortion, while CTXC and 7 DFXCs minimize the crosstalk noise.

The CTXC and DFXC systems dovetail and ensure the crosstalk cancellation completely on the RX side, thereby coupling the RX circuit with transmitters sourced by different vendors.

#### 6.2 Boards characteristics

In this work, two different boards have been used to emulate multi-lane single-ended legacy channels for servers applications. For consistency we consider two complementary cases, labelled as *Ch1* and *Ch2*, which are defined as follows:

#### 6.2.1 Ch1 board

The crosstalk board Ch1 consists of a Rogers-PCB mother card, which hosts eight clean channels (no notches in the frequency response) due to the absence of extension boards, vias and connectors. The signal travels for 995 mm on the mother card, in a lane defined by its trace width w=95  $\mu$ m and lane-to-lane spacing s=142  $\mu$ m. Ch1 is an example of large attenuation channel, -30 dB at Nyquist frequency, and important FEXT contribution. Fig. 6.2 (a) displays the S-parameters (insertion loss and FEXT from all switching lanes) with respect to lane 3, in each channel bundle.

#### 6.2.2 Ch2 board

Channel Ch2 consists of a 720 mm Rogers-PCB mother card with extension Rogers-PCB board mounted on top with two Erni MicroSpeed connectors. The signal travels for 100 mm on the mother card, then goes to the first extension board, travels back to the mother card for 100 mm, travels in the second extension board and finally arrives to the RX. In this channel  $s=1.5 \times w=142 \mu m$ . Board Ch2 FEXT does not follow the ideal derivative model, due to the presence of connectors and via arrays in the signal path. With respect to Ch1, Ch2 results as a more severe board channel, with an important channel attenuation around 28 dB, and a more severe FEXT contribution. Fig. 6.2 (b) displays the S-parameters for lane 3 in each channel bundle.





Figure 6.2 – Forward and FEXT frequency responses (magnitude) for the Ch1 (a) and Ch2 (b) PCB board.

#### 6.3 Mathematical formulation for ideally coupled lanes

This section provides an overview of the crosstalk contribution within *N* ideally coupled lanes (e.g., Ch1 board), where the FEXT follows the derivative model [119].

The frequency domain representation of the received vector signal  $\mathbf{Z} \in \mathbb{R}^N$  is given by  $\mathbf{Z} = \mathbf{G} \cdot \mathbf{H} \cdot \mathbf{X}$  [105], where  $\mathbf{G} \in \mathbb{R}^{N \times N}$  is given by the CTXC contribution,  $\mathbf{H} \in \mathbb{R}^{N \times N}$  is the frequency channel response matrix and  $\mathbf{X} \in \mathbb{R}^N$  is the input signal vector.

As addressed in Oh et al. [105], the setting that ensures zero crosstalk contribution from the nearest neighbour is  $G_x = \beta G_0$ , where  $G_x$  and  $G_0$  define the CTXC analog gain for the crosstalk cancellation component and forward received strength, respectively. Under this scenario, the multiplication between the channel response matrix **H** and the CTXC matrix **G**, shows the additional reused crosstalk energy  $(2\omega^2\beta^2)G_0H$ . Nonetheless, it also shows an error contribution at each lane from the  $2^{nd}$  neighbours and reveals the presence of additional uncompensated noise terms  $\omega^2\beta^2G_0H$ . In [105], it is proposed to solve this issue by pairing up every two lanes and maintaining sufficient distance between the bundle, thereby trading board area for residual error term. However, this reduces the PCB area efficiency and it may not even be possible in applications where dense PCB routing is required. It should be noticed that using such analog front-end, residual errors terms can never be forced to zero.

In this work, instead of zero forcing the FEXT from adjacent lanes (setting  $G_x = \beta G_0$ ) and trying to minimize the error term by increased board spacing, we optimize the gain settings  $G_x$  and  $G_0$  in the CTXC with the goal of maximizing the vertical and horizontal eye opening. Paragraph 6.4.1 discusses this crosstalk reduction technique, applied for channel Ch1, where CTXC is sufficient to open the eye diagram in each lane.

However, if the board presents connectors and via arrays in the signal path (e.g., Ch2 board), the crosstalk patterns will be more intricate and will not follow the ideal coupled lanes model. Therefore, the CTXC only would not be sufficient to ensure operations at BER= $10^{-12}$ , and necessitates the DFXC to reduce the FEXT postcursors. Such crosstalk cancellation technique is addressed in Paragraph 6.4.2.

#### 6.4 System level simulations

A system level analysis is performed to investigate the optimized crosstalk cancellation strategies for the channels described in section 6.2.

Fig. 6.3 highlights one of the 8 single-ended lanes within the channel bundle. The 8-lanes channel bundle frequency domain data (forward and FEXT response) have been collected in a 16 ports S-parameter file, which models the interconnect. On the RX side, each lane features a CTXC followed by a CTLE, 8-taps DFE and 7×8-taps DFXC. In the 8-lane topology, it is assumed that data patterns of different lanes are uncorrelated.



Figure 6.3 – Single-lane transceiver block diagram with crosstalk compensation scheme combining CTXC on the front-end.

#### 6.4.1 Ch1 board crosstalk reduction

In this section, the crosstalk reduction technique applied for Ch1 board (described in Section 6.2.1) is addressed.

The calibration is performed over *R*, *C*,  $G_x$ ,  $G_o$  and CTLE settings. Due to large channel attenuation, both CTLE and DFE are required to effectively remove ISI. The simulation results are presented in Fig. 6.4. All simulations include a 5 ps TX random uncorrelated jitter (roughly 3.5% UI for 7 Gb/s data rate). The data eye is completely closed when all aggressors are transmitting. The CTXC front-end is able to open a closed data eye with 35% UI eye width and 75 mV eye height. Fig. 6.5 shows the FEXT pulse response between the aggressor and victim at 7 Gb/s, before and after the crosstalk compensation scheme (with optimal filter settings calibration). Such analysis highlights that the CTXC makes the system less sensitive to jitter noise, since it flattens the derivative of the crosstalk pulse response.

#### 6.4.2 Ch2 board crosstalk reduction

This section discusses the crosstalk reduction technique addressed for channel board Ch2 (defined in Section 6.2.2). The limits of CTXC with this particular board, are evinced in Fig. 6.6, which shows a completely closed eye for 7 Gb/s with only the two nearest aggressor lanes switching. Thus, a different crosstalk minimization technique is involved for this type of channel board. First, the CTXC-CTLE strength calibration targets the reduction of precursors and cursor crosstalk contribution. Then, the crosstalk terms are determined from detected



Figure 6.4 – Simulated RX data eye for Ch1 board, with all aggressors switched (a) off and on (b) without crosstalk compensation scheme (CTLE and DFE on, in both cases). (c) Data eye and (d) bathtub plot with optimally calibrated CTXC front-end. All aggressors are transmitting.



Figure 6.5 – FEXT pulse response from the aggressor to victim lane before and after CTXC.



Figure 6.6 – Simulated RX data eye for Ch2 board, with all aggressors switched off (a) and on (b) without crosstalk compensation (CTLE and DFE on, in both cases). (c) Data eye and (d) bathtub plot with optimally calibrated CTXC front-end with the two nearest aggressors transmitting.

bits in the aggressor lanes and its derived voltage is subtracted over the victim by means of the DFXC filters. A statistical analysis using MATLAB Software has been performed to validate the CTXC-DFXC effect. A *Probability Distribution Function* (PDF) of the ISI and crosstalk pulse-response spanned over all postcursor taps has been developed, where values have been found convolving the ISI and crosstalk terms. The crosstalk PDF allows to analyse all the possible combinations postcursor ISI and the FEXT taps. Such analysis is valid by assuming that the data over the victim and the aggressors are white and uncorrelated.

To perform such analysis, the pulse-response on the victim lane (only the victim lane TX is transmitting, while all the aggressors are silent) is combined with the crosstalk pulse-responses (only the aggressor transmits a pulse, while the victim TX is silent) of the aggressor lanes. During this phase, the CTLE is activated, while DFE is off. Given a sampling window



Figure 6.7 – Highlight of the vertical eye aperture (a) and *signal*, crosstalk and ISI (b) evolution for different CTLE peaking settings.

 $\mathcal{D} = \{x_1, \dots, x_m\}$  of m=32 sampling points in one UI, the optimal  $h_0$  has been found by choosing the index of the sampling point in  $\mathcal{D}$  that maximizes the eye aperture  $V_{eye}$ , which is defined as:

$$V_{eye} = Signal - CDF_{NIX}^{-1} - Sensitivity,$$
(6.1)

where  $Signal = h_0 - |h_{-1}|$ , is the signal amplitude given by the cursor  $h_0$  minus the absolute value of the first of the precursors  $h_{-1}$  (which has the same polarity as the cursor value in these types of channels); the  $CDF_{NIX}^{-1}$  is the inverse of the *Cumulative Distribution Function* of the *Noise, ISI and crosstalk* (NIX) at BER=10<sup>-12</sup>. The CDF is computed convolving the ISI, crosstalk and noise distributions, given by the PDF analysis. The input referred noise includes CTLE, comparator and jitter noise. The term *Sensitivity* = 5 *mV* is the minimum comparator voltage sensitivity.

Using the precise PDF approach for analyzing ISI and crosstalk has been necessary, since using a simpler RMS summation of ISI and crosstalk components was found to give overly pessimistic results, which is due to the non-Gaussian nature of distributions for ISI and crosstalk.

Fig. 6.7 (a) shows the vertical eye opening versus CTLE peaking settings, including 8-tap DFE equalization co-optimized with CTXC and 56-tap DFXC at BER=10<sup>-12</sup>. The trend reveals that high peaking settings provide the maximum vertical eye opening, even in presence of crosstalk. This counter intuitive trend can be explained in Fig. 6.7 (b), where the *signal*, ISI and crosstalk contribution (derived by the PDF analysis) are plotted independently. Even-though the CTLE peaking is generated by lowering the DC-gain, larger CTLE peaking settings increase the *signal*. This is because lower peaking results in more ISI, which requires the CTLE output to be scaled to meet linearity requirements for DFE equalization. Moreover, crosstalk components increase with CTLE peaking, since the CTLE tends to amplify crosstalk. Overall, increasing the CTLE

| Board Ch2                     |      |     |     |    |  |  |  |  |
|-------------------------------|------|-----|-----|----|--|--|--|--|
| $Signal = h_0 - h_{-1} [mV]$  | 138  |     |     |    |  |  |  |  |
| $\Delta_{ISI}$ [mV]           | 26   |     |     |    |  |  |  |  |
| $\sigma_{Noise}$ [mV]         | 4    |     |     |    |  |  |  |  |
| Sensitivity [mV]              | 5    |     |     |    |  |  |  |  |
| CTXC                          | OFF  | ON  | OFF | ON |  |  |  |  |
| DFXC                          | OFF  | OFF | ON  | ON |  |  |  |  |
| $\Delta_{XTK}$ [mV]           | 200  | 101 | 115 | 74 |  |  |  |  |
| $\Delta_{ISI+XTK}$ [mV]       | 211  | 114 | 126 | 88 |  |  |  |  |
| $\Delta_{ISI+XTK+Noise}$ [mV] | 214  | 121 | 131 | 95 |  |  |  |  |
| V <sub>eye</sub> [mV]         | -105 | -7  | 2   | 38 |  |  |  |  |

Table 6.2 - Crosstalk Cancellation Performances

peaking, the *signal* grows faster than crosstalk does and ISI is reduced; then, the net eye opening is larger with high peaking settings.

The crosstalk PDF obtained with the statistical analysis is reported in Fig. 6.8 in four different scenarios, with the maximum CTLE peaking setting. In particular, Fig. 6.8 (a) shows the crosstalk PDF with no FEXT cancellation, while in Fig. 6.8 (b) only the CTXC is activated, reducing the crosstalk noise amplitude at BER= $10^{-12}$  from 200 mV to 113 mV. The PDF distribution for the DFXC-only is reported in Fig. 6.8 (c). When CTXC is combined with DFXC, as shown in Fig. 6.8 (d), the crosstalk error term  $\Delta_{XTK}$  at BER= $10^{-12}$  is equal to 73.5 mV, showing a significant improvement in vertical eye opening. The results from the PDF analysis are reported in Table 6.2, which highlights the crosstalk cancellation strength for all the CTXC-DFXC combinations. Without crosstalk reduction, the FEXT contribution overcomes the cursor  $h_0$ amplitude, resulting to a closed eye. The vertical eye aperture is improved once both the CTXC and the DFXC crosstalk canceller blocks are optimally calibrated, resulting in 38.7 mV vertical eye opening.

#### 6.5 Crosstalk cancellation over skewed lanes

Some difference in the lane length, due to manufacturing tolerance, can be the cause of some skew experienced by the NRZ signal travelling the channel bundle, both on the transmitter and on the receiver side. Fig. 6.9 (a) shows the skewed impulse responses for all the 8 single-ended lanes of channel board Ch2, once the signals are launched at the same time at the TX. Such unwanted issue can be solved forcing delay adjustments on the transmitter side, resulting in aligned impulse-responses, as depicted in Fig. 6.9 (b).

However, the skew adjustment on the TX for each forward paths, does not solve the crosstalk pulse responses skew issue, on the RX side. Fig. 6.10 (a) shows two signals travelling a multilanes board, with identical channel lengths. The crosstalk coupling from the aggressor to the victim lane is then perfectly corrected by the CTXC signal, given by  $-\beta d/dt$  of the aggressor

pulse response, at the time  $t_0$ .

In case of different channel lengths, the crosstalk coupling signal arrives to the RX terminal at a different instant with respect to its correction version, because of the channel skew. Considering that the aggressor lane is shorter than the victim lane, as highlighted in Fig. 6.10 (b), the XTC coupling arrives at time  $t_0 - t_{skew}$ , while the FEXT cancellation signal, generated in the CTXC, is ready at time  $t_0$ . This produces a residual crosstalk signal, which might be partially reduced by the DFXC. For this reason, the CTXC of each lane has to be adapted accordingly.

An analysis is performed to verify how the DFXC system interacts with the RX system sensitivity. Fig. 6.11 shows the vertical eye aperture with different lane skews at the RX, with different DFXC number of taps activated, over a single lane of Ch2 board. The DFXC contribution is already evident on the vertical eye aperture, from no crosstalk reduction (i.e., n=0 curve) to tap-1 of the DFXC activated (n=1 curve). Interestingly, it is important to evince how the DFXC reduces the sensitivity to the skew. For instance, considering in Fig. 6.11 the curve with the first 4-taps activated (i.e., n=4 curve), the vertical eye opening is flattened with respect to the one without DFXC contribution. Moreover, the DFXC contribution is limited to the first 2 to 4 taps, since only marginal crosstalk reduction is obtained with more taps activated.





Figure 6.8 – Probability distribution function of the crosstalk pulse-response spanned over all postcursor taps without crosstalk cancellation (a), with only CTXC on (b), with CTXC off and DFXC on (c) and with both CTXC-DFXC activated (d).



Figure 6.9 – Skewed (a) and un-skewed (b) impulse responses at the TX side.



Figure 6.10 – Qualitative highlight of CTXC effects for un-skewed (a) and skewed (b) board lanes.

Chapter 6. System level analysis for high speed RX



Figure 6.11 – Vertical eye aperture versus different lane skews at the RX side, with different number n of taps activated on the DFXC. The simulations are performed with Ch2 channel board.

# 7 High speed receiver hardware implementation and validation

In this Chapter we describe the RX architecture and circuit details. Furthermore, we provide the electrical characterizations, which are aligned with the system-level analysis discussed in previous Chapter 6 and demonstrate the ISI equalization and crosstalk reduction strength of the overall RX circuit.

This Chapter is organized as follows. The receiver macro architecture and its functional units are described in Section 7.1. Section 7.2 gives the electrical measurements and Section 7.3 discusses the results and concludes the Chapter.



#### 7.1 Receiver Architecture and Circuits

Figure 7.1 – 8-lane single-ended receiver architecture.



Figure 7.2 – CTXC stage with single-ended passive differentiator, variable gain amplifier and current summation. The two high pass RC differentiators are highlighted in the boxes.

The architecture of the source synchronous RX is shown in Fig. 7.1. It consists of 8 single-ended data lanes and 1 shared differential clock lane. Each data-path starts with the termination front end, followed by a product level ESD protection combined with T-COIL for bandwidth extension. The CTXC processes the input signal together with the nearest aggressor. The CTXC output goes to a 2-stage CTLE followed by a direct feedback 8-tap DFE and 56-tap DFXC running at full rate. Equalized output at full rate is then deserialized to quarter rate and sampled by a digital engine, used for adaptation and BER check.

#### 7.1.1 Clock generation

A full rate clock supplied off chip with 1Vpp swing and 750mV CM is terminated differentially before being amplified by a CML buffer [122]. The reference voltage  $V_{ref}$  is extracted directly from the input clock common mode without the need of a dedicated pin as in many single-ended standards such as DDRX. The buffered input clock is then converted to CMOS level and buffered to the local clock distribution within each lane.

#### 7.1.2 CTXC and CTLE

The CTXC is located after the impedance matching network and presents a FEXT reduced signal to the CTLE. Fig. 7.2 shows the circuit implementation of the proposed CTXC circuit.

The CTXC consists of two passive differentiators followed by a current domain adder. The differentiators produce a single-ended crosstalk cancellation signal from the two adjacent lanes. The values of R=972  $\Omega$  and C=30 fF have been chosen to provide return-loss below -10 dB up to 4 GHz at each of the broadband 50  $\Omega$  RX inputs. In [105], a resistor-capacitor replica circuit is added in the forward path to equalize phase delays between forward and crosstalk cancellation paths. In this way, the transfer function of the differentiator differs from the replica circuit by *sRC*, providing 90° phase shifts at all frequencies. However, this creates a parasitic pole on the main signal path. In this design, only the resistor is added in the main path while the capacitor consists of the CTXC input stage loading directly. Circuit



Figure 7.3 – Simulated AC response of main signal path VGA with maximum gain setting.

simulations across corners resulted in acceptable distortion with marginal impact on crosstalk cancellation.

A current domain adder with programmable gain combines the signals from the three paths. Three digitally programmed bias currents enable to adjust the gain of the forward and crosstalk cancellation paths independently. VGA bias currents are binary weighted and can be adjusted with 4-bit resolution, enabling crosstalk cancellation over a wide range. The forward path uses a degenerated differential pair to improve linearity. Since the differentiated signals have a small amplitude because FEXT is typically much smaller than the main signal component, there is no degeneration resistor in the crosstalk cancellation VGA. The single-ended to differential pair to  $V_{ref}$  on one side and to the differentiator/ compensator on the the other.

Fig. 7.3 displays the simulated (after RC extraction) frequency response of the forward path VGA. The DC gain is 3.9 dB with a 3 dB bandwidth of 4.19 GHz. Bandwidth limitations comes from the large capacitance at the current summation node, which corresponds to 16 fF. This is still acceptable for 7-8Gb/s hence no architecture change is needed.

The CTLE circuit, depicted in Fig. 7.4, is a differential buffer stage with programmable capacitive and resistive source degeneration [121]. A negative capacitance is in parallel with the





Figure 7.4 - CTLE stage with negative-C bandwidth enhancement. Reprinted from [121].

differential pair and, if enabled, is used to enhance the bandwidth of the circuit. The programmable resistive degeneration is controlled with 9 thermometer coded steps, providing 17 settings in total. The degeneration capacitance is binary programmable with 4 bits resolution. Each capacitance step is implemented with two anti-parallel connected varactors. Two CTLE stages are cascaded to provide up to 17 dB peaking at 3.5 GHz with -3.7 dB DC gain.

#### 7.1.3 DFE and DFXC

The DFE core, shown in Fig 7.6, includes 8-tap DFE and  $7 \times 8$  DFXC switched-capacitor cells. A current integrating stage amplifies the CTLE output for 1/2 UI. A track and hold stage is avoided to limit the kT/C noise with a cost of 0.9 dB loss due to half UI time window integration. The DFE core loop is based on a direct feedback full rate DFE, where the critical timing loop is for tap-1  $(h_1)$  equalization feedback. Digitally programmable Switched Capacitors SC-DAC are implemented to add charge on the integration node. This approach, enables a fast DFE feedback thanks to the instantaneous effect of the charge injection on the summation node and allows a relaxation of the DFE timing loop, compared with current summation DFE [121]. Each capacitive DAC is programmable with 6-bit resolution, with 1 LSB=250 aF ( $C_{max}$ =15.75 fF) and is realized with metal M1 and M2 layers for the finger caps. Correction tap  $h_1$  uses 3 SCcells connected in parallel, allowing a wider range correction. The less critical DFE taps  $h_2$ to  $h_8$  and the remaining 48-DFXC taps are driven by FIFO data. The implemented DFE, with a capacitance charge feedback, is shown in Fig. 7.5. A dynamic differential latch receives the digital data resolved by the strongARM data-latch [121], and samples them at the falling clock edge. The DCVS and dynamic latch together implement the function of a flip-flop. The dynamic latch avoids charge injection, which occurs before the integration period. In fact, in a SC-DFE, no DFE correction is performed if the charge injection occurs during the reset phase.



Figure 7.5 - Integrating DFE using SC feedback.

The data format is kept in pre-charged dynamic logic from the data-latch to each SC-DAC input. In this way, it is possible to close the DFE tap-1 timing with reasonable margin, since a conversion step to static CMOS logic is avoided.

Each lane includes an additional offset-programmable latch (amplitude path), shown at the top part of Fig. 7.6, for RX internal eye measurement and DFE tap calibration. It consists of a DCVS latch with integrated voltage offset followed by a *Set-Reset* (SR) latch. The amplitude bit is fed into the digital calibration block, where the information is processed to find a correlation between the received amplitude samples and previous data bits indicating the presence of ISI or FEXT.

#### 7.2 Measurement Results

The layout of the fabricated circuit, whose RX macro measures  $300 \times 350 \mu m^2$  is shown in Fig. 7.7. The chip, fabricated in 32 nm SOI CMOS, has been flip-chip mounted on an high frequency, low loss substrate, *Liquid Crystal Polymer* (LCP) PCB, shown in Fig. 7.8 (left). The LCP itself is embedded in a rigid metallic frame which includes impedance-matched high-frequency coaxial connectors, as shown in Fig. 7.8 (right).

The RX performances have been tested with both the channels described in Section 6.2. The characterization has been performed using high frequency probing cables connected to an Agilent PARBERT. Fig. 7.9 shows the measurement setup. Read/write process have been performed thanks to a bidirectional digital interface, used to interface the RX chip with a PC. An on-chip error counter (PRBS checker) and correlator, running at quarter rate, has been



Figure 7.6 – DFE and DFXC core, with fast tap-1 feedback, including 8-tap DFE and  $7 \times 8$  DFXC SC cells.

exploited to run the electrical characterization for latch offset correction, timing adjustment and CTXC-CTLE, DFXC-DFE coefficients tuning. A 3-lane measurement was performed owing to limitation of the measurement equipment. The data streams sent over the three adjacent lanes were PRBS7 on aggressors and PRBS11 on the victim, thus uncorrelated bit sequences.

#### 7.2.1 Ch1 measurement results

The calibration of the internal registers have been addressed as follows: as first step, we calibrate the forward signal path only, switching off the aggressor transmitters. The RX output, read by the on-chip amplitude-path, is sent to the correlator and analysed on a PC, using MATLAB tool. The new CTLE-DFE coefficients are written to the internal registers, in order to reduce the ISI. Following this step, we switch on one of the two nearest TX lanes, and we perform the CTXC parameters sweep, through the PC. The same process is performed for tuning the other nearest aggressor lane, calibrating the second branch of the CTXC. Once the two CTXC set-points have been defined, the forward signal calibration is repeated, to reduce the impact of the CTLE on the reduced crosstalk pulse responses, trading-off the CTLE and CTXC contribution.

Fig. 7.10 shows the measured BER bathtub curves related to board Ch1, generated internally by doing an horizontal sweep of the data through the Agilent PARBERT phase generator (32



Figure 7.7 – Layout of RX macro (center), detail of the SC-DFE cells (on top) and the die micrograph (bottom).

Chapter 7. High speed receiver hardware implementation and validation



Figure 7.8 – On the left, the chip is flip-chip mounted on the LCP PCB. On the right, the LCP is packaged in a rigid metallic frame.

steps UI). Once the aggressor lanes are transmitting, the bathtub curve shows an horizontal aperture of 12% UI, which rises to above 25% UI once the CTXC is activated.

#### 7.2.2 Ch2 measurement results

The correlation measurement between the two aggressors towards the victim post-cursors is necessary for the DFXC taps tuning, over board Ch2. The correlation values were read through the on-chip amplitude-path, by the PC and the updated coefficients re-written to the circuit registers, driving the correlation with postcursor channel taps to zero (Fig. 7.11 (left)). In Fig. 7.11 (right) are shown the BER bathtub curves. With silent aggressors, the RX eye is open with an horizontal margin of 40% UI at  $10^{-12}$  BER. Once the 2 adjacent aggressor lanes are transmitting, the link does not operate error free, since the bathtub curve reaches only  $10^{-4}$  BER. After switching on the crosstalk cancellation blocks, the eye is reasonably open with a 12.5% UI margin (highlighted in Fig. 7.12 (d)), showing that both CTXC and DFXC are necessary to ensure error-free operation of the RX. Fig. 7.12 (a), Fig. 7.12 (b) and Fig. 7.12 (c) display the measured eye diagrams, generated internally by doing an horizontal sweep of the data through the Agilent PARBERT phase generator and vertically by sweeping the amplitude programmable latch offset. The measured vertical eye margins are 22.4 mV<sub>ppdiff</sub> and 64 mV<sub>ppdiff</sub> at  $10^{-8}$  BER with and without crosstalk, respectively.

A power breakup for 7 Gb/s operation is shown in Table 7.1, which reports the power consumed by one lane. The clock generation circuit is amortized by 8 lanes. The DFE core data-path includes integrating amplifier, DCVS latches, dynamic datapath and digital FIFO. The total power dissipation once the CTLE, CTXC, 8 taps DFE and 56 taps DFXC are active amounts to



Figure 7.9 – Measurement setup: clock generators on top left, PARBERT for PRBS generation on bottom left, test board Ch2 on bottom right and the RX in the middle.

Chapter 7. High speed receiver hardware implementation and validation



Figure 7.10 – Measured bathtub plots for Ch1 board with CTXC switched off (a) and switched on (b), with the two nearest aggressor lanes transmitting.



Figure 7.11 – Board-Ch2: measured correlation with postcursor taps with and without DFXC, on the left; measured bathtub plots, on the right.

5.9 mW/Gb/s with 1 V supply at package, from which 3.9 mW/Gb/s are used in the 64 taps DFE+DFXC SC-cells and core data-path.

Table 7.2 shows a comparison of the RX macro with prior art. The power overhead compared to the prior art mainly comes from DFXC function. Moreover, the power number reported here includes the complete RX macro, including digital correlation logic. Finally, the proposed scheme results to be an extremely flexible FEXT compensation scheme, which can be adapted for different single-ended boards.



Figure 7.12 – Received eye diagrams with silent aggressors (top-left), crosstalk cancellation off (top-right), crosstalk cancellation activated (bottom-left) with related bathtub plot (bottom-right).

#### 7.3 Summary

In this work we reported an 8 lane single-ended receiver circuit for source-synchronous links for high loss channels affected by FEXT. Each lane performs ISI equalization and FEXT cancellation based on a CTXC and 7×8-taps DFXC ensuring robust operation. Unlike previous literature [119], [116], where crosstalk cancellation schemes were tested on channel with moderate insertion loss, the proposed RX macro can equalize both a 30 dB insertion loss single-ended channel with a signal-to-crosstalk ratio of 0 dB from the nearest lanes at Nyquist, and a channel with 28 dB attenuation and reflections due to VIA stubs with signal-to-crosstalk ratio of 6 dB. The crosstalk reduction strategy can be used across a variety of channels with different crosstalk patterns, due to board geometry. This trend demonstrated with measurements, showed good agreement with system level analysis. Interestingly, it has been shown how the vertical eye opening improves by increasing CTLE peaking even with severe crosstalk.

|                                 | $\mu$ W/Gb/s |
|---------------------------------|--------------|
| clk path (amortized by 8 lanes) | 66           |
| local clock distribution        | 260          |
| CTLE-CTXC                       | 1150         |
| DFE-DFXC                        | 3894         |
| digital correlator              | 250          |
| 1:4 demux                       | 280          |
| Total                           | 5900         |

Table 7.2 – Comparison of 8 lanes  $\times$  7 Gb/s RX macro with prior art

| Reference                    | [119]        | [116]         | [123]        | [105]        | This work      |       |
|------------------------------|--------------|---------------|--------------|--------------|----------------|-------|
| XTC type                     | CTXC         | Rx passive SC | TX FIR       | CTXC         | CTXC, 7x8 DFXC |       |
| I/O type                     | Single-ended | Differential  | Single-ended | Single-ended | Single-ended   |       |
| Multi-channel num.           | 2            | 2             | 2            | 4            | 8              |       |
| Data-rate (Gb/s)             | 6            | 15            | 7            | 12           | 7              |       |
| Channel Attenuation          | 9 dB         | 14.5 dB       | N/A          | 11 dB        | 30 dB          | 28 dB |
| Signal-to-Crosstalk ratio    | N/A          | 1 dB          | N/A          | 0 dB         | 0 dB           | 6 dB  |
| Eq. power (pJ/bit/lane)      | 2.4          | 0.033         | N/A          | 0.96         | 8 (full RX)    |       |
| process node                 | 130 nm       | 45 nm SOI     | 40 nm        | 65 nm        | 32 nm SOI      |       |
| area (mm <sup>2</sup> /lane) | 0.03         | N/A           | N/A          | 0.036        | 0.012          |       |

Moreover, it has been demonstrated that the first 2 to 4 DFXC taps are sufficient to reduce the crosstalk even in presence of skew between lanes in the channel bundle.

# **Conclusions Part III**

## 8 Conclusion and future work

In this thesis we discussed the algorithm development, architectural design and system/circuit level implementation and silicon validation for learning-based driven hardware, implemented for the edge and big-data computing. The main subject of this work has been to fill the gap between algorithm development and actual prototypes, by means of dedicated hardware designs that trade-off area, power and overall performances in their field of application.

Regarding the implantable device for medical application, this has been obtained by developing tailored algorithm that can be easily implemented in hardware fashion, still allowing relatively high signal reconstruction, enabling efficient medical monitoring. In particular, a structured sampling approach has been discussed, showing how a probability function that favours the low frequencies in the sparse domain can be exploited to enhance the sampling procedure, thus improving the sensing performance. Afterwards, a learning-based compressive sampling algorithm has been described. Such compressive scheme is based on the simple idea of sampling a fixed set of coefficients that preserve as much of the signal's energy as possible. The set of indices is learnt from a training set of fully sampled signals, by selecting the ones that capture most of the signals' average energy. LBCS offers a pair of highly efficient linear encoder and decoder, thus challenging the conventional recovery approach in CS, where non-linear decoding procedures such as basis pursuit are necessary for reliable signal reconstructions.

The different learning-based hardware prototypes have been described in Chapter chapter:LBCSHardware. In particular it has been highlighted how the Hadamard measurement matrix in the LBCS encoding scheme is more suited for implantable application with respect to the LBCS-DCT scheme, which gives better signal recovery quality in terms of SNR, while requiring more area. Then, it has been described the complete single channel system architecture and circuit implementation, including the ADC, an adaptive compression rate Hadamard-based LBCS and the RF parts, for power and data wireless link, followed by the electrical measurements on silicon. Finally, a multichannel implementation has been discussed that, at the time of writing, is under fabrication process.

The second subject of this work has been the design of a multi-lane single-ended high speed

I/O receiver for high-end servers application. In Chapter 5 we have described the high speed Input/Output link interconnection, giving the main informations about the different signal losses in the channel board bundle. Then we have described the system level analysis of the high speed receiver, motivating our crosstalk cancellation technique on the receiver side only. The crosstalk mathematical formulation for ideally coupled lanes introduced the system level simulations. The receiver architecture, circuit details and measurement have been given in Chapter 7, demonstrating a versatile receiver circuit which can be adapted to different channel bundle characteristics, by learning the ISI and crosstalk contribution.

#### 8.1 Future Work

Concerning the edge-data computing, wireless implantable devices capable of monitoring the brain's activity are becoming an important tool for understanding mental diseases and potentially treat some mental disorder or restore motor functions due to central nervous system disorders, such as spinal cord injury. Innovative machine learning based approaches, are used to design very efficient data encoders on hardware, which are signal-structure aware. With this premise in mind, we can significantly improve the encoder/decoder combination, tailoring their design to boost the overall system performance. A possible future step of this work is to implement a neural network system that exploits the signal structure and enhance the performance of the macro, by using minor assumptions on the signal of interest.

A further improvement on this work might consider security and privacy issues in implantable medical devices. A product level implementation must include a security system that fits in the inherent constraints defined by the implantable application: limited area and low-power consumption. The implantable device have to merge the safety of the patient with an adequate level of security.

# A Appendix: Dataset details

The iEEG.org portal contains several datasets of EEG and iEEG data which are manually annotated by expert clinicians. We focus on the following two datasets.

## A.1 I001-P034-D01

The I001-P034-D01 dataset consists of approximately 1 day, 8 hours and 10 minutes of recordings at 5kHz, or approximately  $6 \cdot 10^8$  samples. In order to reduce the dataset size, we use samples only from the 12-th and 13-th seizure, and an equal number of samples before the seizure onset, for training and testing respectively.

We consider the 32 active grid electrodes which, from a first visual inspection, more clearly show significant changes between the samples annotated as seizures from the rest. In order to better compare to the sampling strategy that combines samples across the channels (MCS), we consider only a sub-grid of  $4 \times 4$  electrodes.

### A.2 Study 040

The Study 040 dataset consists of approximately 2 days, 23 hours and 50 minutes of recordings at 5kHz, or approximately  $1.3 \cdot 10^9$  samples. In order to reduce the dataset size, we use samples only from the 1st and the 3rd seizure and an equal number of samples before the seizure onset, for training and testing respectively. We consider all the 64 active grid electrodes.

## A.3 Experimental protocol

The training set of both datasets are used to learn the sampling pattern for the LBCS approach and also to tune the variable density parameters for the SHS method. Once the sampling pattern is fixed, LBCS uses it to compress all the signal windows in the test set. The reconstruction is then performed with the linear decoder (3.21). For the randomized methods, MCS, BERN and SHS, we draw 20 different sampling patterns from the relative distributions for each signal window in the test and reconstruct using the tree-based HGL norm (3.19), which was shown in [46] to yield the best results.

## A.4 Performance Evaluation

We concatenate all reconstructed windows for each channel *j* together, forming the entire reconstructed signal,  $\hat{\mathbf{x}}_j$  for the test seizure. We then compute the SNR for each channel as  $\text{SNR}_j = 20\log_{10}\left(\frac{\|\mathbf{x}_j\|_2}{\|\mathbf{x}_j - \hat{\mathbf{x}}_j\|_2}\right)$ , where  $\mathbf{x}_j$  is the recorded signal for channel *j*, and average these SNRs to obtain our final measure of performance,  $\text{SNR} = \frac{1}{\#ch} \sum_{i=1}^{\#ch} \text{SNR}_j$ . For the randomized methods, we also average over the 20 draws.

## Bibliography

- [1] J. S. Kilby, "Invention of the integrated circuit," *IEEE Transactions on electron devices*, vol. 23, no. 7, pp. 648–654, 1976.
- [2] R. R. Schaller, "Moore's law: past, present and future," *IEEE spectrum*, vol. 34, no. 6, pp. 52–59, 1997.
- [3] D. Bautista, "7nm ibm wafer," Feature Photo Service, IBM, Tech. Rep., 2015.
- [4] "2015 international technology roadmap for semiconductors (itrs)," ITRS, Tech. Rep., 2015.
- [5] I. S. Caveats, "Imd shield: Securing implantable medical devices."
- [6] "Implantable medical devices market. global industry analysis, size, share, growth, trends, and forecast 2016 - 2024," Transparancy Market Research, Tech. Rep. TM-RGL13946, 2016.
- [7] E. J. Candès, "Compressive sampling," in *Proceedings on the International Congress of Mathematicians: Madrid, August 22-30, 2006: invited lectures,* 2006, pp. 1433–1452.
- [8] D. Donoho, "Compressed sensing," *IEEE Transactions on Information Theory*, vol. 52, no. 4, pp. 1289–1306, 2006.
- [9] W. H. Organization, "Global burden of disease study," WHO, Tech. Rep., 2008.
- [10] M. DiLuca and J. Olesen, "The cost of brain diseases: a burden or a challenge?" *Neuron*, vol. 82, no. 6, pp. 1205–1208, 2014.
- [11] "Cardiovascular diseases statistics," European heart health charter, Tech. Rep., 2016.
- [12] A. C. Hoogerwerf and K. D. Wise, "A three-dimensional microelectrode array for chronic neural recording," *Biomedical Engineering, IEEE Transactions on*, vol. 41, no. 12, pp. 1136–1146, 1994.
- [13] C. B. Nemeroff, H. S. Mayberg, S. E. Krahl, J. McNamara, A. Frazer, T. R. Henry, M. S. George, D. S. Charney, and S. K. Brannan, "Vns therapy in treatment-resistant depression: clinical evidence and putative neurobiological mechanisms," *Neuropsychopharmacology*, vol. 31, no. 7, pp. 1345–1355, 2006.

- [14] M. Leonardi and T. B. Ustun, "The global burden of epilepsy," *Epilepsia*, vol. 43, no. s6, pp. 21–25, 2002.
- [15] W. H. Organization, "Epilepsy-fact sheet," WHO, Tech. Rep., february 2017.
- [16] O. N.-I. R. H. IEEE Standards Coordinating Committee 28, IEEE Standard for Safety Levels with Respect to Human Exposure to Radio Frequency Electromagnetic Fields, 3kHz to 300 GHz. IEEE, 1992.
- [17] G. Yilmaz, "Wireless power transfer and data communication for intracranial neural implants case study," 2014.
- [18] G. Yilmaz, O. Atasoy, and C. Dehollain, "Wireless energy and data transfer for in-vivo epileptic focus localization," *Sensors Journal, IEEE*, vol. 13, no. 11, pp. 4172–4179, 2013.
- [19] P. Shenoy, K. J. Miller, J. G. Ojemann, and R. P. Rao, "Generalized features for electrocorticographic bcis," *IEEE Transactions on Biomedical Engineering*, vol. 55, no. 1, pp. 273–280, 2008.
- [20] F. Zhang, A. Mishra, A. G. Richardson, and B. Otis, "A low-power ecog/eeg processing ic with integrated multiband energy extractor," *IEEE Transactions on Circuits and Systems I: Regular Papers*, vol. 58, no. 9, pp. 2069–2082, 2011.
- [21] M. Stead, M. Bower, B. Brinkmann, K. Lee, W. Marsh, F. Meyer, B. Litt, J. Van Gompel, and G. Worrell, "Microseizures and the spatiotemporal scales of human partial epilepsy," *Brain*, vol. 133, no. 9, 2010.
- [22] T. F. Collura, "History and evolution of electroencephalographic instruments and techniques." *Journal of clinical neurophysiology*, vol. 10, no. 4, pp. 476–504, 1993.
- [23] G. Kreiman, "Neural coding: computational and biophysical perspectives," *Physics of Life Reviews*, vol. 1, no. 2, pp. 71–102, 2004.
- [24] A. Gorgulho, A. A. De Salles, L. Frighetto, and E. Behnke, "Incidence of hemorrhage associated with electrophysiological studies performed using macroelectrodes and microelectrodes in functional neurosurgery," *Journal of neurosurgery*, vol. 102, no. 5, pp. 888–896, 2005.
- [25] "Neuronal implants," lifescience.ieee.org, Tech. Rep.
- [26] E. M. Maynard, C. T. Nordhausen, and R. A. Normann, "The utah intracortical electrode array: a recording structure for potential brain-computer interfaces," *Electroencephalography and clinical neurophysiology*, vol. 102, no. 3, pp. 228–239, 1997.
- [27] D. Yoshor, W. H. Bosking, G. M. Ghose, and J. H. Maunsell, "Receptive fields in human visual cortex mapped with surface electrodes," *Cerebral cortex*, vol. 17, no. 10, pp. 2293– 2302, 2006.

- [28] G. A. Worrell, A. B. Gardner, S. M. Stead, S. Hu, S. Goerss, G. J. Cascino, F. B. Meyer, R. Marsh, and B. Litt, "High-frequency oscillations in human temporal lobe: simultaneous microwire and clinical macroelectrode recordings," *Brain*, vol. 131, no. 4, pp. 928–937, 2008.
- [29] F. Chen, A. P. Chandrakasan, and V. M. Stojanovic, "Design and analysis of a hardwareefficient compressed sensing architecture for data compression in wireless sensors," *IEEE Journal of Solid-State Circuits*, vol. 47, no. 3, pp. 744–756, 2012.
- [30] S. Ha, A. Akinin, J. Park, C. Kim, H. Wang, C. Maier, P. P. Mercier, and G. Cauwenberghs, "Silicon-integrated high-density electrocortical interfaces," *Proceedings of the IEEE*, vol. 105, no. 1, pp. 11–33, 2017.
- [31] D. Kwon and G. A. Rincón-Mora, "A 2-μm bicmos rectifier-free ac-dc piezoelectric energy harvester-charger ic," *Biomedical Circuits and Systems, IEEE Transactions on*, vol. 4, no. 6, pp. 400–409, 2010.
- [32] Y. Zhang, F. Zhang, Y. Shakhsheer, J. D. Silver, A. Klinefelter, M. Nagaraju, J. Boley, J. Pandey, A. Shrivastava, E. J. Carlson *et al.*, "A batteryless 19 μw mics/ism-band energy harvesting body sensor node soc for exg applications," *Solid-State Circuits, IEEE Journal of*, vol. 48, no. 1, pp. 199–213, 2013.
- [33] S. Ayazian and A. Hassibi, "Delivering optical power to subcutaneous implanted devices," in *Engineering in Medicine and Biology Society, EMBC, 2011 Annual International Conference of the IEEE.* IEEE, 2011, pp. 2874–2877.
- [34] K. Goto, T. Nakagawa, O. Nakamura, and S. Kawata, "An implantable power supply with an optically rechargeable lithium battery," *Biomedical Engineering, IEEE Transactions on*, vol. 48, no. 7, pp. 830–833, 2001.
- [35] P. P. Mercier, A. C. Lysaght, S. Bandyopadhyay, A. P. Chandrakasan, and K. M. Stankovic,
   "Energy extraction from the biologic battery in the inner ear," *Nature biotechnology*, vol. 30, no. 12, pp. 1240–1243, 2012.
- [36] H. Miranda, V. Gilja, C. A. Chestek, K. V. Shenoy, and T. H. Meng, "Hermesd: A highrate long-range wireless transmission system for simultaneous multichannel neural recording applications," *Biomedical Circuits and Systems, IEEE Transactions on*, vol. 4, no. 3, pp. 181–191, 2010.
- [37] P. V. Nikitin, K. Rao, and S. Lazar, "An overview of near field uhf rfid," in *IEEE international Conference on RFID*, vol. 167. Citeseer, 2007.
- [38] E. Y. Chow, C.-L. Yang, Y. Ouyang, A. L. Chlebowski, P. P. Irazoqui, and W. J. Chappell, "Wireless powering and the study of rf propagation through ocular tissue for development of implantable sensors," *Antennas and Propagation, IEEE Transactions on*, vol. 59, no. 6, pp. 2379–2387, 2011.

- [39] J. S. Ho, S. Kim, and A. S. Poon, "Midfield wireless powering for implantable systems," *Proceedings of the IEEE*, vol. 101, no. 6, pp. 1369–1378, 2013.
- [40] C. Sauer, M. Stanaćević, G. Cauwenberghs, and N. Thakor, "Power harvesting and telemetry in cmos for implanted devices," *Circuits and Systems I: Regular Papers, IEEE Transactions on*, vol. 52, no. 12, pp. 2605–2613, 2005.
- [41] M. Catrysse, B. Hermans, and R. Puers, "An inductive power system with integrated bi-directional data-transmission," *Sensors and Actuators A: Physical*, vol. 115, no. 2, pp. 221–229, 2004.
- [42] F. Mazzilli, P. E. Thoppay, V. Praplan, and C. Dehollain, "Ultrasound energy harvesting system for deep implanted-medical-devices (imds)," in *Circuits and Systems (ISCAS)*, 2012 IEEE International Symposium on. IEEE, 2012, pp. 2865–2868.
- [43] K. Mathieson, J. Loudin, G. Goetz, P. Huie, L. Wang, T. I. Kamins, L. Galambos, R. Smith, J. S. Harris, A. Sher *et al.*, "Photovoltaic retinal prosthesis with high pixel density," *Nature photonics*, vol. 6, no. 6, pp. 391–397, 2012.
- [44] E. G. Kilinc, C. Dehollain, and F. Maloberti, *Remote Powering and Data Communication for Implanted Biomedical Systems.* Springer, 2016.
- [45] S. B. Lee, B. Lee, M. Kiani, B. Mahmoudi, R. Gross, and M. Ghovanloo, "An inductivelypowered wireless neural recording system with a charge sampling analog front-end," *Sensors Journal, IEEE*, vol. 16, no. 2, pp. 475–484, 2016.
- [46] L. Baldassarre, C. Aprile, M. Shoaran, Y. Leblebici, and V. Cevher, "Structured sampling and recovery of ieeg signals," in 6th IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), 2015.
- [47] L. Baldassarre, Y.-H. Li, J. Scarlett, B. Gözcü, I. Bogunovic, and V. Cevher, "Learningbased compressive subsampling," *IEEE Journal of Selected Topics in Signal Processing*, vol. 10, no. 4, pp. 809–822, 2016.
- [48] R. Prony, "Essai experimental et analytique sur les lois de la dilatabilite de fluides elastiques et sur celles da la force expansion de la vapeur de l'alcool, a differentes temperatures," Ecole Polytechnique de Paris, Tech. Rep., 1795.
- [49] J. F. Hauer, C. Demeure, and L. Scharf, "Initial results in prony analysis of power system response signals," *IEEE Transactions on power systems*, vol. 5, no. 1, pp. 80–89, 1990.
- [50] D. L. Donoho and P. B. Stark, "Uncertainty principles and signal recovery," *SIAM Journal on Applied Mathematics*, vol. 49, no. 3, pp. 906–931, 1989.
- [51] M. Vetterli, P. Marziliano, and T. Blu, "Sampling signals with finite rate of innovation," *IEEE transactions on Signal Processing*, vol. 50, no. 6, pp. 1417–1428, 2002.

- [52] H. Nyquist, "Certain topics in telegraph transmission theory," *Transactions of the American Institute of Electrical Engineers*, vol. 47, no. 2, pp. 617–644, 1928.
- [53] C. E. Shannon, "Communication in the presence of noise," *Proceedings of the IRE*, vol. 37, no. 1, pp. 10–21, 1949.
- [54] E. J. Candes and T. Tao, "Decoding by linear programming," *IEEE transactions on information theory*, vol. 51, no. 12, pp. 4203–4215, 2005.
- [55] E. J. Candes, "The restricted isometry property and its implications for compressed sensing," *Comptes rendus mathematique*, vol. 346, no. 9-10, pp. 589–592, 2008.
- [56] E. J. Candes and T. Tao, "Near-optimal signal recovery from random projections: Universal encoding strategies?" *IEEE transactions on information theory*, vol. 52, no. 12, pp. 5406–5425, 2006.
- [57] S. Foucart and H. Rauhut, *A mathematical introduction to compressive sensing*. Springer, 2013.
- [58] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge university press, 2004.
- [59] E. T. Hale, W. Yin, and Y. Zhang, "A fixed-point continuation method for l1-regularized minimization with applications to compressed sensing," *CAAM TR07-07, Rice University*, vol. 43, p. 44, 2007.
- [60] R. Tibshirani, "Regression shrinkage and selection via the lasso," *Journal of the Royal Statistical Society. Series B (Methodological)*, pp. 267–288, 1996.
- [61] H. Boche, R. Calderbank, G. Kutyniok, and J. Vybral, *Compressed Sensing and Its Applications: MATHEON Workshop 2013*, 1st ed. Birkhäuser Basel, 2015.
- [62] B. Adcock, A. C. Hansen, C. Poon, and B. Roman, "Breaking the coherence barrier: A new theory for compressed sensing," *arXiv preprint arXiv:1302.0561*, 2013.
- [63] R. Baraniuk, V. Cevher, M. Duarte, , and C. Hegde, "Model-based compressive sensing," *IEEE Transactions on Information Theory*, vol. 56, no. 4, pp. 1982–2001, 2010.
- [64] A. Kyrillidis, L. Baldassarre, M. El Halabi, Q. Tran-Dinh, and V. Cevher, "Structured sparsity: Discrete and convex approaches," in *Compressed Sensing and its Applications*. Springer, 2015, pp. 341–387.
- [65] M. E. Halabi and V. Cevher, "A totally unimodular view of structured sparsity," in *AISTATS*, 2015.
- [66] S. Mallat, A wavelet tour of signal processing. Academic press, 1999.
- [67] R. Jenatton, J. Mairal, G. Obozinski, and F. Bach, "Proximal methods for hierarchical sparse coding," *Journal of Machine Learning Reasearch*, vol. 12, pp. 2297–2334, 2011.

- [68] M. Shoaran, M. H. Kamal, C. Pollo, P. Vandergheynst, and A. Schmid, "Compact lowpower cortical recording architecture for compressive multichannel data acquisition," *IEEE Transactions on Biomedical Circuits and Systems*, vol. 8, no. 6, pp. 857–870, December 2014.
- [69] Q. Tran-Dinh and V. Cevher, "A primal-dual algorithmic framework for constrained convex minimization," *arXiv preprint arXiv:1406.5403*, 2014.
- [70] G. Higgins, S. Faul, R. P. McEvoy, B. McGinley, M. Glavin, W. P. Marnane, and E. Jones, "Eeg compression using jpeg2000: How much loss is too much?" in *Engineering in Medicine and Biology Society (EMBC), 2010 Annual International Conference of the IEEE*. IEEE, 2010, pp. 614–617.
- [71] B. Murmann, "Adc performance survey 1997-2017 (isscc and vlsi symposium)," Tech. Rep., 2017.
- [72] F. Maloberti, "Data converters," Tech. Rep., 2007.
- [73] J. N. Laska, S. Kirolos, M. F. Duarte, T. S. Ragheb, R. G. Baraniuk, and Y. Massoud, "Theory and implementation of an analog-to-information converter using random demodulation," in *IEEE International Symposium on Circuits and Systems*, 2007, pp. 1959–1962.
- [74] M. Shoaran, M. Shahshahani, M. Farivar, J. Almajano, A. Shahshahani, A. Schmid, A. Bragin, Y. Leblebici, and A. Emami, "A 16-channel 1.1 mm 2 implantable seizure control soc with sub- $\mu$ w/channel consumption and closed-loop stimulation in 0.18  $\mu$ m cmos," in *VLSI Circuits (VLSI-Circuits), 2016 IEEE Symposium on.* Ieee, 2016, pp. 1–2.
- [75] M. Ghovanloo and K. Najafi, "A high data transfer rate frequency shift keying demodulator chip for the wireless biomedical implants," in *Circuits and Systems, 2002. MWSCAS-*2002. The 2002 45th Midwest Symposium on, vol. 3. IEEE, 2002, pp. III–433.
- [76] Z. Lu and M. Sawan, "An 8 mbps data rate transmission by inductive link dedicated to implantable devices," in *Circuits and Systems, 2008. ISCAS 2008. IEEE International Symposium on.* IEEE, 2008, pp. 3057–3060.
- [77] D. J. Young, "Wireless powering and data telemetry for biomedical implants," in *Engineering in Medicine and Biology Society, 2009. EMBC 2009. Annual International Conference of the IEEE*, 2009, pp. 3221–3224.
- [78] S. Mandal and R. Sarpeshkar, "Power-efficient impedance-modulation wireless data links for biomedical implants," *Biomedical Circuits and Systems, IEEE Transactions on*, vol. 2, no. 4, pp. 301–315, 2008.
- [79] F. Inanlou and M. Ghovanloo, "Wideband near-field data transmission using pulse harmonic modulation," *Circuits and Systems I: Regular Papers, IEEE Transactions on*, vol. 58, no. 1, pp. 186–195, 2011.

- [80] H. Miranda and T. H. Meng, "A programmable pulse uwb transmitter with 34% energy efficiency for multichannel neuro-recording systems," in *Custom Integrated Circuits Conference (CICC), 2010 IEEE.* IEEE, 2010, pp. 1–4.
- [81] M. S. Chae, Z. Yang, M. R. Yuce, L. Hoang, and W. Liu, "A 128-channel 6 mw wireless neural recording ic with spike feature extraction and uwb transmitter," *Neural Systems* and Rehabilitation Engineering, IEEE Transactions on, vol. 17, no. 4, pp. 312–321, 2009.
- [82] A. Ebrazeh and P. Mohseni, "30 pj/b, 67 mbps, centimeter-to-meter range data telemetry with an ir-uwb wireless link," *Biomedical Circuits and Systems, IEEE Transactions on*, vol. 9, no. 3, pp. 362–369, 2015.
- [83] P. V. Nikitin and K. S. Rao, "Theory and measurement of backscattering from rfid tags," *Antennas and Propagation Magazine, IEEE*, vol. 48, no. 6, pp. 212–218, 2006.
- [84] J. Pandey and B. P. Otis, "A sub-100 w mics/ism band transmitter based on injectionlocking and frequency multiplication," *Solid-State Circuits, IEEE Journal of*, vol. 46, no. 5, pp. 1049–1058, 2011.
- [85] G. Yilmaz and C. Dehollain, "Wireless energy and data transfer for neural recording and stimulation applications," in *Ph. D. Research in Microelectronics and Electronics* (*PRIME*), 2013 9th Conference on. Ieee, 2013, pp. 209–212.
- [86] R. R. Harrison, "Designing efficient inductive power links for implantable devices," in *Circuits and Systems, 2007. ISCAS 2007. IEEE International Symposium on.* IEEE, 2007, pp. 2080–2083.
- [87] A. Kurs, A. Karalis, R. Moffatt, J. D. Joannopoulos, P. Fisher, and M. Soljačić, "Wireless power transfer via strongly coupled magnetic resonances," *science*, vol. 317, no. 5834, pp. 83–86, 2007.
- [88] M. Kiani, U.-M. Jow, and M. Ghovanloo, "Design and optimization of a 3-coil inductive link for efficient wireless power transmission," *Biomedical Circuits and Systems, IEEE Transactions on*, vol. 5, no. 6, pp. 579–591, 2011.
- [89] G. Yilmaz and C. Dehollain, "Single frequency wireless power transfer and full-duplex communication system for intracranial epilepsy monitoring," *Microelectronics Journal*, vol. 45, no. 12, pp. 1595–1602, 2014.
- [90] V. Majidzadeh, A. Schmid, and Y. Leblebici, "A 16-channel, 359 μw, parallel neural recording system using walsh-hadamard coding," in *Custom Integrated Circuits Conference* (CICC), 2013 IEEE. IEEE, 2013, pp. 1–4.
- [91] H. Hosseini-Nejad, A. Jannesari, and A. M. Sodagar, "Data compression in brainmachine/computer interfaces based on the walsh–hadamard transform," *IEEE transactions on biomedical circuits and systems*, vol. 8, no. 1, pp. 129–137, 2014.

#### **Bibliography**

- [92] C. Aprile, L. Baldassarre, V. Gupta, J. Yoo, M. Shoaran, Y. Leblebici, and V. Cevher, "Learning-based near-optimal area-power trade-offs in hardware design for neural signal acquisition," in *Proceedings of the 26th edition of Great Lakes Symposium on VLSI*. ACM, 2016, pp. 433–438.
- [93] C. Aprile, K. Ture, L. Baldassarre, M. Shoaran, G. Yilmaz, F. Maloberti, C. Dehollain, Y. Leblebici, and V. Cevher, "Adaptive learning-based compressive sampling for lowpower wireless implants," 2018, under review.
- [94] C. Aprile, J. Wüthrich, L. Baldassarre, Y. Leblebici, and V. Cevher, "Dct learning-based hardware design for neural signal acquisition systems," in *Proceedings of the Computing Frontiers Conference*. ACM, 2017, pp. 391–394.
- [95] J. L. Bohorquez, A. P. Chandrakasan, and J. L. Dawson, "A 350 μW cmos msk transmitter and 400 μW OOK super-regenerative receiver for medical implant communications," *IEEE Journal of Solid-State Circuits*, vol. 44, no. 4, pp. 1248–1259, 2009.
- [96] F. C. Commission *et al.*, "Revision of part 15 of the commissions rules regarding ultrawideband transmission systems. first report and order, et docket 98-153, fcc 02-48; adopted: February 14, 2002; released: April 22, 2002," 2002.
- [97] A. K. RamRakhyani, S. Mirabbasi, and M. Chiao, "Design and optimization of resonancebased efficient wireless power delivery systems for biomedical implants," *IEEE Transactions on Biomedical Circuits and Systems*, vol. 5, no. 1, pp. 48–63, 2011.
- [98] K. M. Silay, C. Dehollain, and M. Declercq, "Inductive power link for a wireless cortical implant with two-body packaging," *IEEE Sensors Journal*, vol. 11, no. 11, pp. 2825–2833, 2011.
- [99] S. J. Xilinx, "Virtex-5 lx fpga ml501 evaluation platform."
- [100] V. Stojanovic, "Channel-limited high-speed links: Modeling, analysis and design," Ph.D. dissertation, Stanford University, 2004.
- [101] IEEE International Solid-States Circuits Conference Tech Trends, 2017.
- [102] "International technology roadmap for semiconductors," Semiconduct. Ind. Assoc., Tech. Rep., 2005.
- [103] Intel, "Quickpath interconnect," January 2009.
- [104] C. HyperTransport Technology Consortium, Sunnyvale, "Hyper-Transport Link Specification, Rev. 3.10c," HTC20051222-0046-0003, 2010.
- [105] T. Oh and R. Harjani, "A 12-Gb/s multichannel I/O using MIMO crosstalk cancellation and signal reutilization in 65-nm CMOS," *Solid-State Circuits, IEEE Journal of*, vol. 48, no. 6, pp. 1383–1397, 2013.

- [106] S.-K. Lee, B. Kim, H.-J. Park, and J.-Y. Sim, "A 5 Gb/s single-ended parallel receiver with adaptive crosstalk-induced jitter cancellation," *Solid-State Circuits, IEEE Journal of*, vol. 48, no. 9, pp. 2118–2127, 2013.
- [107] F. D. Mbairi, W. P. Siebert, and H. Hesselbom, "High-frequency transmission lines crosstalk reduction using spacing rules," *IEEE transactions on components and packaging technologies*, vol. 31, no. 3, pp. 601–610, 2008.
- [108] J. F. Buckwalter and A. Hajimiri, "Cancellation of crosstalk-induced jitter," *IEEE journal of solid-state circuits*, vol. 41, no. 3, pp. 621–632, 2006.
- [109] W. Eisenstadr and D. E. Bockelman, "Common and differential crosstalk characterization on the silicon substrate," *IEEE microwave and guided wave letters*, vol. 9, no. 1, pp. 25–27, 1999.
- [110] W. Guggenb and G. Morbach, "Forward crosstalk compensation on bus lines," *Circuits and Systems I: Fundamental Theory and Applications, IEEE Transactions on*, vol. 40, no. 8, pp. 523–527, 1993.
- [111] K. Lee, H.-K. Jung, H.-J. Chi, H.-J. Kwon, J.-Y. Sim, and H.-J. Park, "Serpentine microstrip lines with zero far-end crosstalk for parallel high-speed DRAM interfaces," *Advanced Packaging, IEEE Transactions on*, vol. 33, no. 2, pp. 552–558, 2010.
- [112] J. Buckwalter and A. Hajimiri, "Cancellation of crosstalk-induced jitter," *Solid-State Circuits, IEEE Journal of*, vol. 41, no. 3, pp. 621–632, 2006.
- [113] K.-I. Oh, L.-S. Kim, K.-I. Park, Y.-H. Jun, J. S. Choi, and K. Kim, "A 5-gb/s/pin transceiver for ddr memory interface with a crosstalk suppression scheme," *IEEE Journal of Solid-State Circuits*, vol. 44, no. 8, pp. 2222–2232, 2009.
- [114] S.-J. Bae, K.-I. Park, J.-D. Ihm, H.-Y. Song, W.-J. Lee, H.-J. Kim, K.-H. Kim, Y.-S. Park, M.-S. Park, H.-K. Lee, S.-Y. Bang, G.-S. Moon, S.-W. Hwang, Y.-C. Cho, S.-J. Hwang, D.-H. Kim, J.-H. Lim, J.-S. Kim, S.-H. Kim, S.-J. Jang, J.-S. Choi, Y.-H. Jun, K. Kim, and S.-I. Cho, "An 80 nm 4 Gb/s/pin 32 bit 512 Mb GDDR4 graphics DRAM with low power and low noise data bus inversion," *Solid-State Circuits, IEEE Journal of*, vol. 43, no. 1, pp. 121–131, 2008.
- [115] K.-J. Sham, M. Ahmadi, S. Talbot, and R. Harjani, "Fext crosstalk cancellation for highspeed serial link design," in *Custom Integrated Circuits Conference, 2006. CICC '06. IEEE*, 2006, pp. 405–408.
- [116] M. Nazari and A. Emami-Neyestanak, "A 15-Gb/s 0.5-mW/Gbps two-tap DFE receiver with far-end crosstalk cancellation," *Solid-State Circuits, IEEE Journal of*, vol. 47, no. 10, pp. 2420–2432, 2012.

#### Bibliography

- [117] C. Aprile, A. Cevrero, P. A. Francese, C. Menolfi, M. Braendli, M. Kossel, T. Morf, L. Kull, I. Oezkaya, Y. Leblebici *et al.*, "An eight-lane 7-gb/s/pin source synchronous singleended rx with equalization and far-end crosstalk cancellation for backplane channels," *IEEE Journal of Solid-State Circuits*, 2018.
- [118] A. Cevrero, C. Aprile, P. A. Francese, U. Bapst, C. Menolfi, M. Braendli, M. Kossel, T. Morf, L. Kull, H. Yueksel *et al.*, "A 5.9 mw/gb/s 7gb/s/pin 8-lane single-ended rx with crosstalk cancellation scheme using a xctle and 56-tap xdfe in 32nm soi cmos," in *VLSI Circuits* (*VLSI Circuits*), 2015 Symposium on. IEEE, 2015, pp. C228–C229.
- [119] T. Oh and R. Harjani, "A 6-Gb/s MIMO crosstalk cancellation scheme for high-speed I/Os," *Solid-State Circuits, IEEE Journal of*, vol. 46, no. 8, pp. 1843–1856, 2011.
- [120] A. Cevrero, "Advanced cmos circuits for multi-gb/s links and 3d i/o based on through silicon via technology," Ph.D. dissertation, EPFL, 2014.
- [121] T. Toifl, C. Menolfi, M. Ruegg, R. Reutemann, D. Dreps, T. Beukema, A. Prati, D. Gardellini, M. Kossel, P. Buchmann, M. Brandli, P. Francese, and T. Morf, "A 2.6 mW/Gbps 12.5 Gbps RX with 8-tap switched-capacitor DFE in 32 nm CMOS," *Solid-State Circuits, IEEE Journal of*, vol. 47, no. 4, pp. 897–910, 2012.
- [122] M. Kossel, C. Menolfi, J. Weiss, P. Buchmann, G. von Bueren, L. Rodoni, T. Morf, T. Toifl, and M. Schmatz, "A T-Coil-enhanced 8.5 Gb/s high-swing SST transmitter in 65 nm bulk CMOS with < -16 dB return loss over 10 GHz bandwidth," *Solid-State Circuits, IEEE Journal of*, vol. 43, no. 12, pp. 2905–2920, 2008.
- [123] S.-J. Bae, Y.-S. Sohn, T.-Y. Oh, S.-H. Kim, Y.-S. Yang, D.-H. Kim, S.-H. Kwak, H.-S. Seol, C.-H. Shin, M.-S. Park, G.-H. Han, B.-C. Kim, Y.-K. Cho, H.-R. Kim, S.-Y. Doo, Y.-S. Kim, D.-S. Kang, Y.-R. Choi, S.-Y. Bang, S.-Y. Park, Y.-J. Shin, G.-S. Moon, C.-G. Park, W.-S. Kim, H.-J. Yang, J.-D. Lim, K.-I. Park, J. S. Choi, and Y.-H. Jun, "A 40nm 2Gb 7Gb/s/pin GDDR5 SDRAM with a programmable DQ ordering crosstalk equalizer and adjustable clock-tracking BW," in *Solid-State Circuits Conference Digest of Technical Papers (ISSCC),* 2011 IEEE International, 2011, pp. 498–500.

## **Cosimo APRILE**

Avenue de la Rochelle 6 1008 Prilly, Switzerland +41 (0) 78 976 29 76 cosimo.aprile1988@gmail.com



#### Summary

Analog/Mixed-signal integrated circuit designer, with passion for innovation and new technologies.

#### PROFESSIONAL EXPERIENCE

#### Contractor (2014 - 2017)

IBM research, high speed serial I/O circuits group – Zurich, Switzerland

Circuit design, implementation and electrical characterization of 7 Gbps/pin 8-lanes single ended receiver in 32 nm SOI CMOS, where the chip-to-chip data-rate is boosted by advanced and low-power far-end crosstalk cancellation schemes.

#### Doctoral Researcher (April 2013 - Present)

**EPFL, Lions Lab and Microelectronic Systems Laboratory – Lausanne, Switzerland** Learning-based hardware design for data acquisition systems:

- Circuit design and silicon validation of algorithms applied for learning-based sampling techniques;
- Teaching assistant for the M.Sc. and B.Sc. curses: Test of VLSI, EDA-based design and IC Design;
- Advisor of one M.Sc. project, involving the hardware implementation of compression algorithms;
- Involved in project definitions and proposal writing for research funding.

#### Internship (July 2012 - March 2013)

#### IBM research, high speed serial I/O circuits group - Zurich, Switzerland

Design of a low power receiver in 32 nm SOI-CMOS able to efficiently remove far end crosstalk in single ended wireline communication systems.

#### Internship (June - September 2011)

IIT and CNR – Lecce, Italy

Development and characterization of planar organic electro-luminescence devices (OLETs).

#### **EDUCATION**

Doctor of Philosophy (PhD) in Electrical Engineering (April 2013 - Expected: August 2018)

#### EPFL – Switzerland

Advisors: Prof. Volkan Cevher and Prof. Yusuf Leblebici.

Master in Micro and Nanotechnologies for the Integrated Systems (2010 – 2012) EPFL – Switzerland, INP Grenoble-France, Politecnico di Torino – Italy

#### Bachelor in Electronic Engineering (2007 – 2010) INSA Lyon – France, Politecnico di Torino – Italy

#### **OTHER EXPERIENCES**

#### **CTI Entrepreneurship Training, Business Concept**

#### EPFL – Switzerland

Training on professional tools for transforming ideas into business projects.

#### Academia-Industry Training Camp

#### Swissnex, Rio de Janeiro - Brazil, Venturelab, Zurich - Switzerland

Connection to a wide network of peers, mentors and industry experts and acquisition of tools and skills to analyze the application of research-based ideas, to bridge the gap between academia and industry domains.

#### **SKILLS / INTERESTS**

- Languages: English: fluent (C1), French: fluent (C1), Italian (native speaker).
- Electrical characterization of integrated circuits/semiconductor devices using Automatic Test Equipment
- Knowledge of micro/nano fabrication techniques.
- CAD tools: <u>Analog/Digital Design</u>: Cadence Virtuoso IC, Encounter digital Implementation (EDI), Synopsys IC/Design compiler, Modelsim; PCB Design: Altium Designer;
- Programming languages: C, VHDL, Verilog-A, Matlab, Octave, Python. <u>FPGA programming</u>: ISE and Vivado tools;
- Personal Information: Italian nationality. I love running, climbing, cycling and I have participated to national athletic competitions.