Thèse n° 8452

# EPFL

### Energy-Efficient Design Techniques for High-Speed Wireline Serial Links

Présentée le 23 avril 2021

Faculté des sciences et techniques de l'ingénieur Laboratoire de systèmes microélectroniques Programme doctoral en microsystèmes et microélectronique

pour l'obtention du grade de Docteur ès Sciences

par

### Fırat ÇELİK

Acceptée sur proposition du jury

Prof. G. De Micheli, président du jury Prof. Y. Leblebici, Prof. A. P. Burg, directeurs de thèse Prof. F. Maloberti, rapporteur Prof. A. Tajalli, rapporteur Prof. C. Dehollain, rapporteuse

To my family...

## Acknowledgements

First of all, I would like to thank my thesis supervisor Prof. Yusuf Leblebici for giving me the opportunity to work at the Microelectronic Systems Laboratory during my M.Sc. and Ph.D. His encouragement, support, and trust helped me a lot throughout these years. I would also like to thank my thesis co-advisor, Prof. Andreas Burg, for agreeing to be my co-advisor. I am also very thankful for the advice and guidance that he has given for my publications.

I am also grateful to the committee members, Prof. Giovanni De Micheli, Prof. Franco Maloberti, Prof. Armin Tajalli, and Prof. Catherine Dehollain, for evaluating my thesis and for their valuable comments.

I am thankful to Dr. Alain Vachoux and Dr. Alexandre Schmid for their support related to the software tools and measurement equipment. I must also thank my colleagues at LSM for their friendship and collaboration. They have taught me a lot and worked with me day and night: Duygu Kostak, Mustafa Kilic, Selman Ergunay, and Tugba Demirci.

I have made great friends during these years and I would like to thank them for sharing wonderful and memorable moments in Switzerland: Isinsu Katircioglu, Seniz & Deniz Eroglu, Nergiz Sahin & Cem Solmaz, Burak Hasircioglu, Cenk Ibrahim Ozdemir, Behnoush Attarimashalkoubeh, Gain Kim, Bilal Demir, Elmira Shahrabi, and Ahmet Caner Yuzuguler.

Most importantly, I cannot thank my family enough for their unlimited love and lifelong support in every aspect of my life. I would like to thank my mother Ayten Celik, my father Halil Ibrahim Celik, my brother Murat Celik, his wife Hulya Moral Celik, and my nephew Mert Celik.

Finally, I would like to express my gratitude to my love Ayca Akkaya for being the best friend, the best colleague, and the best girlfriend at the same time. Without a doubt, her endless support and encouragement made this thesis possible.

Lausanne, 15 January 2021

Firat Celik

## Abstract

The exponential growth in computing power and multimedia services has caused a tremendous increase in data traffic in recent years. This increase in data traffic brings a strong demand for data bandwidth of electrical input/output (I/O) links and pushes the data rates of interconnect standards continuously. The technology scaling has improved the I/O data rates and data processing power over the last decade. However, the bandwidth of the copper links has not been scaled similarly. Therefore, the need for advanced equalization techniques and high modulation orders has increased to be able to mitigate the high intersymbol interference that arises at high frequencies due to channel loss. All these techniques bring an increased power consumption for the serial links; therefore, the energy-efficiency of the wireline links is an important necessity to focus on in today's world.

In this thesis, we have investigated different design techniques to improve the energy-efficiency of the high-speed wireline serial links. First of all, the energy-efficiency of current-mode and voltage-mode transmitters (TXs) are compared. For this comparison, two TX prototypes that have low-voltage differential signaling and source-series-terminated (SST) drivers, to be used in a multi-channel analog-to-digital converter (ADC) system to transfer the ADC output data to a field-programmable gate array, are designed in 28 nm FD-SOI. The prototype TXs operate at 12.5 Gb/s which is the maximum data rate supported by the JESD204B standard, and the power consumption of the TXs is optimized for this data rate. Second, a comparative study, analyzing the potentials of pulse amplitude modulation (PAM) for implementing high-speed copper wireline links is presented. Different modulation orders (PAM-2, PAM-4, and PAM-8) have been simulated and their eye diagrams are compared for different channels and different data rates to understand the inherent limitations of these modulation schemes.

Then, a high-impedance driver technique is proposed for high-speed PAM-4 SST TXs. The proposed high-impedance driver technique provides a significant reduction in the capacitive load, decreases the high dynamic power consumption disadvantage of the SST TXs, and introduces 20% less power consumption for the whole TX compared to the conventional design. Measurement results show that the prototype PAM-4 TX with 4-tap feed-forward equalizer (FFE) in 28 nm FD-SOI achieves 2.4 pJ/bit energy-efficiency at 32 Gb/s data rate.

Finally, a very high-order modulation compatible TX and receiver analog front-end (RX AFE)

#### Abstract

system is presented to investigate the effect of the modulation order in energy-efficiency. Considering the equalization capability and the target moderate-loss channel, the optimum modulation order is decided with a modeling study. After that, the PAM-16 compatible SST TX and ADC-based RX AFE system is designed in 28 nm FD-SOI with the objective of minimizing the power consumption. The equalizer blocks, which are a continuous-time linear equalizer and a 2-tap FFE embedded in the ADC, both operate in the analog domain to bypass the disadvantages that high-order modulation brings. The TX consumes 26.85 mW while the RX AFE consumes 49.36 mW at 8 Gbaud. The corresponding energy-efficiency is 2.38 pJ/bit for the whole system with PAM-16 at 32 Gb/s data rate.

**Keywords:** Wireline serial link, transmitter, TX, receiver, RX, transceiver, TRX, source-seriesterminated, SST, pulse amplitude modulation, PAM, feed-forward equalizer, FFE, decision feedback equalizer, DFE, continuous-time linear equalizer, CTLE, analog-to-digital converter, ADC, ADC-based receiver, embedded equalizer.

## Résumé

La croissance exponentielle de la puissance de calcul et des services multimédias a entraîné une augmentation considérable du trafic de données ces dernières années. Cette augmentation du trafic de données entraîne une forte demande de bande passante de données pour les liaisons d'entrée/sortie (E/S) électriques et pousse en permanence les débits de données des normes d'interconnexion. La mise à l'échelle de la technologie a amélioré les débits de données d'E/S et la puissance de traitement des données au cours de la dernière décennie. Cependant, la bande passante des liaisons cuivre n'a pas été mise à l'échelle de la même manière. Par conséquent, le besoin de techniques d'égalisation avancées et d'ordres de modulation élevés a augmenté pour pouvoir atténuer les interférences inter-symboles élevées qui surviennent à des fréquences élevées en raison de la perte de canal. Toutes ces techniques apportent une consommation d'énergie accrue pour les liaisons série; par conséquent, l'efficacité énergétique des liaisons filaires est une nécessité importante sur laquelle se concentrer dans le monde d'aujourd'hui.

Dans cette thèse, nous avons étudié différentes techniques de conception pour améliorer l'efficacité énergétique des liaisons série filaires à haut débit. Tout d'abord, l'efficacité énergétique des émetteurs (TX) en mode courant et en mode tension est comparée. Pour cette comparaison, deux prototypes TX qui ont des pilotes de signalisation différentielle basse tension et de terminaison série source (SST), à utiliser dans un système de convertisseur analogique-numérique (CAN) multicanal pour transférer les données de sortie du CAN vers un réseau de portes programmables in situ, sont conçues en 28nm FD-SOI. Les prototypes de TX fonctionnent à 12,5 Gb/s, ce qui est le débit de données maximal pris en charge par la norme JESD204B, et la consommation d'énergie des TX est optimisée pour ce débit de données. Deuxièmement, une étude comparative, analysant les potentiels de modulation d'impulsions en amplitude (PAM) pour la mise en œuvre de liaisons filaires en cuivre à grande vitesse est présentée. Différents ordres de modulation (PAM-2, PAM-4 et PAM-8) ont été simulés et leurs diagrammes oculaires sont comparés pour différents canaux et différents débits de données pour comprendre les limites inhérentes de ces schémas de modulation.

Ensuite, une technique de circuit d'attaque à haute impédance est proposée pour les TX PAM-4 SST haute vitesse. La technique de circuit d'attaque à haute impédance proposée fournit une réduction significative de la charge capacitive, diminue l'inconvénient de consommation

#### Résumé

d'énergie dynamique élevée des TX SST et introduit 20% de consommation d'énergie en moins pour l'ensemble du TX par rapport à la conception conventionnelle. Les résultats de mesure montrent que le prototype PAM-4 TX avec égaliseur à action directe (FFE) à 4 prises en 28 nm FD-SOI atteint une efficacité énergétique de 2,4 pJ/bit à un débit de données de 32 Gb/s.

Enfin, un système d'entrée analogique de réception et d'émission compatible avec une modulation d'ordre très élevé (RX AFE) est présenté pour étudier l'effet de l'ordre de modulation sur l'efficacité énergétique. Compte tenu de la capacité d'égalisation et du canal cible à perte modérée, l'ordre de modulation optimal est décidé par une étude de modélisation. Après cela, le système RX AFE compatible avec PAM-16 SST TX et le CAN est conçu en 28 nm FD-SOI dans le but de minimiser la consommation d'énergie. Les blocs d'égalisation, qui sont un égaliseur linéaire en temps continu et un FFE à 2 prises intégré dans l'ADC, fonctionnent tous deux dans le domaine analogique pour contourner les inconvénients que la modulation d'ordre élevé apporte. Le TX consomme 26,85 mW tandis que le RX AFE consomme 49,36 mW à 8 Gbauds. L'efficacité énergétique correspondante est de 2,38 pJ/bit pour l'ensemble du système avec le PAM-16 à un débit de données de 32 Gb/s.

**Mots-clés :** liaison série filaire, émetteur, TX, récepteur, RX, émetteur-récepteur, TRX, sourcesérie-terminé, SST, modulation d'impulsions en amplitude, PAM, égaliseur par anticipation, FFE, égaliseur de rétroaction de décision, DFE, égaliseur linéaire en temps continu, CTLE, convertisseur analogique-numérique, CAN, récepteur CAN, égaliseur intégré.

## Contents

| Acknowledgements i          |      |                                           |    |
|-----------------------------|------|-------------------------------------------|----|
| Abstract (English/Français) |      |                                           |    |
| 1                           | Intr | oduction                                  | 1  |
|                             | 1.1  | Thesis Goal                               | 4  |
|                             | 1.2  | Organization and Content of the Thesis    | 4  |
| 2                           | The  | ory Review                                | 7  |
|                             | 2.1  | Technical Terms                           | 7  |
|                             | 2.2  | High-Speed Wireline Serial Link Overview  | 8  |
|                             | 2.3  | Wireline Transmitter Basics               | 9  |
|                             |      | 2.3.1 Serializer                          | 9  |
|                             |      | 2.3.2 Output Driver                       | 10 |
|                             | 2.4  | Wireline Receiver Basics                  | 13 |
|                             | 2.5  | Equalizer Circuits                        | 15 |
|                             |      | 2.5.1 Channel Limitations                 | 15 |
|                             |      | 2.5.2 FFE                                 | 16 |
|                             |      | 2.5.3 CTLE                                | 17 |
|                             |      | 2.5.4 DFE                                 | 19 |
|                             | 2.6  | Signaling Methods                         | 20 |
| 3                           | JES  | D204B Compliant LVDS and SST Transmitters | 25 |
|                             | 3.1  | JESD204B Overview                         | 25 |
|                             | 3.2  | Time-Interleaved SAR ADC System Overview  | 26 |
|                             | 3.3  | TX Circuit Implementations                | 28 |
|                             |      | 3.3.1 Serializer                          | 30 |
|                             |      | 3.3.2 LVDS Driver                         | 31 |
|                             |      | 3.3.3 SST Driver                          | 32 |
|                             | 3.4  | Simulation Results                        | 33 |
|                             | 3.5  | Measurement Results                       | 35 |
|                             | 3.6  | Conclusion                                | 37 |
| 4                           | ISI  | Sensitivity of PAM Signaling              | 39 |

#### Contents

|    | 4.1                  | Modulation Trend Overview                                           | 39  |
|----|----------------------|---------------------------------------------------------------------|-----|
|    | 4.2                  | Analysis Methodology                                                | 40  |
|    | 4.3                  | Simulation Results                                                  | 43  |
|    |                      | 4.3.1 Without Equalization                                          | 43  |
|    |                      | 4.3.2 With Ideal Equalization                                       | 44  |
|    |                      | 4.3.3 With Frequency Limited Equalization                           | 45  |
|    | 4.4                  | Summary and Discussion                                              | 51  |
| 5  | Ene                  | ergy-Efficient PAM-4 SST Transmitter                                | 53  |
|    | 5.1                  | Transmitter Overview                                                | 53  |
|    | 5.2                  | SST Driver Termination Study                                        | 55  |
|    | 5.3                  | Transmitter Architecture                                            | 58  |
|    |                      | 5.3.1 Clock Generation                                              | 58  |
|    |                      | 5.3.2 Four-Tap FFE                                                  | 59  |
|    |                      | 5.3.3 4:1 Serializer and SST Driver                                 | 62  |
|    |                      | 5.3.4 Output Pad-Network                                            | 64  |
|    |                      | 5.3.5 Power Saving Analysis of the Proposed Technique               | 65  |
|    | 5.4                  | Measurement Results                                                 | 67  |
|    | 5.5                  | Conclusion                                                          | 72  |
| 6  | PAN                  | <b>1-16 Transmitter and Receiver Analog Front-End</b>               | 73  |
|    | 6.1                  | High-Order Modulation Overview                                      | 74  |
|    | 6.2                  | Comparative Modulation Order Study                                  | 76  |
|    | 6.3                  | Transmitter                                                         | 77  |
|    | 6.4                  | ADC-Based RX AFE                                                    | 80  |
|    |                      | 6.4.1 Overview                                                      | 80  |
|    |                      | 6.4.2 CTLE                                                          | 81  |
|    |                      | 6.4.3 8 GS/s Time-Interleaved SAR ADC                               | 85  |
|    |                      | 6.4.4 1 GS/s Single-Channel SAR ADC and Embedded FFE Implementation | 85  |
|    | 6.5                  | Simulation Results                                                  | 87  |
|    | 6.6                  | Conclusion                                                          | 91  |
| 7  | Con                  | clusion                                                             | 97  |
|    | 7.1                  | Future Work                                                         | 98  |
|    |                      |                                                                     |     |
| Bi | bliog                | graphy                                                              | 101 |
| Li | st of .              | Acronyms                                                            | 107 |
| Cı | urric                | ulum Vitae                                                          | 109 |
|    |                      |                                                                     |     |
| Li | LIST OF PUDIICATIONS |                                                                     |     |

## List of Figures

| 1.1  | Global mobile data traffic forecast from 2017 to 2022 [1]                         | 2  |
|------|-----------------------------------------------------------------------------------|----|
| 1.2  | A Google data center in Oklahoma, USA [2]                                         | 2  |
| 1.3  | Per pin data rates of different wireline standards by years [3]                   | 3  |
| 1.4  | Data rates of wireline transceiver publications over years [4].                   | 3  |
| 1.5  | Energy-efficiencies of wireline transceiver publications over the years [4]       | 3  |
| 2.1  | General block diagram of wireline serial link                                     | 8  |
| 2.2  | 4-to-1 D Flip-Flop-based serializer schematic.                                    | 9  |
| 2.3  | 4-to-1 multiplexer-based serializer schematic.                                    | 10 |
| 2.4  | Push-pull current-mode (LVDS) driver schematic.                                   | 11 |
| 2.5  | Current-mode logic (CML) driver schematic.                                        | 12 |
| 2.6  | Voltage-mode (a) high-swing source-series-terminated (SST) and (b) low-swing      |    |
|      | driver schematics                                                                 | 13 |
| 2.7  | Strong-Arm Latch schematic.                                                       | 14 |
| 2.8  | The transfer function of a typical FR4-based stripline and loss contributors [5]. | 15 |
| 2.9  | Input and output waveforms of an electrical channel when a pulse is applied.      | 16 |
| 2.10 | FFE implementation at the transmitter.                                            | 17 |
| 2.11 | Differential continuous-time linear equalizer schematic.                          | 18 |
| 2.12 | Transfer function of the CTLE that has one zero and two poles                     | 19 |
| 2.13 | DFE implementation at the receiver                                                | 20 |
| 2.14 | Example waveforms of NRZ and PAM-4 signaling, showing twice the data rate         |    |
|      | for the same Baud rate with PAM-4.                                                | 21 |
| 2.15 | Eye diagrams of NRZ and PAM-4 signaling for the same data rate.                   | 22 |
| 2.16 | Power spectral density of NRZ and PAM-4.                                          | 22 |
| 3.1  | Representation of JESD204B standard showing the support of multiple lanes and     |    |
|      | deterministic latency at high speed [6]                                           | 26 |
| 3.2  | Top level block diagram of the 16-channel TI SAR ADC system.                      | 27 |
| 3.3  | Layout of the TI-ADC system.                                                      | 28 |
| 3.4  | Block diagram of the proposed JESD204B transmitter.                               | 29 |
| 3.5  | Serializer circuit.                                                               | 30 |
| 3.6  | LVDS driver circuit.                                                              | 31 |
| 3.7  | SST driver circuit.                                                               | 32 |
|      |                                                                                   |    |

| 3.8  | Eye diagrams simulated at 12.5 Gb/s.                                                      | 33 |
|------|-------------------------------------------------------------------------------------------|----|
| 3.9  | Power breakdown of the transmitters                                                       | 34 |
| 3.10 | Chip micrograph.                                                                          | 34 |
| 3.11 | Custom measurement PCB of the TXs.                                                        | 35 |
| 3.12 | Measurement setup of the TXs                                                              | 35 |
| 3.13 | Eye diagrams measured at 12.1 Gb/s.                                                       | 36 |
| 3.14 | Eye diagram of the LVDS-P output with clock crosstalk effect.                             | 36 |
| 4.1  | Data rate and modulation type of recently published TX and TRX systems by year.           | 40 |
| 4.2  | Structure of the test setup used to analyze the performance of the channel in             |    |
|      | response to different types of PAM.                                                       | 41 |
| 4.3  | Channel responses used for this study.                                                    | 41 |
| 4.4  | Example eye diagrams of PAM-2, PAM-4, and PAM-8 signaling at 84 Gb/s for                  |    |
|      | channel A when no equalization applied.                                                   | 42 |
| 4.5  | Vertical eye opening with increasing data rate for channel A when no equaliza-            |    |
|      | tion applied.                                                                             | 43 |
| 4.6  | Horizontal eye opening with increasing data rate for channel A when no equal-             |    |
|      | ization applied.                                                                          | 43 |
| 4.7  | Example eye diagrams of PAM-2, PAM-4, and PAM-8 signaling at 112 Gb/s for                 |    |
|      | channel B when ideal equalization applied.                                                | 46 |
| 4.8  | Eye openings for channel A with equalization (CTLE, FFE, and DFE)                         | 47 |
| 4.9  | Eye openings for channel B with equalization (CTLE, FFE, and DFE)                         | 47 |
| 4.10 | Eye openings for channel C with equalization (CTLE, FFE, and DFE)                         | 48 |
| 4.11 | Eye openings for channel D with equalization (CTLE, FFE, and DFE)                         | 48 |
| 4.12 | Eye openings for channel A with EQ (limited bandwidth CTLE, FFE, and DFE).                | 49 |
| 4.13 | Eye openings for channel B with EQ (limited bandwidth CTLE, FFE, and DFE).                | 49 |
| 4.14 | Eye openings for channel C with EQ (limited bandwidth CTLE, FFE, and DFE).                | 50 |
| 4.15 | Eye openings for channel D with EQ (limited bandwidth CTLE, FFE, and DFE).                | 50 |
| 5.1  | Input capacitance of the SST driver for different R <sub>MOS</sub> values                 | 55 |
| 5.2  | Proposed impedance termination structure for SST drivers with additional resis-           |    |
|      | tors (R <sub>par</sub> ).                                                                 | 55 |
| 5.3  | SST driver resistance with respect to $R_{par}$ for 50 $\Omega$ termination.              | 56 |
| 5.4  | Voltage swing at the receiver input with respect to $R_{par}$ for 50 $\Omega$ termination |    |
|      | considering 1 V supply voltage.                                                           | 56 |
| 5.5  | Block diagram of the proposed PAM-4 SST TX                                                | 57 |
| 5.6  | Schematic of the 4-phase I/Q clock generation circuit.                                    | 59 |
| 5.7  | Schematic of the 4-tap FFE block that creates 1-UI delay intervals                        | 60 |
| 5.8  | Block diagram of the SST slices.                                                          | 61 |
| 5.9  | Schematic of the final 4:1 serializer stage.                                              | 62 |
| 5.10 | Waveforms of the final 4:1 serializer                                                     | 62 |
| 5.11 | Schematic of the SST driver.                                                              | 63 |
| 5.12 | Ratio of level mismatch versus R <sub>par</sub>                                           | 63 |
|      |                                                                                           |    |

#### List of Figures

| network in bandwidth.645.14 Simulation results showing the comparison between RC network and T-coil<br>network in return loss.655.15 Chip micrograph.655.16 Custom measurement PCB of the PAM-4 SST TX.685.17 Cross section of the cavity in the custom measurement PCB.685.18 The photograph of the wirebonded chip and the cavity.695.19 Measurement setup.695.20 Measured eye diagrams with 1-tap pre-cursor and 2-tap post-cursor equalization                                                                                                       |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <ul> <li>5.14 Simulation results showing the comparison between RC network and T-coil network in return loss.</li> <li>5.15 Chip micrograph.</li> <li>67</li> <li>5.16 Custom measurement PCB of the PAM-4 SST TX.</li> <li>68</li> <li>5.17 Cross section of the cavity in the custom measurement PCB.</li> <li>68</li> <li>5.18 The photograph of the wirebonded chip and the cavity.</li> <li>69</li> <li>5.19 Measurement setup.</li> <li>69</li> <li>5.20 Measured eye diagrams with 1-tap pre-cursor and 2-tap post-cursor equalization</li> </ul> |
| network in return loss.655.15 Chip micrograph.675.16 Custom measurement PCB of the PAM-4 SST TX.685.17 Cross section of the cavity in the custom measurement PCB.685.18 The photograph of the wirebonded chip and the cavity.695.19 Measurement setup.695.20 Measured eye diagrams with 1-tap pre-cursor and 2-tap post-cursor equalization                                                                                                                                                                                                              |
| 5.15 Chip micrograph.675.16 Custom measurement PCB of the PAM-4 SST TX.685.17 Cross section of the cavity in the custom measurement PCB.685.18 The photograph of the wirebonded chip and the cavity.695.19 Measurement setup.695.20 Measured eye diagrams with 1-tap pre-cursor and 2-tap post-cursor equalization                                                                                                                                                                                                                                       |
| 5.16 Custom measurement PCB of the PAM-4 SST TX.685.17 Cross section of the cavity in the custom measurement PCB.685.18 The photograph of the wirebonded chip and the cavity.695.19 Measurement setup.695.20 Measured eye diagrams with 1-tap pre-cursor and 2-tap post-cursor equalization                                                                                                                                                                                                                                                              |
| 5.17 Cross section of the cavity in the custom measurement PCB.685.18 The photograph of the wirebonded chip and the cavity.695.19 Measurement setup.695.20 Measured eye diagrams with 1-tap pre-cursor and 2-tap post-cursor equalization                                                                                                                                                                                                                                                                                                                |
| 5.18 The photograph of the wirebonded chip and the cavity.695.19 Measurement setup.695.20 Measured eye diagrams with 1-tap pre-cursor and 2-tap post-cursor equalization                                                                                                                                                                                                                                                                                                                                                                                 |
| 5.19 Measurement setup.695.20 Measured eye diagrams with 1-tap pre-cursor and 2-tap post-cursor equalization                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| 5.20 Measured eye diagrams with 1-tap pre-cursor and 2-tap post-cursor equalization                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| for (a) 32 Gb/s PAM-4 and (b) 16 Gb/s NRZ                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| 5.21 Power breakdown of the TX                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| 6.1 Energy-efficiency of the published TRX systems versus channel loss [3] 74                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| 6.2 Structure of the test setup used to analyze the performance of different PAM                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| orders                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| 6.3 Frequency response of the channel                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| 6.4 Horizontal opening trend for different PAM orders                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| 6.5 Block diagram of the TX                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| 6.6 Schematic of the SST driver                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| 6.7 Layout of an SST slice                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| 6.8 Block diagram of the ADC-based RX AFE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| 6.9 Schematic of the CTLE followed by the ADC input buffer 81                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| 6.10 Resistance trim between nodes A and B of CTLE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| 6.11 Capacitor trim between nodes A and B of CTLE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| 6.12 (a) Resistance trim settings and (b) capacitance trim settings of the CTLE 83                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| 6.13 Pulse response of the channel, CTLE, and FFE outputs                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| 6.14 Block diagram of the single-channel 1 GS/s 7-bit SAR ADC with 2-tap embedded                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| analog FFE and the clock signal timing diagram                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| 6.15 Schematic of the Comparator                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| 6.16 Layout of TX and RX AFE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| 6.17 TX output eye diagram                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| 6.18 CTLE+Buffer output eye diagram                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| 6.19 FFE output eye diagram                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| 6.20 FFE output timing bathtub curve (UI = 125ps)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| 6.21 FFE output horizontal histogram                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| 6.22 FFE output horizontal openings of each eye                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| 6.23 Histogram of the ADC output codes (a) for PAM-16 at 32 Gb/s and (b) PAM-8 at                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| 24 Gb/s                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| 6.24 Extrapolated probability distribution of the ADC output codes (a) for PAM-16 at                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| 32 Gb/s and (b) PAM-8 at 24 Gb/s                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| 6.25 Power breakdown of the TX                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |

## List of Tables

| 3.1 | Performance comparison of the JESD204B compliant transmitters with other similar works                                                    | 37 |
|-----|-------------------------------------------------------------------------------------------------------------------------------------------|----|
| 5.1 | Current consumption comparison between the proposed (simulated) and conventional (estimated) designs for critical blocks in the data path | 66 |
| 5.2 | Performance comparison with other similar SST TXs                                                                                         | 71 |
| 6.1 | Performance comparison with state of the art PAM-4 RX or TRX systems                                                                      | 92 |

## **1** Introduction

The exponential growth in computing power, multimedia services, cloud services, and the Internet of Things (IoT) has caused a tremendous increase in global IP network data traffic in recent years. It is estimated that the global mobile data traffic will reach 77 exabytes per month by 2022, which is a seven-fold increase over the traffic in 2017 [1]. The compound annual growth rate (CAGR) between 2017 and 2022 is expected to be 46% and the trend is shown in Figure 1.1. Another similar white paper forecasted that the number of total internet users will increase from 3.4 billion in 2017 to 4.8 billion by 2022 [7]. Moreover, by 2022, 82% of global IP traffic will be due to video viewing, which was 75% in 2017, and this dominance of video content in the global data traffic is fueled by the rapid increase in pixel-resolution of these contents and consequently display devices such as monitors and televisions. As a result of such a dramatic increase in data traffic and the fact that the big majority of this traffic comes from the consumer video content, focusing on the performance of data centers of service providers is essential.

A data center is a facility that is based on storage and computing resources that allow the sharing of data and applications. A typical data center consists of multiple server racks which include storage systems, processors, routers, and switches which are connected with wireline connections such as electrical or optical cables. An example data center that contains a massive number of units in Oklahoma, USA is shown in Figure 1.2. To satisfy the huge demand in video content traffic, data communication bandwidth should be increased aggressively in data centers. Faster data transfer can be achieved by employing faster transceiver (TRX) blocks in the server board, that communicates with other modules over electrical links with various communication standards. Moreover, the energy consumption of data centers is at least as important as their bandwidth. Considering that data centers are highly energy-intense structures, responsible for 1% of the electricity use worldwide, the energy usage should be taken into account accurately [8]. Although it looks like the energy demand will grow strongly at first glance, the trend is not that dramatic, thanks to developments in their energy-efficiency. The ever increasing bandwidth and the resulting energy demand should be met by continuous energy-efficient design techniques in wireline serial links.



Figure 1.1: Global mobile data traffic forecast from 2017 to 2022 [1].



Figure 1.2: A Google data center in Oklahoma, USA [2].

In the last decades, electrical links have been the key blocks to meet the increasing demand for data bandwidth across electronic systems. Advancements in wireline input/output (I/O) systems have brought great technological innovations in electronic components by increasing per pin data rate. To deal with the challenges of high-speed wireline data transmission, many different standards have been developed by various institutions. In Figure 1.3 the per pin data rate trend of the common I/O standards is given by years. It can be seen that the data rate per pin multiplies by two approximately every four years across all the I/O standards shown [3]. One of the standards has achieved 100 Gb/s data rate and the others are about to achieve soon. Even though generally, point-to-point interconnect standards such as Common Electrical I/O (CEI), Quick Path Interconnect/Keizer Technology Interconnect (QPI/KTI), and Peripheral Component Interconnect Express (PCIe) have higher per pin data rates compared to other standards, the logarithmic increase is common for all.



Figure 1.3: Per pin data rates of different wireline standards by years [3].



Figure 1.4: Data rates of wireline transceiver publications over years [4].



Figure 1.5: Energy-efficiencies of wireline transceiver publications over the years [4].

#### Introduction

While aggressive technology scaling increased the demand for I/O speed, it also contributed significantly to the improvement in data rate and power efficiency of wireline links. In the last decade, wireline I/O speed has benefited from advanced technology nodes significantly. Data rates of wireline TRX publications over years are shown in Figure 1.4. It can easily be seen that the data rates have risen dramatically in the last five years. In the literature, there have been some studies achieving more than 100 Gb/s data rate with transmitter (TX) or receiver (RX) systems. These publications prove that 100 Gb/s will be achieved with TRX systems as well soon. Also, Figure 1.5 shows the energy-efficiency improvement over the years.

The technology scaling has improved the I/O data rates and data processing power over the last decade. However, the bandwidth of the copper links has not been scaled similarly. Therefore, the need for advanced equalization techniques and high modulation orders has increased to be able to eliminate the high inter-symbol interference (ISI) that arises at high frequencies. Binary non-return-to-zero (NRZ) has been used very widely to implement links at data rates up to 56 Gb/s [9]. The limited bandwidth of copper links has initiated a transition to higher order modulations such as 4-level pulse-amplitude modulation (PAM-4) [10] which has been used extensively in recent TX designs [11, 12, 13, 14, 15, 16, 17] thanks to its spectral efficiency. Using NRZ (PAM-2) is less efficient at high data rates mainly due to the required channel bandwidth, and silicon operating speed. There are also various studies on other alternative modulation schemes such as correlated non-return to zero (CNRZ) [18], discrete multi-tone (DMT) [19], and NRZ/multi-tone [20].

#### 1.1 Thesis Goal

The recent increase in the data rates of the wireline serial links have contributed greatly to a lot of electronic systems. This development should be supported by continuous improvement in the energy-efficiency of the wireline links. Therefore, the goal of this thesis is to propose new circuit techniques and systems to improve the energy-efficiency of high-speed wireline links. For this objective, we studied and compared different state-of-the-art TX and RX architectures and the main limiting factors of high-speed wireline communication systems. Moreover, we propose an optimal modulation scheme for different cases, new circuit design techniques, and complete transceiver systems that achieve better energy-efficiency figure of merits (FoM), while keeping a high data rate. For this purpose, several systems have been designed and taped out in 28 nm FD-SOI CMOS technology.

#### 1.2 Organization and Content of the Thesis

#### **Chapter Two**

Chapter 2 gives the theory review explaining the fundamental circuits and techniques that are widely used in state-of-the-art serial wireline links. For this objective, a typical complete TRX is shown and the building blocks of a TRX system are detailed. Frequently used equalizer circuits

are shown and their operation is explained, along with the common TX and RX systems. Also, the channel characteristics and the term inter-symbol interference (ISI) are briefly reviewed.

#### **Chapter Three**

In Chapter 3, we briefly explain a multi-channel SAR ADC system and the JESD204B standard to give the motivation behind. Then we present an LVDS TX and an SST TX which are compliant with JESD204B standard in 28 nm FD-SOI CMOS technology. We explain the design selections to improve the energy-efficiency and experimentally compare the two designs in terms of design complexity, power consumption, area usage, and signal integrity performance.

#### **Chapter Four**

In this chapter, we present a comparative architectural study to analyze the potentials and ISI sensitivities of PAM signaling for next-generation high-speed copper links. This modeling work focuses on the inherent limitations of different PAM orders for different channels with different loss characteristics in different data rates. The chapter includes equalization blocks and frequency limitation as well, to be able to make the conclusions more practical.

#### **Chapter Five**

In this chapter, we propose a high-impedance driver technique to decrease the power consumption of SST TXs. The proposed technique is employed in a 32 Gb/s quarter-rate PAM-4 SST TX with a four-tap feed-forward equalizer (FFE) implemented in 28 nm FD-SOI CMOS technology. The analysis of the proposed technique and the details of the designed system are shown. The contribution of the technique is supported by the power consumption and energy-efficiency results of the system.

#### **Chapter Six**

In Chapter 6, we propose a complete TX and RX AFE system using PAM-16 signaling for a moderate-loss channel. The presented system includes 32 Gb/s PAM-16 SST TX, a continuous-time linear equalizer (CTLE), and an 8-channel 8 GS/s 7-bit TI SAR ADC with embedded 2-tap analog FFE. The modulation selection is done using an architectural study. The objective is to improve the figure of merit, and the design choices were made in this direction. The reasons behind these selections are explained and the comparison with the state-of-the-art systems is provided.

#### Conclusion

Finally, Chapter 7 gives the main contributions of this thesis and discusses future work.

## **2** Theory Review

Technological developments in IC fabrication, along with innovative circuit design techniques, have brought exponential growth for the performance of ICs. These advancements should be followed by the bandwidth of wireline communication systems to improve the overall system performance. To provide continuous bandwidth improvement, designers have been working on different TX and RX architectures, equalization techniques, and signaling schemes. In this chapter, we introduce technical terms used in I/O links, an overview of high-speed wireline links, and its building blocks. Different circuit topologies are explained, and the main limiting factors are given. Finally, the methods to overcome these limitations such as equalizer circuits and signaling methods are explained.

#### 2.1 Technical Terms

In this part, some of the fundamental terms are given. The rest of the essential terms will be summarized in the following sections of this chapter.

**Data rate** is the transfer speed of information through a link. Its unit is usually bits per second (b/s).

Baud rate is the transfer speed of a symbol, which can carry more than one bit, through a link.

Unit Interval is the time difference between two consecutive symbols.

**Energy-efficiency** is the energy consumption to send one bit. It is a figure of merit in I/O systems and its unit is joule per bit (J/b).

Bit error rate is the ratio of the faulty bits among the total number of transferred bits.

**Pseudo-random bit stream** is a bit sequence that is generated with a deterministic algorithm. It is usually used to replace the circuit that sends data to a wireline I/O system, to measure the performance of the link. The stream is similar enough to true random to assess the link performance.



Figure 2.1: General block diagram of wireline serial link.

#### 2.2 High-Speed Wireline Serial Link Overview

A typical wireline serial link is shown in Figure 2.1. The transmitter takes the parallel low-speed input, serializes it into high-speed data, and transfers it to an output driver. The output driver transfers the data to the channel in a certain data format. On the receiver side, the received signal is transformed back to digital data and deserialized to obtain the parallel data that is transferred. For the clocking of the TX side, the TX can include a clock generation and distribution circuit. For the RX side, the clock signal can be sent from the TX side or can be generated internally using clock-data recovery (CDR) circuits.

In wireline transceivers, data is sent through a transmission line. When the wire delays are comparable with the transition times of the carried signal, the wire should be considered as a transmission line. Assume that the voltage generated on the TX driver is given to a transmission line with a characteristic impedance of  $Z_0$ . If the other end of the T-line is not terminated by a load impedance ( $Z_{RX}$ ) which is not equal to the characteristic impedance of the channel (if  $Z_0 \neq Z_{RX}$ ), signal reflection occurs on the load (RX) side. In other words, a portion of the signal gets reflected to the TX side, creating an unwanted voltage level at the input of the receiver. The reflection coefficient is defined as the ratio of the voltage of the reflected wave to the voltage of the incident wave.

$$\Gamma = \frac{V_{reflected}}{V_{incident}} = \frac{Z_{RX} - Z_0}{Z_{RX} + Z_0}$$

Where  $\Gamma$  is the reflection coefficient symbol,  $Z_{RX}$  is the termination impedance at the receiver input, and  $Z_0$  is the characteristic impedance of the T-line.

The reflected wave travels to the TX side and in case the TX is not properly terminated as well, it creates a second reflection on the TX side. Therefore, if proper termination is not employed, the signal gets reflected back and forth, creating unwanted signals on both sides.

If the transmitter output impedance  $(Z_{TX})$  is not equal to  $Z_0$ , reflection occurs at the transmitter side as well. Each end of the T-line has its own reflection coefficient and the voltage at



Figure 2.2: 4-to-1 D Flip-Flop-based serializer schematic.

any point can be found as the sum of the initial condition and all the waves that passed from that point. However, if the RX side is terminated properly, it is enough to transfer the signal to the RX side and prevent reflections. But the termination impedances are not constant over the whole frequency range due to the parasitic capacitances and eliminating reflections completely is not possible for all frequencies. That is why double termination is always preferred in high-speed wireline links. In the remainder of this chapter and the following chapters, the double termination technique is used and taken into account.

#### 2.3 Wireline Transmitter Basics

The two main building blocks of a wireline transmitter are the serializer and the output driver. Therefore, these two circuits are reviewed in the following subsections. The serializer structures that are used in TXs are very similar to the deserializer structures used in RXs. Therefore, only serializer schematics are given and reviewed.

#### 2.3.1 Serializer

#### **DFF-Based Serializer**

A 4-to-1 DFF-based serializer schematic is shown in Figure 2.2. In this circuit, the parallel data is given to the internal nodes of a shift register, using the load signal. After that, the data is pushed through the shift register, using a fast serial clock signal. In this example, the serial clock has four times the frequency of the parallel clock. With the arrival of each parallel data stream, the load signal is activated and the corresponding input of the multiplexer is selected to receive the parallel data. The main disadvantage of the DFF-based serializer is the need for a high-speed shift register chain. That brings the need for very fast flip-flops and the distribution of the full rate clock, in high-speed links. On the other hand, the advantage of this structure is that the number of parallel bits at the input of the serializer is very flexible. In



Figure 2.3: 4-to-1 multiplexer-based serializer schematic.

other words, the serializer can be N-to-1 where N is an integer.

#### **MUX-Based Serializer**

There are different types of MUX-based serializer. Tree-type MUX-based 4-to-1 serializer architecture is given in Figure 2.3. It consists of three smaller 2-to-1 serializers and it can be expanded more if needed. Full-rate, half-rate, and quarter-rate clock signals are used. In Figure 2.3 full-rate architecture is given. Most of the time in high-speed links, full-rate architecture is not preferred due to the need for full-rate clock distribution and high-speed flip-flop. Therefore, the achievable data rate may be limited. If we remove the final flip-flop, the architecture can be called half-rate. In half-rate architecture, we eliminate the full-rate clock signal and the flip-flop that operates at full-rate. However, the serial output becomes dependent on the duty-cycle of the clock signal.

The multiplexer-based serializer can be designed by employing transmission-gate-based multiplexers as well. For that circuit, select signals for each bit should be generated. Also, more than a 4-to-1 structure is not realistic due to an exponentially growing capacitive load at the output.

#### 2.3.2 Output Driver

There are two main driver types, namely current-mode and voltage-mode. In a current-mode driver, the data is represented by a current value. If the data is logic-1 the output is the current  $I_{Bias}$ . If the data is logic-0, the output is represented as the current  $-I_{Bias}$ . Therefore, current-mode drivers need current steering switches. On the other hand, in voltage-mode drivers,



Figure 2.4: Push-pull current-mode (LVDS) driver schematic.

data is represented by a voltage value. Logic-1 becomes  $V_{DD}$  and logic-0 becomes GND. Therefore, the voltage-mode drivers need low-impedance switches with series impedance termination. In this subsection, different current-mode and voltage-mode driver architectures will be reviewed.

#### **Current-Mode**

The first current-mode driver architecture that is reviewed is the push-pull driver. This driver is widely used in low-voltage differential signaling (LVDS) standard; therefore it is also called LVDS driver and its schematic is given in Figure 2.4. It consists of a PMOS and an NMOS current source, and two switches similar to inverters. Due to the current-mode structure, parallel termination is employed and the termination resistors are placed in between the output nodes. The current of the PMOS and NMOS current sources are constant. So, the LVDS driver has a good power supply rejection ratio (PSRR) and current noise. However, the voltage swing is limited due to double current source usage, especially in advanced technologies. Peak-to-peak differential voltage swing is  $V_{ppd} = 2IZ_0$  where I is the current of the current sources and  $Z_0$  is the characteristic impedance of the target channel, for the double termination case. The LVDS driver also includes a common-mode feedback loop to set the common-mode voltage of the output nodes. Because the current mismatch between the current sources may cause a shift at the DC levels of the outputs. To prevent that, one of the bias voltages of the current sources is provided by an OTA in the CMFB loop as shown in Figure 2.4. This addition increases the circuit complexity and area. Finally, the output impedance can be trimmed by adjusting the resistors.



Figure 2.5: Current-mode logic (CML) driver schematic.

Secondly, current-mode logic (CML) driver schematic is given in Figure 2.5. Since it is also a current-mode driver, the parallel termination technique is used. Termination resistors are placed between the output nodes and the supply voltage, causing the common-mode of the output signal to be higher than  $V_{DD}/2$ . The high common-mode voltage at the output also helps to keep the current source in the saturation region. Thanks to using only one current source, this architecture is more suitable for the lower supply voltages compared to the LVDS driver. However, it has less PSRR compared to the LVDS driver. Peak-to-peak differential voltage swing is  $V_{ppd} = IZ_0$  where *I* is the current of the current source and  $Z_0$  is the characteristic impedance of the target channel, for the double termination case. In other words, the CML driver needs to spend two times the current of the LVDS driver to reach the same voltage swing. Finally, the output impedance can be trimmed by changing the resistors for CML architecture as well.

#### Voltage-Mode

Two different voltage-mode drivers are reviewed. First, a source-series-terminated (SST) driver is shown in Figure 2.6(a). This driver is also called a CMOS driver as it is fundamentally a CMOS inverter. The voltage swing is naturally rail-to-rail if the termination is not considered. For proper termination, ideally, NMOS and PMOS are sized such that the ON-resistances of these transistors are equal to the characteristic impedance of the channel. Peak-to-peak differential voltage swing is  $V_{ppd} = 4IZ_0$  where *I* is the DC current of the driver and  $Z_0$  is the characteristic impedance of the target channel, for the double termination case. The swing/current efficiency is for times the efficiency of the CML driver and two times the efficiency of the LVDS driver. The main difference comes from the fact that voltage-mode drivers employ series termination. In other words, there is only one current path, and all the current passes through the load termination. Low-swing voltage-mode driver schematic is given in Figure 2.6(b). In the lowswing driver, all the transistors are NMOS and the swing is dependent on the threshold voltage



Figure 2.6: Voltage-mode (a) high-swing source-series-terminated (SST) and (b) low-swing driver schematics.

of NMOS transistors. However, the swing/current efficiency is the same as the SST driver.

There are different impedance control methods for these drivers. One way of impedance control is to use series stacked active resistors. A PMOS bank to adjust the impedance of the PMOS path and an NMOS bank to adjust the impedance of the NMOS path can be used in series to the main driver transistors. Another way is to have a segmented driver, with the functionality of enable/disable of these segments to be able to change the number of active segments and change the termination impedance accordingly. Finally, the gate voltage of the driver transistors can be tuned for impedance trim. This can be realized by employing a digital stage with controlled supply voltages as a preceding block of the driver. In this way, the on-resistance of the transistors can be set equal to the target termination impedance.

#### 2.4 Wireline Receiver Basics

The most critical block in a receiver system is the slicer circuit as its task is to decide if the received data is logic-1 or logic-0. The slicer output is given to a deserializer circuit and the timing between these two blocks is critical. An ideal slicer is supposed to have high-speed, low noise, low offset, and no memory effect. The strong-arm latch is widely used as a slicer in wireline links and its schematic is given in Figure 2.7.

The strong-arm latch does not consume static power thanks to its dynamic nature and that makes it a low-power latch. It can also reach full CMOS output voltage levels, unlike its CML latch counterparts. Also, this circuit has a very small memory effect. On the other hand, the



Figure 2.7: Strong-Arm Latch schematic.

circuit is very sensitive to input common-mode voltage as the gain of the input transistors loses their gain significantly when they enter the triode region. The circuit has different operation phases. When the clock signal goes high, the input differential pair which is an amplifier and an integrator is activated. After that, first the NMOS pair of the latch then the PMOS pair of the latch is turned on and the regeneration phase starts. A decision is made depending on the pulled charge from the dynamic nodes and the outputs are driven to VDD and GND. Finally, the reset phase starts and pulls all the nodes to their initial value to prepare the circuit for the next sample.

In a multi-level signaling scheme, more than one slicer can be used and that structure corresponds to the flash ADC system. Increasing the energy-efficiency of the receiver is key in such systems; therefore, using various power-efficient high-speed ADC architectures in the RX of wireline links is very popular in multi-level signalings such as four-level pulse-amplitude modulation (PAM-4). ADC-based RX systems also enable us to employ strong equalization blocks in the digital domain.



Figure 2.8: The transfer function of a typical FR4-based stripline and loss contributors [5].

#### 2.5 Equalizer Circuits

#### 2.5.1 Channel Limitations

The data rate of high-speed wireline links can be limited because of the operating frequency of the circuits or the channel that is used for wireline communication. Thanks to rapid technology scaling, the performance metrics of the integrated circuits have been improved significantly. On the other hand, the copper wires and even optical wires have limited bandwidth and that is usually the main limiting factor of the data rate in wireline links.

Figure 2.8 shows the forward transfer function of a typical FR4-based stripline and the main contributors to the loss characteristics of this channel. One of the loss contributors is the dielectric loss. Dielectric loss is the inherent dissipation of the dielectric material that causes attenuation of the energy of the signal. Another effect shown in the same figure is conductor loss. The sheet resistance of the conductor metal at an RF frequency can be much higher than the DC sheet resistance because of the skin effect. The skin effect causes the current pass through near the outer border (skin) of the conductor at high frequencies. Therefore, the effective thickness of the conductor is decreased and the effective resistance is increased. These frequency-dependent parameters cause the electrical channels to have low-pass characteristics in total, as shown in Figure 2.8.

Intersymbol interference (ISI) is the distortion caused by one bit on the subsequent bits. The main source of the ISI is the limited channel bandwidth. The ISI reduces the voltage and timing margins and deteriorates the bit error rate (BER) performance of communication systems. The time-domain pulse response of the electrical channel is given in Figure 2.9. A pulse is applied to an electrical channel as shown in black in Figure 2.9. The signal amplitude



Figure 2.9: Input and output waveforms of an electrical channel when a pulse is applied.

observed at the output of the channel is shown in red. We can see that the signal is delayed and its shape is not a proper pulse anymore. The peak point of the response is called the cursor. The points before that peak point are called pre-cursor, while the points that come after the peak are called post-cursors. This pulse response is a typical response which shows the post-cursor is usually a bigger problem compared to the pre-cursor. Thus, to be able to improve the data rates of electrical links, different systems that compensate the loss and equalize the channel response have been designed. To expand the bandwidth of the wireline channels and resolve the ISI problem, equalization is the most popular technique. Ideally, the transfer function of the equalization is the inverse of the transfer function of the system. In the following subsection, these equalization techniques will be reviewed.

#### 2.5.2 FFE

Feed-forward equalization (FFE) is a finite impulse response (FIR) filter that can be used both on the TX side and on the RX side. The main reason for implementing the FFE on the TX side is that designing a high-speed DAC for TX is easier to design a high-speed ADC for RX [21]. In this subsection, the FFE implementation on the TX side will be discussed. FFE is the most common equalization technique employed in TXs. The FFE pre-distorts the signal with the inverse effect of the channel distortion. That brings a high-pass behavior which ideally cancels out the low-pass behavior of the channel and provides a flat frequency response for the signal. The disadvantage of FFE is that it reduces the SNR of the signal as it attenuates the low-frequency part of the signal to provide gain boosting at high-frequency. Attenuating the low frequency to create a relative boosting at high frequency is also called de-emphasis. As a result, it reduces the signal power and deteriorates the SNR.



Figure 2.10: FFE implementation at the transmitter.

The block diagram of TX-side FFE implementation is shown in Figure 2.10. The FIR filter operates as a high-frequency DAC. The length of the filter should be decided considering the pulse response and the number of needed pre-cursors and post-cursors. For simplicity, 2 pre-cursors and 2 post-cursors are considered as shown. Using flip-flops, differently delayed versions of the TX data is created with 1-unit intervals (UI). After that, these delayed data are multiplied by different weights that should be obtained before the operation by considering the waveform at the channel output as TX does not have any information about the signal at the RX input. The result is not a binary signal anymore but a multi-level signal to cancel out the analog pre-cursor and post-cursor values. For example for a channel response shown in Figure 2.9, analog voltages lower than the logic-0 value should be sent for both cursors. Also, if the remaining ISI is still not acceptable, the number of post-cursor should be increased as 5 post-cursor is observed in the pulse response. In the frequency domain, FFE attenuates the low-frequency components by decreasing the amplitude of the streams that have a high number of constant data. Also, it keeps the amplitude of the high-frequency components the same by increasing the pulse amplitude at the transition moments. As a result, a highfrequency boosting characteristic is obtained.

#### 2.5.3 CTLE

The continuous-time linear equalizer (CTLE) circuit is given in Figure 2.11. In this circuit, a differential pair with resistive and capacitive source degeneration is employed. The load is the parallel capacitive ( $C_L$ ) and resistive ( $R_L$ ) combined. The circuit operates as the following. If the input signals are assumed as differential signals, which is the standard case at the RX input, the circuit can be split by half. The half circuit consists of an NMOS, a source degeneration resistor  $R_S/2$  with a capacitor  $2C_S$  in parallel to the resistor, a load capacitor  $C_L$ , and a resistor  $R_L$ . Since both the input and output components are halved, the transfer function is identical to the transfer function of the complete CTLE [22].



Figure 2.11: Differential continuous-time linear equalizer schematic.

When a resistor is placed at the source terminal, the gain is decreased. That means the gain will be lowered at low frequency as the capacitance will be open circuit. But as the frequency increase, the impedance of the capacitance starts to decrease and shorts the resistor. This change decreases the source degeneration and increases the gain at high frequency. In this way, gain boosting at high frequency is obtained. If we ignore the channel length modulation and assume large enough output impedance for the differential pair, the voltage gain expression can be written as below [22]:

$$\frac{V_{out}}{V_{in}} = \frac{-g_m R_L}{\alpha} \times \frac{1 + sC_s R_s}{1 + sC_s R_s/\alpha} \times \frac{1}{1 + sC_L R_L}$$
(2.1)

where the  $\alpha$  is the degeneration factor and the expression of that is:

$$\alpha = 1 + g_m R_s / 2 \tag{2.2}$$

The equation 2.1 shows that the system has one zero and two poles. The frequency of the zero and the poles can be given as:

$$f_z = \frac{1}{2\pi C_s R_s} \tag{2.3}$$


Figure 2.12: Transfer function of the CTLE that has one zero and two poles.

$$f_{p1} = \frac{\alpha}{2\pi C_s R_s} \tag{2.4}$$

$$f_{p2} = \frac{1}{2\pi C_L R_L} \tag{2.5}$$

As it can be seen from the equations 2.3 and 2.4, zero and the first pole frequencies can be changed by tuning the  $C_s$  and  $R_s$ . Therefore, these components are usually designed tunable to cover a wider channel characteristic range. When a more complex equalization function is needed, more than one stage of CTLE can be cascaded. One zero and two pole frequencies create a transfer function as shown in Figure 2.12. It shows a high-pass characteristic to compensate for the low-pass function of the electrical channel at the interested frequency range.

In conclusion, the CTLE can provide the inverse transfer function of the electrical channel by boosting the high-frequency components. If the boosting is desired at a very high frequency, the inductive peaking technique is frequently used to increase the bandwidth of the CTLE. Although the CTLE is known to equalize both pre-cursor and post-cursor, our simulations have shown that it can actually only equalize the post-cursor and its effect on the pre-cursor is minimal. Programmable R-C degeneration allows covering different electrical channel characteristics.

### 2.5.4 DFE

The decision feedback equalizer (DFE) is another equalizer circuit that is frequently employed at the receiver side. The block diagram of a DFE structure that cancels out three post-cursors



Figure 2.13: DFE implementation at the receiver.

is given in Figure 2.13. The DFE system includes a slicer, delay elements, weight multipliers, and summers/subtractors in a feedback loop. The slicer decides the bit polarity and after that, the decision is delayed and multiplied by a proper weight and fed back to the input. Therefore, DFE can only equalize the post-cursors.

DFE is similar to the IIR filter but there are few differences. DFE is a non-linear system while IIR is linear. Also, as the slicer output is noiseless, no noise is fed back, and consequently, DFE does not amplify the noise as it includes the quantization stage in the loop. However, IIR does not have a quantization stage and it enhances noise because an analog voltage is delayed, multiplied, and subtracted from the input value in the IIR filter. The most challenging part of the high-speed DFE design is the limited closing time of the feedback loop as all the operations should be completed in only 1 UI time interval.

## 2.6 Signaling Methods

Non-return-to-zero (NRZ) is a modulation technique that has been very dominant and frequently used for decades in wireline communication systems. It has two different voltage levels to represent logic-1 and logic-0 signals. It can also be called 2-level pulse-amplitude modulation (PAM-2). Therefore, it transmits 1 bit/symbol. The main reason for the popularity of NRZ signaling is its simplicity. Because this signaling can naturally be obtained by a CMOS logic. The simplicity is also valid for the receiver side. If we consider the differential signaling, the only operation that should be done is the comparison of the two signals to detect the transmitted bit polarity. However, the demand for higher speed data transmission has increased, and using NRZ has become more and more difficult, considering the limited bandwidth of



Figure 2.14: Example waveforms of NRZ and PAM-4 signaling, showing twice the data rate for the same Baud rate with PAM-4.

electrical channels.

4-level pulse amplitude modulation (PAM-4) is a very popular modulation scheme that is widely employed to overcome the limited channel bandwidth challenge. In this signaling, a symbol can transmit one of the four possible voltage levels. Each voltage level corresponds to two consecutive bits. In this way, the signaling can carry 2 bits/symbol. An example waveform for NRZ and PAM-4 is given in Figure 2.14. Although this PAM-4 waveform is not gray encoded, gray coding is usually employed in the current wireline systems. Because in case of an incorrect decision, it provides only one bit error per symbol. This corresponds to 33% decrease in BER, compared to linear coding.

It can be seen from Figure 2.14 that by using PAM-4 signaling, the same bit stream can be transmitted in half the time of NRZ signaling. Therefore, ideally, the data rate of a wireline system can be doubled with the same operating frequency. Moreover, if the target data rate is constant, the same data rate can be achieved with circuits operating at half the frequency with PAM-4, compared to NRZ signaling. In this way, the operating frequency and the corresponding channel loss can be decreased.

Eye diagram is the most common indicator that shows the signal integrity in wireline links. It is generated by overlaying all parts of a long data stream. The eye diagrams of NRZ and PAM-4 signals are given in Figure 2.15. For the same data rate, the UI of PAM-4 is double the UI of the NRZ signal. For the same voltage swing, PAM-4 has 1/3 eye height, considering the four different levels are obtained uniformly. Therefore the signal loss in PAM-4 can be shown as:

SNR Loss = 
$$20 \times log(\frac{1}{3}) = -9.5 dB$$
 (2.6)



Figure 2.15: Eye diagrams of NRZ and PAM-4 signaling for the same data rate.



Figure 2.16: Power spectral density of NRZ and PAM-4.

The power spectral density (PSD) of the amplitude modulated signals with NRZ and PAM-4 signaling are shown in [23] as:

$$S_{NRZ}(f) = 10 \times log(|sinc^2(Tf)|)$$
(2.7)

22

$$S_{PAM4}(f) = 10 \times log(|sinc^2(2Tf)|)$$

$$(2.8)$$

where T is 1/data rate. The PSD plots of NRZ and PAM-4 signals that have the same data rate (2f Gb/s) are shown in Figure 2.16. The Nyquist frequency in wireline links is the fundamental tone of the spectrum of the signal that toggles between the highest and lowest voltage levels at the highest rate. This plot shows that PAM-4 requires half the bandwidth of the NRZ, and the corresponding Nyquist frequency is f/2 in PAM-4, instead of f.

# **3** JESD204B Compliant LVDS and SST Transmitters<sup>1</sup>

This chapter presents an LVDS TX and an SST TX which are compliant with JESD204B standard in 28 nm FD-SOI CMOS technology to transmit the 16-channel 10-bit 12 GS/s SAR ADC output to a field-programmable gate array (FPGA). The objective of this work is to examine the differences between the current-mode and the voltage-mode TX architectures in terms of design complexity, power consumption, area usage, and signal integrity performance. The measurement results of the prototype LVDS and SST TXs achieve 1.1 pJ/bit and 1.7 pJ/bit energy-efficiency at 12.5 Gb/s, respectively. An external 6.25 GHz single-ended clock, which is terminated internally with mid-common-mode termination, is used for half-rate operation in both designs, which can achieve open eye diagrams at 12.5 Gb/s data rate. From architecture selection to circuit design, power consumption is minimized while maintaining the maximum data rate that the JESD204B standard supports. High-speed standard cell electrostatic discharge (ESD) diodes are employed for the pads to achieve >1 kV HBM ESD protection while adding 200-250 fF loading capacitance.

This chapter is organized as follows. General background summary about the JESD204B standard is given in Section 3.1. The summary of the 16-channel 10-bit 12 GS/s SAR ADC system is given in Section 3.2. The motivation behind the architecture selections and the design considerations are given in Section 3.3. Then, simulation results and measurement results are given in Section 3.4 and Section 3.5, respectively. Finally, Section 3.6 concludes the chapter.

## 3.1 JESD204B Overview

JESD204B is one of the serial interface standards and its first version is released in April 2006. The standard has received some updates to increase its compatibility with data converter interfaces. Recently, the sampling rate and the precision of the data converters have increased

<sup>&</sup>lt;sup>1</sup>This chapter is based on: F. Celik, A. Akkaya, A. Tajalli, A. Burg, and Y. Leblebici, "JESD204B Compliant 12.5 Gb/s LVDS and SST Transmitters in 28 nm FD-SOI CMOS," *2019 15th Conference on Ph.D Research in Microelectronics and Electronics (PRIME)*, Lausanne, Switzerland, 2019, pp. 101-104, (©IEEE) [24].





significantly. Due to these advancements in data converters, old interfaces have become insufficient. The two main advantages of the JESD204B standard over previous interfaces are the supported data rate is much higher and the number of needed pins is much lower. The JESD204 standard has received two revisions, namely JESD204A and JESD204B to improve the multi-channel data converter compatibility [25, 6].

The two predecessor versions of JESD204B, which are JESD204 and JESD204A, support data rates up to 3.125 Gb/s. JESD204B was released in 2011 as the third version of the same standard with very important improvements such as a higher data rate of 12.5 Gb/s, and support for deterministic latency of the lanes [6]. The block diagram of the JESD204B standard is given in Figure 3.1. Since this standard is widely used by FPGA manufacturers, one of the primary application areas is to transfer multi-channel ADC data to FPGAs. A JESD204B compliant transmitter circuit is needed alongside the ADC for such applications.

The standard has some specifications for the TX such as 360 mVppd minimum differential swing, between 80  $\Omega$  and 120  $\Omega$  differential impedance, and minimum 8 dB differential output return loss. Different types of transmitters with current-mode [26], [27] or voltage-mode [28], [29] topologies can be found in the literature. To find the most suitable architecture, both current-mode and voltage-mode driver architectures are examined in this study. Two different JESD204B compliant transmitters are designed and compared in terms of various parameters.

## 3.2 Time-Interleaved SAR ADC System Overview

The designed transmitters explained in this chapter is planned to be used in a 16-channel time-interleaved 10-bit SAR ADC system to transmit the ADC outputs to an FPGA. In this section, the overview of the 16-channel TI SAR ADC system is given. Even though this TI SAR ADC system is completed after the standalone design and test of LVDS and SST transmitters



Figure 3.2: Top level block diagram of the 16-channel TI SAR ADC system.

explained in this chapter, the overview of the system is discussed here to show the motivation better. The block diagram of the TI ADC system is given in Figure 3.2.

The differential input signal of the ADC is received to the 50  $\Omega$  internal termination resistors. The termination voltage for the input signal is an external common-mode voltage, which is ideally  $V_{DD}/2$  for this design. Using an input distribution network, the differential input signal is distributed to 16 SAR ADC channels, which are identical, with the same delay and parasitics. Each sub-ADC has a sampling rate of 750 MS/s, and the total sampling rate of the 16-channel TI ADC is 12 GS/s. Therefore, the supported input signal frequency can reach 6 GHz, and the direct interleaving technique is employed due to its simplicity. The differential clock signal for the ADC is generated externally and received to the 50  $\Omega$  internal termination resistors, similar to the input signal. After that, the clock signal is received by a DLL block that is responsible for the generation of the sampling clocks of the ADC channels. These sampling clocks are generated in a way that only two of the channels sample the input at the same time to decrease the bandwidth reduction due to the ADC output, and the JESD204B protocol is applied by a digital block. Also, the ADC data is connected to a 4K SRAM to be able to test the standalone ADC performance, without including JESD204B protocol and transmitter circuits.

The 10-bit 12 GS/s ADC generates 120 Gb/s data throughput. Considering the 8b/10b encoding technique, the total throughput is 150 Gb/s. Therefore, the transmitter should be able to reach 150 Gb/s data rate, and twenty parallel transmitter channels are used for this objective. The external high-speed (3.75 GHz) clock signal is received to internal termination resistors to use in the transmitter blocks. A clock distribution tree is carefully designed to minimize the deterministic latency between the channels. Each transmitter has its synchronization block because the input of the serializer comes from another slow (750 MHz) clock domain. In this top-level system integration, the SST transmitter structure is employed. The layout of the TI



Figure 3.3: Layout of the TI-ADC system.

SAR ADC system including 150 Gb/s transmitter channels is shown in Figure 3.3. Most of the pads at the three edges of the chip is used by the transmitter channels. In between transmitter pads, supply voltage pads have been placed to minimize the noise at the supply voltage and ground.

## 3.3 TX Circuit Implementations

In general, when designing I/O link systems, the aim is to maximize the data rate to reduce the number of pins. However, in this design, the data rate is limited to keep compliance with the standard. Thus, the primary objective of this work is to find the optimal structure in terms of performance parameters such as power consumption and jitter, while maintaining the highest data rate that is supported by the standard. The block diagram of the proposed TX is shown in Figure 3.4. There are several topology options for the driver block in this TX. Differential peak-to-peak voltage swings of two widely used drivers, namely low-voltage differential signaling (LVDS), which is a current-mode driver, (Figure 3.6) and source-series-terminated (SST), which is a voltage-mode driver (Figure 3.7) are shown below. These swing values are observed at the



Figure 3.4: Block diagram of the proposed JESD204B transmitter.

input nodes of the receiver (RX) block, considering differential 100  $\Omega$  termination on the RX side.

$$V_{ppd,LVDS} = 2 \times 2Z_0 (I_{LVDS}/2) = 2Z_0 I_{LVDS}$$
(3.1)

$$V_{ppd,SST} = 2 \times 2Z_0 I_{SST} = 4Z_0 I_{SST} \tag{3.2}$$

As can be seen in the equations above, the SST driver is ideally two times more power-efficient than the LVDS one; however, its dynamic power consumption scales up rapidly with frequency. Therefore, the SST loses its power advantage in high-speed applications. Thus, both driver architectures have been designed and measured in this study to compare their performances. In high-speed interconnects, current-mode logic (CML) drivers are also commonly used [27], mainly due to their support for high data rates. However, when the speed is limited by the maximum data rate supported by the standard, the energy-efficiency of the TX architecture becomes the primary consideration, rather than speed. The CML, which is another current-mode driver, is a pull-only circuit. As a result, it is two times less power-efficient compared to the LVDS topology, as the LVDS is a push-pull type driver [26]. Therefore, the LVDS is preferred instead of the CML, because this work aims to achieve lower power consumption at a given data rate.

A 6.25 GHz sinusoidal single-ended clock signal is received in the chip with an internal midcommon-mode termination. This type of termination gives the opportunity of having 50  $\Omega$ termination and setting the common-mode voltage to  $V_{DD}/2$ , simultaneously. This DC bias voltage will be appropriate for the following CMOS buffers for the best duty cycle result while amplifying the clock signal toward a square wave. Since the common-mode voltage is set



Figure 3.5: Serializer circuit.

in the chip, it needs to be separated from the common-mode voltage of the clock source to prevent a DC current path. To have this separation and to keep the 50  $\Omega$  characteristic impedance continuity of the trace, a DC block element is used instead of a series capacitor in the measurements. High-speed standard cell ESD diodes, which provide >1kV HBM ESD protection, while adding about 200 fF parasitic capacitance, are used for all the critical pads, such as clock input and TX outputs. ESD diodes that provide >2kV HBM protection are used for the other pads.

### 3.3.1 Serializer

The JESD204B standard uses 8b/10b encoding to preserve the DC balance of the lines, for clock recovery and data synchronization [25]. That is why the serializer is preferred to be designed as 10:1 MUX. It is also possible to use a standard tree type  $2^{N}$ -to-1 MUX based serializer and share those bits in between different lanes. However, in the designs considered in this chapter, each data package is connected to only one TX lane for the sake of simplicity. The 10:1 MUX used in both transmitters is shown in Figure 3.5. Two 5:1 D flip-flop based serializers similar to [30] are designed with a final multiplexer stage that provides half-rate operation to reduce the frequency requirement of the clock signal, the operating frequency of the flip-flops, and consequently the power consumption. As a drawback, the structure becomes sensitive to the duty cycle distortion of the clock signal. The data bits are written into the registers using the internally generated select signal. Then, the bits are pushed with the high-speed clock signal. The serializer is designed as full-custom, but using the standard cell logic gates; thus, the design is very compact and routing lengths of the critical lines are optimally chosen, with much less layout effort compared to designing custom logic gates.



Figure 3.6: LVDS driver circuit.

#### 3.3.2 LVDS Driver

The LVDS driver consists of two current sources ( $M_P$ ,  $M_N$ ), four core (H-bridge) transistors ( $M_1$ ,  $M_2$ ,  $M_3$ ,  $M_4$ ) and a common-mode feedback loop as shown in Figure 3.6. The swing can be controlled by changing the bias voltage of the current source. Moreover, the output impedance can be controlled independently by trimming the poly resistor values connected between two output nodes to fulfill the return loss specifications in the process corners. To satisfy the 360 mVppd differential swing requirement of the JESD204B standard, 4 mA tail current is selected. With this bias, the output nodes swing between  $V_{CM}$ +100 mV and  $V_{CM}$ -100 mV, considering the termination on the RX side as well. Another critical design consideration for return loss specifications is that the core transistors should be kept in saturation. Worst case saturation requirement of  $M_1$  occurs when the data input is logic-1 since:

$$V_{DS} \ge V_{GS} - V_{TH} \tag{3.3}$$

$$V_{CM} - 0.1V - V_{dsat,M_N} \ge V_{DD} - V_{dsat,M_N} - V_{TH}$$
(3.4)

$$V_{TH} \ge V_{DD} - V_{CM} + 0.1V \tag{3.5}$$

Regarding the equations above, the threshold of the transistors should be higher than a certain value. Therefore, regular-threshold transistors are used instead of low-threshold ones in this design, in contrast to the majority of the system.

The LVDS driver has a common-mode feedback loop to set the output nodes to a defined DC voltage. An operational transconductance amplifier (OTA) takes the average of the outputs



Figure 3.7: SST driver circuit.

and compares it with an external reference voltage. The output of the OTA controls the gate of an NMOS current source to keep the common-mode at the reference voltage. The precision of the common-mode voltage is not very important in this application. Yet, the offset of the OTA should be as low as possible and the unity gain bandwidth should be as high as possible. Therefore, a symmetrical OTA architecture is selected and designed with 1.28 mVrms standard deviation of the offset value. To ensure the stability of the common-mode feedback loop, a compensation capacitor is placed on the node with the dominant pole.

### 3.3.3 SST Driver

The SST driver, similar to [28], is designed as shown in Figure 3.7. The ratio of the nonlinear MOS resistance to the total resistance is kept at 10% with the cost of a higher parasitic capacitance at the input of the driver and higher power consumption. The low-threshold NMOS and PMOS transistors are sized to have 100  $\Omega$  on-resistance. Also, 900  $\Omega$  poly resistors are placed in series for a 1 k $\Omega$  on-resistance per SST slice. Therefore, twenty of these SST slices are used in parallel for impedance matching in typical conditions. To cover the process corners as well, the number of active SST slices used in parallel can be trimmed between 16 and 24. For trimming, slices should give high impedance at their output when disabled. This is done by a logic circuit that is placed just before the slice. To minimize the deterministic jitter (DJ), fixed slices are not separated from tunable slices and the same logic circuitry is used for all of them. Fixed slices are always enabled, while the tunable slices can be enabled or disabled.

Trimming the voltage swing in SST needs extra circuitry and power consumption. Therefore, the swing is not trimmed and the ideal swing at the input node of the RX is 1 Vppd which is the supply voltage for the technology that is used.



Figure 3.8: Eye diagrams simulated at 12.5 Gb/s.

## 3.4 Simulation Results

Eye diagrams of the transmitters obtained from post-layout simulations including transient noise are shown in Figure 3.8. The vertical eye openings are very different while the horizontal openings are similar, as expected. A differential PCB trace is modeled and its S-parameters are used in these simulations to mimic the losses from the trace on the measurement board, connectors, and cables. The whole channel consists of an ESD diode and a pad capacitance, 2 mm bond-wire inductance, the modeled PCB trace which has 7 dB insertion loss at 6.25 GHz Nyquist frequency, and the termination resistors at the RX. Since the eye openings are sufficient for both LVDS and SST, equalizer blocks such as FFE are not implemented on the TX side. However, with a higher data rate or a more aggressive channel, an equalizer may be needed. In that case, the vertical opening advantage of SST becomes more important so that some of the slices could be dedicated to cancel pre or post cursors [28].

The power consumption of the blocks with 1.0 V supply voltage is shown in Figure 3.9. In terms of the power consumption of the two transmitters, clock termination, clock buffers,



Chapter 3. JESD204B Compliant LVDS and SST Transmitters

Figure 3.9: Power breakdown of the transmitters.



Figure 3.10: Chip micrograph.

and serializers are the same. The difference comes from the driver itself and the pre-driver stage. The energy efficiencies of the transmitters are 1.1 pJ/bit and 1.7 pJ/bit at 12.5 Gb/s for LVDS and SST transmitters respectively, including the 5 mA DC current that is spent for the 50  $\Omega$  termination of the clock input. The area of the LVDS driver is 0.0196  $mm^2$  while it is 0.0104  $mm^2$  for the SST driver. Most of the area of the LVDS driver is used by the large current mirror transistors and the compensation capacitances used for stability of the OTA and the common-mode feedback loop. The SST is smaller and the layout of the SST is much more regular thanks to its sliced structure. Also, the circuit is as digital as it can be, and is, therefore, more suitable for modern processes.



Figure 3.11: Custom measurement PCB of the TXs.



Figure 3.12: Measurement setup of the TXs.

### 3.5 Measurement Results

Figure 3.10 shows the chip micrograph with the two transmitters that fabricated in a 28nm FD-SOI CMOS technology. A six-layer custom measurement PCB is designed for the measurement of the LVDS and SST transmitters, and this PCB is shown in Figure 3.11. Low-noise LDO components are used to obtain multiple clean supply voltages, and these LDOs are placed at the bottom side of the PCB. The block diagram of the measurement setup is given in Figure 3.12. The measured eye diagrams at 12.1 Gb/s can be seen in Figure 3.13 with 4 cm RO4003C channel after 1.5-2 mm bondwires for both transmitters. The LVDS TX shows 135 mV vertical and 55.5 ps horizontal eye opening, while the SST TX has 703 mV vertical and 54.6 ps





Figure 3.13: Eye diagrams measured at 12.1 Gb/s.



Figure 3.14: Eye diagram of the LVDS-P output with clock crosstalk effect.

horizontal eye opening in the measurements. The eye diagram of the LVDS TX has a different shape than the simulations because of clock crosstalk affecting one of the LVDS outputs. It makes the signal to go up in one cycle, and down in the consecutive cycle. Depending on the data level (one or zero), four different voltage levels are created with the effect of the clock. To show this effect more clearly, the eye diagram of the single LVDS output (LVDS-P) closer to the clock signal is shown at a lower data rate in Figure 3.14. The jumps in the clock transition moments deteriorate the eye diagram and this effect gets stronger at high data rates while the other output of the LVDS gives a clean eye diagram. The reason is that the distance between LVDS-P and clock bondwires is 200  $\mu$ m and the length of these parallel bondwires is 1.5 mm. Moreover, the PCB traces got as close as 0.65 mm at the closest point, without a ground plane

| Ref.                | [27]      | [28]  | [29]  | This work |       |
|---------------------|-----------|-------|-------|-----------|-------|
| Technology          | 65 nm     | 65 nm | 40 nm | 28 nm     | 28 nm |
| Data rate [Gb/s]    | 10 / 15   | 8.5   | 12.5  | 12.5      | 12.5  |
| Driver Topology     | CML       | SST   | SST   | LVDS      | SST   |
| Power [mW]          | 17 / 34   | 96    | 36    | 13.82     | 21.49 |
| Efficiency [pJ/bit] | 1.7 / 2.3 | 11.3  | 2.88  | 1.1       | 1.7   |

Table 3.1: Performance comparison of the JESD204B compliant transmitters with other similar works

in between.

The LVDS TX has DJ(d-d)=9.55 ps, RJ(rms)=1.2 ps (random jitter) and TJ(BER=1e-12)=27 ps (total jitter) while the SST TX has DJ(d-d)=19.87 ps, RJ(rms)=0.58 ps and TJ(BER=1e-12)=28 ps. The LVDS has a slightly less TJ; however, this value would be much lower if the clock crosstalk on the LVDS output was prevented since the crosstalk is the main source of the DJ in LVDS TX. The signal integrity performances of the two drivers are similar, without considering the crosstalk problem. However, if a more aggressive channel is used or higher data rates are considered, the equalization capability becomes important and SST is much more capable of equalizing the channel loss, thanks to its high swing. The LVDS TX would lose its power consumption advantage with such a swing. Table 3.1 compares the presented transmitters with other designs with similar data rates, without considering the channel loss and equalization capability.

## 3.6 Conclusion

JESD204B-compliant LVDS and SST transmitters using 28nm FD-SOI CMOS technology have been designed and tested. LVDS and SST transmitters achieve 1.1 pJ/bit and 1.7 pJ/bit FoMs at 12.5 Gb/s respectively which are better compared to the other designs with similar data rates. The LVDS TX has less power consumption and better jitter performance than the SST TX despite observed crosstalk on our measurement setup, while the SST has a higher voltage swing, which translates into having more equalization potential.

## **4** ISI Sensitivity of PAM Signaling<sup>1</sup>

This chapter presents a comparative study, analyzing the potentials of pulse amplitude modulation (PAM) for implementing next-generation high-speed copper wireline links. This comparative study analyzes the sensitivities of PAM-2, PAM-4, and PAM-8 to inter-symbol interference (ISI), using four different channels (with their loss gradually increased) running at 28 Gb/s, 56 Gb/s, 112 Gb/s, and 224 Gb/s. Each case is examined with and without equalization to properly assess the inherent limitations of the modulation schemes in the presence of ISI. This study shows that more equalization is needed for high order modulation. Also, although higher order PAM offers better spectral efficiency, this does not translate into the best performance in terms of signal integrity. Depending on the severity of ISI and the data rate, a lower order PAM can outperform a higher order PAM, especially for the low-loss cases.

This chapter is organized as follows. The motivation behind the modulation order and data rate selections of this study are given in Section 4.1. Section 4.2 shows the details of the system and the methodology of this analysis, then Section 4.3 shows the analysis results obtained for different cases. Finally, Section 4.4 summarizes the study and discusses the modulation selection for different channels and data rates.

## 4.1 Modulation Trend Overview

The recently published TX and TRX systems have been shown in Figure 4.1. Recently, the data rates of TX and TRX systems have achieved 112 Gb/s and beyond. At this rate of advancement, the new data rate target of high-speed signaling systems will be 224 Gb/s very soon. Moreover, binary non-return to zero (NRZ) has been used very widely to implement links at data rates up to 56 Gb/s [9]. At 56Gb/s and beyond, (e.g., 112 Gb/s) the main trend is to use higher order pulse amplitude modulations (PAMs). Using binary NRZ (PAM-2) is less efficient at such high data rates mainly due to the required channel bandwidth, and silicon operating speed. Most

<sup>&</sup>lt;sup>1</sup>This chapter is based on: F. Celik, A. Akkaya, A. Tajalli, and Y. Leblebici, "ISI Sensitivity of PAM Signaling for Very High-Speed Short-Reach Copper Links," *2019 17th IEEE International New Circuits and Systems Conference (NEWCAS)*, Munich, Germany, 2019, pp. 1-4, (©IEEE) [31].



Figure 4.1: Data rate and modulation type of recently published TX and TRX systems by year.

of the recent publications operating at 112 Gb/s (e.g., [32], [11]) utilize PAM-4 modulation to relax the required clock speed, and lower the Nyquist bandwidth. Hence, PAM-4 is becoming very attractive for implementing very high-speed copper links [33].

This work aims to study the inter-symbol interference (ISI) sensitivity of PAM signaling methods. Four different short-reach channels have been examined with PAM-2, PAM-4, and PAM-8 modulations. The achievable horizontal and vertical eye openings for each signaling method transferring data between 28Gb/s and 224Gb/s, over all four sample channels have been analyzed. Other high-order modulation types such as multi-tone [34] and duo-binary [35] can be used on top of PAM signaling; however, that are not considered in this study. Moreover, the main focus of this study is to analyze the performance of PAM signaling at high data rates, and studying implementation imperfections stays out of the scope of this work.

### 4.2 Analysis Methodology

PAM-2 has proven to be a very robust signaling method to implement copper links operating at data rates up to 56 Gb/s [36], [37]. However, moving toward higher data rates (56 Gb/s and higher), PAM-4 signaling has become more popular mainly due to its spectral efficiency, as well as lower clock rate requirement compared to PAM-2. Moreover, as the data rate keeps increasing, PAM-8 might be a potential candidate for the links operating beyond 112 Gb/s. While PAM-8 carries three bits per symbol, PAM-2 and PAM-4 transfer one and two bits per symbol, respectively. Thus, the Nyquist frequency scales down for higher order PAMs, such that it requires lower signal bandwidth, experiencing lower channel loss. The more complex the modulation gets, the more eye opening will be available at the transmitter side. However, moving from PAM-2, the vertical eye opening will be degraded because there are 3 and 7 eyes on top of each other in PAM-4 and PAM-8 modulations, respectively.



Figure 4.2: Structure of the test setup used to analyze the performance of the channel in response to different types of PAM.



Figure 4.3: Channel responses used for this study.

Figure 4.2 shows the block diagram of the test setup used in this study. The total signal swing at the output of the transmitter (TX) block has been set to  $1 V_{ppd}$ , independent from the type of PAM signaling (PAM<sub>N</sub>, where N = 2, 4, and 8). A 3-tap feed-forward equalizer (FFE) for pre-cursor cancellation is present on the TX side. The receiver incorporates a continuous-time linear equalizer (CTLE) with 2 poles and 1 zero, and a 10-tap decision feedback equalizer (DFE). A pulse generator is used to apply a pulse and decide the FFE and DFE taps with the help of the "tap coefficient decider" block in the beginning. Once the tap coefficients of the FFE and DFE are obtained, a modulated pseudo-random binary sequence (PRBS) data pattern is given through the TX block. The S<sub>21</sub> parameters for the four different short-reach channels are shown in Figure 4.3. The modeled channels are named as channels A, B, C, and D, and their loss profiles are approximately 1.9 dB, 3.9 dB, 5.9 dB, and 7.9 dB, respectively, all reported at 10 GHz.

In order to study the inherent sensitivity of different signaling methods to ISI, the eye openings at the receiver have been first studied without applying any kind of equalization. In the second



Figure 4.4: Example eye diagrams of PAM-2, PAM-4, and PAM-8 signaling at 84 Gb/s for channel A when no equalization applied.



Figure 4.5: Vertical eye opening with increasing data rate for channel A when no equalization applied.



Figure 4.6: Horizontal eye opening with increasing data rate for channel A when no equalization applied.

step of this study, the equalizer blocks are activated. Conventional methods have been used to find the optimal equalizer values for each data point [38].

## 4.3 Simulation Results

### 4.3.1 Without Equalization

In this subsection, none of the equalizer blocks are activated and the eye opening trends are examined for channel A (see Figure 4.3), between 42 Gb/s and 168 Gb/s data rates. The aim of

using the least aggressive channel is to be able to explore the trend lines over a wider frequency range, as the general characteristics of the eye opening trend are similar for all the channels. For this simulation, example eye diagrams of PAM-2, PAM-4, and PAM-8 signaling at 84 Gb/s data rate is given in Figure 4.4. As can be seen in Figure 4.5, the vertical eye opening is higher in PAM-2 modulation, as expected, and decreases for PAM-4 and PAM-8. The slopes of the vertical opening plots are -5.23 mV/Gbps, -1.87 mV/Gbps, and -0.81 mV/Gbps, respectively. That means the vertical opening of the PAM-2 link decreases faster than that of PAM-4/PAM-8 as the data-rate increases; however, it still has the highest opening at 168 Gb/s, while the vertical eye is already closed for PAM-8.

As depicted in Figure 4.6, higher-order PAM results in wider horizontal eye openings, benefiting from lower required signal bandwidth. However, moving toward higher data rates, the PAM-8 eye closes at a quicker rate compared to the other two. It can be seen that using a higher order PAM still provides a wider horizontal opening up to a moderately high data-rate (e.g., 90 Gb/s) for this channel, while it is the opposite when the data-rate increases (>112 Gb/s). This behavior shows that more equalization will be needed to open the eye for PAM-8 and PAM-4 compared to PAM-2 because both the horizontal and the vertical openings are better for PAM-2 at high data rates. As a result, up to a certain data rate (e.g., up to 90 Gb/s for channel A), the ISI effect creates a trade-off between horizontal and vertical eye openings, while favoring PAM-2 at higher data rates. Clearly, practical issues such as required clock frequency prevent implementing energy-efficient PAM-2 transceivers at such high data rates.

### 4.3.2 With Ideal Equalization

In this subsection, the vertical and horizontal eye openings for the channels A, B, C, and D are examined for the data rates of 28 Gb/s, 56 Gb/s, 112 Gb/s, and 224 Gb/s, after activating the equalizers. Obviously, implementing such complex circuits operating at very high data rates is very challenging, and will add more limitations to the achievable link performance. However, no practical limitations due to circuit implementation have been considered in this study, in order to analyze the achievable link performance in presence of equalizers. Example eye diagrams for this simulation in PAM-2, PAM-4, and PAM-8 signaling at 112 Gb/s data rate using channel B are given in Figure 4.7.

For channel A, the horizontal openings of different signaling methods approach together at around 112 Gb/s, and beyond, as shown in Figure 4.8. PAM-2 has the highest vertical opening for the whole range; therefore, it would make sense to employ PAM-2 beyond the data rate that they converge, if silicon limitations are ignored. For 28 Gb/s and 56 Gb/s data rates, it becomes a vertical vs. horizontal opening trade-off. The trend is similar for the other channels but the threshold data rates, in which the horizontal openings become comparable, happen at lower speeds, approximately 80 Gb/s and 56 Gb/s for the channels B and C, respectively (Figure 4.9 and Figure 4.10). Beyond these data rates, PAM-2 offers a better opening. Below that critical rate, the choice depends on the voltage swing, the equalization capability, available power

budget, and circuit operation speed.

In the case of channel D (highest loss), the horizontal opening is wider for PAM-8 from 28 Gb/s and above, as depicted in Figure 4.11. For PAM-2 and PAM-4, the eye gets closed before 224 Gb/s, while it is slightly open for PAM-8 at 224 Gb/s. For this data rate, the corresponding Nyquist frequencies are 112 GHz, 56 GHz, and 37.3 GHz, while the corresponding channel loss values at these frequencies are 65 dB, 44 dB, and 33 dB, for PAM-2, PAM-4, and PAM-8. When the channel loss is too much to equalize, it may be advantageous to use higher order modulation to decrease the Nyquist frequency, the corresponding channel loss, and the silicon operating speed.

### 4.3.3 With Frequency Limited Equalization

In the previous subsection, the high frequency pole of the CTLE is placed at  $2 \times f_{Nyquist}$  without considering the practical implementation limitations. In this subsection, the maximum pole frequency of the CTLE is limited to 35 GHz to observe the effect of having a more realistic equalization in such a system, especially at high data rates. Since the CTLE has one zero, and two poles, the 35 GHz frequency limit can also be considered approximately equal to the maximum peak frequency. Figures 4.12, 4.13, 4.14 and 4.15 show the eye openings for the channels A, B, C and D respectively in the presence of CTLE bandwidth limitation.

Especially for channels C and D, the vertical eye opening advantage of PAM-2 quickly disappears as the data rate increases. This is mainly due to its higher Nyquist frequency. For channel C, PAM-2 loses its advantage first, and then PAM-4 decays quickly at higher data rates. At 224 Gb/s data rate, PAM-8 has the highest vertical and horizontal eye openings, while the eye is completely closed with PAM-2. Even for channel B, PAM-2 has the worst vertical eye opening at 224 Gb/s, while PAM-4 has a slightly larger eye opening than PAM-8, and their horizontal eye openings are equal. In conclusion, for very high data rates, higher order modulations provide better eye openings, considering the CTLE bandwidth limitation.



Figure 4.7: Example eye diagrams of PAM-2, PAM-4, and PAM-8 signaling at 112 Gb/s for channel B when ideal equalization applied.



Figure 4.8: Eye openings for channel A with equalization (CTLE, FFE, and DFE).



Figure 4.9: Eye openings for channel B with equalization (CTLE, FFE, and DFE).



Figure 4.10: Eye openings for channel C with equalization (CTLE, FFE, and DFE).



Figure 4.11: Eye openings for channel D with equalization (CTLE, FFE, and DFE).



Figure 4.12: Eye openings for channel A with EQ (limited bandwidth CTLE, FFE, and DFE).



Figure 4.13: Eye openings for channel B with EQ (limited bandwidth CTLE, FFE, and DFE).



Figure 4.14: Eye openings for channel C with EQ (limited bandwidth CTLE, FFE, and DFE).



Figure 4.15: Eye openings for channel D with EQ (limited bandwidth CTLE, FFE, and DFE).

### 4.4 Summary and Discussion

In this study, the inherent limitations of PAM-2, PAM-4, and PAM-8 modulations are studied and compared. The ISI sensitivity analysis of these modulations is done for four different short-reach channels with different attenuation levels. Depending on channel loss and data rate, the optimum modulation scheme can be chosen. All the transient simulations are run for 10000 UI, which results in different simulation time for each case depending on the data rate and the modulation type, to have a fair comparison between the cases in terms of BER.

In the first set of simulations, the system is simulated without any equalization and it is shown that PAM-2 has better inherent horizontal and vertical eye openings for high data rates. By itself it does not prove that PAM-2 is a better choice; however, it proves that more equalization is needed for higher-order modulations, and they are more sensitive to ISI despite their higher spectral efficiency.

In the second part of the simulations, identical equalizer systems with individually optimized settings are used for each modulation order without considering any frequency limitation. These simulations show that, up to a certain channel loss threshold at Nyquist frequency, PAM-2 still has advantages over the others. In the third part, the frequency limitation is applied to the CTLE and similar results are obtained with the second part. However, under the CTLE bandwidth limitation, PAM-2 starts losing its advantage as the channel loss increases. This result shows that when the loss is too much to equalize, PAM-2 becomes disadvantageous quickly, and low order modulation such as PAM-2 is affected the most from the circuit limitations.

Finally, the effect of white noise is examined by injecting 500  $\mu V_{\text{RMS}}$  random signal at the input of the CTLE for the cases with 14 GHz Nyquist frequency. For higher Nyquist frequencies, the injected RMS noise is also increased, considering the noise bandwidth of the system is higher. The final observation is that noise does not change the comparative performance in favor of a certain modulation type because the ISI is the dominant factor. That is why the noise simulation results have not been added in Section III. One more practical issue to be included in this discussion is that a 3-tap FFE and a 10-tap DFE in different modulations do not have the same level of circuit complexity, area occupancy, or corresponding power consumption. Moreover, the latency of the forward error correction is higher in higher-order modulations.

It can be concluded that using PAM-2 is the best choice for channels A and B for almost all the data rates exercised in this study as long as the limiting factors of the silicon are neglected, considering its lower ISI sensitivity. For channels C and D, as the data rate increases, PAM with increased modulation orders results in better eye openings as compared with PAM-2. Even though the maximum achievable circuit frequency may affect the selection of the optimal modulation scheme, this study has considered only the first-order effects in order to study the maximum achievable link performance.

## **5** Energy-Efficient PAM-4 SST Transmitter

In the previous chapter, it is shown that the ISI sensitivity of high-order PAM signaling is high, compared to low-order PAM signaling. Therefore, the importance of the equalization is higher, and the residual ISI is a more critical problem in high-order modulation. With increasing modulation order, a larger number of parallel SST segments are required to implement precise FFE tap-weight control in SST TX. Due to the data routing complexity of the segmented structure, the dynamic power consumption of SST TX has become much more significant causing degradation in energy-efficiency. This chapter presents a 32 Gb/s quarter-rate PAM-4 SST TX with four-tap FFE to minimize the residual ISI, implemented in 28 nm FD-SOI CMOS technology, and proposes a high-impedance driver technique to decrease the gate loading of the data path of the SST TX. The output of the whole TX is kept matched to the standard characteristic impedance of the system even though the output impedance of the SST driver alone is high. The proposed technique trades power consumption against voltage swing and decreases the total power consumption by 20% compared to the conventional design by providing a significant reduction in the capacitive load. Our measurement results show that the prototype TX consumes 77.9 mW, and achieves an energy-efficiency of 2.4 pJ/bit at 32 Gb/s data rate for PAM-4 signaling.

This chapter is organized as follows. The details of the proposed technique are given in Section 5.2. The TX architecture and key design choices are discussed in Section 5.3 along with the design and implementation details of the TX circuits including the clock generation, four-tap FFE, 4:1 serializer, SST driver, and output pad-network. The measurement results of the prototype TX system are given in Section 5.4. Finally, Section 5.5 concludes the chapter.

## 5.1 Transmitter Overview

In high-speed I/O systems, there are two popular choices for the driver in wireline TXs which are current-mode logic (CML) drivers [12, 13, 14] and source-series terminated (SST) drivers [11, 15, 16, 17, 39, 40, 41]. These CML and SST drivers are the most commonly used current-mode and voltage-mode drivers, respectively. In terms of static power consumption, the CML

driver consumes four times the power of the SST driver for the same voltage swing, due to its pull-only nature and parallel termination property of current-mode drivers. Therefore, SST drivers are generally known to have better power-efficiency compared to the CML drivers [39].

However, the higher sensitivity of PAM-4 to residual intersymbol interference compared to non-return-to-zero (NRZ) [31] and the corresponding need for precise tap-weight control for the FFE in TXs have increased the number of required SST segments. This need for many segments in SST drivers increases the circuit and routing complexity. Also the state of the art wireline TX design [12] employs CML driver that consists of three equally weighted CML driver segments and avoids the high dynamic power consumption that comes with high number of SST driver segments.

Moreover, employing quarter rate architecture, which is also widely used in the recent highspeed links [12, 13, 14, 11, 42], requires four-phase clock signal generation. The distribution of these clock signals to the segments increases the power consumption significantly. Considering these factors, the dynamic power consumption becomes dominant and the high dynamic power consumption makes the static power advantage of the SST drivers fall behind CML at high-speed [12, 11, 41].

Nevertheless, recently, it has been experimentally shown that the voltage-mode driver is faster due to its intrinsically higher speed and provides better vertical and horizontal eye openings compared to the current-mode counterpart for similar voltage swing, especially at high-speed [42]. Therefore, in this study, we strive for preserving the speed and static power advantage of the SST driver, while decreasing the critical high-speed dynamic power consumption to improve the energy-efficiency. The number of driver segments cannot be decreased because precise FFE tap-weight control is critical in a PAM-4 system. However, the capacitive load of each SST segment for the data path of the TX can be decreased by using transistors with smaller dimensions. Having smaller transistors in the SST driver brings higher series termination impedance, but impedance matching should still be ensured for return loss specifications. This can be solved by decreasing the ratio of the linear poly resistor to the total driver resistor; however, that solution would deteriorate the linearity which is a critical design consideration in PAM-4 signaling. Therefore, a new path is added by placing a passive resistor between the output terminals of the TX [43]. In this way, the standard output impedance for the TX can be kept while using a high impedance SST driver. The additional path is placed with the objective of decreasing the swing in [43] which also brings saving in the static power consumption of the driver. On the other hand, this chapter addresses the high dynamic power consumption of the SST TX at high speed. The main objective of this work is to decrease the capacitive load of the SST driver to decrease the dynamic power consumption of the preceding blocks in a PAM-4 TX with a large number of segments.


Figure 5.1: Input capacitance of the SST driver for different  $R_{MOS}$  values.



Figure 5.2: Proposed impedance termination structure for SST drivers with additional resistors (R<sub>par</sub>).

## 5.2 SST Driver Termination Study

One of the main challenges in a PAM-4 TX design is to maintain high linearity, since any distortion on the DC levels of the modulated signal causes a significant decrease in the vertical eye opening in PAM-4, compared to an NRZ TX. The non-linearity in a CML driver is not as critical compared to that of an SST driver because the high common-mode voltage at the output nodes ensures the saturation of the current source on the tail. Furthermore, the termination impedance is a combination of non-linear MOS resistance and a linear passive resistor in SST drivers. Therefore, the percentage of the non-linear MOS resistance is usually kept low in SST drivers.

A low percentage of the non-linear MOS resistance ( $R_{MOS}$ ) requires low ON resistances and high width for the MOSFETs. This requirement results in increased input capacitance and dynamic power consumption. To show this effect clearly, a conventional SST driver is simulated in 28 nm FD-SOI to observe the input capacitances for different  $R_{MOS}$  values. The ratio of the



Figure 5.3: SST driver resistance with respect to  $R_{par}$  for 50  $\Omega$  termination.



Figure 5.4: Voltage swing at the receiver input with respect to  $R_{par}$  for 50  $\Omega$  termination considering 1 V supply voltage.

passive poly resistor ( $R_{poly}$ ) to the total driver resistance ( $R_{MOS} + R_{poly}$ ) is kept constant for the whole sweep range. The average input capacitance plot for different  $R_{MOS}$  values is shown in Figure 5.1. Only the input capacitance of the positive side of the driver is given in this plot, but the same amount of input capacitance should be considered for the negative side as well. This plot shows that increasing the  $R_{MOS}$  decreases the input capacitance significantly. For example, increasing the  $R_{MOS}$  from 12.5  $\Omega$  to 25  $\Omega$  approximately halves the input capacitance. However, using a driver with very high termination resistance is not as effective since the absolute value of the decrease in the input capacitance becomes less significant as the termination resistance increases further.

Increasing the impedance of the driver requires the placement of the additional two resistors  $(2R_{par})$  between the output terminals of the driver as shown in Figure 5.2 for proper termination. SST driver resistance is shown as a combination of nonlinear MOS ( $R_{MOS}$ ) and a linear polysilicon ( $R_{poly}$ ) resistors. In the conventional design, we aim for  $R_{MOS} + R_{poly} = Z_0$ , where



Figure 5.5: Block diagram of the proposed PAM-4 SST TX.

 $Z_0$  is the characteristic impedance of the channel, for proper termination. In the proposed design, to decrease the input capacitance of the driver, the width of the MOS is decreased which causes an increase in the ON-resistance of the MOS, assuming that the minimum channel length possible is already used. To provide proper termination in this case, a resistor is added in between the output terminals of the driver, assuming that the ratio between the linear poly resistor and the non-linear MOS resistance is kept at a certain level to keep the linearity high.

The parallel combination of  $R_{par}$  and  $R_{MOS} + R_{poly}$  in Figure 5.2 should be equal to Z0. In this way, the total impedance seen from the TX output is kept equal to the characteristic impedance  $Z_0$ . For the target output impedance of 50  $\Omega$  in the TX, the required SST driver resistance versus the parallel resistance value is shown in Figure 5.3. This plot shows that driver resistance can be increased, and consequently, the input capacitance can be decreased significantly with decreasing value of the parallel resistor  $R_{par}$ .

Having higher driver resistance decreases the static current consumed by the driver; therefore, the current flowing through the receiver (RX) termination resistor also decreases. Moreover, adding parallel resistors in between the output terminals of the driver creates a new current path and decreases the swing/current efficiency of the SST driver, and consequently the voltage swing on the termination resistor of the RX. Assuming that the supply voltage is 1 V and the characteristic impedance of the channel and the RX input termination resistors (R<sub>Term</sub>) are 50  $\Omega$ , the voltage swing in the RX side is shown in Figure 5.4. It can be concluded that the technique creates a trade-off between power consumption and voltage swing. Using a 75  $\Omega$  driver (R<sub>MOS</sub> + R<sub>poly</sub> = 75  $\Omega$ ) decreases the swing/current efficiency of the standard 50  $\Omega$  SST driver by 25% but it is still better than the swing/current efficiency of the CML by 200%. By using this technique, we aim to decrease the dynamic power consumption of the SST driver, which is a much bigger portion compared to the static current consumption.

## 5.3 Transmitter Architecture

Based on the driver termination study in Section 5.2, an example PAM-4 SST TX is designed with two series resistors placed in between the output terminals of the driver. The block diagram of the TX is shown in Figure 5.5. The value of the additional poly resistor  $R_{par}$  is approximately 150  $\Omega$ , where the characteristic impedance of the target channel is 50  $\Omega$ . Therefore, the impedance of the driver (excluding  $R_{par}$ ) is 75  $\Omega$  and consequently the launch amplitude of the TX is approximately 0.67  $V_{ppd}$ , instead of 1  $V_{ppd}$ . Note that previous TX publications have shown that TX output swing much less than 0.67  $V_{ppd}$  is feasible for multi-level modulations [44, 45, 46].

One of the most important design choices in a TX design is the clock rate for the final data multiplexing stage. Half-rate or quarter-rate architectures have been used which require two or four clock phases, respectively. The quarter-rate architecture is usually preferred in high-speed TXs as it reduces the high power consumption of the clock distribution network in the half-rate structure. For the target data rate of 32 Gb/s and the corresponding jitter performance parameters, both options are possible. The quarter-rate architecture is used in the proposed design to have similar complexity and power consumption as the higher-speed architectures, to show the compatibility of the technique for a wide data rate range.

The data path includes a PRBS generator, a 16:8 serializer, and a 4-tap FFE, followed by a 4:1 multiplexer (MUX) combined with an SST driver. The 4:1 MUX and SST driver combo consists of 48 identical driver segments (slices) for precise FFE tap-weight control. Since it would be very hard to distribute the full-rate signal to all 48 SST slices with the same delay, the final serializer block is replicated to be placed in each slice. ESD devices with >1 kV HBM are used at the TX outputs; therefore, the output network is optimized with a T-coil to maximize the bandwidth.

The clock path includes an external half-rate clock, a quadrature-error correction (QEC), a CML to CMOS converter, and frequency dividers to create a 4-phase quarter-rate and slower clock frequencies to be used in other blocks.

## 5.3.1 Clock Generation

An external half-rate differential sinusoidal clock is received directly on the termination resistors. The common-mode voltage can be adjusted externally by changing the mid-node voltage of the termination resistors. Changing the common-mode voltage creates an offset voltage in the differential clock signal and affects the duty-cycle at the output of the CML to CMOS converter. The duty-cycle adjustment before quadrature clock generation translates into quadrature error adjustment in the four-phase C4 clock signal. In other words, the quadrature error is eliminated by inserting a differential voltage offset to the received differential clock signal. Duty-cycle distortion (DCD) is corrected by DCD cleanup circuits, which are buffers with cross-coupled inverters, in the frequency division and distribution of all the clock



Figure 5.6: Schematic of the 4-phase I/Q clock generation circuit.

frequencies used.

After the CML to CMOS converter, the rail-to-rail C2 clock is routed to an I/Q divider block for quadrature clock generation. The clock dividers and reset synchronize circuits for I/Q clock generation is shown in Figure 5.6. First, the external reset signal is sampled by the C2 clock for synchronization. Second, two different reset signals and their inverses are created by sampling the synchronized reset signal by the differential C2 clock. These sampled reset signals are used in D-flip-flop based frequency dividers to set the initial phases properly. In this way, four-phase quadrature C4 clocks are created at the output of four different frequency dividers. A TSPC architecture is used for the D-flip-flops in the reset synchronization and frequency division circuits, to be able to support high external clock frequencies.

#### 5.3.2 Four-Tap FFE

The TX design employs a four-tap FFE to compensate for inter-symbol interference effects. One of the taps is assigned for the pre-cursor equalization, while two taps are assigned for post-cursor equalization. For precise FFE tap-weight control, the SST driver consists of 48



Figure 5.7: Schematic of the 4-tap FFE block that creates 1-UI delay intervals.



Figure 5.8: Block diagram of the SST slices.

slices. From these slices, 24 are assigned for the main cursor, 6 slices are assigned for the precursor, 12 slices for the first post-cursor, and 6 slices for the second post-cursor equalization. When less FFE tap-weight is needed, pre/post cursor slices can be assigned to the main cursor to complement the 24 dedicated main cursor slices by selecting the main cursor bits instead of pre/post cursor bits as the input for the selected slices.

After the data is serialized into 2×4-bits packages for MSB and LSB, each bit should be delayed with the intervals of 1-UI to create different phases, namely pre, main, post1, and post2 for FFE operation. To this end, we use the circuit in Figure 5.7 which is carefully designed to respect the setup time limitations of the sampling flip-flops. For example, to create a 90-degree shift, we can not immediately sample data with CK90 that was previously sampled with CK0 to respect the setup time specifications. The proposed FFE circuit, based on [13], ensures a setup time margin of at least 3-UI at each stage. Compared to the prior work in [13], the circuit in Figure 5.7 supports a four-tap FFE structure instead of three-tap, and an additional sign function with its final XOR gate. The disadvantage of the proposed FFE circuit is the additional latency.

Buffers at the output of the FFE block create both positive and negative phases of the cursor bits and drive a long routing due to the sliced structure of the driver. Especially, the main cursor bit is routed to all slices to be able to assign any slice to the main cursor. Therefore, each bit is selected by a clock phase that is preceding the clock phase which generates the bit in the final 4-to-1 serializer block, to create the longest possible delay margin for the routing.



Figure 5.9: Schematic of the final 4:1 serializer stage.



Figure 5.10: Waveforms of the final 4:1 serializer.

#### 5.3.3 4:1 Serializer and SST Driver

The structure of the SST slices that consist of a data selector for FFE tap-weight control, a final 4-to-1 serializer, and an SST driver are shown in Figure 5.8. After the FFE block, the cursor bits are sent to 48 identical slices and the correct bit for each slice is selected by a MUX at the input of each slice. Due to the PAM-4 operation, one-third of the slices are assigned for the LSBs and the other two-thirds are assigned to the MSBs.

In a quarter-rate TX design, the 4-to-1 serializer is one of the most critical blocks. In the proposed design shown in Figure 5.9, select signals that have 1-UI width are created using quadrature CK4 clock signals, prior to the serialization operation. The waveforms of the 4-to-1 serializer are given in Figure 5.10. Each bit is selected by a NAND gate using the select signal



Figure 5.11: Schematic of the SST driver.



Figure 5.12: Ratio of level mismatch versus R<sub>par</sub>.

with the corresponding delay, considering the routing delay between the FFE block and SST slices. When a bit is selected, it is propagated to the next stage which is another NAND gate. Finally, odd and even bits are combined by a NOR gate. NAND/NOR gates are used in the serializer instead of AND/OR gates to decrease the logic depth, and consequently the latency.

Following the 4-to-1 serializer, a pre-driver is employed to ensure sharp transitions at the input of the SST driver. The structure of the SST driver is given in Figure 5.11. The series resistance  $R_S$  is shared between PMOS and NMOS current paths, while the impedance trim can be done independently using tunen[2:0] and tunep[2:0] control bits. Those impedance control bits are shared among all slices. With the binary-weighted 3-bit impedance trim, different NMOS/PMOS transistors are activated and with each of the eight combinations, different output impedance is obtained. One of the MOS transistors that ensures the impedance



Figure 5.13: Simulation results showing the comparison between RC network and T-coil network in bandwidth.

matching in the fast process corner is always active. Along with that transistor, activation of other transistors provides matching for the whole range until the slow process corner with approximately linear steps in the output impedance. In PAM-4 operation, the linearity of the driver is also an important design consideration. In Figure 5.12, the ratio of level mismatch ( $R_{LM}$ ) versus  $R_{par}$  plots are provided for two different  $R_{poly}/(R_{MOS} + R_{poly})$  ratios. As  $R_{par}$  decreases, the output swing and the swing on the MOSFET in the SST driver decrease; therefore, the linearity is affected positively. As a result,  $R_{LM}$  improves as  $R_{par}$  decreases. Figure 5.12 also shows the effect of  $R_{poly}/(R_{MOS} + R_{poly})$  on the  $R_{LM}$ . Normally, in the high-speed state-of-the-art SST TX designs, the ratio of  $R_{poly}$  is chosen around 75% to keep the linearity high [11]. Since using a parallel resistor improves the linearity, even with 50%  $R_{poly}/(R_{MOS} + R_{poly})$  ratio, the  $R_{LM}$  improves significantly. Therefore, the ratio of  $R_{poly}$  is decided as 67% in the proposed design, to be able to achieve both half the input capacitance for the SST driver and a sufficient swing, compared to the conventional design. Considering the implemented technique explained in Section 5.2, the output impedance of the driver is designed approximately as  $R_{MOS}+R_{poly}=75 \Omega$ , where  $R_{MOS}$  and  $R_{poly}$  are 25  $\Omega$  and 50  $\Omega$  respectively.

#### 5.3.4 Output Pad-Network

Along with the inherent output capacitance of the SST driver, the load of the TX includes the capacitances of the ESD protection diodes, pads, bondwires, PCB traces, and termination resistors. The ESD protection diodes provide >1 kV HBM protection while contributing a 250 fF load capacitance; therefore, a T-coil output network is designed to extend the bandwidth of the TX. In this design, a T-coil structure is used with approximately 600 pH inductance whose middle point is connected to the ESD diode to cancel the effect of its capacitive load in bandwidth. The comparison between a standard RC network and a T-coil network is shown in Figure 5.13 and in Figure 5.14.

In the first set of simulations, the bandwidth at the TX output is examined. The T-coil output



Figure 5.14: Simulation results showing the comparison between RC network and T-coil network in return loss.

network provides 2.3 times higher 1-dB bandwidth, compared to the RC network. Also, the signal at 8 GHz Nyquist frequency is boosted by 1.3 dB. In the second set of simulations, the return loss of the RC network and T-coil network are compared. Simulation results show that the T-coil network improved the return loss by 4.2 dB at 8 GHz Nyquist frequency. Also, the T-coil kept the return loss below -6.4 dB until 20 GHz which corresponds to 5.3 dB improvement at that frequency.

#### 5.3.5 Power Saving Analysis of the Proposed Technique

Thanks to the proposed technique explained in Section 5.2, the width of the SST driver transistors can be cut into half compared to the conventional SST driver. This size reduction leads to not only a decrease in the static current consumption of the driver, but also to a decrease in the power consumption of the preceding blocks, as their load capacitance is decreased as well. In this subsection, the analysis of the simulation results is given to show the effect of the proposed technique in terms of potential savings in power consumption of the data path blocks, by comparing the simulation results of the proposed TX with the estimated power consumption of the conventional TX design.

Since the schematic level simulations only include the gate capacitances while the post-layout simulations include the routing parasitics as well, breakdown of the power consumption due to the gate and routing load is possible by comparing these two sets of simulations performed in the typical corner at 80 °C. Current consumption values obtained from the simulations for the proposed TX and estimated current consumption values for the same TX architecture that has a conventional  $50\Omega$  output impedance are given and compared in Table 5.1. Even though the gate size affects both the gate and routing parasitics, the decrease in routing parasitics due to the proposed technique is negligible compared to the decrease in gate parasitics. Therefore, the power consumption due to the routing load is kept the same and only the power

|                   | Current Consumption |         |       |                     |         |       |        |  |
|-------------------|---------------------|---------|-------|---------------------|---------|-------|--------|--|
|                   | Proposed Design     |         |       | Conventional Design |         |       |        |  |
| Load              | Gate                | Routing | Total | Gate                | Routing | Total | Power  |  |
| Circuit           | [mA]                | [mA]    | [mA]  | [mA]                | [mA]    | [mA]  | Saving |  |
| SST Driver        | 4.83                | 1.12    | 5.95  | 5.39                | 1.12    | 6.51  | 8.6%   |  |
| Pre Driver        | 2.16                | 0.58    | 2.74  | 4.32                | 0.58    | 4.9   | 44.1%  |  |
| 4:1<br>Serializer | 17.97               | 4.85    | 22.82 | 32.14               | 4.85    | 36.99 | 38.3%  |  |
| Bit Select        | 1.71                | 0.46    | 2.17  | 2.77                | 0.46    | 3.23  | 32.8%  |  |
| FFE               | 3.35                | 1.34    | 4.69  | 4.99                | 1.34    | 6.33  | 25.9%  |  |
| Total             | 30.02               | 8.35    | 38.37 | 49.61               | 8.35    | 57.96 | 33.8%  |  |

Table 5.1: Current consumption comparison between the proposed (simulated) and conventional (estimated) designs for critical blocks in the data path.

consumption due to gate load is scaled for all the blocks in the table.

First, we examine the current consumption of the SST driver. Simulation results show that the proposed SST driver consumes 5.95 mA in total. The static current consumption of the proposed 75  $\Omega$  SST driver is 4.44 mA for 1 V supply voltage, while it is 5 mA for the conventional 50 $\Omega$  design. Assuming that the current consumption due to the routing is the same and the loads of the TXs are equal for the two designs, the estimated total current consumption of the conventional SST driver is calculated by adding the static current difference of 0.56 mA to the current consumption of the proposed design.

Second, the current consumption of the pre-driver block is analyzed. The pre-driver block in the conventional design would consume double the current due to the gate load compared to the proposed design because the size of the SST driver transistors is halved with the proposed technique. Considering that the current consumption due to routing is equal for the two designs, the proposed design offers 44.1% power saving for the pre-driver block.

To estimate the power consumption of the 4:1 serializer, for the same target frequency and the same rise/fall times, the transistor size of the pre-driver scales down by approximately the power-saving ratio; therefore, the current consumption of the 4:1 serializer due to gate load is also 44.1% less in the proposed design compared to the conventional design. The estimated current consumption due to the gate of the preceding blocks in the conventional design can be calculated in the same way.

As a result, 33.8% power saving in total can be achieved for the critical blocks in the data path shown in Table 5.1 by employing the high-impedance driver technique. The effect of the high-impedance driver technique on the power consumption of the blocks that are not included in Table 5.1 are not as significant and the power consumption of these blocks are assumed to be the same in both designs. Therefore, 33.8% power saving in the blocks listed in Table 5.1 corresponds to 20% power saving for the whole TX.



Figure 5.15: Chip micrograph.

## 5.4 Measurement Results

The TX design is fabricated in 28 nm FD-SOI CMOS technology. The chip micrograph and the dimensions of different blocks are shown in Figure 5.15. A custom PCB is designed for the measurement of the TX, and this PCB is given in Figure 5.16. The chip is placed close to the corner of the PCB to have short PCB traces for the high-speed signals such as the clock input and the TX outputs while using edge-type SMA connector. The measurement PCB includes low-noise LDOs for different supply domains such as I/O and core supplies, different inputs for the trim bits, and SMA connectors following the PCB traces for the high-speed signals. In addition to the length of the PCB traces, the length of the bondwires are also very critical for high-speed signals in a chip-on-board assembly.

Chip-on-board assembly is done and a laser-formed PCB cavity, which has approximately the same depth as the height of the chip, is created where the die is placed. The cross-section of the PCB and the cavity is given in Figure 5.17. The thickness of our chip is around 775  $\mu$ m, and the cavity depth is picked as 800  $\mu$ m, without considering the conductor thickness. To be able to have the shortest possible bondwires, the chip thickness and the cavity depth should be equal. In this way, the length of the bondwires is minimized and the negative effect of the bondwires is reduced for the high-speed signals. Laser-formed cavity process removes the dielectric until the copper layer at the bottom. In other words, it is not possible to adjust the cavity depth freely, as the dielectric layers will be removed entirely. Therefore, the thickness of the RO4003C and FR-4 dielectric layers are selected as 300  $\mu$ m and 500  $\mu$ m respectively,



Figure 5.16: Custom measurement PCB of the PAM-4 SST TX.



Figure 5.17: Cross section of the cavity in the custom measurement PCB.

to remove the first two layers and have 800  $\mu$ m cavity at the end. Naturally, the conductor width for high-speed signal traces is calculated for the given dielectric types and thicknesses for impedance matching. Normally, the second conductor layer is assigned for the ground, and the third conductor layer is assigned for the power plane. To be able to put a ground layer underneath the chip, the power plane connection is removed and several vias are used close to the cavity as shown in Figure 5.17. A PCB photograph that shows the chip placed in the cavity and the bondwires can be seen in Figure 5.18. The area of the cavity is kept close to the



Figure 5.18: The photograph of the wirebonded chip and the cavity.



Figure 5.19: Measurement setup.

surface area of the die, and in this way, the spacing between the die pad and the PCB pad is minimized. However, the corners of the cavity may be rounded due to the imperfections in the cavity-opening process, and the chip might not fit the cavity. To prevent that, ears have been added to the corners of the cavity as can be seen in Figure 5.18.

The block diagram of the measurement setup is shown in Figure 5.19. A half-rate external C2 clock is generated by an analog signal generator. The single-phase clock signal is converted into a differential signal using a high-frequency balun. Considering the common-mode of the clock signal is set and adjusted on-chip, external DC block components are used before entering the measurement board. The differential output that passed through two identical





Figure 5.20: Measured eye diagrams with 1-tap pre-cursor and 2-tap post-cursor equalization for (a) 32 Gb/s PAM-4 and (b) 16 Gb/s NRZ.

PCB traces and SMA cables connected to a high-bandwidth oscilloscope.

Figure 5.20 shows the measured eye diagrams. The PAM-4 eye diagram at 32 Gb/s showing 38 mV worst case vertical eye opening and 0.16-UI horizontal eye opening using 1-tap precursor and 2-tap post-cursor FFE is given in Figure 5.20(a). The NRZ eye diagram at the same baud rate PAM-4 is shown in Figure 5.20(b).

The power consumption of the digital blocks and analog blocks is measured separately thanks to separate supply voltage pads. All measured power consumption values are approximately equal to the estimated power consumption from post-layout simulations. Therefore, the percentage distribution of the power across the components is taken from the post-layout simulations. The corresponding power breakdown of the TX is given in Figure 5.21. The total power consumption of the TX is measured as 77.9 mW, which corresponds to 2.4 pJ/bit energy-efficiency. The TX is designed and optimized for PAM-4 at 64 Gb/s data rate; however, open eye diagrams could be achieved only up to 32 Gb/s data rate in the measurements. Therefore, due to the overdesign of the high-speed circuitry, measured energy-efficiency is worse than expected. The reason for the reduced eye opening in the measurement is likely a reflection due to the ultra-thin bondwires that had to be used because of very small die pads in the



Figure 5.21: Power breakdown of the TX.

| Reference                     | [41]   | [40]  | [16]    | [17]  | This<br>Work |
|-------------------------------|--------|-------|---------|-------|--------------|
| Technology                    | 16nm   | 32nm  | 65nm    | 28nm  | 28nm         |
| Technology                    | FinFET | SOI   | CMOS    | FDSOI | FDSOI        |
| Driver Topology               | SST    | SST   | SST DAC | SST   | SST          |
| Modulation                    | NRZ    | NRZ   | PAM-4   | PAM-4 | PAM-4        |
| TX FFE                        | 3-Tap  | 4-Tap | 2-Tap   | 4-Tap | 4-Tap        |
| Data Rate [Gb/s]              | 32.75  | 28    | 34      | 45    | 32           |
| Power [mW]                    | 120.8  | 217   | 91.8    | 120   | 77.9         |
| Energy-Efficiency<br>[pJ/bit] | 3.69   | 7.75  | 2.7     | 2.6   | 2.4          |

Table 5.2: Performance comparison with other similar SST TXs.

chip-on-board design. Considering the power saving analysis given in Section 5.3.5, the total power consumption without the proposed technique would be 97.5 mW which corresponds to 3.05 pJ/bit energy-efficiency. Therefore, 20% saving in the total power consumption of the TX is achieved with the proposed technique, even without considering the effect of the technique on further preceding blocks or routing.

Finally, Table 5.2 gives the performance summary and comparison of this study with other published SST TXs that have a similar data rate. The measured energy-efficiency of the proposed TX is 2.4 pJ/bit which is better than or comparable with the designs listed in the table.

# 5.5 Conclusion

In this chapter, we first recognize that due to their segmented structure, the high dynamic power disadvantage of the SST TX becomes dominant at high data rates. Since the precise FFE tap-weight control in a PAM-4 TX is critical and the number of segments can not be cut down, we cannot eliminate the power consumption caused by the data and clock routing to the segments. However, the capacitive load of each segment can be reduced significantly by decreasing the width of the driver transistors which causes the driver resistance to be higher than the characteristic impedance of the system. We propose a technique that employs a high resistance SST driver to decrease the load of the SST segments and a parallel resistance placed between the output terminals to match the overall output impedance of the TX to the characteristic impedance of the system at the cost of reduced voltage swing. The presented PAM-4 TX design with 4-tap FFE that employs the proposed technique uses an SST driver with  $75\Omega$  output impedance while keeping  $50\Omega$  output impedance for the whole TX. The measurement results show that the prototype TX achieves 2.4 pJ/bit energy-efficiency at 32 Gb/s data rate, which is better than or comparable with other SST TX publications with a similar data rate. Analysis of the proposed technique and the measurement results combined show that the proposed technique reduces the power consumption by 20% for this TX design. This chapter proves that the dynamic power consumption disadvantage of the voltage-mode drivers can be reduced significantly, by using the proposed technique.

# 6 PAM-16 Transmitter and Receiver Analog Front-End<sup>1</sup>

This chapter presents a 32 Gb/s PAM-16 transceiver that has an SST TX and time-interleaved SAR ADC based receiver with embedded analog FFE in 28 nm FD-SOI. The purpose of this work is to optimize the energy-efficiency (pJ/bit) for the target data rate and the channel with moderate loss. For this purpose, the optimum modulation order that has the least ISI sensitivity for the given equalization capability is decided with a modeling study. All the equalization is done in the analog domain to avoid the circuit complexity and power consumption disadvantages of the digital equalization in the selected high-order modulation. Moreover, it is possible to implement additional digital equalization for more aggressive channels because of the 7-bit resolution of the ADC. Thanks to the spectral efficiency of PAM-16 and the analog-only equalization, the figure-of-merit value is improved significantly. The design choices for each block are explained and the noise is added in the simulations. Post-layout simulation results show that the TX consumes 26.85 mW while the analog front-end and ADC consumes 49.36 mW at 32 Gb/s with PAM-16, which corresponds to 2.38 pJ/bit for the whole system.

This chapter is organized as follows. First, a motivation of using very high order modulation is given in Section 6.1. Then, a modeling study is conducted to decide on the optimum modulation order for the target channel at the data rate of 32 Gb/s. The corresponding analysis and results are provided in Section 6.2. The circuit details of the proposed source-series-terminated (SST) based PAM-16 compatible transmitter are explained in Section 6.3. Then, the ADC-based receiver design consisting of a CTLE and an 8×time-interleaved SAR ADC with 2-tap embedded analog FFE is described in Section 6.4. In the proposed system, all the equalization is performed on the receiver side in the analog domain to avoid the complexity and high power consumption of digital equalization in higher modulation orders. In Section 6.5, the post-layout simulation results of the proposed transceiver design are presented. The chapter is concluded in Section 6.6.

<sup>&</sup>lt;sup>1</sup>This chapter is based on: F. Celik, A. Akkaya, and Y. Leblebici, "A 32 Gb/s PAM-16 TX and ADC-Based RX AFE with 2-Tap Embedded Analog FFE in 28 nm FDSOI," *Microelectronics Journal*, vol. 108, 2021, 104967, [47].



Figure 6.1: Energy-efficiency of the published TRX systems versus channel loss [3].

## 6.1 High-Order Modulation Overview

As data rates and the Nyquist frequency of the signal increase, channel losses introduce significant intersymbol interference (ISI). Complex equalization is required to mitigate this ISI at the expense of power consumption. Equalization power has been improved by using feed-forward equalizers (FFE) and decision-feedback equalizers (DFE) with a higher number of taps. However, timing constraints for the DFE loop became harder to meet at high frequencies and as a result, techniques such as loop-unrolling and look-ahead came forward [48]. Even though extensive use of equalization techniques helped to achieve higher data rates, silicon speed became a limiting factor. In most designs over 40 Gb/s data rate, the operating frequency, Nyquist frequency, and the corresponding channel loss are therefore decreased by using higher-order modulation, namely four-level pulse amplitude modulation (PAM-4), instead of NRZ, also known as PAM-2 [49].

PAM-4 has been widely used in many studies [15, 50, 51, 52, 46, 53, 54, 55] and this modulation allowed designers to reach higher data rates for the same target channel. By using PAM-4, the silicon operating frequency and the Nyquist frequency of the signal have been halved compared to NRZ for the same data rate thanks to transferring 2-bit per symbol. However, moving to higher order modulation also comes with sensitivity to residual ISI [31] as shown in Chapter 4. Furthermore, circuit complexity increases significantly on both transmitter (TX) and receiver (RX) sides with higher order modulation. Especially the circuit size of the digital equalizer becomes significantly larger in PAM-4 because of the 2-bit operation if the equalization is performed in the digital domain.



Figure 6.2: Structure of the test setup used to analyze the performance of different PAM orders.

With the use of higher order modulation in wireline links, analog-to-digital-converter (ADC) based receivers have recently gained popularity [56, 57, 58, 59, 60]. ADC-based receivers also provide more sophisticated equalization potential for high loss channels thanks to the ability to employ powerful equalizer digital signal processor (DSP) following the ADC. Also, some publications show that embedding an equalizer inside ADC decreases the required ADC resolution significantly [61] by mitigating the effect of quantization noise of the ADC on RX performance. On the other hand, both the high-speed ADC itself and the DSP logic consume significant power which makes ADC-based receivers more power-hungry and consequently less power-efficient architectures.

The purpose of this study is to find and implement architectures and design techniques to achieve the best possible energy-efficiency. In Figure 6.1, energy-efficiency of the TRX publications versus the channel loss at Nyquist frequency is given. The plot shows that each 30 dB decrease in the channel loss at Nyquist brings approximately 10 times improvement in energy-efficiency. Therefore, employing highest possible modulation order could be the key to achieve he best energy-efficiency.

Moving from PAM-4 to even higher order modulation decreases the Nyquist frequency and the loss that needs to be equalized even further than PAM-4 for the same target data rate, but sensitivity to residual ISI and noise increase. The spectral efficiency provided by the higher order modulation (e.g. PAM-8 and PAM-16), could potentially offer better energy-efficiency as well.

In this study, we investigate the effect of the very high modulation order on the energy efficiency of a complete TX and RX AFE system. This chapter presents a 32 Gb/s PAM-16 TX and RX AFE that aims to achieve the best energy-efficiency (pJ/bit).



Figure 6.3: Frequency response of the channel.

## 6.2 Comparative Modulation Order Study

While PAM-2 uses two voltage levels to transfer the data, PAM-4 uses four levels, PAM-8 uses eight levels, and so on. As a result, PAM-2 transfers one bit per symbol while PAM-4 and PAM-8 transfers two bits and three bits per symbol, respectively. Transferring a higher number of bits per symbol enables operation at a lower baud rate and wider horizontal eye opening for the same data rate. However, as the modulation order increases, the vertical eye opening decreases, which renders the symbol more sensitive to noise and residual ISI as shown in Chapter 4. Therefore, equalization is a very critical part of high-order wireline communication systems.

In this section, we provide a systematic comparison of different modulation orders. The test setup used in this study is shown in Figure 6.2. We consider a target data rate of 32 Gb/s with PAM-2/4/8/16/32/64 modulation and the total output swing of 1 V<sub>ppd</sub> for modulation orders. The receiver side consists of a CTLE that has one zero and two poles, and a 2-tap FFE performed in the analog domain.

The loss characteristic of the modeled target channel is given in Figure 6.3. The marked loss values in the figure correspond to the Nyquist frequencies for modulation orders from PAM-64 to PAM-2 as the frequency increases. For example, the channel loss is 21.58 dB at 16 GHz Nyquist frequency for PAM-2, while it is only 11.31 dB at 8 GHz Nyquist frequency for PAM-4, all reported for 32 Gb/s. For each modulation order, CTLE parameters are optimized for the target Nyquist frequency. At the beginning of each simulation, the pulse generator block sends a single full-scale pulse first, and then the tap coefficient decider block decides the optimal FFE taps from the pulse response at the output of the CTLE. These tap coefficients are given to the FFE block and in this way, optimal FFE parameters that equalize the post cursor are used. After sending the pulse and deciding the FFE taps, a modulated pseudo-random binary



Figure 6.4: Horizontal opening trend for different PAM orders.

sequence (PRBS) is sent to observe the eye opening at the output of the FFE.

In the first set of simulations, the horizontal eye opening trend for the worst eye openings is obtained without adding any noise to the system. As shown in Figure 6.4, the eye opening increases with increasing modulation order until PAM-8. Beyond that point, the eye opening reduces until the eye is completely closed for PAM-64. However, for high-order modulations, the signal-to-noise ratio (SNR) may limit the performance as well. To be able to evaluate the effect of the SNR more realistically,  $1 \text{ mV}_{\text{RMS}}$  white noise is injected at the input of the CTLE. As shown in Figure 6.4, no change in the comparative performance is observed. Therefore, we can conclude that the main limiting factor is the ISI sensitivities of the modulations. This study proves that for the given channel, data rate, and equalization capability, the ISI sensitivities of PAM-8 and PAM-16 are the best.

PAM-16 performs similar to PAM-8 as shown in Figure 6.4. At the same time, PAM-16 decreases the operating frequency of the transmitter and the TI ADC, which is usually one of the most power-hungry blocks in the transceiver (TRX) systems, by 25% compared to PAM-8. Therefore, to maximize energy-efficiency, PAM-16 is chosen.

## 6.3 Transmitter

Based on the study in Section 6.2, PAM-16 is selected as the modulation of the transceiver design. In a transmitter compatible with very high order modulations such as PAM-16, some design choices become very critical. The voltage swing should be high enough to have sufficient vertical eye opening at the receiver input because the vertical opening is decreased by a factor of fifteen in PAM-16, compared to PAM-2. Moreover, the linearity of the transmitter is very critical because DC levels may shift from their ideal values and this distortion results in a significant degradation in the eye openings. On the other hand, the main purpose is to reach the best energy-efficiency, which means very large transistors are not preferable due to their large gate capacitance.



Figure 6.5: Block diagram of the TX.

First of all, the voltage swing is maximized for a certain power consumption value in this design by using a voltage mode driver such as SST because it is four times more power-efficient compared to current-mode logic (CML) driver with an equal swing if only the driver part is considered [24]. Secondly, the output impedance of an SST driver is a series combination of a linear resistor and a non-linear MOS resistor. Therefore, the percentage of the non-linear part is kept slightly below 50% to have a good compromise between the non-linearity and the power consumption due to the transistor size in the SST driver.

The block diagram of the transmitter is shown in Figure 6.5. The external half-rate differential clock is received by the termination and duty-cycle correction block. The duty-cycle is corrected by adjusting the common mode of the termination resistors. Also, following CML to CMOS converter circuit has cross-coupled inverters between positive and negative paths for duty-cycle distortion clean-up. The half-rate clock is used in the final multiplexer stage as well as the frequency divider to create C4 clocks for the PRBS generator and the 16-to-8 multiplexer. The PAMN selector block selects which bit is going to be transferred to which slice groups. Then, a 2-to-1 multiplexer creates a full-rate signal at its output and it is followed by the pre-driver embedded in the SST slices and the SST driver itself.

The output driver of the transmitter is designed to be able to provide PAM-2, PAM-4, PAM-8, and PAM-16 signals. To have this multi-mode property, the number of slices and the decision on bit selection for each slice should be considered carefully. In PAM-2, all the SST slices receive the same bit as the modulation transfers 1-bit/symbol. In PAM-4, the number of slices that the MSB is sent is twice the number of LSB slices. This means that the total number of slices should be a multiple of 3. Similarly, the number of slices should be a multiple of 7 for PAM-8 and a multiple of 15 for PAM-16, to create equally spaced voltage intervals for each bit. Consequently, the number of slices is selected as 105 to have the most efficient and small bit routing in the transmitter with the multi-mode property. Also, a different number of slices are grouped as shown in the block diagram.



Figure 6.6: Schematic of the SST driver.



Figure 6.7: Layout of an SST slice.

The schematic of the SST driver, similar to [11] is shown in Figure 6.6. The SST driver consists of two identical parts, namely driver-P and driver-N. The series resistance  $R_S$  is used in common for both PMOS and NMOS paths. The output impedance can be adjusted by PMOS and NMOS branches independently. 8-bits used for tuning are encoded into 3 control bits and one of the NMOS/PMOS is activated in each trim combination. Those tune bits are shared among all 105 slices. The trim transistors have different widths and/or lengths; therefore, with each combination, different output impedance is obtained. To keep the size of a slice small and regular, the number of fingers is increased for an increase in width, and the number of series transistors or the channel length is increased for an increase in length. The layout



Figure 6.8: Block diagram of the ADC-based RX AFE.

of an SST slice and its dimensions are shown in Figure 6.7. Because of the high number of slices used in parallel, the height of each slice is minimized. The height is decided by PMOS transistors to be able to provide the desired ON-resistance. DC trim bits and poly resistors are also included, along with MOS transistors. With the maximum and minimum trim codes, SS and FF corners are covered and 50  $\Omega$  output impedance is maintained in all process corners. For sharp transitions, a pre-driver is used in an SST slice.

## 6.4 ADC-Based RX AFE

#### 6.4.1 Overview

The ADC-based RX AFE is shown in Figure 6.8. The RX AFE includes a CTLE, an input buffer, and an 8-channel time-interleaved SAR ADC with embedded 2-tap analog FFE. In this way, all the equalization is done on the RX side. Also, the equalization is kept solely in the analog domain to prevent the complexity that 4-bit operation would bring due to PAM-16 signaling. The first stage of the equalization is done by a CTLE. Following the CTLE, a 2-tap FFE embedded in the ADC cancels outs the remaining ISI. Finally, the output codes of the ADC are observed to evaluate the signal integrity.

In this subsection, the advantages and disadvantages of the high-order PAM signaling are ex-



Figure 6.9: Schematic of the CTLE followed by the ADC input buffer.

amined. Since the operating frequency is reduced significantly thanks to the high modulation order used, the horizontal eye opening is expected to be high. Therefore, a clock data recovery circuit is not a critical block and not included in this system.

#### 6.4.2 CTLE

CTLE and ADC input buffer schematics are shown in Figure 6.9. The CTLE has independently programmable 4-bit degeneration resistor and capacitor digital-to-analog converters (DACs) to adjust the equalization parameters to the optimum values for the target channel given in Figure 6.3, which has 6 dB loss at 4 GHz Nyquist frequency for PAM-16 at 32 Gb/s. The resistor trim changes the DC gain value and the effective peaking amount, while the capacitor trim changes the peaking at high frequencies, without changing the DC gain, as shown in Figure 6.12. Besides, the variable gain function of the CTLE is used to adjust the swing so that the signal amplitude at the input of the ADC is set to the full-scale input voltage.

The implementations of resistor and capacitor trim functions are shown in Figure 6.10 and Figure 6.11. The maximum desired resistance value between node A and node B is placed as a standalone  $R_{constant}$  which has a value of  $135 \Omega$ . In parallel to  $R_{constant}$ , resistor DAC segments are designed which consists of two T-gate switches placed on each side of the trim resistors. The value of the poly resistor placed in each segment is  $3.45 \text{ k}\Omega$  and the non-linear ON-resistance of the series T-gate is kept relatively small. The average voltage of nodes A and B is around 530 mV during the operation. Therefore, the ON-resistance of the T-gate is calculated under this operating condition. T-gates are designed to have an ON-resistance of around 10% of the poly resistor, and using 2  $\mu$ m NMOS and 4.8  $\mu$ m PMOS with minimum length, 345  $\Omega$  ON-resistance is achieved. As a result, the resistance value in between the nodes A and B can



Figure 6.10: Resistance trim between nodes A and B of CTLE.



Figure 6.11: Capacitor trim between nodes A and B of CTLE.



Figure 6.12: (a) Resistance trim settings and (b) capacitance trim settings of the CTLE.

be trimmed between  $86 \Omega$  and  $135 \Omega$ . Capacitor trimming is done similarly. The minimum desired capacitance between node A and node B is placed as a standalone capacitor which is 543 fF. The trimmable capacitor segments are designed using the same T-gates and 24.8 fF capacitor placed in between. In this way, the total capacitor between node A and node B can be trimmed between 543 fF and 891 fF, without considering the effect of T-gate ON-resistance on the effective capacitance value.

PAM-16 is a modulation that is very sensitive to the residual ISI. Almost all the channel loss should be equalized by the CTLE and the analog FFE, in our system. Therefore, while designing the CTLE, a 2-tap analog FFE model used in Section 6.2 is also placed after CTLE and ADC input buffer, to imitate the FFE behavior embedded in the ADC. In this way, optimum CTLE



Figure 6.13: Pulse response of the channel, CTLE, and FFE outputs.

parameters are found in the presence of 2-tap analog FFE.

Pulse responses at the channel output, CTLE output, and FFE model output are shown in Figure 6.13. The CTLE is designed in a way that it covers almost all the channel loss at Nyquist frequency. Having only 8-channel TI-ADC using direct sampling with a load of only one ADC channel at a time and consequently, a small ADC buffer brings less output load and high bandwidth, even without using the inductive peaking technique. Therefore, the area of the CTLE is much smaller considering the inductor usually occupies a large area. The CTLE design has a peak frequency of around 3×Nyquist frequency. The pulse response at Nyquist frequency should be able to reach the voltage level of the DC logic-1 response. With the optimum CTLE and FFE parameters found, pulse response at the CTLE output shows that the loss in the main-cursor is covered but there is a significant undershoot after the pulse. Increasing the peaking means increasing the undershoot, which also causes residual ISI after the first post-cursor. Because the 2-tap FFE cancels out the first post-cursor but it cannot remove the second post-cursor. Therefore, the lower bound of the gain peaking of the CTLE is decided by the peak voltage of the main-cursor in the pulse response, while the upper bound of the gain peaking is decided by the second post-cursor voltage.

Moreover, the linearity of the CTLE is an important design consideration as the gain of the CTLE is not constant across the whole voltage range. To minimize the non-linearity, the data sent by the TX is AC coupled to the RX and the common-mode voltage is set to 1 V. Also, the supply voltage of the CTLE and the buffer is increased to 1.25 V from 1 V, which is the nominal supply voltage. In this way, the saturation conditions of the current sources are relaxed, and much more linear operation is achieved. The output common-mode voltage of the CTLE is also set to 1 V for the ADC input buffer block to be able to provide 500 mV output common-mode voltage, required by the ADC.



Figure 6.14: Block diagram of the single-channel 1 GS/s 7-bit SAR ADC with 2-tap embedded analog FFE and the clock signal timing diagram.

#### 6.4.3 8 GS/s Time-Interleaved SAR ADC

The block diagram of the TI-ADC is given in Figure 6.8 in the gray box. The 8 GS/s timeinterleaved ADC consists of 8 identical 7-bit 1 GS/s SAR ADC channels used in parallel. A buffer is used at the input of the TI-ADC to be able to drive all the channels and routing. The schematic of the source follower buffer schematic, following the CTLE, is shown in Figure 6.9. An external 4 GHz differential clock signal is received and given to a frequency divider block. The frequency divider block transfers or creates clock signals with different frequencies and applies these signals to the TI clock generator blocks. Each of the TI clock generator blocks receives these 4 GHz, 2 GHz, and 1 GHz clock signals with multiple phases and creates two sampling signals for each SAR ADC channel. To have the FFE functionality embedded in the ADC, two consecutive phases for sampling the current input signal and the previously sampled input signal are needed for post-cursor equalization. The sampling clock signals used in each ADC channel are named as  $CK_{samp}$  and  $CK_{sampffe}$ . The buffer outputs and the sampling clocks are given to the bootstrapped sampling switches, placed at the input of each ADC channel.

#### 6.4.4 1 GS/s Single-Channel SAR ADC and Embedded FFE Implementation

The block diagram of the single-channel SAR ADC is shown in Figure 6.14. The SAR ADC is 7-bit and achieves 1 GS/s sampling rate, while also featuring a 2-tap embedded analog FFE functionality. The SAR ADC employs multiple comparators each dedicated for one comparison, similar to [62, 63] to improve the sampling rate. In this way, the comparator reset time is



Figure 6.15: Schematic of the Comparator.

eliminated and the total delay of the critical path is reduced.

Unlike the standard SAR ADC architecture, there are two different CDACs in this design. The main CDAC, namely  $CDAC_{SAR}$ , is the standard CDAC that is used to sample the current input signal and is dedicated to the SAR conversion steps. The second CDAC is named as  $CDAC_{FFE}$  and it is dedicated to the FFE function of the ADC. The input signal is sampled to the CDAC<sub>SAR</sub> by  $CK_{samp}$  and the standard SAR operation is run, while the value of the previous input sample is sampled by  $CK_{sampffe}$ , kept, and scaled in  $CDAC_{FFE}$ .

The FFE operation based on [58] is done by adding/subtracting the scaled previous sample to/from the current sample. The previous sample that is kept in the  $CDAC_{FFE}$  is multiplied by a number that the user can control, thanks to the ability to change the effective CDAC size. Since the bottom plate sampling technique is used in  $CDAC_{FFE}$ , the effective CDAC size that the input is sampled can be changed by connecting the bottom plates to either the input signal or the common-mode signal during the sampling phase. When the sampling phase is over, the input signal voltage is scaled depending on the total capacitor size that is connected to the input signal during the sampling phase and transferred to the top plate. In this way, the voltage obtained at the top plate depends on the 5-bit tap control bits and the FFE tap weight can be adjusted using these bits.

On the other hand,  $CDAC_{SAR}$  employs top plate sampling. So, it is a 6-bit binary-weighted CDAC that is used for a 7-bit SAR operation. To keep the common-mode constant for the whole conversion cycle except for the last switching, the splitting monotonic switching technique [64] is used. The last switching is done single-sided instead of fully differential to be able to

reduce the unit capacitor size by half and decrease the CDAC area. The common-mode change caused by the final single-sided switching is negligible on the comparator performance.

In the SAR operation, first, the input signal is sampled to the  $CDAC_{SAR}$  by  $CK_{samp}$  and the first comparison starts. Then, depending on the decision of the first comparison, the corresponding bottom plates of the  $CDAC_{SAR}$  are changed and a new voltage at the input of the comparators is created. At the same time, the first comparator generates the clock signal of the next comparator asynchronously. Thanks to this asynchronous clocking technique, the sampling rate of the ADC is improved significantly. Before pulling up the clock signal of the next comparator, the clock signal of the current comparator is pulled down. Therefore, the voltage drop due to comparator kickback is eliminated. Consequently, the common-mode voltage is kept at the same level during the SAR conversions for precise decisions.

The schematic of the comparator of the ADC with FFE and background calibration functions is given in Figure 6.15. The second differential pair is connected to  $V_{ffep}$  and  $V_{ffen}$  which are CDAC<sub>FFE</sub> top plate voltages. This information corresponding to the previous input sample is scaled in the CDAC<sub>FFE</sub> and added/subtracted to/from the current sample in the comparator. In other words, the FFE operation creates an offset voltage which is a function of the previous input sample, in the SAR conversion. The addition/subtraction operation is done in the analog domain, before quantization. Therefore, the quantization noise of the ADC is not added before the FFE operation.

After deciding the ADC output bits, the background calibration phase starts. At each sampling phase, one of the seven comparators is calibrated using the auxiliary differential pair and calibration voltages ( $V_{calp}$  and  $V_{caln}$ ). Since the offset of each comparator gets zeroed in their calibration phase, offset mismatch between comparators is eliminated. Bandwidth mismatch between different ADC channels is also negligible as the Nyquist frequency is decreased to 4 GHz which is a much less frequency compared to the bandwidth of the channels, thanks to the PAM-16 operation.

## 6.5 Simulation Results

Layouts of the TX and RX AFE including the  $8 \times$ TI ADC are given in Figure 6.16. The simulation results presented in this section are derived from the post-layout simulations. First of all, the noise simulations of each block in the transceiver system are run at 80 °C. The input-referred noise of the CTLE is obtained as 303 uV<sub>RMS</sub>, while it is 338 uV<sub>RMS</sub> for CTLE and the buffer together. The input-referred noise of the ADC itself is 1.46 mV<sub>RMS</sub>. Combining all the noise coming from different blocks and referring it to the input of the RX results in 749  $\mu$ V<sub>RMS</sub> noise at the input of the CTLE. Assuming the system has supply noise approximately equal to the device noise, 1 mV<sub>RMS</sub> white noise is injected at the input of the CTLE and all the simulations are run in the presence of this noise source. Moreover, capacitive loads equivalent to the capacitance of the ESD diodes and pads at the TX output and the RX input are included in the simulation setup as well. The moderate-loss channel shown in Figure 6.3 is used in the



Figure 6.16: Layout of TX and RX AFE.

simulations.

The eye diagram at the output of the TX is shown in Figure 6.17. The transmitter does not offer any equalization capability and it drives the self-load of the transmitter itself, ESD/pad capacitances, and the target channel combined with the termination impedance at the RX side. The eye diagram at the output of the channel is completely closed due to ISI. The eye diagram at the output of the ADC buffer following the CTLE is shown in Figure 6.18. As it is explained in the CTLE subsection, CTLE has an intentional high peaking value that compensates the channel loss but creates a significant undershoot/overshoot in the post-cursor which will be corrected later by the analog FFE integrated in the ADC. The continuous FFE output cannot be observed because the FFE operation is applied to the sampled signal in the ADC. To observe the equivalent continuous eye diagram at the output of the FFE circuit, the FFE model that is used in Section 6.2 is used to imitate the same FFE behavior. The tap-weights of the FFE model is adjusted to the same values used in the ADC. The eye diagram at the output of the FFE model following the CTLE and the buffer circuits is shown in Figure 6.19. The corresponding timing bathtub of the mid-eye of the FFE output is given in Figure 6.20 and the histogram of the same eye is given in Figure 6.21. The two peak values in the histogram correspond to the jumps due to the FFE operation. Also, the horizontal opening values for all the eyes starting from the eye at the top are shown in Figure 6.22. The use of the behavioral FFE model is exceptional to this simulation to illustrate the continuous FFE output eye diagram. The circuit-level simulation results of the whole system including the ADC with embedded FFE will also be provided.

The post-layout simulations of the whole system including the ADC with 2-tap embedded











Figure 6.19: FFE output eye diagram.







Figure 6.21: FFE output horizontal histogram.

analog FFE give the histograms of the ADC output codes shown in Figure 6.23. For PAM-16 at 32 Gb/s, the eye opening is 2-codes in three of the eyes including the top and bottom ones which are the worst-case conditions, while the other eyes have at least 3-codes opening. For PAM-8 at 24 Gb/s, the worst eye openings are 11-codes at the top and bottom eyes. To estimate the bit error rate (BER), the extrapolated probability distribution of the ADC output codes is plotted in the logarithmic scale as shown in Figure 6.24. For PAM-16, all the extrapolated lines cross below the probability of  $10^{-6}$ . For PAM-8, the crossing points of the extrapolated lines are well below the probability of  $10^{-12}$ . The probability of error for each PAM level can be calculated from the area below the probability distribution of that PAM level, the total probability of bit error can be calculated. The BER is calculated as  $< 10^{-5}$  for PAM-16 at 32 GS/s and  $< 10^{-12}$  for PAM-8 at 24 GS/s. The total power consumption is 26.85 mW for the TX and 49.36 mW for the RX AFE including the CTLE and ADC, for PAM-16 at 32 Gb/s. Corresponding power breakdowns are given in Figure 6.25 for the TX, and in Figure 6.26 for the RX. As a result, the energy-efficiency of the whole transceiver system is 2.38 pJ/bit at 32 Gb/s, which is better


Figure 6.22: FFE output horizontal openings of each eye.

than other publications using PAM-4 listed in Table 6.1.

#### 6.6 Conclusion

This chapter has presented a 32 Gb/s PAM-16 SST TX and RX AFE consisting of a CTLE and an 8×TI SAR ADC with 2-tap embedded analog FFE. With a modeling study, the ISI sensitivities of different modulations are shown and the optimum modulation order to maximize the power efficiency is decided. Very high-order modulation, such as PAM-16, is used to minimize the power consumption for the target data rate. Thanks to using PAM-16, Nyquist frequency, corresponding channel loss, operating frequency, and consequently the power consumption is decreased. The RX compensates for 6.01 dB channel loss at 4 GHz Nyquist frequency for PAM-16 thanks to the CTLE and the embedded analog FFE in the ADC. As a comparison, the loss for the same channel is 21.58 dB at 16 GHz Nyquist frequency for PAM-2 at the same 32 Gb/s data rate. All the equalization is done in the analog domain; therefore, the disadvantages of using high-order modulation are bypassed. Moreover, it is possible to implement additional digital equalization for more aggressive channels because of the 7-bit resolution of the ADC. The energy-efficiency of 2.38 pJ/bit at 32 Gb/s with BER<10<sup>-5</sup> is achieved for the whole system consisting of a TX, CTLE, and ADC with embedded analog FFE. The achieved power efficiency with PAM-16 is better than other publications that use PAM-4. All the simulations are done with parasitic extraction and in the presence of estimated noise, injected at the input of the RX.

|      | 65 nm      16 nm FinFET      65 nm      16 nm FinFET      28 nm FDSOI | $1 V \text{ and } 1.2 V \qquad \begin{array}{c c} 0.9 V, 1.2 V, \text{ and} \\ 1.8 V \qquad 1.8 V \end{array} \qquad \begin{array}{c c} 1 V \text{ and } 1.2 V \\ 1.8 V \end{array} \qquad \begin{array}{c c} 0.9 V, 1.2 V, \text{ and} \\ 1.8 V \end{array} \qquad \begin{array}{c c} 1.25 V \\ 1.8 V \end{array}$ | 25.6 Gb/s      56 Gb/s      52 Gb/s      56 Gb/s      32 Gb/s | PAM-4 PAM-4 PAM-4 PAM-4 PAM-16 | ADC-Based RX  Slicer-Based RX  ADC-Based RX  TX & | 4-tap - 3-tap -    | EmbeddedCTLECTLE, 3-tapCTLEIIR Filter10-tap FFEemb. FFE, DSPDSP2-tap emb. FFE | 9.6 dB@6.4 GHz      10 dB@14 GHz      31 dB@13 GHz      31 dB@14 GHz      6 dB@4 GHz        21.6 dB@16 GHz      21.6 dB@16 GHz      21.6 dB@16 GHz | AFE+ADC:      RX: 230 mW      AFE+ADC:      AFE+ADC:      AFE+ADC:      AFE+ADC:        62 mW      236 mW      370 mW      49 mW        DSP: 183 mW      TX: 140 mW      TX: 27 mW | AFE+ADC:AFE+ADC:AFE+ADC:AFE+ADC:2.43 pJ/bitRX: 4.1 pJ/bit4.54 pJ/bit6.61 pJ/bit2.43 pJ/bitDSP: 3.52 pJ/bitTX: 2.5 pJ/bit |
|------|-----------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------|--------------------------------|--------------------------------------------------------------------------------------|--------------------|-------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|
|      | 65 nm 16 nm Fin                                                       | 1 V and 1.2 V 0.9 V, 1.2 V 1.8 V                                                                                                                                                                                                                                                                                    | 25.6 Gb/s 56 Gb/                                              | PAM-4 PAM-4                    | ADC-Based RX Slicer-Base                                                             | 4-tap              | Embedded CTLE<br>IIR Filter 10-tap F                                          | 9.6 dB@6.4 GHz 10 dB@14                                                                                                                            | AFE+ADC:<br>62 mW                                                                                                                                                                  | AFE+ADC:<br>2.43 pJ/bit RX: 4.1 pJ                                                                                       |
| [.2] | 65 nm                                                                 | 1 V and 1.2 V                                                                                                                                                                                                                                                                                                       | 25.6 Gb/s                                                     | PAM-4                          | X ADC-Based RX                                                                       | 4-tap              | Embedded<br>IIR Filter                                                        | 2 9.6 dB@6.4 GH2                                                                                                                                   | AFE+ADC:<br>62 mW                                                                                                                                                                  | AFE+ADC:<br>2.43 pJ/bit                                                                                                  |
| [กก] | 28 nm                                                                 | N/A                                                                                                                                                                                                                                                                                                                 | 32 Gb/s                                                       | PAM-4                          | ADC-Based R                                                                          | ı                  | CTLE                                                                          | 31 dB@8 GH                                                                                                                                         | AFE+ADC:<br>320 mW                                                                                                                                                                 | AFE+ADC:<br>10 pJ/bit                                                                                                    |
|      | Technology                                                            | Power<br>Supply                                                                                                                                                                                                                                                                                                     | Max. Data<br>Rate                                             | Modulation<br>Type             | Architecture                                                                         | TX<br>Equalization | RX<br>Equalization                                                            | Channel<br>Loss                                                                                                                                    | Power                                                                                                                                                                              | Energy<br>Efficiency                                                                                                     |

Table 6.1: Performance comparison with state of the art PAM-4 RX or TRX systems.



Figure 6.23: Histogram of the ADC output codes (a) for PAM-16 at 32 Gb/s and (b) PAM-8 at 24 Gb/s.



Figure 6.24: Extrapolated probability distribution of the ADC output codes (a) for PAM-16 at 32 Gb/s and (b) PAM-8 at 24 Gb/s.



Figure 6.25: Power breakdown of the TX.



Figure 6.26: Power breakdown of the RX consisting of an AFE and an ADC.

## 7 Conclusion

In this thesis, we investigated different design techniques for high-speed energy-efficient wireline communication systems. The main subject of this work is to analyze and propose different design techniques to reduce the power consumption of the I/O systems for copper links with low to medium loss.

In Chapter 3, we describe a 16-channel 10-bit SAR ADC system that operates at 12 GS/s sampling rate. The total ADC output data rate is 150 Gb/s (with 8b/10b encoding) and the DSP function is realized in an FPGA. Therefore, a high-speed transmitter system is designed to transfer the ADC outputs to FPGA, which supports a certain communication standard, namely JESD204B. The data rate of the transmitter is limited by the data rate supported by the standard. Under those conditions, two different transmitters are designed in 28 nm FD-SOI technology, to minimize the power consumption of the systems and experimentally compare the two output driver types. Both LVDS and SST TX prototypes achieve open eye diagrams at 12.5 Gb/s data rate which is the maximum speed supported by the standard. LVDS and SST transmitters are compared with other TX designs with similar data rates. If we compare the two TX designs, the LVDS TX has less power consumption and better jitter performance, while the SST TX is slightly faster, has less design complexity, and shows a much higher voltage swing.

Increasing the operating frequency is a straightforward solution to achieve higher data rates in a wireline link. However, to reach a very high silicon speed and to deal with the corresponding channel loss at high frequencies are both challenging problems. Therefore, PAM-4 has been widely used instead of NRZ in high speed I/O systems in recent years. To reach a higher speed in the future, even higher-order PAM can be adopted as well. Chapter 4 presents a comparative study to understand the ISI sensitivity of different PAM orders for different channel characteristics at different data rates. The inherent limitations of PAM-2, PAM-4, and PAM-8 signalings are studied and compared for four different channels with different attenuation characteristics at four different data rates. For this objective, a complete PAM-N compatible TRX system is built and examined through architectural modeling. This study showed that residual ISI sensitivity is worse and the required equalization power is higher for high order PAM signaling.

Moreover, more analyses are done in the presence of equalization without and with practical bandwidth limitations. Up to a certain channel loss at corresponding Nyquist frequency, low order PAM keeps its advantage over the others but when the bandwidth limitation is activated for an equalizer block, low order PAM loses its advantage quickly, especially for high loss cases.

Chapter 5 presents a 32 Gb/s PAM-4 SST TX implementation in 28 nm FD-SOI technology targeting to reduce the power consumption by proposing a high-impedance driver technique. Precise tap weight control in PAM-4 TXs is essential but it increases the number of SST segments needed and the dynamic power consumption significantly. The proposed technique allows using a high-impedance SST driver to decrease the capacitive load of the SST slices. The parallel resistance placed between output nodes keeps the output impedance of the whole TX matched to the standard characteristic impedance of the system. ISI sensitivity analysis showed that the high order PAM is more vulnerable to the residual ISI effect. The equalization power and precision should be sufficient to cover an important portion of the channel loss. Therefore, the TX system employs 4-tap FFE for one-tap pre-cursor and two-tap post-cursor equalization. Analysis of the proposed technique shows that the proposed technique reduces the power consumption of the whole TX by 20%. The measurement results of the prototype SST TX achieve 2.4 pJ/bit energy-efficiency at 32 Gb/s data rate, which is better than or comparable with other SST TX publications with a similar data rate.

Finally, a complete 32 Gb/s PAM-16 TX and ADC-based RX AFE system is proposed in Chapter 6. In this chapter, first, an architectural modeling study is conducted for a target medium-loss channel to find out the optimal modulation order considering the equalization potential of the system. It is shown that PAM-16 is the optimal signaling scheme to reach the best eye opening and the best energy-efficiency together. Then on the TX side, a PAM-2,4,8,16 compatible SST TX which consists of 105 identical segments; on the RX side, a CTLE and an  $8 \times TI 8$  GS/s 7-bit SAR ADC with 2-tap embedded analog FFE are designed in 28 nm FD-SOI technology. Thanks to employing PAM-16 signaling the Nyquist frequency is decreased to only 4 GHz and the corresponding channel loss at that frequency is only 6.01 dB. All the equalization is realized in the analog domain to bypass the disadvantages that multi-bit operation brings. The post-layout simulations show that the total energy-efficiency of the TX and RX AFE system is 2.38 pJ/bit at 32 Gb/s with BER<10<sup>-5</sup>. The achieved energy-efficiency is the best among the other state of the art systems using ADC-based RX.

#### 7.1 Future Work

Although the measurement result of the prototype SST TX, shown in Chapter 5, achieves an open eye diagram up to 32 Gb/s data rate, the TX is designed, optimized, and simulated successfully at 64 Gb/s. Therefore, the high-speed circuitry is optimized for double the operating frequency and that overdesign deteriorates the energy-efficiency of the system. This is why the energy-efficiency of the TX is worse than expected. The reason for the reduced data rate in the measurement is thought to be the reflection caused by the ultra-thin bondwires that we had to use with our very small die pads. Therefore, another tapeout can be done either with larger pads to be able to use thick bondwire, or employing the flip-chip method.

Moreover, even though Chapter 6 shows the energy-efficiency is significantly improved, these results are not verified by the measurement, unlike the other chapters. Therefore, the TX and RX AFE system, whose layout is complete, can be taped out and a prototype of the system can be measured. The scope of this thesis focuses on the design techniques that improve the energy-efficiency of the wireline links. However, a CDR algorithm and its implementation, which is essential for wireline RX systems, are not considered in this system. Therefore, a CDR implementation can be added and the system can be named as TRX, instead of TX and RX AFE.

### **Bibliography**

- Cisco, "Cisco Visual Networking Index: Global Mobile Data Traffic Forecast Update, 2017-2022." [Online]. Available: "https://www.cisco.com/c/en/us/solutions/ executive-perspectives/annual-internet-report/index.html", February 2019.
- [2] Google, [Online]. Available: "https://www.google.com/about/datacenters/gallery/".
- [3] D. C. Daly, L. C. Fujino, and K. C. Smith, "Through the Looking Glass-2020 Edition: Trends in Solid-State Circuits From ISSCC," *IEEE Solid-State Circuits Magazine*, vol. 12, no. 1, pp. 8–24, 2020.
- [4] T. Anand, "Wireline Link Performance Survey," [Online]. Available: "https://web.engr. oregonstate.edu/~anandt/linksurvey/".
- [5] J-M. Patenaude, "High-Speed Backplanes Pose New Challenges to "https://www.edn.com/ Comms Designers," [Online]. Available: high-speed-backplanes-pose-new-challenges-to-comms-designers/", January 2004.
- [6] Analog Devices Inc., "JESD204B Survival Guide," [Online]. Available: "https: //www.analog.com/media/en/technical-documentation/technical-articles/ JESD204B-Survival-Guide.pdf", 2014, accessed: 20.03.2019.
- [7] Cisco, "Cisco Visual Networking Index: Complete Forecast Update, 2017-2022."
  [Online]. Available: "https://www.cisco.com/c/dam/m/en\_us/network-intelligence/ service-provider/digital-transformation/knowledge-network-webinars/pdfs/1211\_ BUSINESS\_SERVICES\_CKN\_PDEpdf", December 2018.
- [8] E. Masanet, A. Shehabi, N. Lei, S. Smith, and J. Koomey, "Recalibrating global data center energy-use estimates," *Science*, vol. 367, no. 6481, pp. 984–986, 2020.
- [9] T. Beukema, "Design considerations for high-data-rate chip interconnect systems," *IEEE Communications Magazine*, vol. 48, no. 10, pp. 174–183, 2010.
- [10] Optical Internetworking Forum (OIF), "Common Electrical I/O (CEI)-112G,"
  [Online]. Available: "https://www.oiforum.com/technical-work/hot-topics/ common-electrical-interface-cei-112g-2/", 2019, accessed: 09.09.2020.

- [11] C. Menolfi, M. Braendli, P. A. Francese, T. Morf, A. Cevrero, M. Kossel, L. Kull, D. Luu, I. Ozkaya, and T. Toifl, "A 112Gb/S 2.6pJ/b 8-Tap FFE PAM-4 SST TX in 14nm CMOS," in 2018 IEEE International Solid - State Circuits Conference - (ISSCC), 2018, pp. 104–106.
- [12] Z. Toprak-Deniz, J. E. Proesel, J. F. Bulzacchelli, H. A. Ainspan, T. O. Dickson, M. P. Beakes, and M. Meghelli, "A 128-Gb/s 1.3-pJ/b PAM-4 Transmitter With Reconfigurable 3-Tap FFE in 14-nm CMOS," *IEEE Journal of Solid-State Circuits*, vol. 55, no. 1, pp. 19–26, 2020.
- [13] J. Kim, A. Balankutty, R. K. Dokania, A. Elshazly, H. S. Kim, S. Kundu, D. Shi, S. Weaver, K. Yu, and F. O'Mahony, "A 112 Gb/s PAM-4 56 Gb/s NRZ Reconfigurable Transmitter With Three-Tap FFE in 10-nm FinFET," *IEEE Journal of Solid-State Circuits*, vol. 54, no. 1, pp. 29–42, 2019.
- [14] G. Steffan, E. Depaoli, E. Monaco, N. Sabatino, W. Audoglio, A. A. Rossi, S. Erba, M. Bassi, and A. Mazzanti, "6.4 A 64Gb/s PAM-4 transmitter with 4-Tap FFE and 2.26pJ/b energy efficiency in 28nm CMOS FDSOI," in 2017 IEEE International Solid-State Circuits Conference (ISSCC), 2017, pp. 116–117.
- [15] D. Kwon, M. Kim, S. Kim, and W. Choi, "A Low-Power 40-Gb/s Pre-Emphasis PAM-4 Transmitter With Toggling Serializers," *IEEE Transactions on Circuits and Systems II: Express Briefs*, vol. 67, no. 3, pp. 430–434, 2020.
- [16] B. Hu, Y. Du, R. Huang, J. Lee, Y. Chen, and M. F. Chang, "An R2R-DAC-Based Architecture for Equalization-Equipped Voltage-Mode PAM-4 Wireline Transmitter Design," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 25, no. 11, pp. 3260–3264, 2017.
- [17] M. Bassi, F. Radice, M. Bruccoleri, S. Erba, and A. Mazzanti, "A High-Swing 45 Gb/s Hybrid Voltage and Current-Mode PAM-4 Transmitter in 28 nm CMOS FDSOI," *IEEE Journal of Solid-State Circuits*, vol. 51, no. 11, pp. 2702–2715, 2016.
- [18] A. Tajalli, M. Bastani Parizi, D. A. Carnelli, C. Cao, K. Gharibdoust, D. Gorret, A. Gupta, C. Hall, A. Hassanin, K. L. Hofstra, B. Holden, A. Hormati, J. Keay, Y. Mogentale, V. Perrin, J. Phillips, S. Raparthy, A. Shokrollahi, D. Stauffer, R. Simpson, A. Stewart, G. Surace, O. Talebi Amiri, E. Truffa, A. Tschank, R. Ulrich, C. Walter, and A. Singh, "A 1.02-pJ/b 20.83-Gb/s/Wire USR Transceiver Using CNRZ-5 in 16-nm FinFET," *IEEE Journal of Solid-State Circuits*, vol. 55, no. 4, pp. 1108–1123, 2020.
- [19] G. Kim, L. Kull, D. Luu, M. Braendli, C. Menolfi, P. Francese, H. Yueksel, C. Aprile, T. Morf, M. Kossel, A. Cevrero, I. Ozkaya, A. Burg, T. Toifl, and Y. Leblebici, "A 161-mW 56-Gb/s ADC-Based Discrete Multitone Wireline Receiver Data-Path in 14-nm FinFET," *IEEE Journal of Solid-State Circuits*, vol. 55, no. 1, pp. 38–48, 2020.
- [20] K. Gharibdoust, A. Tajalli, and Y. Leblebici, "Hybrid NRZ/Multi-Tone Serial Data Transceiver for Multi-Drop Memory Interfaces," *IEEE Journal of Solid-State Circuits*, vol. 50, no. 12, pp. 3133–3144, 2015.

- [21] S. Palermo, "High-Speed Serial I/O Design for Channel-Limited and Power-Constrained Systems," in CMOS Nanoelectronics Analog and RF VLSI Circuits. McGraw-Hill, 2011, ch. 9.
- [22] A. Sheikholeslami, "Equalizer Circuit [Circuit Intuitions]," *IEEE Solid-State Circuits Magazine*, vol. 12, no. 1, pp. 6–7, 2020.
- [23] H. Zhang, B. Jiao, Y. Liao, and G. Zhang, "PAM4 signaling for 56G serial link applications—A tutorial," *DesignConn*, 2016.
- [24] F. Celik, A. Akkaya, A. Tajalli, A. Burg, and Y. Leblebici, "JESD204B Compliant 12.5 Gb/s LVDS and SST Transmitters in 28 nm FD-SOI CMOS," in 2019 15th Conference on Ph.D Research in Microelectronics and Electronics (PRIME), 2019, pp. 101–104.
- [25] JEDEC Solid State Technology Association, "JEDEC Standard: Serial Interface for Data Converters JESD204B," [Online]. Available: "https://www.jedec.org/sites/default/files/ docs/JESD204B.pdf", July 2011, accessed: 20.03.2019.
- [26] A. Tajalli and Y. Leblebici, "A Slew Controlled LVDS Output Driver Circuit in 0.18 μm CMOS Technology," *IEEE Journal of Solid-State Circuits*, vol. 44, no. 2, pp. 538–548, 2009.
- [27] G. Balamurugan, J. Kennedy, G. Banerjee, J. E. Jaussi, M. Mansuri, F. O'Mahony, B. Casper, and R. Mooney, "A Scalable 5-15Gbps, 14-75mW Low Power I/O Transceiver in 65nm CMOS," in 2007 IEEE Symposium on VLSI Circuits, 2007, pp. 270–271.
- [28] M. Kossel, C. Menolfi, J. Weiss, P. Buchmann, G. von Bueren, L. Rodoni, T. Morf, T. Toifl, and M. Schmatz, "A T-Coil-Enhanced 8.5 Gb/s High-Swing SST Transmitter in 65 nm Bulk CMOS With <-16 dB Return Loss Over 10 GHz Bandwidth," *IEEE Journal of Solid-State Circuits*, vol. 43, no. 12, pp. 2905–2920, 2008.
- [29] B. Chattopadhyay, S. N. Bhat, G. Nayak, and R. Mehta, "A 12.5Gbps Transmitter for Multi-standard SERDES in 40nm Low Leakage CMOS Process," in 2018 31st International Conference on VLSI Design and 2018 17th International Conference on Embedded Systems (VLSID), 2018, pp. 13–18.
- [30] Yawei Guo, Zhanpeng Zhang, Wei Hu, and Lianxing Yang, "CMOS multiplexer and demultiplexer for gigabit Ethernet," in *IEEE 2002 International Conference on Communications, Circuits and Systems and West Sino Expositions*, vol. 1, 2002, pp. 819–823 vol.1.
- [31] F. Celik, A. Akkaya, A. Tajalli, and Y. Leblebici, "ISI Sensitivity of PAM Signaling for Very High-Speed Short-Reach Copper Links," in 2019 17th IEEE International New Circuits and Systems Conference (NEWCAS), 2019, pp. 1–4.
- [32] J. Kim, A. Balankutty, R. Dokania, A. Elshazly, H. S. Kim, S. Kundu, S. Weaver, K. Yu, and F. O'Mahony, "A 112Gb/s PAM-4 transmitter with 3-Tap FFE in 10nm CMOS," in 2018 IEEE International Solid - State Circuits Conference - (ISSCC), 2018, pp. 102–104.

- [33] T. Toifl, M. Brändli, A. Cevrero, P. A. Francese, M. Kossel, L. Kull, D. Luu, C. Menolfi, T. Morf, I. Özkaya, and H. Yueksel, "Design considerations for 50G+ backplane links," in *ESSCIRC Conference 2016: 42nd European Solid-State Circuits Conference*, 2016, pp. 477–482.
- [34] K. Gharibdoust, A. Tajalli, and Y. Leblebici, "10.3 A 7.5mW 7.5Gb/s mixed NRZ/multi-tone serial-data transceiver for multi-drop memory interfaces in 40nm CMOS," in 2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers, 2015, pp. 1–3.
- [35] J. Lee, M. Chen, and H. Wang, "Design and Comparison of Three 20-Gb/s Backplane Transceivers for Duobinary, PAM4, and NRZ Data," *IEEE Journal of Solid-State Circuits*, vol. 43, no. 9, pp. 2120–2133, 2008.
- [36] T. Shibasaki, T. Danjo, Y. Ogata, Y. Sakai, H. Miyaoka, F. Terasawa, M. Kudo, H. Kano, A. Matsuda, S. Kawai, T. Arai, H. Higashi, N. Naka, H. Yamaguchi, T. Mori, Y. Koyanagi, and H. Tamura, "3.5 A 56Gb/s NRZ-electrical 247mW/lane serial-link transceiver in 28nm CMOS," in 2016 IEEE International Solid-State Circuits Conference (ISSCC), 2016, pp. 64–65.
- [37] J. Han, N. Sutardja, Y. Lu, and E. Alon, "Design Techniques for a 60-Gb/s 288-mW NRZ Transceiver With Adaptive Equalization and Baud-Rate Clock and Data Recovery in 65nm CMOS Technology," *IEEE Journal of Solid-State Circuits*, vol. 52, no. 12, pp. 3474–3485, 2017.
- [38] L. Zhang and T. Kwasniewski, "Optimal equalization for reducing the impact of channel group delay distortion on high-speed backplane data transmission," *AEU International Journal of Electronics and Communications*, vol. 64, pp. 671–681, 07 2010.
- [39] G. Jeong, S. Chu, Y. Kim, S. Jang, S. Kim, W. Bae, S. Cho, H. Ju, and D. Jeong, "A 20 Gb/s 0.4 pJ/b Energy-Efficient Transmitter Driver Utilizing Constant- G<sub>m</sub> Bias," *IEEE Journal of Solid-State Circuits*, vol. 51, no. 10, pp. 2312–2327, 2016.
- [40] C. Menolfi, J. Hertle, T. Toifl, T. Morf, D. Gardellini, M. Braendli, P. Buchmann, and M. Kossel, "A 28Gb/s source-series terminated TX in 32nm CMOS SOI," in 2012 IEEE International Solid-State Circuits Conference, 2012, pp. 334–336.
- [41] K. L. Chan, K. H. Tan, Y. Frans, J. Im, P. Upadhyaya, S. W. Lim, A. Roldan, N. Narang, C. Y. Koay, H. Zhao, P. Chiang, and K. Chang, "A 32.75-Gb/s Voltage-Mode Transmitter With Three-Tap FFE in 16-nm CMOS," *IEEE Journal of Solid-State Circuits*, vol. 52, no. 10, pp. 2663–2678, 2017.
- [42] E. Depaoli, H. Zhang, M. Mazzini, W. Audoglio, A. A. Rossi, G. Albasini, M. Pozzoni, S. Erba,
  E. Temporiti, and A. Mazzanti, "A 64 Gb/s Low-Power Transceiver for Short-Reach PAM-4
  Electrical Links in 28-nm FDSOI CMOS," *IEEE Journal of Solid-State Circuits*, vol. 54, no. 1, pp. 6–17, 2019.

- [43] H. Song, S. Kim, and D. K. Jeong, "A Reduced-Swing Voltage-Mode Driver for Low-Power Multi-Gb/s Transmitters," *JSTS:Journal of Semiconductor Technology and Science*, vol. 9, no. 2, pp. 104–109, June 2009.
- [44] Y. Chen, P. Mak, C. C. Boon, and R. P. Martins, "A 36-Gb/s 1.3-mW/Gb/s Duobinary-Signal Transmitter Exploiting Power-Efficient Cross-Quadrature Clocking Multiplexers With Maximized Timing Margin," *IEEE Transactions on Circuits and Systems I: Regular Papers*, vol. 65, no. 9, pp. 3014–3026, 2018.
- [45] H. Yang, A. Roshan-Zamir, Y. Song, and S. Palermo, "A low-power dual-mode 20-Gb/s NRZ and 28-Gb/s PAM-4 voltage-mode transmitter," in 2017 IEEE Asian Solid-State Circuits Conference (A-SSCC), 2017, pp. 261–264.
- [46] H. Ju, M. Choi, G. Jeong, W. Bae, and D. Jeong, "A 28 Gb/s 1.6 pJ/b PAM-4 Transmitter Using Fractionally Spaced 3-Tap FFE and G<sub>m</sub> -Regulated Resistive-Feedback Driver," *IEEE Transactions on Circuits and Systems II: Express Briefs*, vol. 64, no. 12, pp. 1377–1381, 2017.
- [47] F. Celik, A. Akkaya, and Y. Leblebici, "A 32 Gb/s PAM-16 TX and ADC-Based RX AFE with 2-Tap Embedded Analog FFE in 28 nm FDSOI," *Microelectronics Journal*, vol. 108, 2021, 104967.
- [48] S. Ibrahim and B. Razavi, "Low-Power CMOS Equalizer Design for 20-Gb/s Systems," *IEEE Journal of Solid-State Circuits*, vol. 46, no. 6, pp. 1321–1336, 2011.
- [49] M. Hossain, "Recent trend in high-speed wireline link design," *Electrical and Electronic Technology Open Access Journal*, vol. 1, no. 1, pp. 16–18, 2017.
- [50] J. Im, D. Freitas, A. B. Roldan, R. Casey, S. Chen, C. A. Chou, T. Cronin, K. Geary, S. McLeod, L. Zhou, I. Zhuang, J. Han, S. Lin, P. Upadhyaya, G. Zhang, Y. Frans, and K. Chang, "A 40-to-56 Gb/s PAM-4 Receiver With Ten-Tap Direct Decision-Feedback Equalization in 16-nm FinFET," *IEEE Journal of Solid-State Circuits*, vol. 52, no. 12, pp. 3486–3502, 2017.
- [51] B. Min, K. Lee, and S. Palermo, "A 20Gb/s triple-mode (PAM-2, PAM-4, and duobinary) transmitter," *Microelectronics Journal*, vol. 43, no. 10, pp. 687–696, 2012.
- [52] B. Song, K. Kim, J. Lee, J. Chung, Y. Choi, and J. Burm, "A 13.5-mW 10-Gb/s 4-PAM Serial Link Transmitter in 0.13- μm CMOS Technology," *IEEE Transactions on Circuits and Systems II: Express Briefs*, vol. 61, no. 9, pp. 646–650, 2014.
- [53] C. Fan, W. Yu, P. Mak, and R. P. Martins, "A 40-Gb/s PAM-4 Transmitter Using a 0.16-pJ/bit SST-CML-Hybrid (SCH) Output Driver and a Hybrid-Path 3-Tap FFE Scheme in 28-nm CMOS," *IEEE Transactions on Circuits and Systems I: Regular Papers*, vol. 66, no. 12, pp. 4850–4861, 2019.
- [54] H. Do, J. Hwang, H. S. Choi, and D. K. Jeong, "A 48 Gb/s PAM-4 Transmitter With 3-Tap FFE Based on Double-Shielded Coplanar Waveguide in 65-nm CMOS," *IEEE Transactions* on Circuits and Systems II: Express Briefs, vol. 67, no. 9, pp. 1569–1573, 2020.

- [55] G. Byun and M. M. Navidi, "A Low-Power 4-PAM Transceiver Using a Dual-Sampling Technique for Heterogeneous Latency-Sensitive Network-on-Chip," *IEEE Transactions* on Circuits and Systems II: Express Briefs, vol. 62, no. 6, pp. 613–617, 2015.
- [56] D. Cui, H. Zhang, N. Huang, A. Nazemi, B. Catli, H. G. Rhew, B. Zhang, A. Momtaz, and J. Cao, "3.2 A 320mW 32Gb/s 8b ADC-based PAM-4 analog front-end with programmable gain control and analog peaking in 28nm CMOS," in 2016 IEEE International Solid-State Circuits Conference (ISSCC), 2016, pp. 58–59.
- [57] J. Nam and M. S. Chen, "A 12.8-Gbaud ADC-based NRZ/PAM4 Receiver with Embedded Tunable IIR Equalization Filter Achieving 2.43-pJ/b in 65nm CMOS," in *2019 IEEE Custom Integrated Circuits Conference (CICC)*, 2019, pp. 1–4.
- [58] S. Kiran, S. Cai, Y. Luo, S. Hoyos, and S. Palermo, "A 52-Gb/s ADC-Based PAM-4 Receiver With Comparator-Assisted 2-bit/Stage SAR ADC and Partially Unrolled DFE in 65-nm CMOS," *IEEE Journal of Solid-State Circuits*, vol. 54, no. 3, pp. 659–671, 2019.
- [59] Y. Frans, J. Shin, L. Zhou, P. Upadhyaya, J. Im, V. Kireev, M. Elzeftawi, H. Hedayati, T. Pham, S. Asuncion, C. Borrelli, G. Zhang, H. Zhang, and K. Chang, "A 56-Gb/s PAM4 Wireline Transceiver Using a 32-Way Time-Interleaved SAR ADC in 16-nm FinFET," *IEEE Journal* of Solid-State Circuits, vol. 52, no. 4, pp. 1101–1110, 2017.
- [60] B. Song, K. Kim, J. Lee, and J. Burm, "A 0.18-/spl mu/m CMOS 10-Gb/s Dual-Mode 10-PAM Serial Link Transceiver," *IEEE Transactions on Circuits and Systems I: Regular Papers*, vol. 60, no. 2, pp. 457–468, 2013.
- [61] S. Palermo, S. Hoyos, A. Shafik, E. Z. Tabasy, S. Cai, S. Kiran, and K. Lee, "CMOS ADCbased receivers for high-speed electrical and optical links," *IEEE Communications Magazine*, vol. 54, no. 10, pp. 168–175, 2016.
- [62] L. Kull, D. Luu, C. Menolfi, M. Brändli, P. A. Francese, T. Morf, M. Kossel, A. Cevrero, I. Ozkaya, and T. Toifl, "A 24–72-GS/s 8-b Time-Interleaved SAR ADC With 2.0–3.3pJ/Conversion and >30 dB SNDR at Nyquist in 14-nm CMOS FinFET," *IEEE Journal* of Solid-State Circuits, vol. 53, no. 12, pp. 3508–3516, 2018.
- [63] T. Jiang, W. Liu, F. Y. Zhong, C. Zhong, K. Hu, and P. Y. Chiang, "A Single-Channel, 1.25-GS/s, 6-bit, 6.08-mW Asynchronous Successive-Approximation ADC With Improved Feedback Delay in 40-nm CMOS," *IEEE Journal of Solid-State Circuits*, vol. 47, no. 10, pp. 2444–2453, 2012.
- [64] C. Liu, S. Chang, G. Huang, Y. Lin, and C. Huang, "A 1V 11fJ/conversion-step 10bit 10MS/s asynchronous SAR ADC in 0.18μm CMOS," in 2010 Symposium on VLSI Circuits, 2010, pp. 241–242.

# List of Acronyms

| AC     | Alternating Current                     |
|--------|-----------------------------------------|
| ADC    | Analog-to-Digital Converter             |
| AFE    | Analog Front-End                        |
| BER    | Bit Error Rate                          |
| CAGR   | Compound Annual Growth Rate             |
| CDAC   | Capacitive Digital-to-Analog Converter  |
| CDR    | Clock-Data Recovery                     |
| CEI    | Common Electrical Input/Output          |
| CMFB   | Common-Mode Feedback                    |
| CML    | Current-Mode Logic                      |
| CMOS   | Complementary Metal-Oxide-Semiconductor |
| CNRZ   | Correlated Non-Return-to-Zero           |
| CTLE   | Continuous-Time Linear Equalizer        |
| DAC    | Digital-to-Analog Converter             |
| DC     | Direct Current                          |
| DCD    | Duty-Cycle Distortion                   |
| DFE    | Decision-Feedback Equalizer             |
| DFF    | D Flip-Flop                             |
| DLL    | Delay-Locked-Loop                       |
| DMT    | Discrete Multi-tone                     |
| DSP    | Digital Signal Processor                |
| ESD    | Electrostatic Discharge                 |
| FD-SOI | Fully Depleted Silicon On Insulator     |
| FFE    | Feed-Forward Equalizer                  |
| FIR    | Finite Impulse Response                 |
| FoM    | Figure of Merit                         |
| FPGA   | Field-Programmable Gate Array           |
| GND    | Ground                                  |
| HBM    | Human-Body Model                        |
| I/O    | Input/Output                            |

#### List of Acronyms

| IIR             | Infinite Impulse Response                 |
|-----------------|-------------------------------------------|
| ΙοΤ             | Internet of Things                        |
| ISI             | Inter-Symbol Interference                 |
| KTI             | Keizer Technology Interconnect            |
| LDO             | Low-Dropout                               |
| LSB             | Least Significant Bit                     |
| LVDS            | Low-Voltage Differential Signaling        |
| MOS             | Metal-Oxide-Semiconductor                 |
| MSB             | Most Significant Bit                      |
| MUX             | Multiplexer                               |
| NAND            | NOT AND                                   |
| NMOS            | N-Channel Metal-Oxide-Semiconductor       |
| NOR             | NOT OR                                    |
| NRZ             | Non-Return-to-Zero                        |
| PAM             | Pulse Amplitude Modulation                |
| PCB             | Printed Circuit Board                     |
| PCIe            | Peripheral Component Interconnect Express |
| PMOS            | P-Channel Metal-Oxide-Semiconductor       |
| PRBS            | Pseudo-Random Binary Sequence             |
| PSD             | Power Spectral Density                    |
| PSRR            | Power Supply Rejection Ratio              |
| QEC             | Quadrature Error Correction               |
| QPI             | Quick Path Interconnect                   |
| RF              | Radio Frequency                           |
| RX              | Receiver                                  |
| SAR             | Successive-Approximation-Register         |
| SERDES          | Serializer-Deserializer                   |
| SMA             | SubMiniature version A                    |
| SNR             | Signal-to-Noise Ratio                     |
| SRAM            | Static Random-Access Memory               |
| SST             | Source-Series Terminated                  |
| TI              | Time-Interleaved                          |
| TRX             | Transceiver                               |
| TSPC            | True Single-Phase Clock                   |
| TX              | Transmitter                               |
| UI              | Unit Interval                             |
| V <sub>DD</sub> | Supply Voltage                            |
| XOR             | Exclusive OR                              |

## **Curriculum Vitae**

### Firat Celik

firatcelik09@gmail.com

#### **Research Interests**

Analog/Mixed-Signal IC Design, High-speed SERDES (e.g. TX, RX, equalizer), SAR ADC

#### Education

| 2016-2021 | <b>École Polytechnique Fédérale de Lausanne (EPFL)</b><br>Ph.D. Candidate, Microsystems and Microelectronics (EDMI) |
|-----------|---------------------------------------------------------------------------------------------------------------------|
| 2014-2016 | <b>École Polytechnique Fédérale de Lausanne (EPFL)</b><br>M.Sc., Electrical and Electronics Engineering             |
| 2009-2014 | Istanbul Technical University<br>B.Sc., Electronics Engineering                                                     |

### **Professional Experience**

| 2016-2020 | École Polytechnique Fédérale de Lausanne (EPFL)                             |
|-----------|-----------------------------------------------------------------------------|
|           | Research assistant at Microelectronic Systems Laboratory (LSM).             |
|           | Energy-efficient TX and TRX design with NRZ, PAM-4, and PAM-16 signaling    |
|           | for high-speed wireline links.                                              |
| 2016      | Kandou Bus S. A., Lausanne, Switzerland                                     |
|           | Intern.                                                                     |
|           | Low jitter clock distribution design and layout, CML2CMOS converter design. |
| 2013      | Analog Devices Inc. (formerly Hittite Microwave), Istanbul, Turkey          |
|           | Intern.                                                                     |
|           | Analog IC design and layout.                                                |
|           |                                                                             |

### **List of Publications**

#### **Journal Articles**

- F. Celik, A. Akkaya, A. Tajalli, and Y. Leblebici, "A 32 Gb/s PAM-4 SST Transmitter with 4-Tap FFE Using High-Impedance Driver in 28 nm FDSOI," *Journal article, under review,* 2021.
- F. Celik, A. Akkaya, and Y. Leblebici, "A 32 Gb/s PAM-16 TX and ADC-Based RX AFE with 2-Tap Embedded Analog FFE in 28 nm FDSOI," *Microelectronics Journal*, vol. 108, 2021, 104967.
- A. Akkaya, **F. Celik**, and Y. Leblebici, "An 8-bit 800 MS/s Loop-Unrolled SAR ADC with Common-Mode Adaptive Background Offset Calibration in 28 nm FDSOI" *Journal article, under review*, 2021.
- A. C. Yüzügüler, **F. Celik**, M. Drumond, B. Falsafi and P. Frossard, "Analog Neural Networks With Deep-Submicrometer Nonlinear Synapses," in *IEEE Micro*, vol. 39, no. 5, pp. 55-63, 1 Sept.-Oct. 2019.

#### **Conference Proceedings**

- F. Celik, A. Akkaya, A. Tajalli, and Y. Leblebici, "ISI Sensitivity of PAM Signaling for Very High-Speed Short-Reach Copper Links," 2019 17th IEEE International New Circuits and Systems Conference (NEWCAS), Munich, Germany, 2019, pp. 1-4.
- F. Celik, A. Akkaya, A. Tajalli, A. Burg, and Y. Leblebici, "JESD204B Compliant 12.5 Gb/s LVDS and SST Transmitters in 28 nm FD-SOI CMOS," *2019 15th Conference on Ph.D Research in Microelectronics and Electronics (PRIME)*, Lausanne, Switzerland, 2019, pp. 101-104.
- A. Akkaya, **F. Celik**, and Y. Leblebici, "A Low-Power 9-Bit 222 MS/s Asynchronous SAR ADC in 65 nm CMOS," *2020 IEEE International Symposium on Circuits and Systems (ISCAS)*, Sevilla, 2020, pp. 1-5.

#### List of Publications

- A. Akkaya, **F. Celik**, and Y. Leblebici, "Self-Calibrated Delay-Based LSB Extraction for Resolution Improvement in SAR ADCs," *2019 IEEE Nordic Circuits and Systems Conference (NORCAS): NORCHIP and International Symposium of System-on-Chip (SoC),* Helsinki, Finland, 2019, pp. 1-7.
- A. Akkaya, **F. Celik**, A. Tajalli, and Y. Leblebici, "A 10b SAR ADC with Widely Scalable Sampling Rate and AGC Amplifier Front-End," *2018 IEEE Nordic Circuits and Systems Conference (NORCAS): NORCHIP and International Symposium of System-on-Chip (SoC),* Tallinn, 2018, pp. 1-6.