Ultra-Low Power 32-bit Pipelined Adder
Using Subthreshold Source-Coupled Logic
with 5fJ/stage PDP

Armin Tajalli\textsuperscript{a} Elizabeth J. Brauer\textsuperscript{b} Yusuf Leblebici\textsuperscript{a}

\textsuperscript{a}Microelectronic Systems Lab. (LSM),
Ecole Polytechnique Fédérale de Lausanne (EPFL), 1015 Lausanne, Switzerland
\textsuperscript{b}Electrical Eng. Dept., Northern Arizona Univ., Flagstaff AZ 86011, USA

Abstract

This article presents a new approach for improving the power-delay performance of subthreshold source-coupled logic (STSCl) circuits. Using a simple two-phase pipelining technique, it is possible to increase the activity rate of STSCL gates with negligible additional cost, and hence reduce the total system energy consumption per operation. In the proposed pipelined topology, each STSCL gate is followed by a simple cross-coupled differential pair operating as a state keeper with a very low power consumption and small area overhead. Measurement results on a 32-bit pipelined adder chain fabricated with 0.18\textmu m CMOS technology show that the proposed approach can achieve a significant reduction in power-delay product (PDP) down to 5fJ/stage.

Key words: CMOS integrated circuits, ultra low power circuit design, source-coupled logic (SCL), current-mode logic (CML), subthreshold SCL (STSCL), pipelined SCL.

1 Introduction

The demand for implementing ultra low power circuit building blocks in many emerging applications where the energy consumption is extremely important has made the subthreshold CMOS circuit techniques very attractive \cite{1}. Applications such as wearable computing and implantable systems require very low power circuits with low sensitivity to the supply voltage variations for robust operation where circuit performance is ideally independent of the supply voltage \cite{2}.
While the power consumption of conventional CMOS digital circuits can be reduced substantially by proper biasing in subthreshold regime [3]-[5], they generally require a very careful control on their supply voltage ($V_{DD}$) since the speed of operation and power consumption are both critically depending on $V_{DD}$ [6].

Source-coupled logic (SCL) circuits, shown in Fig. 1, exhibit a very low sensitivity to the supply voltage variations [7]. Indeed, the speed of operation in SCL circuits is independent to the supply voltage and can be adjusted through the tail bias current. Meanwhile, SCL circuits exhibit good immunity to the substrate and supply noise [7]. Recent results by the authors have shown that it is possible to use this type of circuits in subthreshold regime where the power consumption can be reduced to 1fJ/gate and even less [8], [9]. These properties make the subthreshold SCL (STSCL) circuits a good choice for low voltage and low power applications.

Building on this foundation, we present in this article a new approach for improving the performance of STSCL circuits in terms of power-delay product (PDP). Using a simple two-phase pipelining technique, it is possible to increase the activity rate in STSCL circuits and hence utilize more efficiently the static power consumption of the gates. Dramatic improvement of PDP (by factor of 14) is demonstrated for a 32-bit adder that is shown to operate with 5fJ/gate PDP, independent of the supply voltage.

In Section 2, a short overview on STSCL circuits will be presented. The proposed pipelining technique is introduced in Section 3, and Section 4 presents the simulation and measurement results.
2 Subthreshold SCL Circuits

2.1 Conventional SCL

In an SCL circuit, the logic operation takes place in the switching network that is composed of NMOS differential pair transistors as illustrated in Fig. 1. In this configuration, the constant tail bias current $I_{SS}$ will be switched between two NMOS transistors in each stage and finally will be steered to one of the output branches. This current is converted to voltage output by the load resistances ($R_L$) which determines the output logic levels [10]. Generally, PMOS devices biased in triode region are used as the load resistances. The required output voltage swing ($V_{SW} = R_L I_{SS}$) should be high enough to switch the NMOS transistors of the following SCL stages. The output voltage swing can be controlled by a replica bias circuit to make sure the output voltage swing will remain high enough over process, temperature, and supply voltage (PVT) variations [9].

The main speed limiting factor in SCL topology arises from the circuit output time constant. Hence, the propagation delay of each gate can be estimated by:

\[ t_d \approx \ln 2 \cdot R_L C_L = \ln(2) \cdot \frac{V_{SW} C_L}{I_{SS}} \]  

(1)

where $C_L$ stands for the total equivalent output capacitance seen by the SCL gate.

2.2 Subthreshold SCL

To maintain the desired output voltage swing at very low bias current levels, it is necessary to increase the load resistance value in inverse proportion to the reducing tail bias current as

\[ R_L = V_{SW} / I_{SS}. \]  

(2)

In subthreshold operation, the tail bias current would be in the range of few nA or even less. Therefore, to obtain a reasonable output voltage swing, the load resistance should be in the range of hundreds of MΩ. Meanwhile, this resistance should be controlled very accurately based on the $I_{SS}$ value. Hence, a well controlled high resistivity load device with a very small area is required. For this range of resistivity, conventional PMOS devices biased in triode region can not be utilized since the required channel length of the transistor would be
impractically large. The conventional bulk-source connected PMOS load configuration [Fig. 1] results in a current source with almost infinite impedance, even for deep sub-micron devices. Hence, the gain would not be limited, neither would the amplitude. However, the proposed configuration illustrated in Fig. 2 for the load devices produces a finite and controllable differential resistance, which, associated with the transconductance of the differential pair will provide a controlled, limited gain and amplitude. Hence, it is possible to implement a very high resistivity load device using a single minimum size PMOS transistors [11].

As shown in Fig. 2, a replica bias circuit will produce the proper gate bias voltage for PMOS load devices \(V_{BP}\) to control the output voltage swing [9]. The voltage swing must be selected larger than \(4n_nU_T\) \((n_n\) is the subthreshold slope factor of NMOS differential pair devices and \(U_T\) is the thermal voltage\) to make sure that the NMOS differential pair devices will switch completely [12].

Measurement results show that the tail bias current of the STSCL circuit built using the topology of Fig. 2 can be reduced down to 10pA with a supply voltage of as low as 350mV and still maintain an output voltage swing of 150mV and a PDP of less than 0.1fJ/gate [9].

2.3 Power-Delay Performance

Unlike the conventional CMOS gates, SCL circuits draw a constant bias current from the supply voltage. This bias current should be kept high enough to have an acceptable delay in each gate. Regarding (1), the power-delay product \((PDP [13])\) of STSCL gates is equal to

\[
PDP_{SCL} = \ln 2 \cdot V_{DD}V_{SW}C_L.\]  

(3)
Using $V_{DD}=0.5\text{V}$ and $V_{SW}=0.2\text{V}$, for example, the PDP of an SCL gate can be as low as 70aJ/\mu F/gate. However, compared to the conventional CMOS digital circuits, an SCL circuit with logic depth of $N > V_{DD}/V_{SW}$ exhibits higher PDP which is mainly due to the static current consumption of SCL gates [10]. In a digital SCL circuit with logic depth of $N$, the total delay is $t_{d,N} = N \cdot t_{d}$ and total power consumption is $P = N V_{DD} I_{SS}$. Therefore, for an SCL digital circuit with a logic depth of $N$, the maximum operating frequency would be:

$$f_{op,N} \approx \frac{1}{t_{d,N}} = \frac{I_{SS}}{\ln 2 \cdot N V_{SW} C_L}$$

which is $N$ times less than the maximum possible operating frequency of each SCL gate:

$$f_{op,Max} \approx \frac{1}{t_{d}} = \frac{I_{SS}}{\ln 2 \cdot V_{SW} C_L}.$$ 

Here, we are neglecting the effect of incomplete settling when $N$ is small. The main reason for this reduction is that the activity rate in a digital circuit with the logic depth of $N$ is reduced by a factor of $N$ while the power consumption of each gate remains the same.

Defining the activity rate (or duty rate) as:

$$\alpha = \frac{f_{op}}{f_{op,Max}}$$

and regarding (3), one can show that the power-delay product with logic depth of $N$ is:

$$PDP_{SCL,N} = \ln 2 \cdot \frac{N}{\alpha} V_{DD} V_{SW} C_L.$$  

Therefore, by increasing the activity rate it is possible to reduce the power-delay product of the proposed SCL circuit with a logic depth of $N$. Comparing this result with the PDP of CMOS gates [6], [10]:

$$PDP_{CMOS,N} = \ln 2 \cdot N V_{DD}^2 C_L$$

it can be seen that increasing the activity rate of the STSCL topology can help to achieve a PDP performance which is at least as good as the PDP of conventional CMOS topology, with the additional benefit of keeping the output swing and the delay completely independent of the supply voltage.
Regarding (4), one can conclude that the delay (or the maximum operating frequency) in a STSCL gate depends on the tail bias current ($I_{SS}$), but not on $V_{DD}$. Therefore, the delay of a logic block can be controlled without influencing PDP, which is not possible in conventional CMOS topologies. More importantly, the speed and the operation (supply) voltage can be effectively decoupled in STSCL circuits.

Meanwhile, to reduce the PDP of STSCL circuits as predicted in (7), $\alpha$ should be kept as large as possible. This observation does not contradict with similar results for conventional CMOS, where

$$\left(\frac{P}{f}\right)_{CMOS} = C_L V_{DD}^2 \left(1 + \frac{2}{\alpha} e^{-\frac{V_{DD}}{\alpha e}}\right)$$

as shown in [1]. Here, power-to-frequency is defined as:

$$\left(\frac{P}{f}\right) = \frac{P_{\text{diss}}}{f_{op}}.$$  \hspace{1cm} (10)

However, the influence of $V_{DD}$ on $(P/f)$ is quite different in conventional CMOS, where an optimum $V_{DD}$ value to minimize $(P/f)$ can be found, especially for small $\alpha$ values, due to significant leakage in CMOS topology.

Therefore, assuming that the system clock frequency is dictated by the longest delay path between two consecutive register stages, and assuming that the activity rate depends inversely on the maximum logic depth between two registers, it is most beneficial to keep the logic depth as shallow as possible, and thus, increase $\alpha$. This calls for very short (ideally one stage) pipelining in STSCL systems, which is demonstrated with an example in the next Section.

3 Pipelined STSCL Topology with Compound Gates

In this Section, some techniques for improving the performance of STSCL circuits will be introduced. First, the performance of stacked STSCL gates will be analyzed and then the proposed pipelining technique will be introduced.

3.1 Compound STSCL Structure

Compound SCL gates with merging two or more logic operations in a single gate will provide this possibility to reduce the power consumption and improve the speed of operation simultaneously. Figure 3 shows an example in which an
AND gate and an XOR gate are merged together to construct the proposed compound logic operation. Using this technique, it is possible to have only one pair of output load devices and a single tail biasing transistor and hence reduce the area in addition to halving the total current consumption.

Assuming that the time constant at the output nodes of each SCL gate is equal to

$$\tau_L = R_L C_L = \frac{V_{SW} C_L}{I_{SS}}$$ \hspace{1cm} (11)$$

then the total equivalent time constant of a simple two stage SCL gate will be:

$$\tau_{tot,A} \approx 2 \times \frac{V_{SW} C_L}{I_{SS}}$$ \hspace{1cm} (12)$$

Assuming a compound STSCL gate with $M$ stacked levels of NMOS differential pairs (for example in Fig. 3: $M=3$), then the total time constant of the circuit will be

$$\tau_{tot,A} \approx \left( \frac{V_{SW} C_L}{I_{SS}} \right) + M \left( \frac{C_S}{g_{ms}} \right)$$ \hspace{1cm} (13)$$
where \( g_{ms} = I_{SS}/(n_n U_T) \) and \( C_S \) is the parasitic capacitance seen from the source of each NMOS differential pair. Here, it is assumed that the time constant at the intermediate nodes of a compound SCL gate is \( \tau_i = C_S/g_{ms} \) (see Fig. 3) and the total time constant can be calculated by \( \tau_{tot} = \tau_L + \sum_{i=1}^{N} \tau_i \) [12]. Comparing (12) and (13) it can be concluded that as far as \( N U_T C_S \ll V_{SW} C_L \), or

\[
N \ll \frac{V_{SW} C_L}{U_T C_S}
\]  

stacking differential pair stages will not degrade the speed of operation. Simulations show that the proposed technique can reduce the power dissipation of an \((8\times8)\) Multiplier by about 40% and at the same time improve the speed of operation. Figure 4 depicts this improvement for different operating frequencies.

### 3.2 Two-Phase Pipelining Techniques

An effective approach for increasing the activity rate is using a simple two-phase pipelining technique [13], [14]. Figure 5 shows one possible technique to implement two-phase latch-based pipelining where the output of each gate is latched during one clock phase, and passed on to the next stage during the next clock phase - effectively reducing the maximum logic depth to two consecutive gates.

The topology of a single stage pipelined gate is shown in Fig. 5(a). When clock is low, the latch is disabled and the gate is evaluating the output value based on the input data. In this period, as the gate is evaluating the output, the
input data should remain constant.

When clock is high, on the other hand, the output is latched and the following stages can start their evaluation step. Since in this period the output of this stage is kept constant by the latch, input data can attain its new value. Therefore, the input data rate can be increased practically to $f_D = 1/(2t_d)$. This input data rate does not reduce if the logic depth increases (Fig. 5(b)) since during the evaluation phase of each gate, its input is kept constant by the latch of the previous stage and hence does not change. Without pipelining, the entire system consisting of $N$ stages needs to wait until all the gates in the chain complete their evaluation, hence the maximum data rate is limited to $f_D = 1/(Nt_d)$. As a conclusion, pipelining can theoretically helps to improve the speed by a factor of $N/2$.

Instead of using explicit latch stages, such two-phase pipelining can be achieved by increasing and reducing the tail current bias of alternating stages, using the gate terminal of the tail current bias transistor of each stage as the clock input [$V_{BN}$ in Fig. 6(a)]. In this approach, as illustrated in Fig. 6(a) for the example of an STSCL full adder (FA) gate, the current bias of odd stages is reduced to a low (yet non-zero) level to retain (hold) their output while
Fig. 6. (a) STSCL full adder and keeper stage. Here, the tail current bias $V_{BN}$ is switched according to $CK$ (or $\overline{CK}$) while $V_{BN0}$ is kept as a constant bias. (b) Simulated output of the pipelined FA chain showing the holding and tracking modes of operation.

The current bias of even stages is raised to the nominal operating value to enable evaluation. Very simple cross-coupled keeper stages connected to each gate output ensure that the output levels do not degrade significantly during the hold phase. Since the keeper stage is used to maintain the latest state of the output of each gate, it does not need to be very fast. Therefore, the bias current of keeper stage ($I_{SS,L}$) can be chosen as low as 1% of the bias current of the main gate ($I_{SS}$). This means that the power overhead of the keeper stages is virtually negligible. Meanwhile, since the bias current of half of the gates is almost zero in each clock phase, the overall power consumption of the system will be reduced by a factor of two. Figure 6(b) shows the transient simulation results for the output of a adder stage in a chain of adders. In this figure it is possible to see the hold and evaluation phases for $I_{SS,L} = 0.01I_{SS}$ for $V_{SW}=0.2V$.

Assuming that the delay of each gate is $t_d$, theoretically it is possible to increase the input data rate in Fig. 5 to approximately $1/(2t_d)$. Therefore, the power-
delay product of a pipelined STSCL system can be calculated as

\[ PDP_{SCL,N,Pipe} = 2 \ln 2 \cdot NV_{DD}V_{SW}CL. \]  

(15)

Regarding (7) and (15), it can be seen that pipelining helps to reduce the system power-delay product by a factor of approximately \( N/2 \) which is a considerable improvement especially in deeper pipelines with a large number of stages. In practice, the improvement in power-delay product is less than this value because of increased loading at the output nodes as well as power consumption of the keeper stage.

4 Experimental Results

A test chip has been fabricated in digital 0.18\(\mu\)m CMOS technology, which consists of a 32-bit pipelined adder chain, and a conventional (non-pipelined)
Fig. 9. Measured delay versus tail bias current: total delay of simple adder chain, and stage delay in pipelined adder chain. In both cases, the delay figure corresponds to the time period between two consecutive inputs. The effective operating frequency improves by a factor of 14 with pipelining.

32-bit ripple-carry adder as the comparison block, both designed with STSCL topology. Figure 7 shows the test chip photomicrograph. Internal current mirrors are used to control the bias current of the gates and the keeper stage separately. Each adder chain is followed by an SCL-to-CMOS level converter circuit and an output driver.

Figure 8 shows the measured output of the pipelined FA chain in comparison to the input data and clock. The latency is equal to $NT_{CK}/2$ which in this figure is 320µs. It is possible to measure the total delay in the simple non-pipelined 32-bit adder and also the delay of a single gate for the pipelined 32-bit adder. The measurement results are shown in Fig. 9 as delay versus tail bias current. The delay of both circuits can be adjusted linearly by changing their tail bias current in a very wide range which is about three orders of magnitude in these measurements. Note that the time delay between two consecutive inputs can be reduced by a factor of 14 with pipelining (maximum theoretical improvement would have been by a factor of $N/2=16$, as explained above).

The measured power-delay product for the two topologies are shown in Fig. 10. Both topologies show a relatively constant PDP over their tuning range. The average PDP for simple and pipelined FA chains are 2.6pJ and 0.18pJ, respectively, which corresponds an improvement factor of about 14. Figure 11 shows more clearly the improvement in power consumption at iso-speed operation or speed improvement at iso-power operation. This result is very close to the estimation made in (15).

Measurements for pipelined adder chain have been performed for two different bias current of $I_{SS,L}$: $I_{SS,L} = I_{SS}/10$ and $I_{SS,L} = I_{SS}/100$. As can be seen in
Fig. 10. Measured power-delay product for the two adder topologies. The pipelined adder topology achieves a very significant reduction of PDP, over a wide range of operating frequencies.

Fig. 10, the results for two bias currents for the keeper stage are very close. Therefore, it is possible to reduce the bias current of the keeper stage to $I_{SS}/100$ and hence minimize the power overhead of this stage.

Figure 11 shows more clearly the improvement in power consumption at iso-speed operation or speed improvement at iso-power operation.

It can be seen that the PDP of the proposed STSCL adder circuit with a deep pipeline can be as low as the PDP of static CMOS adders reported in [15]-[18]. This means that using pipelining technique, it is possible to improve the performance STSCL circuits and make it comparable to static CMOS circuits even with high logic depth.

5 Conclusion

A simple two-phase pipelining technique has been demonstrated to improve the performance of subthreshold source-coupled logic circuits. It is shown that the proposed approach can significantly increase the activity rate of logic circuits while reducing logic depth, and hence use more efficiently the static power consumption in source-coupled logic circuits, with minimum overhead. Measurement results obtained with a 32-bit pipelined adder chain structure show that the PDP can be improved by factor of 14 compared to the non-pipelined topology, achieving a very low PDP of 5fJ/stage.
Fig. 11. Power-Frequency improvement achieved by pipelining technique.

Acknowledgment

The authors would like to thank M. Alioto, F. K. Gurbaynak and S. Badel for their help in test chip design and S. Hauser for preparing the test setup.

References


