So far, performance and reliability of circuits have been determined by worst-case characterisation of silicon and of environmental conditions such as noise and temperature. As nanometer technologies exacerbate process variations and reduce noise margins, worst-case design will eventually fail to meet an aggressive combination of objectives in performance, reliability, and power. In order to circumvent these difficulties, researchers have recently proposed a new design paradigm: self-calibrating circuits. The design parameters (e.g., operating points) of a self-calibrating circuit are tuned at run-time by a controller. The latter receives feedback from a checker that monitors correct operation of the circuit. A self-calibrating circuit can thus trade dynamically reliability for power or performance, depending on actual silicon capabilities and noise conditions. This thesis pioneers the use of digital self-calibration techniques to dynamically tune the operating points of an on-chip link based on the detection of run-time transfer errors. In particular, we show that the energy overhead induced by the checker and operating point controller is offset by the operating of the link at sub-critical voltage. Such a system-level study strengthens the interest into self-calibrated links by demonstrating their feasibility. The primary focus of the thesis bears on the development of robust and low overhead checkers for a self-calibrating on-chip data link subject to errors caused by operation at sub-critical voltage. Such errors –we call them timing errors– may be numerous and cause error rates as large as 100%. We abstract timing errors by the failure of bit transitions and propose ad-hoc coding techniques to detect them reliably. We emphasise the originality of the coding requirements by showing that (i) traditional error correcting codes (like CRCs) fail to detect timing errors under over-aggressive operation of the link, and (ii) asynchronous codes such as dual-rail detect all timing errors, but incur a significant bandwidth overhead in the synchronous context of our problem. Next, we introduce a novel code-based checker satisfying such requirements and featuring unique detection capabilities towards both timing and additive errors. Then, we contrast the error detection capabilities of the code-based checker with the one of double sampling flip-flops. We stress the complementarity of the two approaches and show how to optimally combine them into an even more robust checker featuring a very limited wiring and circuitry overhead. Finally, we extend our work to computing elements by giving preliminary research directions on the detection of timing errors resulting from the self-calibration of the operating points of an adder. The main contribution of this work is to propose novel checker architectures based on codes and/or double sampling flip-flops to detect massive timing errors caused by self-calibration of the link operating points. A requirement rendering our work unique is that reliable operation of the checker should be ensured over the whole range of bit error rates from 0 to 100%. Furthermore, we have developed a unified framework bringing fundamental insights into the timing error detection capabilities of various practical encoding schemes.
EPFL_TH3647.pdf
openaccess
1.88 MB
Adobe PDF
cd3181810aa11e1623a281bf89f63175