## **Optimization of Reliability and Power Consumption in SoCs**

Tajana Šimunić Rosing, UC San Diego Kresimir Mihic, Cypress Semiconductors Giovanni De Micheli, EPF Lausanne

## **Integrated System Technology Issues**

- Extremely small size
  - Thinner interconnect -> more chance of EM failure
  - Thinner dielectric -> more chance of TDDB failure
  - Narrower design margins
- Extremely large scale
  - High transistor density
    - Causes more failures
    - Enables redundancy
- Energy consumption



- Increased energy consumption is a hurdle to modular redundancy
- Power and thermal management are critical
  - Reliability exponentially related to temperature

Designing reliable integrated systems requires techniques that integrate with power management and tie to the underlying technology surface

## Reliability

- Reliability is the the probability function R(t) that a system works correctly in [0, t] without repairs
- The mean time to failure MTTF is  $E[t] = \int R(t)dt$
- Assuming a unit works correctly in [0, t], the failure rate is the conditional probability λ(t) that a unit fails in [t, t+Δt]
  - It depends on temperature, environmental exposure, mechanical and thermal stress
  - \* The component failure rate is often assumed to be constant during useful lifetime of device:  $\Lambda(t)$ 
    - R(t) = exp (–  $\lambda$ t) and MTTF = 1/  $\lambda$



• Two types of failures can be defined in integrated systems:

- Soft failures transient malfunctions
- Hard failures permanent malfunctions

## **Related Work – Reliability for SoCs**

- Reliability at the architecture level
  - Integrated simulation of power and reliability at microarchitecture level RAMP [Srinivasan'03]
  - Redundancy tradeoffs [Shivakumar'03]
- Dynamic Thermal Management (DTM)
  - HotSpot [Skadron'03], ThermaHeard [Shang'04]
    - Simulate and reduce thermal hotspots
  - Thermal management for multimedia [Srinivasan'03]
- Dynamic Voltage Scaling (DVS) as related to reliability
  - Routing and DVS for reduction of hotspots [Shang'04]
- Dynamic Power Management (DPM)
  - Primarily focused on lowering energy consumption
- Soft errors studied by many, e.g.:
  - Ultra-low power systems [Maheshwari'02]
  - Sensing systems [Marculescu'03]
- Hard failure mechanisms studied at length in the past, e.g.:
  - Temperature cycling [Huang'00]
  - TDDB [Degraeve'98]

## **Reliable low-power design**



- Simulate system-level reliability
  - Model three sources of hard errors:
    - Electromigration (EM), Time-dependent dielectric breakdown (TDDB), and Temperature Cycling (TC)

as a function of a power management policy

- Design and optimize a system management policy
  - Maximize reliability and minimize energy consumption
  - Combined dynamic reliability management (DRM) with dynamic power management (DPM) optimization
    - Markov, semi-Markov and TISM models

## Hard failures

- Defects in silicon or package, permanent once present
- Expected lifetime decreases with hard error rate
  - Extrinsic
    - Caused by process and manufacturing defects
    - Usually screened out before shipping a product
  - ✤ Intrinsic
    - Occur during operation
    - Depends on materials, process parameters, system design and operating conditions
    - Should occur after device passes its useful lifetime
    - Examples: electromigration, time dependent dielectric breakdown, thermal cycling

## **Electromigration (EM)**

- Result of momentum transfer from electrons to the ions which make interconnect lattice
- Leads to opening of metal lines/contacts, shortening between adjacent metal lines, shortening between metal levels, increased resistance of metal lines/contacts or junction shortening
- Described by Black's model where A<sub>o</sub> is an empirically determined constant, J is the current density in the interconnect, J<sub>crit</sub> is the threshold current density, k is the Boltzmann's constant, E<sub>a</sub> and n are 0.7 and 2



$$MTTF_{EM} = A_o (J - J_{crit})^{-n} e^{\frac{Ea}{kT}}$$

Failure rate due to EM is modeled only in active and idle states as in sleep state leakage current is not yet large enough to cause migration:

$$\lambda_{core,s}^{EM} = A'_o (J_s - J_{crit})^n e^{\frac{-Ea}{kT_s}} = \lambda_{m,s}^{EM} e^{\frac{-Ea}{kT_s}};$$
  
$$\forall s = active, idle$$

## Time Dependent Dielectric Breakdown (TDDB)

- Wear out mechanism of dielectric due electric field and temperature; causes formation of conductive paths through dielectrics
- MTTF is a function of the empirically determined constant  $A_o$ , the field acceleration parameter  $\gamma$ , the electric field across the dielectric  $E_{ox}$ , the activation energy  $E_a$  and T



$$MTTF_{TDDB} = A_o e^{-\gamma E_{ox}} e^{\frac{Ea}{kT}}$$

Failure rate due to TDDB:

$$\lambda_{core,s}^{TDDB} = A_{o}^{'} e^{\gamma E_{ox,s}} e^{\frac{-Ea}{kT_{s}}} = \lambda_{m,s}^{TDDB} e^{\frac{-Ea}{kT_{s}}};$$
  
$$\forall s = active, idle, sleep$$

## **Temperature Cycling (TC)**

- Caused by thermal cycles that occur during power state changes
   Slow and fast thermal cycles
- Induces plastic deformations in materials leads to cracks, short circuits and other failures of metal films and interlayer dielectrics
- Depends on temperature range and average temperature:

$$N_{f} = C_{o} \left[ C_{1} \left( T_{\max} - T_{\min} \right) - C_{2} \left( T_{avg} - T_{mold} \right) \right]^{q}$$

Failure rate due to TC:

$$\lambda_{core,s}^{TC} = C_o' \left[ C_1 \left( T_{active} - T_{sleep} \right) - C_2 \left( T_{avg,s} - T_{mold} \right) \right]^q f_s = \forall s = sleep$$

## **Reliability of complex systems**

- A system is a connection of components
- System reliability depends on the topology
  - Series/parallel configurations
  - N out of K configurations
  - General topologies

$$R_{system}(t) = \prod_{i=0}^{n} R_{i}(t) \implies R_{system}(t) = e^{-\sum_{i=0}^{n} \lambda_{f_{i}} t} \qquad R_{system}(t) = 1 - \prod_{i=0}^{n} (1 - R_{i}(t))$$

#### Series

### Parallel

- Examples:
  - CPU, memory and interconnect form a series reliability network as all three are necessary for the correct functioning of the system
  - Dual CPU system could be viewed as a parallel reliability combination as only one CPU is needed in order for the system to function

## **Basic Reliability Configurations**

- Active parallel configuration has all redundant components working concurrently
  - Energy consumption is high
  - Time to transition on failure is very low
  - Failure rate is higher than standby parallel
  - ✤ E.g. identical controllers for aircraft guidance
- Standby parallel configuration has redundant components in low-power mode until failure of the active component
  - Energy consumption lower
  - ✤ Time to transition on failure higher
  - ✤ Low failure rate
  - ✤ E.g. dual CPU platform
- Series combination has the highest failure rate
  - ✤ E.g. CPU, memory, interconnect











Markov processes model memoryless systems with constant failure rates

Tajana Simunic Rosing & Giovanni De Miche

## **DPM&DRM - Power management modeling**



## **DPM&DRM System Model Details**

- Combine:
  - Power-state machine model TISMDP
  - Reliability model Markov process
- Represent overall system as combination of components' PSMs where failure rates depend on system state
- System control aims to increase energy efficiency and enhance reliability



# **DPM&DRM Policy Optimization**

# Minimize average energy consumed under reliability and performance constraints – get randomized policy

Variable definitions: min  $\sum \cos t_{energy, c}$ cost (s,a) average cost incurred while in state s given action a s.t.  $\sum_{a \in A} f(s,a) - \sum_{a \in A} \sum_{s \in S} M(s'|s,a) f(s',a) = 0; \ \forall s, \forall c_s$ f( s,a ) frequency of executing action a while in state s  $\sum_{a \in A} \sum_{s \in S} T(s, a) f(s, a) = 1; \quad \forall c_s$ M(s'|t,s,a) probability of arriving to state s' given action a taken in state s  $\sum_{n=1}^{N} \cos t_{perf, c} < Perf_{const}; \quad \forall c$ T( s,a ) expected time spent in state s given action a  $Tpl(\lambda_c) \leq \text{Re}l_{const}; \quad \forall c_s$ reliability constraint as a  $Tpl(\lambda_c)$ function of network topology Tpl  $\lambda_{c} = \sum \sum \sum \lambda_{core}^{i} (s, a) y(s, a) f(s, a)$ core reliability λ

## Obtain globally optimal policy using linear programming

 Policy is obtained from state-action frequencies f(s,a) as a table of probabilities of issuing command a when system is in state s

$$p(s,a) = \frac{f(s,a)}{\sum_{a' \neq a} f(s,a')}$$

# **DPM Constraint Formulation**

### Energy and performance cost:

- ♦  $k(s_{i'}, a_i)$  lump sum cost
- ★  $c(s_{i+1}, s_i, a_i)$  cost rate (e.g. power or performance penalty)
- ♦  $F(t_i | s_{i'} a_i)$  probability distribution of next event occurrence
- \*  $p(s_{i+1} | t_i, s_i, a_i)$  probability of transition into next state  $s_{i+1}$

$$Cost(s_{i}, a_{i}) = \begin{cases} k(s_{i}, a_{i}) + \int_{0}^{\infty} \left[ F(du \mid s_{i}, a_{i}) \sum_{s_{i+1} \in S_{i+1}} \int_{0}^{u} f(s_{i+1}, s_{i}, a_{i}) f(s_{i+1} \mid t_{i}, s_{i}, a_{i}) dt \right] \quad \forall dt \\ k(s_{i}, a_{i}) + \sum_{s_{i+1} \in S_{i+1}} c(s_{i+1}, s_{i}, a_{i}) T(s_{i}, a_{i}) \quad \forall \Delta t \end{cases}$$

## Expected time spent in each state:

$$T(s_{i}, a_{i}) = \begin{cases} \int_{0}^{\infty} t \sum_{s_{i+1} \in S_{i+1}} p(s_{i+1} \mid t_{i}, s_{i}, a_{i}) F(dt \mid s_{i}, a_{i}) & \forall dt \\ \int_{t_{i}}^{t_{i} + \Delta t} \frac{(1 - F(t))dt}{1 - F(t_{i})} & \forall \Delta t \end{cases}$$

Probability of arrival into each state:

$$M(s_{i+1} | s_i, a_i) = \begin{cases} \int_{0}^{\infty} p(s_{i+1} | t_i, s_i, a_i) F(dt | s_i, a_i) & dt \\ p(s_{i+1} | t_i, s_i, a_i) & \Delta \end{cases}$$

## **Reliability Constraint Formulation**

 Failure rate of each state is a sum of the failure rates due to all mechanisms (EM, TDDB, TC) acting in that state
 \* Expected temperature in a state needs to be calculated

$$T_{state} = (T_{active} - T_{state,ss})e^{-\frac{y(s,a)}{\tau}} + T_{state,ss}$$
$$T_{active} \propto P_{active}(R_{th \, die} + R_{th \, package})$$

Total failure rate of a core is a weighted sum of state failure rates, for example:

Core has three power states: active, idle and sleep

✤ Two actions: "go to sleep" (S) and "continue" (C)

$$\begin{split} \lambda_A y(A,C) f(A,C) + \\ \lambda_I y(I,C) f(I,C) + \lambda_I y(I,S) f(I,S) + \\ \lambda_S y(S,C) f(S,C) \leq \operatorname{Re} l_{const} \end{split}$$

 System failure rate is calculated based on system topology as a function of series and parallel combinations

## **Optimization example**

- 95nm technology
- Five cores; standard workloads (audio, video, www, email)
- MTTF constraint set to 10 years; minimized power consumption



|                             |                         |               |                        | t <sub>ts</sub> | t <sub>ta</sub> |
|-----------------------------|-------------------------|---------------|------------------------|-----------------|-----------------|
| IP block                    | P <sub>active</sub> [W] | $P_{idle}[W]$ | P <sub>sleep</sub> [W] | [s]             | [s]             |
| DSP (TMS6211) [22]          | 1.1                     | 0.5           | 0.01                   | 250u            | 100n            |
| Video (SAF7113H) [23]       | 0.44                    | N/A           | 0.07                   | 110m            | 0.9             |
| Audio (SST-Melody-DAA) [24] | 0.11                    | 0.03          | 3.00E-03               | 6u              | 0.13            |
| I/O (MSP43011x2) [25]       | 1.00E-03                | N/A           | 6.00E-06               | 100n            | 6u              |
| DRAM (Rambus 512M) [26]     | 1.58                    | 0.37          | 1.00E-02               | 16n             | 16n             |

## **Single Core Design**





 Maximum power savings achievable given MTTF of 10 years are at 90% for all cores and temperature ranges except for DSP, Video and Audio at 90 C due to TC mechanism  Design change effect widening metal lines

 Current density down by 20%, core area up by 5%, temperature down by 2%, but TC up by 10%

## **Design with redundancy**

Standby-off and standby-sleep redundancy model
 Power savings with MTTF set to 10 years



System meets MTTF of 10 years when one more redundant core in standby off mode is added to DSP, Audio and I/O; power savings of 40% are achieved

## Redundancy

- Using redundancy helps improving reliability but at the cost of increased area and power consumption
  - Instead of spare cores use functional redundancy & dynamic reconfiguration



# **DVS, DPM and Reliability**

- Simulate using a "typical day" workload, consisting of video, audio, www and telnet traffic interspersed throughout the day
- 95nm technology, power/performance properties of XScale PXA270

| State  | Active (mW) | ldle (mW) | Freq (MHz) |
|--------|-------------|-----------|------------|
| P1     | 925         | 260       | 624        |
| P2     | 747         | 222       | 520        |
| P3     | 279         | 129       | 208        |
| P4     | 116         | 64        | 104        |
| Psleep | 0.163       | 0.163     | 0          |
|        |             |           |            |

Aggressive DPM:

- Large power savings, but reliability loss due to TC
- DVS only:
  - Smaller power savings, but longer MTTF due to EM/TDDB
- Both DVS/DPM give best tradeoff



## **Power and MTTF with DVS/DPM**

## DVS/DPM improves MTTF by 45%, with 61% power savings

| Policy      | Power | MTTF |
|-------------|-------|------|
| None        | 0%    | 0%   |
| DVS         | 35%   | 42%  |
| DPM (Rmax)  | 16%   | 6%   |
| DPM (ave)   | 47%   | -12% |
| DPM (Pmax)  | 99%   | -34% |
| both (Rmax) | 46%   | 47%  |
| both (ave)  | 61%   | 45%  |
| both (Pmax) | 99%   | 34%  |
|             |       |      |



## Summary

- Reliability is strongly affected by both DVS and DPM
- Integrated methodology for analysis, optimization and management of reliability and power consumption:
  - Simulator gives fast feedback on topology design and system characteristics for a wide range of operating conditions
  - Optimizer provides a policy capable of giving an optimal implementation of reliability and power management control
- Results obtained for a number of integrated systems implemented in 95nm technology show:
  - Large dependence between power management policy and reliability due to tradeoff between EM, TDDB and TC effects
  - 40% power savings on top of meeting MTTF of 10 years for an integrated system consisting of five cores with redundancy