Design Technologies for Networks-on-Chip

#### Giovanni De Micheli

#### EPFL

#### Federico Angiolini, Srinivasan Murali, Luca Benini, David Atienza, Antonio Pullini









Vangal et al. ISSCC 2007

# **Application domains**

- Multiprocessors on chip
  - Homogenous fabric
  - Designed for performance
  - General purpose
- Application-specific SoCs
  - Heterogeneous structure
  - QoS and power constraints
  - Domain specific software

#### Embedded SoC Trend



1\$/GMAC

#### Architecture Evolution



- Roadmap continues:  $90 \rightarrow 65 \rightarrow 45$  nm
- "Traditional" Bus-based SoCs fit in one tile !!
- Communication demand is staggering, but unevenly distributed, because of architectural heterogeneity

#### Interconnect Bottleneck

Power consumption Unidirectional link (38 bits+flow control)



- 65 nm low-power library
- Iow V<sub>t</sub> library, high V<sub>DD</sub> power/perf tradeoff
- very high frequencies or very long links infeasible
- but even some feasible links burn up to 30 mW!!
   heavy buffer insertion

#### Interconnect Bottleneck

Power consumption Unidirectional link (38 bits+flow control)



- 65 nm low-power library
- High V<sub>t</sub> library, low V<sub>DD</sub> absolute min power
- Even at 250 MHz, > 2 mm link length infeasible

#### Addressing Interconnect Issues

- High-end industrial solutions:
  - Evolutionary path from shared busses



- Challenges
  - Complexity: how to analyze, verify "spaghetti interconnects"?
  - Scalability: bus is bandwidth-limited, Xbar is size-limited
  - Predictability: how to tie interconnects with floorplanning

#### The Network-on-Chip Paradigm

#### The "power of NoCs":

- Clean separation at session layer
  - Cores issue end-to-end transactions
  - Network deals with transport, network, link, physical
- Modularity at HW level: only 2 building blocks
  - Network interface
  - Switch (router)
- Physical design aware (floorplan global routing)



#### Scalability is supported from the ground up!

### SoC and NoC Characteristics

- Typical applications targeted by SoCs
  - Complex
  - Highly heterogeneous (component specialization)
  - Communication intensive
- Tailor-made interconnects for applications
- NoCs are resource constrained:
  - Power, area constraints low buffering available
- Large available wire bandwidth
  - But tapping it with modular, structured design is key

# New design challenges

- From multiprocessor field
  - Assigning tasks to processors
  - Synchronization, consistency, coherency
- Networking
  - Network topology, routing, flow control
  - Quality of Service (QoS) needs
- VLSI
  - Floorplan in 2D, wire lengths
  - Power, area, performance





# The Big Picture



Orthogonalize computation from communication

## Why Design Automation ?

Large design space, several steps



1. Capturing application traffic

# Why Design Automation ?

Large design space, several steps



1. Capturing application traffic



- 2. What topology ?
- 3. Mapping ?
- 4. Routes to use ?



# Why Design Automation ?

DDR

SDRAM

BAB

cale

Audio

DSP

RISE

CPU

Large design space, several steps



1. Capturing application traffic

-Resource constrained: power, area -Large wire bandwidth - tapping it with modular design is key 2. What topology ?

si - 3x3

s8 - 8x8

up

samp

iDC

cfc

- 3. Mapping ?
- 4. Routes to use ?



SKAM

DDR

SDRAM

ilXT.

elc

tailer

au 1/10

Audro

DSP

UD

satisp

BAH

Calc

SRAM

Medin

CPU

SRAM

Media

CPU

RISC

CPU

SRAM

al 5x5

13 4x4

12-3x3

## More Steps !

- 5. Tuning communication architecture parameters (link width, buffer sizes)
- 6. Verification for correctness, performance
- 7. Build simulation, synthesis, emulation models
- 8. Reliable operation under unreliable conditions



Should ensure design closure (fast time-to-market)

#### Automating and integrating the stross essential !

# Layered Design Flow

|                                          | Design phases                                                       | Models/effects                                                   | Key Issues                                                      |  |
|------------------------------------------|---------------------------------------------------------------------|------------------------------------------------------------------|-----------------------------------------------------------------|--|
| High-level specification                 | Topology design,<br>mapping, routing,<br>refine arch.<br>parameters | Analytical models,<br>static effects,<br>large solution<br>space | Accurate traffic<br>modeling,<br>performance,<br>power modeling |  |
| Stochastic<br>packet-level<br>simulation | Buffer sizing,<br>arbitration policy,<br>dynamic routing            | Dynamic, fast<br>C++ simulations,<br>stochastic traffic          | Traffic generator<br>models, accurate<br>network models         |  |
| Transaction simulation                   | Further refine<br>arch params, key<br>topology changes              | Dependencies in communication                                    | Reflect cycle-<br>accuracy, speed                               |  |
| Cycle acc.<br>simulation                 | Performance test,<br>very few arch,<br>topology changes             | Completely<br>accurate                                           | Speed, FPGA<br>emulation                                        |  |

#### **Research** Teams







Technion



KTH, Sweden







Brazil



Princeton





University of Bologna



UNIVERSITY OF CAMBRIDGE



Stanford



University of Cagliari







**Tampere University** of Technology





All omissions are purely accidental ...

University of Southampton

# SunFloor Design Flow

### SunFloor Design Flow



# Front-End Design

Design application-specific custom topologies



Achieves design closure, bridging design gaps across different steps

# Input Models

#### Traffic Models



- Consider bursty traffic, criticality of streams
- Obtained from initial simulations, application knowledge
- Hardware monitors to obtain traffic characteristics

### **Back-End Flow**



# Æthereal Design Flow

### **Architecture Specification**









#### architecture.xml



[Kees Goossens, NXP]

### **Application specification**



| M                                                                                          | 🛛 Microsoft Excel - small.xls                         |            |              |           |            |              |           |            |         |   |
|--------------------------------------------------------------------------------------------|-------------------------------------------------------|------------|--------------|-----------|------------|--------------|-----------|------------|---------|---|
| E                                                                                          | 🖺 Eile Edit View Insert Format Tools Data Window Help |            |              |           |            |              |           |            |         |   |
| □ 🚔 🖬 🞒 🔃 💖 🐰 🗈 🛍 🗳 👳 - ∞ - 🍓 Σ 🖍 👌 🕻 🏨 🖓 😨 👋 Arial 🚽 10 💌 Β Ζ Ψ ≡ Ξ Ξ 🔤 😨 🕫 🖽 - 🌺 - 🏝 - 🤅 |                                                       |            |              |           |            |              |           |            |         |   |
| F                                                                                          | F17 <b>• =</b>                                        |            |              |           |            |              |           |            |         |   |
|                                                                                            | А                                                     | В          | С            | D         | E          | F            | G         | Н          | 1       |   |
| 1                                                                                          |                                                       |            |              |           |            |              |           |            |         |   |
| 2                                                                                          |                                                       | Read Write |              |           |            |              |           |            |         |   |
|                                                                                            | , , T                                                 | Target     |              |           |            |              |           |            | QoS     |   |
|                                                                                            | Initiator port                                        | port       | Bandwidth    | BurstSize | Latency    | Bandwidth    | BurstSize | Latency    | (GT/BE) |   |
| 3                                                                                          |                                                       | -          | (MBytes/sec) | (Bytes)   | (nano sec) | (MBytes/sec) | (Bytes)   | (nano sec) | . ,     |   |
| 4                                                                                          | input_c1                                              | filter1_c1 | 40           | 32        | 100        | 24           | 32        | 100        | BE      |   |
| 5                                                                                          | input c1                                              | filter2 c1 | 50           | 32        | 100        | 70           | 32        | 100        | BE      |   |
| 6                                                                                          | input_p2                                              | filter1_p1 | 0            | 0         | 0          | 240          | 32        | 0          | GT      |   |
| 7                                                                                          | filter1_p2                                            | memory_p1  | 500          | 32        | 0          | 0            | 0         | 0          | GT      |   |
| 8                                                                                          | filter2_p1                                            | memory_p1  | 500          | 32        | 0          | 0            | 0         | 0          | BE      |   |
|                                                                                            |                                                       |            | _            | -         |            | 0.10         | 22        | 0          |         |   |
| 9                                                                                          | filter2_p2                                            | output_p1  | 0            | 0         | 0          | 240          | 32        | 0          |         |   |
| 9<br>10                                                                                    | filter2_p2                                            | output_p1  | 0            | 0         | U          | 240          | 32        | U          |         |   |
| 10<br>11                                                                                   |                                                       | ·          |              | 0         | U          |              | 32        | U          |         |   |
| 10<br>11                                                                                   | I one comp                                            | ·          |              | 0         | U          | 240          | 32        | 0          | NUM     | × |

[Kees Goossens, NXP]

 Split large optimization problem in smaller pieces



- Split large optimisation problem in smaller pieces
  - may fail (feedback)



- Split large optimisation problem in smaller pieces
  - may fail (feedback)
  - back annotation



- Split large optimisation problem in smaller pieces
  - may fail (feedback)
  - back annotation



# **UMARS:** Multiple applications

- SoCs typically support multiple applications
- Applications can run in parallel: compound modes
- UMARS supports multiple applications
  - Supports NoC reconfiguration across compound modes



[Kees Goossens, NXP]

#### Several NoC CAD efforts

#### Nostrum simulation environment

| Long Stews Seg                                            | C Bratine (                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |      |
|-----------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------|
| And Ballion                                               | Traffic Configuration                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |      |
| Interpret Forth<br>Sent Taken<br>Sent Taken<br>Taken Sent | Spatial specification:<br>Internet internet intern |      |
|                                                           | Tempole specification<br>we want the factor of |      |
| lane cane                                                 |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | i () |

NoC buffering with queueing theory [Hu]



#### OEDIPUS design system [Ahonen]



#### Case Study 1: Custom Vs Regular NoCs

# SUNFLOOR vs Manual design

On the 30-core multimedia benchmark







Hand-design (custom mesh)

SUNFLOOR Design

From Cadence SoC Encounter

# SUNFLOOR vs Hand-Mapped

|            | Hand-mapped design:                                                                                                                                                  | SunFloor:                                                                                                                                                                                         |
|------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|            | <ul> <li>Topology: 5x3 mesh</li> <li>(15 switches)</li> </ul>                                                                                                        | <ul> <li>Topology: custom</li> <li>(8 switches)</li> </ul>                                                                                                                                        |
| constraint | <ul> <li>Operating frequency:<br/>885 MHz (post-layout)</li> </ul>                                                                                                   | <ul> <li>Operating frequency:<br/>885 MHz (post-layout)</li> </ul>                                                                                                                                |
| CONSTRAINT | <ul> <li>Power consumption:</li> <li>368 mW</li> <li>Floorplan area:</li> <li>35.4 mm<sup>2</sup></li> <li>Design time: weeks</li> <li>0.13 µm technology</li> </ul> | <ul> <li>Power consumption:</li> <li>277 mW (-25%)</li> <li>Cell area:</li> <li>37 mm<sup>2</sup> (+4%)</li> <li>Design time: 4 hours<br/>design to layout</li> <li>0.13 µm technology</li> </ul> |

Benchmark execution time comply with application requirements and are even 10% better on SunFloor topology.

#### Custom Vs Regular Topologies

| Application         | Topology                   | Power(mW)               | Avg. nr.<br>hops     |
|---------------------|----------------------------|-------------------------|----------------------|
| VPROC<br>(42 cores) | Custom<br>Mesh<br>Opt-mesh | 79.64<br>301.8<br>136.1 | 1.67<br>2.58<br>2.58 |
| MPEG4<br>(12 cores) | Custom<br>Mesh<br>Opt-mesh | 27.24<br>96.82<br>60.97 | 1.50<br>2.17<br>2.17 |
| VOPD<br>(12 cores)  | Custom<br>Mesh<br>Opt-mesh | 30.00<br>95.94<br>46.48 | 1.33<br>2.00<br>2.00 |
| MWD<br>(12 cores)   | Custom<br>Mesh<br>Opt-mesh | 20.53<br>90.17<br>38.60 | 1.15<br>2.00<br>2.00 |

 On average, SunFloor custom topologies:

- 2.75x less power consumption
- 1.55x less hop delay

 Despite large design space, maximum run time of few hours for VPROC

#### Case Study 2: Technology Scaling Effects

#### Effect of Technology Scaling



#### Network Synthesis Results

|        | Library                                | <ul> <li>Count Switch Power latency</li> <li>Observations:         <ul> <li>Lower power in 65nm for same design</li> <li>65 nm supports 2x BW, at lower power!</li> <li>NoC for a big design (38 cores) operates at 800 MHz</li> </ul> </li> </ul> |    |     |           |                      |  |  |  |
|--------|----------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|-----|-----------|----------------------|--|--|--|
| dVOPD  | 90nm<br>HP<br>90nm<br>LP<br>65nm<br>HP |                                                                                                                                                                                                                                                    |    |     |           |                      |  |  |  |
|        | 65nm <sup>L</sup><br>LP                |                                                                                                                                                                                                                                                    |    |     |           |                      |  |  |  |
| dVOPD2 | 65nm<br>HP                             | 800 MHz                                                                                                                                                                                                                                            | 6  | 7x6 | 129.36 mW | 4.24 cycles<br>[3,7] |  |  |  |
| tVOPD2 | 65nm<br>HP                             | 800 MHz                                                                                                                                                                                                                                            | 10 | 7x7 | 196.40 mW | 4.35 cycles<br>[3,9] |  |  |  |

Case Study 3: NoCs for low power applications ?

# Parallel Encryption Engine

• 18 cores



#### Low Bandwidth & Power Application

| Library    | Frequency | Switch<br>Count | Largest<br>Switch | Total NoC<br>Power | Avg. head flit<br>latency |
|------------|-----------|-----------------|-------------------|--------------------|---------------------------|
| 90nm<br>HP | 50 MHz    | 2               | 11x11             | 10.4 mW            | 3.94 cycles<br>[3,5]      |
| 90nm<br>LP | 50 MHz    | 2               | 11x11             | 4.1 mW             | 3.94 cycles<br>[3,5]      |
| 65nm<br>HP | 50 MHz    | 2               | 11x11             | 4.72 mW            | 3.94 cycles<br>[3,5]      |
| 65nm<br>LP | 50 MHz    | 5               | 9x9               | 3.1 mW             | 4.38 cycles<br>[3,7]      |

Energy efficiency: 2.2Gbs/mW $\rightarrow$  2.5x better than high-perf NoC

## Custom Topology Layout



### Conclusions

Design flows and CAD tools are critical for NoCs

- Layered design flow
  - Tackle problems from several levels
- Several key steps
  - Traffic analysis, mapping, topology design, routing,...
- Integrated approach is critical
  - Interact with existing back-end tools
- Fertile ground for more R&D work:
  - Run-time configurability
  - Robustness w.r.t. to static/dynamic variations, errors
  - Tackle floorplan and layout issues