Title: NoC Design and Implementation in 65nm Technology
1NoC Design and Implementationin 65nm Technology
Antonio Pullini1, Federico Angiolini2, Paolo
Meloni3, David Atienza4,5, Srinivasan Murali6,
Luigi Raffo3, Giovanni De Micheli4, Luca Benini2
1DAUIN, Politecnico di Torino, Italy 2DEIS,
University of Bologna, Italy 3DIEE, University of
Cagliari, Italy 4LSI, EPFL, Switzerland 5DACYA,
Complutense University, Spain 6CSL, Stanford
University, USA
2Bringing NoCs to Success
Software Services Mapping, QoS, middleware...
CAD Tools
Architecture Packeting, buffering, flow control...
Physical Implementation Synchronization, wires,
power...
- All these items are key opportunities and
challenges - Strict interaction/feedback mandatory!...
3NoC Physical Implementation
4A Typical ASIC Back-End Flow
Design Space Exploration
RTL Coding
Logic Synthesis
Placement
Routing
- How does this affect NoCs?
- Locally synthesis, placement routing of NoC IP
blocks? - Globally constraints on NoC links?
5Placement Strategies
- Virtual flat
- Placer is able to work on whole design
- Theoretically optimal results
- Impractical for chip-sized designs (e.g. NoCs)
- Soft macro
- Fences define areas where placer can operate
- If fine grain, fastest possible placement
- If fine grain, too much designer effort
- Identify tradeoff among effectiveness,
performance and designer effort
6xpipes Placement Approach
- Floorplan mix of
- hard macros for IP cores (may or may not allow
over-the-cell routing) - soft macros for NoC blocks
7Wireload Models and 65nm
- Logic synthesis tools do not know about placement
yet - Thus, loads and timing are uncertain!
- Wireload models
- Quite inaccurate. 130nm TCAD07 6 to 23 off
from actual achievable post-placement timing - In 65nm, problem is dramatically worse
- No timing closure after placement (-50
frequency, huge runtimes...) - Traditional logic synthesis tools (e.g. Synopsys
Design Compiler) insufficient
8Placement-Aware Synthesis
- Synopsys Physical Compiler flow
RTL
Observation 1 Use placement-aware tools to get
accurate estimations of design speed and to reach
timing closure. Traditional logic synthesis tools
may not be suitable.
Quick logic synthesis
Initial Netlist
Placement
Initial Placed Netlist
Thorough logic synthesis
Final Placed Netlist
9Area Modeling
- In our experiments, placementrouting is
extremely sensitive to soft macro area - Fences too tight flow fails
- Fences too wide tool produces bad results
- Solution accurate component area models
- Involves work since area depends on architectural
parameters (cardinality, buffering...)
Observation 2 Thorough characterization of the
components may be key to the convergence of the
flow for a whole topology.
10Technology Scaling on Modules
6x6 switch, 38 bits,6 buffers
- Within modules, scaling looks great
- 25 frequency
- -46 area
- -52 power
1165nm Degrees of Freedom
Observation 3 There is no such thing as a 65nm
library. Power/performance degrees of freedom
span across one order of magnitude. It is the
designers (or the tools) responsibility to pick
the right cells.
- Libraries differ in gate design, VT, VDD...
12Link Design Constraints
65nm lowest power
65nm power/ performance
- Power to drive a 38-bit (plus flow control)
unidirectional link
Observation 4 Long links (unless custom
designed) become either infeasible, or too
power-hungry. Keep them segmented.
13Link Repeaters/Relay Stations
- Wire segmentation by topology design
- Put more switches, closer
- Adds a lot of overhead
- Wire segmentation by repeater insertion
- Flops/relay stations to break links
- Details are strictly related to flow control
- Could force design iterations for QoS
provisioning! - Need for awareness in high-level CAD tooling!
14Technology Scaling on Topologies
- Three designs for max frequency
65 nm, 1 mm2 cores
65 nm, 0.4 mm2 cores
15Mesh Scaling
- Links
- Always short (lt1.2 mm) ? non-pipelined
- However
- 90 nm 1 mm2 3.1 mW
- 65 nm 1 mm2 3.6 mW (tightest fit ? more
buffering) - 65 nm 0.4 mm2 2.2 mW
- Power shifting from switches/NIs to links
(buffering)
16Updates to the xpipes Design Flow
17A Complete NoC Design Flow
Application
Codesign, Simulation
User objectives power, hop delay
Constraints area, power, hop delay, wire length
NoC component library
FPGA Emulation
Input traffic model
IP Core models
Constraint graph Comm graph
Topology Synthesis includes Floorplanner NoC
Router
Platform Generation
Platform Generation (xpipes- Compiler)
Synthesis
Placement Routing
System specs
SystemC code
To fab
NoC Area models
RTL Architectural Simulation
NoC Power models
SunFloor
Floorplanning specifications
Area, power characterization
18Example Layout
- Floorplan is automatically generated
- Black areas IP cores
- Colored areas NoC
- Over-the-cell routing allowed in this example
19Studies on Task Graphs
20dVOPD Application
21Technology Scaling dVOPD
- Low Power libraries cannot meet BW requirements
of the application - Best topology features high-radix switches
- Best topology does not change when moving to 65nm
- But power improves a lot...
22Complexity Scaling 65nm HP
- Switch frequency must go up
- Back-end cannot support switches of arbitrary
cardinality - Therefore, more, smaller switches are
instantiated - Link frequency/length must go up
- Pipelined links get instantiated
- Only 500 MHz would be achievable otherwise
23Communication Efficiency
- Much more efficiency when...
- Moving from 90nm to 65nm ( 100)
- Moving to low-frequency designs ( 50)
- Moving to low-power libraries ( 100)
24Conclusions and Future Work
- NoCs in 65nm are feasible and perform well
- 65nm design presents opportunities and challenges
- Tool flows are key
- Block pre-characterization is important
- More degrees of freedom for block implementation
- Pay attention to long links
- A complete flow to map applications onto NoCs
with back-end awareness - In depth study on high-performance vs. low-power
- Leakage studies
- Alternate link designs and optimizations
25Questions Welcome!
Thank You