Title: Interconnect-Oriented Architecture and Circuits
1Interconnect-OrientedArchitecture and Circuits
- William J. Dally
- Computer Systems Laboratory
- Stanford University
- February 12, 1998
2On-chip wires
0.0mm
2.5mm
Minimum width wire in an 0.35mm process
5.0mm
7.5mm
10.0mm
3On-chip wires are getting slower
x2 s x1 0.5x R2 R1/s2 4x C2 C1 1x tw2
R2C2y2 tw1/s2 4x tw2/tg2 tw1/(tg1s3) 8x v
0.5(tgRC)-1/2 (m/s) v2 v1s1/2 0.7x vtg
0.5(tg/RC)1/2 (m/gate) v2tg2 v1tg1s3/2 0.35x
y
y
x1
x2
tw RCy2
RCy2
RCy2
tg
tg
tg
4Technology scaling makes communication the scarce
resource
1998
2008
0.35mm 64Mb DRAM 16 64b FP Proc 400MHz
0.10mm 4Gb DRAM 1K 64b FP Proc 2.5GHz
P
18mm 12,000 tracks 1 clock repeaters every 3mm
32mm 90,000 tracks 20 clocks repeaters every 0.4mm
5Architecture Must Evolve to Fit the Landscape
20 Clocks
Global operations Low bandwidth High latency
High power
90,000 tracks
Local, parallel operations High bandwidth Low
latency Low power
6Architecture Today Depends on Fast Global
Communication
- All instructions issued from single global
instruction unit - All data passes through global register file
- This wont work when global accesses cost 20
clocks of latency
I-Unit
Regs
7Tomorrows Architectures must Exploit Locality
and Expose Communication
- Multiple elements (clusters) with
- local instruction dispatch
- local register files
- co-located with arithmetic elements
- Explicit communication between elements through a
switch or network - Fast synchronization between instruction units
Switch
8Multi-ALU Processor Chip
9Crafted-Cell Design
Area
Standard-Cell
Full-Custom
Crafted-Cell
80 Different Cells
7 Different Cells
17 Different Cells
1x
1.64x
5.25x
Performance
-Results courtesy of Andrew Chang
10Interconnect repeaters with switching
- Need repeaters every 1mm or less
- Easy to insert switching
- zero-cost reconfiguration
- Cant afford decision time
- static routing
- fixed or regular pattern
- source routing
- on-demand
- requires arbitration and fanout
- Queuing and flow-control
- Pipelining control
1mm
1mm
Arb
LUT
11(No Transcript)
12Bandwidth Hierarchy
- Provide lots of bandwidth where its inexpensive
- short wires between ALUs
- Moderate bandwidth with intermediate cost
- local RAM associated with each ALU cluster
- Low bandwidth where its expensive
- Global RAM with long wires
- Very low bandwidth off chip
global30mm
medium4mm
local1mm
off chip
LocalRAM
ALU Cluster
Global on-chip RAM
LocalRAM
ALU Cluster
LocalRAM
ALU Cluster
LocalRAM
ALU Cluster
13Bandwidth Hierarchy
- A key problem is to match the demands of an
application to the bandwidth available at each
level of the hierarchy - Casting applications in a streaming model exposes
much of the locality necessary to exploit the
hierarchy
LocalRAM
ALU Cluster
Global on-chip RAM
LocalRAM
ALU Cluster
LocalRAM
ALU Cluster
LocalRAM
ALU Cluster
14Architecture Research Issues
- Processor architecture
- configuration of ALUs
- clustered vs distributed
- method for controlling ALUs
- distributed control, VLIW, SIMD
- communication aware instruction sets
- how to hide details while exposing communication
- Memory architecture
- methods for exploiting 2D spatial locality
- communication aware cache organizations
- Communication Architecture
- on-chip interconnection networks
- the use of repeaters with switching
- the use of hierarchy and selective fat wires
15Circuit Challenges of Slow Interconnect
- The clock cycle is dominated by wire delay
- novel circuits to improve effective signal
velocity - Power is largely used to drive wires
- low-swing on-chip signaling methods
- reject rather than overpower noise
- Its difficult to distribute a global clock
- locally synchronous design methods
- fast synchronizers
- no wait for metastable decay
16Overdrive gives 3x improvement in RC wire latency
17Low-Swing Overdrive Signaling
1V Swing at Source
300mV Swing at Receiver
Recovered Signal
18ConclusionExploit, Dont Fight, The Technology
- Interconnect is rapidly dominating the delay,
power, and area of ICs - Traditional architectures rely on global
communication - they are ill-suited for an interconnect-dominated
technology - Emerging architectures expose communication and
exploit locality - distributed register files and instruction
dispatch - bandwidth hierarchy
- Novel circuits can mitigate effects of slow wires
- overdrive, low-swing signaling, locally
synchronous design