Interconnect-Oriented Architecture and Circuits - PowerPoint PPT Presentation

About This Presentation
Title:

Interconnect-Oriented Architecture and Circuits

Description:

Interconnect-Oriented Architecture and Circuits William J. Dally Computer Systems Laboratory Stanford University February 12, 1998 – PowerPoint PPT presentation

Number of Views:88
Avg rating:3.0/5.0
Slides: 19
Provided by: Willi396
Category:

less

Transcript and Presenter's Notes

Title: Interconnect-Oriented Architecture and Circuits


1
Interconnect-OrientedArchitecture and Circuits
  • William J. Dally
  • Computer Systems Laboratory
  • Stanford University
  • February 12, 1998

2
On-chip wires
0.0mm
2.5mm
Minimum width wire in an 0.35mm process
5.0mm
7.5mm
10.0mm
3
On-chip wires are getting slower
x2 s x1 0.5x R2 R1/s2 4x C2 C1 1x tw2
R2C2y2 tw1/s2 4x tw2/tg2 tw1/(tg1s3) 8x v
0.5(tgRC)-1/2 (m/s) v2 v1s1/2 0.7x vtg
0.5(tg/RC)1/2 (m/gate) v2tg2 v1tg1s3/2 0.35x
y
y
x1
x2
tw RCy2
RCy2
RCy2
tg
tg
tg
4
Technology scaling makes communication the scarce
resource
1998
2008
0.35mm 64Mb DRAM 16 64b FP Proc 400MHz
0.10mm 4Gb DRAM 1K 64b FP Proc 2.5GHz
P
18mm 12,000 tracks 1 clock repeaters every 3mm
32mm 90,000 tracks 20 clocks repeaters every 0.4mm
5
Architecture Must Evolve to Fit the Landscape
20 Clocks
Global operations Low bandwidth High latency
High power
90,000 tracks
Local, parallel operations High bandwidth Low
latency Low power
6
Architecture Today Depends on Fast Global
Communication
  • All instructions issued from single global
    instruction unit
  • All data passes through global register file
  • This wont work when global accesses cost 20
    clocks of latency

I-Unit
Regs
7
Tomorrows Architectures must Exploit Locality
and Expose Communication
  • Multiple elements (clusters) with
  • local instruction dispatch
  • local register files
  • co-located with arithmetic elements
  • Explicit communication between elements through a
    switch or network
  • Fast synchronization between instruction units

Switch
8
Multi-ALU Processor Chip
9
Crafted-Cell Design
Area
Standard-Cell
Full-Custom
Crafted-Cell
80 Different Cells
7 Different Cells
17 Different Cells
1x
1.64x
5.25x
Performance
-Results courtesy of Andrew Chang
10
Interconnect repeaters with switching
  • Need repeaters every 1mm or less
  • Easy to insert switching
  • zero-cost reconfiguration
  • Cant afford decision time
  • static routing
  • fixed or regular pattern
  • source routing
  • on-demand
  • requires arbitration and fanout
  • Queuing and flow-control
  • Pipelining control

1mm
1mm
Arb
LUT
11
(No Transcript)
12
Bandwidth Hierarchy
  • Provide lots of bandwidth where its inexpensive
  • short wires between ALUs
  • Moderate bandwidth with intermediate cost
  • local RAM associated with each ALU cluster
  • Low bandwidth where its expensive
  • Global RAM with long wires
  • Very low bandwidth off chip

global30mm
medium4mm
local1mm
off chip
LocalRAM
ALU Cluster
Global on-chip RAM
LocalRAM
ALU Cluster
LocalRAM
ALU Cluster
LocalRAM
ALU Cluster
13
Bandwidth Hierarchy
  • A key problem is to match the demands of an
    application to the bandwidth available at each
    level of the hierarchy
  • Casting applications in a streaming model exposes
    much of the locality necessary to exploit the
    hierarchy

LocalRAM
ALU Cluster
Global on-chip RAM
LocalRAM
ALU Cluster
LocalRAM
ALU Cluster
LocalRAM
ALU Cluster
14
Architecture Research Issues
  • Processor architecture
  • configuration of ALUs
  • clustered vs distributed
  • method for controlling ALUs
  • distributed control, VLIW, SIMD
  • communication aware instruction sets
  • how to hide details while exposing communication
  • Memory architecture
  • methods for exploiting 2D spatial locality
  • communication aware cache organizations
  • Communication Architecture
  • on-chip interconnection networks
  • the use of repeaters with switching
  • the use of hierarchy and selective fat wires

15
Circuit Challenges of Slow Interconnect
  • The clock cycle is dominated by wire delay
  • novel circuits to improve effective signal
    velocity
  • Power is largely used to drive wires
  • low-swing on-chip signaling methods
  • reject rather than overpower noise
  • Its difficult to distribute a global clock
  • locally synchronous design methods
  • fast synchronizers
  • no wait for metastable decay

16
Overdrive gives 3x improvement in RC wire latency
17
Low-Swing Overdrive Signaling
1V Swing at Source
300mV Swing at Receiver
Recovered Signal
18
ConclusionExploit, Dont Fight, The Technology
  • Interconnect is rapidly dominating the delay,
    power, and area of ICs
  • Traditional architectures rely on global
    communication
  • they are ill-suited for an interconnect-dominated
    technology
  • Emerging architectures expose communication and
    exploit locality
  • distributed register files and instruction
    dispatch
  • bandwidth hierarchy
  • Novel circuits can mitigate effects of slow wires
  • overdrive, low-swing signaling, locally
    synchronous design
Write a Comment
User Comments (0)
About PowerShow.com