Title: Task 1: System Architecture and Circuit Innovations
1Task 1 System Architectureand Circuit
Innovations
- Anantha Chandrakasan MIT
- Bill Dally Stanford
- Mark Horowitz Stanford
- Kunle Olukotun Stanford
- Scott Wills Georgia Tech
Interconnect Focus Center
e
e
e
e
2Interconnect and Gate Delays
from ITRS99
3Overcoming Interconnect Limitations
- Improve Interconnect Performance
- Improve Interconnect Utilization
- Reduce Interconnect Demand
4Improve Interconnect Performance
- delay maximize propagation velocity
- repeaters
- overdrive
- controlled transmission waveforms
- power efficient signaling
- low swing
- signal coding
- alternative synchronization and clock
- noise
- noise characterization and cancellation
Connect with Task 2
5Driving Long Regular Wires(part of an on-chip
network)
Uniform, well characterized lines enable custom
circuits - 0.1x power, 3x velocity
Long, lossy RC lines
Regenerative Repeaters
H-bridge driver 100mV swing
Connect with Task 2
6Transition Pattern Coding
Input Data
Recovered Data
Data Transformations
Decoder
Extra Bits
Control (Coding Scheme Selection)
Possible memory
Bits
Drivers
Receivers
Extended Data Bus (includes extra control lines)
7Move Longest Global Wires Off-Chip
- Move longest global interconnects off-chip
- Improve interconnect quality and speed
Connect with Task 4
8Improve Interconnect Performance
- off-chip I/O
- dense electrical I/O
- optical I/O
- RF I/O
- clock distribution
- distributed clocking
- optical clock distribution
Connect with Task 3 and Task 4
9Distributed Clocking (MIT)
- Synchronized clock generated at multiple points
10Test Chip Results ISSCC 00
- 16 oscillators (40mm x 40mm)
- 24 phase detectors (20mm x 40mm)
- 0.35 mm, single poly triple metal CMOS
- Total power 450mW at 3V, 1.3GHz
11Optical Clocking
Local Electrical Distribution
Optical GCLK
Optical Receiver
waveguides
Connect with Task 2 and Task 3
12Baseline Optical Receiver Fabrication
- Test chip fabrication
- 0.35 mm MOSIS
- Simple Si diode detector
- 4 receivers at corners of 2mm x 2mm chip
- Test chip is functional (currently tested at
300MHz) - Process and environmental variations had a big
impact on clock skew - Future work will focus on reducing the impact
of variations
13Overcoming Interconnect Limitations
- Improve Interconnect Performance
- Improve Interconnect Utilization
- Reduce Interconnect Demand
14Increase Interconnect Utilization
- Replace dedicated global wiring with a shared
network
Dedicated wiring
Network
15Most Wires are Idle Most of the Time
- Dont dedicate wires to signals, share them
- Route packets not wires
- Organize global wiring as an on-chip
interconnection network - increases global wire utilization
- allows more flexible use of global wiring
resource - offers regular, highly optimized global wiring
16Dedicated Wires versusOn-Chip Network
17Network RoutersRepeaters with Switching
- Need repeaters every 1mm or less
- Easy to insert switching
- zero-cost reconfiguration
- Minimize decision time
- static routing
- fixed or regular pattern
- source routing
- on-demand
- requires arbitration and fanout
- can be pipelined
- Minimize buffering
1mm
1mm
Arb
LUT
18Architecture Reduces Impact of Slow
WiresCircuits Make Wires More Efficient
- Locality
- Eliminate implicit global communication
- Expose and optimize the communication
- Clustered architecture
- Partitioned register file
- Data migration
- Networking
- Route packets, not wires
- Improves duty factor of wires
- Single, regular, highly-optimized design
19Overcoming Interconnect Limitations
- Improve Interconnect Performance
- Improve Interconnect Utilization
- Reduce Interconnect Demand
20SIMD Instruction Broadcast
- Uses long broadcasting wires
- Long-wire delays limit system clock
Global instruction broadcast
Plot extracted from 1997 NTRS
21Short-Wire Instruction Broadcast
- Exclusive use of short-wire interconnects
- Systolic fashioned instruction distribution
- Reduced wiring demand
- Some nearest neighbor communication issues
ACU
22Making Broadcast Systolic
A communication instruction is composed of two
mini operations 1- Read data from source
register file and put in inter-PE buffer 2- Read
data from inter-PE buffer and write to
destination register file
r1 east port west port r2
r1 west port NOP east port r2
xfer r2 r1 East
xfer r2 r1 West
r1
r2
Inter-PE buffer
r1 west port NOP east port r2
1
r1 east port west port r2
1
r1 west port NOP east port r2
2
3
r1 east port west port r2
2
4
23Cycle Count Penaltyfor Systolic Instruction
Broadcast
24Impact of Breaking Long Global Wires
for 2-way CP ILP-SIMD
25Architecture Today Depends on Fast Global
Communication
- All instructions issued from single global
instruction unit - All data passes through global register file
- This wont work when global accesses cost 16
clocks of latency (each way)
I-Unit
Regs
26Clustered Architecture
- Multiple elements (clusters) with
- local instruction dispatch
- local register files
- co-located with arithmetic elements
- Explicit communication between elements through a
switch or network - Fast synchronization between instruction units
Sync
Switch
27Multi-ALU Processor Chip
- Exploits ILP and thread-level parallelism across
clusters - Single cycle mechanisms
- communication
- synchronization
- thread creation
- Low-overhead inter-node mechanisms
28Reduced Communication has Minimum Impact on
Performance
29Register File Organization
- Register files serve two functions
- Short term storage for intermediate results
- Communication between multiple function units
- Which dominates area, delay, and power?
- Global register grow as N3 where N is the number
of ALUs - Need more registers to hold more results (grows
with N) - Need more ports to connect all of the units
(grows with N2)
30Register Cells are Mostly Switch
31SIMD and Distributed Register Files
Scalar
SIMD
Central
DRF
32Organizations
- 48 ALUs (32-bit), 500 MHz
- Stream organization improves central organization
by Area 195x, Delay 20x, Power 430x
33Performance
16 Performance Drop (8 with latency constraints)
180x Improvement
34Hierarchical Register Organization
35Much Locality is Data Dependent
- Applications have data/time-dependent graph
structure - Sparse-matrix solution
- non-zero and fill-in structure
- Logic simulation
- circuit topology and activity
- PIC codes
- structure changes as particles move
- Sort-middle polygon rendering
- structure changes as viewpoint moves
36Fine-Grain Data MigrationDrift and Diffusion
- Run-time relocation based on pointer use
- move data at both ends of pointer
- move control and data
- Each relocation cycle
- compute drift vector based on pointer use
- compute diffusion vector based on density
potential (resource usage) - need to avoid oscillations
- Should data be replicated?
- not just update vs. invalidate
- need to duplicate computation to avoid
communication
37Results for Logic Simulation
38Using Applications and Technology Models to
Explore Architectural Design Space
39Power Efficiency Ranking
40Most Power Efficient Configuration
41Area Efficiency Ranking
42Most Area Efficient Configuration
43Best Overall Configuration
44Improving Reliability Availability and
Scalability (RAS)
- In 2007 CMOS technology transient failures will
regularly occur in logic latches and some
combinatorial logic. Interconnect crosstalk will
also cause significant transient failures. (IBM
Jour. RD) - Leverage multiple identical components in chip
multiprocessor and speculative memory support to
provide flexible RAS - Turn on RAS for higher reliability, turn off RAS
for higher performance - Pair up processors within a Hydra quad
- Processors compare results and retry if they
disagree - Processors do not have to be in lockstep
- Compare does not affect single thread performance
- Temporary state is kept in speculative store
buffers
45Hydra Speculation Support
- Speculation coprocessor to sequence speculative
threads - Additional L1 D cache bits to track speculative
reads and writes - Write buffers at L2 cache to hold speculative
writes - Write bus has additional bits to support
speculative writes
46RAS Design
Processor 1
Processor 2
Processor 3
Processor 4
- Compares happen
- 100 K insts with 32 line L2 buffer
L2 Buffers
L2 Buffers
L2 Buffers
L2 Buffers
Regs
Regs
Regs
Regs
Error
Error
L2 Cache
47Task 1 Objectives Summary
- Interface with other tasks to improve the
performance of available interconnect - Employ global wire connect architectures to
maximize the utilization of global interconnect - Explore new architectures to reduce the
requirement for global interconnect