Task 1: System Architecture and Circuit Innovations - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

Task 1: System Architecture and Circuit Innovations

Description:

Pair up processors within a Hydra quad. Processors compare results and retry if they disagree ... Hydra Speculation Support. Speculation coprocessor to ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 48

Provided by: scott384

Category:

more less

Transcript and Presenter's Notes

Title: Task 1: System Architecture and Circuit Innovations

1
Task 1 System Architectureand Circuit
Innovations

Anantha Chandrakasan MIT
Bill Dally Stanford
Mark Horowitz Stanford
Kunle Olukotun Stanford
Scott Wills Georgia Tech

Interconnect Focus Center
e
e
e
e
2
Interconnect and Gate Delays
from ITRS99
3
Overcoming Interconnect Limitations

Improve Interconnect Performance
Improve Interconnect Utilization
Reduce Interconnect Demand

4
Improve Interconnect Performance

delay maximize propagation velocity
repeaters
overdrive
controlled transmission waveforms
power efficient signaling
low swing
signal coding
alternative synchronization and clock
noise
noise characterization and cancellation

Connect with Task 2
5
Driving Long Regular Wires(part of an on-chip
network)
Uniform, well characterized lines enable custom
circuits - 0.1x power, 3x velocity
Long, lossy RC lines
Regenerative Repeaters
H-bridge driver 100mV swing
Connect with Task 2
6
Transition Pattern Coding
Input Data
Recovered Data
Data Transformations
Decoder
Extra Bits
Control (Coding Scheme Selection)
Possible memory
Bits
Drivers
Receivers
Extended Data Bus (includes extra control lines)
7
Move Longest Global Wires Off-Chip

Move longest global interconnects off-chip
Improve interconnect quality and speed

Connect with Task 4
8
Improve Interconnect Performance

off-chip I/O
dense electrical I/O
optical I/O
RF I/O
clock distribution
distributed clocking
optical clock distribution

Connect with Task 3 and Task 4
9
Distributed Clocking (MIT)

Synchronized clock generated at multiple points

10
Test Chip Results ISSCC 00

16 oscillators (40mm x 40mm)
24 phase detectors (20mm x 40mm)
0.35 mm, single poly triple metal CMOS
Total power 450mW at 3V, 1.3GHz

11
Optical Clocking
Local Electrical Distribution
Optical GCLK
Optical Receiver
waveguides
Connect with Task 2 and Task 3
12
Baseline Optical Receiver Fabrication

Test chip fabrication
0.35 mm MOSIS
Simple Si diode detector
4 receivers at corners of 2mm x 2mm chip
Test chip is functional (currently tested at
300MHz)
Process and environmental variations had a big
impact on clock skew
Future work will focus on reducing the impact
of variations

13
Overcoming Interconnect Limitations

Improve Interconnect Performance
Improve Interconnect Utilization
Reduce Interconnect Demand

14
Increase Interconnect Utilization

Replace dedicated global wiring with a shared
network

Dedicated wiring
Network
15
Most Wires are Idle Most of the Time

Dont dedicate wires to signals, share them
Route packets not wires
Organize global wiring as an on-chip
interconnection network
increases global wire utilization
allows more flexible use of global wiring
resource
offers regular, highly optimized global wiring

16
Dedicated Wires versusOn-Chip Network
17
Network RoutersRepeaters with Switching

Need repeaters every 1mm or less
Easy to insert switching
zero-cost reconfiguration
Minimize decision time
static routing
fixed or regular pattern
source routing
on-demand
requires arbitration and fanout
can be pipelined
Minimize buffering

1mm
1mm
Arb
LUT
18
Architecture Reduces Impact of Slow
WiresCircuits Make Wires More Efficient

Locality
Eliminate implicit global communication
Expose and optimize the communication
Clustered architecture
Partitioned register file
Data migration
Networking
Route packets, not wires
Improves duty factor of wires
Single, regular, highly-optimized design

19
Overcoming Interconnect Limitations

Improve Interconnect Performance
Improve Interconnect Utilization
Reduce Interconnect Demand

20
SIMD Instruction Broadcast

Uses long broadcasting wires
Long-wire delays limit system clock

Global instruction broadcast
Plot extracted from 1997 NTRS
21
Short-Wire Instruction Broadcast

Exclusive use of short-wire interconnects
Systolic fashioned instruction distribution
Reduced wiring demand
Some nearest neighbor communication issues

ACU
22
Making Broadcast Systolic
A communication instruction is composed of two
mini operations 1- Read data from source
register file and put in inter-PE buffer 2- Read
data from inter-PE buffer and write to
destination register file
r1 east port west port r2
r1 west port NOP east port r2
xfer r2 r1 East
xfer r2 r1 West
r1
r2
Inter-PE buffer
r1 west port NOP east port r2
1
r1 east port west port r2
1
r1 west port NOP east port r2
2
3
r1 east port west port r2
2
4
23
Cycle Count Penaltyfor Systolic Instruction
Broadcast
24
Impact of Breaking Long Global Wires
for 2-way CP ILP-SIMD
25
Architecture Today Depends on Fast Global
Communication

All instructions issued from single global
instruction unit
All data passes through global register file
This wont work when global accesses cost 16
clocks of latency (each way)

I-Unit
Regs
26
Clustered Architecture

Multiple elements (clusters) with
local instruction dispatch
local register files
co-located with arithmetic elements
Explicit communication between elements through a
switch or network
Fast synchronization between instruction units

Sync
Switch
27
Multi-ALU Processor Chip

Exploits ILP and thread-level parallelism across
clusters
Single cycle mechanisms
communication
synchronization
thread creation
Low-overhead inter-node mechanisms

28
Reduced Communication has Minimum Impact on
Performance
29
Register File Organization

Register files serve two functions
Short term storage for intermediate results
Communication between multiple function units
Which dominates area, delay, and power?
Global register grow as N3 where N is the number
of ALUs
Need more registers to hold more results (grows
with N)
Need more ports to connect all of the units
(grows with N2)

30
Register Cells are Mostly Switch
31
SIMD and Distributed Register Files
Scalar
SIMD
Central
DRF
32
Organizations

48 ALUs (32-bit), 500 MHz
Stream organization improves central organization
by Area 195x, Delay 20x, Power 430x

33
Performance
16 Performance Drop (8 with latency constraints)
180x Improvement
34
Hierarchical Register Organization
35
Much Locality is Data Dependent

Applications have data/time-dependent graph
structure
Sparse-matrix solution
non-zero and fill-in structure
Logic simulation
circuit topology and activity
PIC codes
structure changes as particles move
Sort-middle polygon rendering
structure changes as viewpoint moves

36
Fine-Grain Data MigrationDrift and Diffusion

Run-time relocation based on pointer use
move data at both ends of pointer
move control and data
Each relocation cycle
compute drift vector based on pointer use
compute diffusion vector based on density
potential (resource usage)
need to avoid oscillations
Should data be replicated?
not just update vs. invalidate
need to duplicate computation to avoid
communication

37
Results for Logic Simulation
38
Using Applications and Technology Models to
Explore Architectural Design Space
39
Power Efficiency Ranking
40
Most Power Efficient Configuration
41
Area Efficiency Ranking
42
Most Area Efficient Configuration
43
Best Overall Configuration
44
Improving Reliability Availability and
Scalability (RAS)

In 2007 CMOS technology transient failures will
regularly occur in logic latches and some
combinatorial logic. Interconnect crosstalk will
also cause significant transient failures. (IBM
Jour. RD)
Leverage multiple identical components in chip
multiprocessor and speculative memory support to
provide flexible RAS
Turn on RAS for higher reliability, turn off RAS
for higher performance
Pair up processors within a Hydra quad
Processors compare results and retry if they
disagree
Processors do not have to be in lockstep
Compare does not affect single thread performance
Temporary state is kept in speculative store
buffers

45
Hydra Speculation Support

Speculation coprocessor to sequence speculative
threads
Additional L1 D cache bits to track speculative
reads and writes
Write buffers at L2 cache to hold speculative
writes
Write bus has additional bits to support
speculative writes

46
RAS Design
Processor 1
Processor 2
Processor 3
Processor 4

Compares happen
100 K insts with 32 line L2 buffer

L2 Buffers
L2 Buffers
L2 Buffers
L2 Buffers
Regs
Regs
Regs
Regs

Error
Error
L2 Cache
47
Task 1 Objectives Summary

Interface with other tasks to improve the
performance of available interconnect
Employ global wire connect architectures to
maximize the utilization of global interconnect
Explore new architectures to reduce the
requirement for global interconnect

Write a Comment

User Comments (0)