VLSI Architecture Past, Present, and Future - PowerPoint PPT Presentation

About This Presentation

Title:

VLSI Architecture Past, Present, and Future

Description:

... 00 2.59 158078.00 45104.00 157589.00 3.49 157134.00 39606.00 106204.00 2.68 123838.00 29907.00 164414.00 5.50 124076.00 24085.00 61266.00 2.54 123944.00 29146 ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 39

Provided by: Willi332

Learn more at: http://csl.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: VLSI Architecture Past, Present, and Future

1
VLSI ArchitecturePast, Present, and Future

William J. DallyComputer Systems
LaboratoryStanford University
March 23, 1999

2
Past, Present, and Future

The last 20 years has seen a 1000-fold increase
in grids per chip and a 20-fold reduction in gate
delay
We expect this trend to continue for the next 20
years
For the past 20 years, these devices have been
applied to implicit parallelism
We will see a shift toward implicit parallelism
over the next 20 years

3
Technology Evolution
4
Technology Evolution (2)
5
Architecture Evolution
6
Incremental Returns
Quad-issue out of order
Performance
Dual-issue in order
Pipelined RISC
Processor Cost (Die Area)
7
Efficiency and Granularity
2PM
2P2M
Peak Performance
PM
System Cost (Die Area)
8
VLSI in 1979
9
VLSI Architecture in 1979

5mm NMOS technology
6mm die size
100,000 grids per chip, 10,000 transistors
8086 microprocessor
0.5MIPS

10
1979-1989 Attack of the Killer Micros

50 per year improvement in performance
Transistors applied to implicit parallelism
pipeline processor (10 CPI --gt 1 CPI)
shorten clock cycle (67 gates/clock --gt 30
gates/clock)
in 1989 a 32-bit processor w/ floating point and
caches fits on one chip
e.g., i860 40MIPS, 40MFLOPS
5,000,000 grids, 1M transistors (many memory)

11
1989-1999 The Era of Diminishing Returns

50 per year increase in performance through
1996, but
projects delayed, performance below expectations
50 increase in grids, 15 increase in frequency
(72 total)
Squeaking out the last implicit parallelism
2-way to 6-way issue, out-of-order issue, branch
prediction
1 CPI --gt 0.5 CPI, 30 gates/clock --gt 20
gates/clock
Convert data parallelism to ILP
Examples
Intel Pentium II (3-way o-o-o)
Compaq 21264 (4-way o-o-o)

12
1979-1999 Why Implicit Parallelism?

Opportunity
large gap between micros and fastest processors
Compatibility
software pool ready to run on implicitly parallel
machines
Technology
not available for fine-grain explicitly parallel
machines

13
1999-2019 Explicit Parallelism Takes Over

Opportunity
no more processor gap
Technology
interconnection, interaction, and shared memory
technologies have been proven

14
Technology for Fine-Grain Parallel Machines

A collection of workstations does not make a good
parallel machine. (BLAGG)
Bandwidth - large fraction (0.1) of local memory
BW
LAtency - small multiple (3) of local memory
latency
Global mechanisms - sync, fetch-and-op
Granularity - of tasks (100 inst) and memory (8MB)

15
Technology for Parallel MachinesThree Components

Networks
2 clocks/hop latency
8GB/s global bandwidth
Interaction mechanisms
single-cycle communication and synchronization
Software

16
k-ary n-cubes

Link bandwidth, B, depends on radix, k, for both
wire- and pin-limited networks.
Select radix to trade-off diameter, D, against B.

Latency
4K Nodes L 256 Bs 16K
Dimension
Dally, Performance Analysis of k-ary n-cube
Interconnection Networks, IEEE TC, 1990
17
Delay of Express Channels
18
The Torus Routing Chip

k-ary n-cube topology
2D Torus Network
8bit x 20MHz Channels
Hardware routing
Wormhole routing
Virtual channels
Fully Self-Timed Design
Internal Crossbar Architecture

Dally and Seitz, The Torus Routing Chip,
Distributed Computing, 1986
19
The Reliable Router

Fault-tolerant
Adaptive routing (adaptation of Duatos
algorithm)
Link-level retry
Unique token protocol
32bit x 200MHz channels
Simultaneous bidirectional signalling
Low latency plesiochronous synchronizers
Optimisitic routing

Dally, Dennison, Harris, Kan, and Xanthopoulos,
Architecture and Implementation of the Reliable
Router, Hot Interconnects II, 1994 Dally,
Dennison, and Xanthopoulos, Low-Latency
Plesiochronous Data Retiming, ARVLSI
1995 Dennison, Lee, and Dally, High Performance
Bidirectional Signalling in VLSI Systems, SIS
1993
20
Equalized 4Gb/s Signaling
21
End-to-End Latency

Software sees 10ms latency with 500ns network
Heavy compute load associated with sending a
message
system call
buffer allocation
synchronization
Solution treat the network like memory, not like
an I/O device
hardware formatting, addressing, and buffer
allocation

Regs
Send
Tx Node
Net
Buffer
Dispatch
Rx Node
22
Network Summary

We can build networks with 2-4 clocks/hop latency
(12-24 clocks for a 512-node 3-cube)
networks faster than main memory access of modern
machines
need end-to-end hardware support to see this, no
libraries
With high-speed signaling, bandwdith of 4GB/s or
more per channel (512GB/s bisection) is easy to
achieve
nearly flat memory bandwidth
Topology is a matter of matching pin and
bisection constraints to the packaging technology
its hard to beat a 3-D mesh or torus
This gives us B and LA (of BLAGG)

23
The Importance of Mechanisms
24
The Importance of Mechanisms
25
The Importance of Mechanisms
26
Granularity and Cost Effectiveness

Parallel Computers Built for
Capability - run problems that are too big or
take too long to solve any other way
absolute performance at any cost
Capacity - get throughput on lots of small
problems
performance/cost
A parallel computer built from workstation size
nodes will always have lower perf/cost than a
workstation
sublinear speedup
economies of scale
A parallel computer with less memory per node can
have better perf/cost than a workstation

M

P
P

P

P

P

M
M
M
M
27
MIT J-Machine (1991)
28
Exploiting fine-grain threads

Where will the parallelism come from to keep all
of these processors busy?
ILP - limited to about 5
Outer-loop parallelism
e.g., domain decomposition
requires big problems to get lots of parallelism
Fine threads
make communication and synchronization very fast
(1 cycle)
break the problem into smaller pieces
more parallelism

29
Mechanism and Granularity Summary

Fast communication and synchronization mechanisms
enable fine-grain task decomposition
simplifies programming
exposes parallelism
facilitates load balance
Have demonstrated
1-cycle communication and synchronization locally
10-cycle communication, synchronization, and task
dispatch across a network
Physically fine-grain machines have better
performance/cost than sequential machines

30
A 2009 Multicomputer
31
Challenges for the Explicitly Parallel Era

Compatibility
Managing locality
Parallel software

32
Compatibility

Almost no fine-grain parallel software exists
Writing parallel software is easy
with good mechanisms
Parallelizing sequential software is hard
needs to be designed from the ground up
An incremental migration path
run sequential codes with acceptable performance
parallelize selected applications for
considerable speedup

33
Performance Depends on Locality

Applications have data/time-dependent graph
structure
Sparse-matrix solution
non-zero and fill-in structure
Logic simulation
circuit topology and activity
PIC codes
structure changes as particles move
Sort-middle polygon rendering
structure changes as viewpoint moves

34
Fine-Grain Data MigrationDrift and Diffusion

Run-time relocation based on pointer use
move data at both ends of pointer
move control and data
Each relocation cycle
compute drift vector based on pointer use
compute diffusion vector based on density
potential (Taylor)
need to avoid oscillations
Should data be replicated?
not just update vs. invalidate
need to duplicate computation to avoid
communication

35
Migration and Locality
36
Parallel SoftwareFocus on the Real Problems

Almost all demanding problems have ample
parallelism
Need to focus on fundamental problems
extracting parallelism
load balance
locality
load balance and locality can be covered by
excess parallelism
Avoid incidental issues
aggregating tasks to avoid overhead
manually managing data movement and replication
oversynchronization

37
Parallel SoftwareDesign Strategy

A program must be designed for parallelism from
the ground up
no bottlenecks in the data structures
e.g., arrays instead of linked lists
Data parallelism
many for loops (over data,not time) can be forall
break dependencies out of the loop
synchronize on natural units (no barriers)

38
Conclusion We are on the threshold of the
explicitly parallel era

As in 1979, we expect a 1000-fold increase in
grids per chip in the next 20 years
Unlike 1979 these grids are best applied to
explicitly parallel machines
Diminishing returns from sequential processors
(ILP) - no alternative to explicit parallelism
Enabling technologies have been proven
interconnection networks, mechanisms, cache
coherence
Fine-grain machines are more efficient than
sequential machines
Fine-grain machines will be constructed from
multi-processor/DRAM chips
Incremental migration to parallel software