Title: VLSI Architecture Past, Present, and Future
1VLSI ArchitecturePast, Present, and Future
- William J. DallyComputer Systems
LaboratoryStanford University - March 23, 1999
2Past, Present, and Future
- The last 20 years has seen a 1000-fold increase
in grids per chip and a 20-fold reduction in gate
delay - We expect this trend to continue for the next 20
years - For the past 20 years, these devices have been
applied to implicit parallelism - We will see a shift toward implicit parallelism
over the next 20 years
3Technology Evolution
4Technology Evolution (2)
5Architecture Evolution
6Incremental Returns
Quad-issue out of order
Performance
Dual-issue in order
Pipelined RISC
Processor Cost (Die Area)
7Efficiency and Granularity
2PM
2P2M
Peak Performance
PM
System Cost (Die Area)
8VLSI in 1979
9VLSI Architecture in 1979
- 5mm NMOS technology
- 6mm die size
- 100,000 grids per chip, 10,000 transistors
- 8086 microprocessor
- 0.5MIPS
101979-1989 Attack of the Killer Micros
- 50 per year improvement in performance
- Transistors applied to implicit parallelism
- pipeline processor (10 CPI --gt 1 CPI)
- shorten clock cycle (67 gates/clock --gt 30
gates/clock) - in 1989 a 32-bit processor w/ floating point and
caches fits on one chip - e.g., i860 40MIPS, 40MFLOPS
- 5,000,000 grids, 1M transistors (many memory)
111989-1999 The Era of Diminishing Returns
- 50 per year increase in performance through
1996, but - projects delayed, performance below expectations
- 50 increase in grids, 15 increase in frequency
(72 total) - Squeaking out the last implicit parallelism
- 2-way to 6-way issue, out-of-order issue, branch
prediction - 1 CPI --gt 0.5 CPI, 30 gates/clock --gt 20
gates/clock - Convert data parallelism to ILP
- Examples
- Intel Pentium II (3-way o-o-o)
- Compaq 21264 (4-way o-o-o)
121979-1999 Why Implicit Parallelism?
- Opportunity
- large gap between micros and fastest processors
- Compatibility
- software pool ready to run on implicitly parallel
machines - Technology
- not available for fine-grain explicitly parallel
machines
131999-2019 Explicit Parallelism Takes Over
- Opportunity
- no more processor gap
- Technology
- interconnection, interaction, and shared memory
technologies have been proven
14Technology for Fine-Grain Parallel Machines
- A collection of workstations does not make a good
parallel machine. (BLAGG) - Bandwidth - large fraction (0.1) of local memory
BW - LAtency - small multiple (3) of local memory
latency - Global mechanisms - sync, fetch-and-op
- Granularity - of tasks (100 inst) and memory (8MB)
15Technology for Parallel MachinesThree Components
- Networks
- 2 clocks/hop latency
- 8GB/s global bandwidth
- Interaction mechanisms
- single-cycle communication and synchronization
- Software
16k-ary n-cubes
- Link bandwidth, B, depends on radix, k, for both
wire- and pin-limited networks. - Select radix to trade-off diameter, D, against B.
Latency
4K Nodes L 256 Bs 16K
Dimension
Dally, Performance Analysis of k-ary n-cube
Interconnection Networks, IEEE TC, 1990
17Delay of Express Channels
18The Torus Routing Chip
- k-ary n-cube topology
- 2D Torus Network
- 8bit x 20MHz Channels
- Hardware routing
- Wormhole routing
- Virtual channels
- Fully Self-Timed Design
- Internal Crossbar Architecture
Dally and Seitz, The Torus Routing Chip,
Distributed Computing, 1986
19The Reliable Router
- Fault-tolerant
- Adaptive routing (adaptation of Duatos
algorithm) - Link-level retry
- Unique token protocol
- 32bit x 200MHz channels
- Simultaneous bidirectional signalling
- Low latency plesiochronous synchronizers
- Optimisitic routing
Dally, Dennison, Harris, Kan, and Xanthopoulos,
Architecture and Implementation of the Reliable
Router, Hot Interconnects II, 1994 Dally,
Dennison, and Xanthopoulos, Low-Latency
Plesiochronous Data Retiming, ARVLSI
1995 Dennison, Lee, and Dally, High Performance
Bidirectional Signalling in VLSI Systems, SIS
1993
20Equalized 4Gb/s Signaling
21End-to-End Latency
- Software sees 10ms latency with 500ns network
- Heavy compute load associated with sending a
message - system call
- buffer allocation
- synchronization
- Solution treat the network like memory, not like
an I/O device - hardware formatting, addressing, and buffer
allocation
Regs
Send
Tx Node
Net
Buffer
Dispatch
Rx Node
22Network Summary
- We can build networks with 2-4 clocks/hop latency
(12-24 clocks for a 512-node 3-cube) - networks faster than main memory access of modern
machines - need end-to-end hardware support to see this, no
libraries - With high-speed signaling, bandwdith of 4GB/s or
more per channel (512GB/s bisection) is easy to
achieve - nearly flat memory bandwidth
- Topology is a matter of matching pin and
bisection constraints to the packaging technology - its hard to beat a 3-D mesh or torus
- This gives us B and LA (of BLAGG)
23The Importance of Mechanisms
24The Importance of Mechanisms
25The Importance of Mechanisms
26Granularity and Cost Effectiveness
- Parallel Computers Built for
- Capability - run problems that are too big or
take too long to solve any other way - absolute performance at any cost
- Capacity - get throughput on lots of small
problems - performance/cost
- A parallel computer built from workstation size
nodes will always have lower perf/cost than a
workstation - sublinear speedup
- economies of scale
- A parallel computer with less memory per node can
have better perf/cost than a workstation
M
P
P
P
P
P
M
M
M
M
27MIT J-Machine (1991)
28Exploiting fine-grain threads
- Where will the parallelism come from to keep all
of these processors busy? - ILP - limited to about 5
- Outer-loop parallelism
- e.g., domain decomposition
- requires big problems to get lots of parallelism
- Fine threads
- make communication and synchronization very fast
(1 cycle) - break the problem into smaller pieces
- more parallelism
29Mechanism and Granularity Summary
- Fast communication and synchronization mechanisms
enable fine-grain task decomposition - simplifies programming
- exposes parallelism
- facilitates load balance
- Have demonstrated
- 1-cycle communication and synchronization locally
- 10-cycle communication, synchronization, and task
dispatch across a network - Physically fine-grain machines have better
performance/cost than sequential machines
30A 2009 Multicomputer
31Challenges for the Explicitly Parallel Era
- Compatibility
- Managing locality
- Parallel software
32Compatibility
- Almost no fine-grain parallel software exists
- Writing parallel software is easy
- with good mechanisms
- Parallelizing sequential software is hard
- needs to be designed from the ground up
- An incremental migration path
- run sequential codes with acceptable performance
- parallelize selected applications for
considerable speedup
33Performance Depends on Locality
- Applications have data/time-dependent graph
structure - Sparse-matrix solution
- non-zero and fill-in structure
- Logic simulation
- circuit topology and activity
- PIC codes
- structure changes as particles move
- Sort-middle polygon rendering
- structure changes as viewpoint moves
34Fine-Grain Data MigrationDrift and Diffusion
- Run-time relocation based on pointer use
- move data at both ends of pointer
- move control and data
- Each relocation cycle
- compute drift vector based on pointer use
- compute diffusion vector based on density
potential (Taylor) - need to avoid oscillations
- Should data be replicated?
- not just update vs. invalidate
- need to duplicate computation to avoid
communication
35Migration and Locality
36Parallel SoftwareFocus on the Real Problems
- Almost all demanding problems have ample
parallelism - Need to focus on fundamental problems
- extracting parallelism
- load balance
- locality
- load balance and locality can be covered by
excess parallelism - Avoid incidental issues
- aggregating tasks to avoid overhead
- manually managing data movement and replication
- oversynchronization
37Parallel SoftwareDesign Strategy
- A program must be designed for parallelism from
the ground up - no bottlenecks in the data structures
- e.g., arrays instead of linked lists
- Data parallelism
- many for loops (over data,not time) can be forall
- break dependencies out of the loop
- synchronize on natural units (no barriers)
38Conclusion We are on the threshold of the
explicitly parallel era
- As in 1979, we expect a 1000-fold increase in
grids per chip in the next 20 years - Unlike 1979 these grids are best applied to
explicitly parallel machines - Diminishing returns from sequential processors
(ILP) - no alternative to explicit parallelism - Enabling technologies have been proven
- interconnection networks, mechanisms, cache
coherence - Fine-grain machines are more efficient than
sequential machines - Fine-grain machines will be constructed from
multi-processor/DRAM chips - Incremental migration to parallel software