VLSI Architecture Past, Present, and Future - PowerPoint PPT Presentation

About This Presentation
Title:

VLSI Architecture Past, Present, and Future

Description:

... 00 2.59 158078.00 45104.00 157589.00 3.49 157134.00 39606.00 106204.00 2.68 123838.00 29907.00 164414.00 5.50 124076.00 24085.00 61266.00 2.54 123944.00 29146 ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 39
Provided by: Willi332
Learn more at: http://csl.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: VLSI Architecture Past, Present, and Future


1
VLSI ArchitecturePast, Present, and Future
  • William J. DallyComputer Systems
    LaboratoryStanford University
  • March 23, 1999

2
Past, Present, and Future
  • The last 20 years has seen a 1000-fold increase
    in grids per chip and a 20-fold reduction in gate
    delay
  • We expect this trend to continue for the next 20
    years
  • For the past 20 years, these devices have been
    applied to implicit parallelism
  • We will see a shift toward implicit parallelism
    over the next 20 years

3
Technology Evolution
4
Technology Evolution (2)
5
Architecture Evolution
6
Incremental Returns
Quad-issue out of order
Performance
Dual-issue in order
Pipelined RISC
Processor Cost (Die Area)
7
Efficiency and Granularity
2PM
2P2M
Peak Performance
PM
System Cost (Die Area)
8
VLSI in 1979
9
VLSI Architecture in 1979
  • 5mm NMOS technology
  • 6mm die size
  • 100,000 grids per chip, 10,000 transistors
  • 8086 microprocessor
  • 0.5MIPS

10
1979-1989 Attack of the Killer Micros
  • 50 per year improvement in performance
  • Transistors applied to implicit parallelism
  • pipeline processor (10 CPI --gt 1 CPI)
  • shorten clock cycle (67 gates/clock --gt 30
    gates/clock)
  • in 1989 a 32-bit processor w/ floating point and
    caches fits on one chip
  • e.g., i860 40MIPS, 40MFLOPS
  • 5,000,000 grids, 1M transistors (many memory)

11
1989-1999 The Era of Diminishing Returns
  • 50 per year increase in performance through
    1996, but
  • projects delayed, performance below expectations
  • 50 increase in grids, 15 increase in frequency
    (72 total)
  • Squeaking out the last implicit parallelism
  • 2-way to 6-way issue, out-of-order issue, branch
    prediction
  • 1 CPI --gt 0.5 CPI, 30 gates/clock --gt 20
    gates/clock
  • Convert data parallelism to ILP
  • Examples
  • Intel Pentium II (3-way o-o-o)
  • Compaq 21264 (4-way o-o-o)

12
1979-1999 Why Implicit Parallelism?
  • Opportunity
  • large gap between micros and fastest processors
  • Compatibility
  • software pool ready to run on implicitly parallel
    machines
  • Technology
  • not available for fine-grain explicitly parallel
    machines

13
1999-2019 Explicit Parallelism Takes Over
  • Opportunity
  • no more processor gap
  • Technology
  • interconnection, interaction, and shared memory
    technologies have been proven

14
Technology for Fine-Grain Parallel Machines
  • A collection of workstations does not make a good
    parallel machine. (BLAGG)
  • Bandwidth - large fraction (0.1) of local memory
    BW
  • LAtency - small multiple (3) of local memory
    latency
  • Global mechanisms - sync, fetch-and-op
  • Granularity - of tasks (100 inst) and memory (8MB)

15
Technology for Parallel MachinesThree Components
  • Networks
  • 2 clocks/hop latency
  • 8GB/s global bandwidth
  • Interaction mechanisms
  • single-cycle communication and synchronization
  • Software

16
k-ary n-cubes
  • Link bandwidth, B, depends on radix, k, for both
    wire- and pin-limited networks.
  • Select radix to trade-off diameter, D, against B.

Latency
4K Nodes L 256 Bs 16K
Dimension
Dally, Performance Analysis of k-ary n-cube
Interconnection Networks, IEEE TC, 1990
17
Delay of Express Channels
18
The Torus Routing Chip
  • k-ary n-cube topology
  • 2D Torus Network
  • 8bit x 20MHz Channels
  • Hardware routing
  • Wormhole routing
  • Virtual channels
  • Fully Self-Timed Design
  • Internal Crossbar Architecture

Dally and Seitz, The Torus Routing Chip,
Distributed Computing, 1986
19
The Reliable Router
  • Fault-tolerant
  • Adaptive routing (adaptation of Duatos
    algorithm)
  • Link-level retry
  • Unique token protocol
  • 32bit x 200MHz channels
  • Simultaneous bidirectional signalling
  • Low latency plesiochronous synchronizers
  • Optimisitic routing

Dally, Dennison, Harris, Kan, and Xanthopoulos,
Architecture and Implementation of the Reliable
Router, Hot Interconnects II, 1994 Dally,
Dennison, and Xanthopoulos, Low-Latency
Plesiochronous Data Retiming, ARVLSI
1995 Dennison, Lee, and Dally, High Performance
Bidirectional Signalling in VLSI Systems, SIS
1993
20
Equalized 4Gb/s Signaling
21
End-to-End Latency
  • Software sees 10ms latency with 500ns network
  • Heavy compute load associated with sending a
    message
  • system call
  • buffer allocation
  • synchronization
  • Solution treat the network like memory, not like
    an I/O device
  • hardware formatting, addressing, and buffer
    allocation

Regs
Send
Tx Node
Net
Buffer
Dispatch
Rx Node
22
Network Summary
  • We can build networks with 2-4 clocks/hop latency
    (12-24 clocks for a 512-node 3-cube)
  • networks faster than main memory access of modern
    machines
  • need end-to-end hardware support to see this, no
    libraries
  • With high-speed signaling, bandwdith of 4GB/s or
    more per channel (512GB/s bisection) is easy to
    achieve
  • nearly flat memory bandwidth
  • Topology is a matter of matching pin and
    bisection constraints to the packaging technology
  • its hard to beat a 3-D mesh or torus
  • This gives us B and LA (of BLAGG)

23
The Importance of Mechanisms
24
The Importance of Mechanisms
25
The Importance of Mechanisms
26
Granularity and Cost Effectiveness
  • Parallel Computers Built for
  • Capability - run problems that are too big or
    take too long to solve any other way
  • absolute performance at any cost
  • Capacity - get throughput on lots of small
    problems
  • performance/cost
  • A parallel computer built from workstation size
    nodes will always have lower perf/cost than a
    workstation
  • sublinear speedup
  • economies of scale
  • A parallel computer with less memory per node can
    have better perf/cost than a workstation

M

P
P

P

P

P

M
M
M
M
27
MIT J-Machine (1991)
28
Exploiting fine-grain threads
  • Where will the parallelism come from to keep all
    of these processors busy?
  • ILP - limited to about 5
  • Outer-loop parallelism
  • e.g., domain decomposition
  • requires big problems to get lots of parallelism
  • Fine threads
  • make communication and synchronization very fast
    (1 cycle)
  • break the problem into smaller pieces
  • more parallelism

29
Mechanism and Granularity Summary
  • Fast communication and synchronization mechanisms
    enable fine-grain task decomposition
  • simplifies programming
  • exposes parallelism
  • facilitates load balance
  • Have demonstrated
  • 1-cycle communication and synchronization locally
  • 10-cycle communication, synchronization, and task
    dispatch across a network
  • Physically fine-grain machines have better
    performance/cost than sequential machines

30
A 2009 Multicomputer
31
Challenges for the Explicitly Parallel Era
  • Compatibility
  • Managing locality
  • Parallel software

32
Compatibility
  • Almost no fine-grain parallel software exists
  • Writing parallel software is easy
  • with good mechanisms
  • Parallelizing sequential software is hard
  • needs to be designed from the ground up
  • An incremental migration path
  • run sequential codes with acceptable performance
  • parallelize selected applications for
    considerable speedup

33
Performance Depends on Locality
  • Applications have data/time-dependent graph
    structure
  • Sparse-matrix solution
  • non-zero and fill-in structure
  • Logic simulation
  • circuit topology and activity
  • PIC codes
  • structure changes as particles move
  • Sort-middle polygon rendering
  • structure changes as viewpoint moves

34
Fine-Grain Data MigrationDrift and Diffusion
  • Run-time relocation based on pointer use
  • move data at both ends of pointer
  • move control and data
  • Each relocation cycle
  • compute drift vector based on pointer use
  • compute diffusion vector based on density
    potential (Taylor)
  • need to avoid oscillations
  • Should data be replicated?
  • not just update vs. invalidate
  • need to duplicate computation to avoid
    communication

35
Migration and Locality
36
Parallel SoftwareFocus on the Real Problems
  • Almost all demanding problems have ample
    parallelism
  • Need to focus on fundamental problems
  • extracting parallelism
  • load balance
  • locality
  • load balance and locality can be covered by
    excess parallelism
  • Avoid incidental issues
  • aggregating tasks to avoid overhead
  • manually managing data movement and replication
  • oversynchronization

37
Parallel SoftwareDesign Strategy
  • A program must be designed for parallelism from
    the ground up
  • no bottlenecks in the data structures
  • e.g., arrays instead of linked lists
  • Data parallelism
  • many for loops (over data,not time) can be forall
  • break dependencies out of the loop
  • synchronize on natural units (no barriers)

38
Conclusion We are on the threshold of the
explicitly parallel era
  • As in 1979, we expect a 1000-fold increase in
    grids per chip in the next 20 years
  • Unlike 1979 these grids are best applied to
    explicitly parallel machines
  • Diminishing returns from sequential processors
    (ILP) - no alternative to explicit parallelism
  • Enabling technologies have been proven
  • interconnection networks, mechanisms, cache
    coherence
  • Fine-grain machines are more efficient than
    sequential machines
  • Fine-grain machines will be constructed from
    multi-processor/DRAM chips
  • Incremental migration to parallel software
Write a Comment
User Comments (0)
About PowerShow.com