How to Build a Petaflops Computer

About This Presentation

Title:

How to Build a Petaflops Computer

Description:

Carl Kesselman. David E. Keyes. Andrew Lunsdaine. James R. McGraw. Piyush Mehrotra. Daniel Savarese ... D. Crawford. Project Secretary. A. Smythe. System ... – PowerPoint PPT presentation

Number of Views:125

Avg rating:3.0/5.0

Slides: 62

Provided by: cacrCa

Category:

more less

Transcript and Presenter's Notes

Title: How to Build a Petaflops Computer

1
How to Build a Petaflops Computer
Keynote address to 3rd workshop on The Petaflops
Frontier

Thomas Sterling
California Institute of Technology
NASA Jet Propulsion Laboratory
February 22, 1999

2
(No Transcript)
3
Comparison to Present Technology
4
(No Transcript)
5
The High Cs to crossing to Petaflops Computing

Capability
Computation rate
Capacity of storage
Communication bandwidth
Cost
Component count
Connection complexity
Consumption of power
Concurrency
Cycles of latency
Customers and Ciller-applications
Confidence

6
POWR Workshop Overview

Petaflops initiative context
Objectives
Charter Guidelines
3 Pflops system classes
COTS clusters
MPP system architecture
Hybrid-technology custom architecture
Specific group results

Summary findings
Open issues
Recommendations
Conclusions

7
MPP Petaflops System

COTS chips and industry standard interfaces
Custom glue-logic ASICs and SAN
New systems architecture
Distributed shared memory and cache based latency
management

Algorithm/application methodologies
Specialized compile time and runtime software

8
MPP Breakout Group
Rudolf Eigenmann Jose Fortes David Frye Kent
Koeninger Vipin Kumar John May Paul
Messina Merrell Patrick Paul Smith Rick Stevens
Valerie Taylor Josep Torrellas Paul Woodward
9
Summary of MPP

processor 3 GHz, 10 Gflops
processors 100,000
memory 32 Tbytes, DRAM, 40ns access time local
interconnect frames switched, 128 Gbps/channel
secondary storage 1 Pbyte, 1 ms access time
distributed shared memory
latency management cache coherence protocol

10
COTS Clustered Petaflops System

NO specialized hardware
Leverages mass market economy of scale
Distributed memory model with message passing
Incorporates desktop/server mainstream component
systems
Integrated by means of COTS networking technology
Augmented by new application algorithm
methodologies and system software

11
COTS Cluster Breakout Group
David H. Bailey James Bieda Remy Evard Robert
Clay Al Geist Carl Kesselman David E.
Keyes Andrew Lunsdaine James R. McGraw Piyush
Mehrotra Daniel Savarese Bob Voigt Michael S.
Warren
12
Summary of COTS Cluster

processor 3 GHz, 10 Gflops
processors 100,000
memory 32 Tbytes, DRAM, 40ns access time
interconnect degree 12 n-cube, 20 Gbps/channel
secondary storage 1 Pbyte, 1 ms access time
distributed memory, 3 level cache, 1 level DRAM
latency management software

13
Hybrid Technology Petaflops System

New device technologies
New component designs
New subsystem architecture
New system architecture
New latency management paradigm and mechanisms
New algorithms/applications
New compile time and runtime software

14
HTMT Breakout Group
Larry Bergman Nikos Chrisochoides Vincent
Freeh Guang R. Gao Peter Kogge Phil Merkey John
Van Rosendale John Salmon Burton Smith Thomas
Sterling
15
Summary of HTMT

processor 150 GHz, 600 Gflops
processors 2048
memory 16 Tbytes PIM-DRAM, 80ns access time
interconnect Data Vortex, 500 Gbps/channel, gt 10
Pbps bi-section bw
3/2 storage 1 Pbyte, 10 us access time
shared memory, 4 level hierarchy
latency management multithreaded with percolation

16
Summary Findings

Architecture is important
Bandwidth requirements dominate hardware
structures
Latency management determines runtime resource
management strategy
Efficient mechanisms for overhead services
Generality of application workload dependent on
interconnect throughput and response time
COTS processors will not hide system latency,
even if multithreading is adopted
More memory than earlier thought may be needed
MPP problem is very difficult, unclear which
direction to take

17
Summary Findings (cont)

COTS clusters will provide safe migration path at
best price-performance but must rely on user
management of all system resources
Inter-process load balancing too expensive on
clusters
New formalism required to expose diverse modes of
parallelism
Compilers cant ever make all performance
decisions must be combined with collaborative
runtime software
Critical-path performance decision tree requires
new internal protocols
User must describe application properties, not
means

18
Open Issues

Is network of processor/memories best use of
multi-billion transistor chips
Is convergence real or only point of inflexion
Will semiconductor continue to push beyond 0.15
micron do market costs support it
Can alternative technology fabrication be
supported/avoided
Can orders-of-magnitude latency be managed
What will the computer languages of the Pflops
era look like?
Processor granularity fine and many or fat and
few

19
Return to a Single Node(but highly parallel)

Emergence of a new class of high end computer
Return to a single world API image
Eliminate (virtualize) processors from the name
space
Unburden applications program from direct
resource management
Latency management an intrinsic architecture
responsibility (with compiler assist)
Enable adaptive system operation at hyper speeds
Leap-frog the conventional price-performance-power
curves for wide market

20
(No Transcript)
21
HTMT Objectives

Scalable architecture with high sustained
performance in the presence of disparate cycle
times and latencies
Exploit diverse device technologies to achieve
substantially superior operating point
Execution model to simplify parallel system
programming and expand generality and
applicability

22
Hybrid Technology MultiThreaded Architecture
23
Summary of HTMT

processor 150 GHz, 600 Gflops
processors 2048
memory 16 Tbytes PIM-DRAM, 80ns access time
interconnect Data Vortex, 500 Gbps/channel, gt 10
Pbps bi-section bw
3/2 storage 1 Pbyte, 10 us access time
shared memory, 4 level hierarchy
latency management multithreaded with percolation

24
(No Transcript)
25
Storage Capacity by Subsystem 2007 Design Point
26
(No Transcript)
27
HTMT Strategy

High performance
Superconductor RSFQ logic
Data Vortex optical interconnect network
PIM smart memory
Low power
Superconductor RSFQ logic
Optical holographic storage
PIM smart memory

28
HTMT Strategy (cont)

Low cost
reduce wire count through chip-to-chip fiber
reduce processor count through x100 clock speed
reduce memory chips by 3-2 holographic memory
layer
Efficiency
processor level multithreading
smart memory managed second stage context pushing
multithreading
fine grain regular irregular data parallelism
exploited in memory
high memory bandwidth and low latency ops through
PIM
memory to memory interactions without processor
intervention
hardware mechanisms for synchronization,
scheduling, data/context migration, gather/scatter

29
HTMT Strategy (cont)

Programmability
Global shared name space
hierarchical parallel thread flow control model
no explicit processor naming
automatic latency management
automatic processor load balancing
runtime fine grain multithreading
automatic context pushing for process migration
(percolation)
configuration transparent, runtime scalable

30
HTMT Organization
NSA G. Cotter Doc Bedard W. Carlson (IDA)
NASA E. Tu W. Johnston
DARPA J. Munioz
PI T. Sterling Project Mgr L. Bergman
STEERING COMMITTEE P. Messina
Project AA D. Crawford
Project Secretary A. Smythe
System Engineer (S. Monacos)
Tech Publishing M. MacDonald
PRINCETON CO-I K. Bergman C. Reed (IDA)
UNIVERSITY OF DELAWARE CO-I G. Gao
CACR CO-I P. Messina
NOTRE DAME CO-I P. Kogge
SUNY CO-I K. Likharev
TERA CO-I B. Smith
CALTECH CO-I D. Psaltis
ARGONNE CO-I R. Stevens
UCSB CI-I M. Rodwell M. Melliar-Smith
JPL CO-I D. Curkendall H. Siegel
JPL CO-I T. Cwik
TI CI-I G. Armstrong
TRW CI-I A. Silver
HYPRESS CI-I E. Track
RPI CI-I J. McDonald
Univ Rochester CI-I M. Feldman
31
Areas of Accomplishments

Concepts and Structures
approach strategy
device technologies
subsystem design
efficiency, productivity, generality
System Architecture
size, cost, complexity, power
System Software
resource management
multiprocessor emulator

Applications
multithreaded codes
scaling models
Evaluation
feasibility
cost
performance
Future Directions
Phase 3 prototype
Phase 4 petaflops system
Proposals

32
RSFQ Roadmap(VLSI Circuit Clock Frequency)
33
(No Transcript)
34
Advantages

X100 clock speeds achievable
X100 power efficiency advantage
Easier fabrication
Leverage semiconductor fabrication tools
First technology to encounter ultra-high speed
operation

35
SuperconductorProcessor

100 GHz clock, 33 GHz inter-chip
0.8 micron Niobium on Silicon
100K gates per chip
0.05 watts per processor
100Kwatts per Petaflops

36
Accomplishments - Processor

SPELL Architecture
Detailed circuit design for critical paths
CRAM Memory design initiated
1st network design and analysis/simulation
750 GHz logic demonstrated
Detailed sizing, cost, and power analysis
Estimate for fabrication facilities investment
Barriers and path to 0.4-0.25 micron regime
Sizing for Phase 3 50 Gflops processor

37
Data Vortex Optical Interconnect
38
(No Transcript)
39
DATA VORTEX LATENCY DISTRIBUTION
network height 1024
40
Single-mode rib waveguides on silicon-on-insulator
wafers Hybrid sources and detectors Mix of
CMOS-like and micromachining-type processes
for fabrication
e.g R A Soref, J Schmidtchen K Petermann,
IEEE J. Quantum Electron. 27 p1971 (1991) A
Rickman, G T Reed, B L Weiss F Navamar, IEEE
Photonics Technol. Lett. 4 p.633 (1992) B
Jalali, P D Trinh, S Yegnanarayanan F
Coppinger IEE Proc. Optoelectron. 143 p.307 (1996)
41
Data Vortex Parameters for Petaflops in 2007

Bi-section sustained bandwidth 4000 Tbps
Per port data rate 640 Gbps
Single wavelength channel rate 10 Gbps
Level of WDM 64 colors
Number of input ports 6250
Angle nodes 7
Network node height 4096
Number of nodes per cylinder 28672
Number of cylinders 13
Total node number 372736

42
Accomplishments - Data Vortex

Implemented and tested optical device technology
Prototyped electro-optical butterfly switch
Design study of electro-optic integrated switch
Implemented and tested most of end-to-end path
Design of topology to size
Simulation of network behavior under load
Modified structure for ease of packaging
Size, complexity, power studies
Initial interface design

43
PIM Provides Smart Memory

Merge logic and memory
Integrate multiple logic/mem stacks on single
chip
Exposes high intrinsic memory bandwidth
Reduction of memory access latency
Low overhead for memory oriented operations
Manages data structure manipulation, context
coordination and percolation

44
Multithreaded PIM DRAM

Multithreaded Control of PIM Functions
multiple operation sequences with low context
switching overhead
maximize memory utilization and efficiency
maximize processor and I/O utilization

multiple banks of row buffers to hold data,
instructions, and addr
data parallel basic operations at row buffer
manages shared resources such as FP

Direct PIM to PIM Interaction
memory communicates with memory within and across
chip boundaries without external control
processor intervention by parcels
exposes fine grain parallelism intrinsic to
vector and irregular data structures
e.g. pointer chasing, block moves,
synchronization, data balancing

45
Accomplishments - PIM DRAM

Establish operational opportunity and
requirements
Win 12.2M DARPA contract for DIVA
USC ISI prime
Caltech, Notre Dame, U. of Delaware
Deliver 8 Mbyte part in FY01 at 0.25 micron
Architecture concept design complete
parcel message driven computation
multithreaded resource management
Analysis of size, power, bandwidth
Diva to be used directly in Phase 3 testbed

46
Holographic 3/2 Memory
Performance Scaling

Advantages
petabyte memory
competitive cost
10 ?sec access time
low power
efficient interface to DRAM

Disadvantages
recording rate is slower than the readout rate
for LiNbO3
recording must be done in GB chunks
long term trend favors DRAM unless new materials
and lasers are used

47
Accomplishments - HoloStore

Detailed study of two optical storage
technologies
photo refractive
spectral hole burning
Operational photo refractive read/write storage
Access approaches explored for 10 usec regime
pixel array
wavelength multiplexing
Packaging studies
power, size, cost analysis

48
Multilevel Multithreaded Execution Model

Extend latency hiding of multithreading
Hierarchy of logical thread
Delineates threads and thread ensembles
Action sequences, state, and precedence
constraints
Fine grain single cycle thread switching
Processor level, hides pipeline and time of
flight latency
Coarse grain context "percolation"
Memory level, in memory synchronization
Ready contexts move toward processors, pending
contexts towards big memory

49
HTMT Thread ActivationState Diagram
Percolationof threads
50
Percolation of Active Tasks

Multiple stage latency management methodology
Augmented multithreaded resource scheduling
Hierarchy of task contexts
Coarse-grain contexts coordinate in PIM memory
Ready contexts migrate to SRAM under PIM control
releasing threads for scheduling
Threads pushed into SRAM/CRAM frame buffers
Strands loaded in register banks on space
available basis

51
HTMT Percolation Model
CRYOGENIC AREA
DMA to CRAM
Split-Phase Synchronization to SRAM
done
start
C-Buffer
I-Queue
A-Queue
Parcel Invocation Termination
Parcel Assembly Disassembly
Parcel Dispatcher Dispenser
Re-Use
T-Queue
D-Queue
Run Time System
SRAM-PIM
DMA to DRAM-PIM
52
0.3 m
1.4 m
4oK 50 W
77oK
SIDE VIEW
1 m
Fiber/Wire Interconnects
1 m
3 m
0.5 m
53
Top Down View of HTMT Machine 2007 Design Point
54
SIDE VIEW
Nitrogen
Helium
Tape Silo Array (400 Silos)
Hard Disk Array (40 cabinets)
4oK 50 W
77oK
Fiber/Wire Interconnects
Front End Computer Server

3 m
3 m
Console
Cable Tray Assembly
0.5 m
220Volts
220Volts
WDM Source
Generator
Generator
980 nm Pumps (20 cabinets)
Optical Amplifiers
55
HTMT Facility (Top View)
56
Floor Area
57
Power Dissipation by Subsystem Petaflops Design
Point
58
Subsystem Interfaces 2007 Design Point

Same colors indicate a connection between
subsystems
Horizontal lines group interfaces within a
subsystem

59
Accomplishments - Systems

System architecture completed
Physical structure design
Parts count, power, interconnect complexity
analysis
Infrastructure requirements and impact
Feasibility assessment

60
Distributed Isomorphic Simulator

Executable Specification
subsystem functional/operational description
inter-subsystem interface protocol definition
Distributed Low-cost Cluster of processors
Cluster partitioned and allocated to separate
subsystems
Subsystem development groups own cluster
partitions, and develop functional specification
Subsystem partitions interact by agreed-upon
interface protocols
Runtime percolation and thread scheduling system
software put on top of emulation software.

61
(No Transcript)

Write a Comment

User Comments (0)