Title: HTMT-class%20Latency%20Tolerant%20Parallel%20Architecture%20for%20Petaflops-scale%20Computation
1HTMT-class Latency Tolerant Parallel Architecture
for Petaflops-scale Computation
- Dr. Thomas Sterling
- California Institute of Technology
- and
- NASA Jet Propulsion Laboratory
- October 1, 1999
2HTMT-class Latency Tolerant Parallel Architecture
for Petaflops-scale Computation
A Presentation at IBM Almaden
- Dr. Thomas Sterling
- California Institute of Technology
- and
- NASA Jet Propulsion Laboratory
- September 10, 1999
3(No Transcript)
4(No Transcript)
52nd Conference on Enabling Technologies for
Peta(fl)ops Computing
6Rational Drug Design
Nanotechnology
Tomographic Reconstruction
Phylogenetic Trees
Biomolecular Dynamics
Neural Networks
Crystallography
Fracture Mechanics
MRI Imaging
Reservoir Modelling
Molecular Modelling
Biosphere/Geosphere
Diffraction Inversion Problems
Distribution Networks
Chemical Dynamics
Atomic Scattering
Electrical Grids
Flow in Porous Media
Pipeline Flows
Data Assimilation
Signal Processing
Condensed Matter Electronic Structure
Plasma Processing
Chemical Reactors
Cloud Physics
Electronic Structure
Boilers
Combustion
Actinide Chemistry
Radiation
Fourier Methods
Graph Theoretic
CVD
Quantum Chemistry
Reaction-Diffusion
Chemical Reactors
Cosmology
Transport
n-body
Astrophysics
Multiphase Flow
Manufacturing Systems
CFD
Basic Algorithms Numerical Methods
Discrete Events
PDE
Weather and Climate
Air Traffic Control
Military Logistics
Structural Mechanics
Seismic Processing
Population Genetics
Monte Carlo
ODE
Multibody Dynamics
Geophysical Fluids
VLSI Design
Transportation Systems
Aerodynamics
Raster Graphics
Economics
Fields
Orbital Mechanics
Nuclear Structure
Ecosystems
QCD
Pattern Matching
Symbolic Processing
Neutron Transport
Economics Models
Genome Processing
Virtual Reality
Astrophysics
Cryptography
Electromagnetics
Computer Vision
Virtual Prototypes
Intelligent Search
Multimedia Collaboration Tools
Computer Algebra
Databases
Magnet Design
Computational Steering
Scientific Visualization
Data Minning
Automated Deduction
Number Theory
CAD
Intelligent Agents
7(No Transcript)
8(No Transcript)
9A 10 Gflops Beowulf
Center for Advance Computing Research
172 Intel Pentium Pro microprocessors
California Institute of Technology
10Emergence of Beowulf Clusters
111st printing May, 1999 2nd printing Aug.
1999 MIT Press
12(No Transcript)
13Beowulf Scalability
14INTEGRATED SMP - WDM
DRAM - 4 GBYTES - HIGHLY INTERLEAVED
MULTI-LAMBDA AON
CROSS BAR
coherence
640 GBYTES/SEC
2nd LEVEL CACHE 96 MBYTES
64 bytes wide
160 gbytes/sec
VLIW/RISC CORE 24 GFLOPS 6 ghz
...
15COTS PetaFlop System
128 die/box 4 CPU/die
3
4
...
5
2
16
1
17
64
ALL-OPTICAL SWITCH
18
63
...
...
32
49
48
Multi-Die Multi-Processor
...
33
47
46
I/O
10 meters 50 NS Delay
16COTS PetaFlops System
- 8192 Dies (4 CPU/die-minimum)
- Each Die is 120 GFlops
- 1 PetaFlop Peak
- Power 8192 x200 Watts 1.6 MegaWatts
- Extra Main Memory gt3 MegaWatts (512 TBytes)
- 15.36 TFlops/Rack (128 die)
- 30 KWatts/Rack - thus 64 racks - 30 inch
- Common System I/O
- 2 Level Main Memory
- Optical Interconnect
- OC768 Channels (40 GHz)
- 128 Channels per Die (DWDM)-5.12 THz
- ALL Optical Switching
- Bisection Bandwidth of 50 TBytes/sec
- 15 TFlops/rack.1bytes/flop/sec32 racks
- Rack Bandwidth - 15 TFlops.1 12 THz
17The SIA CMOS Roadmap
18Requirements for High End Systems
- Bulk capabilities
- performance
- storage capacities
- throughput/bandwidth
- cost, power, complexity
- Efficiency
- overhead
- latency
- contention
- starvation/parallelism
- Usability
- generality
- programmability
- reliability
19Points of Inflection in the History of Computing
- Heroic Era (1950)
- technology vacuum tubes, mercury delay lines,
pulse transformers - architecture accumulator based
- model von-Neumann, sequential instruction
execution - examples Whirlwind, EDSAC
- Mainframe (1960)
- technology transistors, core memory, disk drives
- architecture register bank based
- model virtual memory
- examples IBM 7090, PDP-1
20Points of Inflection in the History of Computing
- Supercomputers (1980)
- technology ECL, semiconductor integration, RAM
- architecture pipelined
- model vector
- example Cray-1
- Massively Parallel Processing (1990)
- technology VLSI, microprocessor,
- architecture MIMD
- model Communicating Sequential Processes,
Message passing - examples TMC CM-5, Intel Paragon
- ? (2000)
21(No Transcript)
22HTMT Objectives
- Scalable architecture with high sustained
performance in the presence of disparate cycle
times and latencies - Exploit diverse device technologies to achieve
substantially superior operating point - Execution model to simplify parallel system
programming and expand generality and
applicability
23Hybrid Technology MultiThreaded Architecture
24(No Transcript)
25Storage Capacity by Subsystem 2007 Design Point
26(No Transcript)
27HTMT Strategy
- High performance
- Superconductor RSFQ logic
- Data Vortex optical interconnect network
- PIM smart memory
- Low power
- Superconductor RSFQ logic
- Optical holographic storage
- PIM smart memory
28HTMT Strategy (cont)
- Low cost
- reduce wire count through chip-to-chip fiber
- reduce processor count through x100 clock speed
- reduce memory chips by 3-2 holographic memory
layer - Efficiency
- processor level multithreading
- smart memory managed second stage context pushing
multithreading - fine grain regular irregular data parallelism
exploited in memory - high memory bandwidth and low latency ops through
PIM - memory to memory interactions without processor
intervention - hardware mechanisms for synchronization,
scheduling, data/context migration, gather/scatter
29HTMT Strategy (cont)
- Programmability
- Global shared name space
- hierarchical parallel thread flow control model
- no explicit processor naming
- automatic latency management
- automatic processor load balancing
- runtime fine grain multithreading
- automatic context pushing for process migration
(percolation) - configuration transparent, runtime scalable
30RSFQ Roadmap(VLSI Circuit Clock Frequency)
31RSFQ Building Block
L1
32(No Transcript)
33Advantages
- X100 clock speeds achievable
- X100 power efficiency advantage
- Easier fabrication
- Leverage semiconductor fabrication tools
- First technology to encounter ultra-high speed
operation
34SuperconductorProcessor
- 100 GHz clock, 33 GHz inter-chip
- 0.8 micron Niobium on Silicon
- 100K gates per chip
- 0.05 watts per processor
- 100Kwatts per Petaflops
35(No Transcript)
36(No Transcript)
37(No Transcript)
38Data Vortex Optical Interconnect
39(No Transcript)
40DATA VORTEX LATENCY DISTRIBUTION
network height 1024
41Single-mode rib waveguides on silicon-on-insulator
wafers Hybrid sources and detectors Mix of
CMOS-like and micromachining-type processes
for fabrication
e.g R A Soref, J Schmidtchen K Petermann,
IEEE J. Quantum Electron. 27 p1971 (1991) A
Rickman, G T Reed, B L Weiss F Navamar, IEEE
Photonics Technol. Lett. 4 p.633 (1992) B
Jalali, P D Trinh, S Yegnanarayanan F
Coppinger IEE Proc. Optoelectron. 143 p.307 (1996)
42PIM Provides Smart Memory
- Merge logic and memory
- Integrate multiple logic/mem stacks on single
chip - Exposes high intrinsic memory bandwidth
- Reduction of memory access latency
- Low overhead for memory oriented operations
- Manages data structure manipulation, context
coordination and percolation
43Multithreaded PIM DRAM
- Multithreaded Control of PIM Functions
- multiple operation sequences with low context
switching overhead - maximize memory utilization and efficiency
- maximize processor and I/O utilization
- multiple banks of row buffers to hold data,
instructions, and addr - data parallel basic operations at row buffer
- manages shared resources such as FP
- Direct PIM to PIM Interaction
- memory communicates with memory within and across
chip boundaries without external control
processor intervention by parcels - exposes fine grain parallelism intrinsic to
vector and irregular data structures - e.g. pointer chasing, block moves,
synchronization, data balancing
44Silicon Budget for HTMT DRAM PIM
- Designed to provide proper balance of memory
support for fiber bandwidth - Different Vortex configurations gt different s
- In 2004, 16 TB 4096 groups of 64 chips
- Each Chip
Fiber WDM Optical Receiver
Interface
HRAM Vortex Output
SuperScalar Core
Memory
Logic
By Area
45Holographic 3/2 Memory
Performance Scaling
- Advantages
- petabyte memory
- competitive cost
- 10 ?sec access time
- low power
- efficient interface to DRAM
- Disadvantages
- recording rate is slower than the readout rate
for LiNbO3 - recording must be done in GB chunks
- long term trend favors DRAM unless new materials
and lasers are used
460.3 m
1.4 m
4oK 50 W
77oK
SIDE VIEW
1 m
Fiber/Wire Interconnects
1 m
3 m
0.5 m
47SIDE VIEW
Nitrogen
Helium
Tape Silo Array (400 Silos)
Hard Disk Array (40 cabinets)
4oK 50 W
77oK
Fiber/Wire Interconnects
Front End Computer Server
3 m
3 m
Console
Cable Tray Assembly
0.5 m
220Volts
220Volts
WDM Source
Generator
Generator
980 nm Pumps (20 cabinets)
Optical Amplifiers
48HTMT Facility (Top View)
49Floor Area
50Power Dissipation by Subsystem Petaflops Design
Point
51Subsystem Interfaces 2007 Design Point
- Same colors indicate a connection between
subsystems - Horizontal lines group interfaces within a
subsystem
52(No Transcript)
53Getting Efficiency
- Contention
- hardware for bandwidth, logic throughput,
hardware arbitration - Latency
- multithreaded processor with hardware context
switching - percolation for proactive prestaging of
executables - PIM-DRAM PIM-SRAM provides smart data oriented
mechanisms - Overhead
- hardware context switching
- in PIM smart synchronization and context
management - proactive percolation performed in PIM
- Starvation
- dynamic load balancing
- high speed processor for reduced parallelism
- expose/exploit fine grain parallelism
54Multilevel Multithreaded Execution Model
- Extend latency hiding of multithreading
- Hierarchy of logical thread
- Delineates threads and thread ensembles
- Action sequences, state, and precedence
constraints - Fine grain single cycle thread switching
- Processor level, hides pipeline and time of
flight latency - Coarse grain context "percolation"
- Memory level, in memory synchronization
- Ready contexts move toward processors, pending
contexts towards big memory
55Tera MTA Friends
56Percolation of Active Tasks
- Multiple stage latency management methodology
- Augmented multithreaded resource scheduling
- Hierarchy of task contexts
- Coarse-grain contexts coordinate in PIM memory
- Ready contexts migrate to SRAM under PIM control
releasing threads for scheduling - Threads pushed into SRAM/CRAM frame buffers
- Strands loaded in register banks on space
available basis
57HTMT Percolation Model
CRYOGENIC AREA
DMA to CRAM
Split-Phase Synchronization to SRAM
done
start
C-Buffer
I-Queue
A-Queue
Parcel Invocation Termination
Parcel Assembly Disassembly
Parcel Dispatcher Dispenser
Re-Use
T-Queue
D-Queue
Run Time System
SRAM-PIM
DMA to DRAM-PIM
58HTMT Execution Model
Contexts in SRAM
Data Structures
59DRAM PIM Functions
- Initialize data structures
- Stride thru regular data structures, transferring
to/from SRAM - Pointer chase thru linked data structures
- Join-like operations
- Reorderings
- Prefix operations
- I/O transfer management
- DMA, compress/decompress, ...
60SRAM PIM Functions
- Initiate Gather/Scatter to/from DRAM
- Recognize when sufficient operands arrive in SRAM
context block - Enqueue/Dequeue SRAM block addresses
- Initiate DMA transfers to/from CRAM context block
- Signal SPELL re task initiation
- Prefix operations like Flt Pt Sum
61StrawMan Prototype for Phase 4
620.3 m
1.4 m
4oK 50 W
77oK
SIDE VIEW
1 m
Fiber/Wire Interconnects
1 m
3 m
0.5 m