HTMT-class%20Latency%20Tolerant%20Parallel%20Architecture%20for%20Petaflops-scale%20Computation - PowerPoint PPT Presentation

About This Presentation

Title:

HTMT-class%20Latency%20Tolerant%20Parallel%20Architecture%20for%20Petaflops-scale%20Computation

Description:

HTMTclass Latency Tolerant Parallel Architecture for Petaflopsscale Computation – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 61

Provided by: ThomasS109

Learn more at: https://research.ac.upc.edu

Category:

more less

Transcript and Presenter's Notes

Title: HTMT-class%20Latency%20Tolerant%20Parallel%20Architecture%20for%20Petaflops-scale%20Computation

1
HTMT-class Latency Tolerant Parallel Architecture
for Petaflops-scale Computation

Dr. Thomas Sterling
California Institute of Technology
and
NASA Jet Propulsion Laboratory
October 1, 1999

2
HTMT-class Latency Tolerant Parallel Architecture
for Petaflops-scale Computation
A Presentation at IBM Almaden

Dr. Thomas Sterling
California Institute of Technology
and
NASA Jet Propulsion Laboratory
September 10, 1999

3
(No Transcript)
4
(No Transcript)
5
2nd Conference on Enabling Technologies for
Peta(fl)ops Computing
6
Rational Drug Design
Nanotechnology
Tomographic Reconstruction
Phylogenetic Trees
Biomolecular Dynamics
Neural Networks
Crystallography
Fracture Mechanics
MRI Imaging
Reservoir Modelling
Molecular Modelling
Biosphere/Geosphere
Diffraction Inversion Problems
Distribution Networks
Chemical Dynamics
Atomic Scattering
Electrical Grids
Flow in Porous Media
Pipeline Flows
Data Assimilation
Signal Processing
Condensed Matter Electronic Structure
Plasma Processing
Chemical Reactors
Cloud Physics
Electronic Structure
Boilers
Combustion
Actinide Chemistry
Radiation
Fourier Methods
Graph Theoretic
CVD
Quantum Chemistry
Reaction-Diffusion
Chemical Reactors
Cosmology
Transport
n-body
Astrophysics
Multiphase Flow
Manufacturing Systems
CFD
Basic Algorithms Numerical Methods
Discrete Events
PDE
Weather and Climate
Air Traffic Control
Military Logistics
Structural Mechanics
Seismic Processing
Population Genetics
Monte Carlo
ODE
Multibody Dynamics
Geophysical Fluids
VLSI Design
Transportation Systems
Aerodynamics
Raster Graphics
Economics
Fields
Orbital Mechanics
Nuclear Structure
Ecosystems
QCD
Pattern Matching
Symbolic Processing
Neutron Transport
Economics Models
Genome Processing
Virtual Reality
Astrophysics
Cryptography
Electromagnetics
Computer Vision
Virtual Prototypes
Intelligent Search
Multimedia Collaboration Tools
Computer Algebra
Databases
Magnet Design
Computational Steering
Scientific Visualization
Data Minning
Automated Deduction
Number Theory
CAD
Intelligent Agents
7
(No Transcript)
8
(No Transcript)
9
A 10 Gflops Beowulf
Center for Advance Computing Research
172 Intel Pentium Pro microprocessors
California Institute of Technology
10
Emergence of Beowulf Clusters
11
1st printing May, 1999 2nd printing Aug.
1999 MIT Press
12
(No Transcript)
13
Beowulf Scalability
14
INTEGRATED SMP - WDM
DRAM - 4 GBYTES - HIGHLY INTERLEAVED
MULTI-LAMBDA AON
CROSS BAR
coherence
640 GBYTES/SEC
2nd LEVEL CACHE 96 MBYTES
64 bytes wide
160 gbytes/sec
VLIW/RISC CORE 24 GFLOPS 6 ghz
...
15
COTS PetaFlop System
128 die/box 4 CPU/die
3
4
...
5
2
16
1
17
64
ALL-OPTICAL SWITCH
18
63
...
...
32
49
48
Multi-Die Multi-Processor
...
33
47
46
I/O
10 meters 50 NS Delay
16
COTS PetaFlops System

8192 Dies (4 CPU/die-minimum)
Each Die is 120 GFlops
1 PetaFlop Peak
Power 8192 x200 Watts 1.6 MegaWatts
Extra Main Memory gt3 MegaWatts (512 TBytes)
15.36 TFlops/Rack (128 die)
30 KWatts/Rack - thus 64 racks - 30 inch
Common System I/O
2 Level Main Memory
Optical Interconnect
OC768 Channels (40 GHz)
128 Channels per Die (DWDM)-5.12 THz
ALL Optical Switching
Bisection Bandwidth of 50 TBytes/sec
15 TFlops/rack.1bytes/flop/sec32 racks
Rack Bandwidth - 15 TFlops.1 12 THz

17
The SIA CMOS Roadmap
18
Requirements for High End Systems

Bulk capabilities
performance
storage capacities
throughput/bandwidth
cost, power, complexity
Efficiency
overhead
latency
contention
starvation/parallelism
Usability
generality
programmability
reliability

19
Points of Inflection in the History of Computing

Heroic Era (1950)
technology vacuum tubes, mercury delay lines,
pulse transformers
architecture accumulator based
model von-Neumann, sequential instruction
execution
examples Whirlwind, EDSAC
Mainframe (1960)
technology transistors, core memory, disk drives
architecture register bank based
model virtual memory
examples IBM 7090, PDP-1

20
Points of Inflection in the History of Computing

Supercomputers (1980)
technology ECL, semiconductor integration, RAM
architecture pipelined
model vector
example Cray-1
Massively Parallel Processing (1990)
technology VLSI, microprocessor,
architecture MIMD
model Communicating Sequential Processes,
Message passing
examples TMC CM-5, Intel Paragon
? (2000)

21
(No Transcript)
22
HTMT Objectives

Scalable architecture with high sustained
performance in the presence of disparate cycle
times and latencies
Exploit diverse device technologies to achieve
substantially superior operating point
Execution model to simplify parallel system
programming and expand generality and
applicability

23
Hybrid Technology MultiThreaded Architecture
24
(No Transcript)
25
Storage Capacity by Subsystem 2007 Design Point
26
(No Transcript)
27
HTMT Strategy

High performance
Superconductor RSFQ logic
Data Vortex optical interconnect network
PIM smart memory
Low power
Superconductor RSFQ logic
Optical holographic storage
PIM smart memory

28
HTMT Strategy (cont)

Low cost
reduce wire count through chip-to-chip fiber
reduce processor count through x100 clock speed
reduce memory chips by 3-2 holographic memory
layer
Efficiency
processor level multithreading
smart memory managed second stage context pushing
multithreading
fine grain regular irregular data parallelism
exploited in memory
high memory bandwidth and low latency ops through
PIM
memory to memory interactions without processor
intervention
hardware mechanisms for synchronization,
scheduling, data/context migration, gather/scatter

29
HTMT Strategy (cont)

Programmability
Global shared name space
hierarchical parallel thread flow control model
no explicit processor naming
automatic latency management
automatic processor load balancing
runtime fine grain multithreading
automatic context pushing for process migration
(percolation)
configuration transparent, runtime scalable

30
RSFQ Roadmap(VLSI Circuit Clock Frequency)
31
RSFQ Building Block
L1
32
(No Transcript)
33
Advantages

X100 clock speeds achievable
X100 power efficiency advantage
Easier fabrication
Leverage semiconductor fabrication tools
First technology to encounter ultra-high speed
operation

34
SuperconductorProcessor

100 GHz clock, 33 GHz inter-chip
0.8 micron Niobium on Silicon
100K gates per chip
0.05 watts per processor
100Kwatts per Petaflops

35
(No Transcript)
36
(No Transcript)
37
(No Transcript)
38
Data Vortex Optical Interconnect
39
(No Transcript)
40
DATA VORTEX LATENCY DISTRIBUTION
network height 1024
41
Single-mode rib waveguides on silicon-on-insulator
wafers Hybrid sources and detectors Mix of
CMOS-like and micromachining-type processes
for fabrication
e.g R A Soref, J Schmidtchen K Petermann,
IEEE J. Quantum Electron. 27 p1971 (1991) A
Rickman, G T Reed, B L Weiss F Navamar, IEEE
Photonics Technol. Lett. 4 p.633 (1992) B
Jalali, P D Trinh, S Yegnanarayanan F
Coppinger IEE Proc. Optoelectron. 143 p.307 (1996)
42
PIM Provides Smart Memory

Merge logic and memory
Integrate multiple logic/mem stacks on single
chip
Exposes high intrinsic memory bandwidth
Reduction of memory access latency
Low overhead for memory oriented operations
Manages data structure manipulation, context
coordination and percolation

43
Multithreaded PIM DRAM

Multithreaded Control of PIM Functions
multiple operation sequences with low context
switching overhead
maximize memory utilization and efficiency
maximize processor and I/O utilization

multiple banks of row buffers to hold data,
instructions, and addr
data parallel basic operations at row buffer
manages shared resources such as FP

Direct PIM to PIM Interaction
memory communicates with memory within and across
chip boundaries without external control
processor intervention by parcels
exposes fine grain parallelism intrinsic to
vector and irregular data structures
e.g. pointer chasing, block moves,
synchronization, data balancing

44
Silicon Budget for HTMT DRAM PIM

Designed to provide proper balance of memory
support for fiber bandwidth
Different Vortex configurations gt different s
In 2004, 16 TB 4096 groups of 64 chips
Each Chip

Fiber WDM Optical Receiver
Interface
HRAM Vortex Output
SuperScalar Core
Memory
Logic
By Area
45
Holographic 3/2 Memory
Performance Scaling

Advantages
petabyte memory
competitive cost
10 ?sec access time
low power
efficient interface to DRAM

Disadvantages
recording rate is slower than the readout rate
for LiNbO3
recording must be done in GB chunks
long term trend favors DRAM unless new materials
and lasers are used

46
0.3 m
1.4 m
4oK 50 W
77oK
SIDE VIEW
1 m
Fiber/Wire Interconnects
1 m
3 m
0.5 m
47
SIDE VIEW
Nitrogen
Helium
Tape Silo Array (400 Silos)
Hard Disk Array (40 cabinets)
4oK 50 W
77oK
Fiber/Wire Interconnects
Front End Computer Server

3 m
3 m
Console
Cable Tray Assembly
0.5 m
220Volts
220Volts
WDM Source
Generator
Generator
980 nm Pumps (20 cabinets)
Optical Amplifiers
48
HTMT Facility (Top View)
49
Floor Area
50
Power Dissipation by Subsystem Petaflops Design
Point
51
Subsystem Interfaces 2007 Design Point

Same colors indicate a connection between
subsystems
Horizontal lines group interfaces within a
subsystem

52
(No Transcript)
53
Getting Efficiency

Contention
hardware for bandwidth, logic throughput,
hardware arbitration
Latency
multithreaded processor with hardware context
switching
percolation for proactive prestaging of
executables
PIM-DRAM PIM-SRAM provides smart data oriented
mechanisms
Overhead
hardware context switching
in PIM smart synchronization and context
management
proactive percolation performed in PIM
Starvation
dynamic load balancing
high speed processor for reduced parallelism
expose/exploit fine grain parallelism

54
Multilevel Multithreaded Execution Model

Extend latency hiding of multithreading
Hierarchy of logical thread
Delineates threads and thread ensembles
Action sequences, state, and precedence
constraints
Fine grain single cycle thread switching
Processor level, hides pipeline and time of
flight latency
Coarse grain context "percolation"
Memory level, in memory synchronization
Ready contexts move toward processors, pending
contexts towards big memory

55
Tera MTA Friends
56
Percolation of Active Tasks

Multiple stage latency management methodology
Augmented multithreaded resource scheduling
Hierarchy of task contexts
Coarse-grain contexts coordinate in PIM memory
Ready contexts migrate to SRAM under PIM control
releasing threads for scheduling
Threads pushed into SRAM/CRAM frame buffers
Strands loaded in register banks on space
available basis

57
HTMT Percolation Model
CRYOGENIC AREA
DMA to CRAM
Split-Phase Synchronization to SRAM
done
start
C-Buffer
I-Queue
A-Queue
Parcel Invocation Termination
Parcel Assembly Disassembly
Parcel Dispatcher Dispenser
Re-Use
T-Queue
D-Queue
Run Time System
SRAM-PIM
DMA to DRAM-PIM
58
HTMT Execution Model
Contexts in SRAM
Data Structures
59
DRAM PIM Functions