HTMT-class%20Latency%20Tolerant%20Parallel%20Architecture%20for%20Petaflops-scale%20Computation - PowerPoint PPT Presentation

About This Presentation
Title:

HTMT-class%20Latency%20Tolerant%20Parallel%20Architecture%20for%20Petaflops-scale%20Computation

Description:

HTMTclass Latency Tolerant Parallel Architecture for Petaflopsscale Computation – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 61
Provided by: ThomasS109
Category:

less

Transcript and Presenter's Notes

Title: HTMT-class%20Latency%20Tolerant%20Parallel%20Architecture%20for%20Petaflops-scale%20Computation


1
HTMT-class Latency Tolerant Parallel Architecture
for Petaflops-scale Computation
  • Dr. Thomas Sterling
  • California Institute of Technology
  • and
  • NASA Jet Propulsion Laboratory
  • October 1, 1999

2
HTMT-class Latency Tolerant Parallel Architecture
for Petaflops-scale Computation
A Presentation at IBM Almaden
  • Dr. Thomas Sterling
  • California Institute of Technology
  • and
  • NASA Jet Propulsion Laboratory
  • September 10, 1999

3
(No Transcript)
4
(No Transcript)
5
2nd Conference on Enabling Technologies for
Peta(fl)ops Computing
6
Rational Drug Design
Nanotechnology
Tomographic Reconstruction
Phylogenetic Trees
Biomolecular Dynamics
Neural Networks
Crystallography
Fracture Mechanics
MRI Imaging
Reservoir Modelling
Molecular Modelling
Biosphere/Geosphere
Diffraction Inversion Problems
Distribution Networks
Chemical Dynamics
Atomic Scattering
Electrical Grids
Flow in Porous Media
Pipeline Flows
Data Assimilation
Signal Processing
Condensed Matter Electronic Structure
Plasma Processing
Chemical Reactors
Cloud Physics
Electronic Structure
Boilers
Combustion
Actinide Chemistry
Radiation
Fourier Methods
Graph Theoretic
CVD
Quantum Chemistry
Reaction-Diffusion
Chemical Reactors
Cosmology
Transport
n-body
Astrophysics
Multiphase Flow
Manufacturing Systems
CFD
Basic Algorithms Numerical Methods
Discrete Events
PDE
Weather and Climate
Air Traffic Control
Military Logistics
Structural Mechanics
Seismic Processing
Population Genetics
Monte Carlo
ODE
Multibody Dynamics
Geophysical Fluids
VLSI Design
Transportation Systems
Aerodynamics
Raster Graphics
Economics
Fields
Orbital Mechanics
Nuclear Structure
Ecosystems
QCD
Pattern Matching
Symbolic Processing
Neutron Transport
Economics Models
Genome Processing
Virtual Reality
Astrophysics
Cryptography
Electromagnetics
Computer Vision
Virtual Prototypes
Intelligent Search
Multimedia Collaboration Tools
Computer Algebra
Databases
Magnet Design
Computational Steering
Scientific Visualization
Data Minning
Automated Deduction
Number Theory
CAD
Intelligent Agents
7
(No Transcript)
8
(No Transcript)
9
A 10 Gflops Beowulf
Center for Advance Computing Research
172 Intel Pentium Pro microprocessors
California Institute of Technology
10
Emergence of Beowulf Clusters
11
1st printing May, 1999 2nd printing Aug.
1999 MIT Press
12
(No Transcript)
13
Beowulf Scalability
14
INTEGRATED SMP - WDM
DRAM - 4 GBYTES - HIGHLY INTERLEAVED
MULTI-LAMBDA AON
CROSS BAR
coherence
640 GBYTES/SEC
2nd LEVEL CACHE 96 MBYTES
64 bytes wide
160 gbytes/sec
VLIW/RISC CORE 24 GFLOPS 6 ghz
...
15
COTS PetaFlop System
128 die/box 4 CPU/die
3
4
...
5
2
16
1
17
64
ALL-OPTICAL SWITCH
18
63
...
...
32
49
48
Multi-Die Multi-Processor
...
33
47
46
I/O
10 meters 50 NS Delay
16
COTS PetaFlops System
  • 8192 Dies (4 CPU/die-minimum)
  • Each Die is 120 GFlops
  • 1 PetaFlop Peak
  • Power 8192 x200 Watts 1.6 MegaWatts
  • Extra Main Memory gt3 MegaWatts (512 TBytes)
  • 15.36 TFlops/Rack (128 die)
  • 30 KWatts/Rack - thus 64 racks - 30 inch
  • Common System I/O
  • 2 Level Main Memory
  • Optical Interconnect
  • OC768 Channels (40 GHz)
  • 128 Channels per Die (DWDM)-5.12 THz
  • ALL Optical Switching
  • Bisection Bandwidth of 50 TBytes/sec
  • 15 TFlops/rack.1bytes/flop/sec32 racks
  • Rack Bandwidth - 15 TFlops.1 12 THz

17
The SIA CMOS Roadmap
18
Requirements for High End Systems
  • Bulk capabilities
  • performance
  • storage capacities
  • throughput/bandwidth
  • cost, power, complexity
  • Efficiency
  • overhead
  • latency
  • contention
  • starvation/parallelism
  • Usability
  • generality
  • programmability
  • reliability

19
Points of Inflection in the History of Computing
  • Heroic Era (1950)
  • technology vacuum tubes, mercury delay lines,
    pulse transformers
  • architecture accumulator based
  • model von-Neumann, sequential instruction
    execution
  • examples Whirlwind, EDSAC
  • Mainframe (1960)
  • technology transistors, core memory, disk drives
  • architecture register bank based
  • model virtual memory
  • examples IBM 7090, PDP-1

20
Points of Inflection in the History of Computing
  • Supercomputers (1980)
  • technology ECL, semiconductor integration, RAM
  • architecture pipelined
  • model vector
  • example Cray-1
  • Massively Parallel Processing (1990)
  • technology VLSI, microprocessor,
  • architecture MIMD
  • model Communicating Sequential Processes,
    Message passing
  • examples TMC CM-5, Intel Paragon
  • ? (2000)

21
(No Transcript)
22
HTMT Objectives
  • Scalable architecture with high sustained
    performance in the presence of disparate cycle
    times and latencies
  • Exploit diverse device technologies to achieve
    substantially superior operating point
  • Execution model to simplify parallel system
    programming and expand generality and
    applicability

23
Hybrid Technology MultiThreaded Architecture
24
(No Transcript)
25
Storage Capacity by Subsystem 2007 Design Point
26
(No Transcript)
27
HTMT Strategy
  • High performance
  • Superconductor RSFQ logic
  • Data Vortex optical interconnect network
  • PIM smart memory
  • Low power
  • Superconductor RSFQ logic
  • Optical holographic storage
  • PIM smart memory

28
HTMT Strategy (cont)
  • Low cost
  • reduce wire count through chip-to-chip fiber
  • reduce processor count through x100 clock speed
  • reduce memory chips by 3-2 holographic memory
    layer
  • Efficiency
  • processor level multithreading
  • smart memory managed second stage context pushing
    multithreading
  • fine grain regular irregular data parallelism
    exploited in memory
  • high memory bandwidth and low latency ops through
    PIM
  • memory to memory interactions without processor
    intervention
  • hardware mechanisms for synchronization,
    scheduling, data/context migration, gather/scatter

29
HTMT Strategy (cont)
  • Programmability
  • Global shared name space
  • hierarchical parallel thread flow control model
  • no explicit processor naming
  • automatic latency management
  • automatic processor load balancing
  • runtime fine grain multithreading
  • automatic context pushing for process migration
    (percolation)
  • configuration transparent, runtime scalable

30
RSFQ Roadmap(VLSI Circuit Clock Frequency)
31
RSFQ Building Block
L1
32
(No Transcript)
33
Advantages
  • X100 clock speeds achievable
  • X100 power efficiency advantage
  • Easier fabrication
  • Leverage semiconductor fabrication tools
  • First technology to encounter ultra-high speed
    operation

34
SuperconductorProcessor
  • 100 GHz clock, 33 GHz inter-chip
  • 0.8 micron Niobium on Silicon
  • 100K gates per chip
  • 0.05 watts per processor
  • 100Kwatts per Petaflops

35
(No Transcript)
36
(No Transcript)
37
(No Transcript)
38
Data Vortex Optical Interconnect
39
(No Transcript)
40
DATA VORTEX LATENCY DISTRIBUTION
network height 1024
41
Single-mode rib waveguides on silicon-on-insulator
wafers Hybrid sources and detectors Mix of
CMOS-like and micromachining-type processes
for fabrication
e.g R A Soref, J Schmidtchen K Petermann,
IEEE J. Quantum Electron. 27 p1971 (1991) A
Rickman, G T Reed, B L Weiss F Navamar, IEEE
Photonics Technol. Lett. 4 p.633 (1992) B
Jalali, P D Trinh, S Yegnanarayanan F
Coppinger IEE Proc. Optoelectron. 143 p.307 (1996)
42
PIM Provides Smart Memory
  • Merge logic and memory
  • Integrate multiple logic/mem stacks on single
    chip
  • Exposes high intrinsic memory bandwidth
  • Reduction of memory access latency
  • Low overhead for memory oriented operations
  • Manages data structure manipulation, context
    coordination and percolation

43
Multithreaded PIM DRAM
  • Multithreaded Control of PIM Functions
  • multiple operation sequences with low context
    switching overhead
  • maximize memory utilization and efficiency
  • maximize processor and I/O utilization
  • multiple banks of row buffers to hold data,
    instructions, and addr
  • data parallel basic operations at row buffer
  • manages shared resources such as FP
  • Direct PIM to PIM Interaction
  • memory communicates with memory within and across
    chip boundaries without external control
    processor intervention by parcels
  • exposes fine grain parallelism intrinsic to
    vector and irregular data structures
  • e.g. pointer chasing, block moves,
    synchronization, data balancing

44
Silicon Budget for HTMT DRAM PIM
  • Designed to provide proper balance of memory
    support for fiber bandwidth
  • Different Vortex configurations gt different s
  • In 2004, 16 TB 4096 groups of 64 chips
  • Each Chip

Fiber WDM Optical Receiver
Interface
HRAM Vortex Output
SuperScalar Core
Memory
Logic
By Area
45
Holographic 3/2 Memory
Performance Scaling
  • Advantages
  • petabyte memory
  • competitive cost
  • 10 ?sec access time
  • low power
  • efficient interface to DRAM
  • Disadvantages
  • recording rate is slower than the readout rate
    for LiNbO3
  • recording must be done in GB chunks
  • long term trend favors DRAM unless new materials
    and lasers are used

46
0.3 m
1.4 m
4oK 50 W
77oK
SIDE VIEW
1 m
Fiber/Wire Interconnects
1 m
3 m
0.5 m
47
SIDE VIEW
Nitrogen
Helium
Tape Silo Array (400 Silos)
Hard Disk Array (40 cabinets)
4oK 50 W
77oK
Fiber/Wire Interconnects
Front End Computer Server

3 m
3 m
Console
Cable Tray Assembly
0.5 m
220Volts
220Volts
WDM Source
Generator
Generator
980 nm Pumps (20 cabinets)
Optical Amplifiers
48
HTMT Facility (Top View)
49
Floor Area
50
Power Dissipation by Subsystem Petaflops Design
Point
51
Subsystem Interfaces 2007 Design Point
  • Same colors indicate a connection between
    subsystems
  • Horizontal lines group interfaces within a
    subsystem

52
(No Transcript)
53
Getting Efficiency
  • Contention
  • hardware for bandwidth, logic throughput,
    hardware arbitration
  • Latency
  • multithreaded processor with hardware context
    switching
  • percolation for proactive prestaging of
    executables
  • PIM-DRAM PIM-SRAM provides smart data oriented
    mechanisms
  • Overhead
  • hardware context switching
  • in PIM smart synchronization and context
    management
  • proactive percolation performed in PIM
  • Starvation
  • dynamic load balancing
  • high speed processor for reduced parallelism
  • expose/exploit fine grain parallelism

54
Multilevel Multithreaded Execution Model
  • Extend latency hiding of multithreading
  • Hierarchy of logical thread
  • Delineates threads and thread ensembles
  • Action sequences, state, and precedence
    constraints
  • Fine grain single cycle thread switching
  • Processor level, hides pipeline and time of
    flight latency
  • Coarse grain context "percolation"
  • Memory level, in memory synchronization
  • Ready contexts move toward processors, pending
    contexts towards big memory

55
Tera MTA Friends
56
Percolation of Active Tasks
  • Multiple stage latency management methodology
  • Augmented multithreaded resource scheduling
  • Hierarchy of task contexts
  • Coarse-grain contexts coordinate in PIM memory
  • Ready contexts migrate to SRAM under PIM control
    releasing threads for scheduling
  • Threads pushed into SRAM/CRAM frame buffers
  • Strands loaded in register banks on space
    available basis

57
HTMT Percolation Model
CRYOGENIC AREA
DMA to CRAM
Split-Phase Synchronization to SRAM
done
start
C-Buffer
I-Queue
A-Queue
Parcel Invocation Termination
Parcel Assembly Disassembly
Parcel Dispatcher Dispenser
Re-Use
T-Queue
D-Queue
Run Time System
SRAM-PIM
DMA to DRAM-PIM
58
HTMT Execution Model
Contexts in SRAM
Data Structures
59
DRAM PIM Functions
  • Initialize data structures
  • Stride thru regular data structures, transferring
    to/from SRAM
  • Pointer chase thru linked data structures
  • Join-like operations
  • Reorderings
  • Prefix operations
  • I/O transfer management
  • DMA, compress/decompress, ...

60
SRAM PIM Functions
  • Initiate Gather/Scatter to/from DRAM
  • Recognize when sufficient operands arrive in SRAM
    context block
  • Enqueue/Dequeue SRAM block addresses
  • Initiate DMA transfers to/from CRAM context block
  • Signal SPELL re task initiation
  • Prefix operations like Flt Pt Sum

61
StrawMan Prototype for Phase 4
62
0.3 m
1.4 m
4oK 50 W
77oK
SIDE VIEW
1 m
Fiber/Wire Interconnects
1 m
3 m
0.5 m
Write a Comment
User Comments (0)
About PowerShow.com