Title: Critical Factors and Directions for Petaflopsscale Supercomputers
1Critical Factors and Directions for
Petaflops-scale Supercomputers
Presentation to IFIP WG10.3 e-Seminar Series
- Thomas Sterling
- California Institute of Technology
- and
- NASA Jet Propulsion Laboratory
- January 4, 2005
2IBM BG/L Fastest Computer in the World
3Blue Gene/L71 Teraflops Linpack Performance
- IBM BlueGene/L DD2 beta-System
- Peak Performance 91.75 Tflops
- Linpack Performance 70.72 Tflops
- Based on the IBM 0.7 GHz PowerPC 440
- 2.8 Gflops/processor (peak 2/ASIC)
- 32768 processors
- 128 MB/processor DDR, 4 TB system
- 3D Torus network combining tree
- 100 Tbytes disk storage
- Power consumption of 500 Kwatts
-
4(No Transcript)
5Where Does Performance Come From?
- Device Technology
- Logic switching speed and device density
- Memory capacity and access time
- Communications bandwidth and latency
- Computer Architecture
- Instruction issue rate
- Execution pipelining
- Reservation stations
- Branch prediction
- Cache management
- Parallelism
- Parallelism number of operations per cycle per
processor - Instruction level parallelism (ILP)
- Vector processing
- Parallelism number of processors per node
- Parallelism number of nodes in a system
6A Growth-Factor of a Billion in Performance in a
Single Lifetime
1959 IBM 7094
1976 Cray 1
1996 T3E
1991 Intel Delta
2003 Cray X1
1949 Edsac
1823 Babbage Difference Engine
2001 Earth Simulator
1951 Univac 1
1982 Cray XMP
1988 Cray YMP
1964 CDC 6600
1997 ASCI Red
1943 Harvard Mark 1
7Moores Law an opportunity missed
8Microprocessor Clock Speed
9(No Transcript)
10Classes of Architecture forHigh Performance
Computers
- Parallel Vector Processors (PVP)
- NEC Earth Simulator, SX-6
- Cray- 1, 2, XMP, YMP, C90, T90, X1
- Fujitsu 5000 series
- Massively Parallel Processors (MPP)
- Intel Touchstone Delta Paragon
- TMC CM-5
- IBM SP-2 3, Blue Gene/Light
- Cray T3D, T3E, Red Storm/Strider
- Distributed Shared Memory (DSM)
- SGI Origin
- HP Superdome
- Single Instruction stream Single Data stream
(SIMD) - Goodyear MPP, MasPar 1 2, TMC CM-2
- Commodity Clusters
- Beowulf-class PC/Linux clusters
- Constellations
- HP Compaq SC, Linux NetworX MCR
11(No Transcript)
12Beowulf Project
- Wiglaf - 1994
- 16 Intel 80486 100 MHz
- VESA Local bus
- 256 Mbytes memory
- 6.4 Gbytes of disk
- Dual 10 base-T Ethernet
- 72 Mflops sustained
- 40K
- Hrothgar - 1995
- 16 Intel Pentium100 MHz
- PCI
- 1 Gbyte memory
- 6.4 Gbytes of disk
- 100 base-T Fast Ethernet (hub)
- 240 Mflops sustained
- 46K
- Hyglac-1996 (Caltech)
- 16 Pentium Pro 200 MHz
- PCI
- 2 Gbytes memory
- 49.6 Gbytes of disk
- 100 base-T Fast Ethernet (switch)
- 1.25 Gflops sustained
- 50K
13HPC Paths
14(No Transcript)
15Why Fast Machines Run Slow
- Latency
- Waiting for access to memory or other parts of
the system - Overhead
- Extra work that has to be done to manage program
concurrency and parallel resources the real work
you want to perform - Starvation
- Not enough work to do due to insufficient
parallelism or poor load balancing among
distributed resources - Contention
- Delays due to fighting over what task gets to use
a shared resource next. Network bandwidth is a
major constraint.
16The SIA CMOS Roadmap
17Latency in a Single System
Ratio
Memory Access Time
CPU Time
THE WALL
18Microprocessors no longer realize the full
potential of VLSI technology
52/year
19/year
301
74/year
1,0001
30,0001
19Opportunities for Future Custom MPP Architectures
for Petaflops Computing
- ALU proliferation
- Lower ALU utilization improves performance
flops/ - Streaming (e.g. Bill Dally)
- Overhead mechanisms support in hardware
- ISA for atomic compound operations on complex
data - Synchronization
- Communications
- Reconfigurable Logic
- Processor in Memory (PIM)
- 100X memory bandwidth
- Supports low/no temporal locality execution
- Latency hiding
- Multithreading
- Parcel driven transaction processing
- Percolation prestaging
20High Productivity Computing Systems
- Goal
- Provide a new generation of economically viable
high productivity computing systems for the
national security and industrial user community
(2009 2010)
- Impact
- Performance (time-to-solution) speedup critical
national security applications by a factor of 10X
to 40X - Programmability (idea-to-first-solution) reduce
cost and time of developing application solutions
- Portability (transparency) insulate research and
operational application software from system - Robustness (reliability) apply all known
techniques to protect against outside attacks,
hardware faults, programming errors
HPCS Program Focus Areas
- Applications
- Intelligence/surveillance, reconnaissance,
cryptanalysis, weapons analysis, airborne
contaminant modeling and biotechnology
Fill the Critical Technology and Capability
Gap Today (late 80s HPC technology)..to..Future
(Quantum/Bio Computing)
21Cray Cascade High Productivity Petaflops-scale
Computer - 2010
- DARPA High Productivity Computing Systems Program
- Deliver sustained Petaflops performance by 2010
- Aggressively attacks causes of performance
degradation - Reduces contention through high bandwidth network
- Latency hiding by vectors, multithreading, and
parcel driven computation, and processor in
memory - Low overhead with efficient remote memory access
and thread creation, PIM acquiring overhead tasks
from main processors, hardware support for
communications - Starvation lowered by exposing fine grain data
parallelism - Greatly simplifies user programming
- Distributed shared memory
- Hierarchical multithreaded execution model
- Low performance penalties for distributed
execution - Hardware support for performance tuning and
correctness debugging
22Cascade Architecture(logical view)
- Interconnection Network
- High bandwidth, low latency
- High radix routers
- Programming Environment
- Mixed UMA/NUMA programming model
- High productivity programming language
Locale
Locale
- Operating System
- Highly robust
- Highly scalable
- Global file system
Locale
Locale
- HWP
- Clustered vectors
- Coarse-grained multithreading
- Compiler assisted cache
Locale
Locale
Locale
Locale
HWP
Locale
Cache
Locale
Locale
Locale Interconnect
Locale
Locale
I/O
RAID
- System Technology
- Opto-electrical interconnect
- Cooling
I/O
TCP/IP
- LWP
- Highly concurrent scalar
- Fine-grained multithreading
- Remote thread creation
I/O
GRAPHICS
23Processor in Memory (PIM)
- PIM merges logic with memory
- Wide ALUs next to the row buffer
- Optimized for memory throughput, not ALU
utilization - PIM has the potential of riding Moore's law while
- greatly increasing effective memory bandwidth,
- providing many more concurrent execution threads,
- reducing latency,
- reducing power, and
- increasing overall system efficiency
- It may also simplify programming and system design
24Why is PIM Inevitable?
- Separation between memory and logic artificial
- von Neumann bottleneck
- Imposed by technology limitations
- Not a desirable property of computer architecture
- Technology now brings down barrier
- We didnt do it because we couldnt do it
- We can do it so we will do it
- What to do with a billion transistors
- Complexity can not be extended indefinitely
- Synthesis of simple elements through replication
- Means to fault tolerance, lower power
- Normalize memory touch time through scaled
bandwidth with capacity - Without it, takes ever longer to look at each
memory block - Will be mass market commodity commercial market
- Drivers outside of HPC thrust
- Cousin to embedded computing
25Roles for PIM
- Perform in-place operations on zero-reuse data
- Exploit high degree data parallelism
- Rapid updates on contiguous data blocks
- Rapid associative searches through contiguous
data blocks - Gather-scatters
- Tree/graph walking
- Enables efficient and concurrent array transpose
- Permits fine grain manipulation of sparse and
irregular data structures - Parallel prefix operations
- In-memory data movement
- Memory management overhead work
- Engage in prestaging of data for HWT processors
- Fault monitoring, detection, and cleanup
- Manage 3/2 memory layer
26Strategic Concepts of the MIND Architecture
- Virtual to physical address translation in memory
- Global distributed shared memory thru distributed
directory table - Dynamic page migration
- Wide registers serve as context sensitive TLB
- Multithreaded control
- Unified dynamic mechanism for resource management
- Latency hiding
- Real time response
- Parcel active message driven computing
- Decoupled split-transaction execution
- System wide latency hiding
- Move work to data instead of data to work
- Caching of external DRAM
27MIND Node
memory address buffer
Parcel Interface
28Microprocessor with PIMs
PIM Node 1
Memory
TLcycle
Microprocessor
TML
WH
ALU
PIM processor
TRR
Contrl
Cache
TMH
THcycle
Reg
WL
mixl/s
TCH
mixl/s
Pmiss
Metrics
PIM Node 2
PIM Node 3
PIM Node N
29Threads Timeline
Time
Heavy Weight Thread Processor
HW Thread
HW Thread
HW Thread
HW Thread
Light Weight Thread Processors
LW Threads
LW Threads
LW Threads
30Simulation of Performance Gain
Performance Gain
PIM Workload
31Simulation of PIM Execution Time
Time to Execution
Number of PIM Nodes
32Analytical Expression for Relative Execution Time
33Effect of PIM on Execution Time with Normalized
Runtime
PIM Nodes
Relative Time to Execution
Number of PIM Nodes
34Parcels
- Parcels
- Enable lightweight communication between LWPs or
between HWP and LWP. - Contribute to system-wide latency management
- Support split-transaction message-driven
computing - Low overhead for efficient communication
- Implementation of remote thread creation (rtc).
- Implementation of remote memory references.
35Parcels for remote threads
Destination Locale
Data
Target Operand
Remote Thread Create Parcel
Methods
Target Action Code
Source Locale
Return Parcel
36Parcel Simulation Latency Hiding Experiment
Nodes
Flat Network
Nodes
Nodes
Control Experiment
Test Experiment
Nodes
37Latency Hiding with Parcelswith respect to
System Diameter in cycles
38Latency Hiding with ParcelsIdle Time with
respect to Degree of Parallelism
39Multithreading in PIMS
- MIND must respond asynchronously to service
requests from multiple sources - Parcel-driven computing requires rapid response
to incident packets - Hardware supports multitasking for multiple
concurrent method instantiations - High memory bandwidth utilization by overlapping
computation with access ops - Manages shared on-chip resources
- Provides fine-grain context switching
- Latency hiding
40Parcels, Multithreading, and Multiport Memory
41MPI The Failed Success
- A 10 year odyssey
- Community wide standard for parallel programming
- A proven natural model for distributed
fragmented memory class systems - User responsible for locality management
- User responsible for minimizing overhead
- User responsible for resource allocation
- User responsible for exposing parallelism
- Relied on ILP and OpenMP for more parallelism
- Mediocre scaling demands problem size expansion
for greater performance - We now are constrained to legacy MPI codes
42What is required
- Global name spaces both data and active tasks
- Rich parallelism semantics and granularity
- Diversity of forms
- Tremendous increase in amount
- Support for sparse data parallelism
- Latency hiding
- Low overhead mechanisms
- Synchronization
- Scheduling
- Affinity semantics
- Do not rely on
- Direct control of hardware mechanisms
- Direct management and allocation of hardware
resources - Direct choreographing of physical data and task
locality
43ParalleX a Parallel Programming Model
- Exposes parallelism in diverse forms and
granularities - Greatly increases available parallelism for
speedup - Matches more algorithms
- Exploits intrinsic parallelism of sparse data
- Exploits split transaction processing
- Decouples computation and communication
- Moves work to data, not just data to work
- Intrinsics for latency hiding
- Multithreading
- Message driven computation
- Efficient lightweight synchronization overhead
- Register synchronization
- Futures hardware support
- Lightweight objects
- Fine grain mutual exclusion
- Provides for global data and task name spaces
- Efficient remote memory accesses (e.g. shmem)
- Lightweight atomic memory operations
- Affinity attribute specifiers
44Agincourt A Latency Tolerant Parallel
Programming Language
- Bridges the cluster gap
- Eliminates constraints of message passing model
- Reduces need for inefficient global barrier
synchronization - Mitigates local to remote access time disparity
- Removes OS from critical path
- Greatly simplifies programming
- Global single system image
- Manipulates sparse irregular time varying meta
data - Facilitates dynamic adaptive applications
- Dramatic performance advantage
- Lower overhead
- Latency hiding
- Load balancing
45This could be a very bad idea
- New languages almost always fail
- Fancy assed languages usually do not match needs
of system hardware - Compilers take forever to bring to maturity
- People, quite reasonably, like what they do they
dont want to change - People feel threatened by others who want to
impose silly naive expensive impractical
unilateral ideas - Acceptance is a big issue
- And then theres the legacy problem
46Real-World Practical Petaflops Computer Systems
- Sustained Petaflops performance on wide range of
applications. - Full Peta-scale system resources of a Petaflops
computer routinely allocated to real world users,
not just for Demos before SCXY. - There are many Petaflops computers available
throughout the nation, not just a couple of
National Laboratories - Size, power, cooling, and cost not prohibitive.
- Programming is tractable, so that a scientist can
use it and not change professions in the process.
471 Petaflops is only the beginning