Critical Factors and Directions for Petaflopsscale Supercomputers - PowerPoint PPT Presentation

About This Presentation

Title:

Critical Factors and Directions for Petaflopsscale Supercomputers

Description:

Critical Factors and Directions for Petaflops-scale ... High radix routers. System Technology. Opto-electrical interconnect. Cooling. HWP. Clustered vectors ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 48

Provided by: thomas764

Learn more at: http://www.ifipwg103.org

Category:

more less

Transcript and Presenter's Notes

Title: Critical Factors and Directions for Petaflopsscale Supercomputers

1
Critical Factors and Directions for
Petaflops-scale Supercomputers
Presentation to IFIP WG10.3 e-Seminar Series

Thomas Sterling
California Institute of Technology
and
NASA Jet Propulsion Laboratory
January 4, 2005

2
IBM BG/L Fastest Computer in the World
3
Blue Gene/L71 Teraflops Linpack Performance

IBM BlueGene/L DD2 beta-System
Peak Performance 91.75 Tflops
Linpack Performance 70.72 Tflops
Based on the IBM 0.7 GHz PowerPC 440
2.8 Gflops/processor (peak 2/ASIC)
32768 processors
128 MB/processor DDR, 4 TB system
3D Torus network combining tree
100 Tbytes disk storage
Power consumption of 500 Kwatts

4
(No Transcript)
5
Where Does Performance Come From?

Device Technology
Logic switching speed and device density
Memory capacity and access time
Communications bandwidth and latency
Computer Architecture
Instruction issue rate
Execution pipelining
Reservation stations
Branch prediction
Cache management
Parallelism
Parallelism number of operations per cycle per
processor
Instruction level parallelism (ILP)
Vector processing
Parallelism number of processors per node
Parallelism number of nodes in a system

6
A Growth-Factor of a Billion in Performance in a
Single Lifetime
1959 IBM 7094
1976 Cray 1
1996 T3E
1991 Intel Delta
2003 Cray X1
1949 Edsac
1823 Babbage Difference Engine
2001 Earth Simulator
1951 Univac 1
1982 Cray XMP
1988 Cray YMP
1964 CDC 6600
1997 ASCI Red
1943 Harvard Mark 1
7
Moores Law an opportunity missed
8
Microprocessor Clock Speed
9
(No Transcript)
10
Classes of Architecture forHigh Performance
Computers

Parallel Vector Processors (PVP)
NEC Earth Simulator, SX-6
Cray- 1, 2, XMP, YMP, C90, T90, X1
Fujitsu 5000 series
Massively Parallel Processors (MPP)
Intel Touchstone Delta Paragon
TMC CM-5
IBM SP-2 3, Blue Gene/Light
Cray T3D, T3E, Red Storm/Strider
Distributed Shared Memory (DSM)
SGI Origin
HP Superdome
Single Instruction stream Single Data stream
(SIMD)
Goodyear MPP, MasPar 1 2, TMC CM-2
Commodity Clusters
Beowulf-class PC/Linux clusters
Constellations
HP Compaq SC, Linux NetworX MCR

11
(No Transcript)
12
Beowulf Project

Wiglaf - 1994
16 Intel 80486 100 MHz
VESA Local bus
256 Mbytes memory
6.4 Gbytes of disk
Dual 10 base-T Ethernet
72 Mflops sustained
40K

Hrothgar - 1995
16 Intel Pentium100 MHz
PCI
1 Gbyte memory
6.4 Gbytes of disk
100 base-T Fast Ethernet (hub)
240 Mflops sustained
46K

Hyglac-1996 (Caltech)
16 Pentium Pro 200 MHz
PCI
2 Gbytes memory
49.6 Gbytes of disk
100 base-T Fast Ethernet (switch)
1.25 Gflops sustained
50K

13
HPC Paths
14
(No Transcript)
15
Why Fast Machines Run Slow

Latency
Waiting for access to memory or other parts of
the system
Overhead
Extra work that has to be done to manage program
concurrency and parallel resources the real work
you want to perform
Starvation
Not enough work to do due to insufficient
parallelism or poor load balancing among
distributed resources
Contention
Delays due to fighting over what task gets to use
a shared resource next. Network bandwidth is a
major constraint.

16
The SIA CMOS Roadmap
17
Latency in a Single System
Ratio
Memory Access Time
CPU Time
THE WALL
18
Microprocessors no longer realize the full
potential of VLSI technology
52/year
19/year
301
74/year
1,0001
30,0001
19
Opportunities for Future Custom MPP Architectures
for Petaflops Computing

ALU proliferation
Lower ALU utilization improves performance
flops/
Streaming (e.g. Bill Dally)
Overhead mechanisms support in hardware
ISA for atomic compound operations on complex
data
Synchronization
Communications
Reconfigurable Logic
Processor in Memory (PIM)
100X memory bandwidth
Supports low/no temporal locality execution
Latency hiding
Multithreading
Parcel driven transaction processing
Percolation prestaging

20
High Productivity Computing Systems

Goal
Provide a new generation of economically viable
high productivity computing systems for the
national security and industrial user community
(2009 2010)

Impact
Performance (time-to-solution) speedup critical
national security applications by a factor of 10X
to 40X
Programmability (idea-to-first-solution) reduce
cost and time of developing application solutions
Portability (transparency) insulate research and
operational application software from system
Robustness (reliability) apply all known
techniques to protect against outside attacks,
hardware faults, programming errors

HPCS Program Focus Areas

Applications
Intelligence/surveillance, reconnaissance,
cryptanalysis, weapons analysis, airborne
contaminant modeling and biotechnology

Fill the Critical Technology and Capability
Gap Today (late 80s HPC technology)..to..Future
(Quantum/Bio Computing)
21
Cray Cascade High Productivity Petaflops-scale
Computer - 2010

DARPA High Productivity Computing Systems Program
Deliver sustained Petaflops performance by 2010
Aggressively attacks causes of performance
degradation
Reduces contention through high bandwidth network
Latency hiding by vectors, multithreading, and
parcel driven computation, and processor in
memory
Low overhead with efficient remote memory access
and thread creation, PIM acquiring overhead tasks
from main processors, hardware support for
communications
Starvation lowered by exposing fine grain data
parallelism
Greatly simplifies user programming
Distributed shared memory
Hierarchical multithreaded execution model
Low performance penalties for distributed
execution
Hardware support for performance tuning and
correctness debugging

22
Cascade Architecture(logical view)

Interconnection Network
High bandwidth, low latency
High radix routers

Programming Environment
Mixed UMA/NUMA programming model
High productivity programming language

Locale
Locale

Operating System
Highly robust
Highly scalable
Global file system

Locale
Locale

HWP
Clustered vectors
Coarse-grained multithreading
Compiler assisted cache

Locale
Locale
Locale
Locale
HWP
Locale
Cache
Locale
Locale
Locale Interconnect
Locale
Locale
I/O
RAID

System Technology
Opto-electrical interconnect
Cooling

I/O
TCP/IP

LWP
Highly concurrent scalar
Fine-grained multithreading
Remote thread creation

I/O
GRAPHICS
23
Processor in Memory (PIM)

PIM merges logic with memory
Wide ALUs next to the row buffer
Optimized for memory throughput, not ALU
utilization
PIM has the potential of riding Moore's law while
greatly increasing effective memory bandwidth,
providing many more concurrent execution threads,
reducing latency,
reducing power, and
increasing overall system efficiency
It may also simplify programming and system design

24
Why is PIM Inevitable?

Separation between memory and logic artificial
von Neumann bottleneck
Imposed by technology limitations
Not a desirable property of computer architecture
Technology now brings down barrier
We didnt do it because we couldnt do it
We can do it so we will do it
What to do with a billion transistors
Complexity can not be extended indefinitely
Synthesis of simple elements through replication
Means to fault tolerance, lower power
Normalize memory touch time through scaled
bandwidth with capacity
Without it, takes ever longer to look at each
memory block
Will be mass market commodity commercial market
Drivers outside of HPC thrust
Cousin to embedded computing

25
Roles for PIM

Perform in-place operations on zero-reuse data
Exploit high degree data parallelism
Rapid updates on contiguous data blocks
Rapid associative searches through contiguous
data blocks
Gather-scatters
Tree/graph walking
Enables efficient and concurrent array transpose
Permits fine grain manipulation of sparse and
irregular data structures
Parallel prefix operations
In-memory data movement
Memory management overhead work
Engage in prestaging of data for HWT processors
Fault monitoring, detection, and cleanup
Manage 3/2 memory layer

26
Strategic Concepts of the MIND Architecture

Virtual to physical address translation in memory
Global distributed shared memory thru distributed
directory table
Dynamic page migration
Wide registers serve as context sensitive TLB
Multithreaded control
Unified dynamic mechanism for resource management
Latency hiding
Real time response
Parcel active message driven computing
Decoupled split-transaction execution
System wide latency hiding
Move work to data instead of data to work
Caching of external DRAM

27
MIND Node
memory address buffer
Parcel Interface
28
Microprocessor with PIMs
PIM Node 1
Memory
TLcycle
Microprocessor
TML
WH
ALU
PIM processor
TRR
Contrl
Cache
TMH
THcycle
Reg
WL
mixl/s
TCH
mixl/s
Pmiss
Metrics
PIM Node 2
PIM Node 3
PIM Node N
29
Threads Timeline
Time
Heavy Weight Thread Processor
HW Thread
HW Thread
HW Thread
HW Thread
Light Weight Thread Processors
LW Threads
LW Threads
LW Threads
30
Simulation of Performance Gain
Performance Gain
PIM Workload
31
Simulation of PIM Execution Time
Time to Execution
Number of PIM Nodes
32
Analytical Expression for Relative Execution Time
33
Effect of PIM on Execution Time with Normalized
Runtime
PIM Nodes
Relative Time to Execution
Number of PIM Nodes
34
Parcels

Parcels
Enable lightweight communication between LWPs or
between HWP and LWP.
Contribute to system-wide latency management
Support split-transaction message-driven
computing
Low overhead for efficient communication
Implementation of remote thread creation (rtc).
Implementation of remote memory references.

35
Parcels for remote threads
Destination Locale
Data
Target Operand
Remote Thread Create Parcel
Methods
Target Action Code
Source Locale
Return Parcel
36
Parcel Simulation Latency Hiding Experiment
Nodes
Flat Network
Nodes
Nodes
Control Experiment
Test Experiment
Nodes
37
Latency Hiding with Parcelswith respect to
System Diameter in cycles
38
Latency Hiding with ParcelsIdle Time with
respect to Degree of Parallelism
39
Multithreading in PIMS

MIND must respond asynchronously to service
requests from multiple sources
Parcel-driven computing requires rapid response
to incident packets
Hardware supports multitasking for multiple
concurrent method instantiations
High memory bandwidth utilization by overlapping
computation with access ops
Manages shared on-chip resources
Provides fine-grain context switching
Latency hiding

40
Parcels, Multithreading, and Multiport Memory
41
MPI The Failed Success

A 10 year odyssey
Community wide standard for parallel programming
A proven natural model for distributed
fragmented memory class systems
User responsible for locality management
User responsible for minimizing overhead
User responsible for resource allocation
User responsible for exposing parallelism
Relied on ILP and OpenMP for more parallelism
Mediocre scaling demands problem size expansion
for greater performance
We now are constrained to legacy MPI codes

42
What is required

Global name spaces both data and active tasks
Rich parallelism semantics and granularity
Diversity of forms
Tremendous increase in amount
Support for sparse data parallelism
Latency hiding
Low overhead mechanisms
Synchronization
Scheduling
Affinity semantics
Do not rely on
Direct control of hardware mechanisms
Direct management and allocation of hardware
resources
Direct choreographing of physical data and task
locality

43
ParalleX a Parallel Programming Model

Exposes parallelism in diverse forms and
granularities
Greatly increases available parallelism for
speedup
Matches more algorithms
Exploits intrinsic parallelism of sparse data
Exploits split transaction processing
Decouples computation and communication
Moves work to data, not just data to work
Intrinsics for latency hiding
Multithreading
Message driven computation
Efficient lightweight synchronization overhead
Register synchronization
Futures hardware support
Lightweight objects
Fine grain mutual exclusion
Provides for global data and task name spaces
Efficient remote memory accesses (e.g. shmem)
Lightweight atomic memory operations
Affinity attribute specifiers

44
Agincourt A Latency Tolerant Parallel
Programming Language

Bridges the cluster gap
Eliminates constraints of message passing model
Reduces need for inefficient global barrier
synchronization
Mitigates local to remote access time disparity
Removes OS from critical path
Greatly simplifies programming
Global single system image
Manipulates sparse irregular time varying meta
data
Facilitates dynamic adaptive applications
Dramatic performance advantage
Lower overhead
Latency hiding
Load balancing

45
This could be a very bad idea

New languages almost always fail
Fancy assed languages usually do not match needs
of system hardware
Compilers take forever to bring to maturity
People, quite reasonably, like what they do they
dont want to change
People feel threatened by others who want to
impose silly naive expensive impractical
unilateral ideas
Acceptance is a big issue
And then theres the legacy problem

46
Real-World Practical Petaflops Computer Systems

Sustained Petaflops performance on wide range of
applications.
Full Peta-scale system resources of a Petaflops
computer routinely allocated to real world users,
not just for Demos before SCXY.
There are many Petaflops computers available
throughout the nation, not just a couple of
National Laboratories
Size, power, cooling, and cost not prohibitive.
Programming is tractable, so that a scientist can
use it and not change professions in the process.

47
1 Petaflops is only the beginning

Write a Comment

User Comments (0)