Critical Factors and Directions for Petaflopsscale Supercomputers - PowerPoint PPT Presentation

About This Presentation
Title:

Critical Factors and Directions for Petaflopsscale Supercomputers

Description:

Critical Factors and Directions for Petaflops-scale ... High radix routers. System Technology. Opto-electrical interconnect. Cooling. HWP. Clustered vectors ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 48
Provided by: thomas764
Learn more at: http://www.ifipwg103.org
Category:

less

Transcript and Presenter's Notes

Title: Critical Factors and Directions for Petaflopsscale Supercomputers


1
Critical Factors and Directions for
Petaflops-scale Supercomputers
Presentation to IFIP WG10.3 e-Seminar Series
  • Thomas Sterling
  • California Institute of Technology
  • and
  • NASA Jet Propulsion Laboratory
  • January 4, 2005

2
IBM BG/L Fastest Computer in the World
3
Blue Gene/L71 Teraflops Linpack Performance
  • IBM BlueGene/L DD2 beta-System
  • Peak Performance 91.75 Tflops
  • Linpack Performance 70.72 Tflops
  • Based on the IBM 0.7 GHz PowerPC 440
  • 2.8 Gflops/processor (peak 2/ASIC)
  • 32768 processors
  • 128 MB/processor DDR, 4 TB system
  • 3D Torus network combining tree
  • 100 Tbytes disk storage
  • Power consumption of 500 Kwatts

4
(No Transcript)
5
Where Does Performance Come From?
  • Device Technology
  • Logic switching speed and device density
  • Memory capacity and access time
  • Communications bandwidth and latency
  • Computer Architecture
  • Instruction issue rate
  • Execution pipelining
  • Reservation stations
  • Branch prediction
  • Cache management
  • Parallelism
  • Parallelism number of operations per cycle per
    processor
  • Instruction level parallelism (ILP)
  • Vector processing
  • Parallelism number of processors per node
  • Parallelism number of nodes in a system

6
A Growth-Factor of a Billion in Performance in a
Single Lifetime
1959 IBM 7094
1976 Cray 1
1996 T3E
1991 Intel Delta
2003 Cray X1
1949 Edsac
1823 Babbage Difference Engine
2001 Earth Simulator
1951 Univac 1
1982 Cray XMP
1988 Cray YMP
1964 CDC 6600
1997 ASCI Red
1943 Harvard Mark 1
7
Moores Law an opportunity missed
8
Microprocessor Clock Speed
9
(No Transcript)
10
Classes of Architecture forHigh Performance
Computers
  • Parallel Vector Processors (PVP)
  • NEC Earth Simulator, SX-6
  • Cray- 1, 2, XMP, YMP, C90, T90, X1
  • Fujitsu 5000 series
  • Massively Parallel Processors (MPP)
  • Intel Touchstone Delta Paragon
  • TMC CM-5
  • IBM SP-2 3, Blue Gene/Light
  • Cray T3D, T3E, Red Storm/Strider
  • Distributed Shared Memory (DSM)
  • SGI Origin
  • HP Superdome
  • Single Instruction stream Single Data stream
    (SIMD)
  • Goodyear MPP, MasPar 1 2, TMC CM-2
  • Commodity Clusters
  • Beowulf-class PC/Linux clusters
  • Constellations
  • HP Compaq SC, Linux NetworX MCR

11
(No Transcript)
12
Beowulf Project
  • Wiglaf - 1994
  • 16 Intel 80486 100 MHz
  • VESA Local bus
  • 256 Mbytes memory
  • 6.4 Gbytes of disk
  • Dual 10 base-T Ethernet
  • 72 Mflops sustained
  • 40K
  • Hrothgar - 1995
  • 16 Intel Pentium100 MHz
  • PCI
  • 1 Gbyte memory
  • 6.4 Gbytes of disk
  • 100 base-T Fast Ethernet (hub)
  • 240 Mflops sustained
  • 46K
  • Hyglac-1996 (Caltech)
  • 16 Pentium Pro 200 MHz
  • PCI
  • 2 Gbytes memory
  • 49.6 Gbytes of disk
  • 100 base-T Fast Ethernet (switch)
  • 1.25 Gflops sustained
  • 50K

13
HPC Paths
14
(No Transcript)
15
Why Fast Machines Run Slow
  • Latency
  • Waiting for access to memory or other parts of
    the system
  • Overhead
  • Extra work that has to be done to manage program
    concurrency and parallel resources the real work
    you want to perform
  • Starvation
  • Not enough work to do due to insufficient
    parallelism or poor load balancing among
    distributed resources
  • Contention
  • Delays due to fighting over what task gets to use
    a shared resource next. Network bandwidth is a
    major constraint.

16
The SIA CMOS Roadmap
17
Latency in a Single System
Ratio
Memory Access Time
CPU Time
THE WALL
18
Microprocessors no longer realize the full
potential of VLSI technology
52/year
19/year
301
74/year
1,0001
30,0001
19
Opportunities for Future Custom MPP Architectures
for Petaflops Computing
  • ALU proliferation
  • Lower ALU utilization improves performance
    flops/
  • Streaming (e.g. Bill Dally)
  • Overhead mechanisms support in hardware
  • ISA for atomic compound operations on complex
    data
  • Synchronization
  • Communications
  • Reconfigurable Logic
  • Processor in Memory (PIM)
  • 100X memory bandwidth
  • Supports low/no temporal locality execution
  • Latency hiding
  • Multithreading
  • Parcel driven transaction processing
  • Percolation prestaging

20
High Productivity Computing Systems
  • Goal
  • Provide a new generation of economically viable
    high productivity computing systems for the
    national security and industrial user community
    (2009 2010)
  • Impact
  • Performance (time-to-solution) speedup critical
    national security applications by a factor of 10X
    to 40X
  • Programmability (idea-to-first-solution) reduce
    cost and time of developing application solutions
  • Portability (transparency) insulate research and
    operational application software from system
  • Robustness (reliability) apply all known
    techniques to protect against outside attacks,
    hardware faults, programming errors

HPCS Program Focus Areas
  • Applications
  • Intelligence/surveillance, reconnaissance,
    cryptanalysis, weapons analysis, airborne
    contaminant modeling and biotechnology

Fill the Critical Technology and Capability
Gap Today (late 80s HPC technology)..to..Future
(Quantum/Bio Computing)
21
Cray Cascade High Productivity Petaflops-scale
Computer - 2010
  • DARPA High Productivity Computing Systems Program
  • Deliver sustained Petaflops performance by 2010
  • Aggressively attacks causes of performance
    degradation
  • Reduces contention through high bandwidth network
  • Latency hiding by vectors, multithreading, and
    parcel driven computation, and processor in
    memory
  • Low overhead with efficient remote memory access
    and thread creation, PIM acquiring overhead tasks
    from main processors, hardware support for
    communications
  • Starvation lowered by exposing fine grain data
    parallelism
  • Greatly simplifies user programming
  • Distributed shared memory
  • Hierarchical multithreaded execution model
  • Low performance penalties for distributed
    execution
  • Hardware support for performance tuning and
    correctness debugging

22
Cascade Architecture(logical view)
  • Interconnection Network
  • High bandwidth, low latency
  • High radix routers
  • Programming Environment
  • Mixed UMA/NUMA programming model
  • High productivity programming language

Locale
Locale
  • Operating System
  • Highly robust
  • Highly scalable
  • Global file system

Locale
Locale
  • HWP
  • Clustered vectors
  • Coarse-grained multithreading
  • Compiler assisted cache

Locale
Locale
Locale
Locale
HWP
Locale
Cache
Locale
Locale
Locale Interconnect
Locale
Locale
I/O
RAID
  • System Technology
  • Opto-electrical interconnect
  • Cooling

I/O
TCP/IP
  • LWP
  • Highly concurrent scalar
  • Fine-grained multithreading
  • Remote thread creation

I/O
GRAPHICS
23
Processor in Memory (PIM)
  • PIM merges logic with memory
  • Wide ALUs next to the row buffer
  • Optimized for memory throughput, not ALU
    utilization
  • PIM has the potential of riding Moore's law while
  • greatly increasing effective memory bandwidth,
  • providing many more concurrent execution threads,
  • reducing latency,
  • reducing power, and
  • increasing overall system efficiency
  • It may also simplify programming and system design

24
Why is PIM Inevitable?
  • Separation between memory and logic artificial
  • von Neumann bottleneck
  • Imposed by technology limitations
  • Not a desirable property of computer architecture
  • Technology now brings down barrier
  • We didnt do it because we couldnt do it
  • We can do it so we will do it
  • What to do with a billion transistors
  • Complexity can not be extended indefinitely
  • Synthesis of simple elements through replication
  • Means to fault tolerance, lower power
  • Normalize memory touch time through scaled
    bandwidth with capacity
  • Without it, takes ever longer to look at each
    memory block
  • Will be mass market commodity commercial market
  • Drivers outside of HPC thrust
  • Cousin to embedded computing

25
Roles for PIM
  • Perform in-place operations on zero-reuse data
  • Exploit high degree data parallelism
  • Rapid updates on contiguous data blocks
  • Rapid associative searches through contiguous
    data blocks
  • Gather-scatters
  • Tree/graph walking
  • Enables efficient and concurrent array transpose
  • Permits fine grain manipulation of sparse and
    irregular data structures
  • Parallel prefix operations
  • In-memory data movement
  • Memory management overhead work
  • Engage in prestaging of data for HWT processors
  • Fault monitoring, detection, and cleanup
  • Manage 3/2 memory layer

26
Strategic Concepts of the MIND Architecture
  • Virtual to physical address translation in memory
  • Global distributed shared memory thru distributed
    directory table
  • Dynamic page migration
  • Wide registers serve as context sensitive TLB
  • Multithreaded control
  • Unified dynamic mechanism for resource management
  • Latency hiding
  • Real time response
  • Parcel active message driven computing
  • Decoupled split-transaction execution
  • System wide latency hiding
  • Move work to data instead of data to work
  • Caching of external DRAM

27
MIND Node
memory address buffer
Parcel Interface
28
Microprocessor with PIMs
PIM Node 1
Memory
TLcycle
Microprocessor
TML
WH
ALU
PIM processor
TRR
Contrl
Cache
TMH
THcycle
Reg
WL
mixl/s
TCH
mixl/s
Pmiss
Metrics
PIM Node 2
PIM Node 3
PIM Node N
29
Threads Timeline
Time
Heavy Weight Thread Processor
HW Thread
HW Thread
HW Thread
HW Thread
Light Weight Thread Processors
LW Threads
LW Threads
LW Threads
30
Simulation of Performance Gain
Performance Gain
PIM Workload
31
Simulation of PIM Execution Time
Time to Execution
Number of PIM Nodes
32
Analytical Expression for Relative Execution Time
33
Effect of PIM on Execution Time with Normalized
Runtime
PIM Nodes
Relative Time to Execution
Number of PIM Nodes
34
Parcels
  • Parcels
  • Enable lightweight communication between LWPs or
    between HWP and LWP.
  • Contribute to system-wide latency management
  • Support split-transaction message-driven
    computing
  • Low overhead for efficient communication
  • Implementation of remote thread creation (rtc).
  • Implementation of remote memory references.

35
Parcels for remote threads
Destination Locale
Data
Target Operand
Remote Thread Create Parcel
Methods
Target Action Code
Source Locale
Return Parcel
36
Parcel Simulation Latency Hiding Experiment
Nodes
Flat Network
Nodes
Nodes
Control Experiment
Test Experiment
Nodes
37
Latency Hiding with Parcelswith respect to
System Diameter in cycles
38
Latency Hiding with ParcelsIdle Time with
respect to Degree of Parallelism
39
Multithreading in PIMS
  • MIND must respond asynchronously to service
    requests from multiple sources
  • Parcel-driven computing requires rapid response
    to incident packets
  • Hardware supports multitasking for multiple
    concurrent method instantiations
  • High memory bandwidth utilization by overlapping
    computation with access ops
  • Manages shared on-chip resources
  • Provides fine-grain context switching
  • Latency hiding

40
Parcels, Multithreading, and Multiport Memory
41
MPI The Failed Success
  • A 10 year odyssey
  • Community wide standard for parallel programming
  • A proven natural model for distributed
    fragmented memory class systems
  • User responsible for locality management
  • User responsible for minimizing overhead
  • User responsible for resource allocation
  • User responsible for exposing parallelism
  • Relied on ILP and OpenMP for more parallelism
  • Mediocre scaling demands problem size expansion
    for greater performance
  • We now are constrained to legacy MPI codes

42
What is required
  • Global name spaces both data and active tasks
  • Rich parallelism semantics and granularity
  • Diversity of forms
  • Tremendous increase in amount
  • Support for sparse data parallelism
  • Latency hiding
  • Low overhead mechanisms
  • Synchronization
  • Scheduling
  • Affinity semantics
  • Do not rely on
  • Direct control of hardware mechanisms
  • Direct management and allocation of hardware
    resources
  • Direct choreographing of physical data and task
    locality

43
ParalleX a Parallel Programming Model
  • Exposes parallelism in diverse forms and
    granularities
  • Greatly increases available parallelism for
    speedup
  • Matches more algorithms
  • Exploits intrinsic parallelism of sparse data
  • Exploits split transaction processing
  • Decouples computation and communication
  • Moves work to data, not just data to work
  • Intrinsics for latency hiding
  • Multithreading
  • Message driven computation
  • Efficient lightweight synchronization overhead
  • Register synchronization
  • Futures hardware support
  • Lightweight objects
  • Fine grain mutual exclusion
  • Provides for global data and task name spaces
  • Efficient remote memory accesses (e.g. shmem)
  • Lightweight atomic memory operations
  • Affinity attribute specifiers

44
Agincourt A Latency Tolerant Parallel
Programming Language
  • Bridges the cluster gap
  • Eliminates constraints of message passing model
  • Reduces need for inefficient global barrier
    synchronization
  • Mitigates local to remote access time disparity
  • Removes OS from critical path
  • Greatly simplifies programming
  • Global single system image
  • Manipulates sparse irregular time varying meta
    data
  • Facilitates dynamic adaptive applications
  • Dramatic performance advantage
  • Lower overhead
  • Latency hiding
  • Load balancing

45
This could be a very bad idea
  • New languages almost always fail
  • Fancy assed languages usually do not match needs
    of system hardware
  • Compilers take forever to bring to maturity
  • People, quite reasonably, like what they do they
    dont want to change
  • People feel threatened by others who want to
    impose silly naive expensive impractical
    unilateral ideas
  • Acceptance is a big issue
  • And then theres the legacy problem

46
Real-World Practical Petaflops Computer Systems
  • Sustained Petaflops performance on wide range of
    applications.
  • Full Peta-scale system resources of a Petaflops
    computer routinely allocated to real world users,
    not just for Demos before SCXY.
  • There are many Petaflops computers available
    throughout the nation, not just a couple of
    National Laboratories
  • Size, power, cooling, and cost not prohibitive.
  • Programming is tractable, so that a scientist can
    use it and not change professions in the process.

47
1 Petaflops is only the beginning
Write a Comment
User Comments (0)
About PowerShow.com