Title: Thor's Hammer/Red Storm
1Bill Camp, Sandia Labs
2nd Feature Hints on MPP computing
Featured attraction Computers for Doing Big
Science
2Sandia MPPs (since 1987)
- 1987 1024-processor nCUBE10 512 Mflops
- 1990--1992 2 1024-processor nCUBE-2 machines
2 _at_ 2 Gflops - 1988--1990 16384-processor CM-200
- 1991 64-processor Intel IPSC-860
- 1993--1996 3700-processor Intel Paragon 180
Gflops - 1996--present 9400-processor Intel TFLOPS (ASCI
Red) 3.2 Tflops - 1997--present 400 --gt 2800 processors in Cplant
Linux Cluster 3 Tflops - 2003 1280-processor IA32- Linux cluster 7
Tflops - 2004 Red Storm 11600 processor Opteron-based
MPP gt40 Tflops - 2005 1280-Processor 64-bit Linux Cluster 10
TF - 2006 Red Storm upgrade 20K nodes, 160 TF.
- 2008--9 Red Widow 50K nodes, 1000 TF. (?)
3Computing domains at Sandia
Big Science
Domain
Procs 1 101 102 103 104
Red Storm X X X
Cplant Linux Supercluster X X X
Beowulf clusters X X X
Desktop X
- Red Storm is targeting the highest-end market but
has real advantages for the mid-range market
(from 1 cabinet on up)
4RS Node architecture
DRAM 1 (or 2) Gbyte or more
CPUsAMDOpteron
Six LinksTo OtherNodes in X, Y, and Z
ASIC ApplicationSpecific IntegratedCircuit,
or acustom chip
53-D Mesh topology (Z direction is a torus)
TorusInterconnectin Z
Y16
640 Visualization Service I/O Nodes
640 Visualization, Service I/O Nodes
10,368ComputeNode Mesh
Z24
X27
6Comparison of ASCI Redand Red Storm
ASCI Red Red Storm
Full System Operational Time Frame June 1997 (processor and memory upgrade in 1999) August 2004
Theoretical Peak (TF)-- compute partition alone 3.15 41.47
MP-Linpack Performance (TF) 2.38 gt30 (estimated)
Architecture Distributed Memory MIMD Distributed Memory MIMD
Number of Compute Node Processors 9,460 10,368
Processor Intel P II _at_ 333 MHz AMD Opteron _at_ 2 GHz
Total Memory 1.2 TB 10.4 TB (up to 80 TB)
System Memory Bandwidth 2.5 TB/s 55 TB/s
Disk Storage 12.5 TB 240 TB
Parallel File System Bandwidth 2.0 GB/s 100.0 GB/s
External Network Bandwidth 0.4 GB/s 50 GB/s
7Comparison of ASCI Redand Red Storm
ASCI Red RED STORM
Interconnect Topology 3D Mesh (x, y, z) 38 x 32 x 2 3D Mesh (x, y, z)27 x 16 x 24
Interconnect Performance MPI Latency Bi-Directional Bandwidth Minimum Bi-section Bandwidth 15 ms 1 hop, 20 ms max800 MB/s51.2 GB/s 2.0 ms 1 hop, 5 ms s max6.4 GB/s2.3 TB/s
Full System RAS RAS Network RAS Processors 10 Mbit Ethernet 1 for each 32 CPUs 100 Mbit Ethernet1 for each 4 CPUs
Operating System Compute Nodes Service and I/O Nodes RAS Nodes CougarTOS (OSF1 UNIX)VX-Works CatamountLINUXLINUX
Red/Black Switching 2260 4940 2260 2688 4992 - 2688
System Foot Print 2500 ft2 3000 ft2
Power Requirement 850 KW 1.7 MW
8Red Storm Project
- Goal 23 months, design to First Product
Shipment! - System software is a joint project between Cray
and Sandia - Sandia is supplying Catamount LWK and the service
node run-time system - Cray is responsible for Linux, NIC software
interface, RAS software, file system software,
and Totalview port - Initial software development was done on a
cluster of workstations with a commodity
interconnect. Second stage involved an FPGA
implementation of SEASTAR NIC/Router (Starfish).
Final checkout is on real SEASTAR-based system - System engineering is wrapping up!
- Cabinets-- exist
- SEASTAR NIC/Router-- RTAT back from Fabrication
at IBM late last month - Full system to be installed and turned over to
Sandia in stages culminating in August--December
2004
9Designing for scalable scientific
supercomputing Challenges in -Design -Integrat
ion -Management -Use
10Design SUREty for Very Large Parallel Computer
Systems Scalability - Full System Hardware and
System Software Usability - Required
Functionality Only Reliability - Hardware and
System Software Expense minimization- use
commodity, high-volume parts SURE poses
Computer System Requirements
11- SURE Architectural tradeoffs
- Processor and memory sub- system balance
- Compute vs interconnect balance
- Topology choices
- Software choices
- RAS
- Commodity vs. Custom technology
- Geometry and mechanical design
-
12Sandia Strategies -build on commodity -leverage
Open Source (e.g., Linux) -Add to commodity
selectively (in RS there is basically one truly
custom part!) -leverage experience with previous
scalable supercomputers
13System Scalability Driven Requirements
Overall System Scalability - Complex scientific
applications such as molecular dynamics,
hydrodynamics, radiation transport should
achieve scaled parallel efficiencies greater than
50 on the full system (20,000 processors). -
14Scalability System Software System Software
Performance scales nearly perfectly with the
number of processors to the full size of the
computer (30,000 processors). This means that
System Software time (overhead) remains nearly
constant with the size of the system or scales at
most logarithmically with the system size. -
Full re-boot time scales logarithmically with the
system size. - Job loading is logarithmic with
the number of processors. - Parallel I/O
performance is not sensitive to of PEs doing
I/O - Communication Network software must be
scalable. - prefer no connection-based protocols
among compute nodes. - Message buffer space
independent of of processors. - Compute node
OS gets out of the way of the application.
15Hardware scalability
- Balance in the node hardware
- Memory BW must match CPU speed
- Ideally 24 Bytes/flop (never yet done)
- Communications speed must match CPU speed
- I/O must match CPU speeds
- Scalable System SW( OS and Libraries)
- Scalable Applications
16SW Scalability Kernel tradeoffs
- Unix/Linux has more functionality, but impairs
efficiency on the compute partition at scale - Say breaks are uncorrelated and last 50 ?S and
occur once per second - On 100 CPUs, wasted time is 5 ms every second
- Negligible .5 impact
- On 1000 CPUs, wasted time is 50 ms every second
- Moderate 5 impact
- On 10,000 CPUs, wasted time is 500 ms
- Significant 50 impact
17 Usability gtApplication Code Support Software
that supports scalability of the Computer
System Math Libraries MPI Support for Full
System Size Parallel I/O Library Compilers To
ols that Scale to the Full Size of the Computer
System Debuggers Performance
Monitors Full-featured LINUX OS support at the
user interface
18Reliability
- Light Weight Kernel (LWK) O. S. on compute
partition - Much less code fails much less often
- Monitoring of correctible errors
- Fix soft errors before they become hard
- Hot swapping of components
- Overall system keeps running during maintenance
- Redundant power supplies memories
- Completely independent RAS System monitors
virtually every component in system
19- Economy
- Use high-volume parts where possible
- Minimize power requirements
- Cuts operating costs
- Reduces need for new capital investment
- Minimize system volume
- Reduces need for large new capital
facilities - Use standard manufacturing processes where
possible-- minimize customization - Maximize reliability and availability/dollar
- Maximize scalability/dollar
- Design for integrability
20Economy
- Red Storm leverages economies of scale
- AMD Opteron microprocessor standard memory
- Air cooled
- Electrical interconnect based on Infiniband
physical devices - Linux operating system
- Selected use of custom components
- System chip ASIC
- Critical for communication intensive applications
- Light Weight Kernel
- Truly custom, but we already have it (4th
generation)
21Cplant on a slide
- Goal MPP look and feel
- Start 1997, upgrade 1999--2001
- Alpha Myrinet, mesh topology
- 3000 procs (3Tf) in 7 systems
- Configurable to 1700 procs
- Red/Black switching
- Linux w/ custom runtime mgmt.
- Production operation for several yrs.
ASCI Red
22IA-32 Cplant on a slide
- Goal Mid-range capacity
- Started 2003, upgrade annually
- Pentium-4 Myrinet, Clos network
- 1280 procs (7 Tf) in 3 systems
- Currently configurable to 512 procs
- Linux w/ custom runtime mgmt.
- Production operation for several yrs.
ASCI Red
23Observation For most large scientific and
engineering applications the performance is more
determined by parallel scalability and less by
the speed of individual CPUs. There must be
balance between processor, interconnect, and I/O
performance to achieve overall performance. To
date, only a few tightly-coupled, parallel
computer systems have been able to demonstrate a
high level of scalability on a broad set of
scientific and engineering applications.
24Lets Compare Balance In Parallel Systems
25Comparing Red Storm and BGL
Blue Gene Light Red Storm Node
Speed 5.6 GF
5.6 GF (1x) Node Memory
0.25--.5 GB 2 (1--8 ) GB
(4x nom.) Network latency 7 msecs
2 msecs (2/7
x) Network link BW 0.28 GB/s
6.0 GB/s (22x) BW Bytes/Flops
0.05 1.1
(22x) Bi-Section B/F
0.0016 0.038
(24x) nodes/problem 40,000
10,000 (1/4 x) 100 TF
version of Red Storm 360 TF version of BGL
26Fixed problem performance
Molecular dynamics problem (LJ liquid)
27Scalable computing works
28Balance is critical to scalability
Peak
Linpack
29Relating scalability and cost
MPP more cost effective
Cluster more cost effective
Efficiency ratio Cost ratio 1.8
Average efficiency ratio over the five codes that
consume gt80 of Sandias cycles
30Scalability determines cost effectiveness
Sandias top priority computing workload
Cluster more cost effective
MPP more cost effective
380M node-hrs
55M node-hrs
256
31Scalability also limits capability
3x processors
32Commodity nearly everywhere-- Customization
drives cost
- Earth Simulator and Cray X-1 are fully custom
Vector systems with good balance - This drives their high cost (and their high
performance). - Clusters are nearly entirely high-volume with no
truly custom parts - Which drives their low-cost (and their low
scalability) - Red Storm uses custom parts only where they are
critical to performance and reliability - High scalability at minimal cost/performance
-
33(No Transcript)
34Honey Its not one of thoseor Hints on MPP
Computing
- Excerpted from a talk with this title given by
Bill Camp at CUG-Tours in October 1994
35- Issues in MPP Computing
- Physically shared memory does not scale
- Data must be distributed
- No single data layout may be optimal
- The optimal data layout may change during the
computation - Communications are expensive
- The single control stream in SIMD computing makes
it simple-- at the cost of severe loss in
performance-- due to load balancing problems - In data parallel computing (a la CM-5) there can
be multiple control streams-- but with global
synchronization - Less simple but overhead remains an issue
- In MIMD computing there are many control streams
loosely synchronized (eg with messages) - Powerful, flexible and complex
36Why doesnt shared memory scale?
CPU
CPU
CPU
CPU
CPU
CPU
CPU
cache
cache
cache
cache
cache
cache
cache
Switch
memory
memory
memory
memory
memory
memory
memory
Bank conflicts-- about a 40 hit for large of
banks and CPUs Memory coherency-- who has the
data, can I access it? High, non-deterministic
latencies
37Amdahls Law Time on a single processor T1
Tser Tser Time on P processors Tp Tser
Tpar/P Tcomms Ignore communications Speedup,
Sp ( T1/ Tp) is then Sp fser 1 - fser
/ P-1 Where fser Tser / Tser So, Sp lt
1 / fser
38The Axioms
Axiom 1 Amdahls Law is inviolate (Sp lt 1 /
fser ) Axiom 2 Amdahl Law doesnt matter for
MPP if you know what you are doing (Comms
dominate) Axiom 3 Nature is parallel Axiom 4
Nature is (mostly) local Axiom 5 Physical
shared memory does not scale Axiom 6 Physically
distributed memory does Axiom 7 Nevertheless, a
global address space is nice to have
39The Axioms
Axiom 8 Like solar energy, automatic parallelism
is the technology of the future Axiom 9
successful parallelism requires the near total
suppression of serialism Axiom 10 The best
thing you can do with a processor is
serial execution Axiom 11 Axioms 9 10 are not
contradictory Axiom 12 MPPs are for doing large
problems fast (if you need to do a small
problem fast, look elsewhere). Axiom 13
Generals build weapons to win the last war (so
computer scientists)
40The Axioms
Axiom 14 first find coarse-grained, then
medium-grained, then fine-grained
parallelism Axiom 15 done correctly, the gain
from these is multiplicative Axiom 16
Lifes a balancing act sos MPP computing Axiom
17 Be an introvert-- never communicate
needlessly Axiom 18 Be independent never
synchronize needlessly Axiom
19 Parallel computing is a cold world, bundle
up well
41The Axioms
Axiom 20 I/O should only be done under medical
supervision Axiom 21 If MPP computin
is easy it aint cheap Axiom 22 If MPP
computin is cheap, it aint easy Axiom 23 The
difficulty of programming an MPP
effectively is directly proportional to
latency Axiom 24 The parallelism is in the
problem, not in the code
42The Axioms
Axiom 25 There are an infinite number of
parallel algorithms Axiom 26 There
are no parallel algorithms (Simons
theorem)-- its almost true Axiom 27 The best
parallel algorithm is almost always a
parallel implementation of the best serial
algorithm (what Horst really meant) Axiom 28
Amdahls Law DOES limit vector speedup! Axiom 18
Work in teams ( sometimes SIMD constructs
are just what the Doctor ordered)! Axiom 29
Do try this at home!
43(Some of) the Hints
Hint 1 Any amount of serial computing is
death So 1) make the problem large 2) Look
everywhere for serialism and purge it from
your code 3) Never, ever, ever add serial
statements
44(Some of) the Hints
Hint 2 Keep communications in the noise! So
1) Dont do little problems on big computers 2)
Change algorithms when profitable 3) Bundle
up!-- avoid small messages on high-latency
interconnects 4) Dont waste memory-- using all
the memory on a node minimizes the ratio of
communications to useful work
45(Some of) the Hints
Hint 3 The parallelism is in the problem! E.G.
SAR, Monte Carlo, Direct Sparse solvers,
Molecular Dynamics So, 1) Look first at the
problem 2) Look second at algorithms 3) Look at
data structures in the code 4) dont waste
cycles on line-by-line parallelism
46(Some of) the Hints
Hint 4 Incremental Parallelism Is Too
Inefficient! Dont fiddle with the Fortran Look
at the Problem -- Identify the kinds of
parallelism it contains 1) Multi-program 2)
Multi-task 4) data parallelism 5) inner-loop
parallelism (e.g. vectors)
Time into effort
47(Some of) the Hints
Hint 5 Often With Explicit Message Passing
(EMP) or Gets/Puts You can re-use virtually all
of your code (changes and additions few) --
With data parallel languages, you re-write your
code It can be easy but Performance is
usually unacceptable
48(Some of) the Hints
Hint 6 Load Balancing (Use existing libraries
and technology) -Easy in EMP! -Hard (or
impossible) in HPF, F90, CMF, -Only load
balance if dTnew dTbal lt dTold Static or
Dynamic Graph-based geometry
based Particle-based Hierarchical
Master-Slave
49(Some of) the Hints
Hint 7 Synchronization is expensive So, Dont
do it unless you have to Never, ever put in
synchronization just to get rid of a bug else
youll be stuck with it for the life of the code!
50(Some of) the Hints
Hint 8 I/O can ruin your whole afternoon It is
amazing how many people will create wonderfully
scalable codes only to spoil them with needless
or serial or non-balanced I/O Use I/O sparingly
Stage I/O carefully
51(Some of) the Hints
Hint 9 Religious prejudice is the bane of
computing Caches arent inherently bad Vectors
arent inherently good Small SMPs will not ruin
your life Single processer nodes are not
killers
52La Fin (The End)
53Scaling data for some key engineering codes
Random variation at small proc. counts
Large differential in efficiency at large proc.
counts
54Scaling data for some key physics codes
Los Alamos Radiation transport code
55Parallel Sn Neutronics (provided by LANL)