Title: Architectural Considerations for Petaflops and beyond
1Architectural Considerations for Petaflops and
beyond Bill Camp Sandia National Labs March
4,2003 SOS7 Durango, CO, USA -
2Programming Models A historical
perspective 1948--53 Machine Language
Rules 1953--1973 single-threaded
Fortran 1973--1980 single-threaded vector
Fortran 1978--1995 Shared memory parallel vector
Fortran Directives multi-, auto- and
microtasking 1987--present Massively parallel,
Message-passing Fortran and C 1995--present
Threads-based, shared memory parallelism 1996--pre
sent Hybrid threads message passing
3Programming Models Some false starts Late
80s--early 90s SIMD Fortran for heterogeneous
problems Mid-eighties--present Dataflow
parallelism and Functional programming Mid-eighti
es--late eighties AI-based languages, eg
LISP Mid-nineties CRAFT-90 (shared memory
approach to MPPs Early-nineties to 2000 MPP
Threads
4Programming Models --Observations Shared memory
programming models have never scaled
well Directives-based approaches lead to code
explosion and are not effective at dealing with
Amdahls Law Outer-Loop, distributed memory
parallelism requires a physics-centric
approach. I.e., it changed the way we think about
parallelism but (largely) preserved our code
base, didnt lead to code explosion, and made it
easier to marginalize the effedcts of Amdahls
Law. People will change approaches only for a
huge perceived gain
5Petaflops-- can we get there with what we have
now? YES
6Whats Important? SURE - Scalability -
Usability - Reliability - Expense minimization
7A more REAListic Amdahlian Law
The actual scaled speedup is more like S(N)
SAmdahl(N)/1 fcomm x Rp/c, where fcomm is the
fraction of work devoted to communications and
Rp/c is the ratio of processor speed to
communications speed.
8REAL Law Implications Sreal(N) / SAmdahl(N)
Lets consider three cases on two computers the
two computers are identical except that one has
an Rp/c of 1 and the second an Rp/c of 0.05 The
three cases are fcomm 0.01, 0.05 and 0.10
9REAL Law Implications S(N) / SAmdahl(N)
fcomm
0.01 0.05 0.10
Rp/c
0.99 0.95 0.9 0.83
0.50 0.33
1.0 0.05
10Bottom line
A well-balanced architecture is nearly
insensitive to communications overhead By
contrast a system with weak communications can
lose over half its power for applications in
which communications is important
11- Petaflops-- Why can we get there with what we
have now? - We only need 3 more spins of Moores Law
- --Todays 6-GF Hammer becomes a 48-GF processor
by 2009 - --10-Gigabit ethernet becomes 40 or 80-Gbit
ethernet - --Memory capacities and prices continue to
improve on current trend until 2009 - Disk technology continues on its current
trajectory for 6 more years - We use small, optical switches to give us 40--80
Gbyte/sec interconnects
12- Petaflops-- Why can we get there with what we
have now? - We need 12,000--25,000 processors to get a peak
PETAFLOP. - It will have 250--1000 TB memory
- It will have several hundred petabytes disk
storage - It will sustain about a half terabyte/sec I/O
(more costs more) - It will have about 30 TB/sec XC BW
- It will have about 5--10 PB/Sec memory BW
- BALANCE REMAINS ESSENTIALLY LIKE THAT IN THE RED
STORM DESIGN - COST in 2009 100M--250M in then-year dollars
13- Petaflops-- Design issues
- It will use commodity processors with multiple
cores per chip - It will run a partitioned OS based on Linux
- It could have partitions with fast vector
processors in a mix-and-match architecture - It wont look like the Earth Simulator
- It wont run IA-64 based on current Intel design
intent - It will probably run Power PC or HAMMER follow-ons
14- Petaflops-- Why not Earth Simulator?
- On our codes, commodity processors are nearly as
fast as the ES nodes and they have a 1.5--2.0
order of magnitude cost/performance advantage - BTW this is also true-- but with not as huge a
difference-- for the McKinley versus the
Pentium-4 - Example The geometric mean of Livermore Loops on
ES is only 60 faster than on a 2 GHz Pentium-4 - Example A real CTH problem is about as fast on
that P-4 as it is on the ES
15- Petaflops-- Why not Earth Simulator?
- Amdahls Law and the high cost of custom
processors
16- Why not Earth Simulator?
- Amdahls Law
- S TS / TV
- S 1/pW / (s N) (1-p)W / (s/M) / W /
s - S p/N M(1-p) -1
- Let N M 4,
- S 1/ p/4 4(1-p) .
17- Why not Earth Simulator?
- Amdahls Law (p vector fraction of work)
- S p/N M(1-p) -1
- Let N M 4,
- S 1/ p/4 4(1-p) .
- P must be greater than or equal to 0.8 for
breakeven!
18- Petaflops-- Why not IA-64?
- Heat
- Size
- Complexity
- Cost
- High latency/ low BW
- Difficulty in Compilability
- Competition from Intel
- .
19(No Transcript)
20- The Bad News
- Somewhere between a petaflop and an Exaflop, we
will run the string out on this approach to
computing
21The Good News - For ExaFlops computing, there is
lots of potential for innovation New
approaches DNA computers New memory-centric
technologies (eg, spin computers) (Not) quantum
computers Very Low power semiconductor based
systems
22The Good News - For ExaFlops computing, there is
lots of potential for innovation The
Requirements for SURE will not change!
23- The Good News
- Ill be gone fishing!
- The END (almost)