Title: 100 TF Sustained on Cray X Series
1100 TF Sustained on Cray X Series
- SOS 8
- April 13, 2004
- James B. White III (Trey)
- trey_at_ornl.gov
2Disclaimer
- The opinions expressed here do not necessarily
represent those of the CCS, ORNL, DOE, the
Executive Branch of the Federal Government of the
United States of America, or even UT-Battelle.
3Disclaimer (cont.)
- Graph-free, chart-free environment
- For graphs and chartshttp//www.csm.ornl.gov/eval
uation/PHOENIX/
4100 Real TF on Cray Xn
- Who needs capability computing?
- Application requirements
- Why Xn?
- Laundry, Clean and Otherwise
- Rants
- Custom vs. Commodity
- MPI
- CAF
- Cray
5Who needs capability computing?
- OMB?
- Politicians?
- Vendors?
- Center directors?
- Computer scientists?
6Who needs capability computing?
- Application scientists
- According to scientists themselves
7Personal Communications
- Fusion
- General Atomics, Iowa, ORNL, PPPL, Wisconsin
- Climate
- LANL, NCAR, ORNL, PNNL
- Materials
- Cincinnati, Florida, NC State, ORNL, Sandia,
Wisconsin - Biology
- NCI, ORNL, PNNL
- Chemistry
- Auburn, LANL, ORNL, PNNL
- Astrophysics
- Arizona, Chicago, NC State, ORNL, Tennessee
8Scientists Need Capability
- Climate scientists need simulation fidelity to
support policy decisions - All we can say now is that humans cause warming
- Fusion scientists need to simulate fusion devices
- All we can do now is model decoupled subprocesses
at disparate time scales - Materials scientists need to design new materials
- Just starting to reproduce known materials
9Scientists Need Capability
- Biologists need to simulate proteins and protein
pathways - Baby steps with smaller molecules
- Chemists need similar increases in complexity
- Astrophysics need to simulate nucleogenesis
(high-res, 3D CFD, 6D neutrinos, long times) - Low-res, 3D CFD, approximate 3D neutrinos, short
times
10Why Scientists Might Resist
- Capacity also needed
- Software isnt ready
- Coerced to run capability-sized jobs on
inappropriate systems
11Capability Requirements
- Sample DOE SC applications
- Climate POP, CAM
- Fusion AORSA, Gyro
- Materials LSMS, DCA-QMC
12Parallel Ocean Program (POP)
- Baroclinic
- 3D, nearest neighbor, scalable
- Memory-bandwidth limited
- Barotropic
- 2D implicit system, latency bound
- Ocean-only simulation
- Higher resolution
- Faster time steps
- As ocean component for CCSM
- Atmosphere dominates
13Community Atmospheric Model (CAM)
- Atmosphere component for CCSM
- Higher resolution?
- Physics changes, parameterization must be
retuned, model must be revalidated - Major effort, rare event
- Spectral transform not dominant
- Dramatic increases in computation per grid point
- Dynamic vegetation, carbon cycle, atmospheric
chemistry, - Faster time steps
14All-Orders Spectral Algorithm (AORSA)
- Radio-frequency fusion-plasma simulation
- Highly scalable
- Dominated by ScaLAPACK
- Still in weak-scaling regime
- But
- Expanded physics reducing ScaLAPACK dominance
- Developing sparse formulation
15Gyro
- Continuum gyrokinetic simulation of fusion-plasma
microturbulence - 1D data decomposition
- Spectral method - high communication volume
- Some need for increased resolution
- More iterations
16Locally Self-Consistent Multiple Scattering (LSMS)
- Calculates electronic structure of large systems
- One atom per processor
- Dominated by local DGEMM
- First real application to sustain a TF
- But moving to sparse formulation with a
distributed solve for each atom
17Dynamic Cluster Aproximation (DCA-QMC)
- Simulates high-temp superconductors
- Dominated by DGER (BLAS2)
- Memory-bandwidth limited
- Quantum Monte Carlo, but
- Fixed start-up per process
- Favors fewer, faster processors
- Needs powerful processors to avoid parallelizing
each Monte-Carlo stream
18Few DOE SC Applications
- Weak-ish scaling
- Dense linear algebra
- But moving to sparse
19Many DOE SC Applications
- Strong-ish scaling
- Limited increase in gridpoints
- Major increase in expense per gridpoint
- Major increase in time steps
- Fewer, more-powerful processors
- High memory bandwidth
- High-bandwidth, low-latency communication
20Why X1?
- Strong-ish scaling
- Limited increase in gridpoints
- Major increase in expense per gridpoint
- Major increase in time steps
- Fewer, more-powerful processors
- High memory bandwidth
- High-bandwidth, low-latency communication
21Tangent Strongish Scaling
Greg Lindahl, Vendor Scum
- Firm
- Semistrong
- Unweak
- Strongoidal
- MSTW (More Strong Than Weak)
- JTSoS (Just This Side of Strong)
- WNS (Well-Nigh Strong)
- Seak, Steak, Streak, Stroak, Stronk
- Weag, Weng, Wong, Wrong, Twong
22X1 for 100 TF Sustained?
- Uh, no
- OS not scalable, fault-resilient enough for 104
processors - That price/performance thing
- That power cooling thing
23Xn for 100 TF Sustained
- For DOE SC applications, YES
- Most-promising candidate
- -or-
- Least-implausible candidate
24Why X, again?
- Most-powerful processors
- Reduce need for scalability
- Obey Amdahls Law
- High memory bandwidth
- See above
- Globally addressable memory
- Lowest, most hide-able latency
- Scale latency-bound applications
- High interconnect bandwidth
- Scale bandwidth-bound applications
25The Bad News
- Scalar performance
- Some tuning required
- Ho-hum MPI latency
- See Rants
26Scalar Performance
- Compilation is slow
- Amdahls Law for single processes
- Parallelization - Vectorization
- Hard to port GNU tools
- GCC? Are you kidding?
- GCC compatibility, on the other hand
- Black Widow will be better
27Some Tuning Required
- Vectorization requires
- Independent operations
- Dependence information
- Mapping to vector instructions
- Applications take a wide spectrum of steps to
inhibit this - May need a couple of compiler directives
- May need extensive rewriting
28Application Results
- Awesome
- Indifferent
- Recalcitrant
- Hopeless
29Awesome Results
- 256-MSP X1 already showing unique capability
- Apps bound by memory bandwidth, interconnect
bandwidth, interconnect latency - POP, Gyro, DCA-QMC, AGILE-BOLTZTRAN, VH1, Amber,
- Many examples from DoD
30Indifferent Results
- Cray X1 is brute-force fast, but not cost
effective - Dense linear algebra
- Linpack, AORSA, LSMS
31Recalcitrant Results
- Inherent algorithms are fine
- Source code or ongoing code mods dont vectorize
- Significant code rewriting done, ongoing, or
needed - CLM, CAM, Nimrod, M3D
32Aside How to Avoid Vectorization
- Use pointers to add false dependencies
- Put deep call stacks inside loops
- Put debug I/O operations inside compute loops
- Did I mention using pointers?
33Aside Software Design
- In general, we dont know how to systematically
design efficient, maintainable HPC software - Vectorization imposes constraints on software
design - Bad Existing software must be rewritten
- Good Resulting software often faster on modern
superscalar systems - Some tuning required for X series
- Bad You must tune
- Good Tuning is systematic, not a Black Art
- Vectorization constraints may help us develop
effective design patterns for HPC software
34Hopeless Results
- Dominated by unvectorizable algorithms
- Some benchmark kernels of questionable relevance
- No known DOE SC applications
35Summary
- DOE SC scientists do need 100 TF and beyond of
sustained application performance - Cray X series is the least-implausible option for
scaling DOE SC applications to 100 TF of
sustained performance and beyond
36Custom Rant
- Custom vs. Commodity is Red Herring
- CMOS is commodity
- Memory is commodity
- Wires are commodity
- Cooling is independent of vector vs. scalar
- PNNL liquid-cooling clusters
- Vector systems may move to air-cooling
- All vendors do custom packaging
- Real issue Software
37MPI Rant
- Latency-bound apps often limited by
MPI_Allreduce(, MPI_SUM, ) - Not ping pong!
- An excellent abstraction that is imminently
optimizable - Some apps are limited by point-to-point
- Remote load/store implementations (CAF, UPC) have
performance advantages over MPI - But MPI could be implemented using load/store,
inlined, and optimized - On the other hand, easier to avoid pack/unpack
with load/store model
38Co-Array-Fortran Rant
- No such thing as one-sided communication
- Its all two sided sendreceive, syncputsync,
syncgetsync - Same parallel algorithms
- CAF mods can be highly nonlocal
- Adding CAF in a subroutine can have implications
on the argument types, and thus on the callers,
the callers callers, etc. - Rarely the case for MPI
- We use CAF to avoid MPI-implementation
performance inadequacies - Avoiding nonlocality by cheating with Cray
pointers
39Cray Rant
- Cray XD1 (OctigaBay) follows in tradition of T3E
40Cray Rant
- Cray XD1 (OctigaBay) follows in tradition of T3E
- Very promising architecture
- Dumb name
- Interesting competitor with Red Storm
41Questions?
- James B. White III (Trey)
- trey_at_ornl.gov
- http//www.csm.ornl.gov/evaluation/PHOENIX/