100 TF Sustained on Cray X Series - PowerPoint PPT Presentation

About This Presentation

Title:

100 TF Sustained on Cray X Series

Description:

The opinions expressed here do not necessarily represent those of the CCS, ORNL, ... POP, Gyro, DCA-QMC, AGILE-BOLTZTRAN, VH1, Amber, ... Many examples from DoD. 30 ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 42

Provided by: james75

Learn more at: https://www.csm.ornl.gov

Category:

more less

Transcript and Presenter's Notes

Title: 100 TF Sustained on Cray X Series

1
100 TF Sustained on Cray X Series

SOS 8
April 13, 2004
James B. White III (Trey)
trey_at_ornl.gov

2
Disclaimer

The opinions expressed here do not necessarily
represent those of the CCS, ORNL, DOE, the
Executive Branch of the Federal Government of the
United States of America, or even UT-Battelle.

3
Disclaimer (cont.)

Graph-free, chart-free environment
For graphs and chartshttp//www.csm.ornl.gov/eval
uation/PHOENIX/

4
100 Real TF on Cray Xn

Who needs capability computing?
Application requirements
Why Xn?
Laundry, Clean and Otherwise
Rants
Custom vs. Commodity
MPI
CAF
Cray

5
Who needs capability computing?

OMB?
Politicians?
Vendors?
Center directors?
Computer scientists?

6
Who needs capability computing?

Application scientists
According to scientists themselves

7
Personal Communications

Fusion
General Atomics, Iowa, ORNL, PPPL, Wisconsin
Climate
LANL, NCAR, ORNL, PNNL
Materials
Cincinnati, Florida, NC State, ORNL, Sandia,
Wisconsin
Biology
NCI, ORNL, PNNL
Chemistry
Auburn, LANL, ORNL, PNNL
Astrophysics
Arizona, Chicago, NC State, ORNL, Tennessee

8
Scientists Need Capability

Climate scientists need simulation fidelity to
support policy decisions
All we can say now is that humans cause warming
Fusion scientists need to simulate fusion devices
All we can do now is model decoupled subprocesses
at disparate time scales
Materials scientists need to design new materials
Just starting to reproduce known materials

9
Scientists Need Capability

Biologists need to simulate proteins and protein
pathways
Baby steps with smaller molecules
Chemists need similar increases in complexity
Astrophysics need to simulate nucleogenesis
(high-res, 3D CFD, 6D neutrinos, long times)
Low-res, 3D CFD, approximate 3D neutrinos, short
times

10
Why Scientists Might Resist

Capacity also needed
Software isnt ready
Coerced to run capability-sized jobs on
inappropriate systems

11
Capability Requirements

Sample DOE SC applications
Climate POP, CAM
Fusion AORSA, Gyro
Materials LSMS, DCA-QMC

12
Parallel Ocean Program (POP)

Baroclinic
3D, nearest neighbor, scalable
Memory-bandwidth limited
Barotropic
2D implicit system, latency bound
Ocean-only simulation
Higher resolution
Faster time steps
As ocean component for CCSM
Atmosphere dominates

13
Community Atmospheric Model (CAM)

Atmosphere component for CCSM
Higher resolution?
Physics changes, parameterization must be
retuned, model must be revalidated
Major effort, rare event
Spectral transform not dominant
Dramatic increases in computation per grid point
Dynamic vegetation, carbon cycle, atmospheric
chemistry,
Faster time steps

14
All-Orders Spectral Algorithm (AORSA)

Radio-frequency fusion-plasma simulation
Highly scalable
Dominated by ScaLAPACK
Still in weak-scaling regime
But
Expanded physics reducing ScaLAPACK dominance
Developing sparse formulation

15
Gyro

Continuum gyrokinetic simulation of fusion-plasma
microturbulence
1D data decomposition
Spectral method - high communication volume
Some need for increased resolution
More iterations

16
Locally Self-Consistent Multiple Scattering (LSMS)

Calculates electronic structure of large systems
One atom per processor
Dominated by local DGEMM
First real application to sustain a TF
But moving to sparse formulation with a
distributed solve for each atom

17
Dynamic Cluster Aproximation (DCA-QMC)

Simulates high-temp superconductors
Dominated by DGER (BLAS2)
Memory-bandwidth limited
Quantum Monte Carlo, but
Fixed start-up per process
Favors fewer, faster processors
Needs powerful processors to avoid parallelizing
each Monte-Carlo stream

18
Few DOE SC Applications

Weak-ish scaling
Dense linear algebra
But moving to sparse

19
Many DOE SC Applications

Strong-ish scaling
Limited increase in gridpoints
Major increase in expense per gridpoint
Major increase in time steps
Fewer, more-powerful processors
High memory bandwidth
High-bandwidth, low-latency communication

20
Why X1?

Strong-ish scaling
Limited increase in gridpoints
Major increase in expense per gridpoint
Major increase in time steps
Fewer, more-powerful processors
High memory bandwidth
High-bandwidth, low-latency communication

21
Tangent Strongish Scaling
Greg Lindahl, Vendor Scum

Firm
Semistrong
Unweak
Strongoidal
MSTW (More Strong Than Weak)
JTSoS (Just This Side of Strong)
WNS (Well-Nigh Strong)
Seak, Steak, Streak, Stroak, Stronk
Weag, Weng, Wong, Wrong, Twong

22
X1 for 100 TF Sustained?

Uh, no
OS not scalable, fault-resilient enough for 104
processors
That price/performance thing
That power cooling thing

23
Xn for 100 TF Sustained

For DOE SC applications, YES
Most-promising candidate
-or-
Least-implausible candidate

24
Why X, again?

Most-powerful processors
Reduce need for scalability
Obey Amdahls Law
High memory bandwidth
See above
Globally addressable memory
Lowest, most hide-able latency
Scale latency-bound applications
High interconnect bandwidth
Scale bandwidth-bound applications

25
The Bad News

Scalar performance
Some tuning required
Ho-hum MPI latency
See Rants

26
Scalar Performance

Compilation is slow
Amdahls Law for single processes
Parallelization - Vectorization
Hard to port GNU tools
GCC? Are you kidding?
GCC compatibility, on the other hand
Black Widow will be better

27
Some Tuning Required

Vectorization requires
Independent operations
Dependence information
Mapping to vector instructions
Applications take a wide spectrum of steps to
inhibit this
May need a couple of compiler directives
May need extensive rewriting

28
Application Results

Awesome
Indifferent
Recalcitrant
Hopeless

29
Awesome Results

256-MSP X1 already showing unique capability
Apps bound by memory bandwidth, interconnect
bandwidth, interconnect latency
POP, Gyro, DCA-QMC, AGILE-BOLTZTRAN, VH1, Amber,
Many examples from DoD

30
Indifferent Results

Cray X1 is brute-force fast, but not cost
effective
Dense linear algebra
Linpack, AORSA, LSMS

31
Recalcitrant Results

Inherent algorithms are fine
Source code or ongoing code mods dont vectorize
Significant code rewriting done, ongoing, or
needed
CLM, CAM, Nimrod, M3D

32
Aside How to Avoid Vectorization

Use pointers to add false dependencies
Put deep call stacks inside loops
Put debug I/O operations inside compute loops
Did I mention using pointers?

33
Aside Software Design

In general, we dont know how to systematically
design efficient, maintainable HPC software
Vectorization imposes constraints on software
design
Bad Existing software must be rewritten
Good Resulting software often faster on modern
superscalar systems
Some tuning required for X series
Bad You must tune
Good Tuning is systematic, not a Black Art
Vectorization constraints may help us develop
effective design patterns for HPC software

34
Hopeless Results

Dominated by unvectorizable algorithms
Some benchmark kernels of questionable relevance
No known DOE SC applications

35
Summary

DOE SC scientists do need 100 TF and beyond of
sustained application performance
Cray X series is the least-implausible option for
scaling DOE SC applications to 100 TF of
sustained performance and beyond

36
Custom Rant

Custom vs. Commodity is Red Herring
CMOS is commodity
Memory is commodity
Wires are commodity
Cooling is independent of vector vs. scalar
PNNL liquid-cooling clusters
Vector systems may move to air-cooling
All vendors do custom packaging
Real issue Software

37
MPI Rant

Latency-bound apps often limited by
MPI_Allreduce(, MPI_SUM, )
Not ping pong!
An excellent abstraction that is imminently
optimizable
Some apps are limited by point-to-point
Remote load/store implementations (CAF, UPC) have
performance advantages over MPI
But MPI could be implemented using load/store,
inlined, and optimized
On the other hand, easier to avoid pack/unpack
with load/store model

38
Co-Array-Fortran Rant

No such thing as one-sided communication
Its all two sided sendreceive, syncputsync,
syncgetsync
Same parallel algorithms
CAF mods can be highly nonlocal
Adding CAF in a subroutine can have implications
on the argument types, and thus on the callers,
the callers callers, etc.
Rarely the case for MPI
We use CAF to avoid MPI-implementation
performance inadequacies
Avoiding nonlocality by cheating with Cray
pointers

39
Cray Rant