100 TF Sustained on Cray X Series - PowerPoint PPT Presentation

About This Presentation
Title:

100 TF Sustained on Cray X Series

Description:

The opinions expressed here do not necessarily represent those of the CCS, ORNL, ... POP, Gyro, DCA-QMC, AGILE-BOLTZTRAN, VH1, Amber, ... Many examples from DoD. 30 ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 42
Provided by: james75
Learn more at: https://www.csm.ornl.gov
Category:
Tags: cray | series | sustained | vh1

less

Transcript and Presenter's Notes

Title: 100 TF Sustained on Cray X Series


1
100 TF Sustained on Cray X Series
  • SOS 8
  • April 13, 2004
  • James B. White III (Trey)
  • trey_at_ornl.gov

2
Disclaimer
  • The opinions expressed here do not necessarily
    represent those of the CCS, ORNL, DOE, the
    Executive Branch of the Federal Government of the
    United States of America, or even UT-Battelle.

3
Disclaimer (cont.)
  • Graph-free, chart-free environment
  • For graphs and chartshttp//www.csm.ornl.gov/eval
    uation/PHOENIX/

4
100 Real TF on Cray Xn
  • Who needs capability computing?
  • Application requirements
  • Why Xn?
  • Laundry, Clean and Otherwise
  • Rants
  • Custom vs. Commodity
  • MPI
  • CAF
  • Cray

5
Who needs capability computing?
  • OMB?
  • Politicians?
  • Vendors?
  • Center directors?
  • Computer scientists?

6
Who needs capability computing?
  • Application scientists
  • According to scientists themselves

7
Personal Communications
  • Fusion
  • General Atomics, Iowa, ORNL, PPPL, Wisconsin
  • Climate
  • LANL, NCAR, ORNL, PNNL
  • Materials
  • Cincinnati, Florida, NC State, ORNL, Sandia,
    Wisconsin
  • Biology
  • NCI, ORNL, PNNL
  • Chemistry
  • Auburn, LANL, ORNL, PNNL
  • Astrophysics
  • Arizona, Chicago, NC State, ORNL, Tennessee

8
Scientists Need Capability
  • Climate scientists need simulation fidelity to
    support policy decisions
  • All we can say now is that humans cause warming
  • Fusion scientists need to simulate fusion devices
  • All we can do now is model decoupled subprocesses
    at disparate time scales
  • Materials scientists need to design new materials
  • Just starting to reproduce known materials

9
Scientists Need Capability
  • Biologists need to simulate proteins and protein
    pathways
  • Baby steps with smaller molecules
  • Chemists need similar increases in complexity
  • Astrophysics need to simulate nucleogenesis
    (high-res, 3D CFD, 6D neutrinos, long times)
  • Low-res, 3D CFD, approximate 3D neutrinos, short
    times

10
Why Scientists Might Resist
  • Capacity also needed
  • Software isnt ready
  • Coerced to run capability-sized jobs on
    inappropriate systems

11
Capability Requirements
  • Sample DOE SC applications
  • Climate POP, CAM
  • Fusion AORSA, Gyro
  • Materials LSMS, DCA-QMC

12
Parallel Ocean Program (POP)
  • Baroclinic
  • 3D, nearest neighbor, scalable
  • Memory-bandwidth limited
  • Barotropic
  • 2D implicit system, latency bound
  • Ocean-only simulation
  • Higher resolution
  • Faster time steps
  • As ocean component for CCSM
  • Atmosphere dominates

13
Community Atmospheric Model (CAM)
  • Atmosphere component for CCSM
  • Higher resolution?
  • Physics changes, parameterization must be
    retuned, model must be revalidated
  • Major effort, rare event
  • Spectral transform not dominant
  • Dramatic increases in computation per grid point
  • Dynamic vegetation, carbon cycle, atmospheric
    chemistry,
  • Faster time steps

14
All-Orders Spectral Algorithm (AORSA)
  • Radio-frequency fusion-plasma simulation
  • Highly scalable
  • Dominated by ScaLAPACK
  • Still in weak-scaling regime
  • But
  • Expanded physics reducing ScaLAPACK dominance
  • Developing sparse formulation

15
Gyro
  • Continuum gyrokinetic simulation of fusion-plasma
    microturbulence
  • 1D data decomposition
  • Spectral method - high communication volume
  • Some need for increased resolution
  • More iterations

16
Locally Self-Consistent Multiple Scattering (LSMS)
  • Calculates electronic structure of large systems
  • One atom per processor
  • Dominated by local DGEMM
  • First real application to sustain a TF
  • But moving to sparse formulation with a
    distributed solve for each atom

17
Dynamic Cluster Aproximation (DCA-QMC)
  • Simulates high-temp superconductors
  • Dominated by DGER (BLAS2)
  • Memory-bandwidth limited
  • Quantum Monte Carlo, but
  • Fixed start-up per process
  • Favors fewer, faster processors
  • Needs powerful processors to avoid parallelizing
    each Monte-Carlo stream

18
Few DOE SC Applications
  • Weak-ish scaling
  • Dense linear algebra
  • But moving to sparse

19
Many DOE SC Applications
  • Strong-ish scaling
  • Limited increase in gridpoints
  • Major increase in expense per gridpoint
  • Major increase in time steps
  • Fewer, more-powerful processors
  • High memory bandwidth
  • High-bandwidth, low-latency communication

20
Why X1?
  • Strong-ish scaling
  • Limited increase in gridpoints
  • Major increase in expense per gridpoint
  • Major increase in time steps
  • Fewer, more-powerful processors
  • High memory bandwidth
  • High-bandwidth, low-latency communication

21
Tangent Strongish Scaling
Greg Lindahl, Vendor Scum
  • Firm
  • Semistrong
  • Unweak
  • Strongoidal
  • MSTW (More Strong Than Weak)
  • JTSoS (Just This Side of Strong)
  • WNS (Well-Nigh Strong)
  • Seak, Steak, Streak, Stroak, Stronk
  • Weag, Weng, Wong, Wrong, Twong

22
X1 for 100 TF Sustained?
  • Uh, no
  • OS not scalable, fault-resilient enough for 104
    processors
  • That price/performance thing
  • That power cooling thing

23
Xn for 100 TF Sustained
  • For DOE SC applications, YES
  • Most-promising candidate
  • -or-
  • Least-implausible candidate

24
Why X, again?
  • Most-powerful processors
  • Reduce need for scalability
  • Obey Amdahls Law
  • High memory bandwidth
  • See above
  • Globally addressable memory
  • Lowest, most hide-able latency
  • Scale latency-bound applications
  • High interconnect bandwidth
  • Scale bandwidth-bound applications

25
The Bad News
  • Scalar performance
  • Some tuning required
  • Ho-hum MPI latency
  • See Rants

26
Scalar Performance
  • Compilation is slow
  • Amdahls Law for single processes
  • Parallelization - Vectorization
  • Hard to port GNU tools
  • GCC? Are you kidding?
  • GCC compatibility, on the other hand
  • Black Widow will be better

27
Some Tuning Required
  • Vectorization requires
  • Independent operations
  • Dependence information
  • Mapping to vector instructions
  • Applications take a wide spectrum of steps to
    inhibit this
  • May need a couple of compiler directives
  • May need extensive rewriting

28
Application Results
  • Awesome
  • Indifferent
  • Recalcitrant
  • Hopeless

29
Awesome Results
  • 256-MSP X1 already showing unique capability
  • Apps bound by memory bandwidth, interconnect
    bandwidth, interconnect latency
  • POP, Gyro, DCA-QMC, AGILE-BOLTZTRAN, VH1, Amber,
  • Many examples from DoD

30
Indifferent Results
  • Cray X1 is brute-force fast, but not cost
    effective
  • Dense linear algebra
  • Linpack, AORSA, LSMS

31
Recalcitrant Results
  • Inherent algorithms are fine
  • Source code or ongoing code mods dont vectorize
  • Significant code rewriting done, ongoing, or
    needed
  • CLM, CAM, Nimrod, M3D

32
Aside How to Avoid Vectorization
  • Use pointers to add false dependencies
  • Put deep call stacks inside loops
  • Put debug I/O operations inside compute loops
  • Did I mention using pointers?

33
Aside Software Design
  • In general, we dont know how to systematically
    design efficient, maintainable HPC software
  • Vectorization imposes constraints on software
    design
  • Bad Existing software must be rewritten
  • Good Resulting software often faster on modern
    superscalar systems
  • Some tuning required for X series
  • Bad You must tune
  • Good Tuning is systematic, not a Black Art
  • Vectorization constraints may help us develop
    effective design patterns for HPC software

34
Hopeless Results
  • Dominated by unvectorizable algorithms
  • Some benchmark kernels of questionable relevance
  • No known DOE SC applications

35
Summary
  • DOE SC scientists do need 100 TF and beyond of
    sustained application performance
  • Cray X series is the least-implausible option for
    scaling DOE SC applications to 100 TF of
    sustained performance and beyond

36
Custom Rant
  • Custom vs. Commodity is Red Herring
  • CMOS is commodity
  • Memory is commodity
  • Wires are commodity
  • Cooling is independent of vector vs. scalar
  • PNNL liquid-cooling clusters
  • Vector systems may move to air-cooling
  • All vendors do custom packaging
  • Real issue Software

37
MPI Rant
  • Latency-bound apps often limited by
    MPI_Allreduce(, MPI_SUM, )
  • Not ping pong!
  • An excellent abstraction that is imminently
    optimizable
  • Some apps are limited by point-to-point
  • Remote load/store implementations (CAF, UPC) have
    performance advantages over MPI
  • But MPI could be implemented using load/store,
    inlined, and optimized
  • On the other hand, easier to avoid pack/unpack
    with load/store model

38
Co-Array-Fortran Rant
  • No such thing as one-sided communication
  • Its all two sided sendreceive, syncputsync,
    syncgetsync
  • Same parallel algorithms
  • CAF mods can be highly nonlocal
  • Adding CAF in a subroutine can have implications
    on the argument types, and thus on the callers,
    the callers callers, etc.
  • Rarely the case for MPI
  • We use CAF to avoid MPI-implementation
    performance inadequacies
  • Avoiding nonlocality by cheating with Cray
    pointers

39
Cray Rant
  • Cray XD1 (OctigaBay) follows in tradition of T3E

40
Cray Rant
  • Cray XD1 (OctigaBay) follows in tradition of T3E
  • Very promising architecture
  • Dumb name
  • Interesting competitor with Red Storm

41
Questions?
  • James B. White III (Trey)
  • trey_at_ornl.gov
  • http//www.csm.ornl.gov/evaluation/PHOENIX/
Write a Comment
User Comments (0)
About PowerShow.com