The Politics and Economics of Parallel Computing Performance - PowerPoint PPT Presentation

About This Presentation
Title:

The Politics and Economics of Parallel Computing Performance

Description:

Not many of us (not even me) are old enough to remember Sputnik ... Trendy Euro image. Fuel efficiency. Parking space. Drive it on narrow streets ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 24
Provided by: alic142
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: The Politics and Economics of Parallel Computing Performance


1
The Politics and Economics of Parallel Computing
Performance
  • Allan Snavely
  • UCSD Computer Science Dept.
  • SDSC

2
Computnik
  • Not many of us (not even me) are old enough to
    remember Sputnik
  • But recently U.S. technology received a similar
    shock

3
Japanese Earth Simulator
  • The worlds mot powerful computer

4
Top500.org
  • HIGHLIGHTS FROM THE TOP 10
  • The Earth Simulator, built by NEC, remains the
    unchallenged 1, gt 30 TFlops
  • The cost is conservatively 500M

5
  • ASCI Q at Los Alamos is at 2 at 13.88 TFlop/s.
  • The third system ever to exceed the 10 TFflop/s
    mark is Virgina Tech's X measured at 10.28
    TFlop/s. This cluster is built with the Apple G5
    as building blocks and is often referred to as
    the 'SuperMac.
  • The fourth system is also a cluster. The Tungsten
    cluster at NCSA is a Dell PowerEdge-based system
    using a Myrinet interconnect. It just missed the
    10 TFlop/s mark with a measured 9.82 TFlop/s.

6
More top 500
  • The list of clusters in the TOP10 continues with
    the upgraded Itanium2-based Hewlett-Packard
    system, located at DOE's Pacific Northwest
    National Laboratory, which uses a Quadrics
    interconnect.
  • 6 is the first system in the TOP500 based on
    AMD's Opteron chip. It was installed by Linux
    Networx at the Los Alamos National Laboratory and
    also uses a Myrinet interconnect. T
  • With the exception of the leading Earth
    Simulator, all other TOP10 systems are installed
    in the U.S.
  • The performance of the 10 system jumped to 6.6
    TFlop/s.

7
The fine print
  • But how is performance measured?
  • Linpack is very compute intensive and not very
    memory or communications inten sive and it scales
    perfectly!

8
Axiom You get what you ask for(or what you
measure for)
  • Measures of goodness
  • Macho image
  • Big gas tank
  • Cargo space
  • Drive it offroad
  • Arnold drives one
  • Measures of goodness
  • Trendy Euro image
  • Fuel efficiency
  • Parking space
  • Drive it on narrow streets
  • Herr Schroeder drives one

9
HPC Users Forum and metrics
  • From the beginning we dealt with
  • Political issues
  • You get what you ask for (Top500 Macho Flops)
  • Policy makers need a number (Macho Flops)
  • You measure what makes you look good (Macho
    Flops)
  • Technical issues
  • Recent reports (HECRTF, SCALES) echo our earlier
    consensus that time-to-solution (TTS) is the HPC
    metric
  • But TTS is complicated and problem dependent (
    and policy makers need a number)
  • Is it even technically feasible to encompass TTS
    in one or a few low-level metrics?

10
A science of performance
  • A model is a calculable explanation of why a
    program, application,input, tuple performs as
    it does
  • Should yield a prediction (quantifiable
    objective)
  • Accurate predictions of observable performance
    points give you some confidence in methods (as
    for example to allay fears of perturbation via
    intrusion)
  • Performance models embody understanding of the
    factors that affect performance
  • Inform the tuning process (of application and
    machine)
  • Guide applications to the best machine
  • Enable applications driven architecture design
  • Extrapolate to the performance of future systems

PMaC
11
Goals for performance modeling tools and methods
  • Performance should map back to a small set of
    orthogonal benchmarks
  • Generation of performance models should be
    automated, or at least as regular and systemized
    as possible
  • Performance models must be time-tractable
  • Error is acceptable if it is bounded and allows
    meeting these objectives
  • Taking these principles to extremes would allow
    dynamic, automatic performance improvement via
    adaption (this is open research)

PMaC
12
A useful framework
  • Machine Profiles - characterizations of the rates
    at which a machine can (or is projected to) carry
    out fundamental operations abstract from the
    particular application
  • Application Signature - detailed summaries of the
    fundamental operations to be carried out by the
    application independent of any particular machine
  • Combine Machine Profile and Application
    Signature using
  • Convolution Methods - algebraic mappings of the
    Application Signatures on to the Machine profiles
    to arrive at a performance prediction

PMaC
13
PMaC HPC Benchmark Suite
  • The goal is develop means to infer execution time
    of full applications at scale from low-level
    metrics taken on (smaller) prototype systems
  • To do this in a systematic, even automated way
  • To be able to compare apples and oranges
  • To enable wide workload characterizations
  • To keep number of metrics compact
  • Add metrics only to increase resolution
  • Go to web page www.sdsc.edu/PMaC

14
Machine Profiles Single Processor Component
MAPS
  • Machine Profiles useful for
  • revealing underlying capability of the machine
  • comparing machines
  • Machine Profiles produced by
  • MAPS (Memory Access Pattern Signature) along with
    the rest of the PMaC HPC Benchmark Suite is
    available at www.sdsc.edu/PMaC

15
Convolutions put the two togethermodeling deep
memory hierarchies
MetaSim trace collected on PETSc Matrix-Vector
code 4 CPUs with user supplied memory parameters
for PSCs TCSini
  • Single-processor or per-processor performance
  • Machine profile for processor (Machine A)
  • Application Signature for application (App. 1)
  • The relative per-processor performance of
  • App. 1 on Machine A is represented as the
  • MetaSim Number

16
Metasim cpu events convolverpick simple models
to apply to each basic block
Output 5 different convolutions. Meta1 Mem.
time Meta2 Mem. timeFP time Meta3
MAX(mem.time,FP time) Meta4 .5Mem. time.5FP
time Meta5 .9Mem. time.1FP time
17

Dimemas communications events convolver Simple
communication models applied to each
communication event
18
POP results graphically
  • Seconds per simulation day

PMaC
19
Quality of model predictions for POP
PMaC
20
Explaining Relative Performance of POP
21
POP Performance Sensitivity
1/Execution Time
  • Processor Performance

Latency Performance Normalized
BW Performance Normalized
22
Practical uses
  • DoD HPCMO procurement cycle
  • Identify strategic applications
  • Identify candidate machines
  • Run PMaC HPC Benchmark Probes on (prototypes of)
    machines
  • Use tools to model applications on exemplary
    inputs
  • Generate performance expectations
  • Input to solver that factors in performance,
    cost, architectural diversity, whim of program
    director ?
  • DARPA HPCS program
  • Help vendors evaluate performance impacts of
    proposed architectural features

23
Acknowledgments
  • This work was sponsored in part by the Department
    of Energy Office of Science through SciDAC award
    High-End Computer System Performance Science
    and Engineering. This work was sponsored in part
    by the Department of Defense High Performance
    Computing Modernization Program office through
    award HPC Applications Benchmarking. This
    research was sponsored in part by DARPA through
    award HEC Metrics. This research was supported
    in part by NSF cooperative agreement ACI-9619020
    through computing resources provided by the
    National Partnership for Advanced Computational
    Infrastructure at the San Diego Supercomputer
    Center. Computer time was provided by the
    Pittsburgh Supercomputer Center and the Texas
    Advanced Computing Center and Oak Ridge National
    laboratory and ERDC. We would like to thank
    Francesc Escale of CEPBA for all his help with
    Dimemas, and Pat Worley for all his help with POP.
Write a Comment
User Comments (0)
About PowerShow.com