The Politics and Economics of Parallel Computing Performance - PowerPoint PPT Presentation

About This Presentation

Title:

The Politics and Economics of Parallel Computing Performance

Description:

Not many of us (not even me) are old enough to remember Sputnik ... Trendy Euro image. Fuel efficiency. Parking space. Drive it on narrow streets ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 24

Provided by: alic142

Learn more at: https://cseweb.ucsd.edu

Category:

more less

Transcript and Presenter's Notes

Title: The Politics and Economics of Parallel Computing Performance

1
The Politics and Economics of Parallel Computing
Performance

Allan Snavely
UCSD Computer Science Dept.
SDSC

2
Computnik

Not many of us (not even me) are old enough to
remember Sputnik
But recently U.S. technology received a similar
shock

3
Japanese Earth Simulator

The worlds mot powerful computer

4
Top500.org

HIGHLIGHTS FROM THE TOP 10
The Earth Simulator, built by NEC, remains the
unchallenged 1, gt 30 TFlops
The cost is conservatively 500M

ASCI Q at Los Alamos is at 2 at 13.88 TFlop/s.
The third system ever to exceed the 10 TFflop/s
mark is Virgina Tech's X measured at 10.28
TFlop/s. This cluster is built with the Apple G5
as building blocks and is often referred to as
the 'SuperMac.
The fourth system is also a cluster. The Tungsten
cluster at NCSA is a Dell PowerEdge-based system
using a Myrinet interconnect. It just missed the
10 TFlop/s mark with a measured 9.82 TFlop/s.

6
More top 500

The list of clusters in the TOP10 continues with
the upgraded Itanium2-based Hewlett-Packard
system, located at DOE's Pacific Northwest
National Laboratory, which uses a Quadrics
interconnect.
6 is the first system in the TOP500 based on
AMD's Opteron chip. It was installed by Linux
Networx at the Los Alamos National Laboratory and
also uses a Myrinet interconnect. T
With the exception of the leading Earth
Simulator, all other TOP10 systems are installed
in the U.S.
The performance of the 10 system jumped to 6.6
TFlop/s.

7
The fine print

But how is performance measured?
Linpack is very compute intensive and not very
memory or communications inten sive and it scales
perfectly!

8
Axiom You get what you ask for(or what you
measure for)

Measures of goodness
Macho image
Big gas tank
Cargo space
Drive it offroad
Arnold drives one

Measures of goodness
Trendy Euro image
Fuel efficiency
Parking space
Drive it on narrow streets
Herr Schroeder drives one

9
HPC Users Forum and metrics

From the beginning we dealt with
Political issues
You get what you ask for (Top500 Macho Flops)
Policy makers need a number (Macho Flops)
You measure what makes you look good (Macho
Flops)
Technical issues
Recent reports (HECRTF, SCALES) echo our earlier
consensus that time-to-solution (TTS) is the HPC
metric
But TTS is complicated and problem dependent (
and policy makers need a number)
Is it even technically feasible to encompass TTS
in one or a few low-level metrics?

10
A science of performance

A model is a calculable explanation of why a
program, application,input, tuple performs as
it does
Should yield a prediction (quantifiable
objective)
Accurate predictions of observable performance
points give you some confidence in methods (as
for example to allay fears of perturbation via
intrusion)
Performance models embody understanding of the
factors that affect performance
Inform the tuning process (of application and
machine)
Guide applications to the best machine
Enable applications driven architecture design
Extrapolate to the performance of future systems

PMaC
11
Goals for performance modeling tools and methods

Performance should map back to a small set of
orthogonal benchmarks
Generation of performance models should be
automated, or at least as regular and systemized
as possible
Performance models must be time-tractable
Error is acceptable if it is bounded and allows
meeting these objectives
Taking these principles to extremes would allow
dynamic, automatic performance improvement via
adaption (this is open research)

PMaC
12
A useful framework

Machine Profiles - characterizations of the rates
at which a machine can (or is projected to) carry
out fundamental operations abstract from the
particular application
Application Signature - detailed summaries of the
fundamental operations to be carried out by the
application independent of any particular machine
Combine Machine Profile and Application
Signature using
Convolution Methods - algebraic mappings of the
Application Signatures on to the Machine profiles
to arrive at a performance prediction

PMaC
13
PMaC HPC Benchmark Suite

The goal is develop means to infer execution time
of full applications at scale from low-level
metrics taken on (smaller) prototype systems
To do this in a systematic, even automated way
To be able to compare apples and oranges
To enable wide workload characterizations
To keep number of metrics compact
Add metrics only to increase resolution
Go to web page www.sdsc.edu/PMaC

14
Machine Profiles Single Processor Component
MAPS

Machine Profiles useful for
revealing underlying capability of the machine
comparing machines
Machine Profiles produced by
MAPS (Memory Access Pattern Signature) along with
the rest of the PMaC HPC Benchmark Suite is
available at www.sdsc.edu/PMaC

15
Convolutions put the two togethermodeling deep
memory hierarchies
MetaSim trace collected on PETSc Matrix-Vector
code 4 CPUs with user supplied memory parameters
for PSCs TCSini

Single-processor or per-processor performance
Machine profile for processor (Machine A)
Application Signature for application (App. 1)
The relative per-processor performance of
App. 1 on Machine A is represented as the
MetaSim Number

16
Metasim cpu events convolverpick simple models
to apply to each basic block
Output 5 different convolutions. Meta1 Mem.
time Meta2 Mem. timeFP time Meta3
MAX(mem.time,FP time) Meta4 .5Mem. time.5FP
time Meta5 .9Mem. time.1FP time
17

Dimemas communications events convolver Simple
communication models applied to each
communication event
18
POP results graphically

Seconds per simulation day

PMaC
19
Quality of model predictions for POP
PMaC
20
Explaining Relative Performance of POP
21
POP Performance Sensitivity
1/Execution Time

Processor Performance

Latency Performance Normalized
BW Performance Normalized
22
Practical uses

DoD HPCMO procurement cycle
Identify strategic applications
Identify candidate machines
Run PMaC HPC Benchmark Probes on (prototypes of)
machines
Use tools to model applications on exemplary
inputs
Generate performance expectations
Input to solver that factors in performance,
cost, architectural diversity, whim of program
director ?
DARPA HPCS program
Help vendors evaluate performance impacts of
proposed architectural features

23
Acknowledgments

This work was sponsored in part by the Department
of Energy Office of Science through SciDAC award
High-End Computer System Performance Science
and Engineering. This work was sponsored in part
by the Department of Defense High Performance
Computing Modernization Program office through
award HPC Applications Benchmarking. This
research was sponsored in part by DARPA through
award HEC Metrics. This research was supported
in part by NSF cooperative agreement ACI-9619020
through computing resources provided by the
National Partnership for Advanced Computational
Infrastructure at the San Diego Supercomputer
Center. Computer time was provided by the
Pittsburgh Supercomputer Center and the Texas
Advanced Computing Center and Oak Ridge National
laboratory and ERDC. We would like to thank
Francesc Escale of CEPBA for all his help with
Dimemas, and Pat Worley for all his help with POP.