Engineering Analysis of High Performance Parallel Programs - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

Engineering Analysis of High Performance Parallel Programs

Description:

Tools for Enginieering Analysis of High Performance ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 42

Provided by: DavidEC151

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Engineering Analysis of High Performance Parallel Programs

1
Engineering Analysis of High Performance Parallel
Programs

David Culler
Computer Science Division
U.C.Berkeley
http//www.cs.berkeley.edu/culler

2
Traditional Parallel Programming Tools

Focus on showing what program did and when it
did it
microscopic analysis of deterministic events
oriented towards initial development of small
programs on small data sets and small machines
Instrumentation
traces, counters, profiles
Visualization
Examples
AIMS, PTOOLS, PPP
pablo paradyn ... gt delphi
ACTS TAU - tuning and analysis util.

3
Example Pablo
4
Beyond Zeroth-order Analysis

Basic level to get to a system design that is
reasonable and behaves properly under ideal
condition
Subject the system to various stresses to
understand its operating regime and gain deeper
insight into its dynamic behavior
Combine empirical data with analytical models
Iterate
from What? to What if?

max displacement
Wind Speed
5
Approach Framework for Parameterized Sensitivity
Analsys

framework performs analysis over numerous runs
statistical filtering
vary parameter of interest
provides means of combining data to isolate
effects of interest
gt ROBUSTNESS

Problem Data Set Generator
Well-developed Parallel Program
Instrumentation Tools
Study Parameter
Machine Characterizers

Procs
Comm. perf.
Cache
Scheduling
...

visualization, modeling
6
Example NAS Parallel Benchmarks

Fix problem size (NPB2.2 class A)
Two different Architectures
NOW Ultrasparc Cluster (170 MHz)
SGI Origin (250 MHz)
Six application kernels
BT - Block Tridiagonal Solve
SP -
LU - Sparse LU
MG - Multigrid
IS - Integer sort
FT - 3D FFT
Examine sensitivity to P ( procs)
time(P), speedup(P) Time(1)/Time(P)

7
Single Processor Performance
8
Simplest Example Performance( P )

NPB2.2 on NOW and Origin 2000 (250)

9
Understanding Speedup

SpeedUp(p) T1
MAXp (Tcompute Tcomm. T wait)
Tcompute (work/p extra) x efficiency
With message passing (e.g., MPI) communication
time and wait time are indistinguishable

10
A more austere metric...

Time spent doing thing X
Total TimeX (P) ? TimeX (i)
Constant for perfect speedup

P
i1
11
Where Time is Spent ( P )

Reveal basic Processor and network loading (vs P)

12
Where Time is Spent ( P )

Reveal basic Processor and network loading (vs P)
Basis for model derivation - comm(P)

13
Why do comm. costs increase?

total volume?
volume per processor?
message overhead?
contention?

14
Communication Volume ( P )
15
Communication Structure ( P )
16
Understanding Efficiency ( P, M )

Want to understand both what load the program is
placing on the system
and how well the system is handling that load
gt characterize the capability of the system via
simple benchmarks (rather than advertised peaks)
gt combine with measured load for predictive
model, compare

30 MB/s
150 MB/s
17
Communication Efficiency
18
Tools gt Improvements in Run Time

Efficiency analysis (vs parameters) gives insight
into where to improve the system or the program
use traditional profiling to see where is program
the bad stuff happens
or go back and tune the system to do better

19
Why does comp. time decrease?

Combining trace generation with simulation
provides new structural insight
Here clear knees in program working set ()
shift with machine size (P)

20
Constant Problem Size Scaling
4
8
16
32
64
128
256
21
LU Working Sets

Sharp drop in miss rate from 512 to 1024
WS captured by at 1024 KB per processor
size increase (lt 32KB), miss rate decrease with
a constant rate
New effect, 100s KB to MB

22
LU Working Sets

CPS scaling means smaller and smaller problem per
processor
Smaller WS requirement
Miss rate curve moves to the left with P

23
LU Working Sets

Given a fixed machine, we only observe a vertical
slice of the graph

24
LU Working Sets
Cluster
Origin
25
Working Sets
LU
IS
BT
FT
MG
SP

There is a Cost to scaling when at larger machine
size, miss rate increases
There is a Benefit to scaling when at larger
machine size, miss rate decreases
Processing Efficiency is determined by -
the interaction between the changes in working
set with the size of the machine

26
Sensitivity to Multiprogramming

Parallel machines are increasingly general
purpose
multiprogramming, at least interrupts and daemons
Many ideal programs very sensitive to
perturbations
Msg Passing is loosely coupled, but
implementation may not be!

27
Tools gt Improvements in Run Time

MPI implementation spin-waits on send till
network available (or queue not full) or on
recv-complete
Should use two-phase spin-block

28
Sensitivity to Seemingly Unrelated Activity

The mechanism for doing parameter studies is
naturally extended to get statistically valid
data through multiple samples at each point
tend to get crisp, fast results in the wee hours
Extend study outside the app
Example two programs on big Origin
alone together on 64 P
8 processor IS run 4.71 sec 6.18
36 processor SP run 26.36 sec 65.28

29
Repeatability

The variance for the repeated runs is a key
result for production codes - the real world is
not ideal

30
Understanding the Platform

A very Simple Example broadcast(M,P)
vary M, P
repeat

end time
start time
MPI bcast
MPI barrier
MPI barrier
31
NOW bcast (m, p)
32
Origin mean bcast (m, p)
33
NOW bcast (1024, p)
34
Origin bcast (1024, p)
35
NOW bcast(1042, 16) repetitions
discarded first iteration
36
Origin bcast(1042, 16) repetitions
discarded first iteration
37
Origin bcast(1042, 16) repetitions - 10x
38
Origin bcast(1042, 16) repetitions
39
Origin bcast(1M, 16) repetitions
40
Discussion

Apply engineering analysis to your parallel
engineering analysis codes!
Isolate components
Introduce controlled variations
processors
data set
communication rate
repetition
Identify trouble spots

41
To read more

Parallel Computer Architecture - a
hardware/software approach, Culler and Singh,
Morgan-Kaufmann
Architectural Requirements and Scalability of the
NAS Parallel Benchmarks, Wong, Martin,
Arpaci-Dusseau, and Culler, Proc. of SC99
Building MPI for Multi-Programming Systems using
Implicit Information, Wong, Arpaci-Dusseau,
Culler, 6th European PVM/MPI User's Group Meeting
http//www.cs.berkeley.edu/culler/papers