Engineering Analysis of High Performance Parallel Programs - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Engineering Analysis of High Performance Parallel Programs

Description:

Tools for Enginieering Analysis of High Performance ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 42
Provided by: DavidEC151
Category:

less

Transcript and Presenter's Notes

Title: Engineering Analysis of High Performance Parallel Programs


1
Engineering Analysis of High Performance Parallel
Programs
  • David Culler
  • Computer Science Division
  • U.C.Berkeley
  • http//www.cs.berkeley.edu/culler

2
Traditional Parallel Programming Tools
  • Focus on showing what program did and when it
    did it
  • microscopic analysis of deterministic events
  • oriented towards initial development of small
    programs on small data sets and small machines
  • Instrumentation
  • traces, counters, profiles
  • Visualization
  • Examples
  • AIMS, PTOOLS, PPP
  • pablo paradyn ... gt delphi
  • ACTS TAU - tuning and analysis util.

3
Example Pablo
4
Beyond Zeroth-order Analysis
  • Basic level to get to a system design that is
    reasonable and behaves properly under ideal
    condition
  • Subject the system to various stresses to
    understand its operating regime and gain deeper
    insight into its dynamic behavior
  • Combine empirical data with analytical models
  • Iterate
  • from What? to What if?

max displacement
Wind Speed
5
Approach Framework for Parameterized Sensitivity
Analsys
  • framework performs analysis over numerous runs
  • statistical filtering
  • vary parameter of interest
  • provides means of combining data to isolate
    effects of interest
  • gt ROBUSTNESS

Problem Data Set Generator
Well-developed Parallel Program
Instrumentation Tools
Study Parameter
Machine Characterizers
  • Procs
  • Comm. perf.
  • Cache
  • Scheduling
  • ...

visualization, modeling
6
Example NAS Parallel Benchmarks
  • Fix problem size (NPB2.2 class A)
  • Two different Architectures
  • NOW Ultrasparc Cluster (170 MHz)
  • SGI Origin (250 MHz)
  • Six application kernels
  • BT - Block Tridiagonal Solve
  • SP -
  • LU - Sparse LU
  • MG - Multigrid
  • IS - Integer sort
  • FT - 3D FFT
  • Examine sensitivity to P ( procs)
  • time(P), speedup(P) Time(1)/Time(P)

7
Single Processor Performance
8
Simplest Example Performance( P )
  • NPB2.2 on NOW and Origin 2000 (250)

9
Understanding Speedup
  • SpeedUp(p) T1
    MAXp (Tcompute Tcomm. T wait)
  • Tcompute (work/p extra) x efficiency
  • With message passing (e.g., MPI) communication
    time and wait time are indistinguishable

10
A more austere metric...
  • Time spent doing thing X
  • Total TimeX (P) ? TimeX (i)
  • Constant for perfect speedup

P
i1
11
Where Time is Spent ( P )
  • Reveal basic Processor and network loading (vs P)

12
Where Time is Spent ( P )
  • Reveal basic Processor and network loading (vs P)
  • Basis for model derivation - comm(P)

13
Why do comm. costs increase?
  • total volume?
  • volume per processor?
  • message overhead?
  • contention?

14
Communication Volume ( P )
15
Communication Structure ( P )
16
Understanding Efficiency ( P, M )
  • Want to understand both what load the program is
    placing on the system
  • and how well the system is handling that load
  • gt characterize the capability of the system via
    simple benchmarks (rather than advertised peaks)
  • gt combine with measured load for predictive
    model, compare

30 MB/s
150 MB/s
17
Communication Efficiency
18
Tools gt Improvements in Run Time
  • Efficiency analysis (vs parameters) gives insight
    into where to improve the system or the program
  • use traditional profiling to see where is program
    the bad stuff happens
  • or go back and tune the system to do better

19
Why does comp. time decrease?
  • Combining trace generation with simulation
    provides new structural insight
  • Here clear knees in program working set ()
    shift with machine size (P)

20
Constant Problem Size Scaling
4
8
16
32
64
128
256
21
LU Working Sets
  • Sharp drop in miss rate from 512 to 1024
  • WS captured by at 1024 KB per processor
  • size increase (lt 32KB), miss rate decrease with
    a constant rate
  • New effect, 100s KB to MB

22
LU Working Sets
  • CPS scaling means smaller and smaller problem per
    processor
  • Smaller WS requirement
  • Miss rate curve moves to the left with P

23
LU Working Sets
  • Given a fixed machine, we only observe a vertical
    slice of the graph

24
LU Working Sets
Cluster
Origin
25
Working Sets
LU
IS
BT
FT
MG
SP
  • There is a Cost to scaling when at larger machine
    size, miss rate increases
  • There is a Benefit to scaling when at larger
    machine size, miss rate decreases
  • Processing Efficiency is determined by -
  • the interaction between the changes in working
    set with the size of the machine

26
Sensitivity to Multiprogramming
  • Parallel machines are increasingly general
    purpose
  • multiprogramming, at least interrupts and daemons
  • Many ideal programs very sensitive to
    perturbations
  • Msg Passing is loosely coupled, but
    implementation may not be!

27
Tools gt Improvements in Run Time
  • MPI implementation spin-waits on send till
    network available (or queue not full) or on
    recv-complete
  • Should use two-phase spin-block

28
Sensitivity to Seemingly Unrelated Activity
  • The mechanism for doing parameter studies is
    naturally extended to get statistically valid
    data through multiple samples at each point
  • tend to get crisp, fast results in the wee hours
  • Extend study outside the app
  • Example two programs on big Origin
  • alone together on 64 P
  • 8 processor IS run 4.71 sec 6.18
  • 36 processor SP run 26.36 sec 65.28

29
Repeatability
  • The variance for the repeated runs is a key
    result for production codes - the real world is
    not ideal

30
Understanding the Platform
  • A very Simple Example broadcast(M,P)
  • vary M, P
  • repeat

end time
start time
MPI bcast
MPI barrier
MPI barrier
31
NOW bcast (m, p)
32
Origin mean bcast (m, p)
33
NOW bcast (1024, p)
34
Origin bcast (1024, p)
35
NOW bcast(1042, 16) repetitions
discarded first iteration
36
Origin bcast(1042, 16) repetitions
discarded first iteration
37
Origin bcast(1042, 16) repetitions - 10x
38
Origin bcast(1042, 16) repetitions
39
Origin bcast(1M, 16) repetitions
40
Discussion
  • Apply engineering analysis to your parallel
    engineering analysis codes!
  • Isolate components
  • Introduce controlled variations
  • processors
  • data set
  • communication rate
  • repetition
  • Identify trouble spots

41
To read more
  • Parallel Computer Architecture - a
    hardware/software approach, Culler and Singh,
    Morgan-Kaufmann
  • Architectural Requirements and Scalability of the
    NAS Parallel Benchmarks, Wong, Martin,
    Arpaci-Dusseau, and Culler, Proc. of SC99
  • Building MPI for Multi-Programming Systems using
    Implicit Information, Wong, Arpaci-Dusseau,
    Culler, 6th European PVM/MPI User's Group Meeting
  • http//www.cs.berkeley.edu/culler/papers
Write a Comment
User Comments (0)
About PowerShow.com