Title: IPDPS 2005, slide 1
1Automatic Construction and Evaluation of
Performance Skeletons (Predicting Performance
in an Unpredictable World)
Sukhdeep Sodhi Microsoft Jaspal
Subhlok University of Houston IPDPS 2005
2What is a Performance Skeleton anyway ?
A short running program that mimics execution
behavior of a given application GOAL execution
time of a performance skeleton is a fixed
fraction of application execution time - say
11000, then.. Sounds vaguely interesting
but Who cares ? How to do it ? Is it even
possible to build one ?
3Who Cares ? Anyone who needs a performance
estimate when it cannot be modeled well
- Applications Distributed on Networks Resource
selection, Mapping, Adapting
Which nodes offer best performance
?
Application
Network
- Performance testing of a future architecture
under simulation Large applications cannot be
tested as simulation is 1000X slower
4Mapping Distributed Applications on Networks
state of the art
Mapping for Best Performance
- Measure and model network and application
characteristics (NWS is popular) - Find best match of nodes for execution
- But the approach has significant limitations
- Knowing network status is not the same as knowing
how an application will perform - Frequent measurements are expensive, less
frequent measurements mean stale data
5Mapping Distributed Applications on Networks
our approach
Model
Data
Sim 2
Vis
Sim 1
Pre
Stream
Application
Predict performance and select nodes by actual
execution of performance skeletons on groups of
nodes
?
Network
6How to Construct a Performance Skeleton ?
-
- Central challenge in this research
- Common sense dictates that an application and its
skeleton must be similar in - Computation behavior
- Communication behavior
- Memory behavior
- I/O Behavior
- All execution behavior is to be captured in a
short program
How ? How ?
skeleton
application
7How to Construct a Performance Skeleton ?
How ?
skeleton
Run application
Construct Performance Skeleton
Record Execution Trace
Compress execution trace into Execution Signature
Execution trace is a record of all system
activity during execution such as memory
accesses, communication messages and CPU
events. Execution signature is a compressed
summarized record of execution Performance
Skeleton is a program based on execution signature
8Likmitations of Work Presented Today
- Only model the coarse application
computation and communication patterns to build
performance skeleton - ignore memory and I/O behavior
- Ignore specific instructions only consider
whether CPU is computing or communicating or idle - somewhat intrusive link with a profiling
library - Limited to MPI programs
- But these are not limitations of the approach.
- Most are being addressed in the project.
9Constructing a Performance Skeleton
How ?
skeleton
Run application
Construct Performance Skeleton program from
execution signature
Record Execution Trace
Compress execution trace into Execution Signature
10Recording Execution Trace
- Link MPI application with PMPI based profiling
library - no source code modification / analysis required
- Execute on a dedicated testbed
- Records all MPI function calls
- Call name, start time, stop time, parameters
- Timing done to microsecond granularity
- CPU busy time between consecutive MPI calls
- Result is a (long) execution sequence of
computation and communication events and their
durations/parameters
11Constructing a Simple Performance Skeleton
How ?
skeleton
Run application
Construct Performance Skeleton program from
execution signature
Record Execution Trace
Compress execution trace into Execution Signature
12Compress Execution Trace? Execution Signature
- Application execution typically follows cyclic
patterns - Goal Form loop structure by identifying
repeating execution behavior. - Step 1 Execution trace to symbol strings
- Identify similar (may not be identical)
execution events - Each event in such a cluster of similar events is
replaced by a representative and assigned a
symbol - Execution trace is replaced by symbol string
- ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
- Where, say ? compute for 100ms, ? MPI
call to send 800 bytes to a neighbor node
13Compress Execution Trace? Execution Signature
- Step 2 Compress string by Identifying Cycles
- Build loop structure recursively from symbol
strings - e.g. ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
- is replaced by
- ? ? ?3 ? ?2 ?2
- Similar to longest substring matching problem
- Typical Execution Signature is multiple orders of
magnitude smaller than trace - Step 3 Adaptively increase degree of compression
(by managing a similarity parameter) until
signature is compact enough
14Constructing a Simple Performance Skeleton
How ?
skeleton
Run application
Construct Performance Skeleton program from
execution signature
Record Execution Trace
Compress execution trace into Execution Signature
15Generate Performance Skeleton Program
- GoalExecution time of performance skeleton is
1/K application execution time (K given by user) - Reduce Iterations of each loop in application
signature by a factor K - Heuristically process remaining iterations and
events outside loops - Replace symbols by C language statements
16Experimental Validation
- Skeletons constructed for Class B NAS MPI
benchmarks. Executed on 4 cluster nodes in
following sharing scenarios - Dedicated nodes (defines reference execution time
ratio between skeleton and application) - Competing processes on one node/ all nodes
- Competing traffic on one link /all links
- Competition as above on one node and one link
- Skeleton execution time used to predict
application execution time in different scenarios - Setup Intel Xeon dual CPU 1.7 GHz nodes running
Linux 2.4.7. Gigabit crossbar switch. Simple CPU
intensive competing processes. iproute to
simulate link sharing
17Prediction Accuracy of Skeletons(average across
all sharing scenarios)
Average prediction error is 6 , max 18
--acceptable Longer skeletons better but even .5
sec. skeletons meaningful (tool issues a warning
if requested skeleton size is too small)
18Prediction for Different Sharing Scenarios (10
second skeletons)
- Error is higher with network contention
- communication is harder to scale down and
affects synchronization more directly
19Comparison with Simple Prediction Methods
Average Prediction Average slowdown of entire
benchmark is used to predict execution time for
each program. Class S Prediction Class S
benchmark(1sec) programs used as skeletons for
Class B (30-900s)benchmarks Even the smallest
skeletons are far superior!
20Conclusions
- Promising approach to performance estimation for
- Unpredictable environments (GRIDS)
- Non existing architectures (under simulation)
- .
- It is work in progress a lot more remains, such
as - accurately reproducing memory behavior (some
results in LCR 2004 workshop) - integration of memory and communicate/compute
- validation on larger grid environments
- accurate reproduction of CPU behavior (such as
instruction types etc.) - Skeletons that scale to different numbers of nodes
21 FOR MORE INFORMATION www.cs.uh.edu/jaspal
jaspal_at_uh.edu Thanks to NSF and DOE!
End of Talk! Or is It ? Questions ?
22Discovered Communication Structure of NAS
Benchmarks
1
1
1
0
0
0
2
2
3
3
3
2
BT
CG
IS
1
1
1
0
0
0
2
2
2
3
3
3
LU
MG
SP
1
0
2
3
EP
23CPU Behavior of NAS Benchmarks