IPDPS 2005, slide 1 - PowerPoint PPT Presentation

About This Presentation

Title:

IPDPS 2005, slide 1

Description:

Automatic Construction and Evaluation of 'Performance Skeletons' ... Construct Performance Skeleton ... Skeletons constructed for Class B NAS MPI benchmarks. ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 24

Provided by: camp203

Learn more at: https://www2.cs.uh.edu

Category:

more less

Transcript and Presenter's Notes

Title: IPDPS 2005, slide 1

1
Automatic Construction and Evaluation of
Performance Skeletons (Predicting Performance
in an Unpredictable World)
Sukhdeep Sodhi Microsoft Jaspal
Subhlok University of Houston IPDPS 2005
2
What is a Performance Skeleton anyway ?
A short running program that mimics execution
behavior of a given application GOAL execution
time of a performance skeleton is a fixed
fraction of application execution time - say
11000, then.. Sounds vaguely interesting
but Who cares ? How to do it ? Is it even
possible to build one ?
3
Who Cares ? Anyone who needs a performance
estimate when it cannot be modeled well

Applications Distributed on Networks Resource
selection, Mapping, Adapting

Which nodes offer best performance
?
Application
Network

Performance testing of a future architecture
under simulation Large applications cannot be
tested as simulation is 1000X slower

4
Mapping Distributed Applications on Networks
state of the art
Mapping for Best Performance

Measure and model network and application
characteristics (NWS is popular)
Find best match of nodes for execution
But the approach has significant limitations
Knowing network status is not the same as knowing
how an application will perform
Frequent measurements are expensive, less
frequent measurements mean stale data

5
Mapping Distributed Applications on Networks
our approach
Model
Data
Sim 2
Vis
Sim 1
Pre
Stream
Application
Predict performance and select nodes by actual
execution of performance skeletons on groups of
nodes
?
Network
6
How to Construct a Performance Skeleton ?

Central challenge in this research
Common sense dictates that an application and its
skeleton must be similar in
Computation behavior
Communication behavior
Memory behavior
I/O Behavior
All execution behavior is to be captured in a
short program

How ? How ?
skeleton
application
7
How to Construct a Performance Skeleton ?
How ?
skeleton
Run application
Construct Performance Skeleton
Record Execution Trace
Compress execution trace into Execution Signature
Execution trace is a record of all system
activity during execution such as memory
accesses, communication messages and CPU
events. Execution signature is a compressed
summarized record of execution Performance
Skeleton is a program based on execution signature
8
Likmitations of Work Presented Today

Only model the coarse application
computation and communication patterns to build
performance skeleton
ignore memory and I/O behavior
Ignore specific instructions only consider
whether CPU is computing or communicating or idle
somewhat intrusive link with a profiling
library
Limited to MPI programs
But these are not limitations of the approach.
Most are being addressed in the project.

9
Constructing a Performance Skeleton
How ?
skeleton
Run application
Construct Performance Skeleton program from
execution signature
Record Execution Trace
Compress execution trace into Execution Signature
10
Recording Execution Trace

Link MPI application with PMPI based profiling
library
no source code modification / analysis required
Execute on a dedicated testbed
Records all MPI function calls
Call name, start time, stop time, parameters
Timing done to microsecond granularity
CPU busy time between consecutive MPI calls
Result is a (long) execution sequence of
computation and communication events and their
durations/parameters

11
Constructing a Simple Performance Skeleton
How ?
skeleton
Run application
Construct Performance Skeleton program from
execution signature
Record Execution Trace
Compress execution trace into Execution Signature
12
Compress Execution Trace? Execution Signature

Application execution typically follows cyclic
patterns
Goal Form loop structure by identifying
repeating execution behavior.
Step 1 Execution trace to symbol strings
Identify similar (may not be identical)
execution events
Each event in such a cluster of similar events is
replaced by a representative and assigned a
symbol
Execution trace is replaced by symbol string
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
Where, say ? compute for 100ms, ? MPI
call to send 800 bytes to a neighbor node

13
Compress Execution Trace? Execution Signature

Step 2 Compress string by Identifying Cycles
Build loop structure recursively from symbol
strings
e.g. ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
is replaced by
? ? ?3 ? ?2 ?2
Similar to longest substring matching problem
Typical Execution Signature is multiple orders of
magnitude smaller than trace
Step 3 Adaptively increase degree of compression
(by managing a similarity parameter) until
signature is compact enough

14
Constructing a Simple Performance Skeleton
How ?
skeleton
Run application
Construct Performance Skeleton program from
execution signature
Record Execution Trace
Compress execution trace into Execution Signature
15
Generate Performance Skeleton Program

GoalExecution time of performance skeleton is
1/K application execution time (K given by user)
Reduce Iterations of each loop in application
signature by a factor K
Heuristically process remaining iterations and
events outside loops
Replace symbols by C language statements

16
Experimental Validation

Skeletons constructed for Class B NAS MPI
benchmarks. Executed on 4 cluster nodes in
following sharing scenarios
Dedicated nodes (defines reference execution time
ratio between skeleton and application)
Competing processes on one node/ all nodes
Competing traffic on one link /all links
Competition as above on one node and one link
Skeleton execution time used to predict
application execution time in different scenarios
Setup Intel Xeon dual CPU 1.7 GHz nodes running
Linux 2.4.7. Gigabit crossbar switch. Simple CPU
intensive competing processes. iproute to
simulate link sharing

17
Prediction Accuracy of Skeletons(average across
all sharing scenarios)
Average prediction error is 6 , max 18
--acceptable Longer skeletons better but even .5
sec. skeletons meaningful (tool issues a warning
if requested skeleton size is too small)
18
Prediction for Different Sharing Scenarios (10
second skeletons)

Error is higher with network contention
communication is harder to scale down and
affects synchronization more directly

19
Comparison with Simple Prediction Methods
Average Prediction Average slowdown of entire
benchmark is used to predict execution time for
each program. Class S Prediction Class S
benchmark(1sec) programs used as skeletons for
Class B (30-900s)benchmarks Even the smallest
skeletons are far superior!
20
Conclusions

Promising approach to performance estimation for
Unpredictable environments (GRIDS)
Non existing architectures (under simulation)
.
It is work in progress a lot more remains, such
as
accurately reproducing memory behavior (some
results in LCR 2004 workshop)
integration of memory and communicate/compute
validation on larger grid environments
accurate reproduction of CPU behavior (such as
instruction types etc.)
Skeletons that scale to different numbers of nodes

21
FOR MORE INFORMATION www.cs.uh.edu/jaspal
jaspal_at_uh.edu Thanks to NSF and DOE!
End of Talk! Or is It ? Questions ?
22
Discovered Communication Structure of NAS
Benchmarks
1
1
1
0
0
0
2
2
3
3
3
2
BT
CG
IS
1
1
1
0
0
0
2
2
2
3
3
3
LU
MG
SP
1
0
2
3
EP
23
CPU Behavior of NAS Benchmarks

Write a Comment

User Comments (0)