Title: Interoperable Performance Tools
1Interoperable Performance Tools
Nikhil Bhatia Fengguang Song Felix
Wolf University of Tennessee Innovative
Computing Laboratory
- Bernd Mohr
- Forschungszentrum Jülich
- John von Neumann - Institut für Computing
2Outline
- KOJAK
- Recent extensions
- Intermediate conclusion
- Two new interoperable components
- CUBE
- CONE
- Future directions
3Low-Level View of Performance Behavior
4KOJAK
- Automatic performance analysis
- Take event traces of MPI/OpenMP applications
- Search for execution patterns
- Calculate mapping
- Problem, call path, location ? time
- Display in performance browser
?
5KOJAK Architecture
- Instrumentation
- Inserting extra code to generate trace
- Abstract representation of event trace
- Precomputed relationships
- Simplified specification of performance
properties - Easy to extend
- Analysis
- Automatic classification and quantification of
performance behavior - Presentation
- Navigating / browsing through performance space
- Can be combined with time-line display
Presentation
Analysis
Abstraction
Instrumentation
6KOJAK Architecture (2)
Semiautomatic Instrumentation
Instrumented source code
OPARI / TAU
Source code
POMPPMPI Libraries
Compiler / Linker
Executable
EPILOG Library
PAPI Library
Run DPCL
Automatic
Analysis
EXPERT Analyzer
Analysis report
EXPERT Presenter
EPILOG Trace file
EARL
Manual Analysis
VTF3 Trace file
Trace converter
VAMPIR
7Parallelism vs. CPU and Memory Performance
- Interaction among different processes and threads
? - How do my processes and threads perform
individually? - CPU performance
- Memory performance
- Integration of these performance aspects?
- Specification of parallelism-related properties
- Temporal and spatial relationships between
run-time events - Specification of CPU and memory-related
properties - Hardware counters
8CPU Memory Performance in KOJAK
- Event model trace format
- Predefined and user-defined system metrics
- Including but not limited to hardware counters
- Metric values as part of ENTER / EXIT records
- Flexible interval semantics
- Run-time system
- Hardware-counter access with PAPI
- Portable access to hardware counters on most
platforms - Abstraction layer
- Additional event attributes
- Attribute name as defined in trace file
- print eventL1_D_CACHE
9CPU Memory Performance in KOJAK (2)
- Analysis
- Identifies tuples (call path, thread) whose
occurrence rate of a certain event is above /
below a certain threshold - Use entire execution time of tuple as severity
(upper bound) - Two experimental performance properties
- L1 data cache misses per time above average
- Floating point operations per time below average
(25 peak) - Main results
- Beneficial integration of parallelism with
individual CPU performance - Actual run-time penalty still unknown
- Better bound for severity
- Need to cover additional aspects of CPU
performance
10Intermediate Conclusion
- Manpower
- Most of the time only two people
- Demand for robust and portable software
- Tool components vs. monolithic tool
- KOJAK developed from and as a set of independent
components - No detailed initial design
- Use of third-party components (e.g., PAPI, TAU)
- Native KOJAK components more generally usable
- Synergy through interoperability
- Components with well-defined interfaces
- Portability
- Open source
Monolithic Tool
11Some Components and Interfaces
- Hardware monitoring
- HPM, PAPI, PCL
- Instrumentation
- DPCL, DPOMP, Dyninst, SCALEA, SDDF, SIR, TAU
- Experiment management
- ILab, Nimrod, ZENTURIO
- Tool infrastructure
- MRNet, TDP
- Databases and source-code analysis
- DUCTAPE, PDT, PerfDBF, PPerfDB
- Presentation
- Askalon, SvPablo, TAU
- Modeling and prediction
- MetaSim, PerformanceProphet/Teuta
12Generic Presentation
- Conclusions drawn from EXPERT presenter
- Presentation independent of
- Specific performance properties
- Specific metric
- Presentation only based on
- Structure
- Hierarchical decomposition
- Relative weight (severity) of nodes
- Coloring
Presenter
Analyzer
Performance behavior
Karavanic et al. structural difference
operator Miller et al. hierarchical
decomposition in Paradyn
13CUBE Uniform Behavioral Encoding
?
- High-level data model of performance behavior
- Mapping performance aspect, program entities ?
metric - Hierarchical decomposition
- Multidimensional aggregation
- Portable data format (XML)
- Generic presentation component
- Performance-data algebra (not yet supported)
?
?
Cube Tool
KOJAK
CUBE (XML)
CONE
Performance Tool 3
14CUBE Prototype
- Implemented in C by Fengguang Song
- Mapping performance property, call tree,
location ? metric - Hierarchical dimensions
- Data format specified using XMLSchema
- C class interface for reading / writing
- Tested with the CONE call-graph profiler
- Already some more features than EXPERT Presenter
- Absolute values
- Source-code display
General Behavior
Main
Grid
Machine
SMP Node
Process
Specific Behavior
Subroutine
Thread
15CUBE Interface
class Cube public Cube() // property
dimension int def_prop(stdstring name,
stdstring uom, stdstring descr,
int parent_id) // call-tree dimension
int def_module(stdstring name, stdstring
path) int def_region(stdstring name, long
begln, long endln, stdstring descr,
int mod_id) int def_csite(int mod_id,
int line, int callee_id) int def_cnode(int
csite_id, int parent_id) // location
dimension int def_grid(stdstring name) int
def_mach(stdstring name, int grid_id) int
def_node(stdstring name, int mach_id) int
def_proc(stdstring name, int node_id) int
def_thrd(stdstring name, int proc_id)
void set_sev(int prop_id, int cnode_id, int
thrd_id, double value)
16CUBE Data Format
lt?xml version"1.0" encoding"UTF-8"?gtltcube
version"0.1"gt ltbehaviorgt ltproperty
id"0"gt ltnamegtTIMElt/namegt
ltuomgtseclt/uomgt ltdescriptiongt Wall clock
timelt/descriptiongt ltproperty id"1"gt
ltnamegtUSER_TIMElt/namegt ltuomgtseclt/uomgt
ltdescriptiongt User cpu timelt/descriptiongt
lt/propertygt ltproperty id"2"gt
ltnamegtSYSTEM_TIMElt/namegt ltuomgtseclt/uomgt
ltdescriptiongt System cpu timelt/descriptiongt
lt/propertygt lt/propertygt
lt/behaviorgt lt/cubegt
17Performance-data algebra
- Comparative analysis
- Different program versions
- Different input data
- Different configuration
- Different random errors
- Performance-data algebra
- Perform arithmetic operations on CUBE instances
- Difference, mean
- Obtain CUBE instance as result
- Display it like ordinary CUBE instance
-
CUBE (XML)
CUBE (XML)
CUBE (XML)
-
18CONE
COntrol flow
Notification Engine
- Flexible call-graph profiler
- Implemented in C/C by Nikhil Bhatia
- Binary instrumentation (DPCL)
- Full call path including line numbers
- Large variety of performance data
- PAPI used for hardware monitoring
- Based on IBMs call-graph tracking algorithm
- CATCH profiler
- MPI and serial applications
- Presentation of data with CUBE
19Online call-graph tracking
- Compute the static call graph in advance
- For each control flow maintain a pointer into
call graph - Start at root node
- Move the pointer upon every function call and
return - Call from call site n
- Move to child node n
- Recursive programs push onto stack
- Return
- Move to previous node (parent)
- Recursive programs pop from stack
The number of nodes directly reachable from a
function only depends on that function - not on
the current call path
20Online call-graph tracking (2)
main() A( ) B( ) C( )
D( ) C( ) A( ) C( )
A( ) B( ) C( ) D( )
X() W() Y() Y() Y() Z()
Z() Z() Y()
21Instrumenting the application
- Every process holds a reference to current node
- Requires only constant overhead
- Note recursive programs require maintaining a
stack
C(...) Y(...) Z(...) Y(...)
call(int i) current current-gtchildreni
...
call(0) return() call(1)
return() call(2) return()
return() current current-gtparent ...
22CONE Architecture
CONE Tool
Target Application
CUBE
instruments
starts
Probe
Probe
calls
presents
Monitoring Manager
loads into
Call-Graph Manager
application
writes
CUBE File
DPCL
PAPI
Probe Module
23Future Directions
- KOJAK
- Redesign of the analyzer component
- Improved integration of hardware counters
- CUBE will replace old presenter
- CONE
- Attaching to a running application
- Selective tracing
- More platforms ( moving to Dyninst )
- CUBE
- Performance algebra
- Automatic tree expansion
- Rates (derived property)
- Property 1 / Property 2
- KOJAK-specific extensions
- Integration with VAMPIR