Title: Automatic Performance Analysis of SMP Cluster Applications
1Automatic Performance Analysis of SMP Cluster
Applications
- Felix Wolf, Bernd Mohr
- f.wolf, b.mohr_at_fz-juelich.de
- Forschungszentrum Jülich
- Zentralinstitut für Angewandte Mathematik
2Forschungszentrum Jülich
- Interdisciplinary public research center
- Germanys largest National Laboratory
- Focus
- Matter
- Energy
- Information
- Life
- Environment
- 4300 employees
- 1 square mile
3Jülich
- 30.000 citizens
- First mentioned in 356
- Citadel (ital. small city)
- Renaissance fortress 1549
- High school 1966
4Aachen
- 250.000 citizens
- Touches Belgium and Netherlands
- Favorite residence of Charlemagne
- Roman emperor 800 - 814
5Outline
- Introduction
- Approach
- Event Trace Generation
- Abstraction Mechanisms
- Analysis Process
- Performance Behavior
- Representation
- Specification
- Presentation
- Demo
- Summary
6Projects
- APART
- Automatic Performance Analysis Resources and
Tools - Working group funded by European Union
- http//www.fz-juelich.de/apart/
- Forum of tool experts, hard- and software vendors
- About 20 members worldwide
- Organizes international workshops
- KOJAK
- Research project of ZAM at Forschungszentrum
Jülich - http//www.fz-juelich.de/zam/kojak/
- Embedded in APART
- Development of tools for automatic performance
analysis - New tools, integration of existing tools
- Generic Design
7SMP Clusters
- Hierarchical architecture
- Shared memory within a node
- Distributed memory among nodes
- Standard programming models
- MPI
- OpenMP
- Hybrid
- MPI among nodes
- OpenMP within a node
- Advantage best match of underlying architecture
- Problem
- Complex performance behavior
- Lack of appropriate performance tools
8EXPERT tool environment
- Complete tracing-based solution
- Automatic detection of performance problems
- Explanation on high abstraction level
- Close to underlying programming model
- Support of
- MPI
- OpenMP
- MPI/OpenMP
- Advanced graphical tree display of performance
behavior - Along three hierarchical dimensions
- Class of performance behavior
- Position within the dynamic call tree
- Location (e.g., node or process)
9EXPERT Tool Environment (2)
- EXPERT performance tool
- Analysis of performance behavior
- Presentation of performance behavior
- EARL trace analysis language
- Maps event trace onto higher abstraction level
- Makes analysis process simple and easy to extend
- Event trace generation
- OPARI source code instrumentation of OpenMP
directives - EPILOG runtime library for event recording
10Terminology
- Performance property aspect or class of
performance behavior - E.g., execution dominated by point-to-point
communication - Specified as a condition over performance data
- Severity measure indicates influence on
performance behavior - Performance bottleneck
- Performance property with
- High influence on the performance behavior
- High severity
11Performance Data
- Different kinds of structured performance data
- Profiles
- Summary information
- Easy to create, low space requirements
- Show simple performance properties
- Event traces
- Single events and their spatial/temporal
relationship - Creation more difficult, high space requirements
- Expressive visualization
- Show complex performance properties
12Approach
- Proof of performance properties using event
traces - Existence of compound events
- Compound event representing inefficient behavior
- Set of primitive events (constituents)
- Relationships among constituents
- Example
- Message dispatch, receipt
- Kind of relationships based on the programming
model - Advantage
- High-level explanation of inefficiencies
- Based on vocabulary of the underlying programming
model
13Example Late Sender (blocked receiver)
- Czochralski crystal growth
14Example (2) Wait at NxN
15Event Trace Generation
- Instrumentation
- MPI calls
- Wrapper library based on MPI standard profiling
interface - OpenMP directives
- OPARI source code preprocessor (C, C, Fortran)
- User functions
- Internal PGI compiler profiling interface (C,
C, Fortran) - Event trace generation
- EPILOG runtime library (thread safe)
- EPILOG binary trace data format
- Event types for MPI/OpenMP
- Hierarchical cluster hardware
- Event location is tuple (machine, node, process,
thread) - Source code information and performance counter
values
16Abstraction Mechanisms
- Problem
- Low-level information in event trace
- Sequential access to event records
- Mapping of low-level trace onto higher-level
model - Simpler specification of performance properties
- EARL trace analysis language
- Implements high-level interface to event trace
- C class embedded in Python interpreter
- Random access
- Abstractions expressing programming model
specific relationships - Call tree management
17Abstraction Mechanisms (2)
- Event Trace
- Sequence of events in chronological order
- Event type set of attributes
- Hierarchy of event types
18Abstraction Mechanisms (3)
- Abstractions
- State of an event
- Links between related events (e.g., Send and
Recv) - State of an event
- State of the executing system, set of ongoing
activities - Defined by set of events that caused the state
- Mapping event onto set of events
- Defined inductively by transition rules
- Examples
- Send events of messages currently in transfer
(message queue) - Enter events of regions currently in execution
(region stack) - Exit events of collective operation just
completed - MPI collective operations (MPICExit)
- OpenMP parallel constructs (OMPCExit)
19Abstraction Mechanisms (4)
- Links between related events
- Navigate along path of related events
- Represented by pointer attributes
- Extension of event types
- Examples
- Pointer to Enter event of current region instance
(enterptr) - Pointer from Recv event to corresponding Send
event (sendptr) - Pointer from lock event to preceding lock event
that modified the same lock - Call tree access
- Associate Enter events with call tree node
- Divide set of Enter events into equivalence
classes - Same call tree node
- Pointer attribute pointing to representative
(least recent event) - Call tree node represents execution phase, source
code location
20Analysis Process
- EXPERT analysis component
- Implemented in Python
- Proof of performance properties
- Calculation of severity measure
- Design principle separation of
- Analysis process
- Specification of performance properties
- Abstractions (EARL)
- Advantage
- Arbitrary set of performance properties
- Short specification of performance properties
- Application specific properties
21Representation of Performance Behavior
- Three dimensional matrix
- Class of performance behavior
- Performance property
- Call tree node
- Source code location, execution phase
- Location
- Machine, node, process, thread
- E.g., distribution of waiting times across
processes - Proof of load imbalance
- Each cell contains a performance metric
- Currently time, e.g., overhead, waiting time
- Each dimension is arranged in a hierarchy
- From general to specific performance properties
- From caller to callees
- Hierarchy of hard- and software components
22Specification of Performance Behavior
- Proof of performance property
- Existence of compound event in event trace
- Compound event set of primitive events
(constituents) - Partition constituents into subsets i.e.,
logical parts - Define relationships among subsets using EARL
abstractions - Pattern classes specify compound events
- Python class
- Call back method for each event type
- Analysis process
- Looks for compound event instances in event trace
- Walks sequentially through event trace
- Invokes corresponding call back method for each
event - Each pattern class computes severity matrix
- Time losses due to performance property
- Per location and call tree node
23Example Late Sender (blocked receiver)
Location
Enter Send Receive Message enterptr sendptr
MPI_SEND
B
Waiting
MPI_RECV
A
Time
24Example Late Sender (2)
- class LateSender(Pattern)
- "Late Sender"
- def parent(self)
- return "P2P"
- def recv(self, recv)
- recv_start self._trace.event(recv'enterpt
r') - if (self._trace.region(recv_start'regid')'
name' "MPI_Recv") - send self._trace.event(recv's
endptr') - send_start self._trace.event(send'e
nterptr') - if (self._trace.region(send_start'reg
id')'name' "MPI_Send") - idle_time send_start'time' -
recv_start'time' - if idle_time gt 0
- locid recv_start'locid'
- cnode recv_start'cnodeptr'
- self._severity.add(cnode,
locid, idle_time)
25Performance properties partial list
- Severity is percentage of CPU allocation time
- 100 ( time from first to last event )
number of CPUs - Upper-level properties
- Execution
- Time during which code is executed
- In MPI applications near 100
- In OpenMP applications typically below 100
- Idle Threads
- Time on unused CPUs during sequential regions in
OpenMP applications - Call tree mapping corresponds to that of master
thread
26Performance Properties (2)
Location
100 Execution Idle Threads
Thread 1.1
Thread 1.1
Thread 1.1
Thread 1.0
Thread 0.3
Thread 0.2
Thread 0.1
Thread 0.0
Time
27Performance Properties (3)
- MPI
- Communication
- Collective
- Early Reduce
- Late Broadcast
- Wait at N x N
- Point to Point
- Late Receiver
- Messages in Wrong Order
- Late Sender
- Messages in Wrong Order
- IO
- Synchronization
28Performance Properties (4)
- OpenMP
- Synchonization
- Barrier
- Implicit
- Load Imbalance at Parallel Do
- Not Enough Sections
- Explicit
- Lock Competition
- Idle Threads (sequential overhead)
29Plug-in Mechanism
- Application-specific performance properties
- Application-specific criteria
- E.g. based on iterations or updates per second
- Automatic GUI integration
- Based on Python module concept
30Presentation of Performance Behavior
- Performance behavior
- 3 dimensional matrix
- Hierarchical dimensions
- Weighted tree
- Tree browser
- Each node has weight
- Percentage of CPU allocation time
- E.g. time spent in subtree of call tree
- Displayed weight depends on state of node
- Collapsed (including weight of descendants)
- Expanded (without weight of descendants)
- Displayed using
- Color
- Allows to easily identify hot spots (bottlenecks)
- Numerical value
- Detailed comparison
100 main
10 main
30 foo
60 bar
31Presentation of Performance Behavior (2)
- Three views
- Performance property
- Call tree
- Locations
- Interconnected
- View refers to selection in left neighbor
- Two modes
- Absolute percent of total CPU allocation time
- Relative percent of selection in left neighbor
- Collapsing/expanding of nodes
- Analysis on all hierarchy levels
32Case Study TRACE (MPI)
- Simulation of subsurface water flow in variable
saturated media - Based on parallelized CG algorithm
- Experiment
- 8 x 2 processes
- Communication 16.7
- Two major sources of inefficiencies detected
- Wait at N x N
- trace ? cgiteration ? parallelcg ?
paralleldotproduct ? globalsum_r1 ? MPI_Allreduce - Waiting time ca. 3.6
- Late Sender
- trace ? cgiteration ? parallelcg ?
parallelfemultiply ? exchangedata ?
exchangebufferswf ? mrecv ? MPI_Recv - Waiting time 7.0
33Case Study REMO (MPI OpenMP)
- Weather forecast
- Based on the regional climate model
- Experiment
- 4 pocesses
- 4 threads each
- Sequential part outside parallel region is to
large - Idle threads
- remo (whole program) 50
- remo ? ec4org ? progec4 37
34Summary
- EXPERT tool environment for automatic performance
analysis - Complete but still extensible solution
- Support of MPI, OpenMP, and hybrid applications
- Especially well suited for SMP cluster
- Performance properties
- High abstraction level
- Close to terminology of the underlying
programming model - Specifications embedded in extensible
architecture - Application-specific needs
- Performance behavior
- Three interconnected hierarchical dimensions
- Representation of SMP cluster structure
- Hierarchical hard- and software components
- Scalable but still accurate tree display
35Outlook
- More performance properties
- Comparative analysis of different trace files
- Additional presentation components
- Source code display
- Event pattern display (e.g, using VAMPIR)
- Performance behavior is represented by time
values (losses) - Alternative metrics such as hardware performance
counters - Integration with TAU (University of Oregon)
- Instrumentation
- Trace generation