Title: Allen D. Malony, Sameer Shende, Alan Morris
1Phase-Based ParallelPerformance Profiling
- Allen D. Malony, Sameer Shende, Alan Morris
- malony,sameer,amorris_at_cs.uoregon.edu
- Department of Computer and Information Science
- Performance Research Laboratory
- NeuroInformatics Center
- University of Oregon
2Outline of Talk
- Motivation
- Models in parallel scientific applications
- Phases and performance mapping
- Problem description
- Motivating example
- Profiling techniques
- Flat, callpath, phase profiling
- Approach and implementation
- Applications
- Future work and concluding remarks
3Motivation
- Scientific applications designed based on models
- Computational structural, logical, numerical
models, - Correctness execution order, data consistency,
- Performance expected, factors,
parallelism/scalability, - Computational models form developers mental
model - How the program is intended to behave and perform
- Want to relate performance model to computation
model - View performance data with respect to mental
model - Better identify problems and guide tuning
decisions - Must link computational abstractions to
performance - Bridge semantic gap measurements ? mental
model
4Computational Models
- Structural models
- Program organization and code relationships
- Language used, layout of application parts,
- Constructed generally and unfolds during
execution - Logical and numerical models
- Capture algorithmic characteristics of the
application - Semantic properties of the computation
- correct flow of operation and assertions on
application state - Numerical models
- Algorithms for simulating physical phenomena
- Accuracy properties from numerical calculations
- Structural and logical models implicit
5Performance Mapping
- General problem of linking performance to
computation - Performance mapping (Irvin and Miller, 96
Shende, 01) - Associate (map) measured performance data
- To higher level, semantic representations
- Those with model significance to the user
- What is the difficulty of making the association
- Depends on performance information
- performance events/state visible from
instrumentation - what performance data can be measured
- How the performance information is used in
mapping - Difficulty in how performance information is
presented - Model-based views (LeBlanc et al., 90)
6Phases and Performance Mapping
- Like to support the association between model and
data - Concept of phases is common in scientific
applications - How developers think about structure, logic,
numerics - How performance can be interpreted (Worley, 92)
- Worthwhile to consider support for phases
- In performance measurement
- Bridge semantic gap in parallel performance
mapping? - tracing has long demonstrated the benefits!
(Heath, 91) - phase-based analysis and interpretation
- Main contribution
- Support for phases in parallel performance
profiling
7Problem Description
- Performance measured as a consequence of events
- Events represent actions that occur during
execution - Events of interest determine performance
information - Events have semantics and context (pragmatics)
- Semantics
- Defines what the event represents
- Example subroutine entry
- Context
- Properties of the state in which event occurred
- Example subroutines calling parent
- Interrogate context to map event performance data
8Motivating Example Multi-Physics Application
- Assembly of physical objects
- Different shapes
- Different materials
- Calculate physics
- Heat transfer
- Mechanical stress
- Within / between objects
- Iterate to error tolerance
- How is performance attributed?
- Between events (e.g., routines) and execution
components - With respect to computational objects (e.g., data
objects)
heat()
MPIrecv()
MPIsend()
other routines
9Context and Standard Profiling
- Flat profiles
- Context is whole program (i.e., program code)
- Performance distribution across (static) program
structure - Cannot differentiate dynamics (e.g., callpath or
objects) - Callgraph / callpath profiles
- Identify parent-child calling relationships at
exectution - Context is calling (event) parent / calling
(event) path - Extend event semantics to encode context
- create new event with callpath name
- requires dynamic event creation for complex
callpaths - burdens event mechanisms for context
identification - simple performance associations require many
events
10Context and Phase Profiling
- View the program execution as collection of
phases - Transition between phases (sequenced, nested)
- easiest to think of as phase hierarchy (or phase
graph) - Phases are not events
- phase boundaries can mark entry/exit events
- Context is the current phase
- How do we know what phase we are in?
- Phases are identified separately from events
- phases are not encoded in event names
- event mechanisms are not overloaded
- A phase profile is event performance attributed
to phases - Phase-specific performance profiles (flat or
callpath)
11Approach (Flat Profile)
- Create a profile object for each entry/exit event
- Each profile object has a name
- Static profile object (static event)
- event has a single instance (single name)
- Dynamic profile object (dynamic event)
- event can have multiple instances (created
dynamically) - Inclusive and exclusive performance statistics
- Must maintain an event stack (or callstack)
- Context are generally thought of as code
locations - Dynamic events do allow for dynamic context
awareness - User code can check state and create new events
- BUT only see one level of event!
12Approach (Callpath Profile)
- Show event calling (nesting) relationships
- Create a profile object for each event calling
context - Each profile object has a name that encodes the
callpath - Static profile object
- callpath has a single instance (single name)
- Dynamic profile object
- callpath can have multiple instances (created
dynamically) - Reuse event mechanisms
- Interrogate the event stack to form event names
- maingt f1 gt f2 gt MPI_Send
- Inclusive and exclusive performance statistics
- Callpath length and callgraph depth options
13Approach (Phase Profile)
- A phase is an execution abstraction
- Two questions
- How to inform the measurement systems about
phases? - How to collect the performance data?
- Create a phase object when new phase is created
- Each phase object has a name
- Static and dynamic phase objects
- Phase relationships
- Phases may be nested (cannot overlap)
- Active phase object follows scoping rules
- Default (top-level) phase is outermost event
(e.g., main)
14Approach (Phase Profile - API)
- Phase creationTAU_PHASE_CREATE_STATIC(var,
name, type, group)TAU_PHASE_CREATE_DYNAMIC(var,
name, type, group) - TAU_GLOBAL_PHASE(var, name, type,
group)TAU_GLOBAL_PHASE_EXTERNAL(var) - Global phases have global scope (accessible
anywhere) - External declarations for defined phases outside
file scope - Phase control
- TAU_PHASE_START(var)TAU_PHASE_STOP(var)TAU_GLOB
AL_PHASE_START(var)TAU_GLOBAL_PHASE_STOP(var) - Collects a callgraph profile (depth 2) PER PHASE!
- Phases default as standard events (when disable)
15Approach (Phase Profile - Data Collection)
- Leverages performance mapping and callpath
profiling - Phase entry
- Phase object pushed to measurement (event)
callstack - Phase / event entry
- Need to determine (event, phase) tuple
- traverse callstack to find enclosing phase
- construct key for (event, phase) tuple
- Maintain global map
- new keys for new (event, phase) tuples put into
global map - create new profile object for every (event,
phase) tuple - search global map to determine is tuple occurred
before - Use mapping support to store performance data on
exit
16Multi-Physics Example
Instrumentation
phases
iteratephase
events
heat phase
heat()
MPIrecv()
only two events!
stress phase
stress()
MPIsend()
other routines
17Implementation
- Parallel profiling in the TAU performance system
- Flat profiling
- Callpath and callgraph (2-level callpath)
profiling - Phase profiling
- Multiple performance metrics
- Execution time
- Hardware performance counters (using PAPI)
- Scalable to tens of thousands of processors
- Profile analysis and data management tools
- ParaProf parallel profile analyzer / visualizer
- PerfDMF parallel profile database
18Application NAS Parallel Benchmarks
- Phase profiling can provide more refined profile
results - Specific to phase localities
- Defining phases is an application-specific issue
- Apply understanding of computational models
- Unfortunately, we were not the application
developers - How to decide on phases and phase
instrumentation? - Informed by application documentation and code
- Look at NAS parallel benchmark application suite
- Identify benchmarks with phase behavior
- SP, BT, LU (simulated CFD codes) and CG
- Focus on BT
19NAS BT Phase Analysis
- Emulates a CFD application
- System of linear equations
- Implicit finite-difference discretization of
Navier-Stokes - Solve three sets of uncoupled systems of
equations - in X, Y, Z directions
- Block tridiagonal with 5x5 blocks
- Square number of processors
- Phase analysis
- Highlight performance for each solution direction
- Identified in code by three main functions
- x_solve, y_solve, z_solve
- Static phases
20NAS BT Instrumentation
- call TAU_PHASE_CREATE_STATIC(xsolvephase,x_solve
phase) - call TAU_PHASE_START(xsolvephase)
- call x_solve
- call TAU_PHASE_STOP(xsolvephase)
- call TAU_PHASE_CREATE_STATIC(ysolvephase,y_solve
phase) - call TAU_PHASE_START(ysolvephase)
- call y_solve
- call TAU_PHASE_STOP(ysolvephase)
- call TAU_PHASE_CREATE_STATIC(zsolvephase,z_solve
phase) - call TAU_PHASE_START(zsolvephase)
- call z_solve
- call TAU_PHASE_STOP(zsolvephase)
21NAS BT Flat Profile
How is MPI_Wait()distributed relative tosolver
direction?
Application routine names reflect phase semantics
22NAS BT Phase Profile (Main and X, Y, Z)
Main phase shows nested phases and immediate
events
23Application MFIX
- Multiphase Flow with Interphase eXchanges (MFIX)
- National Energy Transfer Laboratory (NETL)
- Study physical/chemistry properties in
fluid-solid systems - hydrodynamics, heat transfer, chemical reactions
- Characteristic of large-scale iterative
simulations - major loop executed as simulation advances in
time - Testcase
- Models Ozone decomposition in a bubbling
fluidized bed - Flat profile
- Iterate phase profile
- Demonstrate dynamic phases
24MFIX Phase Instrumentation (ITERATE)
-
- SUBROUTINE ITERATE(IER, NIT) character(11)
taucharary integer tauiteration / 0 / integer
profiler(2) / 0, 0 / save profiler,
tauiteration write (taucharary, (a8,i3))
ITERATE , tauiteration tauiteration
tauiteration 1 call TAU_PHASE_CREATE_DYNAMIC(pr
ofiler,taucharary) call TAU_PHASE_START(profiler)
- ! WORK
- call TAU_PHASE_STOP(profiler)
- END SUBROUTINE ITERATE
25MFIX Phase Profile (MPI_Waitall)
In 51st iteration, time spent in MPI_Waitall
was 85.81 secs
dynamic phases one per interation
Total time spent in MPI_Waitall was 4137.9 secs
across all 92 iterations
26MFIX Iterate Phase Behavior
27Concluding Discussion and Future Work
- Phased-based profiling can help to bridge
semantic gap - Computational models ? performance measurements
- Application-specific performance analysis
- Implemented phase profiling in TAU
- Demonstrated phase profiling
- NAS BT benchmark and MFIX application
- Also used in S3D, Uintah, Flash on large-scale
platforms - Requires application-specific knowledge
- Might be possible to link to auto phase
identification - Based on memory tracing or application state
change - Can this idea be extended to global parallel
phases? - Working on better ways to present phase
performance
28Support Acknowledgements
- Department of Energy (DOE)
- Office of Science contracts
- University of Utah ASCI Level 1 sub-contract
- ASC/NNSA Level 3 contract
- Department of Defense (DoD)
- HPC Modernization Office (HPCMO)
- Programming Environment and Training (PET)
- NSF
- Research Centre Juelich
- Los Alamos National Laboratory
- www.cs.uoregon.edu/research/paracomp/tau