Allen D. Malony, Sameer Shende, Alan Morris - PowerPoint PPT Presentation

About This Presentation
Title:

Allen D. Malony, Sameer Shende, Alan Morris

Description:

View performance data with respect to 'mental' model ... Create a profile object for each event calling context ... 'Active' phase object follows scoping rules ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 28
Provided by: allend7
Category:
Tags: alan | allen | malony | morris | sameer | shende

less

Transcript and Presenter's Notes

Title: Allen D. Malony, Sameer Shende, Alan Morris


1
Phase-Based ParallelPerformance Profiling
  • Allen D. Malony, Sameer Shende, Alan Morris
  • malony,sameer,amorris_at_cs.uoregon.edu
  • Department of Computer and Information Science
  • Performance Research Laboratory
  • NeuroInformatics Center
  • University of Oregon

2
Outline of Talk
  • Motivation
  • Models in parallel scientific applications
  • Phases and performance mapping
  • Problem description
  • Motivating example
  • Profiling techniques
  • Flat, callpath, phase profiling
  • Approach and implementation
  • Applications
  • Future work and concluding remarks

3
Motivation
  • Scientific applications designed based on models
  • Computational structural, logical, numerical
    models,
  • Correctness execution order, data consistency,
  • Performance expected, factors,
    parallelism/scalability,
  • Computational models form developers mental
    model
  • How the program is intended to behave and perform
  • Want to relate performance model to computation
    model
  • View performance data with respect to mental
    model
  • Better identify problems and guide tuning
    decisions
  • Must link computational abstractions to
    performance
  • Bridge semantic gap measurements ? mental
    model

4
Computational Models
  • Structural models
  • Program organization and code relationships
  • Language used, layout of application parts,
  • Constructed generally and unfolds during
    execution
  • Logical and numerical models
  • Capture algorithmic characteristics of the
    application
  • Semantic properties of the computation
  • correct flow of operation and assertions on
    application state
  • Numerical models
  • Algorithms for simulating physical phenomena
  • Accuracy properties from numerical calculations
  • Structural and logical models implicit

5
Performance Mapping
  • General problem of linking performance to
    computation
  • Performance mapping (Irvin and Miller, 96
    Shende, 01)
  • Associate (map) measured performance data
  • To higher level, semantic representations
  • Those with model significance to the user
  • What is the difficulty of making the association
  • Depends on performance information
  • performance events/state visible from
    instrumentation
  • what performance data can be measured
  • How the performance information is used in
    mapping
  • Difficulty in how performance information is
    presented
  • Model-based views (LeBlanc et al., 90)

6
Phases and Performance Mapping
  • Like to support the association between model and
    data
  • Concept of phases is common in scientific
    applications
  • How developers think about structure, logic,
    numerics
  • How performance can be interpreted (Worley, 92)
  • Worthwhile to consider support for phases
  • In performance measurement
  • Bridge semantic gap in parallel performance
    mapping?
  • tracing has long demonstrated the benefits!
    (Heath, 91)
  • phase-based analysis and interpretation
  • Main contribution
  • Support for phases in parallel performance
    profiling

7
Problem Description
  • Performance measured as a consequence of events
  • Events represent actions that occur during
    execution
  • Events of interest determine performance
    information
  • Events have semantics and context (pragmatics)
  • Semantics
  • Defines what the event represents
  • Example subroutine entry
  • Context
  • Properties of the state in which event occurred
  • Example subroutines calling parent
  • Interrogate context to map event performance data

8
Motivating Example Multi-Physics Application
  • Assembly of physical objects
  • Different shapes
  • Different materials
  • Calculate physics
  • Heat transfer
  • Mechanical stress
  • Within / between objects
  • Iterate to error tolerance
  • How is performance attributed?
  • Between events (e.g., routines) and execution
    components
  • With respect to computational objects (e.g., data
    objects)

heat()
MPIrecv()
MPIsend()
other routines
9
Context and Standard Profiling
  • Flat profiles
  • Context is whole program (i.e., program code)
  • Performance distribution across (static) program
    structure
  • Cannot differentiate dynamics (e.g., callpath or
    objects)
  • Callgraph / callpath profiles
  • Identify parent-child calling relationships at
    exectution
  • Context is calling (event) parent / calling
    (event) path
  • Extend event semantics to encode context
  • create new event with callpath name
  • requires dynamic event creation for complex
    callpaths
  • burdens event mechanisms for context
    identification
  • simple performance associations require many
    events

10
Context and Phase Profiling
  • View the program execution as collection of
    phases
  • Transition between phases (sequenced, nested)
  • easiest to think of as phase hierarchy (or phase
    graph)
  • Phases are not events
  • phase boundaries can mark entry/exit events
  • Context is the current phase
  • How do we know what phase we are in?
  • Phases are identified separately from events
  • phases are not encoded in event names
  • event mechanisms are not overloaded
  • A phase profile is event performance attributed
    to phases
  • Phase-specific performance profiles (flat or
    callpath)

11
Approach (Flat Profile)
  • Create a profile object for each entry/exit event
  • Each profile object has a name
  • Static profile object (static event)
  • event has a single instance (single name)
  • Dynamic profile object (dynamic event)
  • event can have multiple instances (created
    dynamically)
  • Inclusive and exclusive performance statistics
  • Must maintain an event stack (or callstack)
  • Context are generally thought of as code
    locations
  • Dynamic events do allow for dynamic context
    awareness
  • User code can check state and create new events
  • BUT only see one level of event!

12
Approach (Callpath Profile)
  • Show event calling (nesting) relationships
  • Create a profile object for each event calling
    context
  • Each profile object has a name that encodes the
    callpath
  • Static profile object
  • callpath has a single instance (single name)
  • Dynamic profile object
  • callpath can have multiple instances (created
    dynamically)
  • Reuse event mechanisms
  • Interrogate the event stack to form event names
  • maingt f1 gt f2 gt MPI_Send
  • Inclusive and exclusive performance statistics
  • Callpath length and callgraph depth options

13
Approach (Phase Profile)
  • A phase is an execution abstraction
  • Two questions
  • How to inform the measurement systems about
    phases?
  • How to collect the performance data?
  • Create a phase object when new phase is created
  • Each phase object has a name
  • Static and dynamic phase objects
  • Phase relationships
  • Phases may be nested (cannot overlap)
  • Active phase object follows scoping rules
  • Default (top-level) phase is outermost event
    (e.g., main)

14
Approach (Phase Profile - API)
  • Phase creationTAU_PHASE_CREATE_STATIC(var,
    name, type, group)TAU_PHASE_CREATE_DYNAMIC(var,
    name, type, group)
  • TAU_GLOBAL_PHASE(var, name, type,
    group)TAU_GLOBAL_PHASE_EXTERNAL(var)
  • Global phases have global scope (accessible
    anywhere)
  • External declarations for defined phases outside
    file scope
  • Phase control
  • TAU_PHASE_START(var)TAU_PHASE_STOP(var)TAU_GLOB
    AL_PHASE_START(var)TAU_GLOBAL_PHASE_STOP(var)
  • Collects a callgraph profile (depth 2) PER PHASE!
  • Phases default as standard events (when disable)

15
Approach (Phase Profile - Data Collection)
  • Leverages performance mapping and callpath
    profiling
  • Phase entry
  • Phase object pushed to measurement (event)
    callstack
  • Phase / event entry
  • Need to determine (event, phase) tuple
  • traverse callstack to find enclosing phase
  • construct key for (event, phase) tuple
  • Maintain global map
  • new keys for new (event, phase) tuples put into
    global map
  • create new profile object for every (event,
    phase) tuple
  • search global map to determine is tuple occurred
    before
  • Use mapping support to store performance data on
    exit

16
Multi-Physics Example
Instrumentation
phases
iteratephase
events
heat phase
heat()
MPIrecv()
only two events!
stress phase
stress()
MPIsend()
other routines
17
Implementation
  • Parallel profiling in the TAU performance system
  • Flat profiling
  • Callpath and callgraph (2-level callpath)
    profiling
  • Phase profiling
  • Multiple performance metrics
  • Execution time
  • Hardware performance counters (using PAPI)
  • Scalable to tens of thousands of processors
  • Profile analysis and data management tools
  • ParaProf parallel profile analyzer / visualizer
  • PerfDMF parallel profile database

18
Application NAS Parallel Benchmarks
  • Phase profiling can provide more refined profile
    results
  • Specific to phase localities
  • Defining phases is an application-specific issue
  • Apply understanding of computational models
  • Unfortunately, we were not the application
    developers
  • How to decide on phases and phase
    instrumentation?
  • Informed by application documentation and code
  • Look at NAS parallel benchmark application suite
  • Identify benchmarks with phase behavior
  • SP, BT, LU (simulated CFD codes) and CG
  • Focus on BT

19
NAS BT Phase Analysis
  • Emulates a CFD application
  • System of linear equations
  • Implicit finite-difference discretization of
    Navier-Stokes
  • Solve three sets of uncoupled systems of
    equations
  • in X, Y, Z directions
  • Block tridiagonal with 5x5 blocks
  • Square number of processors
  • Phase analysis
  • Highlight performance for each solution direction
  • Identified in code by three main functions
  • x_solve, y_solve, z_solve
  • Static phases

20
NAS BT Instrumentation
  • call TAU_PHASE_CREATE_STATIC(xsolvephase,x_solve
    phase)
  • call TAU_PHASE_START(xsolvephase)
  • call x_solve
  • call TAU_PHASE_STOP(xsolvephase)
  • call TAU_PHASE_CREATE_STATIC(ysolvephase,y_solve
    phase)
  • call TAU_PHASE_START(ysolvephase)
  • call y_solve
  • call TAU_PHASE_STOP(ysolvephase)
  • call TAU_PHASE_CREATE_STATIC(zsolvephase,z_solve
    phase)
  • call TAU_PHASE_START(zsolvephase)
  • call z_solve
  • call TAU_PHASE_STOP(zsolvephase)

21
NAS BT Flat Profile
How is MPI_Wait()distributed relative tosolver
direction?
Application routine names reflect phase semantics
22
NAS BT Phase Profile (Main and X, Y, Z)
Main phase shows nested phases and immediate
events
23
Application MFIX
  • Multiphase Flow with Interphase eXchanges (MFIX)
  • National Energy Transfer Laboratory (NETL)
  • Study physical/chemistry properties in
    fluid-solid systems
  • hydrodynamics, heat transfer, chemical reactions
  • Characteristic of large-scale iterative
    simulations
  • major loop executed as simulation advances in
    time
  • Testcase
  • Models Ozone decomposition in a bubbling
    fluidized bed
  • Flat profile
  • Iterate phase profile
  • Demonstrate dynamic phases

24
MFIX Phase Instrumentation (ITERATE)
  • SUBROUTINE ITERATE(IER, NIT) character(11)
    taucharary integer tauiteration / 0 / integer
    profiler(2) / 0, 0 / save profiler,
    tauiteration write (taucharary, (a8,i3))
    ITERATE , tauiteration tauiteration
    tauiteration 1 call TAU_PHASE_CREATE_DYNAMIC(pr
    ofiler,taucharary) call TAU_PHASE_START(profiler)
  • ! WORK
  • call TAU_PHASE_STOP(profiler)
  • END SUBROUTINE ITERATE

25
MFIX Phase Profile (MPI_Waitall)
In 51st iteration, time spent in MPI_Waitall
was 85.81 secs
dynamic phases one per interation
Total time spent in MPI_Waitall was 4137.9 secs
across all 92 iterations
26
MFIX Iterate Phase Behavior
27
Concluding Discussion and Future Work
  • Phased-based profiling can help to bridge
    semantic gap
  • Computational models ? performance measurements
  • Application-specific performance analysis
  • Implemented phase profiling in TAU
  • Demonstrated phase profiling
  • NAS BT benchmark and MFIX application
  • Also used in S3D, Uintah, Flash on large-scale
    platforms
  • Requires application-specific knowledge
  • Might be possible to link to auto phase
    identification
  • Based on memory tracing or application state
    change
  • Can this idea be extended to global parallel
    phases?
  • Working on better ways to present phase
    performance

28
Support Acknowledgements
  • Department of Energy (DOE)
  • Office of Science contracts
  • University of Utah ASCI Level 1 sub-contract
  • ASC/NNSA Level 3 contract
  • Department of Defense (DoD)
  • HPC Modernization Office (HPCMO)
  • Programming Environment and Training (PET)
  • NSF
  • Research Centre Juelich
  • Los Alamos National Laboratory
  • www.cs.uoregon.edu/research/paracomp/tau
Write a Comment
User Comments (0)
About PowerShow.com