Title: Allen D. Malony, Aroon Nataraj
1TAU Meets Dyninst and MRNetA Long-term and
Short-term Affair
- Allen D. Malony, Aroon Nataraj
- malony,anataraj_at_cs.uoregon.edu
- http//www.cs.uoregon.edu/research/tau
- Department of Computer and Information Science
- Performance Research Laboratory
- University of Oregon
2Performance Research Lab
- Dr. Sameer Shende, Senior scientist
- Alan Morris, Senior software engineer
- Wyatt Spear, Software engineer
- Scott Biersdorff, Software engineer
- Li Li, Ph.D. student
- Model-based Automatic Performance Diagnosis
- Ph.D. thesis, January 2007
- Kevin Huck, Ph.D. student
- Aroon Nataraj, Ph.D. student
- Integrated kernel / application performance
analysis - Scalable performance monitoring
3Outline
- What is TAU?
- Observation methodology
- Instrumentation, measurement, analysis tools
- Our affair with Dyninst
- Perspective
- MPI applications
- Integrated instrumentation
- Courting MRNet
- Initial results
- Future work
4TAU Performance System
- Tuning and Analysis Utilities (14 year project
effort) - Performance system framework for HPC systems
- Integrated, scalable, flexible, and parallel
- Multiple parallel programming paradigms
- Parallel performance mapping methodology
- Portable (open source) parallel performance
system - Instrumentation, measurement, analysis, and
visualization - Portable performance profiling and tracing
facility - Performance data management and data mining
- Scalable (very large) parallel performance
analysis - Partners
- Research Center Jülich, LLNL, ANL, LANL, UTK
5TAU Performance Observation Methodology
- Advocate event-based, direct performance
observation - Observe execution events
- Types control flow, state-based, user-defined
- Modes atomic, interval (enter/exit)
- Instrument program code directly (defines events)
- Modify program code at points of event occurrence
- Different code forms (source, library, object,
binary, VM) - Measurement code inserted (instantiates events)
- Make events visible
- Measures performance related to event occurrence
- Contrast with event-based sampling
6TAU Performance System Architecture
7TAU Performance System Architecture
8Multi-Level Instrumentation and Mapping
- Multiple interfaces
- Information sharing
- Between interfaces
- Event selection
- Within levels
- Between levels
- Mapping
- Performance data is associated with high-level
semantic abstractions
source code
instrumentation
instrumentation
preprocessor
source code
instrumentation
compiler
instrumentation
object code
libraries
executable
instrumentation
instrumentation
runtime image
instrumentation
instrumentation
VM
performancedata
run
9TAU Instrumentation Approach
- Support for standard program events
- Routines, classes and templates
- Statement-level blocks and loops
- Support for user-defined events
- Begin/End events (user-defined timers)
- Atomic events (e.g., size of memory
allocated/freed) - Selection of event statistics
- Support definition of semantic entities for
mapping - Support for event groups (aggregation, selection)
- Instrumentation selection and optimization
- Instrumentation enabling/disabling and runtime
throttling
10TAU Instrumentation Mechanisms
- Source code
- Manual (TAU API, TAU component API)
- Automatic (robust)
- C, C, F77/90/95 (Program Database Toolkit
(PDT)) - OpenMP (directive rewriting (Opari), POMP2 spec)
- Object code
- Pre-instrumented libraries (e.g., MPI using PMPI)
- Statically-linked and dynamically-linked
- Executable code
- Dynamic instrumentation (pre-execution)
(DyninstAPI) - Virtual machine instrumentation (e.g., Java using
JVMPI) - TAU_COMPILER to automate instrumentation process
11TAU Measurement Approach
- Portable and scalable parallel profiling solution
- Multiple profiling types and options
- Event selection and control (enabling/disabling,
throttling) - Online profile access and sampling
- Online performance profile overhead compensation
- Portable and scalable parallel tracing solution
- Trace translation to EPILOG, VTF3, and OTF
- Trace streams (OTF) and hierarchical trace
merging - Robust timing and hardware performance support
- Multiple counters (hardware, user-defined,
system) - Measurement specification separate from
instrumentation
12TAU Measurement Mechanisms
- Parallel profiling
- Function-level, block-level, statement-level
- Supports user-defined events and mapping events
- TAU parallel profile stored (dumped) during
execution - Support for flat, callgraph/callpath, phase
profiling - Support for memory profiling (headroom, leaks)
- Tracing
- All profile-level events
- Inter-process communication events
- Inclusion of multiple counter data in traced
events - Compile-time and runtime measurement selection
13Performance Analysis and Visualization
- Analysis of parallel profile and trace
measurement - Parallel profile analysis
- ParaProf parallel profile analysis and
presentation - ParaVis parallel performance visualization
package - Profile generation from trace data (tau2pprof)
- Performance data management framework (PerfDMF)
- Parallel trace analysis
- Translation to VTF (V3.0), EPILOG, OTF formats
- Integration with VNG (Technical University of
Dresden) - Online parallel analysis and visualization
- Integration with CUBE browser (KOJAK, UTK, FZJ)
14TAU and DyninstAPI
- TAU has had a long-term affair Dyninst technology
- Dyninst offered a binary-level instrumentation
tool - Could help in cases when the source code is
unavailable - Could allow instrumentation without recompilation
- TAU requirements
- Instrument HPC applications with TAU measurements
- Multiple paradigms, languages, compilers,
platforms - Portability
- Tested Dyninst features as they were released
- Issues
- MPI, threading, availability, binary rewriting
- It been on/off open relationship
15Using DyninstAPI
- TAU uses DyninstAPI for binary code patching
- Pre-execution
- versus at any point during execution
- Methods
- runtime before the application begins
- binary rewriting
- tau_run (mutator)
- Loads TAU measurement library
- Uses DyninstAPI to instrument mutatee
- Can apply instrumentation selection
16Using DyninstAPI with TAU
Configure TAU with Dyninst and build
lttaudirgt/ltarchgt/bin/tau_run configure
dyninst/usr/local/dyninstAPI-5.0.1 make
clean make install tau_run command tau_run
lt-o outfilegt -Xrunltlibnamegt-f
ltselect_inst_filegt -v
ltinfilegt Instrument all events with TAU
measurement library and execute tau_run
klargest Instrument all events with TAUPAPI
measurements (libTAUsh-papi.so) and execute
tau_run -XrunTAUsh-papi a.out Instruments only
events specified in select.tau instrumentation
specification file and execute tau_run -f
select.tau a.out Binary rewriting tau_run o
a.inst.out a.out
17Runtime Instrumentation with DyninstAPI
- tau_run loads TAUs shared object in the address
space - Selects routines to be instrumented
- Calls DyninstAPI OneTimeCode
- Register a startup routine
- Pass a string of routine (event) names
- main foo bar
- IDs assigned to events
- TAUs hooks for entry/exit used for
instrumentation - Invoked during execution
18Using DyninstAPI with MPI
- One mutator per mutatee
- Each mutator instruments mutatee prior to
execution - No central control
- Each mutatee writes its own performance data to
disk
mpirun -np 4 ./run.sh cat run.sh
!/bin/sh /usr/local/tau-2.x/x86_64/bin/tau_ru
n ltpathgt/a.out
19Binary Rewriting with TAU
- Rewrite binary (Save the world) before executing
- No central control
- No need to re-instrument the code on all backend
nodes - Each mutatee writes its own performance data to
disk
tau_run -o a.inst.out a.out cd
_dyninstsaved0 mpirun -np 4 ./a.inst.out
20Example
- EK-SIMPLE benchmark
- CFD benchmark
- Andy Shaw, Kofi Fynn
- Adapted by Brad Chamberlain
- Experimentation
- Run on 4 cpus
- Runtime instrumentation using DyninstAPI and
tau_run - Measure wallclock time and CPU time experiments
- Profiling and tracing modes of measurement
- Look at performance data with Paraprof and Vampir
21ParaProf - Main Window (4 cpus)
22ParaProf - Indivdual Profile (n,c,t 0,0,0)
23ParaProf - Statistics Table (Mean)
24ParaProf - net_recv (MPI rank 1)
25Integrated Instrumentation (Source Dyninst)
- Use source instrumentation for some events
- Use Dyninst for other events
- Access same TAU measurement infrastructure
- Demonstrate on matrix multiplication example
- Compare regular versus strip-mining versions
Source instrumented
Source binary instrumented
26TAU-over-MRNET (ToM) Project
- MRNET
- as a
- Transport Substrate
- in
- TAU
- (Reporting early work done in the last week.)
27TAU Transport Substrate - Motivations
- Transport Substrate
- Enables movement of measurement-related data
- TAU, in the past, has relied on shared
file-system - Some Modes of Performance Observation
- Offline / Post-mortem observation and analysis
- least requirements for a specialized transport
- Online observation
- long running applications, especially at scale
- dumping to file-system can be suboptimal
- Online observation with feedback into application
- in addition, requires that the transport is
bi-directional - Performance observation problems and requirements
are a function of the mode
28Requirements
- Improve performance of transport
- NFS can be slow and variable
- Specialization and remoting of FS-operations to
front-end - Data Reduction
- At scale, cost of moving data too high
- Sample in different domain (node-wise,
event-wise) - Control
- Selection of events, measurement technique,
target nodes - What data to output, how often and in what form?
- Feedback into the measurement system, feedback
into application - Online, distributed processing of generated
performance data - Use compute resource of transport nodes
- Global performance analyses within the topology
- Distribute statistical analyses
- easy (mean, variance, histogram), challenging
(clustering)
29Approach and First Prototype
- Measurement and measured data transport are
separate - No such distinction in TAU
- Created abstraction to separate and hide
transport - TauOutput
- Did not create a custom transport for TAU
- Use existing monitoring/transport capabilities
- Supermon (Sottile and Minnich, LANL)
- Piggy-backed TAU performance data on Supermon
channels - Correlate system-level metrics from Supermon with
TAU application performance data
30Rationale
- Moved away from NFS
- Separation of concerns
- Scalability, portability, robustness
- Addressed independent of TAU
- Re-use existing technologies where appropriate
- Multiple bindings
- Use different solutions best suited to particular
platform - Implementation speed
- Easy, fast to create adapter that binds to
existing transport - MRNET support was added in about a week
- Says a lot about usability of MRNET
31ToM Architecture
- TAU Components
- Front-End (FE)
- Filters
- Back-End (BE)
- Over MRNet API
- No-Backend-Instantiationmode
- Push-Pull model of dataretrieval
- No daemon
- Instrumented application contains TAU and
Back-End - Two channels (streams)
- Data (BE to FE)
- Control (FE to BE)
32ToM Architecture
- Applicaton calls into TAU
- Per-Iteration explicit call to output routine
- Periodic calls using alarm
- TauOutput object invoked
- Configuration specificcompile or runtime
- One per thread
- TauOutput mimics subset of FS-style operations
- Avoids changes to TAU code
- If required rest of TAU can be made aware of
output type - Non-blocking recv for control
- Back-end pushes
- Sink pulls
33Simple Example (NPB LU - A, Per-5 iterations)
Exclusive time
34Simple Example (NPB LU - A, Per-5 iterations)
Number of calls
35Comparing ToM with NFS
- TAUoverNFS versus TAUoverMRNET
- 250 ssor iterations
- 251 TAU_DB_DUMP operations
- Significant advantages with specialized transport
substrate - Similar when using Supermon as the substrate
- Remoting of expensive FS meta-data operations to
Front-End
36Playing with Filters
- Downstream (FE to BE) multicast path
- Even without filters, is very useful for control
- Data Reduction Filters are integral to Upstream
path (BE to FE) - W/O filters loss-less data reproduced D-1 times
- Unnecessary large cost to network
- Filter 1 Random Sampling Filter
- Very simplistic data reduction by node-wise
sampling - Accepts or Rejects packets probabilistically
- TAU Front-End can control probability P(accept)
- P(accept)K/N (N leafs, K is user constant)
- Bounds number of packets per-round to K
37Filter 1 in Action (Ring application)
- Compare different P(accept) values
- 1, 1/4, 1/16
- Front-End unable to keep up
- Queuing delay propagated back
38Other Filters
- Statistics filter
- Reduce raw performance data to smaller set of
statistics - Distribute these statistical analyses from
Front-End to the filters - Simple measures - mean, std.dev, histograms
- More sophisticated measures - distributed
clustering - Controlling filters
- No direct way to control Upstream-filters
- not on control path
- Recommended solution
- place upstream filters that work in concert with
downstream filters to share control information - requires synchronization of state between
upstream and downstream filters - Our Echo hack
- Back-Ends transparently echo Filter-Control
packets back upstream - this is then interpreted by the filters
- easier to implement
- control response time may be greater
39Feedback / Suggestions
- Easy to integrate with MRNET
- Good examples documentation, readable source code
- Setup phase
- Make MRNET intermediate nodes listen on
pre-specified port - Allow arbitrary mrnet-ranks to connect and then
set the Ids in the topology - Relaxing strict apriori-ranks can make setup
easier - Setup in Job-Q environments difficult
- Packetization API can be more flexible
- Current API is usable and simple (var-arg printf
style) - Composing a packet over a series of staggered
stages difficult - Allow control over how buffering is performed
- Important in a push-pull model as data injection
points (rates) independent of data retrieval - Is not a problem in a purely pull model
40TAUoverMRNET - Contrast TAUoverSupermon
- Supermon (cluster-monitor) vs. MRNet
(reduction-network) - Both light-weight transport substrates
- Data format
- Supermon ascii s-expressions
- MRNET packets with packed (binary?) data
- Supermon Setup
- Loose topology
- No support/help in setting up intermediate nodes
- Assume Supermon is part of the environment
- MRNET Setup
- Strict topology
- Better support for starting intermediate nodes
- With/Without Back-End instantiation (TAU uses
latter) - Multiple Front-Ends (or sinks) possible with
Supermon - MRNET, front-end needs to program this
functionality - No exisiting pluggable filter support in
Supermon - Performing aggregation is more difficult with
Supermon. - Supermons allows buffer-policy specification,
MRNET does not
41Future Work
- Dyninst
- Tighter integration of source and binary
instrumentation - Conveying of source information to binary level
- Enabling use of TAUs advanced measurement
features - Leveraging TAUs performance mapping support
- Want robust and portable binary rewriting tool
- MRNet
- Development of more performance filters
- Evaluation of MRNet performance for different
scenarios - Testing at large scale
- Use in applications