Allen D. Malony, Aroon Nataraj - PowerPoint PPT Presentation

About This Presentation
Title:

Allen D. Malony, Aroon Nataraj

Description:

Allen D' Malony, Aroon Nataraj – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 41
Provided by: Csw5
Category:
Tags: allen | aroon | malony | nataraj | ox

less

Transcript and Presenter's Notes

Title: Allen D. Malony, Aroon Nataraj


1
TAU Meets Dyninst and MRNetA Long-term and
Short-term Affair
  • Allen D. Malony, Aroon Nataraj
  • malony,anataraj_at_cs.uoregon.edu
  • http//www.cs.uoregon.edu/research/tau
  • Department of Computer and Information Science
  • Performance Research Laboratory
  • University of Oregon

2
Performance Research Lab
  • Dr. Sameer Shende, Senior scientist
  • Alan Morris, Senior software engineer
  • Wyatt Spear, Software engineer
  • Scott Biersdorff, Software engineer
  • Li Li, Ph.D. student
  • Model-based Automatic Performance Diagnosis
  • Ph.D. thesis, January 2007
  • Kevin Huck, Ph.D. student
  • Aroon Nataraj, Ph.D. student
  • Integrated kernel / application performance
    analysis
  • Scalable performance monitoring

3
Outline
  • What is TAU?
  • Observation methodology
  • Instrumentation, measurement, analysis tools
  • Our affair with Dyninst
  • Perspective
  • MPI applications
  • Integrated instrumentation
  • Courting MRNet
  • Initial results
  • Future work

4
TAU Performance System
  • Tuning and Analysis Utilities (14 year project
    effort)
  • Performance system framework for HPC systems
  • Integrated, scalable, flexible, and parallel
  • Multiple parallel programming paradigms
  • Parallel performance mapping methodology
  • Portable (open source) parallel performance
    system
  • Instrumentation, measurement, analysis, and
    visualization
  • Portable performance profiling and tracing
    facility
  • Performance data management and data mining
  • Scalable (very large) parallel performance
    analysis
  • Partners
  • Research Center Jülich, LLNL, ANL, LANL, UTK

5
TAU Performance Observation Methodology
  • Advocate event-based, direct performance
    observation
  • Observe execution events
  • Types control flow, state-based, user-defined
  • Modes atomic, interval (enter/exit)
  • Instrument program code directly (defines events)
  • Modify program code at points of event occurrence
  • Different code forms (source, library, object,
    binary, VM)
  • Measurement code inserted (instantiates events)
  • Make events visible
  • Measures performance related to event occurrence
  • Contrast with event-based sampling

6
TAU Performance System Architecture
7
TAU Performance System Architecture
8
Multi-Level Instrumentation and Mapping
  • Multiple interfaces
  • Information sharing
  • Between interfaces
  • Event selection
  • Within levels
  • Between levels
  • Mapping
  • Performance data is associated with high-level
    semantic abstractions

source code
instrumentation
instrumentation
preprocessor
source code
instrumentation
compiler
instrumentation
object code
libraries
executable
instrumentation
instrumentation
runtime image
instrumentation
instrumentation
VM
performancedata
run
9
TAU Instrumentation Approach
  • Support for standard program events
  • Routines, classes and templates
  • Statement-level blocks and loops
  • Support for user-defined events
  • Begin/End events (user-defined timers)
  • Atomic events (e.g., size of memory
    allocated/freed)
  • Selection of event statistics
  • Support definition of semantic entities for
    mapping
  • Support for event groups (aggregation, selection)
  • Instrumentation selection and optimization
  • Instrumentation enabling/disabling and runtime
    throttling

10
TAU Instrumentation Mechanisms
  • Source code
  • Manual (TAU API, TAU component API)
  • Automatic (robust)
  • C, C, F77/90/95 (Program Database Toolkit
    (PDT))
  • OpenMP (directive rewriting (Opari), POMP2 spec)
  • Object code
  • Pre-instrumented libraries (e.g., MPI using PMPI)
  • Statically-linked and dynamically-linked
  • Executable code
  • Dynamic instrumentation (pre-execution)
    (DyninstAPI)
  • Virtual machine instrumentation (e.g., Java using
    JVMPI)
  • TAU_COMPILER to automate instrumentation process

11
TAU Measurement Approach
  • Portable and scalable parallel profiling solution
  • Multiple profiling types and options
  • Event selection and control (enabling/disabling,
    throttling)
  • Online profile access and sampling
  • Online performance profile overhead compensation
  • Portable and scalable parallel tracing solution
  • Trace translation to EPILOG, VTF3, and OTF
  • Trace streams (OTF) and hierarchical trace
    merging
  • Robust timing and hardware performance support
  • Multiple counters (hardware, user-defined,
    system)
  • Measurement specification separate from
    instrumentation

12
TAU Measurement Mechanisms
  • Parallel profiling
  • Function-level, block-level, statement-level
  • Supports user-defined events and mapping events
  • TAU parallel profile stored (dumped) during
    execution
  • Support for flat, callgraph/callpath, phase
    profiling
  • Support for memory profiling (headroom, leaks)
  • Tracing
  • All profile-level events
  • Inter-process communication events
  • Inclusion of multiple counter data in traced
    events
  • Compile-time and runtime measurement selection

13
Performance Analysis and Visualization
  • Analysis of parallel profile and trace
    measurement
  • Parallel profile analysis
  • ParaProf parallel profile analysis and
    presentation
  • ParaVis parallel performance visualization
    package
  • Profile generation from trace data (tau2pprof)
  • Performance data management framework (PerfDMF)
  • Parallel trace analysis
  • Translation to VTF (V3.0), EPILOG, OTF formats
  • Integration with VNG (Technical University of
    Dresden)
  • Online parallel analysis and visualization
  • Integration with CUBE browser (KOJAK, UTK, FZJ)

14
TAU and DyninstAPI
  • TAU has had a long-term affair Dyninst technology
  • Dyninst offered a binary-level instrumentation
    tool
  • Could help in cases when the source code is
    unavailable
  • Could allow instrumentation without recompilation
  • TAU requirements
  • Instrument HPC applications with TAU measurements
  • Multiple paradigms, languages, compilers,
    platforms
  • Portability
  • Tested Dyninst features as they were released
  • Issues
  • MPI, threading, availability, binary rewriting
  • It been on/off open relationship

15
Using DyninstAPI
  • TAU uses DyninstAPI for binary code patching
  • Pre-execution
  • versus at any point during execution
  • Methods
  • runtime before the application begins
  • binary rewriting
  • tau_run (mutator)
  • Loads TAU measurement library
  • Uses DyninstAPI to instrument mutatee
  • Can apply instrumentation selection

16
Using DyninstAPI with TAU
Configure TAU with Dyninst and build
lttaudirgt/ltarchgt/bin/tau_run configure
dyninst/usr/local/dyninstAPI-5.0.1 make
clean make install tau_run command tau_run
lt-o outfilegt -Xrunltlibnamegt-f
ltselect_inst_filegt -v
ltinfilegt Instrument all events with TAU
measurement library and execute tau_run
klargest Instrument all events with TAUPAPI
measurements (libTAUsh-papi.so) and execute
tau_run -XrunTAUsh-papi a.out Instruments only
events specified in select.tau instrumentation
specification file and execute tau_run -f
select.tau a.out Binary rewriting tau_run o
a.inst.out a.out
17
Runtime Instrumentation with DyninstAPI
  • tau_run loads TAUs shared object in the address
    space
  • Selects routines to be instrumented
  • Calls DyninstAPI OneTimeCode
  • Register a startup routine
  • Pass a string of routine (event) names
  • main foo bar
  • IDs assigned to events
  • TAUs hooks for entry/exit used for
    instrumentation
  • Invoked during execution

18
Using DyninstAPI with MPI
  • One mutator per mutatee
  • Each mutator instruments mutatee prior to
    execution
  • No central control
  • Each mutatee writes its own performance data to
    disk

mpirun -np 4 ./run.sh cat run.sh
!/bin/sh /usr/local/tau-2.x/x86_64/bin/tau_ru
n ltpathgt/a.out
19
Binary Rewriting with TAU
  • Rewrite binary (Save the world) before executing
  • No central control
  • No need to re-instrument the code on all backend
    nodes
  • Each mutatee writes its own performance data to
    disk

tau_run -o a.inst.out a.out cd
_dyninstsaved0 mpirun -np 4 ./a.inst.out
20
Example
  • EK-SIMPLE benchmark
  • CFD benchmark
  • Andy Shaw, Kofi Fynn
  • Adapted by Brad Chamberlain
  • Experimentation
  • Run on 4 cpus
  • Runtime instrumentation using DyninstAPI and
    tau_run
  • Measure wallclock time and CPU time experiments
  • Profiling and tracing modes of measurement
  • Look at performance data with Paraprof and Vampir

21
ParaProf - Main Window (4 cpus)
22
ParaProf - Indivdual Profile (n,c,t 0,0,0)
23
ParaProf - Statistics Table (Mean)
24
ParaProf - net_recv (MPI rank 1)
25
Integrated Instrumentation (Source Dyninst)
  • Use source instrumentation for some events
  • Use Dyninst for other events
  • Access same TAU measurement infrastructure
  • Demonstrate on matrix multiplication example
  • Compare regular versus strip-mining versions

Source instrumented
Source binary instrumented
26
TAU-over-MRNET (ToM) Project
  • MRNET
  • as a
  • Transport Substrate
  • in
  • TAU
  • (Reporting early work done in the last week.)

27
TAU Transport Substrate - Motivations
  • Transport Substrate
  • Enables movement of measurement-related data
  • TAU, in the past, has relied on shared
    file-system
  • Some Modes of Performance Observation
  • Offline / Post-mortem observation and analysis
  • least requirements for a specialized transport
  • Online observation
  • long running applications, especially at scale
  • dumping to file-system can be suboptimal
  • Online observation with feedback into application
  • in addition, requires that the transport is
    bi-directional
  • Performance observation problems and requirements
    are a function of the mode

28
Requirements
  • Improve performance of transport
  • NFS can be slow and variable
  • Specialization and remoting of FS-operations to
    front-end
  • Data Reduction
  • At scale, cost of moving data too high
  • Sample in different domain (node-wise,
    event-wise)
  • Control
  • Selection of events, measurement technique,
    target nodes
  • What data to output, how often and in what form?
  • Feedback into the measurement system, feedback
    into application
  • Online, distributed processing of generated
    performance data
  • Use compute resource of transport nodes
  • Global performance analyses within the topology
  • Distribute statistical analyses
  • easy (mean, variance, histogram), challenging
    (clustering)

29
Approach and First Prototype
  • Measurement and measured data transport are
    separate
  • No such distinction in TAU
  • Created abstraction to separate and hide
    transport
  • TauOutput
  • Did not create a custom transport for TAU
  • Use existing monitoring/transport capabilities
  • Supermon (Sottile and Minnich, LANL)
  • Piggy-backed TAU performance data on Supermon
    channels
  • Correlate system-level metrics from Supermon with
    TAU application performance data

30
Rationale
  • Moved away from NFS
  • Separation of concerns
  • Scalability, portability, robustness
  • Addressed independent of TAU
  • Re-use existing technologies where appropriate
  • Multiple bindings
  • Use different solutions best suited to particular
    platform
  • Implementation speed
  • Easy, fast to create adapter that binds to
    existing transport
  • MRNET support was added in about a week
  • Says a lot about usability of MRNET

31
ToM Architecture
  • TAU Components
  • Front-End (FE)
  • Filters
  • Back-End (BE)
  • Over MRNet API
  • No-Backend-Instantiationmode
  • Push-Pull model of dataretrieval
  • No daemon
  • Instrumented application contains TAU and
    Back-End
  • Two channels (streams)
  • Data (BE to FE)
  • Control (FE to BE)

32
ToM Architecture
  • Applicaton calls into TAU
  • Per-Iteration explicit call to output routine
  • Periodic calls using alarm
  • TauOutput object invoked
  • Configuration specificcompile or runtime
  • One per thread
  • TauOutput mimics subset of FS-style operations
  • Avoids changes to TAU code
  • If required rest of TAU can be made aware of
    output type
  • Non-blocking recv for control
  • Back-end pushes
  • Sink pulls

33
Simple Example (NPB LU - A, Per-5 iterations)
Exclusive time
34
Simple Example (NPB LU - A, Per-5 iterations)
Number of calls
35
Comparing ToM with NFS
  • TAUoverNFS versus TAUoverMRNET
  • 250 ssor iterations
  • 251 TAU_DB_DUMP operations
  • Significant advantages with specialized transport
    substrate
  • Similar when using Supermon as the substrate
  • Remoting of expensive FS meta-data operations to
    Front-End

36
Playing with Filters
  • Downstream (FE to BE) multicast path
  • Even without filters, is very useful for control
  • Data Reduction Filters are integral to Upstream
    path (BE to FE)
  • W/O filters loss-less data reproduced D-1 times
  • Unnecessary large cost to network
  • Filter 1 Random Sampling Filter
  • Very simplistic data reduction by node-wise
    sampling
  • Accepts or Rejects packets probabilistically
  • TAU Front-End can control probability P(accept)
  • P(accept)K/N (N leafs, K is user constant)
  • Bounds number of packets per-round to K

37
Filter 1 in Action (Ring application)
  • Compare different P(accept) values
  • 1, 1/4, 1/16
  • Front-End unable to keep up
  • Queuing delay propagated back

38
Other Filters
  • Statistics filter
  • Reduce raw performance data to smaller set of
    statistics
  • Distribute these statistical analyses from
    Front-End to the filters
  • Simple measures - mean, std.dev, histograms
  • More sophisticated measures - distributed
    clustering
  • Controlling filters
  • No direct way to control Upstream-filters
  • not on control path
  • Recommended solution
  • place upstream filters that work in concert with
    downstream filters to share control information
  • requires synchronization of state between
    upstream and downstream filters
  • Our Echo hack
  • Back-Ends transparently echo Filter-Control
    packets back upstream
  • this is then interpreted by the filters
  • easier to implement
  • control response time may be greater

39
Feedback / Suggestions
  • Easy to integrate with MRNET
  • Good examples documentation, readable source code
  • Setup phase
  • Make MRNET intermediate nodes listen on
    pre-specified port
  • Allow arbitrary mrnet-ranks to connect and then
    set the Ids in the topology
  • Relaxing strict apriori-ranks can make setup
    easier
  • Setup in Job-Q environments difficult
  • Packetization API can be more flexible
  • Current API is usable and simple (var-arg printf
    style)
  • Composing a packet over a series of staggered
    stages difficult
  • Allow control over how buffering is performed
  • Important in a push-pull model as data injection
    points (rates) independent of data retrieval
  • Is not a problem in a purely pull model

40
TAUoverMRNET - Contrast TAUoverSupermon
  • Supermon (cluster-monitor) vs. MRNet
    (reduction-network)
  • Both light-weight transport substrates
  • Data format
  • Supermon ascii s-expressions
  • MRNET packets with packed (binary?) data
  • Supermon Setup
  • Loose topology
  • No support/help in setting up intermediate nodes
  • Assume Supermon is part of the environment
  • MRNET Setup
  • Strict topology
  • Better support for starting intermediate nodes
  • With/Without Back-End instantiation (TAU uses
    latter)
  • Multiple Front-Ends (or sinks) possible with
    Supermon
  • MRNET, front-end needs to program this
    functionality
  • No exisiting pluggable filter support in
    Supermon
  • Performing aggregation is more difficult with
    Supermon.
  • Supermons allows buffer-policy specification,
    MRNET does not

41
Future Work
  • Dyninst
  • Tighter integration of source and binary
    instrumentation
  • Conveying of source information to binary level
  • Enabling use of TAUs advanced measurement
    features
  • Leveraging TAUs performance mapping support
  • Want robust and portable binary rewriting tool
  • MRNet
  • Development of more performance filters
  • Evaluation of MRNet performance for different
    scenarios
  • Testing at large scale
  • Use in applications
Write a Comment
User Comments (0)
About PowerShow.com