Performance Instrumentation and Measurement for Terascale Systems - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Performance Instrumentation and Measurement for Terascale Systems

Description:

Performance Instrumentation and Measurement for Terascale Systems ... E.g., Pixie, ATOM, EEL, PAT. Dynamic instrumentation. DyninstAPI. Types of Measurements ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 40
Provided by: shirley80
Category:

less

Transcript and Presenter's Notes

Title: Performance Instrumentation and Measurement for Terascale Systems


1
Performance Instrumentation and Measurement for
Terascale Systems
  • Jack Dongarra, Shirley Moore, Philip Mucci
  • University of Tennessee
  • Sameer Shende, and Allen Malony
  • University of Oregon

2
Requirements for Terascale Systems
  • Performance framework must support a wide range
    of
  • Performance problems (e.g., single-node
    performance, synchronization and communication
    overhead, load balancing)
  • Performance evaluation methods (e.g.,
    parameter-based modeling, bottleneck detection
    and diagnosis)
  • Programming environments (e.g., multiprocess and
    /or multithreaded, parallel and distributed,
    large-scale)
  • Need for flexible and extensible performance
    observation framework

3
Research Problems
  • Appropriate level and location for implementing
    instrumentation and measurement
  • How to make the framework modular and extensible
  • Appropriate compromise between level of
    detail/accuracy and instrumentation cost

4
Instrumentation Strategies
  • Source code instrumentation
  • Manual or using preprocessor
  • Library level instrumentation
  • e.g., MPI and OpenMP profiling interfaces
  • Binary rewriting
  • E.g., Pixie, ATOM, EEL, PAT
  • Dynamic instrumentation
  • DyninstAPI

5
Types of Measurements
  • Profiling
  • Tracing
  • Real-time Analysis

6
Profiling
  • Recording of summary information during execution
  • inclusive, exclusive time, calls, hardware
    statistics,
  • Reflects performance behavior of program entities
  • functions, loops, basic blocks
  • user-defined semantic entities
  • Very good for low-cost performance assessment
  • Helps to expose performance bottlenecks and
    hotspots
  • Implemented through
  • sampling periodic OS interrupts or hardware
    counter traps
  • instrumentation direct insertion of measurement
    code

7
Tracing
  • Recording of information about significant points
    (events) during program execution
  • entering/exiting code region (function, loop,
    block, )
  • thread/process interactions (e.g., send/receive
    message)
  • Save information in event record
  • timestamp
  • CPU identifier, thread identifier
  • Event type and event-specific information
  • Event trace is a time-sequenced stream of event
    records
  • Can be used to reconstruct dynamic program
    behavior
  • Typically requires code instrumentation

8
Real-time Analysis
  • Allows evaluation of program performance during
    execution
  • Examples
  • Paradyn
  • Autopilot
  • Perfometer

9
TAU Performance System Architecture
Paraver
EPILOG
10
TAU Instrumentation
  • Manually using TAU instrumentation API
  • Automatically using
  • Program Database Toolkit (PDT)
  • MPI profiling library
  • Opari OpenMP rewriting tool
  • Uses PAPI to access hardware counter data

11
Program Database Toolkit (PDT)
  • Program code analysis framework for developing
    source-based tools
  • High-level interface to source code information
  • Integrated toolkit for source code parsing,
    database creation, and database query
  • commercial grade front end parsers
  • portable IL analyzer, database format, and access
    API
  • open software approach for tool development
  • Targets and integrates multiple source languages
  • Used in TAU to build automated performance
    instrumentation tools

12
PDT Components
  • Language front end
  • Edison Design Group (EDG) C, C
  • Mutek Solutions Ltd. F77, F90
  • creates an intermediate-language (IL) tree
  • IL Analyzer
  • processes the intermediate language (IL) tree
  • creates program database (PDB) formatted file
  • DUCTAPE (Bernd Mohr, ZAM, Germany)
  • C program Database Utilities and Conversion
    Tools APplication Environment
  • processes and merges PDB files
  • C library to access the PDB for PDT applications

13
OPARI Basic Usage (f90)
  • Reset OPARI state information
  • rm -f opari.rc
  • Call OPARI for each input source file
  • opari file1.f90...opari fileN.f90
  • Generate OPARI runtime table, compile it with
    ANSI C
  • opari -table opari.tab.ccc -c opari.tab.c
  • Compile modified files .mod.f90 using OpenMP
  • Link the resulting object files, the OPARI
    runtime table opari.tab.o and the TAU POMP RTL

14
TAU Analysis
  • Profile analysis
  • pprof
  • parallel profiler with text-based display
  • Racy / jRacy
  • graphical interface to pprof (Tcl/Tk)
  • jRacy is a Java implementation of Racy
  • ParaProf
  • Next-generation parallel profile analysis and
    display
  • Trace analysis and visualization
  • Trace merging and clock adjustment (if necessary)
  • Trace format conversion (ALOG, SDDF, Vampir)
  • Vampir (Pallas) trace visualization
  • Paraver (CEPBA) trace visualization

15
TAU Pprof Display
16
jracy (NAS Parallel Benchmark LU)
Routine profile across all nodes
Global profiles
n node c context t thread
Individual profile
17
ParaProf Scalable Profiler
  • Re-implementation of jRacy tool
  • Target flexibility in profile input source
  • Profile files, performance database, online
  • Target scalability in profile size and display
  • Will include three-dimensional display support
  • Provide more robust analysis and extension
  • Derived performance statistics

18
ParaProf Architecture
19
512-Processor Profile (SAMRAI)
20
Three-dimensional Profile Displays
500-processor Uintah execution (University of
Utah)
21
Overview of PAPI
  • Performance Application Programming Interface
  • The purpose of the PAPI project is to design,
    standardize and implement a portable and
    efficient API to access the hardware performance
    monitor counters found on most modern
    microprocessors.
  • Parallel Tools Consortium project
  • References implementations for all major HPC
    platforms
  • Installed and in use at major government labs,
    academic sites
  • Becoming de facto industry standard
  • Incorporated into many performance analysis tools
    e.g., HPCView,SvPablo, TAU, Vampir, Vprof

22
PAPI Counter Interfaces
  • PAPI provides three interfaces to the underlying
    counter hardware
  • The low level interface provides functions for
    setting options, accessing native events,
    callback on counter overflow, etc.
  • The high level interface simply provides the
    ability to start, stop and read the counters for
    a specified list of events.
  • Graphical tools to visualize information.

23
PAPI Implementation
24
PAPI Preset Events
  • Proposed standard set of events deemed most
    relevant for application performance tuning
  • Defined in papiStdEventDefs.h
  • Mapped to native events on a given platform
  • Run tests/avail to see list of PAPI preset events
    available on a platform

25
Scalability of PAPI Instrumentation
  • Overhead of library calls to read counters can be
    excessive.
  • Statistical sampling can reduce overhead.
  • PAPI substrate for Alpha Tru64 UNIX
  • Built on top of DADD/DCPI (Dynamic Access to DCPI
    Data/Digital Continuous Profiling Interface)
  • Sampling approach supported in hardware
  • 1-2 overhead compared to 30 on other platforms
  • Using sampling and hardware profiling support on
    Itanium/Itanium2

26
Vampir v3.x Hardware Counter Data
  • Counter Timeline Display

27
What is DynaProf?
  • A portable tool to instrument a running
    executable with Probes that monitor application
    performance.
  • Simple command line interface.
  • Open Source Software
  • A work in progress

No source code required
28
DynaProf Methodology
  • Make collection of run-time performance data easy
    by
  • Avoiding instrumentation and recompilation
  • Using the same tool with different probes
  • Providing useful and meaningful probe data
  • Providing different kinds of probes
  • Allowing custom probes

No source code required!
29
Why the Dyna?
  • Instrumentation is selectively inserted directly
    into the programs address space.
  • Why is this a better way?
  • No perturbation of compiler optimizations
  • Complete language independence
  • Multiple Insert/Remove instrumentation cycles

30
DynaProf Design
  • GUI, command line script driven user interface
  • Uses GNU readline for command line editing and
    command completion.
  • Instrumentation is done using
  • Dyninst on Linux, Solaris and IRIX
  • DPCL on AIX

31
DynaProf Commands
  • load ltexecutablegt
  • list module pattern
  • use ltprobegt probe args
  • instr module ltmodulegt probe args
  • instr function ltmodulegt ltfunctiongt probe args
  • stop
  • continue
  • run args
  • Info
  • unload

32
DynaProf Probe Design
  • Probes provided with distribution
  • Wallclock probe
  • PAPI probe
  • Perfometer probe
  • Can be written in any compiled language
  • Probes export 3 functions with a standardized
    interface.
  • Easy to roll your own (lt1day)
  • Supports separate probes for MPI/OpenMP/Pthreads

33
Future development
  • GUI development
  • Additional probes
  • Perfex probe
  • Vprof probe
  • TAU probe
  • Better support for parallel applications

34
Perfometer
  • Application is instrumented with PAPI
  • call perfometer()
  • call mark_perfometer(int color, char label)
  • Application is started. At the call to
    perfometer, signal handler and a timer are set up
    to collect and send the information to a Java
    applet containing the graphical view.
  • Sections of code that are of interest can be
    designated with specific colors
  • Real-time display or trace file

35
Perfometer Display
36
Perfometer Parallel Interface
37
Conclusions
  • TAU and PAPI projects are addressing important
    research problems involved in constructing a
    flexible and extensible performance observation
    framework.
  • Widespread adoption of PAPI demonstrates the
    value of a portable interface to low-level
    architecture-specific performance monitoring
    hardware.
  • TAU framework provides flexible mechanisms for
    instrumentation and measurement.

38
Conclusions (cont.)
  • Terascale systems require scalable low-overhead
    means of collecting performance data.
  • Statistical sampling support in PAPI
  • TAU filtering and feedback schemes for focusing
    instrumentation
  • Real-time monitoring capabilities (Dynaprof,
    Perfometer)
  • PAPI and TAU infrastructure is designed for
    interoperability, flexibility, and extensibility.

39
More Information
  • http//icl.cs.utk.edu/papi/
  • Software, documentation, mailing lists
  • TAU (http//www.acl.lanl.gov/tau)
  • PDT (http//www.acl.lanl.gov/pdtoolkit)
  • PAPI (http//icl.cs.utk.edu/projects/papi/)
  • OPARI (http//www.fz-juelich.de/zam
Write a Comment
User Comments (0)
About PowerShow.com