Dynaprof and PAPI A Tool for Dynamic Runtime Instrumentation and Performance Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Dynaprof and PAPI A Tool for Dynamic Runtime Instrumentation and Performance Analysis

Description:

Dynaprof and PAPI. A Tool for Dynamic Runtime Instrumentation and Performance Analysis ... Popularized by James Larus with EEL: An Executable Editor Library at ... – PowerPoint PPT presentation

Number of Views:217
Avg rating:3.0/5.0
Slides: 54
Provided by: icl88
Learn more at: https://icl.utk.edu
Category:

less

Transcript and Presenter's Notes

Title: Dynaprof and PAPI A Tool for Dynamic Runtime Instrumentation and Performance Analysis


1
Dynaprof and PAPIA Tool for Dynamic Runtime
Instrumentation and Performance Analysis
  • Philip Mucci, Research Consultant
  • Innovative Computing Laboratory/LBNL
  • mucci_at_cs.utk.edu
  • http//icl.cs.utk.edu/projects/papi
  • http//www.cs.utk.edu/mucci/dynaprof

2
The ICL PAPI Team
  • Jack Dongarra
  • Kevin London
  • Shirley Moore
  • Philip Mucci
  • Keith Seymour
  • Dan Terpstra
  • Haihang You
  • Min Zhou
  • And a few of you spread throughout the globe

3
The Library Interface
  • PAPI provides two APIs to access the underlying
    counter hardware
  • The low level interface manages hardware events
    in user defined groups called EventSets.
  • The high level interface simply provides the
    ability to start, stop and read the counters for
    a specified list of events.

4
PAPI Implementation
5
Preset Events
  • Proposed standard set of event names deemed most
    relevant for application performance tuning
  • No standardization of the exact definition
  • Mapped to native events on a given platform

6
Preset Events 2
  • PAPI supports 92 preset events and native events.
  • Preset events are mappings from symbolic names to
    machine specific definitions for a particular
    hardware resource.
  • Example Total Cycles is PAPI_TOT_CYC
  • PAPI also supports preset that may be derived
    from the underlying hardware metrics
  • Example Floating Point Instructions per Second
    is PAPI_FLOPS

7
Native Events
  • An event countable by the CPU can be counted even
    if there is no matching preset PAPI event
  • Same interface as when setting up a preset event,
    but a CPU-specific bit pattern is used instead of
    the PAPI event definition

8
Sample Preset Listing
  • gt tests/avail
  • Test case 8 Available events and hardware
    information.
  • --------------------------------------------------
    -----------------------
  • Vendor string and code GenuineIntel (-1)
  • Model string and code Celeron (Mendocino)
    (6)
  • CPU revision 10.000000
  • CPU Megahertz 366.504944
  • --------------------------------------------------
    -----------------------
  • Name Code Avail Deriv Description (Note)
  • PAPI_L1_DCM 0x80000000 Yes No Level 1 data cache
    misses
  • PAPI_L1_ICM 0x80000001 Yes No Level 1 instruction
    cache misses
  • PAPI_L2_DCM 0x80000002 No No Level 2 data cache
    misses
  • PAPI_L2_ICM 0x80000003 No No Level 2 instruction
    cache misses
  • PAPI_L3_DCM 0x80000004 No No Level 3 data cache
    misses
  • PAPI_L3_ICM 0x80000005 No No Level 3 instruction
    cache misses
  • PAPI_L1_TCM 0x80000006 Yes Yes Level 1 cache
    misses
  • PAPI_L2_TCM 0x80000007 Yes No Level 2 cache
    misses
  • PAPI_L3_TCM 0x80000008 No No Level 3 cache misses
  • PAPI_CA_SNP 0x80000009 No No Requests for a snoop

9
High-level Interface
  • Meant for application programmers wanting
    coarse-grained measurements
  • Not thread safe
  • Calls the lower level API
  • Allows only PAPI preset events
  • Easier to use and less setup (additional code)
    than low-level

10
High-level API Calls
  • PAPI_num_counters()
  • Returns the number of available counters
  • PAPI_start_counters(int cntrs, int alen)
  • Start counters
  • PAPI_stop_counters(long_long vals, int alen)
  • Stop counters and put counter values in array
  • PAPI_accum_counters(long_long vals, int alen)
  • Accumulate counters into array and reset
  • PAPI_read_counters(long_long vals, int alen)
  • Copy counter values into array and reset counters
  • PAPI_flops(float rtime, float ptime,
    long_long flpins, float mflops)
  • Wallclock time, process time, FP ins since start,
  • Mflop/s since last call

11
Low-level Interface
  • Increased efficiency and functionality over the
    high level PAPI interface
  • Approximately 56 functions (http//icl.cs.utk.edu/
    projects/papi/files/html_man/papi.html4)
  • Thread-safe (SMP, OpenMP, Pthreads)
  • Supports both presets and native events

12
Low-level Functionality
  • API Calls for
  • Counter multiplexing
  • Callbacks on counter overflow
  • SVR4 compatible profiling
  • Hardware information
  • Software information
  • Highly accurate and low latency timing functions
  • Hardware event inquiry functions
  • Eventset management functions
  • Simple locking operations

13
The Cost of Calling PAPI
  • PAPI includes an example program cost to measure
    latencies
  • Reading hardware counters is relatively cheap
  • Setup is a bit more expensive as it sometimes
    requires a system call

Total User Kernel Cycles Linux/x86 Linux/IA64 IBM POWER3
PAPI start/stop(cycles/pair) 3524 22115 14199
PAPI read(cycles/call) 1299 6526 3126
14
PAPI and Threads
  • A challenge how to make one version of a library
    that works with any thread model?
  • After initializing the library, the user needs to
    enable thread detection
  • Each thread responsible for creation, start, stop
    and read of its own counters

15
PAPI and Multiplexing
  • Multiplexing allows simultaneous use of more
    counters than are supported by the hardware.
  • This is accomplished through timesharing the
    counter hardware and extrapolating the results.
  • Users can enable multiplexing with one API call
    and then use PAPI normally.
  • Implementation was based on MPX done by John May
    at LLNL.

16
PAPI and Multiplexing 2
  • Most platforms do not support multiplexing at the
    kernel level.
  • PAPI implements multiplexing in software at the
    user level.
  • The more events you multiplex, the larger the
    sampling error in the result.
  • Too short of a measurement interval will result
    in 0 counts.

17
Interrupts on Counter Overflow
  • PAPI provides the ability to call user-defined
    handlers when a specified event exceeds a
    specified threshold.
  • For systems that do not support counter overflow
    at the hardware level, PAPI emulates this in
    software at the user level.

18
Hardware Statistical Profiling
  • On overflow of hardware counter, dispatch a
    signal/interrupt.
  • Get the address at which the code was
    interrupted.
  • Store counts of interrupts for each address.
  • GNU prof and gprof (-pg and p compiler options)
    use interval timers.

19
SVR4 Compatible Profiling
  • PAPI provides support for SVR4-compatible
    execution profiling based on any counter event.
  • PAPI_profil() creates a histogram of overflow
    counts for a specified region of the application
    code.

20
Results of Statistical Profiling
Event Count
Program Text Addresses
  • The result A probabilistic distribution of where
    the code spent its time and why.

21
Some Tools that use PAPI
  • DEEP/PAPI (Pacific Sierra) http//www.psrv.com/dee
    p_papi_top.html
  • TAU (Allen Mallony, U of Oregon)
    http//www.cs.uoregon.edu/research/paracomp/tau/
  • SvPablo (Dan Reed, U of Illinois)
    http//vibes.cs.uiuc.edu/Software/SvPablo/svPablo.
    htm
  • Cactus (Ed Seidel, Max Plank/U of Illinois)
    http//www.aei-potsdam.mpg.de
  • Vprof (Curtis Janssen, Sandia Livermore Lab)
    http//aros.ca.sandia.gov/cljanss/perf/vprof/
  • Tool Gear/MPX (John M, John G, LLNL)
  • Cluster Tools (Al Geist, ORNL)
  • Paradyn (Barton Miller, U Wisc.)
  • http//www.paradyn.org

22
For More Information
  • http//icl.cs.utk.edu/projects/papi/
  • Software and documentation
  • Reference materials
  • Papers and presentations
  • Third-party tools
  • Mailing lists

23
PAPI Around the World
24
IBM PAPI Release Platforms
  • 2.1 Release Platforms
  • IBM AIX 4.3.x pmtoolkit
  • PPC604, 604e, Power 3
  • X86 perfctr 2.3.x
  • Development version
  • Power 3, 604e AIX 5.1
  • Power 4
  • Itanium / Itanium 2 kernel 2.4.18 or higher
  • V3.0
  • Pentium 4

25
Upcoming PAPI 2.3 Release
  • Additional Platforms
  • Itanium
  • Itanium 2
  • Power 4
  • AIX 5, Power 3
  • AIX 5, PPC604e
  • PAPI 3.0 binary Pentium 4
  • Sample Tools
  • Perfometer
  • Trapper
  • Dynaprof

26
PAPI 3.0
  • Using lessons learned from years earlier
  • Substrate code 90 used only 10 of the time
  • In practice, it was never used
  • Redesign for
  • Robustness
  • Feature set
  • Simplicity
  • Portability to new platforms

27
PAPI 3.0 Features
  • Multiway multiplexing
  • Use all available counter registers instead of
    one per time slice. (Just 1 additional register
    means 2x increase in accuracy)
  • Superb performance
  • Pentium 4, a PAPI_read() costs 230 cycles.
  • Register access alone costs 100 cycles.
  • System level counting interface
  • Programmable events
  • Thresholding
  • Instruction matching
  • Per event counting domains

28
PAPI 3.0 Features 2
  • Remote control interface
  • Allows PAPI to control counters in multiple
    threads/processes
  • High level API becomes thread safe
  • Internal timer/signal/thread abstractions
  • Additional internal layered API to support robust
    extensions like
  • MPX from Lawrence Livermore
  • Kevin Londons memory extensions
  • Remote control interface from U. Wisc.

29
PAPI 3.0 Features 3
  • New language bindings
  • Java
  • Lisp
  • Matlab

30
PAPI 3.0 Release Targets
  • Supercomputing release for Pentium 4, possibly
    more
  • Future work
  • New platforms
  • Earth Simulator / SX-6
  • Blue Gene (BG/L 64k nodes)

31
What is DynaProf?
  • A portable tool to dynamically instrument serial
    and parallel programs for the purpose of
    performance analysis.
  • Simple and intuitive command line interface like
    GDB.
  • Java/Swing GUI.
  • Instrumentation is done through the run-time
    insertion of function calls to specially
    developed performance probes.

32
DynaProf Goals
  • Make collection of run-time performance data easy
    by
  • Avoiding instrumentation and recompilation
  • Avoiding perturbation of compiler optimizations
  • Using the same tool with different probes
  • Providing useful and meaningful probe data
  • Providing different kinds of probes
  • Allowing custom probes
  • Providing complete language independence
  • Allowing multiple insert/remove instrumentation
    cycles

No source code required!
33
A Brief History of Dynamic Instrumentation
  • Popularized by James Larus with EEL An
    Executable Editor Library at U. Wisc.
  • http//www.cs.wisc.edu/larus/eel.html
  • Technology matured by Dr. Bart Miller and (now
    Dr.) Jeff Hollingsworth at U. Wisc.
  • DynInst Project at U. Maryland
  • http//www.dyninst.org/
  • IBMs DPCL A Distributed DynInst
  • http//oss.software.ibm.com/dpcl/

34
Dynamic Instrumentation
  • Operates on a running executable.
  • Identifies instrumentation points where code can
    be inserted.
  • Inserts code snippets at selected points.
  • Snippets can collect and monitor performance
    information.
  • Snippets can be removed and reinserted
    dynamically.

35
Why the Dyna in DynaProf?
  • Built on DynInst and DPCL
  • Instrumentation is dynamically and selectively
    inserted directly into the programs address
    space.
  • Why is this a better way?
  • No perturbation of compiler optimizations
  • Complete language independence
  • Multiple Insert/Remove instrumentation cycles

36
DynaProf Commands
  • load
  • attach
  • list
  • use
  • instr module function
  • stop
  • continue
  • run
  • info
  • unload

37
Dynaprof Sample Session
./dynaprof (dynaprof) load tests/swim (dynaprof)
list DEFAULT_MODULE swim.F libm.so.6 libc.so.6 (d
ynaprof) list swim.F MAIN__ inital_ calc1_ calc2_
calc3z_ calc3_ (dynaprof) list swim.F
MAIN__ Entry Call s_wsle Call do_lio Call
e_wsle Call s_wsle Call do_lio Call
e_wsle Call calc3_
(dynaprof) use probes/papiprobe Module
papiprobe.so was loaded. Module libpapi.so was
loaded. Module libperfctr.so was
loaded. (dynaprof) instr module swim.F
calc swim.F, inserted 6 instrumentation
points (dynaprof) run papiprobe output goes to
/home/mucci/dynaprof/tests/swim.1671
38
DynaProf Probe Design
  • Probes export 2 functions with loosely
    standardized interfaces.
  • Very easy to roll your own.
  • Supports separate probes for MPI/OpenMP/Pthreads.
  • Probes do their own data collection and
    visualization.

39
Dynaprof v0.7 Probes
  • papiprobe
  • Measure any combination of PAPI presets and
    native events
  • wallclockprobe
  • Highly accurate elapsed wallclock time in
    microseconds.
  • These probes report
  • Inclusive
  • Exclusive
  • 1 Level Call Tree

40
Dynaprof v0.7 Release
  • Supported Platforms
  • Using DynInst
  • Linux 2.x
  • AIX 4.3
  • Solaris 2.8
  • IRIX 6.x
  • Using DPCL
  • AIX 4.3
  • AIX 5?
  • Available as a binary package from
  • http//www.cs.utk.edu/mucci/dynaprof
  • Perfapi-devel_at_ptools.org
  • No GUI included
  • Users Guide
  • All probe libraries included

41
PAPI Probe v0.7 Features
  • Can count any PAPI preset or Native event
    accessible through PAPI
  • Can count multiple events
  • Supports multiplexing
  • Supports multithreading
  • AIX SMP, OpenMP, Pthreads
  • Linux SMP, OpenMP, Pthreads

42
Wallclock Probe v0.7 Features
  • Counts microseconds using RTC
  • Supports multithreading
  • AIX SMP, OpenMP, Pthreads
  • Linux SMP, OpenMP, Pthreads

43
PAPI Probe v0.7 Output
Output file /home/mucci/dynaprof/tests/swim.138
5 Option string PAPI_TOT_CYC,PAPI_TOT_INS Proce
ssor 363 Mhz GenuineIntel Intel Pentium II
rev 0xa (1-way) Total metrics measured 2 Metric
1 PAPI_TOT_CYC, Total cycles (Native
0x79,0x79) Metric 2 PAPI_TOT_INS,
Instructions completed (Native 0xc0,0xc0) Total
functions 6
Exclusive Profile of Metric PAPI_TOT_CYC. Name
Percent Total Calls
------------- ------- ----- ----- TOTAL
100 2.583e10 1 calc2_ 32.02
8.271e09 120 calc3_ 31.54
8.147e09 118 calc1_ 30.84
7.966e09 120 unknown 2.759
7.125e08 1 inital_ 2.503
6.465e08 1 calc3z_ 0.1698
4.387e07 1 MAIN__ 0.1639
4.235e07 1
Inclusive Profile of Metric PAPI_TOT_INS. Name
Percent Total SubCalls ------------- ---
---- ----- -------- TOTAL 100
2.408e10 0 MAIN__ 100
2.408e10 424 calc1_ 34.27
8.251e09 0 calc2_ 33.48
8.06e09 0 calc3_ 27.94
6.726e09 0 inital_ 4.073
9.806e08 1.053e06 calc3z_ 0.1257
3.027e07 0
44
PAPI Probe v0.7 Output
1-Level Inclusive Call Tree of Metric
PAPI_TOT_INS. Parent/-Child Percent Total
Calls ------------- ------- -----
-------- TOTAL 100 2.408e10 1
MAIN__ 100 2.408e10 1 -
s_wsle 2.92e-06 703 1 -
do_lio 3.14e-06 756 1 -
e_wsle 4.515e-06 1087 1 -
inital_ 4.073 9.806e08 1 -
s_wsfe 2.427e-05 5843 1 -
do_fio 2.141e-05 5154 1 -
do_fio 1.251e-05 3012 1 -
e_wsfe 5.728e-06 1379 1 -
calc1_ 0.2856 6.876e07 120 -
calc2_ 0.279 6.717e07 120 -
s_wsfe 8.278e-06 1993 2 -
do_fio 2.676e-05 6443 2 -
e_wsfe 7.385e-06 1778 2 -
s_stop 0 0 1 -
calc3z_ 0.1257 3.027e07 1 -
calc3_ 0.2367 5.7e07 118 inital_
100 9.806e08 1 -
atan 0.0001985 1946 1 -
sin 0.0002003 1964 2.632e05 -
sin 6.364e-05 624 2.632e05 -
cos 0.0002101 2060 2.632e05 -
cos 6.353e-05 623 2.632e05 calc1_
100 8.251e09 120 calc2_ 100
8.06e09 120 calc3z_ 100
3.027e07 1 calc3_ 100
6.726e09 118
45
Dynaprof v0.8
  • 3 probes, including perfometerprobe
  • All support all threading models
  • Pthreads
  • OpenMP directives
  • SMP directives
  • GUI included
  • Same release targets

46
DynaProf GUI
  • Displays module tree for instrumentation
  • Simple selection of probes and instrumentation
    points
  • Single-click execution of common DynaProf
    commands
  • Coupling of probes and visualizers (e.g.
    perfometer)

47
DynaProf GUI Screenshot
48
Perfometer Probe v0.8
  • Graphically monitor performance in near real
    time.
  • To be rereleased in v0.8 with full thread support
    on all platforms.
  • Robust error handling.

49
Perfometer Screenshot
50
Perfometer Parallel Interface
51
Dynaprof Development
  • New instrumentation points
  • Loop level (DynInst only)
  • Arbitrary start/stop points (DynInst only)
  • Breakpoints (DynInst only)
  • New probes
  • Heartbeat
  • Vprof statistical histogram visualization tool
  • More GUI work
  • Ability to handle programs that use stdin
  • Integrated help and tutorial
  • Robust error handling
  • Robust handling of multiple instrumentation cycles

52
Heartbeat Probe
  • Statistical profiling is often static
  • Gprof, Quanitify, Speedshop, Perfex, Workshop,
    Tprof
  • We want to understand all aspects of a programs
    performance
  • Programs have different phases
  • Initialization
  • Data input
  • Compute (many phases here)
  • Data output
  • Finalization
  • Workloads on hardware are wavelike or periodic in
    nature
  • How do we visualize this?

53
Heartbeat Screenshot
Write a Comment
User Comments (0)
About PowerShow.com