Title: Dynaprof and PAPI A Tool for Dynamic Runtime Instrumentation and Performance Analysis
1Dynaprof and PAPIA Tool for Dynamic Runtime
Instrumentation and Performance Analysis
- Philip Mucci, Research Consultant
- Innovative Computing Laboratory/LBNL
- mucci_at_cs.utk.edu
- http//icl.cs.utk.edu/projects/papi
- http//www.cs.utk.edu/mucci/dynaprof
2The ICL PAPI Team
- Jack Dongarra
- Kevin London
- Shirley Moore
- Philip Mucci
- Keith Seymour
- Dan Terpstra
- Haihang You
- Min Zhou
- And a few of you spread throughout the globe
3The Library Interface
- PAPI provides two APIs to access the underlying
counter hardware - The low level interface manages hardware events
in user defined groups called EventSets. - The high level interface simply provides the
ability to start, stop and read the counters for
a specified list of events.
4PAPI Implementation
5Preset Events
- Proposed standard set of event names deemed most
relevant for application performance tuning - No standardization of the exact definition
- Mapped to native events on a given platform
6Preset Events 2
- PAPI supports 92 preset events and native events.
- Preset events are mappings from symbolic names to
machine specific definitions for a particular
hardware resource. - Example Total Cycles is PAPI_TOT_CYC
- PAPI also supports preset that may be derived
from the underlying hardware metrics - Example Floating Point Instructions per Second
is PAPI_FLOPS
7Native Events
- An event countable by the CPU can be counted even
if there is no matching preset PAPI event - Same interface as when setting up a preset event,
but a CPU-specific bit pattern is used instead of
the PAPI event definition
8Sample Preset Listing
- gt tests/avail
- Test case 8 Available events and hardware
information. - --------------------------------------------------
----------------------- - Vendor string and code GenuineIntel (-1)
- Model string and code Celeron (Mendocino)
(6) - CPU revision 10.000000
- CPU Megahertz 366.504944
- --------------------------------------------------
----------------------- - Name Code Avail Deriv Description (Note)
- PAPI_L1_DCM 0x80000000 Yes No Level 1 data cache
misses - PAPI_L1_ICM 0x80000001 Yes No Level 1 instruction
cache misses - PAPI_L2_DCM 0x80000002 No No Level 2 data cache
misses - PAPI_L2_ICM 0x80000003 No No Level 2 instruction
cache misses - PAPI_L3_DCM 0x80000004 No No Level 3 data cache
misses - PAPI_L3_ICM 0x80000005 No No Level 3 instruction
cache misses - PAPI_L1_TCM 0x80000006 Yes Yes Level 1 cache
misses - PAPI_L2_TCM 0x80000007 Yes No Level 2 cache
misses - PAPI_L3_TCM 0x80000008 No No Level 3 cache misses
- PAPI_CA_SNP 0x80000009 No No Requests for a snoop
9High-level Interface
- Meant for application programmers wanting
coarse-grained measurements - Not thread safe
- Calls the lower level API
- Allows only PAPI preset events
- Easier to use and less setup (additional code)
than low-level
10High-level API Calls
- PAPI_num_counters()
- Returns the number of available counters
- PAPI_start_counters(int cntrs, int alen)
- Start counters
- PAPI_stop_counters(long_long vals, int alen)
- Stop counters and put counter values in array
- PAPI_accum_counters(long_long vals, int alen)
- Accumulate counters into array and reset
- PAPI_read_counters(long_long vals, int alen)
- Copy counter values into array and reset counters
- PAPI_flops(float rtime, float ptime,
long_long flpins, float mflops) - Wallclock time, process time, FP ins since start,
- Mflop/s since last call
11Low-level Interface
- Increased efficiency and functionality over the
high level PAPI interface - Approximately 56 functions (http//icl.cs.utk.edu/
projects/papi/files/html_man/papi.html4) - Thread-safe (SMP, OpenMP, Pthreads)
- Supports both presets and native events
12Low-level Functionality
- API Calls for
- Counter multiplexing
- Callbacks on counter overflow
- SVR4 compatible profiling
- Hardware information
- Software information
- Highly accurate and low latency timing functions
- Hardware event inquiry functions
- Eventset management functions
- Simple locking operations
13The Cost of Calling PAPI
- PAPI includes an example program cost to measure
latencies - Reading hardware counters is relatively cheap
- Setup is a bit more expensive as it sometimes
requires a system call
Total User Kernel Cycles Linux/x86 Linux/IA64 IBM POWER3
PAPI start/stop(cycles/pair) 3524 22115 14199
PAPI read(cycles/call) 1299 6526 3126
14PAPI and Threads
- A challenge how to make one version of a library
that works with any thread model? - After initializing the library, the user needs to
enable thread detection - Each thread responsible for creation, start, stop
and read of its own counters
15PAPI and Multiplexing
- Multiplexing allows simultaneous use of more
counters than are supported by the hardware. - This is accomplished through timesharing the
counter hardware and extrapolating the results. - Users can enable multiplexing with one API call
and then use PAPI normally. - Implementation was based on MPX done by John May
at LLNL.
16PAPI and Multiplexing 2
- Most platforms do not support multiplexing at the
kernel level. - PAPI implements multiplexing in software at the
user level. - The more events you multiplex, the larger the
sampling error in the result. - Too short of a measurement interval will result
in 0 counts.
17Interrupts on Counter Overflow
- PAPI provides the ability to call user-defined
handlers when a specified event exceeds a
specified threshold. - For systems that do not support counter overflow
at the hardware level, PAPI emulates this in
software at the user level.
18Hardware Statistical Profiling
- On overflow of hardware counter, dispatch a
signal/interrupt. - Get the address at which the code was
interrupted. - Store counts of interrupts for each address.
- GNU prof and gprof (-pg and p compiler options)
use interval timers.
19SVR4 Compatible Profiling
- PAPI provides support for SVR4-compatible
execution profiling based on any counter event. - PAPI_profil() creates a histogram of overflow
counts for a specified region of the application
code.
20Results of Statistical Profiling
Event Count
Program Text Addresses
- The result A probabilistic distribution of where
the code spent its time and why.
21Some Tools that use PAPI
- DEEP/PAPI (Pacific Sierra) http//www.psrv.com/dee
p_papi_top.html - TAU (Allen Mallony, U of Oregon)
http//www.cs.uoregon.edu/research/paracomp/tau/ - SvPablo (Dan Reed, U of Illinois)
http//vibes.cs.uiuc.edu/Software/SvPablo/svPablo.
htm - Cactus (Ed Seidel, Max Plank/U of Illinois)
http//www.aei-potsdam.mpg.de - Vprof (Curtis Janssen, Sandia Livermore Lab)
http//aros.ca.sandia.gov/cljanss/perf/vprof/ - Tool Gear/MPX (John M, John G, LLNL)
- Cluster Tools (Al Geist, ORNL)
- Paradyn (Barton Miller, U Wisc.)
- http//www.paradyn.org
22For More Information
- http//icl.cs.utk.edu/projects/papi/
- Software and documentation
- Reference materials
- Papers and presentations
- Third-party tools
- Mailing lists
23PAPI Around the World
24IBM PAPI Release Platforms
- 2.1 Release Platforms
- IBM AIX 4.3.x pmtoolkit
- PPC604, 604e, Power 3
- X86 perfctr 2.3.x
- Development version
- Power 3, 604e AIX 5.1
- Power 4
- Itanium / Itanium 2 kernel 2.4.18 or higher
- V3.0
- Pentium 4
25Upcoming PAPI 2.3 Release
- Additional Platforms
- Itanium
- Itanium 2
- Power 4
- AIX 5, Power 3
- AIX 5, PPC604e
- PAPI 3.0 binary Pentium 4
- Sample Tools
- Perfometer
- Trapper
- Dynaprof
26PAPI 3.0
- Using lessons learned from years earlier
- Substrate code 90 used only 10 of the time
- In practice, it was never used
- Redesign for
- Robustness
- Feature set
- Simplicity
- Portability to new platforms
27PAPI 3.0 Features
- Multiway multiplexing
- Use all available counter registers instead of
one per time slice. (Just 1 additional register
means 2x increase in accuracy) - Superb performance
- Pentium 4, a PAPI_read() costs 230 cycles.
- Register access alone costs 100 cycles.
- System level counting interface
- Programmable events
- Thresholding
- Instruction matching
- Per event counting domains
28PAPI 3.0 Features 2
- Remote control interface
- Allows PAPI to control counters in multiple
threads/processes - High level API becomes thread safe
- Internal timer/signal/thread abstractions
- Additional internal layered API to support robust
extensions like - MPX from Lawrence Livermore
- Kevin Londons memory extensions
- Remote control interface from U. Wisc.
29PAPI 3.0 Features 3
- New language bindings
- Java
- Lisp
- Matlab
30PAPI 3.0 Release Targets
- Supercomputing release for Pentium 4, possibly
more - Future work
- New platforms
- Earth Simulator / SX-6
- Blue Gene (BG/L 64k nodes)
31What is DynaProf?
- A portable tool to dynamically instrument serial
and parallel programs for the purpose of
performance analysis. - Simple and intuitive command line interface like
GDB. - Java/Swing GUI.
- Instrumentation is done through the run-time
insertion of function calls to specially
developed performance probes.
32DynaProf Goals
- Make collection of run-time performance data easy
by - Avoiding instrumentation and recompilation
- Avoiding perturbation of compiler optimizations
- Using the same tool with different probes
- Providing useful and meaningful probe data
- Providing different kinds of probes
- Allowing custom probes
- Providing complete language independence
- Allowing multiple insert/remove instrumentation
cycles
No source code required!
33A Brief History of Dynamic Instrumentation
- Popularized by James Larus with EEL An
Executable Editor Library at U. Wisc. - http//www.cs.wisc.edu/larus/eel.html
- Technology matured by Dr. Bart Miller and (now
Dr.) Jeff Hollingsworth at U. Wisc. - DynInst Project at U. Maryland
- http//www.dyninst.org/
- IBMs DPCL A Distributed DynInst
- http//oss.software.ibm.com/dpcl/
34Dynamic Instrumentation
- Operates on a running executable.
- Identifies instrumentation points where code can
be inserted. - Inserts code snippets at selected points.
- Snippets can collect and monitor performance
information. - Snippets can be removed and reinserted
dynamically.
35Why the Dyna in DynaProf?
- Built on DynInst and DPCL
- Instrumentation is dynamically and selectively
inserted directly into the programs address
space. - Why is this a better way?
- No perturbation of compiler optimizations
- Complete language independence
- Multiple Insert/Remove instrumentation cycles
36DynaProf Commands
- load
- attach
- list
- use
- instr module function
- stop
- continue
- run
- info
- unload
37Dynaprof Sample Session
./dynaprof (dynaprof) load tests/swim (dynaprof)
list DEFAULT_MODULE swim.F libm.so.6 libc.so.6 (d
ynaprof) list swim.F MAIN__ inital_ calc1_ calc2_
calc3z_ calc3_ (dynaprof) list swim.F
MAIN__ Entry Call s_wsle Call do_lio Call
e_wsle Call s_wsle Call do_lio Call
e_wsle Call calc3_
(dynaprof) use probes/papiprobe Module
papiprobe.so was loaded. Module libpapi.so was
loaded. Module libperfctr.so was
loaded. (dynaprof) instr module swim.F
calc swim.F, inserted 6 instrumentation
points (dynaprof) run papiprobe output goes to
/home/mucci/dynaprof/tests/swim.1671
38DynaProf Probe Design
- Probes export 2 functions with loosely
standardized interfaces. - Very easy to roll your own.
- Supports separate probes for MPI/OpenMP/Pthreads.
- Probes do their own data collection and
visualization.
39Dynaprof v0.7 Probes
- papiprobe
- Measure any combination of PAPI presets and
native events - wallclockprobe
- Highly accurate elapsed wallclock time in
microseconds. - These probes report
- Inclusive
- Exclusive
- 1 Level Call Tree
40Dynaprof v0.7 Release
- Supported Platforms
- Using DynInst
- Linux 2.x
- AIX 4.3
- Solaris 2.8
- IRIX 6.x
- Using DPCL
- AIX 4.3
- AIX 5?
- Available as a binary package from
- http//www.cs.utk.edu/mucci/dynaprof
- Perfapi-devel_at_ptools.org
- No GUI included
- Users Guide
- All probe libraries included
41PAPI Probe v0.7 Features
- Can count any PAPI preset or Native event
accessible through PAPI - Can count multiple events
- Supports multiplexing
- Supports multithreading
- AIX SMP, OpenMP, Pthreads
- Linux SMP, OpenMP, Pthreads
42Wallclock Probe v0.7 Features
- Counts microseconds using RTC
- Supports multithreading
- AIX SMP, OpenMP, Pthreads
- Linux SMP, OpenMP, Pthreads
43PAPI Probe v0.7 Output
Output file /home/mucci/dynaprof/tests/swim.138
5 Option string PAPI_TOT_CYC,PAPI_TOT_INS Proce
ssor 363 Mhz GenuineIntel Intel Pentium II
rev 0xa (1-way) Total metrics measured 2 Metric
1 PAPI_TOT_CYC, Total cycles (Native
0x79,0x79) Metric 2 PAPI_TOT_INS,
Instructions completed (Native 0xc0,0xc0) Total
functions 6
Exclusive Profile of Metric PAPI_TOT_CYC. Name
Percent Total Calls
------------- ------- ----- ----- TOTAL
100 2.583e10 1 calc2_ 32.02
8.271e09 120 calc3_ 31.54
8.147e09 118 calc1_ 30.84
7.966e09 120 unknown 2.759
7.125e08 1 inital_ 2.503
6.465e08 1 calc3z_ 0.1698
4.387e07 1 MAIN__ 0.1639
4.235e07 1
Inclusive Profile of Metric PAPI_TOT_INS. Name
Percent Total SubCalls ------------- ---
---- ----- -------- TOTAL 100
2.408e10 0 MAIN__ 100
2.408e10 424 calc1_ 34.27
8.251e09 0 calc2_ 33.48
8.06e09 0 calc3_ 27.94
6.726e09 0 inital_ 4.073
9.806e08 1.053e06 calc3z_ 0.1257
3.027e07 0
44PAPI Probe v0.7 Output
1-Level Inclusive Call Tree of Metric
PAPI_TOT_INS. Parent/-Child Percent Total
Calls ------------- ------- -----
-------- TOTAL 100 2.408e10 1
MAIN__ 100 2.408e10 1 -
s_wsle 2.92e-06 703 1 -
do_lio 3.14e-06 756 1 -
e_wsle 4.515e-06 1087 1 -
inital_ 4.073 9.806e08 1 -
s_wsfe 2.427e-05 5843 1 -
do_fio 2.141e-05 5154 1 -
do_fio 1.251e-05 3012 1 -
e_wsfe 5.728e-06 1379 1 -
calc1_ 0.2856 6.876e07 120 -
calc2_ 0.279 6.717e07 120 -
s_wsfe 8.278e-06 1993 2 -
do_fio 2.676e-05 6443 2 -
e_wsfe 7.385e-06 1778 2 -
s_stop 0 0 1 -
calc3z_ 0.1257 3.027e07 1 -
calc3_ 0.2367 5.7e07 118 inital_
100 9.806e08 1 -
atan 0.0001985 1946 1 -
sin 0.0002003 1964 2.632e05 -
sin 6.364e-05 624 2.632e05 -
cos 0.0002101 2060 2.632e05 -
cos 6.353e-05 623 2.632e05 calc1_
100 8.251e09 120 calc2_ 100
8.06e09 120 calc3z_ 100
3.027e07 1 calc3_ 100
6.726e09 118
45Dynaprof v0.8
- 3 probes, including perfometerprobe
- All support all threading models
- Pthreads
- OpenMP directives
- SMP directives
- GUI included
- Same release targets
46DynaProf GUI
- Displays module tree for instrumentation
- Simple selection of probes and instrumentation
points - Single-click execution of common DynaProf
commands - Coupling of probes and visualizers (e.g.
perfometer)
47DynaProf GUI Screenshot
48Perfometer Probe v0.8
- Graphically monitor performance in near real
time. - To be rereleased in v0.8 with full thread support
on all platforms. - Robust error handling.
49Perfometer Screenshot
50Perfometer Parallel Interface
51Dynaprof Development
- New instrumentation points
- Loop level (DynInst only)
- Arbitrary start/stop points (DynInst only)
- Breakpoints (DynInst only)
- New probes
- Heartbeat
- Vprof statistical histogram visualization tool
- More GUI work
- Ability to handle programs that use stdin
- Integrated help and tutorial
- Robust error handling
- Robust handling of multiple instrumentation cycles
52Heartbeat Probe
- Statistical profiling is often static
- Gprof, Quanitify, Speedshop, Perfex, Workshop,
Tprof - We want to understand all aspects of a programs
performance - Programs have different phases
- Initialization
- Data input
- Compute (many phases here)
- Data output
- Finalization
- Workloads on hardware are wavelike or periodic in
nature - How do we visualize this?
53Heartbeat Screenshot