Title: PAPI and Dynaprof
1PAPI and Dynaprof
- Application Signatures and Performance Analysis
of Scientific Applications - Philip J. Mucci
- Innovative Computing Laboratory, UTK
- Performance Evaluation Research Center, LBL
- mucci_at_cs.utk.edu
- http//icl.cs.utk.edu/mucci/dynaprof/snapshots/sc
2002.ppt
2Goals
- Understanding the behavior of the application
- Identification of bottlenecks.
- Usage of the hardware resources.
- Effects of that usage on performance.
- Using Dynaprof to achieve that goal
- Command line usage
- 3 Dynaprof probes
- Wallclock Time
- Hardware performance counters
- Resource usage traces
3Motivation
- Optimize the application's performance.
- Evaluate the algorithms efficiency.
- Generate an application signature.
- A collection of data that represent the major
terms in the performance model. - Develop a performance model.
4Overview of Hardware Counters
- Data is NOT PORTABLE, but PAPI is...
- Small number of registers dedicated for
performance monitoring functions. - AMD Athlon, 4 counters
- Pentium lt III, 2 counters
- Pentium IV, 18 counters
- IA64, 4 counters
- Alpha 21x64, 2 counters
- Power 3, 8 counters
- Power 4, 8 counters to a group
- UltraSparc II, 2 counters
- MIPS R14K, 2 counters
5Applications used in this Tutorial
- Serial
- FSPX A binary alloy solidification benchmark.
- SWIM The SPEC shallow water benchmark.
- Parallel (MPI)
- Ex19 from PetSC distribution.
- Solves nonlinear driven cavity with multigrid. A
2D driven cavity problem solved in a
velocity-vorticity formulation.
6FPSX Execution Environment
- Intel PIII, 1.2 Ghz
- FP Results/Clock 1 1.2 Gflips
- 4 SP/clk with SSE, 2DP/clk with SSE2
- Caches 16K/16K, 256K
- G77 version 2.96
- -g -O -malign-double -mpentiumpro -funroll-loops
-fexpensive-optimizations - Execution time
- gt /bin/time fspx
- 115.370u 0.030s 158.17 97.6 00k 00io 162pf0w
7swim Execution Environment
- IBM Nighthawk, 16-way Power 3, 375MHz
- FP Results/Clock 4 (1.5 Gflips)
- Caches 32K/64K, 8MB
- MPI over TCP/IP via switch
- Xlc 5.0.2.1 built with -g -O3 -qstrict
-qarchpwr3 -qtunepwr3 - Execution time
- gt /bin/time poe swim -procs 2
- 0.4u 0.0s 015 3 2173933k 00io 1pf0w
8ex19 Execution Environment
- IBM Nighthawk, 16-way Power 3, 375MHz
- FP Results/Clock 4 (1.5 Gflips)
- Caches 32K/64K, 8MB
- Xlc 5.0.2.1 built with -g
- Execution time
- gt /bin/time poe ex19 -procs 2 -da_grid_x 56
-da_grid_y 56 - 0.520u 0.200s 044.18 1.6 2973580k 00io 0pf0w
9Gprof
- Gathers timer interrupts vs. text address.
- Recompile with -p option.
- Gprof profile is useful for a high level overview
- Does it tell us why?
10Gprof Profile of FSPX
11FPSX Top 4 functions
- Top 4 functions make up 50 of execution time
- In module update.F
- flux
- proflux
- pde
- In module phase.F
- phase
- Use the list command to explore modules and
functions
12Gprof Profile of SWIM
13Gprof Profile of ex19
14Dynaprof Environment Variables
- LD_LIBRARY_PATH Colon seperated list where to
look for shared libraries. We need to find - DynInst library
- PAPI library
- Any dependancies on the above. (libperfctr.so,
libcpc.so) - DYNINSTAPI_RT_LIB Full pathname of DynInst
runtime library. - No settings necessary for AIX/DPCL port
15Running Dynaprof
- Usage
- dynaprof -d serial_application
- -d enables debugging output
- Specifying an application automatically loads it
into the tool immediately after initialization.
16Command Line Interface
- Uses GNU Readline library for input
- Full featured Command Line Editing
- File and command completion ltTabgt
- History ltUpgt/ltDowngt
- Settings, macros and aliases in /.inputrc
- Allows Emacs or VI style bindings
- set editing-mode emacs
- set editing-mode vi
- See man page, TexInfo file or home page.
17Load command
- Starts the application and stops it at the first
instruction. - Usage
- load ltapplicationgt args
- gt dynaprof
- (dynaprof) load tests/fpsx
18Poeload command
- For use with MPI applications on AIX and DPCL.
- DPCL lt 3.2.5 requires full path
- Usage
- poeload ltapplicationgt args
- (dynaprof) poeload tests/swim -procs 2
19Mpiload command
- For use with MPI applications.
- Stops the application after it calls PMPI_Init().
- Mostly useful for script driven execution of MPI
jobs - Usage
- mpiload ltapplicationgt args
- (dynaprof) mpiload tests/mpicount
20Attach command
- Attaches to a running application (or poe
process) and stops it. - Usage
- attach ltapplicationgt ltpidgt
- (dynaprof) Z
- gt tests/fspx
- 2 17500
- gt fg
- (dynaprof) attach tests/fspx 17500
21Poeattach Command
- For use with MPI applications on AIX and DPCL.
- DPCL lt 3.2.5 requires full path
- Usage
- poeattach ltapplicationgt ltpid_of_poegt
- (dynaprof) Z
- poe ex19 -da_grid_x 56 -da_grid_y 56 -procs 2
- 2 17500
- gt fg
- (dynaprof) poeattach ex19 17500
22List command
- list
- List all modules in process
- list ltpatterngt
- List all matching modules
- list ltmodulegt
- List all functions in module
- list ltmodulegt ltpatterngt
- List all matching functions in module
- list ltmodulegt ltfunctiongt
- List instrumentable points in function
23Exploring FSPX
- G77's Fortran Runtime support
- Code compiled with g77 without -g
- ends up in the DEFAULT_MODULE
- Application Code
-
- Shared libraries
24Exploring FSPX 2
- G77's Fortran Runtime support
- Code compiled with g77 without -g
- ends up in the DEFAULT_MODULE
25Exploring FSPX 3
Function Calls
26Use command
- Loads a probe shared library into address space
- (dynaprof) use probe args
- Use by itself displays current probe.
- To change options, respecify probe.
- 4 probes in this release
- Wallclock Real time clock
- PAPI Hardware metrics
- Perfometer RT Visi of streaming hardware metrics
27Instr command
- instr
- list all instrumented functions
- instr module ltpatterngt arg
- Instrument all functions in modules matching
pattern - instr function ltmodulegt ltpatterngt arg
- Instrument all functions matching pattern in
module
28Threads and Dynaprof Probes
- For threaded code, use the same probe!
- Dynaprof detects threads and loads a special
version of the probe library. - Each probe specifies what to do when a new thread
is discovered. - Each thread gets the same instrumentation.
29Probe Warning
- Instrumentation is not free.
- Consider granularity of region being measured.
- Overhead for PAPI 2.3 is O(100) cycles.
- Between 500 and 2000 cycles for a 2 counter read.
- Overhead for Wallclock is O(100) cycles.
30Wallclock Probe
- High resolution, low latency timer
- Usage
- use wallclockprobe
- Reports time in microseconds, 1.0x10-6s.
31PAPI Probe
- Count PAPI Presets or Native Events
- Usage
- use papiprobe event,event,...
- Default argument is either PAPI_FP_INS or
PAPI_TOT_INS if the architecture doesn't support
it. - Available events a can be obtained by using
- papi_avail -a
32PAPI Probe and Multiplexing
- More than physical number of metrics
automatically enables multiplexing. - Minimum runtime of instrumented regions must be
observed, such that all virtual counters get a
chance to run at least once. - run-timemin num_events .01s
- Automatic warning functionality is being rolled
into PAPI.
33PAPI Native Events
- Look in the PAPI distribution
- See the README file for your architecture in the
src directory - See the example program tests/native.c in the
src/tests directory
34Power 3 Events
35Power 3 Events 2
36Power 4 Events
37Pentium III Events
38Intel Pentium IV Events
(Arguments to perfex -e from PerfCtr distribution)
39Sun UltraSparc II Events
40Sun UltraSparc III Events
41MIPS R12K Events
42Alpha/DADD 21264 Events
43Perfometer Probe
- Sends a stream of performance data every N
seconds to the Perfometer GUI. - Functions can be colored at instrumentation time.
- Default color is white, 0xFFFFFF
- Usage
- use perfometerprobe 0xRRGGBB
- instr ltargsgt lt0xRRGGBBgt
44Perfometer Probe 2
- Perfometer GUI is NOT launched automatically.
- showrgb in X11 lists colors and names.
- Run the Java GUI
- Java -jar Perfometer.jar
- Connect up to the specified hostname and port.
45Instrumenting SWIM withperfometerprobe
46Instrumenting FSPX forInstructions Per Cycle
47Instrumenting SWIM forInstructions Per Cycle
48Reporting Probe Data
- The wallclock and PAPI probes produce very
similar data. - Both use a parsing script written in Perl.
- wallclockrpt ltfilegt
- papiproberpt ltfilegt
- Produce 3 profiles
- Inclusive Tfunction Tself Tchildren
- Exclusive Tfunction Tself
- 1-Level Call Tree Tchild Inclusive Tfunction
49Fspx Cycles Instrs.
50fspx IPC
proflux 0.61 phase 0.63 flux 0.49 pde 0.46
51Swim Cycles Instrs.
52Swim IPC
calc2 0.59 calc1 0.53 calc3 0.46
53Perfometer Screenshot
54Dynaprof 0.8 SC Release
- Binary distribution for 4 Platforms on the
website - AIX 3.x / DPCL 3.2.5 on Power 3
- Linux / DynInst 3.0 on Pentium lt III
- Solaris 2.8 / DynInst 3.0 on UltraSparc II/III
- IRIX / DynInst 3.0 on MIPS R10/12/14k
- Power 4 and Pentium 4 are coming...
- Xdynaprof Java/Swing GUI included
- perfometerprobe and GUI included
- Updated documentation
55References
- The Dynaprof Homepage
- http//www.cs.utk.edu/mucci/dynaprof
- The PAPI Homepage
- http//icl.cs.utk.edu/projects/papi
- The DynInst Homepage
- http//www.dyninst.org
- The DPCL Homepage
- http//oss.software.ibm.com/developerworks/opensou
rce/dpcl - The Vprof Homepage
- http//aros.ca.sandia.gov/cljanss/perf/vprof
- The GNU Readline Homepage
- http//cnswww.cns.cwru.edu/chet/readline/rltop.ht
ml