Title: Performance Monitoring Tools on TCS
1Performance Monitoring Tools on TCS
- Roberto Gomez and Raghu Reddy
- Pittsburgh Supercomputing Center
- David ONeal
- National Center for Supercomputing Applications
2Objective
- Measure single PE performance
- Operation counts, wall time, MFLOP rates
- Cache utilization ratio
- Study scalability
- Time spent in MPI calls vs. computation
- Time spent in OpenMP parallel sections
3Atom Tools
- atom(1)
- Various tools
- Low overhead
- No recompiling or re-linking in some cases
4Useful Tools
- Flop2
- Floating point operations count
- Timer5
- Wall time (inclusive exclusive) per routine
- Calltrace
- Detailed statistics of calls and their arguments
- Developed by Dick Foster _at_ Compaq
5Instrumentation
- setenv ATOMTOOLPATH rreddy/Atom/Tools
- nm g a.out awk if(5T) print 1 gt
routines - Edit routines
- place main routine first
- remove unwanted ones.
- Instrument executable
- cat routines atom tool flop2 a.out
- cat routines atom tool timer5 a.out
- Execute a.out.flop2,timer5 to create fprof.
and tprof.
6Single PE Performance Analysis
Sample Timer5 output file
Procedure
Calls Self Time Total Time
null_evolnull_j_
3072 60596709 79880903
null_ethnull_d1_ 72458 45499161
45499161 null_hyper_unull_u_
3328 39889655 44500045
null_hyper_wnull_w_ 3328 19195271
33769541 ...
... ... ...
Total
1961226 248258934 248258934
7Single PE Performance Analysis
Sample Flop2 output file
Procedure
Calls Fops
null_evolnull_j_ 3072 20406036288
null_ethnull_d1_ 72458
20220926518 null_hyper_unull_u_
3328 14062774258 null_hyper_wnull_w
_ 3328 3823795456
... ... ...
Total 1936818
70876179927
Obtain MFLOPS Fops/(Self Time)
8MPI calltrace
- setenv ATOMTOOLPATH rreddy/Atom/Tools
- cat rreddy/Atom/mpicalls atom tool \
calltrace a.out - Execute a.out.calltrace to generate one trace
file per PE - Gather timings for desired MPI routines
- Repeat for increasing number of processors
9Sample calltrace statistics
Number of processors 8 PEs 128 PEs 256
PEs Processor grid 2x2x2 8x4x4
8x8x4 Total Run time 277.028 314.857
422.170 MPI_ISEND Statistics 1.250
1.498 2.265 MPI_RECV Statistics 4.349
19.779 26.537 MPI_WAIT Statistics
9.172 16.311 20.150 MPI_ALLTOALL Statistics
5.072 9.433 12.894 MPI_REDUCE Statistics
0.013 0.162 0.002 MPI_ALLREDUCE
Statistics 0.391 2.073 10.313 MPI_BCAST
Statistics 0.061 1.135
1.382 MPI_BARRIER Statistics 14.959 28.694
62.028 _________________________________________
___________ Total MPI Time 35.267
79.085 135.571
10calltrace timings graph
11DCPI
- Digital Continuous Profiling Infrastructure
- daemon and profiling utilities
- Very low overhead (1-2)
- Aggregate or per-process data and analysis
- No code modifications
- Requires interactive access to compute nodes
12DCPI Example
- Driver script e.g PBS
- creates map file and host list
- calls daemon and profiling scripts
- Daemon startup script
- starts daemon with selected options
- Daemon shutdown script
- halts daemon
- Profiling script
- executes post-processing utility with selected
options
13DCPI Driver Script
- PBS job file
- dcpi.pbs
- Creates map file and host list
- Image map generated by dcpiscan(1)
- Host list used by dsh(1) commands
- Executes daemon and profiling scripts
- Start daemon, run test executable, stop daemon,
post-process
14DCPI Startup Script
- C shell script
- dcpi_start.csh
- Three arguments defined by driver job
- MAP, WORK, EXE
- Creates database directory (DCPIDB)
- Derived from WORK hostname
- Starts dcpid(1) process
- Events of interest are specified here
15DCPI Stop Script
- C shell script
- dcpi_stop.csh
- No arguments
- dcpiquit(1) flushes buffers and halts the daemon
process
16DCPI Profiling Script
- C shell script
- dcpi_post.csh
- Three arguments defined by driver job
- MAP, WORK, EXE
- Determines database location (as before)
- Uses dcpiprof(1) to post-process database files
- Profile selection(s) must be consistent with
daemon startup options
17Common DCPI Problems
- Login denied (dsh)
- Requires permission to login on compute nodes
- Start the daemon in background
- Set filemode of DCPIDB directory correctly
- chmod 755 DCPIDB
- Mismatches between startup configuration and
profiling specs - See dcpid(1), dcpiprof(1), and dcpiprofileme(1)
18Summary
- Low-level interfaces provide access to hardware
counters - Very effective, but requires experience
- Minimal overhead costs
- Report timings, flop counts, MFLOP rates for user
code and library calls, e.g. MPI - More information available, e.g. message sizes,
time variability, etc.