Title: A Dynamic Tracing Mechanism For Performance Analysis of OpenMP Applications - Caubet, Gimenez, Labarta, DeRose, Vetter (WOMPAT 2001)
1A Dynamic Tracing Mechanism For Performance
Analysis of OpenMP Applications- Caubet,
Gimenez, Labarta, DeRose, Vetter (WOMPAT 2001)
- - Presented by Anita Nagarajan
2Introduction
- OpenMP
- Standard for shared memory parallel programming
- Set of directives and library routines for
Fortran and C/C - Performance Tools
- Need Analyse parallel behaviour. Determine
causes for OpenMP application performance
problems. - Properties Minimize intrusion cost, maximize
performance data captured
3Introduction(Contd.)
- Dynamic Instrumentation
- Instrument application while it is executing,
recompilation not required. - Dynamic Probe Class Library(DPCL)
- Developed at IBM, built on top of the Dyninst
API. - Using DPCL, performance tool attaches to
application, inserts code patches into the
binary, starts/continues its execution - Program instrumentation can be done at function
entry points, exit points and call sites.
4DPCL
- DPCL consists of
- Client library
- Runtime library
- Daemon
- Super-daemon
-
5OMPtrace
- Built on top of DPCL
- IBM compiler translates OpenMP directives into
function calls.
6Translation of OpenMP Directives
7OMPtrace
8OMPtrace(Contd.)
9OMPtrace(Contd.)
- Hardware Counters
- OMPtrace can access hardware counters, and
provide statistics of the hardware events. - Eg.L1/L2 hits, L1/L2 misses, number of
instructions - Paraver
- Computes Derived Metrics from hardware events.
- Eg. L1 misses per second
10Case Study Sweep3D
- Multidimensional wavefront algorithm for
discrete ordinates deterministic particle
transport simulation.
11Sweep3D(Contd.)
- diag - original version of Sweep3D
- mkj do idiag and do jkm loops replaced by a
triple nested loop (do m, do k, do j) - ccrit - based on mkj, outer loop parallelized,
synchronization implemented using the CRITICAL
directive. - cpipe based on mkj, outer loop parallelized,
synchronization implemented using shared arrays
and busy waiting.
12Results from Experiments
version 1 2 3 4 5 6 12
Ccrit 28.26 24.41 26.84 26.47 29.28 30.34 30.43
Cpipe 25.63 18.45 13.01 12.53 10.06 7.67 7.76
Diag 17.28 13.09 11.40 9.64 8.50 7.78 6.55
Elapsed time in seconds for the different OpenMP
versions
13Analysis of Results using Paraver
- Ccrit
- Not scalable
- Overhead of mutex lock and unlock, contention
Red Trying to obtain lock Blue Using lock Green
Release lock Light Blue Execution outside
critical section
14- Cpipe
- Better performance than ccrit.
- Poor locality because the m loop has an
iteration count of 6.
15- Diag
- Limited scalability due to high number of L2
misses
Blue Large values Green Low values
16Optimization
- kjmi
- Interchange loops
- Good performance, better scalability
1 2 3 4 5 6 12
kjmi 14.86 10.01 7.35 5.82 4.89 3.62 2.88
17Conclusions
- OMPtrace and Paraver form a useful tool for
performance analysis and optimization of OpenMP
applications.