Title: Parallel Performance Wizard: Measurement Module
1Parallel Performance WizardMeasurement Module
- Professor Alan D. George, Principal Investigator
- Mr. Hung-Hsun Su, Sr. Research Assistant
- Mr. Adam Leko, Sr. Research Assistant
- Mr. Bryan Golden, Research Assistant
- Mr. Hans Sherburne, Research Assistant
- Mr. Max Billingsley, Research Assistant
- Mr. Josh Hartman, Undergraduate Volunteer
- HCS Research Laboratory
- University of Florida
2Measurement Topics Outline
- Measurement module overview
- Overview of measurement process
- I/O library overview (file formats)
- Clock synchronization algorithms
- Summary of PAPI support
- Preliminary overhead measurements
- Upcoming work
3Measurement Module Overview
- Purpose of module
- Record performance information at runtime based
upon analysis that the user requests - Types of measurements
- Profiling
- Record statistical information about execution
time or hardware counter values - Relate information to basic blocks (functions,
upc_forall loops) in source code - Tracing
- Record full log of when events happen at runtime
and how long - Gives very complete information about what
happened at runtime - Sampling
- Special low-overhead mode of profiling that
attributes performance information via indirect
measurement (samples) - Optional feature
4Measurement Module Architecture
5Measurement Data What is Recorded?
- Basic information for each run
- Command-line arguments (if available)
- Snapshot of environment variables
- Thread information rank, pid, hostname
- Region information filename, line, description
- Region is an abstraction of a code block such as
function or upc_forall loop - Regions are used to relate performance
information to a particular block of user code - Callsite information filename, line, col
- Relates performance information to a particular
line of code - Metric information name
- Used to signify the metric used for measurements
(e.g., hardware counters such as of L2 misses,
etc.)
6Measurement Data Trace Records
- All trace records associated with a
- Time value (globally synchronized)
- Thread
- Event ID
- Begin/end flag
- Depending upon type of event, record also has
- Communication information
- Metric information (hardware counter values)
- Other language- and function-specific values
- Most of the SHMEM function type variants are
mapped to a single event data size - e.g., shmem_put_float, shmem_put_double, mapped
to shmem_put with a size argument - Significantly simplifies analysis code (200
functions mapped down to a handful without losing
information)
7Measurement Data Profile Statistics
- Profile statistics are collected for regions only
- User regions handled normally
- For other language-specific statistics, stats are
related to special regions - upc_get
- shmem_get
- etc.
- Each statistic has
- of times that region was called
- of subroutine calls (another region was entered
inside this region) - Min/Max/Sum of inclusive time spent in region
- Sum of exclusive time spent in region
- Sum of (exclusive time)2 spent in region (for
standard deviation calculations) - Metric this statistic corresponds to
- A callpath it is associated with
8Measurement Data Profile Statistics (2)
- Inclusive vs. exclusive time
- Inclusive time time spent inside a region,
including any subroutine calls - Exclusive time time spent inside a region,
excluding any subroutine calls ( self time) - Tracking both inclusive and exclusive time helps
differentiate between - Slow regions that do most of the work themselves
- Regions that call other regions that take a long
time - Min/max/average (derived from sum and count)
inclusive time help determine if a regions time
differs between invocations - Standard deviation of exclusive time also helps
determine
9Measurement Data Callpaths for Stats
- Based upon our tool evaluations last year, we
picked three different methods for relating stats
to regions of code - Flat callpaths
- Each statistic has a callpath of depth 1
- Results in one statistic per region
- e.g., will give user the average and total time
for all upc_memgets in their program - Can be efficiently recorded
- Parent callpaths
- Each statistic has a callpath of depth 2
region, enclosing region - Results in several statistics per region giving
contextual information - e.g., will give user the averages and total times
for all upc_memgets in their program, including
which region called that upc_memget - Can be fairly efficiently recorded
10Measurement Data Callpaths for Stats 2
- Full callpaths
- Callpaths are taken from root down, up to 10
levels - Gives user several statistical records per
region, with a lot of contextual information - Can result in a large volume of stat records, but
size usage is very modest compared to tracing - Naturally filters out deep function calls
- Much more expensive to record at runtime
- Have to do up to 10 comparison operations per
timer struct lookup - Increased memory usage due to larger possible
of callpaths - Different profile modes provided so user can
balance amount of information recorded vs.
runtime overhead - Represents a wider spectrum of information vs.
overhead possibilities than just profiling
tracing
11Measurement Data Visual Representation
- Simplified ER diagram (syntax severely abused for
claritys sake)
12Measurement Process Overview
- Goal keep runtime overhead as low as possible
- Tracing
- Each thread (execution unit) keeps a buffer of
event records in raw format - Buffer is flushed to disk when full
- On merge, raw data is sent to merge master, which
outputs data in final (sorted) format - Profiling
- Profile data is kept in memory inside timer
structs - Each timer struct is associated with a region of
code (user or system) and a callpath - Side note requirement of our project is to
provide switching between tracing profiling
without recompilation
13Trace Implementation Details
- Importance of explicit buffering
- Disk behavior is similar to network overhead
- For small writes, seek time dominates data
transfer time - Even for modern (expensive) disks, seek times
range in the milliseconds - Cannot afford a seek on each trace entry
- Most OSs will buffer small reads if free RAM
- But, most HPC applications slurp up all available
RAM - Even a modest buffer size (64KB) will buffer
over 1500 trace records _at_ 40 bytes / record - Buffer size will be user-tunable
- Need to do extensive testing to determine effect
of buffer size on tracing overhead - Preliminary data seems to indicate that unless
dealing with huge records, trace buffers of gt 1M
have limited benefits - Additional techniques for overhead reduction
(will implement as needed) - Use nonblocking io (aio_write, aio_read) with
double buffering - Alternatively use producer/consumer and circular
buffer with pthreads
14Profile Implementation Details
- As with tracing, low overhead is key
- On each region entry, need to look up associated
timer object - Need a very efficient data structure for lookups
of a moderate (100s to 1000s) number of timer
objects - Hashing based upon region type (upc_get, upc_put,
etc) is currently used - Provides very efficient lookups when small number
of timers per event type - Separate hash used for user events
- Hash overflow buckets handled with a simple array
using a modest number of pre-allocated buckets - Further possible performance optimizations
- Use separate hash tables for very common events
(UPC implicit communication region types) - Use more efficient lookup structures for overflow
buckets (AVL trees) - Do cycle trimming on callpath matching code
- Need further testing with realistic applications
to identify bottlenecks (premature optimization)
15I/O Library File Formats
- We spent much time looking _at_ existing trace
formats - Trying to avoid reinventing the wheel!
- SLOG-2
- Native format used by Jumpshot
- Made to be very efficient for visualizing traces
- However, graphics-based format loses a lot of
performance data - Format is also expensive to write
- MPICH uses CLOG2 as intermediary file format
- Jumpshot viewer converts CLOG2 files
semi-automatically when trace is opened for the
first time
16I/O Library File Formats (2)
- STF (Intel Trace Collector)
- Seems to be an efficient format but is not open
(undocumented property of Intel) - VTF3 (Vampirs format)
- VTF has free (but not open) library with support
for many platforms - VTF library is not designed to be used at runtime
- Rather, it is geared towards analysis
conversion utilities - OTF new trace library supported by Vampir-NG and
TAU - OTF trace format was created under contract with
LLNL - Open (documented) format, uses a lot of
techniques to support massively parallel traces
(1000s of processors) - Library is open source (http//www.tu-dresden.de/z
ih/otf) - Format does not have specific support for
one-sided memory operations - Has become available only very recently (past few
months)
17I/O Library File Formats (3)
- Unnamed trace format for IBM SP systems
- Described in paper by Wu et al. in From Trace
Generation to Visualization A Performance
Framework for Distributed Parallel Systems - No implementation available but has some good
ideas (framing continuation records for
efficient access) - EPILOG
- Trace format used by KOJAK suite
- Open source documented
- Portable (support for many platforms)
- No explicit support for UPC operations
- Format survey findings
- OTF and EPILOG were most promising
- Neither has support for both UPC (EPILOG) and
one-sided memory operations (OTF) - Analysis module needs this information, so cannot
use libraries as is - Due to lack of perfect candidate, created own
simple trace format that takes best ideas from
each other library
18I/O Library PPW Format
- File consists of two pieces (separate files)
- File header containing definition records and
statistic information (filename.ppw) - Trace file containing trace records very short
header (filename.ppw.0) - Trace definition records statistics kept
separate from trace file - Can easily transport the header file around for
initial viewing by visualization tools - Gory details of file format documented in PPW
source code - Very simple overall file format, basically a dump
of information outlined on slides 5-11 in network
byte-order format (big-Endian) - File format is easily ported to other trace
formats such as OTF, EPILOG, and VTF - More featured trace conversion module are in the
works - Depending upon needs of analysis and presentation
modules, will add more scalable features to
format - In particular, have strongly looked at trace file
framing and continuation records to efficiently
support reading portions of a trace file without
losing information
19Clock Synchronization Algorithms
- In order for analysis and visualization to be
useful and accurate, need clocks to be globally
synchronized - PPW uses a simple linear drift model
- One clock is designated as the master clock
- Parameters recorded for slave clocks
- Local time 1 (t1), master time 1 (mt1)
- Local time 2 (t2), master time 2 (mt2)
- Based upon parameters, slave timestamps are
mapped to masters time by linear equation
globaltime mx b - m (mt2 mt1) / (t2 t1)
- b mt1 m t1
- Timestamp values adjusted during merge phase
- F. Cristians algorithm is used to estimate
remote clocks actual value - Probabilistic remote clock reading algorithm,
very popular straightforward to implement
20Clock Synchronization Algorithms (2)
- Illustration of linear model
Thread 2
Thread 1
Offset (b)
Timer values
Different slopes (m)
Runtime
21PAPI Basic Information
- Name
- Performance Application Programming Interface
(PAPI) - Developer
- University of Tennessee, Knoxville
- Current version
- PAPI 3.2.1
- Website
- http//icl.cs.utk.edu/papi/
- Contact
- Phil Mucci
- Collaborations
- TAU, SCALEA, vprof, SvPablo, PeformanceBench,
DynaProf
22PAPI Implementation
23HW Support
- Excellent hardware support
- Heterogeneity support
- PAPI API code is highly portable, but needs to be
cognizant of semantic differences
24PPW/PAPI integration
- Not all metrics available on each platform
- More than 100 total PAPI metrics available
- 42 metrics available for Opteron processors in
our open platforms - PAPI does not support the Cray X1E
- Software multiplexing used to extend
functionality (i.e. track more metrics than there
are physical HW counters) - Used for PPW modules
- A Unit Local delay analysis
- P Unit visualizations
- Profiling mode call tree, statistical
visualizations - Tracing mode timeline
- After call to PAPI_library_init
- PAPI_add_events defines what events PAPI should
track - PAPI_start/stop_counters starts/stops
counting-specified hardware events - PAPI_read_counters copy current counts to array
reset counters
25Measurement Future Work
- Performance tuning of measurement library
- Instrument record data for applications that
exercise trace profile functions - Identify inefficiencies via valgrind tool
- Proprietary platform support (Cray SHMEM, etc.)
- Add source callsite support for SHMEM wrappers
- Borrow code from mpiP
- Trace format improvements for efficient analysis
/ visualization - Memory profiling minor mode
- Request from Berkeley UPC users