Parallel Performance Wizard: Measurement Module

About This Presentation

Title:

Parallel Performance Wizard: Measurement Module

Description:

Record statistical information about execution time or ... Stephane Eranian's PerfMon kernel patch for Linux (included) Linux 2.4,2.6. Intel Itanium I & II ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 26

Provided by: dral60

Category:

more less

Transcript and Presenter's Notes

Title: Parallel Performance Wizard: Measurement Module

1
Parallel Performance WizardMeasurement Module

Professor Alan D. George, Principal Investigator
Mr. Hung-Hsun Su, Sr. Research Assistant
Mr. Adam Leko, Sr. Research Assistant
Mr. Bryan Golden, Research Assistant
Mr. Hans Sherburne, Research Assistant
Mr. Max Billingsley, Research Assistant
Mr. Josh Hartman, Undergraduate Volunteer
HCS Research Laboratory
University of Florida

2
Measurement Topics Outline

Measurement module overview
Overview of measurement process
I/O library overview (file formats)
Clock synchronization algorithms
Summary of PAPI support
Preliminary overhead measurements
Upcoming work

3
Measurement Module Overview

Purpose of module
Record performance information at runtime based
upon analysis that the user requests
Types of measurements
Profiling
Record statistical information about execution
time or hardware counter values
Relate information to basic blocks (functions,
upc_forall loops) in source code
Tracing
Record full log of when events happen at runtime
and how long
Gives very complete information about what
happened at runtime
Sampling
Special low-overhead mode of profiling that
attributes performance information via indirect
measurement (samples)
Optional feature

4
Measurement Module Architecture
5
Measurement Data What is Recorded?

Basic information for each run
Command-line arguments (if available)
Snapshot of environment variables
Thread information rank, pid, hostname
Region information filename, line, description
Region is an abstraction of a code block such as
function or upc_forall loop
Regions are used to relate performance
information to a particular block of user code
Callsite information filename, line, col
Relates performance information to a particular
line of code
Metric information name
Used to signify the metric used for measurements
(e.g., hardware counters such as of L2 misses,
etc.)

6
Measurement Data Trace Records

All trace records associated with a
Time value (globally synchronized)
Thread
Event ID
Begin/end flag
Depending upon type of event, record also has
Communication information
Metric information (hardware counter values)
Other language- and function-specific values
Most of the SHMEM function type variants are
mapped to a single event data size
e.g., shmem_put_float, shmem_put_double, mapped
to shmem_put with a size argument
Significantly simplifies analysis code (200
functions mapped down to a handful without losing
information)

7
Measurement Data Profile Statistics

Profile statistics are collected for regions only
User regions handled normally
For other language-specific statistics, stats are
related to special regions
upc_get
shmem_get
etc.
Each statistic has
of times that region was called
of subroutine calls (another region was entered
inside this region)
Min/Max/Sum of inclusive time spent in region
Sum of exclusive time spent in region
Sum of (exclusive time)2 spent in region (for
standard deviation calculations)
Metric this statistic corresponds to
A callpath it is associated with

8
Measurement Data Profile Statistics (2)

Inclusive vs. exclusive time
Inclusive time time spent inside a region,
including any subroutine calls
Exclusive time time spent inside a region,
excluding any subroutine calls ( self time)
Tracking both inclusive and exclusive time helps
differentiate between
Slow regions that do most of the work themselves
Regions that call other regions that take a long
time
Min/max/average (derived from sum and count)
inclusive time help determine if a regions time
differs between invocations
Standard deviation of exclusive time also helps
determine

9
Measurement Data Callpaths for Stats

Based upon our tool evaluations last year, we
picked three different methods for relating stats
to regions of code
Flat callpaths
Each statistic has a callpath of depth 1
Results in one statistic per region
e.g., will give user the average and total time
for all upc_memgets in their program
Can be efficiently recorded
Parent callpaths
Each statistic has a callpath of depth 2
region, enclosing region
Results in several statistics per region giving
contextual information
e.g., will give user the averages and total times
for all upc_memgets in their program, including
which region called that upc_memget
Can be fairly efficiently recorded

10
Measurement Data Callpaths for Stats 2

Full callpaths
Callpaths are taken from root down, up to 10
levels
Gives user several statistical records per
region, with a lot of contextual information
Can result in a large volume of stat records, but
size usage is very modest compared to tracing
Naturally filters out deep function calls
Much more expensive to record at runtime
Have to do up to 10 comparison operations per
timer struct lookup
Increased memory usage due to larger possible
of callpaths
Different profile modes provided so user can
balance amount of information recorded vs.
runtime overhead
Represents a wider spectrum of information vs.
overhead possibilities than just profiling
tracing

11
Measurement Data Visual Representation

Simplified ER diagram (syntax severely abused for
claritys sake)

12
Measurement Process Overview

Goal keep runtime overhead as low as possible
Tracing
Each thread (execution unit) keeps a buffer of
event records in raw format
Buffer is flushed to disk when full
On merge, raw data is sent to merge master, which
outputs data in final (sorted) format
Profiling
Profile data is kept in memory inside timer
structs
Each timer struct is associated with a region of
code (user or system) and a callpath
Side note requirement of our project is to
provide switching between tracing profiling
without recompilation

13
Trace Implementation Details

Importance of explicit buffering
Disk behavior is similar to network overhead
For small writes, seek time dominates data
transfer time
Even for modern (expensive) disks, seek times
range in the milliseconds
Cannot afford a seek on each trace entry
Most OSs will buffer small reads if free RAM
But, most HPC applications slurp up all available
RAM
Even a modest buffer size (64KB) will buffer
over 1500 trace records _at_ 40 bytes / record
Buffer size will be user-tunable
Need to do extensive testing to determine effect
of buffer size on tracing overhead
Preliminary data seems to indicate that unless
dealing with huge records, trace buffers of gt 1M
have limited benefits
Additional techniques for overhead reduction
(will implement as needed)
Use nonblocking io (aio_write, aio_read) with
double buffering
Alternatively use producer/consumer and circular
buffer with pthreads

14
Profile Implementation Details

As with tracing, low overhead is key
On each region entry, need to look up associated
timer object
Need a very efficient data structure for lookups
of a moderate (100s to 1000s) number of timer
objects
Hashing based upon region type (upc_get, upc_put,
etc) is currently used
Provides very efficient lookups when small number
of timers per event type
Separate hash used for user events
Hash overflow buckets handled with a simple array
using a modest number of pre-allocated buckets
Further possible performance optimizations
Use separate hash tables for very common events
(UPC implicit communication region types)
Use more efficient lookup structures for overflow
buckets (AVL trees)
Do cycle trimming on callpath matching code
Need further testing with realistic applications
to identify bottlenecks (premature optimization)

15
I/O Library File Formats

We spent much time looking _at_ existing trace
formats
Trying to avoid reinventing the wheel!
SLOG-2
Native format used by Jumpshot
Made to be very efficient for visualizing traces
However, graphics-based format loses a lot of
performance data
Format is also expensive to write
MPICH uses CLOG2 as intermediary file format
Jumpshot viewer converts CLOG2 files
semi-automatically when trace is opened for the
first time

16
I/O Library File Formats (2)

STF (Intel Trace Collector)
Seems to be an efficient format but is not open
(undocumented property of Intel)
VTF3 (Vampirs format)
VTF has free (but not open) library with support
for many platforms
VTF library is not designed to be used at runtime
Rather, it is geared towards analysis
conversion utilities
OTF new trace library supported by Vampir-NG and
TAU
OTF trace format was created under contract with
LLNL
Open (documented) format, uses a lot of
techniques to support massively parallel traces
(1000s of processors)
Library is open source (http//www.tu-dresden.de/z
ih/otf)
Format does not have specific support for
one-sided memory operations
Has become available only very recently (past few
months)

17
I/O Library File Formats (3)

Unnamed trace format for IBM SP systems
Described in paper by Wu et al. in From Trace
Generation to Visualization A Performance
Framework for Distributed Parallel Systems
No implementation available but has some good
ideas (framing continuation records for
efficient access)
EPILOG
Trace format used by KOJAK suite
Open source documented
Portable (support for many platforms)
No explicit support for UPC operations
Format survey findings
OTF and EPILOG were most promising
Neither has support for both UPC (EPILOG) and
one-sided memory operations (OTF)
Analysis module needs this information, so cannot
use libraries as is
Due to lack of perfect candidate, created own
simple trace format that takes best ideas from
each other library

18
I/O Library PPW Format

File consists of two pieces (separate files)
File header containing definition records and
statistic information (filename.ppw)
Trace file containing trace records very short
header (filename.ppw.0)
Trace definition records statistics kept
separate from trace file
Can easily transport the header file around for
initial viewing by visualization tools
Gory details of file format documented in PPW
source code
Very simple overall file format, basically a dump
of information outlined on slides 5-11 in network
byte-order format (big-Endian)
File format is easily ported to other trace
formats such as OTF, EPILOG, and VTF
More featured trace conversion module are in the
works
Depending upon needs of analysis and presentation
modules, will add more scalable features to
format
In particular, have strongly looked at trace file
framing and continuation records to efficiently
support reading portions of a trace file without
losing information

19
Clock Synchronization Algorithms

In order for analysis and visualization to be
useful and accurate, need clocks to be globally
synchronized
PPW uses a simple linear drift model
One clock is designated as the master clock
Parameters recorded for slave clocks
Local time 1 (t1), master time 1 (mt1)
Local time 2 (t2), master time 2 (mt2)
Based upon parameters, slave timestamps are
mapped to masters time by linear equation
globaltime mx b
m (mt2 mt1) / (t2 t1)
b mt1 m t1
Timestamp values adjusted during merge phase
F. Cristians algorithm is used to estimate
remote clocks actual value
Probabilistic remote clock reading algorithm,
very popular straightforward to implement

20
Clock Synchronization Algorithms (2)

Illustration of linear model

Thread 2
Thread 1
Offset (b)
Timer values
Different slopes (m)
Runtime
21
PAPI Basic Information

Name
Performance Application Programming Interface
(PAPI)
Developer
University of Tennessee, Knoxville
Current version
PAPI 3.2.1
Website
http//icl.cs.utk.edu/papi/
Contact
Phil Mucci
Collaborations
TAU, SCALEA, vprof, SvPablo, PeformanceBench,
DynaProf

22
PAPI Implementation
23
HW Support

Excellent hardware support
Heterogeneity support
PAPI API code is highly portable, but needs to be
cognizant of semantic differences

24
PPW/PAPI integration

Not all metrics available on each platform
More than 100 total PAPI metrics available
42 metrics available for Opteron processors in
our open platforms
PAPI does not support the Cray X1E
Software multiplexing used to extend
functionality (i.e. track more metrics than there
are physical HW counters)
Used for PPW modules
A Unit Local delay analysis
P Unit visualizations
Profiling mode call tree, statistical
visualizations
Tracing mode timeline
After call to PAPI_library_init
PAPI_add_events defines what events PAPI should
track
PAPI_start/stop_counters starts/stops
counting-specified hardware events
PAPI_read_counters copy current counts to array
reset counters