Parallel Performance Wizard: Measurement Module - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Parallel Performance Wizard: Measurement Module

Description:

Record statistical information about execution time or ... Stephane Eranian's PerfMon kernel patch for Linux (included) Linux 2.4,2.6. Intel Itanium I & II ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 26
Provided by: dral60
Category:

less

Transcript and Presenter's Notes

Title: Parallel Performance Wizard: Measurement Module


1
Parallel Performance WizardMeasurement Module
  • Professor Alan D. George, Principal Investigator
  • Mr. Hung-Hsun Su, Sr. Research Assistant
  • Mr. Adam Leko, Sr. Research Assistant
  • Mr. Bryan Golden, Research Assistant
  • Mr. Hans Sherburne, Research Assistant
  • Mr. Max Billingsley, Research Assistant
  • Mr. Josh Hartman, Undergraduate Volunteer
  • HCS Research Laboratory
  • University of Florida

2
Measurement Topics Outline
  • Measurement module overview
  • Overview of measurement process
  • I/O library overview (file formats)
  • Clock synchronization algorithms
  • Summary of PAPI support
  • Preliminary overhead measurements
  • Upcoming work

3
Measurement Module Overview
  • Purpose of module
  • Record performance information at runtime based
    upon analysis that the user requests
  • Types of measurements
  • Profiling
  • Record statistical information about execution
    time or hardware counter values
  • Relate information to basic blocks (functions,
    upc_forall loops) in source code
  • Tracing
  • Record full log of when events happen at runtime
    and how long
  • Gives very complete information about what
    happened at runtime
  • Sampling
  • Special low-overhead mode of profiling that
    attributes performance information via indirect
    measurement (samples)
  • Optional feature

4
Measurement Module Architecture
5
Measurement Data What is Recorded?
  • Basic information for each run
  • Command-line arguments (if available)
  • Snapshot of environment variables
  • Thread information rank, pid, hostname
  • Region information filename, line, description
  • Region is an abstraction of a code block such as
    function or upc_forall loop
  • Regions are used to relate performance
    information to a particular block of user code
  • Callsite information filename, line, col
  • Relates performance information to a particular
    line of code
  • Metric information name
  • Used to signify the metric used for measurements
    (e.g., hardware counters such as of L2 misses,
    etc.)

6
Measurement Data Trace Records
  • All trace records associated with a
  • Time value (globally synchronized)
  • Thread
  • Event ID
  • Begin/end flag
  • Depending upon type of event, record also has
  • Communication information
  • Metric information (hardware counter values)
  • Other language- and function-specific values
  • Most of the SHMEM function type variants are
    mapped to a single event data size
  • e.g., shmem_put_float, shmem_put_double, mapped
    to shmem_put with a size argument
  • Significantly simplifies analysis code (200
    functions mapped down to a handful without losing
    information)

7
Measurement Data Profile Statistics
  • Profile statistics are collected for regions only
  • User regions handled normally
  • For other language-specific statistics, stats are
    related to special regions
  • upc_get
  • shmem_get
  • etc.
  • Each statistic has
  • of times that region was called
  • of subroutine calls (another region was entered
    inside this region)
  • Min/Max/Sum of inclusive time spent in region
  • Sum of exclusive time spent in region
  • Sum of (exclusive time)2 spent in region (for
    standard deviation calculations)
  • Metric this statistic corresponds to
  • A callpath it is associated with

8
Measurement Data Profile Statistics (2)
  • Inclusive vs. exclusive time
  • Inclusive time time spent inside a region,
    including any subroutine calls
  • Exclusive time time spent inside a region,
    excluding any subroutine calls ( self time)
  • Tracking both inclusive and exclusive time helps
    differentiate between
  • Slow regions that do most of the work themselves
  • Regions that call other regions that take a long
    time
  • Min/max/average (derived from sum and count)
    inclusive time help determine if a regions time
    differs between invocations
  • Standard deviation of exclusive time also helps
    determine

9
Measurement Data Callpaths for Stats
  • Based upon our tool evaluations last year, we
    picked three different methods for relating stats
    to regions of code
  • Flat callpaths
  • Each statistic has a callpath of depth 1
  • Results in one statistic per region
  • e.g., will give user the average and total time
    for all upc_memgets in their program
  • Can be efficiently recorded
  • Parent callpaths
  • Each statistic has a callpath of depth 2
    region, enclosing region
  • Results in several statistics per region giving
    contextual information
  • e.g., will give user the averages and total times
    for all upc_memgets in their program, including
    which region called that upc_memget
  • Can be fairly efficiently recorded

10
Measurement Data Callpaths for Stats 2
  • Full callpaths
  • Callpaths are taken from root down, up to 10
    levels
  • Gives user several statistical records per
    region, with a lot of contextual information
  • Can result in a large volume of stat records, but
    size usage is very modest compared to tracing
  • Naturally filters out deep function calls
  • Much more expensive to record at runtime
  • Have to do up to 10 comparison operations per
    timer struct lookup
  • Increased memory usage due to larger possible
    of callpaths
  • Different profile modes provided so user can
    balance amount of information recorded vs.
    runtime overhead
  • Represents a wider spectrum of information vs.
    overhead possibilities than just profiling
    tracing

11
Measurement Data Visual Representation
  • Simplified ER diagram (syntax severely abused for
    claritys sake)

12
Measurement Process Overview
  • Goal keep runtime overhead as low as possible
  • Tracing
  • Each thread (execution unit) keeps a buffer of
    event records in raw format
  • Buffer is flushed to disk when full
  • On merge, raw data is sent to merge master, which
    outputs data in final (sorted) format
  • Profiling
  • Profile data is kept in memory inside timer
    structs
  • Each timer struct is associated with a region of
    code (user or system) and a callpath
  • Side note requirement of our project is to
    provide switching between tracing profiling
    without recompilation

13
Trace Implementation Details
  • Importance of explicit buffering
  • Disk behavior is similar to network overhead
  • For small writes, seek time dominates data
    transfer time
  • Even for modern (expensive) disks, seek times
    range in the milliseconds
  • Cannot afford a seek on each trace entry
  • Most OSs will buffer small reads if free RAM
  • But, most HPC applications slurp up all available
    RAM
  • Even a modest buffer size (64KB) will buffer
    over 1500 trace records _at_ 40 bytes / record
  • Buffer size will be user-tunable
  • Need to do extensive testing to determine effect
    of buffer size on tracing overhead
  • Preliminary data seems to indicate that unless
    dealing with huge records, trace buffers of gt 1M
    have limited benefits
  • Additional techniques for overhead reduction
    (will implement as needed)
  • Use nonblocking io (aio_write, aio_read) with
    double buffering
  • Alternatively use producer/consumer and circular
    buffer with pthreads

14
Profile Implementation Details
  • As with tracing, low overhead is key
  • On each region entry, need to look up associated
    timer object
  • Need a very efficient data structure for lookups
    of a moderate (100s to 1000s) number of timer
    objects
  • Hashing based upon region type (upc_get, upc_put,
    etc) is currently used
  • Provides very efficient lookups when small number
    of timers per event type
  • Separate hash used for user events
  • Hash overflow buckets handled with a simple array
    using a modest number of pre-allocated buckets
  • Further possible performance optimizations
  • Use separate hash tables for very common events
    (UPC implicit communication region types)
  • Use more efficient lookup structures for overflow
    buckets (AVL trees)
  • Do cycle trimming on callpath matching code
  • Need further testing with realistic applications
    to identify bottlenecks (premature optimization)

15
I/O Library File Formats
  • We spent much time looking _at_ existing trace
    formats
  • Trying to avoid reinventing the wheel!
  • SLOG-2
  • Native format used by Jumpshot
  • Made to be very efficient for visualizing traces
  • However, graphics-based format loses a lot of
    performance data
  • Format is also expensive to write
  • MPICH uses CLOG2 as intermediary file format
  • Jumpshot viewer converts CLOG2 files
    semi-automatically when trace is opened for the
    first time

16
I/O Library File Formats (2)
  • STF (Intel Trace Collector)
  • Seems to be an efficient format but is not open
    (undocumented property of Intel)
  • VTF3 (Vampirs format)
  • VTF has free (but not open) library with support
    for many platforms
  • VTF library is not designed to be used at runtime
  • Rather, it is geared towards analysis
    conversion utilities
  • OTF new trace library supported by Vampir-NG and
    TAU
  • OTF trace format was created under contract with
    LLNL
  • Open (documented) format, uses a lot of
    techniques to support massively parallel traces
    (1000s of processors)
  • Library is open source (http//www.tu-dresden.de/z
    ih/otf)
  • Format does not have specific support for
    one-sided memory operations
  • Has become available only very recently (past few
    months)

17
I/O Library File Formats (3)
  • Unnamed trace format for IBM SP systems
  • Described in paper by Wu et al. in From Trace
    Generation to Visualization A Performance
    Framework for Distributed Parallel Systems
  • No implementation available but has some good
    ideas (framing continuation records for
    efficient access)
  • EPILOG
  • Trace format used by KOJAK suite
  • Open source documented
  • Portable (support for many platforms)
  • No explicit support for UPC operations
  • Format survey findings
  • OTF and EPILOG were most promising
  • Neither has support for both UPC (EPILOG) and
    one-sided memory operations (OTF)
  • Analysis module needs this information, so cannot
    use libraries as is
  • Due to lack of perfect candidate, created own
    simple trace format that takes best ideas from
    each other library

18
I/O Library PPW Format
  • File consists of two pieces (separate files)
  • File header containing definition records and
    statistic information (filename.ppw)
  • Trace file containing trace records very short
    header (filename.ppw.0)
  • Trace definition records statistics kept
    separate from trace file
  • Can easily transport the header file around for
    initial viewing by visualization tools
  • Gory details of file format documented in PPW
    source code
  • Very simple overall file format, basically a dump
    of information outlined on slides 5-11 in network
    byte-order format (big-Endian)
  • File format is easily ported to other trace
    formats such as OTF, EPILOG, and VTF
  • More featured trace conversion module are in the
    works
  • Depending upon needs of analysis and presentation
    modules, will add more scalable features to
    format
  • In particular, have strongly looked at trace file
    framing and continuation records to efficiently
    support reading portions of a trace file without
    losing information

19
Clock Synchronization Algorithms
  • In order for analysis and visualization to be
    useful and accurate, need clocks to be globally
    synchronized
  • PPW uses a simple linear drift model
  • One clock is designated as the master clock
  • Parameters recorded for slave clocks
  • Local time 1 (t1), master time 1 (mt1)
  • Local time 2 (t2), master time 2 (mt2)
  • Based upon parameters, slave timestamps are
    mapped to masters time by linear equation
    globaltime mx b
  • m (mt2 mt1) / (t2 t1)
  • b mt1 m t1
  • Timestamp values adjusted during merge phase
  • F. Cristians algorithm is used to estimate
    remote clocks actual value
  • Probabilistic remote clock reading algorithm,
    very popular straightforward to implement

20
Clock Synchronization Algorithms (2)
  • Illustration of linear model

Thread 2
Thread 1
Offset (b)
Timer values
Different slopes (m)
Runtime
21
PAPI Basic Information
  • Name
  • Performance Application Programming Interface
    (PAPI)
  • Developer
  • University of Tennessee, Knoxville
  • Current version
  • PAPI 3.2.1
  • Website
  • http//icl.cs.utk.edu/papi/
  • Contact
  • Phil Mucci
  • Collaborations
  • TAU, SCALEA, vprof, SvPablo, PeformanceBench,
    DynaProf

22
PAPI Implementation
23
HW Support
  • Excellent hardware support
  • Heterogeneity support
  • PAPI API code is highly portable, but needs to be
    cognizant of semantic differences

24
PPW/PAPI integration
  • Not all metrics available on each platform
  • More than 100 total PAPI metrics available
  • 42 metrics available for Opteron processors in
    our open platforms
  • PAPI does not support the Cray X1E
  • Software multiplexing used to extend
    functionality (i.e. track more metrics than there
    are physical HW counters)
  • Used for PPW modules
  • A Unit Local delay analysis
  • P Unit visualizations
  • Profiling mode call tree, statistical
    visualizations
  • Tracing mode timeline
  • After call to PAPI_library_init
  • PAPI_add_events defines what events PAPI should
    track
  • PAPI_start/stop_counters starts/stops
    counting-specified hardware events
  • PAPI_read_counters copy current counts to array
    reset counters

25
Measurement Future Work
  • Performance tuning of measurement library
  • Instrument record data for applications that
    exercise trace profile functions
  • Identify inefficiencies via valgrind tool
  • Proprietary platform support (Cray SHMEM, etc.)
  • Add source callsite support for SHMEM wrappers
  • Borrow code from mpiP
  • Trace format improvements for efficient analysis
    / visualization
  • Memory profiling minor mode
  • Request from Berkeley UPC users
Write a Comment
User Comments (0)
About PowerShow.com