Performance Instrumentation and Measurement for Terascale Systems - PowerPoint PPT Presentation

About This Presentation

Title:

Performance Instrumentation and Measurement for Terascale Systems

Description:

Make collection of run-time performance data easy by: ... PAPI (http://icl.cs.utk.edu/papi/) OPARI (http://www.fz-juelich.de/zam ... – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 39

Provided by: shirley52

Learn more at: http://www.cs.uoregon.edu

Category:

more less

Transcript and Presenter's Notes

Title: Performance Instrumentation and Measurement for Terascale Systems

1
Performance Instrumentation and Measurement for
Terascale Systems

Jack Dongarra, Shirley Moore, Philip Mucci
University of Tennessee
Sameer Shende, and Allen Malony
University of Oregon

2
Requirements for Terascale Systems

Performance framework must support a wide range
of
Performance problems (e.g., single-node
performance, synchronization and communication
overhead, load balancing)
Performance evaluation methods (e.g.,
parameter-based modeling, bottleneck detection
and diagnosis)
Programming environments (e.g., multiprocess and
/or multithreaded, parallel and distributed,
large-scale)
Need for flexible and extensible performance
observation framework

3
Research Problems

Appropriate level and location for implementing
instrumentation and measurement
How to make the framework modular and extensible
Appropriate compromise between level of
detail/accuracy and instrumentation cost

4
Instrumentation Strategies

Source code instrumentation
Manual or using preprocessor
Library level instrumentation
e.g., MPI and OpenMP profiling interfaces
Binary rewriting
E.g., Pixie, ATOM, EEL, PAT
Dynamic instrumentation
DyninstAPI

5
Types of Measurements

Profiling
Tracing
Real-time Analysis

6
Profiling

Recording of summary information during execution
inclusive, exclusive time, calls, hardware
statistics,
Reflects performance behavior of program entities
functions, loops, basic blocks
user-defined semantic entities
Very good for low-cost performance assessment
Helps to expose performance bottlenecks and
hotspots
Implemented through
sampling periodic OS interrupts or hardware
counter traps
instrumentation direct insertion of measurement
code

7
Tracing

Recording of information about significant points
(events) during program execution
entering/exiting code region (function, loop,
block, )
thread/process interactions (e.g., send/receive
message)
Save information in event record
timestamp
CPU identifier, thread identifier
Event type and event-specific information
Event trace is a time-sequenced stream of event
records
Can be used to reconstruct dynamic program
behavior
Typically requires code instrumentation

8
Real-time Analysis

Allows evaluation of program performance during
execution
Examples
Paradyn
Autopilot
Perfometer

9
TAU Performance System Architecture
Paraver
EPILOG
10
TAU Instrumentation

Manually using TAU instrumentation API
Automatically using
Program Database Toolkit (PDT)
MPI profiling library
Opari OpenMP rewriting tool
Uses PAPI to access hardware counter data

11
Program Database Toolkit (PDT)

Program code analysis framework for developing
source-based tools
High-level interface to source code information
Integrated toolkit for source code parsing,
database creation, and database query
commercial grade front end parsers
portable IL analyzer, database format, and access
API
open software approach for tool development
Targets and integrates multiple source languages
Used in TAU to build automated performance
instrumentation tools

12
PDT Components

Language front end
Edison Design Group (EDG) C, C
Mutek Solutions Ltd. F77, F90
creates an intermediate-language (IL) tree
IL Analyzer
processes the intermediate language (IL) tree
creates program database (PDB) formatted file
DUCTAPE (Bernd Mohr, ZAM, Germany)
C program Database Utilities and Conversion
Tools APplication Environment
processes and merges PDB files
C library to access the PDB for PDT applications

13
TAU Analysis

Profile analysis
pprof
parallel profiler with text-based display
Racy / jRacy
graphical interface to pprof (Tcl/Tk)
jRacy is a Java implementation of Racy
ParaProf
Next-generation parallel profile analysis and
display
Trace analysis and visualization
Trace merging and clock adjustment (if necessary)
Trace format conversion (ALOG, SDDF, Vampir)
Vampir (Pallas) trace visualization
Paraver (CEPBA) trace visualization

14
TAU Pprof Display
15
jracy (NAS Parallel Benchmark LU)
Routine profile across all nodes
Global profiles
n node c context t thread
Individual profile
16
ParaProf Scalable Profiler

Re-implementation of jRacy tool
Target flexibility in profile input source
Profile files, performance database, online
Target scalability in profile size and display
Will include three-dimensional display support
Provide more robust analysis and extension
Derived performance statistics

17
ParaProf Architecture
18
512-Processor Profile (SAMRAI)
19
Three-dimensional Profile Displays
500-processor Uintah execution (University of
Utah)
20
Overview of PAPI

Performance Application Programming Interface
The purpose of the PAPI project is to design,
standardize and implement a portable and
efficient API to access the hardware performance
monitor counters found on most modern
microprocessors.
Parallel Tools Consortium project
References implementations for all major HPC
platforms
Installed and in use at major government labs,
academic sites
Becoming de facto industry standard
Incorporated into many performance analysis tools
e.g., HPCView,SvPablo, TAU, Vampir, Vprof

21
PAPI Counter Interfaces

PAPI provides three interfaces to the underlying
counter hardware
The low level interface provides functions for
setting options, accessing native events,
callback on counter overflow, etc.
The high level interface simply provides the
ability to start, stop and read the counters for
a specified list of events.
Graphical tools to visualize information.

22
PAPI Implementation
23
PAPI Preset Events

Proposed standard set of events deemed most
relevant for application performance tuning
Defined in papiStdEventDefs.h
Mapped to native events on a given platform
Run tests/avail to see list of PAPI preset events
available on a platform

24
Scalability of PAPI Instrumentation

Overhead of library calls to read counters can be
excessive.
Statistical sampling can reduce overhead.
PAPI substrate for Alpha Tru64 UNIX
Built on top of DADD/DCPI (Dynamic Access to DCPI
Data/Digital Continuous Profiling Interface)
Sampling approach supported in hardware
1-2 overhead compared to 30 on other platforms
Using sampling and hardware profiling support on
Itanium/Itanium2

25
Vampir v3.x Hardware Counter Data

Counter Timeline Display

26
What is DynaProf?

A portable tool to instrument a running
executable with Probes that monitor application
performance.
Simple command line interface.
Open Source Software
A work in progress

No source code required
27
DynaProf Methodology

Make collection of run-time performance data easy
by
Avoiding instrumentation and recompilation
Using the same tool with different probes
Providing useful and meaningful probe data
Providing different kinds of probes
Allowing custom probes

No source code required!
28
Why the Dyna?

Instrumentation is selectively inserted directly
into the programs address space.
Why is this a better way?
No perturbation of compiler optimizations
Complete language independence
Multiple Insert/Remove instrumentation cycles

29
DynaProf Design

GUI, command line script driven user interface
Uses GNU readline for command line editing and
command completion.
Instrumentation is done using
Dyninst on Linux, Solaris and IRIX
DPCL on AIX

30
DynaProf Commands

load ltexecutablegt
list module pattern
use ltprobegt probe args
instr module ltmodulegt probe args
instr function ltmodulegt ltfunctiongt probe args
stop
continue
run args
Info
unload

31
DynaProf Probe Design

Probes provided with distribution
Wallclock probe
PAPI probe
Perfometer probe
Can be written in any compiled language
Probes export 3 functions with a standardized
interface.
Easy to roll your own (lt1day)
Supports separate probes for MPI/OpenMP/Pthreads

32
Future development

GUI development
Additional probes
Perfex probe
Vprof probe
TAU probe
Better support for parallel applications

33
Perfometer

Application is instrumented with PAPI
call perfometer()
call mark_perfometer(int color, char label)
Application is started. At the call to
perfometer, signal handler and a timer are set up
to collect and send the information to a Java
applet containing the graphical view.
Sections of code that are of interest can be
designated with specific colors
Real-time display or trace file

34
Perfometer Display
35
Perfometer Parallel Interface
36
Conclusions

TAU and PAPI projects are addressing important
research problems involved in constructing a
flexible and extensible performance observation
framework.
Widespread adoption of PAPI demonstrates the
value of a portable interface to low-level
architecture-specific performance monitoring
hardware.
TAU framework provides flexible mechanisms for
instrumentation and measurement.

37
Conclusions (cont.)

Terascale systems require scalable low-overhead
means of collecting performance data.
Statistical sampling support in PAPI
TAU filtering and feedback schemes for focusing
instrumentation
Real-time monitoring capabilities (Dynaprof,
Perfometer)
PAPI and TAU infrastructure is designed for
interoperability, flexibility, and extensibility.

38
More Information