Allen D. Malony Sameer S. Shende Robert Bell

About This Presentation

Title:

Allen D. Malony Sameer S. Shende Robert Bell

Description:

Event type and event-specific information ... Creates 'program database' (PDB) formatted file. DUCTAPE (Bernd Mohr, FZJ/ZAM, Germany) ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 114

Provided by: allend7

Learn more at: http://www.cs.uoregon.edu

Category:

more less

Transcript and Presenter's Notes

Title: Allen D. Malony Sameer S. Shende Robert Bell

1
The TAU Performance System

Allen D. Malony Sameer S. Shende Robert
Bell
malony,sameer,bell_at_cs.uoregon.edu
Department of Computer and Information Science
Computational Science Institute
University of Oregon

2
Overview

Motivation and goals
TAU architecture and toolkit
Instrumentation
Measurement
Analysis
Performance mapping
Application case studies
TAU Integration
Work in progress
Conclusions

3
Motivation

Tools for performance problem solving
Empirical-based performance optimization process
Versatile performance technology
Portable performance analysis methods

PerformanceTuning
hypotheses
Performance Diagnosis
PerformanceTechnology
properties
Performance Experimentation
characterization
Performance Observation
4
Problems

Diverse performance observability requirements
Multiple levels of software and hardware
Different types and detail of performance data
Alternative performance problem solving methods
Multiple targets of software and system
application
Demands more robust performance technology
Broad scope of performance observation
Flexible and configurable mechanisms
Technology integration and extension
Cross-platform portability
Open, layered, and modular framework architecture

5
Complexity Challenges for Performance Tools

Computing system environment complexity
Observation integration and optimization
Access, accuracy, and granularity constraints
Diverse/specialized observation
capabilities/technology
Restricted modes limit performance problem
solving
Sophisticated software development environments
Programming paradigms and performance models
Performance data mapping to software abstractions
Uniformity of performance abstraction across
platforms
Rich observation capabilities and flexible
configuration
Common performance problem solving methods

6
General Problems (Performance Technology)

How do we create robust and ubiquitous
performance technology for the analysis and
tuning of parallel and distributed software and
systems in the presence of (evolving) complexity
challenges?
How do we apply performance technology
effectively for the variety and diversity of
performance problems that arise in the context of
complex parallel and distributed computer systems?

?
7
Computation Model for Performance Technology

How to address dual performance technology goals?
Robust capabilities widely available methods
Contend with problems of system diversity
Flexible tool composition/configuration/integratio
n
Approaches
Restrict computation types / performance problems
machines, languages, instrumentation technique,
limited performance technology coverage and
application
Base technology on abstract computation model
general architecture and software execution
features
map features/methods to existing complex system
types
develop capabilities that can be adapted and
optimized

8
General Complex System Computation Model

Node physically distinct shared memory machine
Message passing node interconnection network
Context distinct virtual memory space within
node
Thread execution threads (user/system) in context

Interconnection Network
Inter-node messagecommunication

Node
Node
Node
node memory
memory
memory
SMP
physicalview
VM space

modelview

Context
Threads
9
TAU Performance System

Tuning and Analysis Utilities
Performance system framework for scalable
parallel and distributed high-performance
computing
Targets a general complex system computation
model
nodes / contexts / threads
Multi-level system / software / parallelism
Measurement and analysis abstraction
Integrated toolkit for performance
instrumentation, measurement, analysis, and
visualization
Portable performance profiling and tracing
facility
Open software approach with technology
integration
University of Oregon , Forschungszentrum Jülich,
LANL

10
Definitions Profiling

Profiling
Recording of summary information during execution
execution time, calls, hardware statistics,
Reflects performance behavior of program entities
functions, loops, basic blocks
user-defined semantic entities
Very good for low-cost performance assessment
Helps to expose performance bottlenecks and
hotspots
Implemented through
sampling periodic OS interrupts or hardware
counter traps
instrumentation direct insertion of measurement
code

11
Definitions Tracing

Tracing
Recording of information about significant points
(events) during program execution
entering/exiting code region (function, loop,
block, )
thread/process interactions (e.g., send/receive
message)
Save information in event record
timestamp
CPU identifier, thread identifier
Event type and event-specific information
Event trace is a time-sequenced stream of event
records
Can be used to reconstruct dynamic program
behavior
Typically requires code instrumentation

12
TAU Performance System Architecture
13
TAU Performance Systems Goals

Multi-level performance instrumentation
Multi-language automatic source instrumentation
Flexible and configurable performance measurement
Widely-ported parallel performance profiling
system
Computer system architectures and operating
systems
Different programming languages and compilers
Support for multiple parallel programming
paradigms
Multi-threading, message passing, mixed-mode,
hybrid
Support for performance mapping
Support for object-oriented and generic
programming
Integration in complex software systems and
applications

14
How To Use TAU?

Instrumentation
Application code and libraries
Selective instrumentation
Install, compile, and link with TAU measurement
library
configure make clean install
Multiple configurations for different
measurements options
Does not require change in instrumentation
Selective measurement control
Execute experiments produce performance data
Performance data generated at end or during
execution
Use analysis tools to look at performance results

15
TAU Instrumentation Approach

Support for standard program events
Routines
Classes and templates
Statement-level blocks
Support for user-defined events
Begin/End events (user-defined timers)
Atomic events
Selection of event statistics
Support definition of semantic entities for
mapping
Support for event groups
Instrumentation optimization

16
TAU Instrumentation

Flexible instrumentation mechanisms at multiple
levels
Source code
manual
automatic
C, C, F77/90 (Program Database Toolkit (PDT))
OpenMP (directive rewriting (Opari))
Object code
pre-instrumented libraries (e.g., MPI using PMPI)
statically-linked and dynamically-linked
fast breakpoints (compiler generated)
Executable code
dynamic instrumentation (pre-execution)
(DynInstAPI)
virtual machine instrumentation (e.g., Java using
JVMPI)

17
Multi-Level Instrumentation

Targets common measurement interface
TAU API
Multiple instrumentation interfaces
Simultaneously active
Information sharing between interfaces
Utilizes instrumentation knowledge between levels
Selective instrumentation
Available at each level
Cross-level selection
Targets a common performance model
Presents a unified view of execution
Consistent performance events

18
Program Database Toolkit (PDT)

Program code analysis framework
develop source-based tools
High-level interface to source code information
Integrated toolkit for source code parsing,
database creation, and database query
Commercial grade front-end parsers
Portable IL analyzer, database format, and access
API
Open software approach for tool development
Multiple source languages
Implement automatic performance instrumentation
tools
tau_instrumentor

19
PDT Architecture and Tools
Application / Library
C / C parser
Fortran 77/90 parser
Program documentation
PDBhtml
Application component glue
IL
IL
SILOON
C / C IL analyzer
Fortran 77/90 IL analyzer
C / F90 interoperability
CHASM
Program Database Files
Automatic source instrumentation
TAU_instr
DUCTAPE
20
PDT Components

Language front end
Edison Design Group (EDG) C, C, Java
Mutek Solutions Ltd. F77, F90
IL Analyzer
Processes intermediate language (IL) tree from
front-end
Creates program database (PDB) formatted file
DUCTAPE (Bernd Mohr, FZJ/ZAM, Germany)
C program Database Utilities and Conversion
Tools APplication Environment
Processes and merges PDB files
C library to access the PDB for PDT applications

21
Instrumentation Control

Selection of which performance events to observe
Could depend on scope, type, level of interest
Could depend on instrumentation overhead
How is selection supported in instrumentation
system?
No choice
Include / exclude lists (TAU)
Environment variables
Static vs. dynamic
Controlling the instrumentation of small routines
High relative measurement overhead
Significant intrusion and possible perturbation

22
Selective Instrumentation
tau_instrumentor Usage tau_instrumentor
ltpdbfilegt ltsourcefilegt -o ltoutputfilegt
-noinline -g groupname -i headerfile
-c-c-fortran -f ltinstr_req_filegt For
selective instrumentation, use f option cat
selective.dat Selective instrumentation
Specify an exclude/include list. BEGIN_EXCLUDE_LI
ST void quicksort(int , int, int) void
sort_5elements(int ) void interchange(int , int
) END_EXCLUDE_LIST If an include list is
specified, the routines in the list will be the
only routines that are instrumented. To
specify an include list (a list of routines that
will be instrumented) remove the leading to
uncomment the following lines BEGIN_INCLUDE_LIST
int main(int, char ) int select_ END_INCLUDE_
LIST
23
Overhead Analysis for Automatic Selection

Analyze the performance data to determine events
with high (relative) overhead performance
measurements
Create a select list for excluding those events
Rule grammar (used in tau_reduce tool)
GroupName Field Operator Number
GroupName indicates rule applies to events in
group
Field is a event metric attribute (from profile
statistics)
numcalls, numsubs, percent, usec, cumusec, count,
totalcount, stdev, usecs/call, counts/call
Operator is one of gt, lt, or
Number is any number
Compound rules possible using between simple
rules

24
Example Rules

Exclude all events that are members of TAU_USER
and use less than 1000 microseconds TAU_USERuse
c lt 1000
Exclude all events that have less than 100
microseconds and are called only once usec lt
1000 numcalls 1
Exclude all events that have less than 1000
usecs per call OR have a (total inclusive)
percent less than 5 usecs/call lt 1000 percent lt
5
Scientific notation can be used

25
TAU Measurement

Performance information
Performance events
High-resolution timer library (real-time /
virtual clocks)
General software counter library (user-defined
events)
Hardware performance counters
PCL (Performance Counter Library) (ZAM, Germany)
PAPI (Performance API) (UTK, Ptools Consortium)
consistent, portable API
Organization
Node, context, thread levels
Profile groups for collective events (runtime
selective)
Performance data mapping between software levels

26
TAU Measurement Options

Parallel profiling
Function-level, block-level, statement-level
Supports user-defined events
TAU parallel profile data stored during execution
Hardware counts values
Support for multiple counters
Support for callpath profiling
Tracing
All profile-level events
Inter-process communication events
Timestamp synchronization
Trace merging and format conversion

27
TAU Measurement System Configuration

configure OPTIONS
-cltCCgt, -ccltccgt Specify C and C
compilers
-pthread, -sproc , -smarts Use pthread, SGI
sproc, smarts threads
-openmp Use OpenMP threads
-opariltdirgt Specify location of Opari OpenMP
tool
-papi ,-pclltdirgt Specify location of PAPI or
PCL
-pdtltdirgt Specify location of PDT
-mpiincltdgt, mpilibltdgt Specify MPI library
instrumentation
-TRACE Generate TAU event traces
-PROFILE Generate TAU profiles
-PROFILECALLPATH Generate Callpath profiles
(1-level)
-MULTIPLECOUNTERS Use more than one hardware
counter
-CPUTIME Use usertimesystem time
-PAPIWALLCLOCK Use PAPI to access wallclock time
-PAPIVIRTUAL Use PAPI for virtual (user) time

28
TAU Measurement API

Initialization and runtime configuration
TAU_PROFILE_INIT(argc, argv)TAU_PROFILE_SET_NODE
(myNode)TAU_PROFILE_SET_CONTEXT(myContext)TAU_
PROFILE_EXIT(message)TAU_REGISTIER_THREAD()
Function and class methods
TAU_PROFILE(name, type, group)
Template
TAU_TYPE_STRING(variable, type)TAU_PROFILE(name,
type, group)CT(variable)
User-defined timing
TAU_PROFILE_TIMER(timer, name, type,
group)TAU_PROFILE_START(timer)TAU_PROFILE_STOP
(timer)

29
TAU Measurement API (continued)

User-defined events
TAU_REGISTER_EVENT(variable, event_name)TAU_EVEN
T(variable, value)TAU_PROFILE_STMT(statement)
Mapping
TAU_MAPPING(statement, key)TAU_MAPPING_OBJECT(fu
ncIdVar)TAU_MAPPING_LINK(funcIdVar, key)
TAU_MAPPING_PROFILE (funcIdVar)TAU_MAPPING_PROFI
LE_TIMER(timer, funcIdVar)TAU_MAPPING_PROFILE_ST
ART(timer)TAU_MAPPING_PROFILE_STOP(timer)
Reporting
TAU_REPORT_STATISTICS()TAU_REPORT_THREAD_STATIST
ICS()

30
Grouping Performance Data in TAU

Profile Groups
A group of related routines forms a profile group
Statically defined
TAU_DEFAULT, TAU_USER1-5, TAU_MESSAGE, TAU_IO,
Dynamically defined
group name based on string, such as adlib or
particles
runtime lookup in a map to get unique group
identifier
uses tau_instrumentor to instrument
Ability to change group names at runtime
Group-based instrumentation and measurement
control

31
TAU Group Instrumentation Control API

Enabling Profile Groups
TAU_ENABLE_INSTRUMENTATION()
TAU_ENABLE_GROUP(TAU_GROUP)
TAU_ENABLE_GROUP_NAME(group name)
TAU_ENABLE_ALL_GROUPS()
Disabling Profile Groups
TAU_DISABLE_INSTRUMENTATION()
TAU_DISABLE_GROUP(TAU_GROUP)
TAU_DISABLE_GROUP_NAME()
TAU_DISABLE_ALL_GROUPS()
Obtaining Profile Group Identifier
Runtime Switching of Profile Groups

32
TAU Pre-execution Control

Dynamic groups defined at file scope
Group names and group associations runtime
modifiable
Controlling groups at pre-execution time
--profile ltgroup1group2groupNgt option
tau_instrumentor app.pdb app.cpp \
o app.i.cpp g particles
mpirun np 4 application \
profile particlesfieldmeshio
Examples
POOMA (LANL) uses static groups
VTF (Caltech) uses dynamic group in Python-based
execution instrumentation control

33
Configuring TAU Measurement Library

Profiling with wallclock time (on a quad PIII
Linux machine)
configure -mpiinc/usr/local/packages/mpich/incl
ude -mpilib/usr/local/packages/mpich/lib
-pdt/usr/pkg/pdtoolkit/ -useropt-O2
-LINUXTIMERS
Tracing
configure -mpiinc/usr/local/packages/mpich/incl
ude -mpilib/usr/local/packages/mpich/lib
-pdt/usr/pkg/pdtoolkit -useropt-O2
-LINUXTIMERS
Profiling with PAPI
configure -mpiinc/usr/local/packages/mpich/incl
ude -mpilib/usr/local/packages/mpich/lib
-pdt/usr/pkg/pdtoolkit/ -useropt-O2
-papi/usr/local/packages/papi
setenv PAPI_EVENT PAPI_FP_INS
setenv PAPI_EVENT PAPI_L1_DCM

34
Compiling with TAU Makefiles

Include TAU Stub Makefile (ltarchgt/lib) in the
users Makefile
Variables
TAU_CXX Specify the C compiler used by TAU
TAU_CC, TAU_F90 Specify the C, F90 compilers
TAU_DEFS Defines used by TAU. Add to CFLAGS
TAU_LDFLAGS Linker options. Add to LDFLAGS
TAU_INCLUDE Header files include path. Add to
CFLAGS
TAU_LIBS Statically linked TAU library. Add to
LIBS
TAU_SHLIBS Dynamically linked TAU library
TAU_MPI_LIBS TAUs MPI wrapper library for C/C
TAU_MPI_FLIBS TAUs MPI wrapper library for F90
TAU_FORTRANLIBS Must be linked in with C linker
for F90.
TAU_DISABLE TAUs dummy F90 stub library

35
TAU Analysis

Parallel profile analysis
Pprof
parallel profiler with text-based display
Racy
graphical interface to pprof (Tcl/Tk)
jRacy
Java implementation of Racy
Trace analysis and visualization
Trace merging and clock adjustment (if necessary)
Trace format conversion (ALOG, SDDF, VTF,
Paraver)
Trace visualization using Vampir (Pallas)

36
Pprof Command

pprof -c-b-m-t-e-i -r -s -n num -f
file -l nodes
-c Sort according to number of calls
-b Sort according to number of subroutines called
-m Sort according to msecs (exclusive time total)
-t Sort according to total msecs (inclusive time
total)
-e Sort according to exclusive time per call
-i Sort according to inclusive time per call
-v Sort according to standard deviation
(exclusive usec)
-r Reverse sorting order
-s Print only summary profile information
-n num Print only first number of functions
-f file Specify full path and filename without
node ids
-l nodes List all functions and exit (prints only
info about all contexts/threads of given node
numbers)

37
Pprof Output (NAS Parallel Benchmark LU)

Intel QuadPIII Xeon
F90 MPICH
Profile - Node - Context - Thread
Events - code - MPI

38
jRacy (NAS Parallel Benchmark LU)
Routine profile across all nodes
n node c context t thread
Global profiles
Event legend
Individual profile
39
Paraprof Profile Browser
40
Paraprof Profile Browser Main Window
41
Paraprof Profile Browser Node Window
42
Paraprof Profile Browser (Derived Metrics)
43
Paraprof Profile Browser Routine Window
44
TAU PAPI (NAS Parallel Benchmark LU )

Floating point operations
Re-link to alternate library
Can use multiple counter support

45
TAU Vampir (NAS Parallel Benchmark LU)
Callgraph display
Timeline display
Parallelism display
Communications display
46
tau_reduce Example

tau_reduce implements overhead reduction in TAU
Consider klargest example
Find kth largest element in a N elements
Compare two methods quicksort,
select_kth_largest
Un-instrumented testcase i 2324, N 1000000
quicksort (wall clock) 0.188511 secs
select_kth_largest (wall clock) 0.149594 secs
Total (PIII/1.2GHz time) 0.340u 0.020s 000.37
Execute with all routines instrumented
Execute with rule-based selective instrumentation
usecgt1000 numcallsgt400000 usecs/calllt30
percentgt25

47
Simple sorting example on one processor
Before selective instrumentation reduction

NODE 0CONTEXT 0THREAD 0
--------------------------------------------------
-------------------------------------
Time Exclusive Inclusive Call
Subrs Inclusive Name
msec msec
usec/call
--------------------------------------------------
-------------------------------------
100.0 13 4,982 1
4 4982030 int main
93.5 3,223 4,659 4.20241E06
1.40268E07 1 void quicksort
62.9 0.00481 3,134 5
5 626839 int kth_largest_qs
36.4 137 1,813 28
450057 64769 int select_kth_largest
33.6 150 1,675 449978
449978 4 void sort_5elements
28.8 1,435 1,435 1.02744E07
0 0 void interchange
0.4 20 20 1
0 20668 void setup
0.0 0.0118 0.0118 49
0 0 int ceil

After selective instrumentation reduction
NODE 0CONTEXT 0THREAD 0 -----------------------
--------------------------------------------------
-------------- Time Exclusive Inclusive
Call Subrs Inclusive Name
msec total msec
usec/call ----------------------------------------
----------------------------------------------- 10
0.0 14 383 1
4 383333 int main 50.9 195
195 5 0 39017 int
kth_largest_qs 40.0 153 153
28 79 5478 int
select_kth_largest 5.4 20
20 1 0 20611 void setup
0.0 0.02 0.02 49
0 0 int ceil
48
TAU Performance System Status

Computing platforms
IBM SP / Power4, SGI Origin 2K/3K, ASCI Red, Cray
T3E / SV-1 (X-1 planned), HP (Compaq) SC (Tru64),
HP Superdome (HP-UX), Sun, Hitachi SR8000, NEX
SX-5 (SX-6 underway), Linux clusters (IA-32/64,
Alpha, PPC, PA-RISC, Power), Apple (OS X),
Windows
Programming languages
C, C, Fortran 77, F90, HPF, Java, OpenMP,
Python
Communication libraries
MPI, PVM, Nexus, shmem, Tulip, ACLMPL, MPIJava
Thread libraries
pthreads, SGI sproc, Java,Windows, OpenMP, SMARTS

49
TAU Performance System Status (continued)

Compilers
Intel KAI (KCC, KAP/Pro), PGI, GNU, Fujitsu, Sun,
Microsoft, SGI, Cray, IBM, Compaq, Hitachi, NEC,
Intel
Application libraries (selected)
Blitz, A/P, PETSc, SAMRAI, Overture, PAWS
Application frameworks (selected)
POOMA, MC, Conejo, Uintah, VTF, UPS, GrACE
Performance projects using TAU
Aurora / SCALEA ACPC, University of Vienna
TAU full distribution (Version 2.12, web
download)
TAU performance system toolkit and users guide
Automatic software installation and examples

50
PDT Status

Program Database Toolkit (Version 2.2, web
download)
EDG C front end (Version 2.45.2)
Mutek Fortran 90 front end (Version 2.4.1)
C and Fortran 90 IL Analyzer
DUCTAPE library
Standard C system header files (KCC Version
4.0f)
PDT-constructed tools
TAU instrumentor (C/C/F90)
Program analysis support for SILOON and CHASM
Platforms
Same as for TAU with a few exceptions

51
Performance Mapping

High-level semantic abstractions
Associate performance measurements
Performance mapping
performance measurement system support to assign
data correctly

52
Semantic Entities/Attributes/Associations

New dynamic mapping scheme (SEAA)
Contrast with ParaMap (Miller and Irvin)
Entities defined at any level of abstraction
Attribute entity with semantic information
Entity-to-entity associations
Two association types (implemented in TAU API)
Embedded extends associatedobject to store
performancemeasurement entity
External creates an external look-uptable
using address of object as key tolocate
performance measurement entity

53
Hypothetical Mapping Example

Particles distributed on surfaces of a cube

Particle PMAX / Array of particles / int
GenerateParticles() / distribute particles
over all faces of the cube / for (int face0,
last0 face lt 6 face) / particles on
this face / int particles_on_this_face
num(face) for (int ilast i lt
particles_on_this_face i) / particle
properties are a function of face / Pi
... f(face) ... last
particles_on_this_face
54
Hypothetical Mapping Example (continued)
int ProcessParticle(Particle p) / perform
some computation on p / int main()
GenerateParticles() / create a list of
particles / for (int i 0 i lt N i) /
iterates over the list / ProcessParticle(Pi)

work packets

engine

How much time is spent processing face i
particles?
What is the distribution of performance among
faces?

55
No Performance Mapping versus Mapping

Typical performance tools report performance with
respect to routines
Does not provide support for mapping

Performance tools with SEAA mapping can observe
performance with respect to scientists
programming and problem abstractions

TAU (w/ mapping)
TAU (no mapping)
56
Performance Mapping in Callpath Profiling

Consider callgraph (callpath) profiling
Measure time (metric) along an edge (path) of
callgraph
Incident edge gives parent / child view
Edge sequence (path) gives parent / descendant
view
Callpath profiling when callgraph is unknown
Must determine callgraph dynamically at runtime
Map performance measurement to dynamic call path
state
Callpath levels
0-level current callgraph node
1-level immediate parent (descendant)
k-level kth calling parent (call descendant)

57
1-Level Callpath Implementation in TAU

TAU maintains a performance event (routine)
callstack
Profiled routine (child) looks in callstack for
parent
Previous profiled performance event is the parent
A callpath profile structure created first time
parent calls
TAU records parent in a callgraph map for child
String representing 1-level callpath used as its
key
a( )gtb( ) name for time spent in b when
called by a
Map returns pointer to callpath profile structure
1-level callpath is profiled using this profiling
data
Build upon TAUs performance mapping technology
Measurement is independent of instrumentation
Use PROFILECALLPATH to configure TAU

58
Callpath Profiling Example (NAS LU v2.3)

configure -PROFILECALLPATH -SGITIMERS
-archsgi64-mpiinc/usr/include
-mpilib/usr/lib64 -useropt-O2

59
Callpath Parallel Profile Display

0-level and 1-level callpath grouping

1-Level Callpath
0-Level Callpath
60
Strategies for Empirical Performance Evaluation

Empirical performance evaluation as a series of
performance experiments
Experiment trials describing instrumentation and
measurement requirements
Where/When/How axes of empirical performance
space
where are performance measurements made in
program
when is performance instrumentation done
how are performance measurement/instrumentation
chosen
Strategies for achieving flexibility and
portability goals
Limited performance methods restrict evaluation
scope
Non-portable methods force use of different
techniques
Integration and combination of strategies

61
Case Study SIMPLE Performance Analysis

SIMPLE hydrodynamics benchmark
C code with MPI message communication
Multiple instrumentation methods
source-to-source translation (PDT)
MPI wrapper library level instrumentation (PMPI)
pre-execution binary instrumentation (DyninstAPI)
Alternative measurement strategies
statistical profiles of software actions
statistical profiles of hardware actions (PCL,
PAPI)
program event tracing
choice of time source
gettimeofday, high-res physical, CPU, process
virtual

62
SIMPLE Source Instrumentation (Preprocessed)

PDT automatically generates instrumentation code
names events with full function signatures
Similarly for all other routines in SIMPLE program

int compute_heat_conduction(double
theta_hatXY, double deltat, double
new_rXY, double new_zXY, double
new_alphaXY, double new_rhoXY, double
theta_lXY,double Gamma_kXY, double
Gamma_lXY) TAU_PROFILE("int
compute_heat_conduction( double ()259,
double, double ()259, double ()259, double
()259, double ()259, double ()259,
double ()259, double ()259)", " ",
TAU_USER) ...
63
MPI Library Instrumentation (MPI_Send)

Uses MPI profiling interposition library (PMPI)

int MPI_Send()... int returnVal,
typesize TAU_PROFILE_TIMER(tautimer,
"MPI_Send()", " ", TAU_MESSAGE) TAU_PROFILE_STAR
T(tautimer) if (dest ! MPI_PROC_NULL)
PMPI_Type_size(datatype, typesize) TAU_TRA
CE_SENDMSG(tag, dest, typesizecount) returnV
al PMPI_Send(buf, count, datatype, dest, tag,
comm) TAU_PROFILE_STOP(tautimer) return
returnVal
64
MPI Library Instrumentation (MPI_Recv)

int MPI_Recv()... int returnVal,
size TAU_PROFILE_TIMER(tautimer, "MPI_Recv()",
" ", TAU_MESSAGE) TAU_PROFILE_START(tautimer)
returnVal PMPI_Recv(buf, count, datatype, src,
tag, comm,
status) if (src ! MPI_PROC_NULL returnVal
MPI_SUCCESS) PMPI_Get_count( status,
MPI_BYTE, size ) TAU_TRACE_RECVMSG(status-gtMPI
_TAG, status-gtMPI_SOURCE,
size) TAU_PROFILE_STOP(tautimer)
return returnVal

65
Multi-Level Instrumentation (Profiling)
four processes
event legend
Profile per process
global profile
66
Multi-Level Instrumentation (Tracing)

Relink with TAU library configured for tracing
No modification of source instrumentation
required!

TAU performance groups
67
Dynamic Instrumentation of SIMPLE

Uses DynInstAPI for runtime code patching
Mutator loads measurement library, instruments
mutatee
One mutator (tau_run) per executable image
mpirun np ltngt tau.shell

68
Case Study PETSc v2.1.3 (ANL)

Portable, Extensible Toolkit for Scientific
Computation
Scalable (parallel) PDE framework
Suite of data structures and routines (374,458
code lines)
Solution of scientific applications modeled by
PDEs
Parallel implementation
MPI used for inter-process communication
TAU instrumentation
PDT for C/C source instrumentation (100, no
manual)
MPI wrapper interposition library instrumentation
Example
Linear system of equations (Axb) (SLES) (ex2
test case)
Non-linear system of equations (SNES) (ex19 test
case)

69
PETSc ex2 (Profile - wallclock time)
Sorted with respect to exclusive time
70
PETSc ex2(Profile - overall and message counts)

Observe load balance
Track messages

Capture with user-defined events
71
PETSc ex2 (Profile - percentages and time)

View per threadperformance on individual routines

72
PETSc ex2 (Trace)
73
PETSc ex19

Non-linear solver (SNES)
2-D driven cavity code
Uses velocity-vorticity formulation
Finite difference discretization on a structured
grid
Problem size and measurements
56x56 mesh size on quad Pentium III (550 Mhz,
Linux)
Executes for approximately one minute
MPI wrapper interposition library
PDT (tau_instrumentor)
Selective instrumentation (tau_reduce)
three routines identified with high
instrumentation overhead

74
PETSc ex19 (Profile - wallclock time)
Sorted by inclusive time
Sorted by exclusive time
75
PETSc ex19 (Profile - overall and percentages)
76
PETSc ex19 (Tracing)
Commonly seen communicaton behavior
77
PETSc ex19 (Tracing - callgraph)
78
PETSc ex19 (PAPI_FP_INS, PAPI_L1_DCM)

Uses multiple counter profile measurement

PAPI_FP_INS
PAPI_L1_DCM
79
Case Study Mixed-mode Parallel Programs

Portable mixed-mode parallel programming
Multi-threaded shared memory programming
Inter-node message passing
Performance measurement
Access to runtime system and communication events
Associate communication and application events
2-Dimensional Stommel model of ocean circulation
OpenMP for shared memory parallel programming
MPI for cross-box message-based parallelism
Jacobi iteration, 5-point stencil
Timothy Kaiser (San Diego Supercomputing Center)

80
Stommel Instrumentation

OpenMP directive instrumentation (uses OPARI)

pomp_for_enter(omp_rd_2) line 252
"stommel.c" pragma omp for schedule(static)
reduction( diff) private(j) firstprivate
(a1,a2,a3,a4,a5) nowait for( ii1ilti2i)
for(jj1jltj2j) new_psiija1psii1
j a2psii-1j a3psiij1
a4psiij-1 - a5the_forij diffdifffab
s(new_psiij-psiij) pomp_barrier_ente
r(omp_rd_2) pragma omp barrier pomp_barrier_exi
t(omp_rd_2) pomp_for_exit(omp_rd_2) line 261
"stommel.c"
81
OpenMP MPI Ocean Modeling (Trace)
Thread-paired message passing
Integrated OpenMP MPI events
82
OpenMP MPI Ocean Modeling (HW Profile)
configure -papi../packages/papi -openmp
-cpgCC -ccpgcc -mpiinc../packages/mpich/in
clude -mpilib../packages/mpich/lib
Integrated OpenMP MPI events
Integrated OpenMP MPI events
FP instructions
83
Case Study C and Performance Mapping

Object-oriented programming
abstract data types, encapsulation, inheritance,
Domain-specific abstractions
Implemented by OO languages in form of class
libraries
Generic programming mechanisms
efficient coding abstractions, compile-time
transformations
Creates a semantic gap between the transformed
code and what the user expects (as describes in
source code)
Need a mechanism to expose the nature of
high-level abstract computation to the
performance tools
Map low-level performance data to high-level
semantics

84
C Template Instrumentation (Blitz, PETE)

High-level objects
Array classes
Templates (Blitz)
Optimizations
Array processing
Expressions (PETE)
Relate performance data to high-level statement
Complexity of template evaluation

Array expressions
Array expressions
85
Standard Template Instrumentation Difficulties

Instantiated templates result in mangled
identifiers
Standard profiling techniques / tools are
deficient
Integrated with proprietary compilers
Specific systems platforms and programming models

Uninterpretable routine names
Very long!
86
Blitz Library Instrumentation

Expression templates
embed the form of the expression in a template
name
Blitz describes structure of the expression
template
Present as pretty printed name to the profiling
toolkit
Create performance event associated with
expression type

Expression B C - 2.0 D

BinOpltAdd, B, ltBinOpltSubtract, C,
ltBinOpltMultiply, Scalarlt2.0gt, Dgtgtgt
B
-

C
2.0
D
87
Blitz Library Instrumentation (example)

ifdef BZ_TAU_PROFILING
static string exprDescription
if (!exprDescription.length())
exprDescription "A"
prettyPrintFormat format(_bz_true) // terse
mode on
format.nextArrayOperandSymbol()
T_updateprettyPrint(exprDescription)
expr.prettyPrint(exprDescription, format)
TAU_PROFILE(" ", exprDescription, TAU_BLITZ)
endif

exprDescription is the event name
88
TAU Instrumentation and Profiling for C
Profile of expression types
Performance data presented with respect to
high-level array expression types
Performance data presented with respect to
high-level array expression types
89
Case Study C-SAFE / Uintah

Center for Simulation of Accidental Fires
Explosions
ASCI ASAP Level 1 center, University of Utah
PSE for multi-model simulation high-energy
explosion
Coupled non-linear solvers, optimization,
computational steering, visualization, and
experimental data verification
Very large-scale simulations
Computer science problems
Coupling of multiple simulation codes
Software engineering across diverse expert teams
Achieving high performance on large-scale systems

90
Example C-SAFE Simulation Problems
?
Heptane fire simulation
Typical C-SAFE simulation with a billion degrees
of freedom and non-linear time dynamics
Material stress simulation
91
Uintah Computational Framework (UCF)

Execution model based on software (macro)
dataflow
Exposes parallelism and hides data transport
latency
Computations expressed a directed acyclic graphs
of tasks
consumes input and produces output (input to
future task)
input/outputs specified for each patch in a
structured grid
Abstraction of global single-assignment memory
DataWarehouse
Directory mapping names to values (array
structured)
Write value once then communicate to awaiting
tasks
Task graph gets mapped to processing resources
Communications schedule approximates global
optimal

92
Performance Technology Integration

Uintah present challenges to performance
integration
Software diversity and structure
UCF middleware, simulation code modules
component-based hierarchy
Portability objectives
cross-language and cross-platform
multi-parallelism thread, message passing, mixed
Scalability objectives
High-level programming and execution abstractions
Requires flexible and robust performance
technology
Requires support for performance mapping

93
Task Execution in Uintah Parallel Scheduler

Profile methods and functions in scheduler and in
MPI library

Task execution time dominates (what task?)
Task execution time distribution
MPI communication overheads (where?)

Need to map performance data!

94
Uintah Task Performance Mapping

Uintah partitions individual particles across
processing elements (processes or threads)
Simulation tasks in task graph work on particles
Tasks have domain-specific character in the
computation
interpolate particles to grid in Material Point
Method
Task instances generated for each partitioned
particle set
Execution scheduled with respect to task
dependencies
How to attributed execution time among different
tasks
Assign semantic name (task type) to a task
instance
SerialMPMinterpolateParticleToGrid
Map TAU timer object to (abstract) task (semantic
entity)
Look up timer object using task type (semantic
attribute)
Further partition along different domain-specific
axes

95
Mapping Instrumentation in UCF (example)

Use TAU performance mapping API

void MPISchedulerexecute(const ProcessorGroup
pc, DataWarehouseP old_dw,
DataWarehouseP dw ) ... TAU_MAPPING_C
REATE( task-gtgetName(), "MPISchedulerexecute(
)", (TauGroup_t)(void)task-gtgetName(),
task-gtgetName(), 0) ... TAU_MAPPING_OBJECT(taut
imer) TAU_MAPPING_LINK(tautimer,(TauGroup_t)(void
)task-gtgetName()) // EXTERNAL
ASSOCIATION ... TAU_MAPPING_PROFILE_TIMER(doitpr
ofiler, tautimer, 0) TAU_MAPPING_PROFILE_START(do
itprofiler,0) task-gtdoit(pc) TAU_MAPPING_PROFI
LE_STOP(0) ...
96
Task Performance Mapping (Profile)
Mapped task performance across processes
Performance mapping for different tasks
97
Work Packet to Task Mapping (Trace)
Work packet computation events colored by task
type
Distinct phases of computation can be identifed
based on task
98
Comparing Uintah Traces for Scalability Analysis
32 processes
32 processes
32 processes
99
Online Performance Analysis for C-SAFE Apps
SCIRun (Univ. of Utah)
Performance Visualizer
Application
// performance data streams
TAU Performance System
Performance Analyzer
// performance data output
accumulated samples
Performance Data Reader
Performance Data Integrator
file system
sample sequencing reader synchronization
100
2D Field Performance Visualization in SCIRun
SCIRun program
101
Uintah Computational Framework (UCF)

UCF analysis
Scheduling
MPI library
Components
500 processes
Onlineand offlinevisualization
Performancesteering
use SCIRun support

102
Case Study SAMRAI (LLNL)

Structured Adaptive Mesh Refinement Application
Infrastructure (SAMRAI)
Programming
C and MPI
SPMD
Instrumentation
PDT for automatic instrumentation of routines
MPI interposition wrappers
SAMRAI timers for interesting code segments
timers classified in groups (apps, mesh, )
timer groups are managed by TAU groups

103
SAMRAI (Profile)

Euler (2D)

routine name
return type
104
SAMRAI Euler (Profile)
105
SAMRAI Euler (Trace)
106
Case Study EVH1

Enhanced Virginia Hydrodynamics 1 (EVH1)
"TeraScale Simulations of Neutrino-Driven
Supernovae and Their Nucleosynthesis" SciDAC
project
Configured to run a simulation of the
Sedov-Taylor blast wave solution in 2D spherical
geometry
Performance study found EVH1 communication bound
for more than 64 processors
Predominant routine (gt50 of execution time) at
this scale is MPI_ALLTOALL
Used in matrix transpose-like operations

107
EVH1 Execution Profile
108
EVH1 Execution Trace
MPI_Alltoall is an execution bottleneck
109
TAU Integration (Selected)

SAMRAI (LLNL)
Overture (LLNL)
C-SAFE (ASCI ASAP)
VTF (ASCI ASAP)
SAGE (ASCI LANL)
POOMA, POOMA-II (LANL, Code Sourcery)
PETSc (ANL)
CCA (DOE SciDAC)
GrACE (Rutgers)
Aurora / SCALEA (University of Vienna)

110
Work in Progress

Trace visualization
Event traces with counters (Vampir 3.0 will
visualize)
EPILOG trace conversion
Runtime performance monitoring and analysis
Online performance data access
Performance analysis and visualization in SCIRun
Performance Database Framework
XML parallel profile representation of TAU
profiles
PostgresSQL performance database
Next-generation PDT
Performance analysis for component software (CCA)

111
Concluding Remarks

Complex software and parallel computing systems
pose challenging performance analysis problems
that require robust methodologies and tools
To build more sophisticated performance tools,
existing proven performance technology must be
utilized
Performance tools must be integrated with
software and systems models and technology
Performance engineered software
Function consistently and coherently in software
and system environments
TAU performance system offers robust performance
technology that can be broadly integrated so
USE IT!

112
Acknowledgements

Department of Energy (DOE)
MICS office
DOE 2000 ACTS contract
Performance Technology for Tera-class Parallel
Computer Systems Evolution of the TAU
Performance System
PERC SciDAC project affiliate
University of Utah DOE ASCI Level 1 sub-contract
DOE ASCI Level 3 (LANL, LLNL)
NSF National Young Investigator (NYI) award
Research Centre Juelich
John von Neumann Institute for Computing
Dr. Bernd Mohr
Los Alamos National Laboratory