Title: Allen D. Malony Sameer S. Shende Robert Bell
1The TAU Performance System
- Allen D. Malony Sameer S. Shende Robert
Bell - malony,sameer,bell_at_cs.uoregon.edu
- Department of Computer and Information Science
- Computational Science Institute
- University of Oregon
2Overview
- Motivation and goals
- TAU architecture and toolkit
- Instrumentation
- Measurement
- Analysis
- Performance mapping
- Application case studies
-
- TAU Integration
- Work in progress
- Conclusions
3Motivation
- Tools for performance problem solving
- Empirical-based performance optimization process
- Versatile performance technology
- Portable performance analysis methods
PerformanceTuning
hypotheses
Performance Diagnosis
PerformanceTechnology
properties
Performance Experimentation
characterization
Performance Observation
4Problems
- Diverse performance observability requirements
- Multiple levels of software and hardware
- Different types and detail of performance data
- Alternative performance problem solving methods
- Multiple targets of software and system
application - Demands more robust performance technology
- Broad scope of performance observation
- Flexible and configurable mechanisms
- Technology integration and extension
- Cross-platform portability
- Open, layered, and modular framework architecture
5Complexity Challenges for Performance Tools
- Computing system environment complexity
- Observation integration and optimization
- Access, accuracy, and granularity constraints
- Diverse/specialized observation
capabilities/technology - Restricted modes limit performance problem
solving - Sophisticated software development environments
- Programming paradigms and performance models
- Performance data mapping to software abstractions
- Uniformity of performance abstraction across
platforms - Rich observation capabilities and flexible
configuration - Common performance problem solving methods
6General Problems (Performance Technology)
- How do we create robust and ubiquitous
performance technology for the analysis and
tuning of parallel and distributed software and
systems in the presence of (evolving) complexity
challenges? -
- How do we apply performance technology
effectively for the variety and diversity of
performance problems that arise in the context of
complex parallel and distributed computer systems?
?
7Computation Model for Performance Technology
- How to address dual performance technology goals?
- Robust capabilities widely available methods
- Contend with problems of system diversity
- Flexible tool composition/configuration/integratio
n - Approaches
- Restrict computation types / performance problems
- machines, languages, instrumentation technique,
- limited performance technology coverage and
application - Base technology on abstract computation model
- general architecture and software execution
features - map features/methods to existing complex system
types - develop capabilities that can be adapted and
optimized
8General Complex System Computation Model
- Node physically distinct shared memory machine
- Message passing node interconnection network
- Context distinct virtual memory space within
node - Thread execution threads (user/system) in context
Interconnection Network
Inter-node messagecommunication
Node
Node
Node
node memory
memory
memory
SMP
physicalview
VM space
modelview
Context
Threads
9TAU Performance System
- Tuning and Analysis Utilities
- Performance system framework for scalable
parallel and distributed high-performance
computing - Targets a general complex system computation
model - nodes / contexts / threads
- Multi-level system / software / parallelism
- Measurement and analysis abstraction
- Integrated toolkit for performance
instrumentation, measurement, analysis, and
visualization - Portable performance profiling and tracing
facility - Open software approach with technology
integration - University of Oregon , Forschungszentrum Jülich,
LANL
10Definitions Profiling
- Profiling
- Recording of summary information during execution
- execution time, calls, hardware statistics,
- Reflects performance behavior of program entities
- functions, loops, basic blocks
- user-defined semantic entities
- Very good for low-cost performance assessment
- Helps to expose performance bottlenecks and
hotspots - Implemented through
- sampling periodic OS interrupts or hardware
counter traps - instrumentation direct insertion of measurement
code
11Definitions Tracing
- Tracing
- Recording of information about significant points
(events) during program execution - entering/exiting code region (function, loop,
block, ) - thread/process interactions (e.g., send/receive
message) - Save information in event record
- timestamp
- CPU identifier, thread identifier
- Event type and event-specific information
- Event trace is a time-sequenced stream of event
records - Can be used to reconstruct dynamic program
behavior - Typically requires code instrumentation
12TAU Performance System Architecture
13TAU Performance Systems Goals
- Multi-level performance instrumentation
- Multi-language automatic source instrumentation
- Flexible and configurable performance measurement
- Widely-ported parallel performance profiling
system - Computer system architectures and operating
systems - Different programming languages and compilers
- Support for multiple parallel programming
paradigms - Multi-threading, message passing, mixed-mode,
hybrid - Support for performance mapping
- Support for object-oriented and generic
programming - Integration in complex software systems and
applications
14How To Use TAU?
- Instrumentation
- Application code and libraries
- Selective instrumentation
- Install, compile, and link with TAU measurement
library - configure make clean install
- Multiple configurations for different
measurements options - Does not require change in instrumentation
- Selective measurement control
- Execute experiments produce performance data
- Performance data generated at end or during
execution - Use analysis tools to look at performance results
15TAU Instrumentation Approach
- Support for standard program events
- Routines
- Classes and templates
- Statement-level blocks
- Support for user-defined events
- Begin/End events (user-defined timers)
- Atomic events
- Selection of event statistics
- Support definition of semantic entities for
mapping - Support for event groups
- Instrumentation optimization
16TAU Instrumentation
- Flexible instrumentation mechanisms at multiple
levels - Source code
- manual
- automatic
- C, C, F77/90 (Program Database Toolkit (PDT))
- OpenMP (directive rewriting (Opari))
- Object code
- pre-instrumented libraries (e.g., MPI using PMPI)
- statically-linked and dynamically-linked
- fast breakpoints (compiler generated)
- Executable code
- dynamic instrumentation (pre-execution)
(DynInstAPI) - virtual machine instrumentation (e.g., Java using
JVMPI)
17Multi-Level Instrumentation
- Targets common measurement interface
- TAU API
- Multiple instrumentation interfaces
- Simultaneously active
- Information sharing between interfaces
- Utilizes instrumentation knowledge between levels
- Selective instrumentation
- Available at each level
- Cross-level selection
- Targets a common performance model
- Presents a unified view of execution
- Consistent performance events
18Program Database Toolkit (PDT)
- Program code analysis framework
- develop source-based tools
- High-level interface to source code information
- Integrated toolkit for source code parsing,
database creation, and database query - Commercial grade front-end parsers
- Portable IL analyzer, database format, and access
API - Open software approach for tool development
- Multiple source languages
- Implement automatic performance instrumentation
tools - tau_instrumentor
19PDT Architecture and Tools
Application / Library
C / C parser
Fortran 77/90 parser
Program documentation
PDBhtml
Application component glue
IL
IL
SILOON
C / C IL analyzer
Fortran 77/90 IL analyzer
C / F90 interoperability
CHASM
Program Database Files
Automatic source instrumentation
TAU_instr
DUCTAPE
20PDT Components
- Language front end
- Edison Design Group (EDG) C, C, Java
- Mutek Solutions Ltd. F77, F90
- IL Analyzer
- Processes intermediate language (IL) tree from
front-end - Creates program database (PDB) formatted file
- DUCTAPE (Bernd Mohr, FZJ/ZAM, Germany)
- C program Database Utilities and Conversion
Tools APplication Environment - Processes and merges PDB files
- C library to access the PDB for PDT applications
21Instrumentation Control
- Selection of which performance events to observe
- Could depend on scope, type, level of interest
- Could depend on instrumentation overhead
- How is selection supported in instrumentation
system? - No choice
- Include / exclude lists (TAU)
- Environment variables
- Static vs. dynamic
- Controlling the instrumentation of small routines
- High relative measurement overhead
- Significant intrusion and possible perturbation
22Selective Instrumentation
tau_instrumentor Usage tau_instrumentor
ltpdbfilegt ltsourcefilegt -o ltoutputfilegt
-noinline -g groupname -i headerfile
-c-c-fortran -f ltinstr_req_filegt For
selective instrumentation, use f option cat
selective.dat Selective instrumentation
Specify an exclude/include list. BEGIN_EXCLUDE_LI
ST void quicksort(int , int, int) void
sort_5elements(int ) void interchange(int , int
) END_EXCLUDE_LIST If an include list is
specified, the routines in the list will be the
only routines that are instrumented. To
specify an include list (a list of routines that
will be instrumented) remove the leading to
uncomment the following lines BEGIN_INCLUDE_LIST
int main(int, char ) int select_ END_INCLUDE_
LIST
23Overhead Analysis for Automatic Selection
- Analyze the performance data to determine events
with high (relative) overhead performance
measurements - Create a select list for excluding those events
- Rule grammar (used in tau_reduce tool)
- GroupName Field Operator Number
- GroupName indicates rule applies to events in
group - Field is a event metric attribute (from profile
statistics) - numcalls, numsubs, percent, usec, cumusec, count,
totalcount, stdev, usecs/call, counts/call - Operator is one of gt, lt, or
- Number is any number
- Compound rules possible using between simple
rules
24Example Rules
- Exclude all events that are members of TAU_USER
and use less than 1000 microseconds TAU_USERuse
c lt 1000 - Exclude all events that have less than 100
microseconds and are called only once usec lt
1000 numcalls 1 - Exclude all events that have less than 1000
usecs per call OR have a (total inclusive)
percent less than 5 usecs/call lt 1000 percent lt
5 - Scientific notation can be used
25TAU Measurement
- Performance information
- Performance events
- High-resolution timer library (real-time /
virtual clocks) - General software counter library (user-defined
events) - Hardware performance counters
- PCL (Performance Counter Library) (ZAM, Germany)
- PAPI (Performance API) (UTK, Ptools Consortium)
- consistent, portable API
- Organization
- Node, context, thread levels
- Profile groups for collective events (runtime
selective) - Performance data mapping between software levels
26TAU Measurement Options
- Parallel profiling
- Function-level, block-level, statement-level
- Supports user-defined events
- TAU parallel profile data stored during execution
- Hardware counts values
- Support for multiple counters
- Support for callpath profiling
- Tracing
- All profile-level events
- Inter-process communication events
- Timestamp synchronization
- Trace merging and format conversion
27TAU Measurement System Configuration
- configure OPTIONS
- -cltCCgt, -ccltccgt Specify C and C
compilers - -pthread, -sproc , -smarts Use pthread, SGI
sproc, smarts threads - -openmp Use OpenMP threads
- -opariltdirgt Specify location of Opari OpenMP
tool - -papi ,-pclltdirgt Specify location of PAPI or
PCL - -pdtltdirgt Specify location of PDT
- -mpiincltdgt, mpilibltdgt Specify MPI library
instrumentation - -TRACE Generate TAU event traces
- -PROFILE Generate TAU profiles
- -PROFILECALLPATH Generate Callpath profiles
(1-level) - -MULTIPLECOUNTERS Use more than one hardware
counter - -CPUTIME Use usertimesystem time
- -PAPIWALLCLOCK Use PAPI to access wallclock time
- -PAPIVIRTUAL Use PAPI for virtual (user) time
28TAU Measurement API
- Initialization and runtime configuration
- TAU_PROFILE_INIT(argc, argv)TAU_PROFILE_SET_NODE
(myNode)TAU_PROFILE_SET_CONTEXT(myContext)TAU_
PROFILE_EXIT(message)TAU_REGISTIER_THREAD() - Function and class methods
- TAU_PROFILE(name, type, group)
- Template
- TAU_TYPE_STRING(variable, type)TAU_PROFILE(name,
type, group)CT(variable) - User-defined timing
- TAU_PROFILE_TIMER(timer, name, type,
group)TAU_PROFILE_START(timer)TAU_PROFILE_STOP
(timer)
29TAU Measurement API (continued)
- User-defined events
- TAU_REGISTER_EVENT(variable, event_name)TAU_EVEN
T(variable, value)TAU_PROFILE_STMT(statement) - Mapping
- TAU_MAPPING(statement, key)TAU_MAPPING_OBJECT(fu
ncIdVar)TAU_MAPPING_LINK(funcIdVar, key) - TAU_MAPPING_PROFILE (funcIdVar)TAU_MAPPING_PROFI
LE_TIMER(timer, funcIdVar)TAU_MAPPING_PROFILE_ST
ART(timer)TAU_MAPPING_PROFILE_STOP(timer) - Reporting
- TAU_REPORT_STATISTICS()TAU_REPORT_THREAD_STATIST
ICS()
30Grouping Performance Data in TAU
- Profile Groups
- A group of related routines forms a profile group
- Statically defined
- TAU_DEFAULT, TAU_USER1-5, TAU_MESSAGE, TAU_IO,
- Dynamically defined
- group name based on string, such as adlib or
particles - runtime lookup in a map to get unique group
identifier - uses tau_instrumentor to instrument
- Ability to change group names at runtime
- Group-based instrumentation and measurement
control
31TAU Group Instrumentation Control API
- Enabling Profile Groups
- TAU_ENABLE_INSTRUMENTATION()
- TAU_ENABLE_GROUP(TAU_GROUP)
- TAU_ENABLE_GROUP_NAME(group name)
- TAU_ENABLE_ALL_GROUPS()
- Disabling Profile Groups
- TAU_DISABLE_INSTRUMENTATION()
- TAU_DISABLE_GROUP(TAU_GROUP)
- TAU_DISABLE_GROUP_NAME()
- TAU_DISABLE_ALL_GROUPS()
- Obtaining Profile Group Identifier
- Runtime Switching of Profile Groups
32TAU Pre-execution Control
- Dynamic groups defined at file scope
- Group names and group associations runtime
modifiable - Controlling groups at pre-execution time
- --profile ltgroup1group2groupNgt option
- tau_instrumentor app.pdb app.cpp \
- o app.i.cpp g particles
- mpirun np 4 application \
- profile particlesfieldmeshio
- Examples
- POOMA (LANL) uses static groups
- VTF (Caltech) uses dynamic group in Python-based
execution instrumentation control
33 Configuring TAU Measurement Library
- Profiling with wallclock time (on a quad PIII
Linux machine) - configure -mpiinc/usr/local/packages/mpich/incl
ude -mpilib/usr/local/packages/mpich/lib
-pdt/usr/pkg/pdtoolkit/ -useropt-O2
-LINUXTIMERS - Tracing
- configure -mpiinc/usr/local/packages/mpich/incl
ude -mpilib/usr/local/packages/mpich/lib
-pdt/usr/pkg/pdtoolkit -useropt-O2
-LINUXTIMERS - Profiling with PAPI
- configure -mpiinc/usr/local/packages/mpich/incl
ude -mpilib/usr/local/packages/mpich/lib
-pdt/usr/pkg/pdtoolkit/ -useropt-O2
-papi/usr/local/packages/papi - setenv PAPI_EVENT PAPI_FP_INS
- setenv PAPI_EVENT PAPI_L1_DCM
34Compiling with TAU Makefiles
- Include TAU Stub Makefile (ltarchgt/lib) in the
users Makefile - Variables
- TAU_CXX Specify the C compiler used by TAU
- TAU_CC, TAU_F90 Specify the C, F90 compilers
- TAU_DEFS Defines used by TAU. Add to CFLAGS
- TAU_LDFLAGS Linker options. Add to LDFLAGS
- TAU_INCLUDE Header files include path. Add to
CFLAGS - TAU_LIBS Statically linked TAU library. Add to
LIBS - TAU_SHLIBS Dynamically linked TAU library
- TAU_MPI_LIBS TAUs MPI wrapper library for C/C
- TAU_MPI_FLIBS TAUs MPI wrapper library for F90
- TAU_FORTRANLIBS Must be linked in with C linker
for F90. - TAU_DISABLE TAUs dummy F90 stub library
35TAU Analysis
- Parallel profile analysis
- Pprof
- parallel profiler with text-based display
- Racy
- graphical interface to pprof (Tcl/Tk)
- jRacy
- Java implementation of Racy
- Trace analysis and visualization
- Trace merging and clock adjustment (if necessary)
- Trace format conversion (ALOG, SDDF, VTF,
Paraver) - Trace visualization using Vampir (Pallas)
36Pprof Command
- pprof -c-b-m-t-e-i -r -s -n num -f
file -l nodes - -c Sort according to number of calls
- -b Sort according to number of subroutines called
- -m Sort according to msecs (exclusive time total)
- -t Sort according to total msecs (inclusive time
total) - -e Sort according to exclusive time per call
- -i Sort according to inclusive time per call
- -v Sort according to standard deviation
(exclusive usec) - -r Reverse sorting order
- -s Print only summary profile information
- -n num Print only first number of functions
- -f file Specify full path and filename without
node ids - -l nodes List all functions and exit (prints only
info about all contexts/threads of given node
numbers)
37Pprof Output (NAS Parallel Benchmark LU)
- Intel QuadPIII Xeon
- F90 MPICH
- Profile - Node - Context - Thread
- Events - code - MPI
38jRacy (NAS Parallel Benchmark LU)
Routine profile across all nodes
n node c context t thread
Global profiles
Event legend
Individual profile
39Paraprof Profile Browser
40Paraprof Profile Browser Main Window
41Paraprof Profile Browser Node Window
42Paraprof Profile Browser (Derived Metrics)
43Paraprof Profile Browser Routine Window
44TAU PAPI (NAS Parallel Benchmark LU )
- Floating point operations
- Re-link to alternate library
- Can use multiple counter support
45TAU Vampir (NAS Parallel Benchmark LU)
Callgraph display
Timeline display
Parallelism display
Communications display
46tau_reduce Example
- tau_reduce implements overhead reduction in TAU
- Consider klargest example
- Find kth largest element in a N elements
- Compare two methods quicksort,
select_kth_largest - Un-instrumented testcase i 2324, N 1000000
- quicksort (wall clock) 0.188511 secs
- select_kth_largest (wall clock) 0.149594 secs
- Total (PIII/1.2GHz time) 0.340u 0.020s 000.37
- Execute with all routines instrumented
- Execute with rule-based selective instrumentation
- usecgt1000 numcallsgt400000 usecs/calllt30
percentgt25
47Simple sorting example on one processor
Before selective instrumentation reduction
- NODE 0CONTEXT 0THREAD 0
- --------------------------------------------------
------------------------------------- - Time Exclusive Inclusive Call
Subrs Inclusive Name - msec msec
usec/call - --------------------------------------------------
------------------------------------- - 100.0 13 4,982 1
4 4982030 int main - 93.5 3,223 4,659 4.20241E06
1.40268E07 1 void quicksort - 62.9 0.00481 3,134 5
5 626839 int kth_largest_qs - 36.4 137 1,813 28
450057 64769 int select_kth_largest - 33.6 150 1,675 449978
449978 4 void sort_5elements - 28.8 1,435 1,435 1.02744E07
0 0 void interchange - 0.4 20 20 1
0 20668 void setup - 0.0 0.0118 0.0118 49
0 0 int ceil
After selective instrumentation reduction
NODE 0CONTEXT 0THREAD 0 -----------------------
--------------------------------------------------
-------------- Time Exclusive Inclusive
Call Subrs Inclusive Name
msec total msec
usec/call ----------------------------------------
----------------------------------------------- 10
0.0 14 383 1
4 383333 int main 50.9 195
195 5 0 39017 int
kth_largest_qs 40.0 153 153
28 79 5478 int
select_kth_largest 5.4 20
20 1 0 20611 void setup
0.0 0.02 0.02 49
0 0 int ceil
48TAU Performance System Status
- Computing platforms
- IBM SP / Power4, SGI Origin 2K/3K, ASCI Red, Cray
T3E / SV-1 (X-1 planned), HP (Compaq) SC (Tru64),
HP Superdome (HP-UX), Sun, Hitachi SR8000, NEX
SX-5 (SX-6 underway), Linux clusters (IA-32/64,
Alpha, PPC, PA-RISC, Power), Apple (OS X),
Windows - Programming languages
- C, C, Fortran 77, F90, HPF, Java, OpenMP,
Python - Communication libraries
- MPI, PVM, Nexus, shmem, Tulip, ACLMPL, MPIJava
- Thread libraries
- pthreads, SGI sproc, Java,Windows, OpenMP, SMARTS
49TAU Performance System Status (continued)
- Compilers
- Intel KAI (KCC, KAP/Pro), PGI, GNU, Fujitsu, Sun,
Microsoft, SGI, Cray, IBM, Compaq, Hitachi, NEC,
Intel - Application libraries (selected)
- Blitz, A/P, PETSc, SAMRAI, Overture, PAWS
- Application frameworks (selected)
- POOMA, MC, Conejo, Uintah, VTF, UPS, GrACE
- Performance projects using TAU
- Aurora / SCALEA ACPC, University of Vienna
- TAU full distribution (Version 2.12, web
download) - TAU performance system toolkit and users guide
- Automatic software installation and examples
50PDT Status
- Program Database Toolkit (Version 2.2, web
download) - EDG C front end (Version 2.45.2)
- Mutek Fortran 90 front end (Version 2.4.1)
- C and Fortran 90 IL Analyzer
- DUCTAPE library
- Standard C system header files (KCC Version
4.0f) - PDT-constructed tools
- TAU instrumentor (C/C/F90)
- Program analysis support for SILOON and CHASM
- Platforms
- Same as for TAU with a few exceptions
51Performance Mapping
- High-level semantic abstractions
- Associate performance measurements
- Performance mapping
- performance measurement system support to assign
data correctly
52Semantic Entities/Attributes/Associations
- New dynamic mapping scheme (SEAA)
- Contrast with ParaMap (Miller and Irvin)
- Entities defined at any level of abstraction
- Attribute entity with semantic information
- Entity-to-entity associations
- Two association types (implemented in TAU API)
- Embedded extends associatedobject to store
performancemeasurement entity - External creates an external look-uptable
using address of object as key tolocate
performance measurement entity
53Hypothetical Mapping Example
- Particles distributed on surfaces of a cube
Particle PMAX / Array of particles / int
GenerateParticles() / distribute particles
over all faces of the cube / for (int face0,
last0 face lt 6 face) / particles on
this face / int particles_on_this_face
num(face) for (int ilast i lt
particles_on_this_face i) / particle
properties are a function of face / Pi
... f(face) ... last
particles_on_this_face
54Hypothetical Mapping Example (continued)
int ProcessParticle(Particle p) / perform
some computation on p / int main()
GenerateParticles() / create a list of
particles / for (int i 0 i lt N i) /
iterates over the list / ProcessParticle(Pi)
work packets
engine
- How much time is spent processing face i
particles? - What is the distribution of performance among
faces?
55No Performance Mapping versus Mapping
- Typical performance tools report performance with
respect to routines - Does not provide support for mapping
- Performance tools with SEAA mapping can observe
performance with respect to scientists
programming and problem abstractions
TAU (w/ mapping)
TAU (no mapping)
56Performance Mapping in Callpath Profiling
- Consider callgraph (callpath) profiling
- Measure time (metric) along an edge (path) of
callgraph - Incident edge gives parent / child view
- Edge sequence (path) gives parent / descendant
view - Callpath profiling when callgraph is unknown
- Must determine callgraph dynamically at runtime
- Map performance measurement to dynamic call path
state - Callpath levels
- 0-level current callgraph node
- 1-level immediate parent (descendant)
- k-level kth calling parent (call descendant)
571-Level Callpath Implementation in TAU
- TAU maintains a performance event (routine)
callstack - Profiled routine (child) looks in callstack for
parent - Previous profiled performance event is the parent
- A callpath profile structure created first time
parent calls - TAU records parent in a callgraph map for child
- String representing 1-level callpath used as its
key - a( )gtb( ) name for time spent in b when
called by a - Map returns pointer to callpath profile structure
- 1-level callpath is profiled using this profiling
data - Build upon TAUs performance mapping technology
- Measurement is independent of instrumentation
- Use PROFILECALLPATH to configure TAU
58Callpath Profiling Example (NAS LU v2.3)
- configure -PROFILECALLPATH -SGITIMERS
-archsgi64-mpiinc/usr/include
-mpilib/usr/lib64 -useropt-O2
59Callpath Parallel Profile Display
- 0-level and 1-level callpath grouping
1-Level Callpath
0-Level Callpath
60Strategies for Empirical Performance Evaluation
- Empirical performance evaluation as a series of
performance experiments - Experiment trials describing instrumentation and
measurement requirements - Where/When/How axes of empirical performance
space - where are performance measurements made in
program - when is performance instrumentation done
- how are performance measurement/instrumentation
chosen - Strategies for achieving flexibility and
portability goals - Limited performance methods restrict evaluation
scope - Non-portable methods force use of different
techniques - Integration and combination of strategies
61Case Study SIMPLE Performance Analysis
- SIMPLE hydrodynamics benchmark
- C code with MPI message communication
- Multiple instrumentation methods
- source-to-source translation (PDT)
- MPI wrapper library level instrumentation (PMPI)
- pre-execution binary instrumentation (DyninstAPI)
- Alternative measurement strategies
- statistical profiles of software actions
- statistical profiles of hardware actions (PCL,
PAPI) - program event tracing
- choice of time source
- gettimeofday, high-res physical, CPU, process
virtual
62SIMPLE Source Instrumentation (Preprocessed)
- PDT automatically generates instrumentation code
- names events with full function signatures
- Similarly for all other routines in SIMPLE program
int compute_heat_conduction(double
theta_hatXY, double deltat, double
new_rXY, double new_zXY, double
new_alphaXY, double new_rhoXY, double
theta_lXY,double Gamma_kXY, double
Gamma_lXY) TAU_PROFILE("int
compute_heat_conduction( double ()259,
double, double ()259, double ()259, double
()259, double ()259, double ()259,
double ()259, double ()259)", " ",
TAU_USER) ...
63MPI Library Instrumentation (MPI_Send)
- Uses MPI profiling interposition library (PMPI)
int MPI_Send()... int returnVal,
typesize TAU_PROFILE_TIMER(tautimer,
"MPI_Send()", " ", TAU_MESSAGE) TAU_PROFILE_STAR
T(tautimer) if (dest ! MPI_PROC_NULL)
PMPI_Type_size(datatype, typesize) TAU_TRA
CE_SENDMSG(tag, dest, typesizecount) returnV
al PMPI_Send(buf, count, datatype, dest, tag,
comm) TAU_PROFILE_STOP(tautimer) return
returnVal
64MPI Library Instrumentation (MPI_Recv)
- int MPI_Recv()... int returnVal,
size TAU_PROFILE_TIMER(tautimer, "MPI_Recv()",
" ", TAU_MESSAGE) TAU_PROFILE_START(tautimer)
returnVal PMPI_Recv(buf, count, datatype, src,
tag, comm, - status) if (src ! MPI_PROC_NULL returnVal
MPI_SUCCESS) PMPI_Get_count( status,
MPI_BYTE, size ) TAU_TRACE_RECVMSG(status-gtMPI
_TAG, status-gtMPI_SOURCE, - size) TAU_PROFILE_STOP(tautimer)
return returnVal
65Multi-Level Instrumentation (Profiling)
four processes
event legend
Profile per process
global profile
66Multi-Level Instrumentation (Tracing)
- Relink with TAU library configured for tracing
- No modification of source instrumentation
required!
TAU performance groups
67Dynamic Instrumentation of SIMPLE
- Uses DynInstAPI for runtime code patching
- Mutator loads measurement library, instruments
mutatee - One mutator (tau_run) per executable image
- mpirun np ltngt tau.shell
68Case Study PETSc v2.1.3 (ANL)
- Portable, Extensible Toolkit for Scientific
Computation - Scalable (parallel) PDE framework
- Suite of data structures and routines (374,458
code lines) - Solution of scientific applications modeled by
PDEs - Parallel implementation
- MPI used for inter-process communication
- TAU instrumentation
- PDT for C/C source instrumentation (100, no
manual) - MPI wrapper interposition library instrumentation
- Example
- Linear system of equations (Axb) (SLES) (ex2
test case) - Non-linear system of equations (SNES) (ex19 test
case)
69PETSc ex2 (Profile - wallclock time)
Sorted with respect to exclusive time
70PETSc ex2(Profile - overall and message counts)
- Observe load balance
- Track messages
Capture with user-defined events
71PETSc ex2 (Profile - percentages and time)
- View per threadperformance on individual routines
72PETSc ex2 (Trace)
73PETSc ex19
- Non-linear solver (SNES)
- 2-D driven cavity code
- Uses velocity-vorticity formulation
- Finite difference discretization on a structured
grid - Problem size and measurements
- 56x56 mesh size on quad Pentium III (550 Mhz,
Linux) - Executes for approximately one minute
- MPI wrapper interposition library
- PDT (tau_instrumentor)
- Selective instrumentation (tau_reduce)
- three routines identified with high
instrumentation overhead
74PETSc ex19 (Profile - wallclock time)
Sorted by inclusive time
Sorted by exclusive time
75PETSc ex19 (Profile - overall and percentages)
76PETSc ex19 (Tracing)
Commonly seen communicaton behavior
77PETSc ex19 (Tracing - callgraph)
78PETSc ex19 (PAPI_FP_INS, PAPI_L1_DCM)
- Uses multiple counter profile measurement
PAPI_FP_INS
PAPI_L1_DCM
79Case Study Mixed-mode Parallel Programs
- Portable mixed-mode parallel programming
- Multi-threaded shared memory programming
- Inter-node message passing
- Performance measurement
- Access to runtime system and communication events
- Associate communication and application events
- 2-Dimensional Stommel model of ocean circulation
- OpenMP for shared memory parallel programming
- MPI for cross-box message-based parallelism
- Jacobi iteration, 5-point stencil
- Timothy Kaiser (San Diego Supercomputing Center)
80Stommel Instrumentation
- OpenMP directive instrumentation (uses OPARI)
pomp_for_enter(omp_rd_2) line 252
"stommel.c" pragma omp for schedule(static)
reduction( diff) private(j) firstprivate
(a1,a2,a3,a4,a5) nowait for( ii1ilti2i)
for(jj1jltj2j) new_psiija1psii1
j a2psii-1j a3psiij1
a4psiij-1 - a5the_forij diffdifffab
s(new_psiij-psiij) pomp_barrier_ente
r(omp_rd_2) pragma omp barrier pomp_barrier_exi
t(omp_rd_2) pomp_for_exit(omp_rd_2) line 261
"stommel.c"
81OpenMP MPI Ocean Modeling (Trace)
Thread-paired message passing
Integrated OpenMP MPI events
82OpenMP MPI Ocean Modeling (HW Profile)
configure -papi../packages/papi -openmp
-cpgCC -ccpgcc -mpiinc../packages/mpich/in
clude -mpilib../packages/mpich/lib
Integrated OpenMP MPI events
Integrated OpenMP MPI events
FP instructions
83Case Study C and Performance Mapping
- Object-oriented programming
- abstract data types, encapsulation, inheritance,
- Domain-specific abstractions
- Implemented by OO languages in form of class
libraries - Generic programming mechanisms
- efficient coding abstractions, compile-time
transformations - Creates a semantic gap between the transformed
code and what the user expects (as describes in
source code) - Need a mechanism to expose the nature of
high-level abstract computation to the
performance tools - Map low-level performance data to high-level
semantics
84C Template Instrumentation (Blitz, PETE)
- High-level objects
- Array classes
- Templates (Blitz)
- Optimizations
- Array processing
- Expressions (PETE)
- Relate performance data to high-level statement
- Complexity of template evaluation
Array expressions
Array expressions
85Standard Template Instrumentation Difficulties
- Instantiated templates result in mangled
identifiers - Standard profiling techniques / tools are
deficient - Integrated with proprietary compilers
- Specific systems platforms and programming models
Uninterpretable routine names
Very long!
86Blitz Library Instrumentation
- Expression templates
- embed the form of the expression in a template
name - Blitz describes structure of the expression
template - Present as pretty printed name to the profiling
toolkit - Create performance event associated with
expression type
Expression B C - 2.0 D
BinOpltAdd, B, ltBinOpltSubtract, C,
ltBinOpltMultiply, Scalarlt2.0gt, Dgtgtgt
B
-
C
2.0
D
87Blitz Library Instrumentation (example)
- ifdef BZ_TAU_PROFILING
- static string exprDescription
- if (!exprDescription.length())
- exprDescription "A"
- prettyPrintFormat format(_bz_true) // terse
mode on - format.nextArrayOperandSymbol()
- T_updateprettyPrint(exprDescription)
- expr.prettyPrint(exprDescription, format)
-
- TAU_PROFILE(" ", exprDescription, TAU_BLITZ)
- endif
exprDescription is the event name
88TAU Instrumentation and Profiling for C
Profile of expression types
Performance data presented with respect to
high-level array expression types
Performance data presented with respect to
high-level array expression types
89Case Study C-SAFE / Uintah
- Center for Simulation of Accidental Fires
Explosions - ASCI ASAP Level 1 center, University of Utah
- PSE for multi-model simulation high-energy
explosion - Coupled non-linear solvers, optimization,
computational steering, visualization, and
experimental data verification - Very large-scale simulations
- Computer science problems
- Coupling of multiple simulation codes
- Software engineering across diverse expert teams
- Achieving high performance on large-scale systems
90Example C-SAFE Simulation Problems
?
Heptane fire simulation
Typical C-SAFE simulation with a billion degrees
of freedom and non-linear time dynamics
Material stress simulation
91Uintah Computational Framework (UCF)
- Execution model based on software (macro)
dataflow - Exposes parallelism and hides data transport
latency - Computations expressed a directed acyclic graphs
of tasks - consumes input and produces output (input to
future task) - input/outputs specified for each patch in a
structured grid - Abstraction of global single-assignment memory
- DataWarehouse
- Directory mapping names to values (array
structured) - Write value once then communicate to awaiting
tasks - Task graph gets mapped to processing resources
- Communications schedule approximates global
optimal
92Performance Technology Integration
- Uintah present challenges to performance
integration - Software diversity and structure
- UCF middleware, simulation code modules
- component-based hierarchy
- Portability objectives
- cross-language and cross-platform
- multi-parallelism thread, message passing, mixed
- Scalability objectives
- High-level programming and execution abstractions
- Requires flexible and robust performance
technology - Requires support for performance mapping
93Task Execution in Uintah Parallel Scheduler
- Profile methods and functions in scheduler and in
MPI library
Task execution time dominates (what task?)
Task execution time distribution
MPI communication overheads (where?)
- Need to map performance data!
94Uintah Task Performance Mapping
- Uintah partitions individual particles across
processing elements (processes or threads) - Simulation tasks in task graph work on particles
- Tasks have domain-specific character in the
computation - interpolate particles to grid in Material Point
Method - Task instances generated for each partitioned
particle set - Execution scheduled with respect to task
dependencies - How to attributed execution time among different
tasks - Assign semantic name (task type) to a task
instance - SerialMPMinterpolateParticleToGrid
- Map TAU timer object to (abstract) task (semantic
entity) - Look up timer object using task type (semantic
attribute) - Further partition along different domain-specific
axes
95Mapping Instrumentation in UCF (example)
- Use TAU performance mapping API
void MPISchedulerexecute(const ProcessorGroup
pc, DataWarehouseP old_dw,
DataWarehouseP dw ) ... TAU_MAPPING_C
REATE( task-gtgetName(), "MPISchedulerexecute(
)", (TauGroup_t)(void)task-gtgetName(),
task-gtgetName(), 0) ... TAU_MAPPING_OBJECT(taut
imer) TAU_MAPPING_LINK(tautimer,(TauGroup_t)(void
)task-gtgetName()) // EXTERNAL
ASSOCIATION ... TAU_MAPPING_PROFILE_TIMER(doitpr
ofiler, tautimer, 0) TAU_MAPPING_PROFILE_START(do
itprofiler,0) task-gtdoit(pc) TAU_MAPPING_PROFI
LE_STOP(0) ...
96Task Performance Mapping (Profile)
Mapped task performance across processes
Performance mapping for different tasks
97Work Packet to Task Mapping (Trace)
Work packet computation events colored by task
type
Distinct phases of computation can be identifed
based on task
98Comparing Uintah Traces for Scalability Analysis
32 processes
32 processes
32 processes
99Online Performance Analysis for C-SAFE Apps
SCIRun (Univ. of Utah)
Performance Visualizer
Application
// performance data streams
TAU Performance System
Performance Analyzer
// performance data output
accumulated samples
Performance Data Reader
Performance Data Integrator
file system
sample sequencing reader synchronization
1002D Field Performance Visualization in SCIRun
SCIRun program
101Uintah Computational Framework (UCF)
- UCF analysis
- Scheduling
- MPI library
- Components
- 500 processes
- Onlineand offlinevisualization
- Performancesteering
- use SCIRun support
102Case Study SAMRAI (LLNL)
- Structured Adaptive Mesh Refinement Application
Infrastructure (SAMRAI) - Programming
- C and MPI
- SPMD
- Instrumentation
- PDT for automatic instrumentation of routines
- MPI interposition wrappers
- SAMRAI timers for interesting code segments
- timers classified in groups (apps, mesh, )
- timer groups are managed by TAU groups
103SAMRAI (Profile)
routine name
return type
104SAMRAI Euler (Profile)
105SAMRAI Euler (Trace)
106Case Study EVH1
- Enhanced Virginia Hydrodynamics 1 (EVH1)
- "TeraScale Simulations of Neutrino-Driven
Supernovae and Their Nucleosynthesis" SciDAC
project - Configured to run a simulation of the
Sedov-Taylor blast wave solution in 2D spherical
geometry - Performance study found EVH1 communication bound
for more than 64 processors - Predominant routine (gt50 of execution time) at
this scale is MPI_ALLTOALL - Used in matrix transpose-like operations
107EVH1 Execution Profile
108EVH1 Execution Trace
MPI_Alltoall is an execution bottleneck
109TAU Integration (Selected)
- SAMRAI (LLNL)
- Overture (LLNL)
- C-SAFE (ASCI ASAP)
- VTF (ASCI ASAP)
- SAGE (ASCI LANL)
- POOMA, POOMA-II (LANL, Code Sourcery)
- PETSc (ANL)
- CCA (DOE SciDAC)
- GrACE (Rutgers)
- Aurora / SCALEA (University of Vienna)
110Work in Progress
- Trace visualization
- Event traces with counters (Vampir 3.0 will
visualize) - EPILOG trace conversion
- Runtime performance monitoring and analysis
- Online performance data access
- Performance analysis and visualization in SCIRun
- Performance Database Framework
- XML parallel profile representation of TAU
profiles - PostgresSQL performance database
- Next-generation PDT
- Performance analysis for component software (CCA)
111Concluding Remarks
- Complex software and parallel computing systems
pose challenging performance analysis problems
that require robust methodologies and tools - To build more sophisticated performance tools,
existing proven performance technology must be
utilized - Performance tools must be integrated with
software and systems models and technology - Performance engineered software
- Function consistently and coherently in software
and system environments - TAU performance system offers robust performance
technology that can be broadly integrated so
USE IT!
112Acknowledgements
- Department of Energy (DOE)
- MICS office
- DOE 2000 ACTS contract
- Performance Technology for Tera-class Parallel
Computer Systems Evolution of the TAU
Performance System - PERC SciDAC project affiliate
- University of Utah DOE ASCI Level 1 sub-contract
- DOE ASCI Level 3 (LANL, LLNL)
- NSF National Young Investigator (NYI) award
- Research Centre Juelich
- John von Neumann Institute for Computing
- Dr. Bernd Mohr
- Los Alamos National Laboratory
113Information
- TAU (http//www.acl.lanl.gov/tau)
- PDT (http//www.acl.lanl.gov/pdtoolkit)
- PAPI (http//icl.cs.utk.edu/projects/papi/)
- OPARI (http//www.fz-juelich.de/zam/kojak/)