Title: Allen D. Malony
1TAU A Framework for Parallel Performance Analysis
- Allen D. Malony
- malony_at_cs.uoregon.edu
- ParaDucks Research Group
- Computer Information Science Department
- Computational Science Institute
- University of Oregon
2Outline
- Goals and challenges
- Targeted research areas
- TAU (Tuning and Analysis Utilities)
- computation model, architecture, toolkit
framework - performance system technology
- examples of TAU use
- Tools associated with TAU
- PDT (Program Database Toolkit)
- distributed runtime monitoring
- Future plans
- Conclusions
3Goal and Challenges
- Create robust (performance) technology for the
analysis and tuning of parallel software and
systems - Challenges
- different scalable computing platforms
- different programming languages and systems
- common, portable framework for analysis
- extensibe, retargetable tool technology
- complex set of requirements
4Targeted Research Areas
- Performance analysis for scalable parallel
systems targeting multiple programming and system
levelsand the mapping between levels - Program code analysis for multiple languages
enabling development of new source-based tools - Integration and interoperation support for
building analysis tool frameworks and
environments - Runtime tool interaction for dynamic applications
5TAU (Tuning and Analysis Utilities)
- Performance analysis framework for scalable
parallel and distributed high-performance
computing - Target a general parallel computation model
- computer nodes
- shared address space contexts
- threads of execution
- multi-level parallelism
- Integrated toolkit for performance
instrumentation, measurement, analysis, and
visualization - portable performance profiling/tracing facility
- open software approach
6TAU Architecture
7TAU Instrumentation
- Flexible, multiple instrumentation mechanisms
- source code
- manual
- automatic using PDT (tau_instrumentor)
- object code
- pre-instrumented libraries
- statically linked MPI wrapper library using the
MPI Profiling Interface (libTauMpi.a) - dynamically linked Java instrumentation using
JVMPI and TAU shared object dynamically loaded in
VM - executable code
- dynamic instrumentation using DyninstAPI (tau_run)
8TAU Instrumentation (continued)
- Common target measurement interface (TAU API)
- C (object-based) instrumentation
- macro-based, using constructor/destructor
techniques - function, classes, and templates
- uniquely identify functions and templates
- name and type signature (name registration)
- static object creates performance entry
- dynamic object receives static object pointer
- runtime type identification for template
instantiations - with C and Fortran instrumentation variants
- Instrumentation optimization
9TAU Measurement
- Performance information
- high resolution timer library (real-time clock)
- generalized software counter library
- hardware performance counters
- PCL (Performance Counter Library) (ZAM, Germany)
- PAPI (Performance API) (UTK, Ptools)
- consistent, portable API
- Organization
- node, context, thread levels
- profile groups for collective events (runtime
selective) - mapping between software levels
10TAU Measurement (continued)
- Profiling
- function-level, block-level, statement-level
- supports user-defined events
- TAU profile (function) database (PD)
- function callstack
- hardware counts instead of time
- Tracing
- profile-level events
- interprocess communication events
- timestamp synchronization
- User-controlled configuration (configure)
11Timing of Multi-threaded Applications
- Capture timing information on per thread basis
- Two alternative
- wall clock time
- works on all systems
- user-level measurement
- OS-maintained CPU time (e.g., Solaris, Linux)
- thread virtual time measurement
- TAU supports both alternatives
- CPUTIME module profiles usersystem time
configure -pthread -CPUTIME
12TAU Analysis
- Profile analysis
- pprof
- parallel profiler with text-based display
- racy
- graphical interface to pprof
- Trace analysis
- trace merging and clock adjustment (if necessary)
- trace format conversion (ALOG, SDDF, PV, Vampir)
- Vampir
- trace analysis and visualization tool (Pallas)
13TAU Status
- Usage
- platforms
- IBM SP, SGI Origin 2K, Intel Teraflop, Cray T3E,
HP, Sun, Windows 95/98/NT, Alpha/Pentium Linux
cluster - languages
- C, C, Fortran 77/90, HPF, pC, HPC, Java
- communication libraries
- MPI, PVM, Nexus, Tulip, ACLMPL
- thread libraries
- pthreads, Tulip, SMARTS, Java,Windows
- compilers
- KAI, PGI, GNU, Fujitsu, Sun, Microsoft, SGI, Cray
14TAU Status (continued)
- application libraries
- Blitz, A/P, ACLVIS, PAWS
- application frameworks
- POOMA, POOMA-2, MC, Conejo, PaRP
- other projects
- ACPC, University of Vienna Aurora
- UC Berkeley (Culler) Millenium, sensitivity
analysis - KAI and Pallas
- TAU profiling and tracing toolkit (Version 2.7)
- LANL ACL Fall 1999 CD-ROM distributed at SC'99
- Extensive 70-page TAU Users Guide
- http//www.acl.lanl.gov/tau
15TAU Examples
- Instrumentation
- C template profiling (PETE, Blitz)
- Java and MPI
- PAPI
- Measurement
- mapping of asynchronous execution (SMARTS)
- hybrid execution (Opus/HPF)
- Analysis
- SMARTS scheduling
16C Template Instrumentation (Blitz, PETE)
- High-level objects
- array classes
- templates
- Optimizations
- array processing
- expressions (PETE)
- Relate performance data to high-level statement
- Complexity of template evaluation
Array expressions
17Standard Template Instrumentation Difficulties
- Instantiated templates result in mangled
identifiers - Standard profiling techniques and tools are
deficient - integrated with proprietary compilers
- specific systems platforms and programming models
Uninterpretable routine names
18TAU Template Instrumentation and Profiling
Profile ofexpressiontypes
Performance data presentedwith respect to
high-levelarray expression types
Graphical pprof
19Parallel Java Performance Instrumentation
- Multi-language applications (Java, C, C,
Fortran) - Hybrid execution models (Java threads, MPI)
- Java Virtual Machine Profiler Interface (JVMPI)
- event instrumentation in JVM
- profiler agent (libTAU.so) fields events
- Java Native Interface (JNI)
- invoke JVMPI control routines to control Java
threads and access thread information - MPI profiling interface
- Performance Tools for Parallel Java
Environments, Java Workshop, ICS 2000, May 2000.
20TAU Java Instrumentation Architecture
Java program
mpiJava package
TAU package
JNI
MPI profiling interface
Event notification
TAU wrapper
TAU
Native MPI library
JVMPI
Profile DB
21Parallel Java Game of Life
- mpiJava testcase
- 4 nodes,28 threads
- Nodeprocessgrouping
- Threadmessagepairing
- Vampirdisplay
- Multi-level event grouping
22TAU and PAPI NAS Parallel LU Benchmark
- SGI Power Onyx (4 processors, R10K), MPI
- Floating pointoperations
- Cross-nodefull / routineprofiles
- Full FPprofile foreach node
Percentage profile
23TAU and PAPI Matrix Multiply
- Data cache miss comparison,
- regular vs. strip-mining execution
- 512x51232 KB (P)2 MB (S)
- Regularcauses4.5 timesmoremisses
24Asynchronous Performance Analysis (SMARTS)
- Scalable Multithreaded Asynchronuous Runtime
System - user-level threads, light-weight virtual
processors - macro-dataflow, asynchronous execution
interleaving iterates from data-parallel
statements - integrated with POOMA II
- TAU measurement of asynchronous parallel
execution - utilized the TAU mapping API
- associate iterate performance with data parallel
statement - evaluate different scheduling policies
- SMARTS Exploting Temporal Locality
Parallelism through Vertical Execution, ICS '99,
August 1999.
25TAU Mapping of Asynchronous Execution
Without mapping
Two threadsexecuting
With mapping
POOMA / SMARTS
26With and without mapping (Thread 0)
Without mapping
Thread 0 blockswaiting for iterates
Iterates get lumped together
With mapping
Iterates distinguished
27With and without mapping (Thread 1)
Without mapping
Array initialization performance lumped
Performance associated with ExpressionKernel
object
With mapping
Iterate performance mapped to array statement
Array initialization performancecorrectly
separated
28TAU and Hybrid Execution in Opus/HPF
- Fortran 77, Fortran 90, HPF
- Vienna Fortran Compiling System
- Opus / HPF
- combined data (HPF) and task (Opus) parallelism
- HPF compiler produces Fortran 90 modules
- processes interoperate using Opus runtime system
- producer / consumer model
- MPI and pthreads
- performance influence at multiple software levels
29TAU Profiling of Opus/HPF Application
Multiple producers
Multiple consumers
Parallelism View
30TAU Profiling of SMARTS
Iteration scheduling for two array expressions
31SMARTS Tracing (SOR) Vampir Visualization
- SCVE scheduler used in Red/Black SOR running on
32 processors of SGI Origin 2000
Asynchronous, overlapped parallelism
32Program Database Toolkit (PDT)
- Program code analysis framework for developing
source-based tools - High-level interface to source code information
- Integrated toolkit for source code parsing,
database creation, and database query - commercial grade front end parsers
- portable IL analyzer, database format, and access
API - open software approach for tool development
- Target and integrate multiple source languages
- http//www.acl.lanl.gov/pdtoolkit
33PDT Architecture and Tools
34PDT Summary
- Program Database Toolkit (Version 1.1)
- LANL ACL Fall 1999 CD-ROM distributed at SC'99
- EDG C Front End (Version 2.41.2)
- C IL Analyzer and DUCTAPE library
- tools pdbmerge, pdbconv, pdbtree, pdbhtml
- standard C system header files (KAI KCC 3.4c)
- Fortran 90 IL Analyzer in progress
- Automated TAU performance instrumentation
- Program analysis support for SILOON (ACL CD)
- A Tool Framework for Static and Dynamic Analysis
of Object-Oriented Software, submitted to SC 00.
35Distributed Monitoring Framework
- Extend usability of TAU performance analysis
- Access TAU performance data during execution
- Framework model
- each application context is a performance data
server - monitor agent thread is created within each
context - client processes attach to agents and request
data - server thread synchronization for data
consistency - pull mode of interaction
- Distributed TAU performance data space
- A Runtime Monitoring Framework for the TAU
Profiling System, ISCOPE 99, Nov. 1999.
36TAU Distributed Monitor Architecture
TAU profile database
- Each context has a monitor agent
- Client in separatethread directs agent
- Pull model ofinteraction
- Initial HPCimplementation
37Java Implementation of TAU Monitor
- Motivations
- more portable monitor middleware system (RMI)
- more flexible and programmable server interface
(JNI) - more robust client development (EJB, JDBC, Swing)
38Future Plans
- TAU
- platforms SGI Itanium, Sun Starfire, IBM Linux,
... - languages Java (Java Grande) , OpenMP
- instrument automatic (F90, Java), Dyninst
- measurement hardware counter, support PAPI
- displays beyond bargraphs performance views
- performance database and technology
- support for multiple runs
- open API for analysis tool development
- PDT
- complete F90 and Java IL Analyzer
- source browsers function, class, template
- tools for aiding in data marshalling and
translation
39Future Plans (continued)
- Distributed monitoring framework
- application and system monitoring
- ACL Supermon and SGI Performance Co-Pilot
- scalable SMP clusters and distributed systems
- performance monitoring clients
- Performance evaluation
- numerical libraries and frameworks
- scalable runtime systems
- ASCI application developers (benchmark codes)
- Investigate performance issues in Linux kernel
- Investigate integration with CCA
40Conclusions
- Complex parallel computing environments require
robust program analysis tools - portable, cross-platform, multi-level, integrated
- able to bridge and reuse existing technology
- technology savvy
- TAU offers a robust performance technology
framework for complex parallel computing systems - flexible instrumentation and instrumentation
- extendable profile and trace performance analysis
- integration with other performance technology
- Opportunities exist for open performance
technology
41Open Performance Technology (OPT)
- Performance problem is complex
- diverse platforms, software development,
applications - things evolve
- History of incompatible and competing tools
- instrumentation / measurement technology
reinvention - lack of common, reusable software foundations
- Need value added (open) approach
- technology for high-level performance tool
development - layered performance tool architecture
- portable, flexible, programmable, integrative
technology - Opportunity for Industry/National Labs/PACI sites