Allen D. Malony - PowerPoint PPT Presentation

About This Presentation
Title:

Allen D. Malony

Description:

threads of execution. multi-level parallelism ... execution models (Java threads, MPI) Java Virtual ... user-level threads, light-weight virtual processors ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 42
Provided by: allend7
Category:
Tags: allen | malony

less

Transcript and Presenter's Notes

Title: Allen D. Malony


1
TAU A Framework for Parallel Performance Analysis
  • Allen D. Malony
  • malony_at_cs.uoregon.edu
  • ParaDucks Research Group
  • Computer Information Science Department
  • Computational Science Institute
  • University of Oregon

2
Outline
  • Goals and challenges
  • Targeted research areas
  • TAU (Tuning and Analysis Utilities)
  • computation model, architecture, toolkit
    framework
  • performance system technology
  • examples of TAU use
  • Tools associated with TAU
  • PDT (Program Database Toolkit)
  • distributed runtime monitoring
  • Future plans
  • Conclusions

3
Goal and Challenges
  • Create robust (performance) technology for the
    analysis and tuning of parallel software and
    systems
  • Challenges
  • different scalable computing platforms
  • different programming languages and systems
  • common, portable framework for analysis
  • extensibe, retargetable tool technology
  • complex set of requirements

4
Targeted Research Areas
  • Performance analysis for scalable parallel
    systems targeting multiple programming and system
    levelsand the mapping between levels
  • Program code analysis for multiple languages
    enabling development of new source-based tools
  • Integration and interoperation support for
    building analysis tool frameworks and
    environments
  • Runtime tool interaction for dynamic applications

5
TAU (Tuning and Analysis Utilities)
  • Performance analysis framework for scalable
    parallel and distributed high-performance
    computing
  • Target a general parallel computation model
  • computer nodes
  • shared address space contexts
  • threads of execution
  • multi-level parallelism
  • Integrated toolkit for performance
    instrumentation, measurement, analysis, and
    visualization
  • portable performance profiling/tracing facility
  • open software approach

6
TAU Architecture
7
TAU Instrumentation
  • Flexible, multiple instrumentation mechanisms
  • source code
  • manual
  • automatic using PDT (tau_instrumentor)
  • object code
  • pre-instrumented libraries
  • statically linked MPI wrapper library using the
    MPI Profiling Interface (libTauMpi.a)
  • dynamically linked Java instrumentation using
    JVMPI and TAU shared object dynamically loaded in
    VM
  • executable code
  • dynamic instrumentation using DyninstAPI (tau_run)

8
TAU Instrumentation (continued)
  • Common target measurement interface (TAU API)
  • C (object-based) instrumentation
  • macro-based, using constructor/destructor
    techniques
  • function, classes, and templates
  • uniquely identify functions and templates
  • name and type signature (name registration)
  • static object creates performance entry
  • dynamic object receives static object pointer
  • runtime type identification for template
    instantiations
  • with C and Fortran instrumentation variants
  • Instrumentation optimization

9
TAU Measurement
  • Performance information
  • high resolution timer library (real-time clock)
  • generalized software counter library
  • hardware performance counters
  • PCL (Performance Counter Library) (ZAM, Germany)
  • PAPI (Performance API) (UTK, Ptools)
  • consistent, portable API
  • Organization
  • node, context, thread levels
  • profile groups for collective events (runtime
    selective)
  • mapping between software levels

10
TAU Measurement (continued)
  • Profiling
  • function-level, block-level, statement-level
  • supports user-defined events
  • TAU profile (function) database (PD)
  • function callstack
  • hardware counts instead of time
  • Tracing
  • profile-level events
  • interprocess communication events
  • timestamp synchronization
  • User-controlled configuration (configure)

11
Timing of Multi-threaded Applications
  • Capture timing information on per thread basis
  • Two alternative
  • wall clock time
  • works on all systems
  • user-level measurement
  • OS-maintained CPU time (e.g., Solaris, Linux)
  • thread virtual time measurement
  • TAU supports both alternatives
  • CPUTIME module profiles usersystem time

configure -pthread -CPUTIME
12
TAU Analysis
  • Profile analysis
  • pprof
  • parallel profiler with text-based display
  • racy
  • graphical interface to pprof
  • Trace analysis
  • trace merging and clock adjustment (if necessary)
  • trace format conversion (ALOG, SDDF, PV, Vampir)
  • Vampir
  • trace analysis and visualization tool (Pallas)

13
TAU Status
  • Usage
  • platforms
  • IBM SP, SGI Origin 2K, Intel Teraflop, Cray T3E,
    HP, Sun, Windows 95/98/NT, Alpha/Pentium Linux
    cluster
  • languages
  • C, C, Fortran 77/90, HPF, pC, HPC, Java
  • communication libraries
  • MPI, PVM, Nexus, Tulip, ACLMPL
  • thread libraries
  • pthreads, Tulip, SMARTS, Java,Windows
  • compilers
  • KAI, PGI, GNU, Fujitsu, Sun, Microsoft, SGI, Cray

14
TAU Status (continued)
  • application libraries
  • Blitz, A/P, ACLVIS, PAWS
  • application frameworks
  • POOMA, POOMA-2, MC, Conejo, PaRP
  • other projects
  • ACPC, University of Vienna Aurora
  • UC Berkeley (Culler) Millenium, sensitivity
    analysis
  • KAI and Pallas
  • TAU profiling and tracing toolkit (Version 2.7)
  • LANL ACL Fall 1999 CD-ROM distributed at SC'99
  • Extensive 70-page TAU Users Guide
  • http//www.acl.lanl.gov/tau

15
TAU Examples
  • Instrumentation
  • C template profiling (PETE, Blitz)
  • Java and MPI
  • PAPI
  • Measurement
  • mapping of asynchronous execution (SMARTS)
  • hybrid execution (Opus/HPF)
  • Analysis
  • SMARTS scheduling

16
C Template Instrumentation (Blitz, PETE)
  • High-level objects
  • array classes
  • templates
  • Optimizations
  • array processing
  • expressions (PETE)
  • Relate performance data to high-level statement
  • Complexity of template evaluation

Array expressions
17
Standard Template Instrumentation Difficulties
  • Instantiated templates result in mangled
    identifiers
  • Standard profiling techniques and tools are
    deficient
  • integrated with proprietary compilers
  • specific systems platforms and programming models

Uninterpretable routine names
18
TAU Template Instrumentation and Profiling
Profile ofexpressiontypes
Performance data presentedwith respect to
high-levelarray expression types
Graphical pprof
19
Parallel Java Performance Instrumentation
  • Multi-language applications (Java, C, C,
    Fortran)
  • Hybrid execution models (Java threads, MPI)
  • Java Virtual Machine Profiler Interface (JVMPI)
  • event instrumentation in JVM
  • profiler agent (libTAU.so) fields events
  • Java Native Interface (JNI)
  • invoke JVMPI control routines to control Java
    threads and access thread information
  • MPI profiling interface
  • Performance Tools for Parallel Java
    Environments, Java Workshop, ICS 2000, May 2000.

20
TAU Java Instrumentation Architecture
Java program
mpiJava package
TAU package
JNI
MPI profiling interface
Event notification
TAU wrapper
TAU
Native MPI library
JVMPI
Profile DB
21
Parallel Java Game of Life
  • mpiJava testcase
  • 4 nodes,28 threads
  • Nodeprocessgrouping
  • Threadmessagepairing
  • Vampirdisplay
  • Multi-level event grouping

22
TAU and PAPI NAS Parallel LU Benchmark
  • SGI Power Onyx (4 processors, R10K), MPI
  • Floating pointoperations
  • Cross-nodefull / routineprofiles
  • Full FPprofile foreach node

Percentage profile
23
TAU and PAPI Matrix Multiply
  • Data cache miss comparison,
  • regular vs. strip-mining execution
  • 512x51232 KB (P)2 MB (S)
  • Regularcauses4.5 timesmoremisses

24
Asynchronous Performance Analysis (SMARTS)
  • Scalable Multithreaded Asynchronuous Runtime
    System
  • user-level threads, light-weight virtual
    processors
  • macro-dataflow, asynchronous execution
    interleaving iterates from data-parallel
    statements
  • integrated with POOMA II
  • TAU measurement of asynchronous parallel
    execution
  • utilized the TAU mapping API
  • associate iterate performance with data parallel
    statement
  • evaluate different scheduling policies
  • SMARTS Exploting Temporal Locality
    Parallelism through Vertical Execution, ICS '99,
    August 1999.

25
TAU Mapping of Asynchronous Execution
Without mapping
Two threadsexecuting
With mapping
POOMA / SMARTS
26
With and without mapping (Thread 0)
Without mapping
Thread 0 blockswaiting for iterates
Iterates get lumped together
With mapping
Iterates distinguished
27
With and without mapping (Thread 1)
Without mapping
Array initialization performance lumped
Performance associated with ExpressionKernel
object
With mapping
Iterate performance mapped to array statement
Array initialization performancecorrectly
separated
28
TAU and Hybrid Execution in Opus/HPF
  • Fortran 77, Fortran 90, HPF
  • Vienna Fortran Compiling System
  • Opus / HPF
  • combined data (HPF) and task (Opus) parallelism
  • HPF compiler produces Fortran 90 modules
  • processes interoperate using Opus runtime system
  • producer / consumer model
  • MPI and pthreads
  • performance influence at multiple software levels

29
TAU Profiling of Opus/HPF Application
Multiple producers
Multiple consumers
Parallelism View
30
TAU Profiling of SMARTS
Iteration scheduling for two array expressions
31
SMARTS Tracing (SOR) Vampir Visualization
  • SCVE scheduler used in Red/Black SOR running on
    32 processors of SGI Origin 2000

Asynchronous, overlapped parallelism
32
Program Database Toolkit (PDT)
  • Program code analysis framework for developing
    source-based tools
  • High-level interface to source code information
  • Integrated toolkit for source code parsing,
    database creation, and database query
  • commercial grade front end parsers
  • portable IL analyzer, database format, and access
    API
  • open software approach for tool development
  • Target and integrate multiple source languages
  • http//www.acl.lanl.gov/pdtoolkit

33
PDT Architecture and Tools
34
PDT Summary
  • Program Database Toolkit (Version 1.1)
  • LANL ACL Fall 1999 CD-ROM distributed at SC'99
  • EDG C Front End (Version 2.41.2)
  • C IL Analyzer and DUCTAPE library
  • tools pdbmerge, pdbconv, pdbtree, pdbhtml
  • standard C system header files (KAI KCC 3.4c)
  • Fortran 90 IL Analyzer in progress
  • Automated TAU performance instrumentation
  • Program analysis support for SILOON (ACL CD)
  • A Tool Framework for Static and Dynamic Analysis
    of Object-Oriented Software, submitted to SC 00.

35
Distributed Monitoring Framework
  • Extend usability of TAU performance analysis
  • Access TAU performance data during execution
  • Framework model
  • each application context is a performance data
    server
  • monitor agent thread is created within each
    context
  • client processes attach to agents and request
    data
  • server thread synchronization for data
    consistency
  • pull mode of interaction
  • Distributed TAU performance data space
  • A Runtime Monitoring Framework for the TAU
    Profiling System, ISCOPE 99, Nov. 1999.

36
TAU Distributed Monitor Architecture
TAU profile database
  • Each context has a monitor agent
  • Client in separatethread directs agent
  • Pull model ofinteraction
  • Initial HPCimplementation

37
Java Implementation of TAU Monitor
  • Motivations
  • more portable monitor middleware system (RMI)
  • more flexible and programmable server interface
    (JNI)
  • more robust client development (EJB, JDBC, Swing)

38
Future Plans
  • TAU
  • platforms SGI Itanium, Sun Starfire, IBM Linux,
    ...
  • languages Java (Java Grande) , OpenMP
  • instrument automatic (F90, Java), Dyninst
  • measurement hardware counter, support PAPI
  • displays beyond bargraphs performance views
  • performance database and technology
  • support for multiple runs
  • open API for analysis tool development
  • PDT
  • complete F90 and Java IL Analyzer
  • source browsers function, class, template
  • tools for aiding in data marshalling and
    translation

39
Future Plans (continued)
  • Distributed monitoring framework
  • application and system monitoring
  • ACL Supermon and SGI Performance Co-Pilot
  • scalable SMP clusters and distributed systems
  • performance monitoring clients
  • Performance evaluation
  • numerical libraries and frameworks
  • scalable runtime systems
  • ASCI application developers (benchmark codes)
  • Investigate performance issues in Linux kernel
  • Investigate integration with CCA

40
Conclusions
  • Complex parallel computing environments require
    robust program analysis tools
  • portable, cross-platform, multi-level, integrated
  • able to bridge and reuse existing technology
  • technology savvy
  • TAU offers a robust performance technology
    framework for complex parallel computing systems
  • flexible instrumentation and instrumentation
  • extendable profile and trace performance analysis
  • integration with other performance technology
  • Opportunities exist for open performance
    technology

41
Open Performance Technology (OPT)
  • Performance problem is complex
  • diverse platforms, software development,
    applications
  • things evolve
  • History of incompatible and competing tools
  • instrumentation / measurement technology
    reinvention
  • lack of common, reusable software foundations
  • Need value added (open) approach
  • technology for high-level performance tool
    development
  • layered performance tool architecture
  • portable, flexible, programmable, integrative
    technology
  • Opportunity for Industry/National Labs/PACI sites
Write a Comment
User Comments (0)
About PowerShow.com