Allen D' Malony, Sameer Shende - PowerPoint PPT Presentation

About This Presentation
Title:

Allen D' Malony, Sameer Shende

Description:

Department of Computer and Information Science. Computational Science Institute ... Describe and store 'known' component's performance ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 57
Provided by: allend7
Category:

less

Transcript and Presenter's Notes

Title: Allen D' Malony, Sameer Shende


1
Recent Advances in theTAU Performance System
  • Allen D. Malony, Sameer Shende
  • malony,shende_at_cs.uoregon.edu
  • Department of Computer and Information Science
  • Computational Science Institute
  • University of Oregon

2
Outline
  • Complexity and performance technology
  • What is the TAU performance system?
  • Problems currently being investigated
  • Instrumentation control and selection
  • Performance mapping and callpath profiling
  • Online performance analysis and visualization
  • Performance analysis for component software
  • Performance database framework
  • Concluding remarks

3
Complexity in Parallel and Distributed Systems
  • Complexity in computing system architecture
  • Diverse parallel and distributed system
    architectures
  • shared / distributed memory, cluster, hybrid,
    NOW, Grid,
  • Sophisticated processor / memory / network
    architectures
  • Complexity in parallel software environment
  • Diverse parallel programming paradigms
  • Optimizing compilers and sophisticated runtime
    systems
  • Advanced numerical libraries and application
    frameworks
  • Hierarchical, multi-level software architectures
  • Multi-component, coupled simulation models

4
Complexity Determines Performance Requirements
  • Performance observability requirements
  • Multiple levels of software and hardware
  • Different types and detail of performance data
  • Alternative performance problem solving methods
  • Multiple targets of software and system
    application
  • Performance technology requirements
  • Broad scope of performance observation
  • Flexible and configurable mechanisms
  • Technology integration and extension
  • Cross-platform portability
  • Open, layered, and modular framework architecture

5
Complexity Challenges for Performance Tools
  • Computing system environment complexity
  • Observation integration and optimization
  • Access, accuracy, and granularity constraints
  • Diverse/specialized observation
    capabilities/technology
  • Restricted modes limit performance problem
    solving
  • Sophisticated software development environments
  • Programming paradigms and performance models
  • Performance data mapping to software abstractions
  • Uniformity of performance abstraction across
    platforms
  • Rich observation capabilities and flexible
    configuration
  • Common performance problem solving methods

6
General Problems (Performance Technology)
  • How do we create robust and ubiquitous
    performance technology for the analysis and
    tuning of parallel and distributed software and
    systems in the presence of (evolving) complexity
    challenges?
  • How do we apply performance technology
    effectively for the variety and diversity of
    performance problems that arise in the context of
    complex parallel and distributed computer systems?

7
TAU Performance System Framework
  • Tuning and Analysis Utilities (aka Tools Are Us)
  • Performance system framework for scalable
    parallel and distributed high-performance
    computing
  • Targets a general complex system computation
    model
  • nodes / contexts / threads
  • Multi-level system / software / parallelism
  • Measurement and analysis abstraction
  • Integrated toolkit for performance
    instrumentation, measurement, analysis, and
    visualization
  • Portable performance profiling/tracing facility
  • Open software approach

8
TAU Performance System Architecture
Paraver
EPILOG
9
Instrumentation Control and Selection
  • Selection of which performance events to observe
  • Could depend on scope, type, level of interest
  • Could depend on instrumentation overhead
  • How is selection supported in instrumentation
    system?
  • No choice
  • Include / exclude lists (TAU)
  • Environment variables
  • Static vs. dynamic
  • Problem Controlling instrumentation of small
    routines
  • High relative measurement overhead
  • Significant intrusion and possible perturbation

10
Rule-Based Overhead Analysis (N. Trebon, UO)
  • Analyze the performance data to determine events
    with high (relative) overhead performance
    measurements
  • Create a select list for excluding those events
  • Rule grammar (used in TAUreduce tool)
  • GroupName Field Operator Number
  • GroupName indicates rule applies to events in
    group
  • Field is a event metric attribute (from profile
    statistics)
  • numcalls, numsubs, percent, usec, cumusec,
    totalcount, stdev, usecs/call, counts/call
  • Operator is one of gt, lt, or
  • Number is any number
  • Compound rules possible using between simple
    rules

11
Example Rules
  • Exclude all events that are members of TAU_USER
    and use less than 1000 microseconds
  • TAU_USERusec lt 1000
  • Exclude all events that have less than 100
    microseconds and are called only once
  • usec lt 1000 numcalls 1
  • Exclude all events that have less than 1000
    usecs per call OR have a (total inclusive)
    percent less than 5
  • usecs/call lt 1000
  • percent lt 5
  • Scientific notation can be used

12
TAUReduce Example
  • tau_reduce implements overhead reduction in TAU
  • Consider klargest example
  • Find kth largest element in a N elements
  • Compare two methods quicksort,
    select_kth_largest
  • Testcase i 2324, N 1000000 (uninstrumented)
  • quicksort (wall clock) 0.188511 secs
  • select_kth_largest (wall clock) 0.149594 secs
  • Total (P3/1.2GHz time) 0.340u 0.020s 000.37
  • Execute with all routines instrumented
  • Execute with rule-based selective instrumentation
  • usecgt1000 numcallsgt400000 usecs/calllt30
    percentgt25

13
Simple sorting example on one processor
Before selective instrumentation reduction
  • NODE 0CONTEXT 0THREAD 0
  • --------------------------------------------------
    -------------------------------------
  • Time Exclusive Inclusive Call
    Subrs Inclusive Name
  • msec msec
    usec/call
  • --------------------------------------------------
    -------------------------------------
  • 100.0 13 4,982 1
    4 4982030 int main
  • 93.5 3,223 4,659 4.20241E06
    1.40268E07 1 void quicksort
  • 62.9 0.00481 3,134 5
    5 626839 int kth_largest_qs
  • 36.4 137 1,813 28
    450057 64769 int select_kth_largest
  • 33.6 150 1,675 449978
    449978 4 void sort_5elements
  • 28.8 1,435 1,435 1.02744E07
    0 0 void interchange
  • 0.4 20 20 1
    0 20668 void setup
  • 0.0 0.0118 0.0118 49
    0 0 int ceil

After selective instrumentation reduction
NODE 0CONTEXT 0THREAD 0 -----------------------
--------------------------------------------------
-------------- Time Exclusive Inclusive
Call Subrs Inclusive Name
msec total msec
usec/call ----------------------------------------
----------------------------------------------- 10
0.0 14 383 1
4 383333 int main 50.9 195
195 5 0 39017 int
kth_largest_qs 40.0 153 153
28 79 5478 int
select_kth_largest 5.4 20
20 1 0 20611 void setup
0.0 0.02 0.02 49
0 0 int ceil
14
Performance Mapping
  • Associate performance with significant entities
    (events)
  • Source code points are important
  • Functions, regions, control flow events, user
    events
  • Execution process and thread entities are
    important
  • Some entities are more abstract, harder to
    measure
  • Consider callgraph (callpath) profiling
  • Measure time (metric) along an edge (path) of
    callgraph
  • incident edge gives parent / child view
  • edge sequence (path) gives parent / descendant
    view
  • Problem Callpath profiling when callgraph is
    unknown
  • Determine callgraph dynamically at runtime
  • Map performance measurement to dynamic call path
    state

15
Callgraph (Callpath) Profiling
  • 0-level callpath
  • Callgraph node
  • A
  • 1-level callpath
  • Immediate descendant
  • A?B, E?I, D?H
  • C?H ?
  • k-level callpath (kgt1)
  • k call descendant
  • 2-level A?D, C?I
  • 2-level A?I ?
  • 3-level A?H

?
?
?
?
?
16
1-Level Callpath Profiling in TAU (S. Shende, UO)
  • TAU maintains a performance event (routine)
    callstack
  • Profiled routine (child) looks in callstack for
    parent
  • Previous profiled performance event is the parent
  • A callpath profile structure created first time
    parent calls
  • TAU records parent in a callgraph map for child
  • String representing 1-level callpath used as its
    key
  • a( )gtb( ) name for time spent in b when
    called by a
  • Map returns pointer to callpath profile structure
  • 1-level callpath is profiled using this profiling
    data
  • Build upon TAUs performance mapping technology
  • Measurement is independent of instrumentation

17
Callpath Profiling Example (NAS LU v2.3)
  • configure -PROFILECALLPATH -SGITIMERS
    -archsgi64-mpiinc/usr/include
    -mpilib/usr/lib64 -useropt-O2

18
Callpath Parallel Profile Display
  • 0-level and 1-level callpath grouping

1-Level Callpath
0-Level Callpath
19
Performance Monitoring and Steering
  • Desirable to monitor performance during execution
  • Long-running applications
  • Steering computations for improved performance
  • Large-scale parallel applications complicate
    solutions
  • More parallel threads of execution producing data
  • Large amount of performance data (relative) to
    access
  • Analysis and visualization more difficult
  • Problem Online performance data access and
    analysis
  • Incremental profile sampling (based on files)
  • Integration in computational steering system
  • Dynamic performance measurement and access

20
Online Performance Analysis (K. Li, UO)
21
2D Field Performance Visualization in SCIRun
SCIRun program
22
Uintah Computational Framework (UCF)
  • Universityof Utah
  • UCF analysis
  • Scheduling
  • MPI library
  • Components
  • 500 processes
  • Use for onlineand offlinevisualization
  • Apply SCIRunsteering

23
Performance Analysis of Component Software
  • Complexity in scientific problem solving
    addressed by
  • advances in software development environments
  • rich layered software middleware and libraries
  • Increases complexity in performance problem
    solving
  • Integration barriers for performance technology
  • Incompatible with advanced software technology
  • Inconsistent with software engineering process
  • Problem Performance engineering for component
    systems
  • Respect software development methodology
  • Leverage software implementation technology
  • Look for opportunities for synergy and
    optimization

24
Focus on Component Technology and CCA
  • Emerging component technology for HPC and Grid
  • Component software object embedding
    functionality
  • Component architecture (CA) how components
    connect
  • Component framework implement a CA
  • Common Component Architecture (CCA)
  • Standard foundation for scientific component
    architecture
  • Component descriptions
  • Scientific Interface Description Language (SIDL)
  • CCA ports for component interactions (provides
    and uses)
  • CCA services directory, registery, connection,
    event
  • High-performance components and interactions

25
Extend Component Design for Performance
genericcomponent
  • Compliant with component architecture
  • Component composition performance engineering
  • Utilize technology and services of component
    framework

26
Performance Knowledge
  • Describe and store known components
    performance
  • Benchmark characterizations in performance
    database
  • Models of performance
  • empirical-based
  • simulation-based
  • analytical-based
  • Saved information about component performance
  • Use for performance-guided selection and
    deployment
  • Use for runtime adaptation
  • Representation must be in common forms with
    standard means for accessing the performance
    information

27
Performance Knowledge Repository Component
  • Component performance repository
  • Implement in componentarchitecture framework
  • Similar to CCA componentrepository
  • Access by componentinfrastructure
  • View performance knowledge as component (PKC)
  • PKC ports give access to performance knowledge
  • to other components back to original
    component
  • Static/dynamic component control and composition
  • Component composition performance knowledge

28
Performance Observation
  • Ability to observe execution performance is
    important
  • Empirically-derived performance knowledge
    requires it
  • does not require measurement integration in
    component
  • Monitor during execution to make dynamic
    decisions
  • measurement integration is key
  • Performance observation integration
  • Component integration core and variant
  • Runtime measurement and data collection
  • On-line and off-line performance analysis
  • Performance observation technology must be as
    portable and robust as component software

29
Performance Observation Component (POC)
  • Performance observation in aperformance-engineere
    dcomponent model
  • Functional extension of originalcomponent design
    ( )
  • Include new componentmethods and ports ( ) for
    othercomponents to access measured performance
    data
  • Allow original component to access performance
    data
  • encapsulate as tightly-couple and co-resident
    performance observation object
  • POC provides port allow use optmized interfaces
    ( )to access internal'' performance
    observations

30
Architecture of a Performance Component
  • Each component advertises its services
  • Performance component
  • Timer (start/stop)
  • Event (trigger)
  • Query (timers)
  • Knowledge (component performance model)
  • Prototype implementation of timer
  • CCAFFEINE reference framework
  • http//www.cca-forum.org/café.html
  • SIDL
  • Instantiate with TAU functionality

31
TimerPort Interface Declaration in CCAFEINE
  • Create Timer port abstraction
  • namespace performance
  • namespace ccaports
  • /
  • This abstract class declares the Timer
    interface.
  • Inherit from this class to provide
    functionality.
  • /
  • class Timer / implementation of port /
  • public virtual govccaPort / inherits
    from port spec /
  • public
  • virtual Timer ()
  • /
  • Start the Timer. Implement this function
    in
  • a derived class to provide required
    functionality.
  • /
  • virtual void start(void) 0 / virtual
    methods with /
  • virtual void stop(void) 0 / null
    implementations /
  • ...

32
Using Performance Component Timer
  • Component uses framework services to get
    TimerPort
  • Use of this TimerPort interface is independent
    of TAU
  • // Get Timer port from CCA framework services
    form CCAFFEINE
  • port frameworkServices-gtgetPort
    ("TimerPort")
  • if (port)
  • timer_m dynamic_cast lt performanceccaports
    Timer gt(port)
  • if (timer_m 0)
  • cerr ltlt "Connected to something, not a Timer
    port" ltlt endl
  • return -1
  • string s "IntegrateTimer" // give name for
    timer
  • timer_m-gtsetName(s) // assign name to
    timer
  • timer_m-gtstart() // start timer
    (independent of tool)
  • for (int i 0 i lt count i)
  • double x random_m-gtgetRandomNumber ()
  • sum sum function_m-gtevaluate (x)
  • timer_m-gtstop() // stop timer

33
Using SIDL for Language Interoperability
  • Can create Timer interface in SIDL for creating
    stubs
  • //
  • // File performance.sidl
  • //
  • version performance 1.0
  • package performance
  • class Timer
  • void start()
  • void stop()
  • void setName(in string name)
  • string getName()
  • void setType(in string name)
  • string getType()
  • void setGroupName(in string name)
  • string getGroupName()
  • void setGroupId(in long group)
  • long getGroupId()

34
Using SIDL Interface for Timers
  • C program that uses the SIDL Timer interface
  • Again, independent of timer implementations
    (e.g., TAU)
  • // SIDL
  • include "performance_Timer.hh"
  • int main(int argc, char argv)
  • performanceTimer t performanceTimer_crea
    te()
  • ...
  • t.setName("Integrate timer")
  • t.start()
  • // Computation
  • for (int i 0 i lt count i)
  • double x random_m-gtgetRandomNumber ()
  • sum sum function_m-gtevaluate (x)
  • ...
  • t.stop()
  • return 0

35
Using TAU Component in CCAFEINE
  • repository get TauTimer / get
    TAU component from repository /
  • repository get Driver / get
    application components /
  • repository get MidpointIntegrator
  • repository get MonteCarloIntegrator
  • repository get RandomGenerator
  • repository get LinearFunction
  • repository get NonlinearFunction
  • repository get PiFunction
  • create LinearFunction lin_func / create
    component instances /
  • create NonlinearFunction nonlin_func
  • create PiFunction pi_func
  • create MonteCarloIntegrator mc_integrator
  • create RandomGenerator rand
  • create TauTimer tau / create
    TAU component instance /
  • / connecting components and running /
  • connect mc_integrator RandomGeneratorPort rand
    RandomGeneratorPort
  • connect mc_integrator FunctionPort nonlin_func
    FunctionPort

36
Component Composition Performance Engineering
  • Performance of component-based scientific
    applicationsdepends on interplay
  • Component functions
  • Computational resources available
  • Management of component compositions throughout
    execution is critical to successful deployment
    and use
  • Identify key technological capabilities needed to
    support the performance engineering of component
    compositions
  • Two model concepts
  • Performance awareness
  • Performance attention

37
Performance Awareness of Component Ensembles
  • Composition performance knowledge and observation
  • Composition performance knowledge
  • Can come from empirical and analytical evaluation
  • Can utilize information provided at the component
    level
  • Can be stored in repositories for future review
  • Extends the notion of component observation to
    ensemble-level performance monitoring
  • Associate monitoring components hierarchical
    component grouping
  • Build upon component-level observation support
  • Monitoring components act as performance
    integrators and routers
  • Use component framework mechanisms

38
Performance Databases
  • Focus on empirical performance optimization
    process
  • Necessary for multi-results performance analysis
  • Multiple experiments (codes, versions, platforms,
    )
  • Historical performance comparison
  • Integral component of performance analysis
    framework
  • Improved performance analysis architecture design
  • More flexible and open tool interfaces
  • Supports extensibility and foreign tool
    interaction
  • Performance analysis collaboration
  • Performance tool sharing
  • Performance data sharing and knowledge base

39
Empirical-Based Performance Optimization
Process
40
TAU Performance Database Framework
  • profile data only
  • XML representation (PerfDML)
  • project / experiment / trial

41
PerfDBF Components
  • Performance Data Meta Language (PerfDML)
  • Common performance data representation
  • Performance meta-data description
  • Translators to common PerfDML data representation
  • Performance DataBase (PerfDB)
  • Standard database technology (SQL)
  • Free, robust database software (PostgresSQL)
  • Commonly available APIs
  • Performance DataBase Toolkit (PerfDBT)
  • Commonly used modules for query and analysis
  • Facility analysis tool development

42
Common and Extensible Profile Data Format
  • Goals
  • Capture data from profile tools in common
    representation
  • Implement representation in a standard format
  • Allow for extension of format for new profile
    data objects
  • Base on XML (obvious choice)
  • Leverage XML tools and APIs
  • XML parsers, Suns Java SDK,
  • XML verification systems (DTD and schemas)
  • Target for profile data translation tools
  • eXtensibile Stylesheet Language Transformations
    (XSLT)
  • Which performance profile data are of interest?
  • Focus on TAU and consider other profiling tools

43
Performance Profiling
  • Performance data about program entities and
    behaviors
  • Code regions functions, loops, basic blocks
  • Actions or states
  • Statistics data
  • Execution time, number of calls, number of FLOPS
    ...
  • Characterization data
  • Parallel profiles
  • Captured per process and/or per thread
  • Program-level summaries
  • Profiling tools
  • prof/gprof, ssrun, uprofile/dpci, cprof/vprof,

44
TAU Parallel Performance Profiles
45
PerfDBF Example
  • NAS Parallel Benchmark LU
  • configure -mpiinc/usr/include
    -mpilib/usr/lib64-archsgi64 -fortransgi
    -SGITIMERS -useropt-O2

NPB profiled With TAU
Standard TAU Output Data
TAU XML Format
TAU to XML Converter
Database Loader
SQL Database
AnalysisTool
46
Scalability Analysis Process
  • Scalability study on LU
  • Vary number of processes 1, 2, 4, and 8
  • mpirun -np 1 lu.W1
  • mpirun -np 2 lu.W2
  • mpirun -np 4 lu.W4
  • mpirun -np 8 lu.W8
  • Populate the performance database
  • run Java translator to translate profiles into
    XML
  • run Java XML reader to write XML profiles to
    database
  • Read times for routines and program from
    experiments
  • Calculate scalability metrics

47
Raw TAU Profile Data
  • Raw data output
  • One processor
  • "applu 1 15 2939.096923828125 248744666.5830078
    0 GROUP"applu
  • Four processors
  • "applu 1 15 2227.343994140625 51691412.17797852
    0 GROUP"applu
  • "applu 1 15 2227.343994140625 51691412.17797852
    0 GROUP"applu
  • "applu " 1 14 596.568115234375 51691519.34106445
    0 GROUP"applu
  • "applu " 1 14 616.833251953125 51691377.21313477
    0 GROUP"applu"

group name
profile calls
exclusive time
inclusive time
name
subs
calls
48
XML Profile Representation
  • One processor
  • ltinstrumentedobjgt
  • ltfuncnamegt 'applu 'lt/funcnamegt
  • ltfuncIDgt8lt/funcIDgt
  • ltinclpercgt100.0lt/inclpercgt
  • ltinclutimegt2.487446665830078E8lt/inclutimegt
  • ltexclpercgt0.0lt/exclpercgt
  • ltexclutimegt2939.096923828125 lt/exclutimegt
  • ltcallgt1lt/callgt
  • ltsubrsgt15lt/subrsgt
  • ltinclutimePcallgt2.487446665830078E8lt/inclu
    timePcallgt
  • lt/instrumentedobjgt

49
XML Representation
  • Four processor mean
  • ltmeanfunctiongt
  • ltfuncnamegt'applu 'lt/funcnamegt
  • ltfuncIDgt12lt/funcIDgt
  • ltinclpercgt100.0lt/inclpercgt
  • ltinclutimegt5.169148940026855E7lt/inclutimegt
  • ltexclpercgt0.0lt/exclpercgt
  • ltexclutimegt1044.487548828125lt/exclutimegt
  • ltcallgt1lt/callgt
  • ltsubrsgt14.25lt/subrsgt
  • ltinclutimePcallgt5.1691489E7lt/inclutimePcal
    lgt
  • lt/meanfunctiongt

50
Contents of Performance Database
51
Scalability Analysis Results
  • Scalability of LU performance experiments
  • Four trial runs
  • Funname processors meanspeedup
  • .
  • applu 2 2.0896117809566
  • applu 4 4.812100975788783
  • applu 8 8.168409581149514
  • exact 2 1.95853126762839071803
  • exact 4 4.03622321124616535446
  • exact 8 7.193812137750623668346

52
Current PerfDBF Status and Future
  • PerfDBF prototype
  • TAU profile to XML translator
  • XML to PerfDB populator
  • PostgresSQL database
  • Java-based PostgresSQL query module
  • Use as a layer to support performance analysis
    tools
  • Make accessing the Performance Database quicker
  • Continue development
  • XML parallel profile representation
  • Basic specification
  • Opportunity for APART to define a common format

53
Performance Tracking and Reporting
  • Integrated performance measurement allows
    performance analysis throughout development
    lifetime
  • Applied performance engineering in software
    design and development (software engineering)
    process
  • Create performance portfolio from regular
    performance experimentation (couple with software
    testing)
  • Use performance knowledge in making key software
    design decision, prior to major development
    stages
  • Use performance benchmarking and regression
    testing to identify irregularities
  • Support automatic reporting of performance bugs
  • Enable cross-platform (cross-generation)
    evaluation

54
XPARE - eXPeriment Alerting and REporting
  • Experiment launcher automates measurement /
    analysis
  • Configuration and compilation of performance
    tools
  • Instrumentation control for Uintah experiment
    type
  • Execution of multiple performance experiments
  • Performance data collection, analysis, and
    storage
  • Integrated in Uintah software testing harness
  • Reporting system conducts performance regression
    tests
  • Apply performance difference thresholds (alert
    ruleset)
  • Alerts users via email if thresholds have been
    exceeded
  • Web alerting setup and full performance data
    reporting
  • Historical performance data analysis

55
XPARE System Architecture
Experiment Launch
Performance Database
Performance Reporter
Comparison Tool
Regression Analyzer
Alerting Setup
56
Concluding Remarks
  • Complex software and parallel computing systems
    pose challenging performance analysis problems
    that require robust methodologies and tools
  • To build more sophisticated performance tools,
    existing proven performance technology must be
    utilized
  • Performance tools must be integrated with
    software and systems models and technology
  • Performance engineered software
  • Function consistently and coherently in software
    and system environments
  • TAU performance system offers robust performance
    technology that can be broadly integrated

57
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com