Allen D' Malony, Sameer Shende

About This Presentation

Title:

Allen D' Malony, Sameer Shende

Description:

Department of Computer and Information Science. Computational Science Institute ... Describe and store 'known' component's performance ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 57

Provided by: allend7

Learn more at: http://www.cs.uoregon.edu

Category:

more less

Transcript and Presenter's Notes

Title: Allen D' Malony, Sameer Shende

1
Recent Advances in theTAU Performance System

Allen D. Malony, Sameer Shende
malony,shende_at_cs.uoregon.edu
Department of Computer and Information Science
Computational Science Institute
University of Oregon

2
Outline

Complexity and performance technology
What is the TAU performance system?
Problems currently being investigated
Instrumentation control and selection
Performance mapping and callpath profiling
Online performance analysis and visualization
Performance analysis for component software
Performance database framework
Concluding remarks

3
Complexity in Parallel and Distributed Systems

Complexity in computing system architecture
Diverse parallel and distributed system
architectures
shared / distributed memory, cluster, hybrid,
NOW, Grid,
Sophisticated processor / memory / network
architectures
Complexity in parallel software environment
Diverse parallel programming paradigms
Optimizing compilers and sophisticated runtime
systems
Advanced numerical libraries and application
frameworks
Hierarchical, multi-level software architectures
Multi-component, coupled simulation models

4
Complexity Determines Performance Requirements

Performance observability requirements
Multiple levels of software and hardware
Different types and detail of performance data
Alternative performance problem solving methods
Multiple targets of software and system
application
Performance technology requirements
Broad scope of performance observation
Flexible and configurable mechanisms
Technology integration and extension
Cross-platform portability
Open, layered, and modular framework architecture

5
Complexity Challenges for Performance Tools

Computing system environment complexity
Observation integration and optimization
Access, accuracy, and granularity constraints
Diverse/specialized observation
capabilities/technology
Restricted modes limit performance problem
solving
Sophisticated software development environments
Programming paradigms and performance models
Performance data mapping to software abstractions
Uniformity of performance abstraction across
platforms
Rich observation capabilities and flexible
configuration
Common performance problem solving methods

6
General Problems (Performance Technology)

How do we create robust and ubiquitous
performance technology for the analysis and
tuning of parallel and distributed software and
systems in the presence of (evolving) complexity
challenges?
How do we apply performance technology
effectively for the variety and diversity of
performance problems that arise in the context of
complex parallel and distributed computer systems?

7
TAU Performance System Framework

Tuning and Analysis Utilities (aka Tools Are Us)
Performance system framework for scalable
parallel and distributed high-performance
computing
Targets a general complex system computation
model
nodes / contexts / threads
Multi-level system / software / parallelism
Measurement and analysis abstraction
Integrated toolkit for performance
instrumentation, measurement, analysis, and
visualization
Portable performance profiling/tracing facility
Open software approach

8
TAU Performance System Architecture
Paraver
EPILOG
9
Instrumentation Control and Selection

Selection of which performance events to observe
Could depend on scope, type, level of interest
Could depend on instrumentation overhead
How is selection supported in instrumentation
system?
No choice
Include / exclude lists (TAU)
Environment variables
Static vs. dynamic
Problem Controlling instrumentation of small
routines
High relative measurement overhead
Significant intrusion and possible perturbation

10
Rule-Based Overhead Analysis (N. Trebon, UO)

Analyze the performance data to determine events
with high (relative) overhead performance
measurements
Create a select list for excluding those events
Rule grammar (used in TAUreduce tool)
GroupName Field Operator Number
GroupName indicates rule applies to events in
group
Field is a event metric attribute (from profile
statistics)
numcalls, numsubs, percent, usec, cumusec,
totalcount, stdev, usecs/call, counts/call
Operator is one of gt, lt, or
Number is any number
Compound rules possible using between simple
rules

11
Example Rules

Exclude all events that are members of TAU_USER
and use less than 1000 microseconds
TAU_USERusec lt 1000
Exclude all events that have less than 100
microseconds and are called only once
usec lt 1000 numcalls 1
Exclude all events that have less than 1000
usecs per call OR have a (total inclusive)
percent less than 5
usecs/call lt 1000
percent lt 5
Scientific notation can be used

12
TAUReduce Example

tau_reduce implements overhead reduction in TAU
Consider klargest example
Find kth largest element in a N elements
Compare two methods quicksort,
select_kth_largest
Testcase i 2324, N 1000000 (uninstrumented)
quicksort (wall clock) 0.188511 secs
select_kth_largest (wall clock) 0.149594 secs
Total (P3/1.2GHz time) 0.340u 0.020s 000.37
Execute with all routines instrumented
Execute with rule-based selective instrumentation
usecgt1000 numcallsgt400000 usecs/calllt30
percentgt25

13
Simple sorting example on one processor
Before selective instrumentation reduction

NODE 0CONTEXT 0THREAD 0
--------------------------------------------------
-------------------------------------
Time Exclusive Inclusive Call
Subrs Inclusive Name
msec msec
usec/call
--------------------------------------------------
-------------------------------------
100.0 13 4,982 1
4 4982030 int main
93.5 3,223 4,659 4.20241E06
1.40268E07 1 void quicksort
62.9 0.00481 3,134 5
5 626839 int kth_largest_qs
36.4 137 1,813 28
450057 64769 int select_kth_largest
33.6 150 1,675 449978
449978 4 void sort_5elements
28.8 1,435 1,435 1.02744E07
0 0 void interchange
0.4 20 20 1
0 20668 void setup
0.0 0.0118 0.0118 49
0 0 int ceil

After selective instrumentation reduction
NODE 0CONTEXT 0THREAD 0 -----------------------
--------------------------------------------------
-------------- Time Exclusive Inclusive
Call Subrs Inclusive Name
msec total msec
usec/call ----------------------------------------
----------------------------------------------- 10
0.0 14 383 1
4 383333 int main 50.9 195
195 5 0 39017 int
kth_largest_qs 40.0 153 153
28 79 5478 int
select_kth_largest 5.4 20
20 1 0 20611 void setup
0.0 0.02 0.02 49
0 0 int ceil
14
Performance Mapping

Associate performance with significant entities
(events)
Source code points are important
Functions, regions, control flow events, user
events
Execution process and thread entities are
important
Some entities are more abstract, harder to
measure
Consider callgraph (callpath) profiling
Measure time (metric) along an edge (path) of
callgraph
incident edge gives parent / child view
edge sequence (path) gives parent / descendant
view
Problem Callpath profiling when callgraph is
unknown
Determine callgraph dynamically at runtime
Map performance measurement to dynamic call path
state

15
Callgraph (Callpath) Profiling

0-level callpath
Callgraph node
A
1-level callpath
Immediate descendant
A?B, E?I, D?H
C?H ?
k-level callpath (kgt1)
k call descendant
2-level A?D, C?I
2-level A?I ?
3-level A?H

?
?
?
?
?
16
1-Level Callpath Profiling in TAU (S. Shende, UO)

TAU maintains a performance event (routine)
callstack
Profiled routine (child) looks in callstack for
parent
Previous profiled performance event is the parent
A callpath profile structure created first time
parent calls
TAU records parent in a callgraph map for child
String representing 1-level callpath used as its
key
a( )gtb( ) name for time spent in b when
called by a
Map returns pointer to callpath profile structure
1-level callpath is profiled using this profiling
data
Build upon TAUs performance mapping technology
Measurement is independent of instrumentation

17
Callpath Profiling Example (NAS LU v2.3)

configure -PROFILECALLPATH -SGITIMERS
-archsgi64-mpiinc/usr/include
-mpilib/usr/lib64 -useropt-O2

18
Callpath Parallel Profile Display

0-level and 1-level callpath grouping

1-Level Callpath
0-Level Callpath
19
Performance Monitoring and Steering

Desirable to monitor performance during execution
Long-running applications
Steering computations for improved performance
Large-scale parallel applications complicate
solutions
More parallel threads of execution producing data
Large amount of performance data (relative) to
access
Analysis and visualization more difficult
Problem Online performance data access and
analysis
Incremental profile sampling (based on files)
Integration in computational steering system
Dynamic performance measurement and access

20
Online Performance Analysis (K. Li, UO)
21
2D Field Performance Visualization in SCIRun
SCIRun program
22
Uintah Computational Framework (UCF)

Universityof Utah
UCF analysis
Scheduling
MPI library
Components
500 processes
Use for onlineand offlinevisualization
Apply SCIRunsteering

23
Performance Analysis of Component Software

Complexity in scientific problem solving
addressed by
advances in software development environments
rich layered software middleware and libraries
Increases complexity in performance problem
solving
Integration barriers for performance technology
Incompatible with advanced software technology
Inconsistent with software engineering process
Problem Performance engineering for component
systems
Respect software development methodology
Leverage software implementation technology
Look for opportunities for synergy and
optimization

24
Focus on Component Technology and CCA

Emerging component technology for HPC and Grid
Component software object embedding
functionality
Component architecture (CA) how components
connect
Component framework implement a CA
Common Component Architecture (CCA)
Standard foundation for scientific component
architecture
Component descriptions
Scientific Interface Description Language (SIDL)
CCA ports for component interactions (provides
and uses)
CCA services directory, registery, connection,
event
High-performance components and interactions

25
Extend Component Design for Performance
genericcomponent

Compliant with component architecture
Component composition performance engineering
Utilize technology and services of component
framework

26
Performance Knowledge

Describe and store known components
performance
Benchmark characterizations in performance
database
Models of performance
empirical-based
simulation-based
analytical-based
Saved information about component performance
Use for performance-guided selection and
deployment
Use for runtime adaptation
Representation must be in common forms with
standard means for accessing the performance
information

27
Performance Knowledge Repository Component

Component performance repository
Implement in componentarchitecture framework
Similar to CCA componentrepository
Access by componentinfrastructure
View performance knowledge as component (PKC)
PKC ports give access to performance knowledge
to other components back to original
component
Static/dynamic component control and composition
Component composition performance knowledge

28
Performance Observation

Ability to observe execution performance is
important
Empirically-derived performance knowledge
requires it
does not require measurement integration in
component
Monitor during execution to make dynamic
decisions
measurement integration is key
Performance observation integration
Component integration core and variant
Runtime measurement and data collection
On-line and off-line performance analysis
Performance observation technology must be as
portable and robust as component software

29
Performance Observation Component (POC)

Performance observation in aperformance-engineere
dcomponent model
Functional extension of originalcomponent design
( )
Include new componentmethods and ports ( ) for
othercomponents to access measured performance
data
Allow original component to access performance
data
encapsulate as tightly-couple and co-resident
performance observation object
POC provides port allow use optmized interfaces
( )to access internal'' performance
observations

30
Architecture of a Performance Component

Each component advertises its services
Performance component
Timer (start/stop)
Event (trigger)
Query (timers)
Knowledge (component performance model)
Prototype implementation of timer
CCAFFEINE reference framework
http//www.cca-forum.org/café.html
SIDL
Instantiate with TAU functionality

31
TimerPort Interface Declaration in CCAFEINE

Create Timer port abstraction

namespace performance
namespace ccaports
/
This abstract class declares the Timer
interface.
Inherit from this class to provide
functionality.
/
class Timer / implementation of port /
public virtual govccaPort / inherits
from port spec /
public
virtual Timer ()
/
Start the Timer. Implement this function
in
a derived class to provide required
functionality.
/
virtual void start(void) 0 / virtual
methods with /
virtual void stop(void) 0 / null
implementations /
...

32
Using Performance Component Timer

Component uses framework services to get
TimerPort
Use of this TimerPort interface is independent
of TAU

// Get Timer port from CCA framework services
form CCAFFEINE
port frameworkServices-gtgetPort
("TimerPort")
if (port)
timer_m dynamic_cast lt performanceccaports
Timer gt(port)
if (timer_m 0)
cerr ltlt "Connected to something, not a Timer
port" ltlt endl
return -1
string s "IntegrateTimer" // give name for
timer
timer_m-gtsetName(s) // assign name to
timer
timer_m-gtstart() // start timer
(independent of tool)
for (int i 0 i lt count i)
double x random_m-gtgetRandomNumber ()
sum sum function_m-gtevaluate (x)
timer_m-gtstop() // stop timer

33
Using SIDL for Language Interoperability

Can create Timer interface in SIDL for creating
stubs

//
// File performance.sidl
//
version performance 1.0
package performance
class Timer
void start()
void stop()
void setName(in string name)
string getName()
void setType(in string name)
string getType()
void setGroupName(in string name)
string getGroupName()
void setGroupId(in long group)
long getGroupId()

34
Using SIDL Interface for Timers

C program that uses the SIDL Timer interface
Again, independent of timer implementations
(e.g., TAU)

// SIDL
include "performance_Timer.hh"
int main(int argc, char argv)
performanceTimer t performanceTimer_crea
te()
...
t.setName("Integrate timer")
t.start()
// Computation
for (int i 0 i lt count i)
double x random_m-gtgetRandomNumber ()
sum sum function_m-gtevaluate (x)
...
t.stop()
return 0

35
Using TAU Component in CCAFEINE

repository get TauTimer / get
TAU component from repository /
repository get Driver / get
application components /
repository get MidpointIntegrator
repository get MonteCarloIntegrator
repository get RandomGenerator
repository get LinearFunction
repository get NonlinearFunction
repository get PiFunction
create LinearFunction lin_func / create
component instances /
create NonlinearFunction nonlin_func
create PiFunction pi_func
create MonteCarloIntegrator mc_integrator
create RandomGenerator rand
create TauTimer tau / create
TAU component instance /
/ connecting components and running /
connect mc_integrator RandomGeneratorPort rand
RandomGeneratorPort
connect mc_integrator FunctionPort nonlin_func
FunctionPort

36
Component Composition Performance Engineering

Performance of component-based scientific
applicationsdepends on interplay
Component functions
Computational resources available
Management of component compositions throughout
execution is critical to successful deployment
and use
Identify key technological capabilities needed to
support the performance engineering of component
compositions
Two model concepts
Performance awareness
Performance attention

37
Performance Awareness of Component Ensembles

Composition performance knowledge and observation
Composition performance knowledge
Can come from empirical and analytical evaluation
Can utilize information provided at the component
level
Can be stored in repositories for future review
Extends the notion of component observation to
ensemble-level performance monitoring
Associate monitoring components hierarchical
component grouping
Build upon component-level observation support
Monitoring components act as performance
integrators and routers
Use component framework mechanisms

38
Performance Databases

Focus on empirical performance optimization
process
Necessary for multi-results performance analysis
Multiple experiments (codes, versions, platforms,
)
Historical performance comparison
Integral component of performance analysis
framework
Improved performance analysis architecture design
More flexible and open tool interfaces
Supports extensibility and foreign tool
interaction
Performance analysis collaboration
Performance tool sharing
Performance data sharing and knowledge base

39
Empirical-Based Performance Optimization
Process
40
TAU Performance Database Framework

profile data only
XML representation (PerfDML)
project / experiment / trial

41
PerfDBF Components

Performance Data Meta Language (PerfDML)
Common performance data representation
Performance meta-data description
Translators to common PerfDML data representation
Performance DataBase (PerfDB)
Standard database technology (SQL)
Free, robust database software (PostgresSQL)
Commonly available APIs
Performance DataBase Toolkit (PerfDBT)
Commonly used modules for query and analysis
Facility analysis tool development

42
Common and Extensible Profile Data Format

Goals
Capture data from profile tools in common
representation
Implement representation in a standard format
Allow for extension of format for new profile
data objects
Base on XML (obvious choice)
Leverage XML tools and APIs
XML parsers, Suns Java SDK,
XML verification systems (DTD and schemas)
Target for profile data translation tools
eXtensibile Stylesheet Language Transformations
(XSLT)
Which performance profile data are of interest?
Focus on TAU and consider other profiling tools

43
Performance Profiling

Performance data about program entities and
behaviors
Code regions functions, loops, basic blocks
Actions or states
Statistics data
Execution time, number of calls, number of FLOPS
...
Characterization data
Parallel profiles
Captured per process and/or per thread
Program-level summaries
Profiling tools
prof/gprof, ssrun, uprofile/dpci, cprof/vprof,

44
TAU Parallel Performance Profiles
45
PerfDBF Example

NAS Parallel Benchmark LU
configure -mpiinc/usr/include
-mpilib/usr/lib64-archsgi64 -fortransgi
-SGITIMERS -useropt-O2

NPB profiled With TAU
Standard TAU Output Data
TAU XML Format
TAU to XML Converter
Database Loader
SQL Database
AnalysisTool
46
Scalability Analysis Process

Scalability study on LU
Vary number of processes 1, 2, 4, and 8
mpirun -np 1 lu.W1
mpirun -np 2 lu.W2
mpirun -np 4 lu.W4
mpirun -np 8 lu.W8
Populate the performance database
run Java translator to translate profiles into
XML
run Java XML reader to write XML profiles to
database
Read times for routines and program from
experiments
Calculate scalability metrics

47
Raw TAU Profile Data

Raw data output
One processor
"applu 1 15 2939.096923828125 248744666.5830078
0 GROUP"applu
Four processors
"applu 1 15 2227.343994140625 51691412.17797852
0 GROUP"applu
"applu 1 15 2227.343994140625 51691412.17797852
0 GROUP"applu
"applu " 1 14 596.568115234375 51691519.34106445
0 GROUP"applu
"applu " 1 14 616.833251953125 51691377.21313477
0 GROUP"applu"

group name
profile calls
exclusive time
inclusive time
name
subs
calls
48
XML Profile Representation

One processor
ltinstrumentedobjgt
ltfuncnamegt 'applu 'lt/funcnamegt
ltfuncIDgt8lt/funcIDgt
ltinclpercgt100.0lt/inclpercgt
ltinclutimegt2.487446665830078E8lt/inclutimegt
ltexclpercgt0.0lt/exclpercgt
ltexclutimegt2939.096923828125 lt/exclutimegt
ltcallgt1lt/callgt
ltsubrsgt15lt/subrsgt
ltinclutimePcallgt2.487446665830078E8lt/inclu
timePcallgt
lt/instrumentedobjgt

49
XML Representation

Four processor mean
ltmeanfunctiongt
ltfuncnamegt'applu 'lt/funcnamegt
ltfuncIDgt12lt/funcIDgt
ltinclpercgt100.0lt/inclpercgt
ltinclutimegt5.169148940026855E7lt/inclutimegt
ltexclpercgt0.0lt/exclpercgt
ltexclutimegt1044.487548828125lt/exclutimegt
ltcallgt1lt/callgt
ltsubrsgt14.25lt/subrsgt
ltinclutimePcallgt5.1691489E7lt/inclutimePcal
lgt
lt/meanfunctiongt

50
Contents of Performance Database
51
Scalability Analysis Results

Scalability of LU performance experiments
Four trial runs
Funname processors meanspeedup
.
applu 2 2.0896117809566
applu 4 4.812100975788783
applu 8 8.168409581149514
exact 2 1.95853126762839071803
exact 4 4.03622321124616535446
exact 8 7.193812137750623668346

52
Current PerfDBF Status and Future

PerfDBF prototype
TAU profile to XML translator
XML to PerfDB populator
PostgresSQL database
Java-based PostgresSQL query module
Use as a layer to support performance analysis
tools
Make accessing the Performance Database quicker
Continue development
XML parallel profile representation
Basic specification
Opportunity for APART to define a common format

53
Performance Tracking and Reporting

Integrated performance measurement allows
performance analysis throughout development
lifetime
Applied performance engineering in software
design and development (software engineering)
process
Create performance portfolio from regular
performance experimentation (couple with software
testing)
Use performance knowledge in making key software
design decision, prior to major development
stages
Use performance benchmarking and regression
testing to identify irregularities
Support automatic reporting of performance bugs
Enable cross-platform (cross-generation)
evaluation

54
XPARE - eXPeriment Alerting and REporting

Experiment launcher automates measurement /
analysis
Configuration and compilation of performance
tools
Instrumentation control for Uintah experiment
type
Execution of multiple performance experiments
Performance data collection, analysis, and
storage
Integrated in Uintah software testing harness
Reporting system conducts performance regression
tests
Apply performance difference thresholds (alert
ruleset)
Alerts users via email if thresholds have been
exceeded
Web alerting setup and full performance data
reporting
Historical performance data analysis

55
XPARE System Architecture
Experiment Launch
Performance Database
Performance Reporter
Comparison Tool
Regression Analyzer
Alerting Setup
56
Concluding Remarks

Complex software and parallel computing systems
pose challenging performance analysis problems
that require robust methodologies and tools
To build more sophisticated performance tools,
existing proven performance technology must be
utilized
Performance tools must be integrated with
software and systems models and technology
Performance engineered software
Function consistently and coherently in software
and system environments
TAU performance system offers robust performance
technology that can be broadly integrated

57
(No Transcript)

Write a Comment

User Comments (0)