Allen D' Malony, Sameer Shende, Li Li, Kevin Huck - PowerPoint PPT Presentation

1 / 45

About This Presentation

Title:

Allen D' Malony, Sameer Shende, Li Li, Kevin Huck

Description:

Department of Computer and Information Science. Performance Research Laboratory ... Tools have poor support for application-specific aspects ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 46

Provided by: alle128

Category:

more less

Transcript and Presenter's Notes

Title: Allen D' Malony, Sameer Shende, Li Li, Kevin Huck

1
Parallel Performance Mapping,Diagnosis, and Data
Mining

Allen D. Malony, Sameer Shende, Li Li, Kevin Huck
malony,sameer,lili,khuck_at_cs.uoregon.edu
Department of Computer and Information Science
Performance Research Laboratory
University of Oregon

2
Research Motivation

Tools for performance problem solving
Empirical-based performance optimization process
Performance technology concerns

3
Challenges in Performance Problem Solving

How to make the process more effective
(productive)?
Process may depend on scale of parallel system
What are the important events and performance
metrics?
Tied to application structure and computational
model
Tied to application domain and algorithms
Process and tools can/must be more
application-aware
Tools have poor support for application-specific
aspects
What are the significant issues that will affect
the technology used to support the process?
Enhance application development and benchmarking
New paradigm in performance process and technology

4
Large Scale Performance Problem Solving

How does our view of this process change when we
consider very large-scale parallel systems?
What are the significant issues that will affect
the technology used to support the process?
Parallel performance observation is clearly
needed
In general, there is the concern for intrusion
Seen as a tradeoff with performance diagnosis
accuracy
Scaling complicates observation and analysis
Performance data size becomes a concern
Analysis complexity increases
Nature of application development may change

5
Role of Intelligence, Automation, and Knowledge

Scale forces the process to become more
intelligent
Even with intelligent and application-specific
tools, the decisions of what to analyze is
difficult and intractable
More automation and knowledge-based decision
making
Build automatic/autonomic capabilities into the
tools
Support broader experimentation methods and
refinement
Access and correlate data from several sources
Automate performance data analysis / mining /
learning
Include predictive features and experiment
refinement
Knowledge-driven adaptation and optimization
guidance
Will allow scalability issues to be addressed in
context

6
Outline of Talk

Performance problem solving
Scalability, productivity, and performance
technology
Application-specific and autonomic performance
tools
TAU parallel performance system (Bernd said
No!)
Parallel performance mapping
Performance data management and data mining
Performance Data Management Framework (PerfDMF)
PerfExplorer
Model-based parallel performance diagnosis
Poirot and Hercule
Conclusions

7
TAU Performance System
8
Semantics-Based Performance Mapping

Associate performance measurements with
high-level semantic abstractions
Need mapping support in the performance
measurement system to assign data correctly

9
Hypothetical Mapping Example

Particles distributed on surfaces of a cube

Particle PMAX / Array of particles / int
GenerateParticles() / distribute particles
over all faces of the cube / for (int face0,
last0 face lt 6 face) / particles on
this face / int particles_on_this_face
num(face) for (int ilast i lt
particles_on_this_face i) / particle
properties are a function of face / Pi
... f(face) ... last
particles_on_this_face
10
Hypothetical Mapping Example (continued)
int ProcessParticle(Particle p) / perform
some computation on p / int main()
GenerateParticles() / create a list of
particles / for (int i 0 i lt N i) /
iterates over the list / ProcessParticle(Pi)

work packets

engine

How much time (flops) spent processing face i
particles?
What is the distribution of performance among
faces?
How is this determined if execution is parallel?

11
No Performance Mapping versus Mapping

Typical performance tools report performance with
respect to routines
Does not provide support for mapping

TAUs performance mapping can observe performance
with respect to scientists programming and
problem abstractions

TAU (w/ mapping)
TAU (no mapping)
12
Performance Mapping Approaches

ParaMap (Miller and Irvin)
Low-level performance to high-level source
constructs
Noun-Verb (NV) model to describe the mapping
noun is an program entity
verb represents an action performed on a noun
sentences (nouns and verb) map to other sentences
Mappings static, dynamic, set of active
sentences (SAS)
Semantics Entities / Abstractions/ Associations
(SEAA)
Entities defined at any level of abstraction
(user-level)
Attribute entity with semantic information
Entity-to-entity associations
Target measurement layer and asynchronous
operation

13
SEAA Implementation

Two association types (implemented in TAU API)
Embedded extends associatedobject to store
performancemeasurement entity
External creates an externallook-up table
using address ofobject as key to locate
performancemeasurement entity
Implemented in TAU API
Applied to performance measurement problems
callpath/phase profiling, C templates,

14
Uintah Problem Solving Environment (PSE)

Uintah component architecture for Utah C-SAFE
project
Application programmers provide
description of computation (tasks and variables)
code to perform task on single patch
(sub-region of space)
Components for scheduling, partitioning, load
balance,
Uintah Computational Framework (UCF)
Execution model based on software (macro)
dataflow
computations expressed a directed acyclic graphs
of tasks
input/outputs specified for each patch in a
structured grid
Abstraction of global single-assignment memory
Task graph gets mapped to processing resources
Communications schedule approximates global
optimal

15
Uintah Task Graph (Material Point Method)

Diagram of named tasks (ovals) and data (edges)
Imminent computation
Dataflow-constrained
MPM
Newtonian material point motion time step
Solid values defined at material point
(particle)
Dashed values defined at vertex (grid)
Prime () values updated during time step

16
Task Execution in Uintah Parallel Scheduler

Profile methods and functions in scheduler and in
MPI library

Task execution time dominates (what task?)
Task execution time distribution
MPI communication overheads (where?)

Need to map performance data!

17
Mapping Instrumentation in UCF (example)

Use TAU performance mapping API

void MPISchedulerexecute(const ProcessorGroup
pc, DataWarehouseP old_dw,
DataWarehouseP dw ) ... TAU_MAPPING_
CREATE( task-gtgetName(), "MPISchedulerexecute
()", (TauGroup_t)(void)task-gtgetName(),
task-gtgetName(), 0) ... TAU_MAPPING_OBJECT(taut
imer) TAU_MAPPING_LINK(tautimer,(TauGroup_t)(void
)task-gtgetName()) // EXTERNAL
ASSOCIATION ... TAU_MAPPING_PROFILE_TIMER(doitpr
ofiler, tautimer, 0) TAU_MAPPING_PROFILE_START(do
itprofiler,0) task-gtdoit(pc) TAU_MAPPING_PROFI
LE_STOP(0) ...
18
Task Performance Mapping (Profile)
Mapped task performance across processes
Performance mapping for different tasks
19
Work Packet to Task Mapping (Trace)
Work packet computation events colored by task
type
Distinct phases of computation can be identifed
based on task
20
Comparing Uintah Traces for Scalability Analysis
32 processes
32 processes
21
Important Questions for Application Developers

How does performance vary with different
compilers?
Is poor performance correlated with certain OS
features?
Has a recent change caused unanticipated
performance?
How does performance vary with MPI variants?
Why is one application version faster than
another?
What is the reason for the observed scaling
behavior?
Did two runs exhibit similar performance?
How are performance data related to application
events?
Which machines will run my code the fastest and
why?
Which benchmarks predict my code performance best?

22
Performance Problem Solving Goals

Answer questions at multiple levels of interest
Data from low-level measurements and simulations
use to predict application performance
High-level performance data spanning dimensions
machine, applications, code revisions, data sets
examine broad performance trends
Discover general correlations application
performance and features of their external
environment
Develop methods to predict application
performance on lower-level metrics
Discover performance correlations between a small
set of benchmarks and a collection of
applications that represent a typical workload
for a given system

23
Empirical-Based Performance Optimization
Process
24
Performance Data Management Framework

ICPP 2005 paper

25
PerfExplorer (K. Huck, Ph.D. student, UO)

Performance knowledge discovery framework
Use the existing TAU infrastructure
TAU instrumentation data, PerfDMF
Client-server based system architecture
Data mining analysis applied to parallel
performance data
comparative, clustering, correlation, dimension
reduction, ...
Technology integration
Relational DatabaseManagement Systems (RDBMS)
Java API and toolkit
R-project / Omegahat statistical analysis
WEKA data mining package
Web-based client

26
PerfExplorer Architecture

SC05 paper

27
PerfExplorer Client GUI
28
Hierarchical and K-means Clustering (sPPM)
29
Miranda Clustering on 16K Processors
30
Parallel Performance Diagnosis

Performance tuning process
Process to find and report performance problems
Performance diagnosis detect and explain
problems
Performance optimization performance problem
repair
Experts approach systematically and use
experience
Hard to formulate and automate expertise
Performance optimization is fundamentally hard
Focus on the performance diagnosis problem
Characterize diagnosis processes
How it integrates with performance
experimentation
Understand the knowledge engineering

31
Parallel Performance Diagnosis Architecture
32
Performance Diagnosis System Architecture
33
Problems in Existing Diagnosis Approaches

Low-level abstraction of properties/metrics
Independent of program semantics
Relate to component structure
not algorithmic structure or parallelism model
Insufficient explanation power
Hard to interpret in the context of program
semantics
Performance behavior not tied to operational
parallelism
Low applicability and adaptability
Difficult to apply in different contexts
Hard to adapt to new requirements

34
Poirot Project

Lack of a formal theory of diagnosis processes
Compare and analyze performance diagnosis systems
Use theory to create system that is automated /
adaptable
Poirot performance diagnosis (theory,
architecture)
Survey of diagnosis methods / strategies in tools
Heuristic classification approach (match to
characteristics)
Heuristic search approach (based on problem
knowledge)
Problems
Descriptive results do not explain with respect
to context
users must reason about high-level causes
Performance experimentation not guided by
diagnosis
Lacks automation

35
Model-Based Approach

Knowledge-based performance diagnosis
Capture knowledge about performance problems
Capture knowledge about how to detect and explain
them
Where does the knowledge come from?
Extract from parallel computational models
Structural and operational characteristics
Associate computational models with performance
Do parallel computational models help in
diagnosis?
Enables better understanding of problems
Enables more specific experimentation
Enables more efffective hypothesize testing and
search

36
Implications for Performance Diagnosis

Models benefit performance diagnosis
Base instrumentation on program semantics
Capture performance-critical features
Enable explanations close to users understanding
of computation operation
of performance behavior
Reuse performance analysis expertise
on the commonly-used models
Model examples
Master-worker model ? Pipeline
Divide-and-conquer ? Domain decomposition
Phase-based ? Compositional

37
Hercule Project

Goals of automation , adaptability, validation

38
Approach

Make use of model knowledge to diagnose
performance
Start with commonly-used computational models
Engineering model knowledge
Integrate model knowledge with performance
measurement system
Build a cause inference system
define causes at parallelism level
build causality relation
between the low-level effects and the causes

39
Master-Worker Parallel Computation Model
40
Performance Diagnosis Inference Tree (MW)
Low speedup
1.Insufficient-parallelism
4. Some workers Noticeably inefficient
3.Master-being-bottleneck
2.Fine-granularity

Large amount of message exchanged every time
The workers waited quite a while in master queue
in Some time intervals
master-assign-task time significant
Num of reqs in master queue gt ?1 in some time
intervals

Init. or final. time significant
Waiting long time for Master assigning each
individual task
Time imbalance
Such intervals gt?2
Such intervals lt?2

Observation
Hypotheses
Worker number saturation
Causes
number priority
?i threshold
Worker starvation
coexistence
41
Knowledge Engineering - Abstract Event (MW)

Use CLIPS expert system building tool

42
Diagnosis Results Output (MW)
43
Experimental Diagnosis Results (MW)
44
Concluding Discussion

Performance tools must be used effectively
More intelligent performance systems for
productive use
Evolve to application-specific performance
technology
Deal with scale by full range performance
exploration
Autonomic and integrated tools
Knowledge-based and knowledge-driven process
Performance observation methods do not
necessarily need to change in a fundamental sense
More automatically controlled and efficiently use
Support model-driven performance diagnosis
Develop next-generation tools and deliver to
community