Title: Allen D' Malony, Sameer Shende, Li Li, Kevin Huck
1Parallel Performance Mapping,Diagnosis, and Data
Mining
- Allen D. Malony, Sameer Shende, Li Li, Kevin Huck
- malony,sameer,lili,khuck_at_cs.uoregon.edu
- Department of Computer and Information Science
- Performance Research Laboratory
- University of Oregon
2Research Motivation
- Tools for performance problem solving
- Empirical-based performance optimization process
- Performance technology concerns
3Challenges in Performance Problem Solving
- How to make the process more effective
(productive)? - Process may depend on scale of parallel system
- What are the important events and performance
metrics? - Tied to application structure and computational
model - Tied to application domain and algorithms
- Process and tools can/must be more
application-aware - Tools have poor support for application-specific
aspects - What are the significant issues that will affect
the technology used to support the process? - Enhance application development and benchmarking
- New paradigm in performance process and technology
4Large Scale Performance Problem Solving
- How does our view of this process change when we
consider very large-scale parallel systems? - What are the significant issues that will affect
the technology used to support the process? - Parallel performance observation is clearly
needed - In general, there is the concern for intrusion
- Seen as a tradeoff with performance diagnosis
accuracy - Scaling complicates observation and analysis
- Performance data size becomes a concern
- Analysis complexity increases
- Nature of application development may change
5Role of Intelligence, Automation, and Knowledge
- Scale forces the process to become more
intelligent - Even with intelligent and application-specific
tools, the decisions of what to analyze is
difficult and intractable - More automation and knowledge-based decision
making - Build automatic/autonomic capabilities into the
tools - Support broader experimentation methods and
refinement - Access and correlate data from several sources
- Automate performance data analysis / mining /
learning - Include predictive features and experiment
refinement - Knowledge-driven adaptation and optimization
guidance - Will allow scalability issues to be addressed in
context
6Outline of Talk
- Performance problem solving
- Scalability, productivity, and performance
technology - Application-specific and autonomic performance
tools - TAU parallel performance system (Bernd said
No!) - Parallel performance mapping
- Performance data management and data mining
- Performance Data Management Framework (PerfDMF)
- PerfExplorer
- Model-based parallel performance diagnosis
- Poirot and Hercule
- Conclusions
7TAU Performance System
8Semantics-Based Performance Mapping
- Associate performance measurements with
high-level semantic abstractions - Need mapping support in the performance
measurement system to assign data correctly
9Hypothetical Mapping Example
- Particles distributed on surfaces of a cube
Particle PMAX / Array of particles / int
GenerateParticles() / distribute particles
over all faces of the cube / for (int face0,
last0 face lt 6 face) / particles on
this face / int particles_on_this_face
num(face) for (int ilast i lt
particles_on_this_face i) / particle
properties are a function of face / Pi
... f(face) ... last
particles_on_this_face
10Hypothetical Mapping Example (continued)
int ProcessParticle(Particle p) / perform
some computation on p / int main()
GenerateParticles() / create a list of
particles / for (int i 0 i lt N i) /
iterates over the list / ProcessParticle(Pi)
work packets
engine
- How much time (flops) spent processing face i
particles? - What is the distribution of performance among
faces? - How is this determined if execution is parallel?
11No Performance Mapping versus Mapping
- Typical performance tools report performance with
respect to routines - Does not provide support for mapping
- TAUs performance mapping can observe performance
with respect to scientists programming and
problem abstractions
TAU (w/ mapping)
TAU (no mapping)
12Performance Mapping Approaches
- ParaMap (Miller and Irvin)
- Low-level performance to high-level source
constructs - Noun-Verb (NV) model to describe the mapping
- noun is an program entity
- verb represents an action performed on a noun
- sentences (nouns and verb) map to other sentences
- Mappings static, dynamic, set of active
sentences (SAS) - Semantics Entities / Abstractions/ Associations
(SEAA) - Entities defined at any level of abstraction
(user-level) - Attribute entity with semantic information
- Entity-to-entity associations
- Target measurement layer and asynchronous
operation
13SEAA Implementation
- Two association types (implemented in TAU API)
- Embedded extends associatedobject to store
performancemeasurement entity - External creates an externallook-up table
using address ofobject as key to locate
performancemeasurement entity - Implemented in TAU API
- Applied to performance measurement problems
- callpath/phase profiling, C templates,
14Uintah Problem Solving Environment (PSE)
- Uintah component architecture for Utah C-SAFE
project - Application programmers provide
- description of computation (tasks and variables)
- code to perform task on single patch
(sub-region of space) - Components for scheduling, partitioning, load
balance, - Uintah Computational Framework (UCF)
- Execution model based on software (macro)
dataflow - computations expressed a directed acyclic graphs
of tasks - input/outputs specified for each patch in a
structured grid - Abstraction of global single-assignment memory
- Task graph gets mapped to processing resources
- Communications schedule approximates global
optimal
15Uintah Task Graph (Material Point Method)
- Diagram of named tasks (ovals) and data (edges)
- Imminent computation
- Dataflow-constrained
- MPM
- Newtonian material point motion time step
- Solid values defined at material point
(particle) - Dashed values defined at vertex (grid)
- Prime () values updated during time step
16Task Execution in Uintah Parallel Scheduler
- Profile methods and functions in scheduler and in
MPI library
Task execution time dominates (what task?)
Task execution time distribution
MPI communication overheads (where?)
- Need to map performance data!
17Mapping Instrumentation in UCF (example)
- Use TAU performance mapping API
void MPISchedulerexecute(const ProcessorGroup
pc, DataWarehouseP old_dw,
DataWarehouseP dw ) ... TAU_MAPPING_
CREATE( task-gtgetName(), "MPISchedulerexecute
()", (TauGroup_t)(void)task-gtgetName(),
task-gtgetName(), 0) ... TAU_MAPPING_OBJECT(taut
imer) TAU_MAPPING_LINK(tautimer,(TauGroup_t)(void
)task-gtgetName()) // EXTERNAL
ASSOCIATION ... TAU_MAPPING_PROFILE_TIMER(doitpr
ofiler, tautimer, 0) TAU_MAPPING_PROFILE_START(do
itprofiler,0) task-gtdoit(pc) TAU_MAPPING_PROFI
LE_STOP(0) ...
18Task Performance Mapping (Profile)
Mapped task performance across processes
Performance mapping for different tasks
19Work Packet to Task Mapping (Trace)
Work packet computation events colored by task
type
Distinct phases of computation can be identifed
based on task
20Comparing Uintah Traces for Scalability Analysis
32 processes
32 processes
21Important Questions for Application Developers
- How does performance vary with different
compilers? - Is poor performance correlated with certain OS
features? - Has a recent change caused unanticipated
performance? - How does performance vary with MPI variants?
- Why is one application version faster than
another? - What is the reason for the observed scaling
behavior? - Did two runs exhibit similar performance?
- How are performance data related to application
events? - Which machines will run my code the fastest and
why? - Which benchmarks predict my code performance best?
22Performance Problem Solving Goals
- Answer questions at multiple levels of interest
- Data from low-level measurements and simulations
- use to predict application performance
- High-level performance data spanning dimensions
- machine, applications, code revisions, data sets
- examine broad performance trends
- Discover general correlations application
performance and features of their external
environment - Develop methods to predict application
performance on lower-level metrics - Discover performance correlations between a small
set of benchmarks and a collection of
applications that represent a typical workload
for a given system
23Empirical-Based Performance Optimization
Process
24Performance Data Management Framework
25PerfExplorer (K. Huck, Ph.D. student, UO)
- Performance knowledge discovery framework
- Use the existing TAU infrastructure
- TAU instrumentation data, PerfDMF
- Client-server based system architecture
- Data mining analysis applied to parallel
performance data - comparative, clustering, correlation, dimension
reduction, ... - Technology integration
- Relational DatabaseManagement Systems (RDBMS)
- Java API and toolkit
- R-project / Omegahat statistical analysis
- WEKA data mining package
- Web-based client
26PerfExplorer Architecture
27PerfExplorer Client GUI
28Hierarchical and K-means Clustering (sPPM)
29Miranda Clustering on 16K Processors
30Parallel Performance Diagnosis
- Performance tuning process
- Process to find and report performance problems
- Performance diagnosis detect and explain
problems - Performance optimization performance problem
repair - Experts approach systematically and use
experience - Hard to formulate and automate expertise
- Performance optimization is fundamentally hard
- Focus on the performance diagnosis problem
- Characterize diagnosis processes
- How it integrates with performance
experimentation - Understand the knowledge engineering
31Parallel Performance Diagnosis Architecture
32Performance Diagnosis System Architecture
33Problems in Existing Diagnosis Approaches
- Low-level abstraction of properties/metrics
- Independent of program semantics
- Relate to component structure
- not algorithmic structure or parallelism model
- Insufficient explanation power
- Hard to interpret in the context of program
semantics - Performance behavior not tied to operational
parallelism - Low applicability and adaptability
- Difficult to apply in different contexts
- Hard to adapt to new requirements
34Poirot Project
- Lack of a formal theory of diagnosis processes
- Compare and analyze performance diagnosis systems
- Use theory to create system that is automated /
adaptable - Poirot performance diagnosis (theory,
architecture) - Survey of diagnosis methods / strategies in tools
- Heuristic classification approach (match to
characteristics) - Heuristic search approach (based on problem
knowledge) - Problems
- Descriptive results do not explain with respect
to context - users must reason about high-level causes
- Performance experimentation not guided by
diagnosis - Lacks automation
35Model-Based Approach
- Knowledge-based performance diagnosis
- Capture knowledge about performance problems
- Capture knowledge about how to detect and explain
them - Where does the knowledge come from?
- Extract from parallel computational models
- Structural and operational characteristics
- Associate computational models with performance
- Do parallel computational models help in
diagnosis? - Enables better understanding of problems
- Enables more specific experimentation
- Enables more efffective hypothesize testing and
search
36Implications for Performance Diagnosis
- Models benefit performance diagnosis
- Base instrumentation on program semantics
- Capture performance-critical features
- Enable explanations close to users understanding
- of computation operation
- of performance behavior
- Reuse performance analysis expertise
- on the commonly-used models
- Model examples
- Master-worker model ? Pipeline
- Divide-and-conquer ? Domain decomposition
- Phase-based ? Compositional
37Hercule Project
- Goals of automation , adaptability, validation
38Approach
- Make use of model knowledge to diagnose
performance - Start with commonly-used computational models
- Engineering model knowledge
- Integrate model knowledge with performance
measurement system - Build a cause inference system
- define causes at parallelism level
- build causality relation
- between the low-level effects and the causes
39Master-Worker Parallel Computation Model
40Performance Diagnosis Inference Tree (MW)
Low speedup
1.Insufficient-parallelism
4. Some workers Noticeably inefficient
3.Master-being-bottleneck
2.Fine-granularity
Large amount of message exchanged every time
The workers waited quite a while in master queue
in Some time intervals
master-assign-task time significant
Num of reqs in master queue gt ?1 in some time
intervals
Init. or final. time significant
Waiting long time for Master assigning each
individual task
Time imbalance
Such intervals gt?2
Such intervals lt?2
Observation
Hypotheses
Worker number saturation
Causes
number priority
?i threshold
Worker starvation
coexistence
41Knowledge Engineering - Abstract Event (MW)
- Use CLIPS expert system building tool
42Diagnosis Results Output (MW)
43Experimental Diagnosis Results (MW)
44Concluding Discussion
- Performance tools must be used effectively
- More intelligent performance systems for
productive use - Evolve to application-specific performance
technology - Deal with scale by full range performance
exploration - Autonomic and integrated tools
- Knowledge-based and knowledge-driven process
- Performance observation methods do not
necessarily need to change in a fundamental sense - More automatically controlled and efficiently use
- Support model-driven performance diagnosis
- Develop next-generation tools and deliver to
community
45Support Acknowledgements
- Department of Energy (DOE)
- Office of Science contracts
- University of Utah ASCI Level 1 sub-contract
- ASC/NNSA Level 3 contract
- NSF
- High-End Computing Grant
- Research Centre Juelich
- John von Neumann Institute
- Dr. Bernd Mohr
- Los Alamos National Laboratory