Allen D' Malony, Sameer Shende, Li Li, Kevin Huck - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Allen D' Malony, Sameer Shende, Li Li, Kevin Huck

Description:

Department of Computer and Information Science. Performance Research Laboratory ... Tools have poor support for application-specific aspects ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 46
Provided by: alle128
Category:

less

Transcript and Presenter's Notes

Title: Allen D' Malony, Sameer Shende, Li Li, Kevin Huck


1
Parallel Performance Mapping,Diagnosis, and Data
Mining
  • Allen D. Malony, Sameer Shende, Li Li, Kevin Huck
  • malony,sameer,lili,khuck_at_cs.uoregon.edu
  • Department of Computer and Information Science
  • Performance Research Laboratory
  • University of Oregon

2
Research Motivation
  • Tools for performance problem solving
  • Empirical-based performance optimization process
  • Performance technology concerns

3
Challenges in Performance Problem Solving
  • How to make the process more effective
    (productive)?
  • Process may depend on scale of parallel system
  • What are the important events and performance
    metrics?
  • Tied to application structure and computational
    model
  • Tied to application domain and algorithms
  • Process and tools can/must be more
    application-aware
  • Tools have poor support for application-specific
    aspects
  • What are the significant issues that will affect
    the technology used to support the process?
  • Enhance application development and benchmarking
  • New paradigm in performance process and technology

4
Large Scale Performance Problem Solving
  • How does our view of this process change when we
    consider very large-scale parallel systems?
  • What are the significant issues that will affect
    the technology used to support the process?
  • Parallel performance observation is clearly
    needed
  • In general, there is the concern for intrusion
  • Seen as a tradeoff with performance diagnosis
    accuracy
  • Scaling complicates observation and analysis
  • Performance data size becomes a concern
  • Analysis complexity increases
  • Nature of application development may change

5
Role of Intelligence, Automation, and Knowledge
  • Scale forces the process to become more
    intelligent
  • Even with intelligent and application-specific
    tools, the decisions of what to analyze is
    difficult and intractable
  • More automation and knowledge-based decision
    making
  • Build automatic/autonomic capabilities into the
    tools
  • Support broader experimentation methods and
    refinement
  • Access and correlate data from several sources
  • Automate performance data analysis / mining /
    learning
  • Include predictive features and experiment
    refinement
  • Knowledge-driven adaptation and optimization
    guidance
  • Will allow scalability issues to be addressed in
    context

6
Outline of Talk
  • Performance problem solving
  • Scalability, productivity, and performance
    technology
  • Application-specific and autonomic performance
    tools
  • TAU parallel performance system (Bernd said
    No!)
  • Parallel performance mapping
  • Performance data management and data mining
  • Performance Data Management Framework (PerfDMF)
  • PerfExplorer
  • Model-based parallel performance diagnosis
  • Poirot and Hercule
  • Conclusions

7
TAU Performance System
8
Semantics-Based Performance Mapping
  • Associate performance measurements with
    high-level semantic abstractions
  • Need mapping support in the performance
    measurement system to assign data correctly

9
Hypothetical Mapping Example
  • Particles distributed on surfaces of a cube

Particle PMAX / Array of particles / int
GenerateParticles() / distribute particles
over all faces of the cube / for (int face0,
last0 face lt 6 face) / particles on
this face / int particles_on_this_face
num(face) for (int ilast i lt
particles_on_this_face i) / particle
properties are a function of face / Pi
... f(face) ... last
particles_on_this_face
10
Hypothetical Mapping Example (continued)
int ProcessParticle(Particle p) / perform
some computation on p / int main()
GenerateParticles() / create a list of
particles / for (int i 0 i lt N i) /
iterates over the list / ProcessParticle(Pi)

work packets

engine
  • How much time (flops) spent processing face i
    particles?
  • What is the distribution of performance among
    faces?
  • How is this determined if execution is parallel?

11
No Performance Mapping versus Mapping
  • Typical performance tools report performance with
    respect to routines
  • Does not provide support for mapping
  • TAUs performance mapping can observe performance
    with respect to scientists programming and
    problem abstractions

TAU (w/ mapping)
TAU (no mapping)
12
Performance Mapping Approaches
  • ParaMap (Miller and Irvin)
  • Low-level performance to high-level source
    constructs
  • Noun-Verb (NV) model to describe the mapping
  • noun is an program entity
  • verb represents an action performed on a noun
  • sentences (nouns and verb) map to other sentences
  • Mappings static, dynamic, set of active
    sentences (SAS)
  • Semantics Entities / Abstractions/ Associations
    (SEAA)
  • Entities defined at any level of abstraction
    (user-level)
  • Attribute entity with semantic information
  • Entity-to-entity associations
  • Target measurement layer and asynchronous
    operation

13
SEAA Implementation
  • Two association types (implemented in TAU API)
  • Embedded extends associatedobject to store
    performancemeasurement entity
  • External creates an externallook-up table
    using address ofobject as key to locate
    performancemeasurement entity
  • Implemented in TAU API
  • Applied to performance measurement problems
  • callpath/phase profiling, C templates,

14
Uintah Problem Solving Environment (PSE)
  • Uintah component architecture for Utah C-SAFE
    project
  • Application programmers provide
  • description of computation (tasks and variables)
  • code to perform task on single patch
    (sub-region of space)
  • Components for scheduling, partitioning, load
    balance,
  • Uintah Computational Framework (UCF)
  • Execution model based on software (macro)
    dataflow
  • computations expressed a directed acyclic graphs
    of tasks
  • input/outputs specified for each patch in a
    structured grid
  • Abstraction of global single-assignment memory
  • Task graph gets mapped to processing resources
  • Communications schedule approximates global
    optimal

15
Uintah Task Graph (Material Point Method)
  • Diagram of named tasks (ovals) and data (edges)
  • Imminent computation
  • Dataflow-constrained
  • MPM
  • Newtonian material point motion time step
  • Solid values defined at material point
    (particle)
  • Dashed values defined at vertex (grid)
  • Prime () values updated during time step

16
Task Execution in Uintah Parallel Scheduler
  • Profile methods and functions in scheduler and in
    MPI library

Task execution time dominates (what task?)
Task execution time distribution
MPI communication overheads (where?)
  • Need to map performance data!

17
Mapping Instrumentation in UCF (example)
  • Use TAU performance mapping API

void MPISchedulerexecute(const ProcessorGroup
pc, DataWarehouseP old_dw,
DataWarehouseP dw ) ... TAU_MAPPING_
CREATE( task-gtgetName(), "MPISchedulerexecute
()", (TauGroup_t)(void)task-gtgetName(),
task-gtgetName(), 0) ... TAU_MAPPING_OBJECT(taut
imer) TAU_MAPPING_LINK(tautimer,(TauGroup_t)(void
)task-gtgetName()) // EXTERNAL
ASSOCIATION ... TAU_MAPPING_PROFILE_TIMER(doitpr
ofiler, tautimer, 0) TAU_MAPPING_PROFILE_START(do
itprofiler,0) task-gtdoit(pc) TAU_MAPPING_PROFI
LE_STOP(0) ...
18
Task Performance Mapping (Profile)
Mapped task performance across processes
Performance mapping for different tasks
19
Work Packet to Task Mapping (Trace)
Work packet computation events colored by task
type
Distinct phases of computation can be identifed
based on task
20
Comparing Uintah Traces for Scalability Analysis
32 processes
32 processes
21
Important Questions for Application Developers
  • How does performance vary with different
    compilers?
  • Is poor performance correlated with certain OS
    features?
  • Has a recent change caused unanticipated
    performance?
  • How does performance vary with MPI variants?
  • Why is one application version faster than
    another?
  • What is the reason for the observed scaling
    behavior?
  • Did two runs exhibit similar performance?
  • How are performance data related to application
    events?
  • Which machines will run my code the fastest and
    why?
  • Which benchmarks predict my code performance best?

22
Performance Problem Solving Goals
  • Answer questions at multiple levels of interest
  • Data from low-level measurements and simulations
  • use to predict application performance
  • High-level performance data spanning dimensions
  • machine, applications, code revisions, data sets
  • examine broad performance trends
  • Discover general correlations application
    performance and features of their external
    environment
  • Develop methods to predict application
    performance on lower-level metrics
  • Discover performance correlations between a small
    set of benchmarks and a collection of
    applications that represent a typical workload
    for a given system

23
Empirical-Based Performance Optimization
Process
24
Performance Data Management Framework
  • ICPP 2005 paper

25
PerfExplorer (K. Huck, Ph.D. student, UO)
  • Performance knowledge discovery framework
  • Use the existing TAU infrastructure
  • TAU instrumentation data, PerfDMF
  • Client-server based system architecture
  • Data mining analysis applied to parallel
    performance data
  • comparative, clustering, correlation, dimension
    reduction, ...
  • Technology integration
  • Relational DatabaseManagement Systems (RDBMS)
  • Java API and toolkit
  • R-project / Omegahat statistical analysis
  • WEKA data mining package
  • Web-based client

26
PerfExplorer Architecture
  • SC05 paper

27
PerfExplorer Client GUI
28
Hierarchical and K-means Clustering (sPPM)
29
Miranda Clustering on 16K Processors
30
Parallel Performance Diagnosis
  • Performance tuning process
  • Process to find and report performance problems
  • Performance diagnosis detect and explain
    problems
  • Performance optimization performance problem
    repair
  • Experts approach systematically and use
    experience
  • Hard to formulate and automate expertise
  • Performance optimization is fundamentally hard
  • Focus on the performance diagnosis problem
  • Characterize diagnosis processes
  • How it integrates with performance
    experimentation
  • Understand the knowledge engineering

31
Parallel Performance Diagnosis Architecture
32
Performance Diagnosis System Architecture
33
Problems in Existing Diagnosis Approaches
  • Low-level abstraction of properties/metrics
  • Independent of program semantics
  • Relate to component structure
  • not algorithmic structure or parallelism model
  • Insufficient explanation power
  • Hard to interpret in the context of program
    semantics
  • Performance behavior not tied to operational
    parallelism
  • Low applicability and adaptability
  • Difficult to apply in different contexts
  • Hard to adapt to new requirements

34
Poirot Project
  • Lack of a formal theory of diagnosis processes
  • Compare and analyze performance diagnosis systems
  • Use theory to create system that is automated /
    adaptable
  • Poirot performance diagnosis (theory,
    architecture)
  • Survey of diagnosis methods / strategies in tools
  • Heuristic classification approach (match to
    characteristics)
  • Heuristic search approach (based on problem
    knowledge)
  • Problems
  • Descriptive results do not explain with respect
    to context
  • users must reason about high-level causes
  • Performance experimentation not guided by
    diagnosis
  • Lacks automation

35
Model-Based Approach
  • Knowledge-based performance diagnosis
  • Capture knowledge about performance problems
  • Capture knowledge about how to detect and explain
    them
  • Where does the knowledge come from?
  • Extract from parallel computational models
  • Structural and operational characteristics
  • Associate computational models with performance
  • Do parallel computational models help in
    diagnosis?
  • Enables better understanding of problems
  • Enables more specific experimentation
  • Enables more efffective hypothesize testing and
    search

36
Implications for Performance Diagnosis
  • Models benefit performance diagnosis
  • Base instrumentation on program semantics
  • Capture performance-critical features
  • Enable explanations close to users understanding
  • of computation operation
  • of performance behavior
  • Reuse performance analysis expertise
  • on the commonly-used models
  • Model examples
  • Master-worker model ? Pipeline
  • Divide-and-conquer ? Domain decomposition
  • Phase-based ? Compositional

37
Hercule Project
  • Goals of automation , adaptability, validation

38
Approach
  • Make use of model knowledge to diagnose
    performance
  • Start with commonly-used computational models
  • Engineering model knowledge
  • Integrate model knowledge with performance
    measurement system
  • Build a cause inference system
  • define causes at parallelism level
  • build causality relation
  • between the low-level effects and the causes

39
Master-Worker Parallel Computation Model
40
Performance Diagnosis Inference Tree (MW)
Low speedup
1.Insufficient-parallelism
4. Some workers Noticeably inefficient
3.Master-being-bottleneck
2.Fine-granularity




Large amount of message exchanged every time
The workers waited quite a while in master queue
in Some time intervals
master-assign-task time significant
Num of reqs in master queue gt ?1 in some time
intervals

Init. or final. time significant
Waiting long time for Master assigning each
individual task
Time imbalance
Such intervals gt?2
Such intervals lt?2

Observation
Hypotheses
Worker number saturation
Causes
number priority
?i threshold
Worker starvation
coexistence
41
Knowledge Engineering - Abstract Event (MW)
  • Use CLIPS expert system building tool

42
Diagnosis Results Output (MW)
43
Experimental Diagnosis Results (MW)
44
Concluding Discussion
  • Performance tools must be used effectively
  • More intelligent performance systems for
    productive use
  • Evolve to application-specific performance
    technology
  • Deal with scale by full range performance
    exploration
  • Autonomic and integrated tools
  • Knowledge-based and knowledge-driven process
  • Performance observation methods do not
    necessarily need to change in a fundamental sense
  • More automatically controlled and efficiently use
  • Support model-driven performance diagnosis
  • Develop next-generation tools and deliver to
    community

45
Support Acknowledgements
  • Department of Energy (DOE)
  • Office of Science contracts
  • University of Utah ASCI Level 1 sub-contract
  • ASC/NNSA Level 3 contract
  • NSF
  • High-End Computing Grant
  • Research Centre Juelich
  • John von Neumann Institute
  • Dr. Bernd Mohr
  • Los Alamos National Laboratory
Write a Comment
User Comments (0)
About PowerShow.com