Title: Allen D' Malony
1Integrating Performance Analysis inComplex
Scientific SoftwareExperiences with theUintah
Computational Framework
- Allen D. Malony
- malony_at_cs.uoregon.edu
- Department of Computer and Information Science
- Computational Science Institute
- University of Oregon
2Acknowledgements
- Sameer Shende, Robert BellUniversity of Oregon
- Steven Parker, J. Dav de St.-Germain, and Alan
MorrisUniversity of Utah - Department of Energy (DOE), ASCI Academic
Strategic Alliances Program (ASAP) - Center for Simulation of Accidental Fires
andExplosions (C-SAFE), ASCI/ASAP Level 1
center, University of Utah, http//www.csafe.utah
.edu - Computational Science Institute, ASCI/ASAPLevel
3 projects with LLNL / LANL,University of
Oregon, http//www.csi.uoregon.edu
3Complex Parallel Systems
- Complexity in computing system architecture
- Diverse parallel system architectures
- shared / distributed memory, cluster, hybrid,
NOW, Grid, - Sophisticated processor and memory architectures
- Advanced network interface and switching
architecture - Specialization of hardware components
- Complexity in parallel software environment
- Diverse parallel programming paradigms
- shared memory multi-threading, message passing,
hybrid - Hierarchical, multi-level software architectures
- Optimizing compilers and sophisticated runtime
systems - Advanced numerical libraries and application
frameworks
4Complexity Drives Performance Need / Technology
- Observe/analyze/understand performance behavior
- Multiple levels of software and hardware
- Different types and detail of performance data
- Alternative performance problem solving methods
- Multiple targets of software and system
application - Robust AND ubiquitous performance technology
- Broad scope of performance observability
- Flexible and configurable mechanisms
- Technology integration and extension
- Cross-platform portability
- Open, layered, and modular framework architecture
5What is Parallel Performance Technology?
- Performance instrumentation tools
- Different program code levels
- Different system levels
- Performance measurement (observation) tools
- Profiling and tracing of SW/HW performance events
- Different software (SW) and hardware (HW) levels
- Performance analysis tools
- Performance data analysis and presentation
- Online and offline tools
- Performance experimentation and data management
- Performance modeling and prediction tools
6Complexity Challenges for Performance Tools
- Computing system environment complexity
- Observation integration and optimization
- Access, accuracy, and granularity constraints
- Diverse/specialized observation
capabilities/technology - Restricted modes limit performance problem
solving - Sophisticated software development environments
- Programming paradigms and performance models
- Performance data mapping to software abstractions
- Uniformity of performance abstraction across
platforms - Rich observation capabilities and flexible
configuration - Common performance problem solving methods
7General Problems
- How do we create robust and ubiquitous
performance technology for the analysis and
tuning of parallel and distributed software and
systems in the presence of (evolving) complexity
challenges? -
- How do we apply performance technology
effectively for the variety and diversity of
performance problems that arise in the context of
complex parallel and distributed computer systems?
?
8Scientific Software Engineering
- Modern scientific simulation software is complex
- Large development teams of diverse expertise
- Simultaneous development on different system
parts - Iterative, multi-stage, long-term software
development - Need support for managing complex software
process - Software engineering tools for revision control,
automated testing, and bug tracking are
commonplace - Tools for HPC performance engineering are not
- evaluation (measurement, analysis, benchmarking)
- optimization (diagnosis, tracking, prediction,
tuning) - Incorporate performance engineering methodology
and support by flexible and robust performance
tools
9Computation Model for Performance Technology
- How to address dual performance technology goals?
- Robust capabilities widely available
methodologies - Contend with problems of system diversity
- Flexible tool composition/configuration/integratio
n - Approaches
- Restrict computation types / performance problems
- limited performance technology coverage
- Base technology on abstract computation model
- general architecture and software execution
features - map features/methods to existing complex system
types - develop capabilities that can adapt and be
optimized
10General Complex System Computation Model
- Node physically distinct shared memory machine
- Message passing node interconnection network
- Context distinct virtual memory space within
node - Thread execution threads (user/system) in context
Interconnection Network
Inter-node messagecommunication
Node
Node
Node
node memory
memory
memory
SMP
physicalview
VM space
modelview
Context
Threads
11Framework for Performance Problem Solving
- Model-based performance technology
- Instrumentation / measurement / execution models
- performance observability constraints
- performance data types and events
- Analysis / presentation model
- performance data processing
- performance views and model mapping
- Integration model
- performance tool component configuration /
integration - Can a performance problem solving framework be
designed based on a general complex system model
and with a performance technology model approach?
12TAU Performance System Framework
- Tuning and Analysis Utilities
- Performance system framework for scalable
parallel and distributed high-performance
computing - Targets a general complex system computation
model - nodes / contexts / threads
- Multi-level system / software / parallelism
- Measurement and analysis abstraction
- Integrated toolkit for performance
instrumentation, measurement, analysis, and
visualization - Portable performance profiling/tracing facility
- Open software approach
13TAU Performance System Architecture
Paraver
EPILOG
14Pprof Output (NAS Parallel Benchmark LU)
- Intel Quad PIII Xeon, RedHat, PGI F90
- F90 MPICH
- Profile for Node Context Thread
- Application events and MPI events
15jRacy (NAS Parallel Benchmark LU)
Routine profile across all nodes
n node c context t thread
Global profiles
Individual profile
16TAU PAPI (NAS Parallel Benchmark LU )
- Floating point operations
- Replaces execution time
- Only requiresre-linking to different TAU library
17TAU Vampir (NAS Parallel Benchmark LU)
Callgraph display
Timeline display
Parallelism display
Communications display
18Utah ASCI/ASAP Level 1 Center (C-SAFE)
- C-SAFE was established to build a problem-solving
environment (PSE) for the numerical simulation of
accidental fires and explosions - Fundamental chemistry and engineering physics
models - Coupled with non-linear solvers, optimization,
computational steering, visualization, and
experimental data verification - Very large-scale simulations
- Computer science problems
- Coupling of multiple simulation codes
- Software engineering across diverse expert teams
- Achieving high performance on large-scale systems
19Example C-SAFE Simulation Problems
?
Heptane fire simulation
Typical C-SAFE simulation with a billion degrees
of freedom and non-linear time dynamics
Material stress simulation
20Uintah Problem Solving Environment
- Enhanced SCIRun PSE
- Pure dataflow to component-based
- Shared memory to scalable multi-/mixed-mode
parallelism - Interactive only to interactive and standalone
- Design and implement Uintah component
architecture - Application programmers provide
- description of computation (tasks and variables)
- code to perform task on single patch
(sub-region of space) - Follow Common Component Architecture (CCA) model
- Design and implement Uintah Computational
Framework (UCF) on top of the component
architecture
21Uintah High-Level Component View
22Uintah Parallel Component Architecture
23Uintah Computational Framework
- Execution model based on software (macro)
dataflow - Exposes parallelism and hides data transport
latency - Computations expressed a directed acyclic graphs
of tasks - consumes input and produces output (input to
future task) - input/outputs specified for each patch in a
structured grid - Abstraction of global single-assignment memory
- DataWarehouse
- Directory mapping names to values (array
structured) - Write value once then communicate to awaiting
tasks - Task graph gets mapped to processing resources
- Communications schedule approximates global
optimal
24Uintah Task Graph (Material Point Method)
- Diagram of named tasks (ovals) and data (edges)
- Imminent computation
- Dataflow-constrained
- MPM
- Newtonian material point motion time step
- Solid values defined at material point
(particle) - Dashed values defined at vertex (grid)
- Prime () values updated during time step
25Example Taskgraphs (MPM and Coupled)
26Taskgraph Advantages
- Accommodates flexible integration needs
- Accommodates a wide range of unforeseen work
loads - Accommodates a mix of static and dynamic load
balance - Manage complexity of mixed-mode programming
- Avoids unnecessary transport abstraction
overheads - Simulation time/space coupling
- Allows uniform abstraction for coordinating
coupled models time and grid scales - Allows application components and framework
infrastructure (e.g., scheduler) to evolve
independently
27Uintah PSE
- UCF automatically sets up
- Domain decomposition
- Inter-processor communication with
aggregation/reduction - Parallel I/O
- Checkpoint and restart
- Performance measurement and analysis (stay tuned)
- Software engineering
- Coding standards
- CVS (Commits Y3 - 26.6 files/day, Y4 - 29.9
files/day) - Correctness regression testing with bugzilla bug
tracking - Nightly build (parallel compiles)
- 170,000 lines of code (Fortran and C tasks
supported)
28Performance Technology Integration
- Uintah present challenges to performance
integration - Software diversity and structure
- UCF middleware, simulation code modules
- component-based hierarchy
- Portability objectives
- cross-language and cross-platform
- multi-parallelism thread, message passing, mixed
- Scalability objectives
- High-level programming and execution abstractions
- Requires flexible and robust performance
technology - Requires support for performance mapping
29Performance Analysis Objectives for Uintah
- Micro tuning
- Optimization of simulation code (task) kernels
for maximum serial performance - Scalability tuning
- Identification of parallel execution bottlenecks
- overheads scheduler, data warehouse,
communication - load imbalance
- Adjustment of task graph decomposition and
scheduling - Performance tracking
- Understand performance impacts of code
modifications - Throughout course of software development
- C-SAFE application and UCF software
30Uintah Performance Engineering Approach
- Contemporary performance methodology focuses on
control flow (function) level measurement and
analysis - C-SAFE application involves coupled-models with
task-based parallelism and dataflow control
constraints - Performance engineering on algorithmic (task)
basis - Observe performance based on algorithm (task)
semantics - Analyze task performance characteristics in
relation to other simulation tasks and UCF
components - scientific component developers can concentrate
on performance improvement at algorithmic level - UCF developers can concentrate on bottlenecks not
directly associated with simulation module code
31Task Execution in Uintah Parallel Scheduler
- Profile methods and functions in scheduler and in
MPI library
Task execution time dominates (what task?)
Task execution time distribution
MPI communication overheads (where?)
- Need to map performance data!
32Semantics-Based Performance Mapping
- Associate performance measurements with
high-level semantic abstractions - Need mapping support in the performance
measurement system to assign data correctly
33Hypothetical Mapping Example
- Particles distributed on surfaces of a cube
Particle PMAX / Array of particles / int
GenerateParticles() / distribute particles
over all faces of the cube / for (int face0,
last0 face lt 6 face) / particles on
this face / int particles_on_this_face
num(face) for (int ilast i lt
particles_on_this_face i) / particle
properties are a function of face / Pi
... f(face) ... last
particles_on_this_face
34Hypothetical Mapping Example (continued)
int ProcessParticle(Particle p) / perform
some computation on p / int main()
GenerateParticles() / create a list of
particles / for (int i 0 i lt N i) /
iterates over the list / ProcessParticle(Pi)
- How much time is spent processing face i
particles? - What is the distribution of performance among
faces? - How is this determined if execution is parallel?
35Semantic Entities/Attributes/Associations (SEAA)
- New dynamic mapping scheme (S. Shende, Ph.D.
thesis) - Contrast with ParaMap (Miller and Irvin)
- Entities defined at any level of abstraction
- Attribute entity with semantic information
- Entity-to-entity associations
- Two association types (implemented in TAU API)
- Embedded extends data structure of associated
object to store performance measurement entity - External creates an external look-up table
using address of object as the key to locate
performance measurement entity
36No Performance Mapping versus Mapping
- Typical performance tools report performance with
respect to routines - Does not provide support for mapping
- Performance tools with SEAA mapping can observe
performance with respect to scientists
programming and problem abstractions
TAU (w/ mapping)
TAU (no mapping)
37Uintah Task Performance Mapping
- Uintah partitions individual particles across
processing elements (processes or threads) - Simulation tasks in task graph work on particles
- Tasks have domain-specific character in the
computation - interpolate particles to grid in Material Point
Method - Task instances generated for each partitioned
particle set - Execution scheduled with respect to task
dependencies - How to attributed execution time among different
tasks - Assign semantic name (task type) to a task
instance - SerialMPMinterpolateParticleToGrid
- Map TAU timer object to (abstract) task (semantic
entity) - Look up timer object using task type (semantic
attribute) - Further partition along different domain-specific
axes
38Task Performance Mapping Instrumentation
- void MPISchedulerexecute(const ProcessorGroup
pc, - DataWarehouseP old_dw, DataWarehouseP
dw ) - ...
- TAU_MAPPING_CREATE(
- task-gtgetName(), "MPISchedulerexecute()",
(TauGroup_t)(void)task-gtgetName(),
task-gtgetName(), 0) - ...
- TAU_MAPPING_OBJECT(tautimer)
- TAU_MAPPING_LINK(tautimer,(TauGroup_t)(void)task
-gtgetName()) - // EXTERNAL ASSOCIATION
- ...
- TAU_MAPPING_PROFILE_TIMER(doitprofiler,
tautimer, 0) - TAU_MAPPING_PROFILE_START(doitprofiler,0)
- task-gtdoit(pc)
- TAU_MAPPING_PROFILE_STOP(0)
- ...
39Task Performance Mapping (Profile)
Mapped task performance across processes
Performance mapping for different tasks
40Task Performance Mapping (Trace)
Work packet computation events colored by task
type
Distinct phases of computation can be identifed
based on task
41Task Performance Mapping (Trace - Zoom)
Startup communication imbalance
42Task Performance Mapping (Trace - Parallelism)
Communication / load imbalance
43Comparing Uintah Traces for Scalability Analysis
44Scaling Performance Optimizations
Last year initial correct scheduler
Reduce communication by 10 x
Reduce task graph overhead by 20 x
ASCI NirvanaSGI Origin 2000 Los AlamosNational
Laboratory
45Scalability to 2000 Processors (Fall 2001)
ASCI NirvanaSGI Origin 2000 Los AlamosNational
Laboratory
46Performance Tracking and Reporting
- Integrated performance measurement allows
performance analysis throughout development
lifetime - Applied performance engineering in software
design and development (software engineering)
process - Create performance portfolio from regular
performance experimentation (coupled with
software testing) - Use performance knowledge in making key software
design decision, prior to major development
stages - Use performance benchmarking and regression
testing to identify irregularities - Support automatic reporting of performance bugs
- Cross-platform (cross-generation) evaluation
47XPARE - eXPeriment Alerting and REporting
- Experiment launcher automates measurement /
analysis - Configuration and compilation of performance
tools - Uintah instrumentation control for experiment
type - Multiple experiment execution
- Performance data collection, analysis, and
storage - Integrated in Uintah software testing harness
- Reporting system conducts performance regression
tests - Apply performance difference thresholds (alert
ruleset) - Alerts users via email if thresholds have been
exceeded - Web alerting setup and full performance data
reporting - Historical performance data analysis
48XPARE System Architecture
Experiment Launch
Performance Database
Performance Reporter
Comparison Tool
Regression Analyzer
Alerting Setup
49Alerting Setup
50Experiment Results Viewing Selection
51Web-Based Experiment Reporting
52Web-Based Experiment Reporting (continued)
53Web-Based Experiment Reporting (continued)
54Performance Analysis Tool Integration
- Complex systems pose challenging performance
analysis problems that require robust
methodologies and tools - New performance problems will arise
- Instrumentation and measurement
- Data analysis and presentation
- Diagnosis and tuning
- No one performance tool can address all concerns
- Look towards an integration of performance
technologies - Support to link technologies to create
performance problem solving environments - Performance engineering methodology and tool
integration with software design and development
process
55Integrated Performance Evaluation Environment
56References
- A. Malony and S. Shende, Performance Technology
for Complex Parallel and Distributed Systems,
Proc. 3rd Workshop on Parallel and Distributed
Systems (DAPSYS), pp. 37-46, Aug. 2000. - S. Shende, A. Malony, and R. Ansell-Bell,
Instrumentation and Measurement Strategies for
Flexible and Portable Empirical Performance
Evaluation, Proc. Intl. Conf. on Parallel and
Distributed Processing Techniques and
Applications (PDPTA), CSREA, pp. 1150-1156, July
2001. - S. Shende, The Role of Instrumentation and
Mapping in Performance Measurement, Ph.D.
Dissertation, Univ. of Oregon, Aug. 2001. - J. de St. Germain, A. Morris, S. Parker, A.
Malony, and S. Shende, Integrating Performance
Analysis in the Uintah Software Development
Cycle, ISHPC 2002, Nara, Japan, May, 2002.