Kai Li, Allen D. Malony, Robert Bell, Sameer Shende - PowerPoint PPT Presentation

About This Presentation

Title:

Kai Li, Allen D. Malony, Robert Bell, Sameer Shende

Description:

Department of Computer and Information Science ... Falcon (Schwan, Vetter): computational steering. Dynamic instrumentation and performance search ... – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 23

Provided by: allend7

Learn more at: http://www.cs.uoregon.edu

Category:

more less

Transcript and Presenter's Notes

Title: Kai Li, Allen D. Malony, Robert Bell, Sameer Shende

1
A Framework for Online PerformanceAnalysis and
Visualization of Large-Scale Parallel Applications

Kai Li, Allen D. Malony, Robert Bell, Sameer
Shende
likai,malony,bertie,sameer_at_cs.uoregon.edu
Department of Computer and Information Science
Computational Science Institute, NeuroInformatics
Center
University of Oregon

2
Outline

Problem description
Scaling and performance observation
Interest in online performance analysis
General online performance system architecture
Access models
Profiling issues and control issues
Framework for online performance analysis
TAU performance system
SCIRun computational and visualization
environment
Experiments
Conclusions and future work

3
Problem Description

Need for parallel performance observation
Instrumentation, measurement, analysis,
visualization
In general, there is the concern for intrusion
Seen as a tradeoff with accuracy of performance
diagnosis
Scaling complicates observation and analysis
Issues of data size, processing time, and
presentation
Online approaches add capabilities as well as
problems
Performance interaction, but at what cost?
Tools for large-scale performance observation
online
Supporting performance system architecture
Tool integration, effective usage, and portability

4
Scaling and Performance Observation

Consider traditional measurement methods
Profiling summary statistics calculated during
execution
Tracing time-stamped sequence of execution
events
More parallelism ? more performance data overall
Performance specific to each thread of execution
Possible increase in number interactions between
threads
Harder to manage the data (memory, transfer,
storage, )
More parallelism / performance data ? harder
analysis
More time consuming to analyze
More difficult to visualize (meaningful displays)
Need techniques to address scaling at all levels

5
Why Complicate Matters with Online Methods?

Adds interactivity to performance analysis
process
Opportunity for dynamic performance observation
Instrumentation change
Measurement change
Allows for control of performance data volume
Post-mortem analysis may be too late
View on status of long running jobs
Allow for early termination
Computation steering to achieve better results
Performance steering to achieve better
performance
Online performance observation may be intrusive

6
Related Ideas

Computational steering
Falcon (Schwan, Vetter) computational steering
Dynamic instrumentation and performance search
Paradyn (Miller) online performance bottleneck
analysis
Adaptive control and performance steering
Active Harmony (Hollingsworth) auto decision
control
Autopilot (Reed) actuator/sensor performance
steering
Scalable monitoring
Peridot (Gerndt) automatic online performance
analysis
MRNet (Miller) multi-case reduction for access /
control
Scalable analysis and visualization
VNG (Brunst) parallel trace analyis

7
General Online Performance Observation System
8
Models of Performance Data Access (Monitoring)

Push Model
Producer/consumer style of access and transfer
Application decides when/what/how much data to
send
External analysis tools only consume performance
data
Availability of new data is signaled passively or
actively
Pull Model
Client/server style of performance data access
and transfer
Application is a performance data server
Access decisions are made externally by analysis
tools
Two-way communication is required
Push/Pull Models

9
Online Profiling Issues

Profiles are summary statistics of performance
Kept with respect to some unit of parallel
execution
Profiles are distributed across the machine (in
memory)
Must be gathered and delivered to profile
analysis tool
Profile merging must take place (possibly in
parallel)
Consistency checking of profile data
Callstack must be updated to generate correct
profile data
Correct communication statistics may require
completion
Event identification (not necessary is save event
names)
Sequence of profile samples allow interval
analysis
Interval frequency depends on profile collection
delay

10
Performance Control

Instrumentation control
Dynamic instrumentation
Inserts / removes instrumentation at runtime
Measurement control
Dynamic measurement
Enabling / disabling / changing of measurement
code
Dynamic instrumentation or measurement variables
Data access control
Selection of what performance data to access
Control of frequency of access

11
TAU Performance System Framework

Tuning and Analysis Utilities (aka Tools Are Us)
Performance system framework for scalable
parallel and distributed high-performance
computing
Targets a general complex system computation
model
nodes / contexts / threads
Multi-level system / software / parallelism
Measurement and analysis abstraction
Integrated toolkit for performance
instrumentation, measurement, analysis, and
visualization
Portable performance profiling/tracing facility
Open software approach

12
TAU Performance System Architecture
Paraver
EPILOG
ParaProf
13
Online Profile Measurement and Analysis in TAU

Standard TAU profiling
Per node/context/thread
Profile dump routine
Context-level
Profile file per eachthread in context
Appends to profile file
Selective event dumping
Analysis tools access filesthrough shared file
system
Application-level profileaccess routine

14
Online Performance Analysis and Visualization
SCIRun (Univ. of Utah)
Performance Visualizer
Application
// performance data streams
TAU Performance System
Performance Analyzer
// performance data output
accumulated samples
Performance Data Reader
Performance Data Integrator
file system
sample sequencing reader synchronization
15
Profile Sample Data Structure in SCIRun
node
context
thread
16
Performance Analysis/Visualization in SCIRun
SCIRun program
17
Uintah Computational Framework (UCF)

Universityof Utah
UCF analysis
Scheduling
MPI library
Components
500 processes
Use for onlineand offlinevisualization
Apply SCIRunsteering

18
Terrain Performance Visualization
F
19
Scatterplot Displays

Each pointcoordinatedeterminedby threevalues
MPI_Reduce
MPI_Recv
MPI_Waitsome
Min/Maxvalue range
Effective forclusteranalysis
Relation between MPI_Recv and MPI_Waitsome

20
Online Unitah Performance Profiling

Demonstration of online profiling capability
Colliding elastic disks
Test material point method (MPM) code
Executed on 512 processors ASCI Blue Pacific at
LLNL
Example 1 (Terrain visualization)
Exclusive execution time across event groups
Multiple time steps
Example 2 (Bargraph visualization)
MPI execution time and performance mapping
Example 3 (Domain visualization)
Task time allocation to patches

21
Example 1 (Event Groups)
22
Example 2 (MPI Performance)
23
Example 3 (Domain-Specific Visualization)
24
ParaProf Framework Architecture

Portable, extensible, and scalable tool for
profile analysis
Offer best of breed capabilities to performance
analysts
Build as profile analysis framework for
extensibility

25
ParaProf Profile Display (VTF)

Virtual Testshock Facility (VTF), Caltech, ASCI
Center
Dynamic measurement, online analysis,
visualization

26
Full Profile Display (SAMRAI)

Structured AMR toolkit (SAMRAI), LLNL

512 processes
27
Evaluation of Experimental Approaches

Currently only supporting push model
File system solution for moving performance data
Is this a scalable solution?
Robust solution that can leverage
high-performance I/O
May result in high intrusion
However, does not require IPC
Should be relatively portable
Analysis and visualization only runs sequentially

28
Possible Improvements

Profile merging at context level to reduce number
of files
Merging at node level may require explicit
processing
Concurrent trace merging could also reduce files
Hierarchical merge tree
Will require explicit processing
Could consider IPC transfer
MPI (e.g., used in mpiP for profile merging)
Create own communicators
Sockets or PACX between computer server and
analyzer
Leverage large-scale systems infrastructure
Parallel profile analysis

29
Concluding Remarks

Interest in online performance monitoring,
analysis, and visualization for large-scale
parallel systems
Need to intelligently use
Benefit from other scalability considerations of
the system software and system architecture
See as an extension to the parallel system
architecture
Avoid solutions that have portability
difficulties
In part, this is an engineering problem
Need to work with the system configuration you
have
Need to understand if approach is applicable to
problem
Not clear if there is a single solution