Title: Advanced Techniques for Performance Analysis
1Advanced Techniques for Performance Analysis
- Michael Gerndt
- Technische Universität München
- gerndt_at_in.tum.de
2Performance Analysis is Essential
3Scaling Portability Profoundly Interesting
A high level description of the performance of
cosmology code MADCAP on four well known
architectures.
Source David Skinner, NERSC
416 Way for 4 seconds
(About 20 timestamps per second per task) ( 14
contextual variables)
564 way for 12 seconds
6Performance Analysis for Parallel Systems
- Development cycle
- Assumption Reproducibility
- Instrumentation
- Static vs Dynamic
- Source-level vs binary-level
- Monitoring
- Software vs Hardware
- Statistical profiles vs event traces
- Analysis
- Source-based tools
- Visualization tools
- Automatic analysis tools
Coding
Performance Monitoringand Analysis
Program Tuning
Production
7Overhead Analysis
- How to decide whether a code performs well
- Comparison of measured MFLOPS with peak
performance - Comparison with a sequential version
- Estimate distance to ideal time via overhead
classes - tmem
- tcomm
- tsync
- tred
- ...
tmem
speedup
tcomm
tred
2
?
1
1
processors
2
8- IST Working Group onAutomatic Performance
Analysis Real Tools
9APART Definitions
- Performance property
- Condition
- Confidence
- Severity
- Performance problem
- Property with
- ConditionTRUE
- Severity gt Threshold
- Bottleneck
- Most severe performance problem
10Performance Analysis
11Why Automated Performance Analysis?
- Large number of processors ?huge amounts of
performance data - Huge data sets are hard to analyze manually
- Performance analysis requires detailed knowledge
of the system (hardware, communications
middleware,...) - Automation allows novice programmers to use
experts knowledge
12Automated Performance Analysis Examples
- Paradyn/Performance Consultant
- Dynamic instrumentation
- W3 Search model (why/where/when)
- www.paradyn.org
- Expert
- Searching program traces
- www.fz-juelich.de/zam/kojak
- Aksum
- JavaPSL, profile info in a database
- dps.uibk.ac.at/projects/aksum
- KappaPi 2
- Searching traces
- Static info
13Periscope Project at Technische Universität
München
- Previous projects (2001-2005)
- Peridot
- EP-Cache
- Periscope (2005-2006)
- Develop an automated distributed performance
analysis tool. - Persicope approach
- Automated search based on ASL
- Online analysis
- Distributed search
14APART Specification Language
- Performance-related data
- Static data
- Dynamic data
- Performance properties
- condition checks existence
- confidence degree of certainty
- severity importance
15MPI Data Model
class RegionSummary Region reg float
duration float commTime float ioTime
float syncTime float idleTime int
nrExecutions setof ProcSummary processSums
16Communication Costs
property communication_costs ( Region r,
Experiment e, Region rank_basis) LET
float cost summary(r,e).CommTime IN
CONDITION costgt0 CONFIDENCE 1
SEVERITY cost/duration(rank_basis,e)
17Periscope Design
Performance Cockpit
Performance Analysis Agent Network
MRI
Peridot Monitor
Focus on OpenMP/MPI
Focus on memory hierarchy
EP-Cache Monitor
18Agent Design Goals
- Flexibility with respect to software reuse, e.g.,
EP-Cache, PALMA - Flexibility with respect to the performance data
model - Different monitoring systems - different
performance data - Flexibility in property database
- Extending / changing the set of available ASL
properties does not require to change the
performance tool
19Node Agents
- Searching for local performance properties
- Properties and data model based on ASL
specification - Data model manually mapped to monitor
- Properties can be dynamically defined
- Transformation ASL -gt C classes currently done
manually
20Classes for Monitor Adaptation
21Agent Structure
22Agent Search Strategies
- Predefined search strategies in repository
- Current strategies
- Non phase-based
- Periodic evaluation of all properties
- Peridot monitor
- No monitor configuration
- Phase-based
- Region nesting and data structure-based
refinement - Property hierarchy-based refinement
23Application Phases
- Portion of programs execution
- Phases are defined by program regions called
phase regions - Standard regions
- Full program
- Functions
- Parallel loops
- User regions
- Repetitive and execute once phase regions
- Phase boundaries have to be global (SPMD programs)
24Monitor Design Distribution
- Offloading monitoring overhead
- Connected via shared memory (SysV Shmem, RDMA)
25Monitor Design Configurability
26New Hardware Monitor for Address Range Monitoring
- Monitor Control Unit
- Provides access interface
- Controls operation
- Cache Level Monitor
- Single CLM per level
- Event- / counter logic
- Associative counter array
- Measurement Modes
- Static mode
- Dynamic mode
27Static vs Dynamic Mode
28Monitor Design Proposed Standard Interface
Monitoring Request Interface(MRI)
VAMPIRArchiver
PeriscopeAgent
29Monitoring Request Interface (MRI)
- Request specification
- Request runtime information for
- Code Regions
- Active Objects
- With some aggregation
- Sum over all instances in a single thread
- Sum over all instances in all threads
- ...
- Application Control
- Start
- Start/Stop at region
- Wait
- Interrupt
30Example
- Monitor the LC3 Read Cache Misses
- in the DO-Loop (file_id 40, line_nr 13)
- for array A
- in form of a histogram
- with granularity of 200 Blocks and
31Distribution Strategies
- Analysis control
- Initial Information
- Selection of search strategy
- Hierarchy of agents
- Combination of properties (Coarsening)
- Handling distributed properties
32Usage Scenario (1)
- Prepare the application
- Instrumentation
- MPI MPI Wrapper library
- OpenMP Opari source-to-source instrumenter
- Functions (-finstrument-functions for the GNU
compilers) - F90inst source level instrumenter
- Interactive startup of target application
- E.g., mpirun ltapplicationgt
- Application is halted on the first call to the
monitoring library - Interactive startup of Periscope
- periscope ltname of the applicationgt
- A registry server is contacted for all nodes on
which the target application is executed - A NodeAgent is started on all nodes
33Usage Scenario (3)
Application
NodeAgent
Directory Service
MasterAgent
Interactive Node
Compute Nodes
34Crystal Growth Simulation
- LC1MissesOverMemRef
- LC1ReadMissesOverMemRef
- LC1WriteMissesOverMemRef
35Beispiel 1. Iteration of Time Loop
USER_REGION( USER_REGION, main.f, 56 ) __CURR(
CALL, main.f, 58 ) __VELO( CALL, main.f, 60
) __TEMP( CALL, main.f, 68 ) __BOUND( CALL,
main.f, 69 ) __MPI_ALLREDUCE( CALL, main.f, 73 )
361st Iteration of Time Loop
LC1MissesOverMemRef Severity 0.156674
Region BOUND( CALL, main.f, 69
) LC1WriteMissesOverMemRef Severity 0.0941401
Region BOUND( CALL, main.f, 69
) LC1ReadMissesOverMemRef Severity 0.0625339
Region BOUND( CALL, main.f, 69
) LC1MissesOverMemRef Severity 0.0225174
Region TEMP( CALL, main.f, 68 )
372nd Iteration
BOUND( CALL_REGION, main.f, 69 ) FOUND SUBROUTINE
BOUND 114 Subregions __BOUND( SUB_REGION,
bound.f, 1 ) _____( LOOP_REGION, bound.f, 28
) _____( LOOP_REGION, bound.f, 42 ) _____(
LOOP_REGION, bound.f, 139 ) _____( LOOP_REGION,
bound.f, 151 ) _____( LOOP_REGION, bound.f, 175 )
382nd Iteration
LC1MissesOverMemRef Severity 0.68164
Region ( LOOP, bound.f, 139 ) LC1WriteMissesOverM
emRef Severity 0.671389 Region ( LOOP,
bound.f, 139 ) LC1MissesOverMemRef Severity
0.216865 Region ( LOOP, bound.f, 673
) LC1ReadMissesOverMemRef Severity 0.108437
Region ( LOOP, bound.f, 673 )
393rd Iteration
Looking for Subregions or Data Structures of
already evaluated region (LOOP, bound.f, 139
) UN( DATA_STRUCTURE, bound.f, 1
) LC1MissesOverMemRef Severity 0.671016
Region (LOOP, bound.f, 139) Data Structure
UN LC1WriteMissesOverMemRef Severity 0.671016
Region (LOOP, bound.f, 139) Data Structure UN
40Final Result
LC1MissesOverMemRef (LOOP, bound.f, 139)
Severity 0.68164 LC1WriteMissesOverMemRef (LOOP,
bound.f, 139 ) Severity 0.671389 LC1WriteMisse
sOverMemRef (LOOP, bound.f, 139 ) Severity
0.671016 Data Structure UN LC1ReadMissesOverMem
Ref (LOOP, bound.f, 673 ) Severity
0.108437 LC1WriteMissesOverMemRef (LOOP, bound.f,
673 ) Severity 0.108429 LC1MissesOverMemRef
(LOOP, bound.f, 28 ) Severity 0.103128
Data Structure UN
41NAS Parallel Benchmarks
42Performance Cockpit
43Summary and Status
- Design for automated and distributed online
analysis - Solves scalability and complexity problem
- Implementation on
- Hitachi
- IBM Regatta
- SGI Altix
- x86 / Itanium Cluster
- Experiments
- NAS Benchmarks, Spec Benchmarks, EP-Cache
applications, APART Test Suite (ATS)
44Future Work
- Distributed coordination
- Tools based on Periscope
- OpenMP profiler
- Periscope for Grids
- DAAD PALMA project SLA processing for grid
computing
45Topics of interest Concepts and languages for
parallel and Grid programming Supportive tools on
system and application level Submission
Deadline November 18th, 2005