Advanced Techniques for Performance Analysis - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Advanced Techniques for Performance Analysis

Description:

Performance Cockpit. Performance Analysis Agent Network. MRI. Peridot Monitor. EP-Cache Monitor ... Performance Cockpit. Michael Gerndt. Summary and Status ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 46
Provided by: michael552
Category:

less

Transcript and Presenter's Notes

Title: Advanced Techniques for Performance Analysis


1
Advanced Techniques for Performance Analysis
  • Michael Gerndt
  • Technische Universität München
  • gerndt_at_in.tum.de

2
Performance Analysis is Essential
3
Scaling Portability Profoundly Interesting
A high level description of the performance of
cosmology code MADCAP on four well known
architectures.
Source David Skinner, NERSC
4
16 Way for 4 seconds
(About 20 timestamps per second per task) ( 14
contextual variables)
5
64 way for 12 seconds
6
Performance Analysis for Parallel Systems
  • Development cycle
  • Assumption Reproducibility
  • Instrumentation
  • Static vs Dynamic
  • Source-level vs binary-level
  • Monitoring
  • Software vs Hardware
  • Statistical profiles vs event traces
  • Analysis
  • Source-based tools
  • Visualization tools
  • Automatic analysis tools

Coding
Performance Monitoringand Analysis
Program Tuning
Production
7
Overhead Analysis
  • How to decide whether a code performs well
  • Comparison of measured MFLOPS with peak
    performance
  • Comparison with a sequential version
  • Estimate distance to ideal time via overhead
    classes
  • tmem
  • tcomm
  • tsync
  • tred
  • ...

tmem
speedup
tcomm
tred
2
?
1
1
processors
2
8
  • IST Working Group onAutomatic Performance
    Analysis Real Tools

9
APART Definitions
  • Performance property
  • Condition
  • Confidence
  • Severity
  • Performance problem
  • Property with
  • ConditionTRUE
  • Severity gt Threshold
  • Bottleneck
  • Most severe performance problem

10
Performance Analysis
11
Why Automated Performance Analysis?
  • Large number of processors ?huge amounts of
    performance data
  • Huge data sets are hard to analyze manually
  • Performance analysis requires detailed knowledge
    of the system (hardware, communications
    middleware,...)
  • Automation allows novice programmers to use
    experts knowledge

12
Automated Performance Analysis Examples
  • Paradyn/Performance Consultant
  • Dynamic instrumentation
  • W3 Search model (why/where/when)
  • www.paradyn.org
  • Expert
  • Searching program traces
  • www.fz-juelich.de/zam/kojak
  • Aksum
  • JavaPSL, profile info in a database
  • dps.uibk.ac.at/projects/aksum
  • KappaPi 2
  • Searching traces
  • Static info

13
Periscope Project at Technische Universität
München
  • Previous projects (2001-2005)
  • Peridot
  • EP-Cache
  • Periscope (2005-2006)
  • Develop an automated distributed performance
    analysis tool.
  • Persicope approach
  • Automated search based on ASL
  • Online analysis
  • Distributed search

14
APART Specification Language
  • Performance-related data
  • Static data
  • Dynamic data
  • Performance properties
  • condition checks existence
  • confidence degree of certainty
  • severity importance

15
MPI Data Model
class RegionSummary Region reg float
duration float commTime float ioTime
float syncTime float idleTime int
nrExecutions setof ProcSummary processSums
16
Communication Costs
property communication_costs ( Region r,
Experiment e, Region rank_basis) LET
float cost summary(r,e).CommTime IN
CONDITION costgt0 CONFIDENCE 1
SEVERITY cost/duration(rank_basis,e)
17
Periscope Design
Performance Cockpit
Performance Analysis Agent Network
MRI
Peridot Monitor
Focus on OpenMP/MPI
Focus on memory hierarchy
EP-Cache Monitor
18
Agent Design Goals
  • Flexibility with respect to software reuse, e.g.,
    EP-Cache, PALMA
  • Flexibility with respect to the performance data
    model
  • Different monitoring systems - different
    performance data
  • Flexibility in property database
  • Extending / changing the set of available ASL
    properties does not require to change the
    performance tool

19
Node Agents
  • Searching for local performance properties
  • Properties and data model based on ASL
    specification
  • Data model manually mapped to monitor
  • Properties can be dynamically defined
  • Transformation ASL -gt C classes currently done
    manually

20
Classes for Monitor Adaptation
21
Agent Structure
22
Agent Search Strategies
  • Predefined search strategies in repository
  • Current strategies
  • Non phase-based
  • Periodic evaluation of all properties
  • Peridot monitor
  • No monitor configuration
  • Phase-based
  • Region nesting and data structure-based
    refinement
  • Property hierarchy-based refinement

23
Application Phases
  • Portion of programs execution
  • Phases are defined by program regions called
    phase regions
  • Standard regions
  • Full program
  • Functions
  • Parallel loops
  • User regions
  • Repetitive and execute once phase regions
  • Phase boundaries have to be global (SPMD programs)

24
Monitor Design Distribution
  • Offloading monitoring overhead
  • Connected via shared memory (SysV Shmem, RDMA)

25
Monitor Design Configurability
26
New Hardware Monitor for Address Range Monitoring
  • Monitor Control Unit
  • Provides access interface
  • Controls operation
  • Cache Level Monitor
  • Single CLM per level
  • Event- / counter logic
  • Associative counter array
  • Measurement Modes
  • Static mode
  • Dynamic mode

27
Static vs Dynamic Mode
28
Monitor Design Proposed Standard Interface
Monitoring Request Interface(MRI)
VAMPIRArchiver
PeriscopeAgent
29
Monitoring Request Interface (MRI)
  • Request specification
  • Request runtime information for
  • Code Regions
  • Active Objects
  • With some aggregation
  • Sum over all instances in a single thread
  • Sum over all instances in all threads
  • ...
  • Application Control
  • Start
  • Start/Stop at region
  • Wait
  • Interrupt

30
Example
  • Monitor the LC3 Read Cache Misses
  • in the DO-Loop (file_id 40, line_nr 13)
  • for array A
  • in form of a histogram
  • with granularity of 200 Blocks and

31
Distribution Strategies
  • Analysis control
  • Initial Information
  • Selection of search strategy
  • Hierarchy of agents
  • Combination of properties (Coarsening)
  • Handling distributed properties

32
Usage Scenario (1)
  • Prepare the application
  • Instrumentation
  • MPI MPI Wrapper library
  • OpenMP Opari source-to-source instrumenter
  • Functions (-finstrument-functions for the GNU
    compilers)
  • F90inst source level instrumenter
  • Interactive startup of target application
  • E.g., mpirun ltapplicationgt
  • Application is halted on the first call to the
    monitoring library
  • Interactive startup of Periscope
  • periscope ltname of the applicationgt
  • A registry server is contacted for all nodes on
    which the target application is executed
  • A NodeAgent is started on all nodes

33
Usage Scenario (3)
Application
NodeAgent
Directory Service
MasterAgent
Interactive Node

Compute Nodes
34
Crystal Growth Simulation
  • LC1MissesOverMemRef
  • LC1ReadMissesOverMemRef
  • LC1WriteMissesOverMemRef

35
Beispiel 1. Iteration of Time Loop
USER_REGION( USER_REGION, main.f, 56 ) __CURR(
CALL, main.f, 58 ) __VELO( CALL, main.f, 60
) __TEMP( CALL, main.f, 68 ) __BOUND( CALL,
main.f, 69 ) __MPI_ALLREDUCE( CALL, main.f, 73 )
36
1st Iteration of Time Loop
LC1MissesOverMemRef Severity 0.156674
Region BOUND( CALL, main.f, 69
) LC1WriteMissesOverMemRef Severity 0.0941401
Region BOUND( CALL, main.f, 69
) LC1ReadMissesOverMemRef Severity 0.0625339
Region BOUND( CALL, main.f, 69
) LC1MissesOverMemRef Severity 0.0225174
Region TEMP( CALL, main.f, 68 )
37
2nd Iteration
BOUND( CALL_REGION, main.f, 69 ) FOUND SUBROUTINE
BOUND 114 Subregions __BOUND( SUB_REGION,
bound.f, 1 ) _____( LOOP_REGION, bound.f, 28
) _____( LOOP_REGION, bound.f, 42 ) _____(
LOOP_REGION, bound.f, 139 ) _____( LOOP_REGION,
bound.f, 151 ) _____( LOOP_REGION, bound.f, 175 )
38
2nd Iteration
LC1MissesOverMemRef Severity 0.68164
Region ( LOOP, bound.f, 139 ) LC1WriteMissesOverM
emRef Severity 0.671389 Region ( LOOP,
bound.f, 139 ) LC1MissesOverMemRef Severity
0.216865 Region ( LOOP, bound.f, 673
) LC1ReadMissesOverMemRef Severity 0.108437
Region ( LOOP, bound.f, 673 )
39
3rd Iteration
Looking for Subregions or Data Structures of
already evaluated region (LOOP, bound.f, 139
) UN( DATA_STRUCTURE, bound.f, 1
) LC1MissesOverMemRef Severity 0.671016
Region (LOOP, bound.f, 139) Data Structure
UN LC1WriteMissesOverMemRef Severity 0.671016
Region (LOOP, bound.f, 139) Data Structure UN
40
Final Result
LC1MissesOverMemRef (LOOP, bound.f, 139)
Severity 0.68164 LC1WriteMissesOverMemRef (LOOP,
bound.f, 139 ) Severity 0.671389 LC1WriteMisse
sOverMemRef (LOOP, bound.f, 139 ) Severity
0.671016 Data Structure UN LC1ReadMissesOverMem
Ref (LOOP, bound.f, 673 ) Severity
0.108437 LC1WriteMissesOverMemRef (LOOP, bound.f,
673 ) Severity 0.108429 LC1MissesOverMemRef
(LOOP, bound.f, 28 ) Severity 0.103128
Data Structure UN
41
NAS Parallel Benchmarks
42
Performance Cockpit
43
Summary and Status
  • Design for automated and distributed online
    analysis
  • Solves scalability and complexity problem
  • Implementation on
  • Hitachi
  • IBM Regatta
  • SGI Altix
  • x86 / Itanium Cluster
  • Experiments
  • NAS Benchmarks, Spec Benchmarks, EP-Cache
    applications, APART Test Suite (ATS)

44
Future Work
  • Distributed coordination
  • Tools based on Periscope
  • OpenMP profiler
  • Periscope for Grids
  • DAAD PALMA project SLA processing for grid
    computing

45
Topics of interest Concepts and languages for
parallel and Grid programming Supportive tools on
system and application level Submission
Deadline November 18th, 2005
Write a Comment
User Comments (0)
About PowerShow.com