Automatic Performance Analysis of SMP Cluster Applications - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Automatic Performance Analysis of SMP Cluster Applications

Description:

Renaissance fortress 1549. High school 1966. Forschungszentrum J lich. 4. Aachen. 250.000 citizens ... High-level explanation of inefficiencies ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 36
Provided by: fzjue
Category:

less

Transcript and Presenter's Notes

Title: Automatic Performance Analysis of SMP Cluster Applications


1
Automatic Performance Analysis of SMP Cluster
Applications
  • Felix Wolf, Bernd Mohr
  • f.wolf, b.mohr_at_fz-juelich.de
  • Forschungszentrum Jülich
  • Zentralinstitut für Angewandte Mathematik

2
Forschungszentrum Jülich
  • Interdisciplinary public research center
  • Germanys largest National Laboratory
  • Focus
  • Matter
  • Energy
  • Information
  • Life
  • Environment
  • 4300 employees
  • 1 square mile

3
Jülich
  • 30.000 citizens
  • First mentioned in 356
  • Citadel (ital. small city)
  • Renaissance fortress 1549
  • High school 1966

4
Aachen
  • 250.000 citizens
  • Touches Belgium and Netherlands
  • Favorite residence of Charlemagne
  • Roman emperor 800 - 814

5
Outline
  • Introduction
  • Approach
  • Event Trace Generation
  • Abstraction Mechanisms
  • Analysis Process
  • Performance Behavior
  • Representation
  • Specification
  • Presentation
  • Demo
  • Summary

6
Projects
  • APART
  • Automatic Performance Analysis Resources and
    Tools
  • Working group funded by European Union
  • http//www.fz-juelich.de/apart/
  • Forum of tool experts, hard- and software vendors
  • About 20 members worldwide
  • Organizes international workshops
  • KOJAK
  • Research project of ZAM at Forschungszentrum
    Jülich
  • http//www.fz-juelich.de/zam/kojak/
  • Embedded in APART
  • Development of tools for automatic performance
    analysis
  • New tools, integration of existing tools
  • Generic Design

7
SMP Clusters
  • Hierarchical architecture
  • Shared memory within a node
  • Distributed memory among nodes
  • Standard programming models
  • MPI
  • OpenMP
  • Hybrid
  • MPI among nodes
  • OpenMP within a node
  • Advantage best match of underlying architecture
  • Problem
  • Complex performance behavior
  • Lack of appropriate performance tools

8
EXPERT tool environment
  • Complete tracing-based solution
  • Automatic detection of performance problems
  • Explanation on high abstraction level
  • Close to underlying programming model
  • Support of
  • MPI
  • OpenMP
  • MPI/OpenMP
  • Advanced graphical tree display of performance
    behavior
  • Along three hierarchical dimensions
  • Class of performance behavior
  • Position within the dynamic call tree
  • Location (e.g., node or process)

9
EXPERT Tool Environment (2)
  • EXPERT performance tool
  • Analysis of performance behavior
  • Presentation of performance behavior
  • EARL trace analysis language
  • Maps event trace onto higher abstraction level
  • Makes analysis process simple and easy to extend
  • Event trace generation
  • OPARI source code instrumentation of OpenMP
    directives
  • EPILOG runtime library for event recording

10
Terminology
  • Performance property aspect or class of
    performance behavior
  • E.g., execution dominated by point-to-point
    communication
  • Specified as a condition over performance data
  • Severity measure indicates influence on
    performance behavior
  • Performance bottleneck
  • Performance property with
  • High influence on the performance behavior
  • High severity

11
Performance Data
  • Different kinds of structured performance data
  • Profiles
  • Summary information
  • Easy to create, low space requirements
  • Show simple performance properties
  • Event traces
  • Single events and their spatial/temporal
    relationship
  • Creation more difficult, high space requirements
  • Expressive visualization
  • Show complex performance properties

12
Approach
  • Proof of performance properties using event
    traces
  • Existence of compound events
  • Compound event representing inefficient behavior
  • Set of primitive events (constituents)
  • Relationships among constituents
  • Example
  • Message dispatch, receipt
  • Kind of relationships based on the programming
    model
  • Advantage
  • High-level explanation of inefficiencies
  • Based on vocabulary of the underlying programming
    model

13
Example Late Sender (blocked receiver)
  • Czochralski crystal growth

14
Example (2) Wait at NxN
  • Jacobi iteration

15
Event Trace Generation
  • Instrumentation
  • MPI calls
  • Wrapper library based on MPI standard profiling
    interface
  • OpenMP directives
  • OPARI source code preprocessor (C, C, Fortran)
  • User functions
  • Internal PGI compiler profiling interface (C,
    C, Fortran)
  • Event trace generation
  • EPILOG runtime library (thread safe)
  • EPILOG binary trace data format
  • Event types for MPI/OpenMP
  • Hierarchical cluster hardware
  • Event location is tuple (machine, node, process,
    thread)
  • Source code information and performance counter
    values

16
Abstraction Mechanisms
  • Problem
  • Low-level information in event trace
  • Sequential access to event records
  • Mapping of low-level trace onto higher-level
    model
  • Simpler specification of performance properties
  • EARL trace analysis language
  • Implements high-level interface to event trace
  • C class embedded in Python interpreter
  • Random access
  • Abstractions expressing programming model
    specific relationships
  • Call tree management

17
Abstraction Mechanisms (2)
  • Event Trace
  • Sequence of events in chronological order
  • Event type set of attributes
  • Hierarchy of event types

18
Abstraction Mechanisms (3)
  • Abstractions
  • State of an event
  • Links between related events (e.g., Send and
    Recv)
  • State of an event
  • State of the executing system, set of ongoing
    activities
  • Defined by set of events that caused the state
  • Mapping event onto set of events
  • Defined inductively by transition rules
  • Examples
  • Send events of messages currently in transfer
    (message queue)
  • Enter events of regions currently in execution
    (region stack)
  • Exit events of collective operation just
    completed
  • MPI collective operations (MPICExit)
  • OpenMP parallel constructs (OMPCExit)

19
Abstraction Mechanisms (4)
  • Links between related events
  • Navigate along path of related events
  • Represented by pointer attributes
  • Extension of event types
  • Examples
  • Pointer to Enter event of current region instance
    (enterptr)
  • Pointer from Recv event to corresponding Send
    event (sendptr)
  • Pointer from lock event to preceding lock event
    that modified the same lock
  • Call tree access
  • Associate Enter events with call tree node
  • Divide set of Enter events into equivalence
    classes
  • Same call tree node
  • Pointer attribute pointing to representative
    (least recent event)
  • Call tree node represents execution phase, source
    code location

20
Analysis Process
  • EXPERT analysis component
  • Implemented in Python
  • Proof of performance properties
  • Calculation of severity measure
  • Design principle separation of
  • Analysis process
  • Specification of performance properties
  • Abstractions (EARL)
  • Advantage
  • Arbitrary set of performance properties
  • Short specification of performance properties
  • Application specific properties

21
Representation of Performance Behavior
  • Three dimensional matrix
  • Class of performance behavior
  • Performance property
  • Call tree node
  • Source code location, execution phase
  • Location
  • Machine, node, process, thread
  • E.g., distribution of waiting times across
    processes
  • Proof of load imbalance
  • Each cell contains a performance metric
  • Currently time, e.g., overhead, waiting time
  • Each dimension is arranged in a hierarchy
  • From general to specific performance properties
  • From caller to callees
  • Hierarchy of hard- and software components

22
Specification of Performance Behavior
  • Proof of performance property
  • Existence of compound event in event trace
  • Compound event set of primitive events
    (constituents)
  • Partition constituents into subsets i.e.,
    logical parts
  • Define relationships among subsets using EARL
    abstractions
  • Pattern classes specify compound events
  • Python class
  • Call back method for each event type
  • Analysis process
  • Looks for compound event instances in event trace
  • Walks sequentially through event trace
  • Invokes corresponding call back method for each
    event
  • Each pattern class computes severity matrix
  • Time losses due to performance property
  • Per location and call tree node

23
Example Late Sender (blocked receiver)
Location
Enter Send Receive Message enterptr sendptr
MPI_SEND
B
Waiting
MPI_RECV
A
Time
24
Example Late Sender (2)
  • class LateSender(Pattern)
  • "Late Sender"
  • def parent(self)
  • return "P2P"
  • def recv(self, recv)
  • recv_start self._trace.event(recv'enterpt
    r')
  • if (self._trace.region(recv_start'regid')'
    name' "MPI_Recv")
  • send self._trace.event(recv's
    endptr')
  • send_start self._trace.event(send'e
    nterptr')
  • if (self._trace.region(send_start'reg
    id')'name' "MPI_Send")
  • idle_time send_start'time' -
    recv_start'time'
  • if idle_time gt 0
  • locid recv_start'locid'
  • cnode recv_start'cnodeptr'
  • self._severity.add(cnode,
    locid, idle_time)

25
Performance properties partial list
  • Severity is percentage of CPU allocation time
  • 100 ( time from first to last event )
    number of CPUs
  • Upper-level properties
  • Execution
  • Time during which code is executed
  • In MPI applications near 100
  • In OpenMP applications typically below 100
  • Idle Threads
  • Time on unused CPUs during sequential regions in
    OpenMP applications
  • Call tree mapping corresponds to that of master
    thread

26
Performance Properties (2)
Location
100 Execution Idle Threads
Thread 1.1
Thread 1.1
Thread 1.1
Thread 1.0
Thread 0.3
Thread 0.2
Thread 0.1
Thread 0.0
Time
27
Performance Properties (3)
  • MPI
  • Communication
  • Collective
  • Early Reduce
  • Late Broadcast
  • Wait at N x N
  • Point to Point
  • Late Receiver
  • Messages in Wrong Order
  • Late Sender
  • Messages in Wrong Order
  • IO
  • Synchronization

28
Performance Properties (4)
  • OpenMP
  • Synchonization
  • Barrier
  • Implicit
  • Load Imbalance at Parallel Do
  • Not Enough Sections
  • Explicit
  • Lock Competition
  • Idle Threads (sequential overhead)

29
Plug-in Mechanism
  • Application-specific performance properties
  • Application-specific criteria
  • E.g. based on iterations or updates per second
  • Automatic GUI integration
  • Based on Python module concept

30
Presentation of Performance Behavior
  • Performance behavior
  • 3 dimensional matrix
  • Hierarchical dimensions
  • Weighted tree
  • Tree browser
  • Each node has weight
  • Percentage of CPU allocation time
  • E.g. time spent in subtree of call tree
  • Displayed weight depends on state of node
  • Collapsed (including weight of descendants)
  • Expanded (without weight of descendants)
  • Displayed using
  • Color
  • Allows to easily identify hot spots (bottlenecks)
  • Numerical value
  • Detailed comparison

100 main
10 main
30 foo
60 bar
31
Presentation of Performance Behavior (2)
  • Three views
  • Performance property
  • Call tree
  • Locations
  • Interconnected
  • View refers to selection in left neighbor
  • Two modes
  • Absolute percent of total CPU allocation time
  • Relative percent of selection in left neighbor
  • Collapsing/expanding of nodes
  • Analysis on all hierarchy levels

32
Case Study TRACE (MPI)
  • Simulation of subsurface water flow in variable
    saturated media
  • Based on parallelized CG algorithm
  • Experiment
  • 8 x 2 processes
  • Communication 16.7
  • Two major sources of inefficiencies detected
  • Wait at N x N
  • trace ? cgiteration ? parallelcg ?
    paralleldotproduct ? globalsum_r1 ? MPI_Allreduce
  • Waiting time ca. 3.6
  • Late Sender
  • trace ? cgiteration ? parallelcg ?
    parallelfemultiply ? exchangedata ?
    exchangebufferswf ? mrecv ? MPI_Recv
  • Waiting time 7.0

33
Case Study REMO (MPI OpenMP)
  • Weather forecast
  • Based on the regional climate model
  • Experiment
  • 4 pocesses
  • 4 threads each
  • Sequential part outside parallel region is to
    large
  • Idle threads
  • remo (whole program) 50
  • remo ? ec4org ? progec4 37

34
Summary
  • EXPERT tool environment for automatic performance
    analysis
  • Complete but still extensible solution
  • Support of MPI, OpenMP, and hybrid applications
  • Especially well suited for SMP cluster
  • Performance properties
  • High abstraction level
  • Close to terminology of the underlying
    programming model
  • Specifications embedded in extensible
    architecture
  • Application-specific needs
  • Performance behavior
  • Three interconnected hierarchical dimensions
  • Representation of SMP cluster structure
  • Hierarchical hard- and software components
  • Scalable but still accurate tree display

35
Outlook
  • More performance properties
  • Comparative analysis of different trace files
  • Additional presentation components
  • Source code display
  • Event pattern display (e.g, using VAMPIR)
  • Performance behavior is represented by time
    values (losses)
  • Alternative metrics such as hardware performance
    counters
  • Integration with TAU (University of Oregon)
  • Instrumentation
  • Trace generation
Write a Comment
User Comments (0)
About PowerShow.com