KOJAK Evaluation Report - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

KOJAK Evaluation Report

Description:

CAMEL contains several hundred thousand function calls in a given execution ... attributed to a few places in code, due to CAMEL's unique communication pattern ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 32
Provided by: dral60
Category:

less

Transcript and Presenter's Notes

Title: KOJAK Evaluation Report


1
KOJAK Evaluation Report
  • Adam Leko,
  • Hans Sherburne
  • UPC Group
  • HCS Research Laboratory
  • University of Florida

Color encoding key Blue Information Red
Negative note Green Positive note
2
Basic Information
  • Name KOJAK
  • Developer Forschungszentrum Jülich, ICL _at_ UTK
  • Current versions
  • Stable KOJAK-v2.0
  • Development KOJAK v2.1b1
  • Websitehttp//icl.cs.utk.edu/kojak/http//www.f
    z-juelich.de/zam/kojak/
  • Contacts
  • Felix Wolf (fwolf_at_cs.utk.edu)
  • Bernd Mohr (b.mohr_at_fz-juelich.de)
  • Generic email kojak_at_cs.utk.edu

3
KOJAK Overview
  • A collection of tools for automated performance
    analysis
  • Instrumentation utilities DUCTAPE, OPARI
  • Trace file format/library EPILOG
  • High-level trace API EARL
  • Pattern matching/performance knowledge
    representation EXPERT
  • Visualization tool CUBE
  • Also can export to Vampirs VT3 format
  • Acronym soup
  • KOJAK Kit for Objective Judgement and
    Knowledge-based detection of performance
    bottlenecks
  • DUCTAPE C program Database Utilities and
    Conversion Tools APplication Environment
  • EPILOG Event Processing, Investigating and
    LOGging
  • EARL Event Analysis and Recognition Library
  • EXPERT Extensible Performance Tool
  • OPARI OpenMP Pragma And Region Instrumentor
  • CUBE CUBE Uniform Behavioral Encoding

4
KOJAK Architecture
5
Instrumentation Overview
  • Automatic instrumentation (kinst)
  • Only available on a few platforms
  • Linux clusters, PGI compilers
  • Hitachi SR-8000
  • Solaris, Sun Fortran90 compiler
  • NEC SX
  • Based on undocumented compiler features
  • Manual instrumentation
  • MPI profiling interface
  • Just need to link against the elg.mpi library
  • Only instruments MPI calls
  • EPILOG API
  • Place macros at start and end of every function
  • ELG_USER_START(function-name)
  • ELG_USER_END(function-name)
  • Compile with -DEPILOG
  • Binary instrumentation (elg_dpcl)
  • Uses IBMs DPCL library
  • Only available on AIX
  • OpenMP instrumentation (opari)
  • Accomplished via
  • Source-to-source transforms
  • Linking against POMP library
  • Only instruments OpenMP regions and constructs
  • Still need to manually instrument functions or
    other code regions

Note website mentions instrumentation via
DUCTAPE and TAU, but these have not been
integrated into the available versions of KOJAK
as of 3/05
6
Instrumentation Overhead CAMEL
  • Performed manual instrumentation of CAMEL
  • Attempt to get a rough estimate of overhead
  • Instrumented all functions
  • Ran CAMEL with 1/64th problem size
  • Execution was slowed down by an order of
    magnitude
  • Trace file size 919M
  • CAMEL contains several hundred thousand function
    calls in a given execution
  • Instrumented two functions within an inner loop
  • Execution time increased by a factor of 2.2
  • Trace file size 153MB
  • Instrumented outside large loops
  • Execution time increased by a few percent
  • Trace file only 9.1KB
  • Clearly the naïve approach of instrument all
    functions is too expensive for KOJAK
  • Behavior is common for any tracing approach,
    though

7
Instrumentation Overhead Test Suite
  • Instrumentation performed using MPI profiling
    interface
  • Overall, instrumentation overhead very low (one
    of the lowest seen thus far)
  • Instrumentation with PAPI enabled (FLOPS, L1 data
    miss rate) has no measurable extra overhead
  • Ping-pong has highest reproducible overhead at
    10 (worst case for MPI)
  • Note Benchmarks marked with have high
    variability in runtimes

8
EPILOG Overview
  • Binary trace file format used by KOJAK
  • Supports OpenMP, MPI, or hybrid applications
  • Fairly compact
  • NAS LU, W workload, 8 processors 23MB
  • Roughly on par with size of SLOG-2 files
  • Documented
  • Complete spec available on website
  • Has an existing API (open source) for reading,
    writing EPILOG files
  • Can also add information from hardware counters
  • PAPI supported
  • Can be converted to VAMPIR format using elg2vtf
  • Requires vptmerge
  • Does not work with updated Intel version of
    Cluster Tools (vptmerge not included)

9
EARL Overview
  • Provides high-level access to trace events
  • Random access to trace events
  • Also provides links between related events
  • API documented, spec available on website
  • Existing implementation also available (open
    source) for C and Python
  • Machine model clusters of SMPs

10
EXPERT Overview
  • Performs automatic analysis of EPILOG traces
  • Main feature of KOJAK suite
  • Matches collection of performance problems
    (bottleneck patterns) against trace file
  • Bottlenecks are specified using EARL
  • User can add in their own patterns using Python
    or C
  • New C patterns have to be compiled back into
    EXPERT
  • Detection method
  • Pattern objects register for certain types of
    trace events
  • Event trace reader performs callbacks when
    requested events are encountered
  • Pattern objects receive callback update state
    information
  • If pattern object matches state to its
    performance problem, a bottleneck is reported
  • Output from EXPERT is a .cube file which can be
    visualized using the CUBE tool

11
EXPERT Bottleneck List
Grey boxes (leaf nodes) are bottlenecks that can
be currently detected
12
EXPERT Analysis Times
  • EXPERT scalability
  • Sequential tool analysis time scales
    proportionally to trace file size
  • Balancing act
  • Try to detect too many/too complex bottlenecks
    analysis time becomes intractable
  • Try to totally minimize analysis time miss
    useful bottlenecks
  • Current analysis speed tractable for trace files
    up to a few hundred MB
  • Plans to parallelize the analysis phase, but no
    implementation available yet

13
CUBE Overview
  • Generic visualization tool
  • Used by KOJAK to visualize EXPERTs analyses
  • X-Windows application (uses wxWindows toolkit)
  • Buzzword description
  • Displays multidimensional data in a scalable
    fashion
  • Reduces all data to hierarchical display of 3
    dimensions (cube)
  • Data is aggregated across dimensions as needed
  • Dimension space
  • Set of metrics (M)
  • Set of call paths (C)
  • Set of locations (L)
  • Each data point (m, c, l) is mapped onto a number
    representing
  • actual metric m (also referred to as severity)
  • while program was execution call path c
  • at location l
  • Browsers for each dimension are linked together
  • User views one dimension with respect to another
  • Uses documented XML format to represent data

14
CUBE Overview Simple Description
  • Uses a 3-pane approach to display information
  • Metric pane
  • Module/calltree pane
  • Right-clicking brings up source code location
  • Location pane (system tree)
  • Each item is displayed along with a color to
    indicate severity of condition
  • Severity can be expressed 4 ways
  • Absolute (time)
  • Percentage
  • Relative percentage (changes module location
    pane)
  • Comparative percentage (differences between
    executions)
  • Despite documentation, interface is actually
    quite intuitive

15
CUBE Example CAMEL
After opening the .cube file (default metric
shown absolute time take in seconds)
16
CUBE Example CAMEL
After expanding all 3 root nodes color shown
indicates metric severity (amount of time)
17
CUBE Example CAMEL
Selecting Execution shows execution time,
broken down into part of code machine
18
CUBE Example CAMEL
Selecting mainloop adjusts system tree to only
show time spent in mainloop per each processor
19
CUBE Example CAMEL
Expanded nodes show exclusive metric (only time
spent by node)
20
CUBE Example CAMEL
Collapsed nodes show inclusive metric (time spent
by node and all children nodes)
21
CUBE Example CAMEL
Metric pane also shows detected bottlenecks
here, shows Late Sender in MPI_Recv within main
spread across all nodes
22
Bottleneck Identification Test Suite
  • Testing metric what did CUBE tell us after
    processing trace file with EXPERT?
  • Excluding what can be accomplished with VAMPIR
    export
  • Programs correctness not affected by
    instrumentation ?
  • CAMEL PASSED
  • Not many problems detected
  • Late sender attributed to a few places in code,
    due to CAMELs unique communication pattern
  • LU TOSS-UP
  • No too many small messages bottleneck pattern
  • Late sender, messages in wrong order correctly
    identified though
  • Big messages PASSED
  • Showed most time being spent in MPI_Send/MPI_Recv
  • Diffuse procedure FAILED
  • Just showed lots of time being spent in barriers
  • Hot procedure FAILED
  • Time incorrectly attributed to MPI_Init

23
Bottleneck Identification Test Suite (2)
  • Intensive server PASSED
  • Late sender bottleneck detected for overloaded
    server
  • Ping-pong PASSED
  • Late sender bottleneck detected
  • Indicates dependence of messages on each other
  • Random barrier PASSED
  • Detected wait at barrier bottleneck
  • Source code correlation allowed pinpointing where
    problem was in code
  • Small messages TOSS-UP
  • Illustrated large time spent in point-to-point
    MPI routines
  • Bottleneck incorrectly attributed to late
    receiver
  • System time FAILED
  • Incorrectly attributed to MPI_Init time
  • Wrong order PASSED
  • Correctly identified messages received in wrong
    order

24
KOJAK General Comments
  • Good things
  • Portable, automatic performance analysis
  • CUBE GUI uses novel way to present metrics
  • Source code correlation!
  • Bottlenecks are shown according to which parts of
    code they occur in and which machines see them
  • Data presentation in a form that makes it easier
    for user to not become overwhelmed
  • Libraries are well-separated into APIs and
    documented
  • We have the opportunity to re-use their existing
    code!
  • Automatic instrumentation is available, although
    only for a limited number of platforms
  • Installation relatively easy
  • Code compiled pretty cleanly
  • Can still export data into VAMPIR format for more
    thorough user analysis
  • Tool very stable (no crashes, only a few bugs)

25
KOJAK General Comments (2)
  • Things that could use improvement
  • Only a few PAPI metrics shown in GUI
  • FLOPS L1 data miss rates
  • No PAPI metrics used for bottleneck detection!
  • Could write new pattern in EARL though
  • When using PAPI, trace file creation fails
  • Complains about out-of-sync files
  • Some time at beginning of application gets
    incorrectly recorded under MPI_Init
  • CUBE becomes does not correlate with source code
    unless automatic/binary instrumentation is used
  • Call tree in second pane turns into flat
    structure when only MPI profiling library
    interface is used
  • Impossible to see specific communication patterns
    in CUBE
  • Exporting to VAMPIR trace format possible, but
    relies on hard-to-find tool vptmerge
  • Effectiveness of automatic analysis on a
    day-to-day basis still unknown
  • However, very powerful tool when combined with
    VAMPIR

26
KOJAK Adding UPC SHMEM
  • SHMEM
  • Not much extra work needed
  • Need to create a SHMEM profiling interface that
    writes to EPILOG
  • Add a few extra SHMEM-specific bottleneck
    patterns
  • UPC
  • Could potentially be difficult
  • If we solve the UPC instrumentation problem, then
    we just need to use EPILOG instead of (other
    trace format)
  • Could use manual instrumentation for everything
    but implicit communication
  • Add (many?) UPC-specific bottleneck patterns
  • In either case, if manual (or source-source)
    instrumentation used, not much additional code
    has to be written
  • Also, since formats defined (and existing API
    implementations are readily available), it should
    be relatively easy to export to EPILOG traces

27
Evaluation (1)
  • Available metrics 4/5
  • Supports recording execution time (broken down
    into call trees)
  • Supports recording communication patterns
    classification of events
  • Supports a few PAPI metrics
  • Cost 5/5
  • Free!
  • Documentation quality 4/5
  • Excellent USAGE file describes how to use
    application
  • CUBE documentation overly technical in some areas
  • Extensibility 4/5
  • Can easily add new benchmark patterns
  • Open source, uses documented APIs
  • Filtering and aggregation 3/5
  • Simple filtering aggregation functionality in
    CUBE GUI
  • Not supported at the tracefile level, though
  • Cannot restrict analysis to only certain parts of
    trace
  • More complicated filtering is done based on
    bottleneck detection algorithms

28
Evaluation (2)
  • Hardware support 5/5
  • Many platforms supported
  • Instrumentation, Measurement, and Analysis
  • 64-bit Linux (Opteron and Itanium) with GNU, PGI,
    or Intel compilers IBM SP (AIX) SGI MIPS-based
    clusters (O2k, O3k) SGI Altix SPARC-based
    clusters AlphaServer (Tru64)
  • Instrumentation and Measurement only
  • Cray X1 and T3E IBM BlueGene/L NEC SX Hitachi
    SR-8000
  • Heterogeneity support 0/5 (not supported)
  • Installation 4.5/5
  • Comes in source form, but very easily to compile
    installation (no problems)
  • Interoperability 2/5
  • CUBE viewer uses simple XML-based format
  • Can only export to VAMPIR trace files
  • Learning curve 3.5/5
  • MPI trace library easy to use, EXPERT very easy
    to use
  • CUBE has a learning curve but is easy to use
    after some use

29
Evaluation (3)
  • Manual overhead 3/5
  • Automatic instrumentation of MPI calls on all
    platforms
  • Automatic instrumentation of all functions and a
    handful of functions via DPCL
  • MPI and OpenMP instrumentation support
  • Measurement accuracy 5/5
  • CAMEL overhead
  • Binary instrumentation more accurate but only
    available on AIX
  • Very low overhead for instrumenting MPI calls
    only
  • Multiple executions 3/5
  • Can relate all metrics between two different runs
    (show percentage differences)
  • Can change code and still compare runs
  • Multiple analyses views 3.5/5
  • CUBE can show time-based metrics broken down by
    node and code locations
  • CUBE can also show bottleneck detection metrics
    broken down by node and code locations
  • Can export to VAMPIR to see trace

30
Evaluation (4)
  • Performance bottleneck identification 3.5/5
  • Bottleneck rules work pretty well (could use more
    though)
  • Lack of built-in trace viewer makes
    identification of some bottlenecks impossible,
    but trace export means could combine with Vampir
    to cover most bases
  • Profiling/tracing support 3/5
  • Only performs tracing
  • Trace file format relatively compact
  • Profiling data shown in CUBE extracted from trace
    data
  • Response time 1/5
  • Have to wait until after program finishes
    executing and EXPERT is done analyzing before you
    get any feedback
  • Software support 3.5/5
  • Supports OpenMP, MPI
  • Can support linking against any library, but does
    not instrument library functions
  • Source code correlation 4/5
  • Well-supported in CUBE, down to the source code
    line level for function defitions and function
    calls
  • Searching 0/5 (not supported)

31
Evaluation (5)
  • System stability 4.5/5
  • No program crashes encountered
  • A few minor bugs discovered
  • Technical support 4.5/5
  • Developers responded within 24 hours
  • Gave back much useful information
  • Willing to work with us to add UPC and SHMEM
    support
Write a Comment
User Comments (0)
About PowerShow.com