KOJAK Evaluation Report - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

KOJAK Evaluation Report

Description:

CAMEL contains several hundred thousand function calls in a given execution ... attributed to a few places in code, due to CAMEL's unique communication pattern ... – PowerPoint PPT presentation

Number of Views:68

Avg rating:3.0/5.0

Slides: 32

Provided by: dral60

Category:

more less

Transcript and Presenter's Notes

Title: KOJAK Evaluation Report

1
KOJAK Evaluation Report

Adam Leko,
Hans Sherburne
UPC Group
HCS Research Laboratory
University of Florida

Color encoding key Blue Information Red
Negative note Green Positive note
2
Basic Information

Name KOJAK
Developer Forschungszentrum Jülich, ICL _at_ UTK
Current versions
Stable KOJAK-v2.0
Development KOJAK v2.1b1
Websitehttp//icl.cs.utk.edu/kojak/http//www.f
z-juelich.de/zam/kojak/
Contacts
Felix Wolf (fwolf_at_cs.utk.edu)
Bernd Mohr (b.mohr_at_fz-juelich.de)
Generic email kojak_at_cs.utk.edu

3
KOJAK Overview

A collection of tools for automated performance
analysis
Instrumentation utilities DUCTAPE, OPARI
Trace file format/library EPILOG
High-level trace API EARL
Pattern matching/performance knowledge
representation EXPERT
Visualization tool CUBE
Also can export to Vampirs VT3 format
Acronym soup
KOJAK Kit for Objective Judgement and
Knowledge-based detection of performance
bottlenecks
DUCTAPE C program Database Utilities and
Conversion Tools APplication Environment
EPILOG Event Processing, Investigating and
LOGging
EARL Event Analysis and Recognition Library
EXPERT Extensible Performance Tool
OPARI OpenMP Pragma And Region Instrumentor
CUBE CUBE Uniform Behavioral Encoding

4
KOJAK Architecture
5
Instrumentation Overview

Automatic instrumentation (kinst)
Only available on a few platforms
Linux clusters, PGI compilers
Hitachi SR-8000
Solaris, Sun Fortran90 compiler
NEC SX
Based on undocumented compiler features
Manual instrumentation
MPI profiling interface
Just need to link against the elg.mpi library
Only instruments MPI calls
EPILOG API
Place macros at start and end of every function
ELG_USER_START(function-name)
ELG_USER_END(function-name)
Compile with -DEPILOG

Binary instrumentation (elg_dpcl)
Uses IBMs DPCL library
Only available on AIX
OpenMP instrumentation (opari)
Accomplished via
Source-to-source transforms
Linking against POMP library
Only instruments OpenMP regions and constructs
Still need to manually instrument functions or
other code regions

Note website mentions instrumentation via
DUCTAPE and TAU, but these have not been
integrated into the available versions of KOJAK
as of 3/05
6
Instrumentation Overhead CAMEL

Performed manual instrumentation of CAMEL
Attempt to get a rough estimate of overhead
Instrumented all functions
Ran CAMEL with 1/64th problem size
Execution was slowed down by an order of
magnitude
Trace file size 919M
CAMEL contains several hundred thousand function
calls in a given execution
Instrumented two functions within an inner loop
Execution time increased by a factor of 2.2
Trace file size 153MB
Instrumented outside large loops
Execution time increased by a few percent
Trace file only 9.1KB
Clearly the naïve approach of instrument all
functions is too expensive for KOJAK
Behavior is common for any tracing approach,
though

7
Instrumentation Overhead Test Suite

Instrumentation performed using MPI profiling
interface
Overall, instrumentation overhead very low (one
of the lowest seen thus far)
Instrumentation with PAPI enabled (FLOPS, L1 data
miss rate) has no measurable extra overhead
Ping-pong has highest reproducible overhead at
10 (worst case for MPI)
Note Benchmarks marked with have high
variability in runtimes

8
EPILOG Overview

Binary trace file format used by KOJAK
Supports OpenMP, MPI, or hybrid applications
Fairly compact
NAS LU, W workload, 8 processors 23MB
Roughly on par with size of SLOG-2 files
Documented
Complete spec available on website
Has an existing API (open source) for reading,
writing EPILOG files
Can also add information from hardware counters
PAPI supported
Can be converted to VAMPIR format using elg2vtf
Requires vptmerge
Does not work with updated Intel version of
Cluster Tools (vptmerge not included)

9
EARL Overview

Provides high-level access to trace events
Random access to trace events
Also provides links between related events
API documented, spec available on website
Existing implementation also available (open
source) for C and Python
Machine model clusters of SMPs

10
EXPERT Overview

Performs automatic analysis of EPILOG traces
Main feature of KOJAK suite
Matches collection of performance problems
(bottleneck patterns) against trace file
Bottlenecks are specified using EARL
User can add in their own patterns using Python
or C
New C patterns have to be compiled back into
EXPERT
Detection method
Pattern objects register for certain types of
trace events
Event trace reader performs callbacks when
requested events are encountered
Pattern objects receive callback update state
information
If pattern object matches state to its
performance problem, a bottleneck is reported
Output from EXPERT is a .cube file which can be
visualized using the CUBE tool

11
EXPERT Bottleneck List
Grey boxes (leaf nodes) are bottlenecks that can
be currently detected
12
EXPERT Analysis Times

EXPERT scalability
Sequential tool analysis time scales
proportionally to trace file size
Balancing act
Try to detect too many/too complex bottlenecks
analysis time becomes intractable
Try to totally minimize analysis time miss
useful bottlenecks
Current analysis speed tractable for trace files
up to a few hundred MB
Plans to parallelize the analysis phase, but no
implementation available yet

13
CUBE Overview

Generic visualization tool
Used by KOJAK to visualize EXPERTs analyses
X-Windows application (uses wxWindows toolkit)
Buzzword description
Displays multidimensional data in a scalable
fashion
Reduces all data to hierarchical display of 3
dimensions (cube)
Data is aggregated across dimensions as needed
Dimension space
Set of metrics (M)
Set of call paths (C)
Set of locations (L)
Each data point (m, c, l) is mapped onto a number
representing
actual metric m (also referred to as severity)
while program was execution call path c
at location l
Browsers for each dimension are linked together
User views one dimension with respect to another
Uses documented XML format to represent data

14
CUBE Overview Simple Description

Uses a 3-pane approach to display information
Metric pane
Module/calltree pane
Right-clicking brings up source code location
Location pane (system tree)
Each item is displayed along with a color to
indicate severity of condition
Severity can be expressed 4 ways
Absolute (time)
Percentage
Relative percentage (changes module location
pane)
Comparative percentage (differences between
executions)
Despite documentation, interface is actually
quite intuitive

15
CUBE Example CAMEL
After opening the .cube file (default metric
shown absolute time take in seconds)
16
CUBE Example CAMEL
After expanding all 3 root nodes color shown
indicates metric severity (amount of time)
17
CUBE Example CAMEL
Selecting Execution shows execution time,
broken down into part of code machine
18
CUBE Example CAMEL
Selecting mainloop adjusts system tree to only
show time spent in mainloop per each processor
19
CUBE Example CAMEL
Expanded nodes show exclusive metric (only time
spent by node)
20
CUBE Example CAMEL
Collapsed nodes show inclusive metric (time spent
by node and all children nodes)
21
CUBE Example CAMEL
Metric pane also shows detected bottlenecks
here, shows Late Sender in MPI_Recv within main
spread across all nodes
22
Bottleneck Identification Test Suite

Testing metric what did CUBE tell us after
processing trace file with EXPERT?
Excluding what can be accomplished with VAMPIR
export
Programs correctness not affected by
instrumentation ?
CAMEL PASSED
Not many problems detected
Late sender attributed to a few places in code,
due to CAMELs unique communication pattern
LU TOSS-UP
No too many small messages bottleneck pattern
Late sender, messages in wrong order correctly
identified though
Big messages PASSED
Showed most time being spent in MPI_Send/MPI_Recv
Diffuse procedure FAILED
Just showed lots of time being spent in barriers
Hot procedure FAILED
Time incorrectly attributed to MPI_Init

23
Bottleneck Identification Test Suite (2)

Intensive server PASSED
Late sender bottleneck detected for overloaded
server
Ping-pong PASSED
Late sender bottleneck detected
Indicates dependence of messages on each other
Random barrier PASSED
Detected wait at barrier bottleneck
Source code correlation allowed pinpointing where
problem was in code
Small messages TOSS-UP
Illustrated large time spent in point-to-point
MPI routines
Bottleneck incorrectly attributed to late
receiver
System time FAILED
Incorrectly attributed to MPI_Init time
Wrong order PASSED
Correctly identified messages received in wrong
order

24
KOJAK General Comments

Good things
Portable, automatic performance analysis
CUBE GUI uses novel way to present metrics
Source code correlation!
Bottlenecks are shown according to which parts of
code they occur in and which machines see them
Data presentation in a form that makes it easier
for user to not become overwhelmed
Libraries are well-separated into APIs and
documented
We have the opportunity to re-use their existing
code!
Automatic instrumentation is available, although
only for a limited number of platforms
Installation relatively easy
Code compiled pretty cleanly
Can still export data into VAMPIR format for more
thorough user analysis
Tool very stable (no crashes, only a few bugs)

25
KOJAK General Comments (2)

Things that could use improvement
Only a few PAPI metrics shown in GUI
FLOPS L1 data miss rates
No PAPI metrics used for bottleneck detection!
Could write new pattern in EARL though
When using PAPI, trace file creation fails
Complains about out-of-sync files
Some time at beginning of application gets
incorrectly recorded under MPI_Init
CUBE becomes does not correlate with source code
unless automatic/binary instrumentation is used
Call tree in second pane turns into flat
structure when only MPI profiling library
interface is used
Impossible to see specific communication patterns
in CUBE
Exporting to VAMPIR trace format possible, but
relies on hard-to-find tool vptmerge
Effectiveness of automatic analysis on a
day-to-day basis still unknown
However, very powerful tool when combined with
VAMPIR

26
KOJAK Adding UPC SHMEM

SHMEM
Not much extra work needed
Need to create a SHMEM profiling interface that
writes to EPILOG
Add a few extra SHMEM-specific bottleneck
patterns
UPC
Could potentially be difficult
If we solve the UPC instrumentation problem, then
we just need to use EPILOG instead of (other
trace format)
Could use manual instrumentation for everything
but implicit communication
Add (many?) UPC-specific bottleneck patterns
In either case, if manual (or source-source)
instrumentation used, not much additional code
has to be written
Also, since formats defined (and existing API
implementations are readily available), it should
be relatively easy to export to EPILOG traces

27
Evaluation (1)

Available metrics 4/5
Supports recording execution time (broken down
into call trees)
Supports recording communication patterns
classification of events
Supports a few PAPI metrics
Cost 5/5
Free!
Documentation quality 4/5
Excellent USAGE file describes how to use
application
CUBE documentation overly technical in some areas
Extensibility 4/5
Can easily add new benchmark patterns
Open source, uses documented APIs
Filtering and aggregation 3/5
Simple filtering aggregation functionality in
CUBE GUI
Not supported at the tracefile level, though
Cannot restrict analysis to only certain parts of
trace
More complicated filtering is done based on
bottleneck detection algorithms

28
Evaluation (2)

Hardware support 5/5
Many platforms supported
Instrumentation, Measurement, and Analysis
64-bit Linux (Opteron and Itanium) with GNU, PGI,
or Intel compilers IBM SP (AIX) SGI MIPS-based
clusters (O2k, O3k) SGI Altix SPARC-based
clusters AlphaServer (Tru64)
Instrumentation and Measurement only
Cray X1 and T3E IBM BlueGene/L NEC SX Hitachi
SR-8000
Heterogeneity support 0/5 (not supported)
Installation 4.5/5
Comes in source form, but very easily to compile
installation (no problems)
Interoperability 2/5
CUBE viewer uses simple XML-based format
Can only export to VAMPIR trace files
Learning curve 3.5/5
MPI trace library easy to use, EXPERT very easy
to use
CUBE has a learning curve but is easy to use
after some use

29
Evaluation (3)

Manual overhead 3/5
Automatic instrumentation of MPI calls on all
platforms
Automatic instrumentation of all functions and a
handful of functions via DPCL
MPI and OpenMP instrumentation support
Measurement accuracy 5/5
CAMEL overhead
Binary instrumentation more accurate but only
available on AIX
Very low overhead for instrumenting MPI calls
only
Multiple executions 3/5
Can relate all metrics between two different runs
(show percentage differences)
Can change code and still compare runs
Multiple analyses views 3.5/5
CUBE can show time-based metrics broken down by
node and code locations
CUBE can also show bottleneck detection metrics
broken down by node and code locations
Can export to VAMPIR to see trace

30
Evaluation (4)

Performance bottleneck identification 3.5/5
Bottleneck rules work pretty well (could use more
though)
Lack of built-in trace viewer makes
identification of some bottlenecks impossible,
but trace export means could combine with Vampir
to cover most bases
Profiling/tracing support 3/5
Only performs tracing
Trace file format relatively compact
Profiling data shown in CUBE extracted from trace
data
Response time 1/5
Have to wait until after program finishes
executing and EXPERT is done analyzing before you
get any feedback
Software support 3.5/5
Supports OpenMP, MPI
Can support linking against any library, but does
not instrument library functions
Source code correlation 4/5
Well-supported in CUBE, down to the source code
line level for function defitions and function
calls
Searching 0/5 (not supported)