Project F2: Application Performance Analysis - PowerPoint PPT Presentation

About This Presentation

Title:

Project F2: Application Performance Analysis

Description:

Title: 12JAN07 Talk for I/UCRC Annual Meeting Last modified by: Seth Koehler Document presentation format: On-screen Show Other titles: Arial Garamond Wingdings Times ... – PowerPoint PPT presentation

Number of Views:116

Avg rating:3.0/5.0

Slides: 23

Provided by: gstittEc

Learn more at: http://www.gstitt.ece.ufl.edu

Category:

more less

Transcript and Presenter's Notes

Title: Project F2: Application Performance Analysis

1
Project F2 Application Performance Analysis

Seth Koehler
John Curreri
Rafael Garcia

2
Outline

Introduction
Performance analysis overview
Historical background
Performance analysis today
Related research and tools
RC performance analysis
Motivation
Instrumentation
Framework
Visualization
Users perspective
Case studies
N-Queens
Collatz (3x1) conjecture
Conclusions References

3
Introduction

Goals for performance analysis in RC
Productively identify and remedy performance
bottlenecks in RC applications (CPUs and FPGAs)
Motivations
Complex systems are difficult to analyze by hand
Manual instrumentation is unwieldy
Difficult to make sense of large volume of raw
data
Tools can help quickly locate performance
problems
Collect and view performance data with little
effort
Analyze performance data to indicate potential
bottlenecks
Staple in HPC, limited in HPEC, and virtually
non-existent in RC
Challenges
How do we expand notion of software performance
analysis into software-hardware realm of RC?
What are common bottlenecks for dual-paradigm
applications?
What techniques are necessary to detect
performance bottlenecks?
How do we analyze and present these bottlenecks
to a user?

4
Historical Background

Gettimeofday and printf
VERY cumbersome, repetitive, manual, not
optimized for speed
Profilers date back to 70s with prof (gprof,
1982)
Provide user with information about application
behavior
Percentage of time spent in a function
How often a function calls another function
Simulators / Emulators
Too slow or too inaccurate
Require significant development time
PAPI (Performance Application Programming
Interface)
Portable interface to hardware performance
counters on modern CPUs
Provides information about caches, CPU
functional units, main memory, and more

Processor HW counters
UltraSparc II 2
Pentium 3 2
AMD Athlon 4
IA-64 4
POWER4 8
Pentium 4 18
Source Wikipedia
5
Performance Analysis Today

What does performance analysis look like today?
Goals
Low impact on application behavior
High-fidelity performance data
Flexible
Portable
Automated
Concise Visualization
Techniques
Event-based, sample-based
Profile, Trace

Above all, we want to understand application
behavior in order to locate performance problems!

6
Related Research and Tools Parallel Performance
Wizard (PPW)

Open-source tool developed by UPC Group at
University of Florida
Performance analysis and optimization (PGAS
systems and MPI support)
Performance data can be analyzed for bottlenecks
Offers several ways of exploring performance data
Graphs and charts to quickly view high-level
performance information at a glance right, top
In-depth execution statistics for identifying
communication and computational bottlenecks
Interacts with popular trace viewers (e.g.
Jumpshot right, bottom) for detailed analysis
of trace data
Comprehensive support for correlating performance
back to original source code

Partitioned Global Address Space languages
allow partitioned memory to be treated as global
shared memory by software.
7
Motivation for RC Performance Analysis

Dual-paradigm applications gaining more traction
in HPC and HPEC
Design flexibility allows best use of FPGAs and
traditional processors
Drawback More challenging to design applications
for dual-paradigm systems
Parallel application tuning and FPGA core
debugging are hard enough!

Less
Difficultylevel
More

No existing holistic solutions for analyzing
dual-paradigm applications
Software-only views leave out low-level details
Hardware-only views provide incomplete
performance information
Need complete system view for effective tuning of
entire application

8
Motivation for RC Performance Analysis

Q Is my runtime load-balancing strategy working?
A ???

ChipScope waveform
9
Motivation for RC Performance Analysis

Q How well is my cores pipelining strategy
working?
A ???

Flat profile Each sample counts as 0.01
seconds. cumulative self
self total time seconds seconds calls
ms/call ms/call name 51.52 2.55 2.55
5 510.04 510.04 USURP_Reg_poll 29.41
4.01 1.46 34 42.82 42.82
USURP_DMA_write 11.97 4.60 0.59
14 42.31 42.31 USURP_DMA_read 4.06
4.80 0.20 1 200.80 200.80
USURP_Finalize 2.23 4.91 0.11 5
22.09 22.09 localp 1.22 4.97
0.06 5 12.05 12.05 USURP_Load
0.00 4.97 0.00 10 0.00
0.00 USURP_Reg_write 0.00 4.97 0.00
5 0.00 0.00 USURP_Set_clk 0.00
4.97 0.00 5 0.00 931.73
rcwork 0.00 4.97 0.00 1
0.00 0.00 USURP_Init
gprof output (N, one for each node!)
10
What to Instrument in Hardware?

Control
Watch state machines, pipelines, etc.
Replicated cores
Understand distribution and parallelism inside
FPGA

Communication
On-chip (Components, Block RAMs, embedded
processors)
On-board (On-board memory, other on-board FPGAs
or processors)
Off-board (CPUs, off-board FPGAs, main memory)

11
Instrumentation Modifications
Color Legend
Framework
User Application
Process is automatable!
Additions are temporary!
12
Performance Analysis Framework

Instrument VHDL source (vs. binary or
intermediate levels)
Portable across devices
Flexible (access to signals)
Low change in area / speed (optimized)
Relatively easy
Must pass through place-and-route
Language specific (VHDL vs. Verilog)
Store data with CPU-initiated transfers (vs.
CPU-assisted or FPGA-initiated)
Universally supported
Not portable across APIs
Inefficient (lock contention, wasteful)
Lower fidelity

Request
CPU
FPGA
Data
13
Hardware Measurement Extractation Module

Separate thread (HMM_Main) periodically transfers
data from FPGA to memory
Adaptive polling frequency can beemployed to
balance fidelity and overhead
Measurement can be stopped andrestarted (similar
to stopwatch)

HMM_Init
HMM_Start
HMM_Main (thread)
Application
HMM_Stop
HMM_Finalize
14
Instrumentation Modifications (cont)

New top-level file arbitrates between application
and performance framework for off-chip
communication
Splice into communication scheme
Acquire address space in memory map
Acquire network address or other unique
identifier
Connect hardware together
Signal analysis

Challenges in Automation
Custom APIs for FPGAs
Custom user schemes for communication
Application knowledge not available

15
Hardware Measurement Module

Tracing, profiling, sampling with signal
analysis

16
Visualization

Need unified visualizations that accentuate
important statistics
Must be scalable to many nodes

17
Analysis

Instrument and measure to locate common or
expected bottlenecks
Provide potential solutions or other aid to
mitigate these bottlenecks
Best practices, common pitfalls, etc
Hardware/platform specific checks and solutions

Bottleneck Pattern Possible Solution
FPGA idle waiting for data Employ double-buffering
Frequent, small communication packets between CPU FPGA Buffer data on CPU or FPGA side
Some cores busy while others idle Improve distribution scheme / load-balancing
Cray XD1 reads slow on CPU Use FPGA to write data
Heavy CPU/FPGA communication Modify partitioning of CPU and FPGA work/data
Excessive time spent in miscellaneous states Combine states
18
Performance flow (users perspective)

Instrument hardware through VHDL Instrumenter GUI
Java/Perl program to simplify modifications to
VHDL for performance analysis
Must resynthesize implement hardware
Requires adding in instrumented HDL file via
standard tool flow
Instrument software through PPW compiler scripts
Run software with ppwupcc instead of standard
compiler
Use fpga-nallatech and inst-functions command
line options

19
Case Study N-Queens

Overview
Find number of distinct ways n queens can be
placed on an nxn board without attacking each
other
Performance analysis overhead
Sixteen 32-bit profile counters
One 96-bit trace buffer (completed cores)
Main state machine optimized based on data
Improved speedup (from 34 to 37 vs. Xeon code)

N-Queens results for board size of 16 XD1 XD1 Xeon-H101 Xeon-H101
N-Queens results for board size of 16 Original Instr. Original Instr.
Slices ( relative to device) 9,041 9,901 (4) 23,086 26,218 (6)
Block RAM ( relative to device) 11 15 (2) 21 22 (0)
Frequency (MHz) ( relative to orig.) 124 123 (-1) 101 101 (0)
Communication (KB/s) lt1 33 lt1 30
FPGAs
Standard backtracking algorithm employed
20
Case study Collatz conjecture (3x1)

Application
Search for sequences that do not reach 1 under
the following function
3.2GHz P4-Xeon CPU with Virtex-4 LX100 FPGA over
PCI-X
Uses 88 of FPGA slices, 22 (53) of block RAM,
runs at 100MHz
Setup
17 counters monitored 3 state machines
No frequency degradation observe
Results
Frequent, small FPGA communication
31 performance improvement achieved by buffering
data before sending to the FPGA
Unexpected...hardware was tuned to work longer to
eliminate communication problems
Distribution of data inside FPGA
Expected performance increase not large enough to
merit implementation
Conclusions
Buffering data achieved 31 increase in speed

FPGA Write
FPGA Read
FPGA Data Processing
Computation
FPGA Read
FPGA Write
21
Conclusions

RC performance analysis is critical to
understanding RC application behavior
Need unified instrumentation, measurement, and
visualization to handle diverse and massively
parallel RC systems
Automated analysis can be useful for locating
common RC bottlenecks (though difficult to do)
Framework developed
First RC performance concept and tool framework
(per extensive literature review)
Automated instrumentation
Measurement via tracing, profiling, sampling
Application case-studies
Observed minimal overhead from tool
Speedup achieved due to performance analysis

22
References

R. DeVille, I. Troxel, and A. George. Performance
monitoring for run-time management of
reconfigurable devices. Proc. of International
Conference on Engineering of Reconfigurable
Systems and Algorithms (ERSA), pages 175-181,
June 2005.
Paul Graham, Brent Nelson, and Brad Hutchings.
Instrumenting bitstreams for debugging FPGA
circuits. In Proc. of the the 9th Annual IEEE
Symposium on Field-Programmable Custom Computing
Machines (FCCM), pages 41-50, Washington, DC,
USA, Apr. 2001. IEEE Computer Society.
Sameer S. Shende and Allen D. Malony. The Tau
parallel performance system. International
Journal of High Performance Computing
Applications (HPCA), 20(2)287-311, May 2006.
C. EricWu, Anthony Bolmarcich, Marc Snir,
DavidWootton, Farid Parpia, Anthony Chan, Ewing
Lusk, and William Gropp. From trace generation to
visualization a performance framework for
distributed parallel systems. In Proc. of the
2000 ACM/IEEE conference on Supercomputing
(CDROM) (SC), page 50, Washington, DC, USA, Nov.
2000. IEEE Computer Society.
Adam Leko and Max Billingsley, III. Parallel
performance wizard user manual.
http//ppw.hcs.ufl.edu/docs/pdf/manual.pdf, 2007.
S. Koehler, J. Curreri, and A. George,
"Challenges for Performance Analysis in
High-Performance Reconfigurable Computing," Proc.
of Reconfigurable Systems Summer Institute 2007
(RSSI), Urbana, IL, July 17-20, 2007.