Title: Project F2: Application Performance Analysis
1Project F2 Application Performance Analysis
- Seth Koehler
- John Curreri
- Rafael Garcia
2Outline
- Introduction
- Performance analysis overview
- Historical background
- Performance analysis today
- Related research and tools
- RC performance analysis
- Motivation
- Instrumentation
- Framework
- Visualization
- Users perspective
- Case studies
- N-Queens
- Collatz (3x1) conjecture
- Conclusions References
3Introduction
- Goals for performance analysis in RC
- Productively identify and remedy performance
bottlenecks in RC applications (CPUs and FPGAs) - Motivations
- Complex systems are difficult to analyze by hand
- Manual instrumentation is unwieldy
- Difficult to make sense of large volume of raw
data - Tools can help quickly locate performance
problems - Collect and view performance data with little
effort - Analyze performance data to indicate potential
bottlenecks - Staple in HPC, limited in HPEC, and virtually
non-existent in RC - Challenges
- How do we expand notion of software performance
analysis into software-hardware realm of RC? - What are common bottlenecks for dual-paradigm
applications? - What techniques are necessary to detect
performance bottlenecks? - How do we analyze and present these bottlenecks
to a user?
4Historical Background
- Gettimeofday and printf
- VERY cumbersome, repetitive, manual, not
optimized for speed - Profilers date back to 70s with prof (gprof,
1982) - Provide user with information about application
behavior - Percentage of time spent in a function
- How often a function calls another function
- Simulators / Emulators
- Too slow or too inaccurate
- Require significant development time
- PAPI (Performance Application Programming
Interface) - Portable interface to hardware performance
counters on modern CPUs - Provides information about caches, CPU
functional units, main memory, and more
Processor HW counters
UltraSparc II 2
Pentium 3 2
AMD Athlon 4
IA-64 4
POWER4 8
Pentium 4 18
Source Wikipedia
5Performance Analysis Today
- What does performance analysis look like today?
- Goals
- Low impact on application behavior
- High-fidelity performance data
- Flexible
- Portable
- Automated
- Concise Visualization
- Techniques
- Event-based, sample-based
- Profile, Trace
- Above all, we want to understand application
behavior in order to locate performance problems!
6Related Research and Tools Parallel Performance
Wizard (PPW)
- Open-source tool developed by UPC Group at
University of Florida - Performance analysis and optimization (PGAS
systems and MPI support) - Performance data can be analyzed for bottlenecks
- Offers several ways of exploring performance data
- Graphs and charts to quickly view high-level
performance information at a glance right, top - In-depth execution statistics for identifying
communication and computational bottlenecks - Interacts with popular trace viewers (e.g.
Jumpshot right, bottom) for detailed analysis
of trace data - Comprehensive support for correlating performance
back to original source code
Partitioned Global Address Space languages
allow partitioned memory to be treated as global
shared memory by software.
7Motivation for RC Performance Analysis
- Dual-paradigm applications gaining more traction
in HPC and HPEC - Design flexibility allows best use of FPGAs and
traditional processors - Drawback More challenging to design applications
for dual-paradigm systems - Parallel application tuning and FPGA core
debugging are hard enough!
Less
Difficultylevel
More
- No existing holistic solutions for analyzing
dual-paradigm applications - Software-only views leave out low-level details
- Hardware-only views provide incomplete
performance information - Need complete system view for effective tuning of
entire application
8Motivation for RC Performance Analysis
- Q Is my runtime load-balancing strategy working?
- A ???
ChipScope waveform
9Motivation for RC Performance Analysis
- Q How well is my cores pipelining strategy
working? - A ???
Flat profile Each sample counts as 0.01
seconds. cumulative self
self total time seconds seconds calls
ms/call ms/call name 51.52 2.55 2.55
5 510.04 510.04 USURP_Reg_poll 29.41
4.01 1.46 34 42.82 42.82
USURP_DMA_write 11.97 4.60 0.59
14 42.31 42.31 USURP_DMA_read 4.06
4.80 0.20 1 200.80 200.80
USURP_Finalize 2.23 4.91 0.11 5
22.09 22.09 localp 1.22 4.97
0.06 5 12.05 12.05 USURP_Load
0.00 4.97 0.00 10 0.00
0.00 USURP_Reg_write 0.00 4.97 0.00
5 0.00 0.00 USURP_Set_clk 0.00
4.97 0.00 5 0.00 931.73
rcwork 0.00 4.97 0.00 1
0.00 0.00 USURP_Init
gprof output (N, one for each node!)
10What to Instrument in Hardware?
- Control
- Watch state machines, pipelines, etc.
- Replicated cores
- Understand distribution and parallelism inside
FPGA
- Communication
- On-chip (Components, Block RAMs, embedded
processors) - On-board (On-board memory, other on-board FPGAs
or processors) - Off-board (CPUs, off-board FPGAs, main memory)
11Instrumentation Modifications
Color Legend
Framework
User Application
Process is automatable!
Additions are temporary!
12Performance Analysis Framework
- Instrument VHDL source (vs. binary or
intermediate levels) - Portable across devices
- Flexible (access to signals)
- Low change in area / speed (optimized)
- Relatively easy
- Must pass through place-and-route
- Language specific (VHDL vs. Verilog)
- Store data with CPU-initiated transfers (vs.
CPU-assisted or FPGA-initiated) - Universally supported
- Not portable across APIs
- Inefficient (lock contention, wasteful)
- Lower fidelity
Request
CPU
FPGA
Data
13Hardware Measurement Extractation Module
- Separate thread (HMM_Main) periodically transfers
data from FPGA to memory - Adaptive polling frequency can beemployed to
balance fidelity and overhead - Measurement can be stopped andrestarted (similar
to stopwatch)
HMM_Init
HMM_Start
HMM_Main (thread)
Application
HMM_Stop
HMM_Finalize
14Instrumentation Modifications (cont)
- New top-level file arbitrates between application
and performance framework for off-chip
communication - Splice into communication scheme
- Acquire address space in memory map
- Acquire network address or other unique
identifier - Connect hardware together
- Signal analysis
- Challenges in Automation
- Custom APIs for FPGAs
- Custom user schemes for communication
- Application knowledge not available
15Hardware Measurement Module
- Tracing, profiling, sampling with signal
analysis
16Visualization
- Need unified visualizations that accentuate
important statistics - Must be scalable to many nodes
17Analysis
- Instrument and measure to locate common or
expected bottlenecks - Provide potential solutions or other aid to
mitigate these bottlenecks - Best practices, common pitfalls, etc
- Hardware/platform specific checks and solutions
Bottleneck Pattern Possible Solution
FPGA idle waiting for data Employ double-buffering
Frequent, small communication packets between CPU FPGA Buffer data on CPU or FPGA side
Some cores busy while others idle Improve distribution scheme / load-balancing
Cray XD1 reads slow on CPU Use FPGA to write data
Heavy CPU/FPGA communication Modify partitioning of CPU and FPGA work/data
Excessive time spent in miscellaneous states Combine states
18Performance flow (users perspective)
- Instrument hardware through VHDL Instrumenter GUI
- Java/Perl program to simplify modifications to
VHDL for performance analysis - Must resynthesize implement hardware
- Requires adding in instrumented HDL file via
standard tool flow - Instrument software through PPW compiler scripts
- Run software with ppwupcc instead of standard
compiler - Use fpga-nallatech and inst-functions command
line options
19Case Study N-Queens
- Overview
- Find number of distinct ways n queens can be
placed on an nxn board without attacking each
other - Performance analysis overhead
- Sixteen 32-bit profile counters
- One 96-bit trace buffer (completed cores)
- Main state machine optimized based on data
- Improved speedup (from 34 to 37 vs. Xeon code)
N-Queens results for board size of 16 XD1 XD1 Xeon-H101 Xeon-H101
N-Queens results for board size of 16 Original Instr. Original Instr.
Slices ( relative to device) 9,041 9,901 (4) 23,086 26,218 (6)
Block RAM ( relative to device) 11 15 (2) 21 22 (0)
Frequency (MHz) ( relative to orig.) 124 123 (-1) 101 101 (0)
Communication (KB/s) lt1 33 lt1 30
FPGAs
Standard backtracking algorithm employed
20Case study Collatz conjecture (3x1)
- Application
- Search for sequences that do not reach 1 under
the following function - 3.2GHz P4-Xeon CPU with Virtex-4 LX100 FPGA over
PCI-X - Uses 88 of FPGA slices, 22 (53) of block RAM,
runs at 100MHz - Setup
- 17 counters monitored 3 state machines
- No frequency degradation observe
- Results
- Frequent, small FPGA communication
- 31 performance improvement achieved by buffering
data before sending to the FPGA - Unexpected...hardware was tuned to work longer to
eliminate communication problems - Distribution of data inside FPGA
- Expected performance increase not large enough to
merit implementation - Conclusions
- Buffering data achieved 31 increase in speed
FPGA Write
FPGA Read
FPGA Data Processing
Computation
FPGA Read
FPGA Write
21Conclusions
- RC performance analysis is critical to
understanding RC application behavior - Need unified instrumentation, measurement, and
visualization to handle diverse and massively
parallel RC systems - Automated analysis can be useful for locating
common RC bottlenecks (though difficult to do) - Framework developed
- First RC performance concept and tool framework
(per extensive literature review) - Automated instrumentation
- Measurement via tracing, profiling, sampling
- Application case-studies
- Observed minimal overhead from tool
- Speedup achieved due to performance analysis
22References
- R. DeVille, I. Troxel, and A. George. Performance
monitoring for run-time management of
reconfigurable devices. Proc. of International
Conference on Engineering of Reconfigurable
Systems and Algorithms (ERSA), pages 175-181,
June 2005. - Paul Graham, Brent Nelson, and Brad Hutchings.
Instrumenting bitstreams for debugging FPGA
circuits. In Proc. of the the 9th Annual IEEE
Symposium on Field-Programmable Custom Computing
Machines (FCCM), pages 41-50, Washington, DC,
USA, Apr. 2001. IEEE Computer Society. - Sameer S. Shende and Allen D. Malony. The Tau
parallel performance system. International
Journal of High Performance Computing
Applications (HPCA), 20(2)287-311, May 2006. - C. EricWu, Anthony Bolmarcich, Marc Snir,
DavidWootton, Farid Parpia, Anthony Chan, Ewing
Lusk, and William Gropp. From trace generation to
visualization a performance framework for
distributed parallel systems. In Proc. of the
2000 ACM/IEEE conference on Supercomputing
(CDROM) (SC), page 50, Washington, DC, USA, Nov.
2000. IEEE Computer Society. - Adam Leko and Max Billingsley, III. Parallel
performance wizard user manual.
http//ppw.hcs.ufl.edu/docs/pdf/manual.pdf, 2007. - S. Koehler, J. Curreri, and A. George,
"Challenges for Performance Analysis in
High-Performance Reconfigurable Computing," Proc.
of Reconfigurable Systems Summer Institute 2007
(RSSI), Urbana, IL, July 17-20, 2007.