Project F2: Application Performance Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Project F2: Application Performance Analysis

Description:

Title: 12JAN07 Talk for I/UCRC Annual Meeting Last modified by: Seth Koehler Document presentation format: On-screen Show Other titles: Arial Garamond Wingdings Times ... – PowerPoint PPT presentation

Number of Views:114
Avg rating:3.0/5.0
Slides: 23
Provided by: gstittEc
Category:

less

Transcript and Presenter's Notes

Title: Project F2: Application Performance Analysis


1
Project F2 Application Performance Analysis
  • Seth Koehler
  • John Curreri
  • Rafael Garcia

2
Outline
  • Introduction
  • Performance analysis overview
  • Historical background
  • Performance analysis today
  • Related research and tools
  • RC performance analysis
  • Motivation
  • Instrumentation
  • Framework
  • Visualization
  • Users perspective
  • Case studies
  • N-Queens
  • Collatz (3x1) conjecture
  • Conclusions References

3
Introduction
  • Goals for performance analysis in RC
  • Productively identify and remedy performance
    bottlenecks in RC applications (CPUs and FPGAs)
  • Motivations
  • Complex systems are difficult to analyze by hand
  • Manual instrumentation is unwieldy
  • Difficult to make sense of large volume of raw
    data
  • Tools can help quickly locate performance
    problems
  • Collect and view performance data with little
    effort
  • Analyze performance data to indicate potential
    bottlenecks
  • Staple in HPC, limited in HPEC, and virtually
    non-existent in RC
  • Challenges
  • How do we expand notion of software performance
    analysis into software-hardware realm of RC?
  • What are common bottlenecks for dual-paradigm
    applications?
  • What techniques are necessary to detect
    performance bottlenecks?
  • How do we analyze and present these bottlenecks
    to a user?

4
Historical Background
  • Gettimeofday and printf
  • VERY cumbersome, repetitive, manual, not
    optimized for speed
  • Profilers date back to 70s with prof (gprof,
    1982)
  • Provide user with information about application
    behavior
  • Percentage of time spent in a function
  • How often a function calls another function
  • Simulators / Emulators
  • Too slow or too inaccurate
  • Require significant development time
  • PAPI (Performance Application Programming
    Interface)
  • Portable interface to hardware performance
    counters on modern CPUs
  • Provides information about caches, CPU
    functional units, main memory, and more

Processor HW counters
UltraSparc II 2
Pentium 3 2
AMD Athlon 4
IA-64 4
POWER4 8
Pentium 4 18
Source Wikipedia
5
Performance Analysis Today
  • What does performance analysis look like today?
  • Goals
  • Low impact on application behavior
  • High-fidelity performance data
  • Flexible
  • Portable
  • Automated
  • Concise Visualization
  • Techniques
  • Event-based, sample-based
  • Profile, Trace
  • Above all, we want to understand application
    behavior in order to locate performance problems!

6
Related Research and Tools Parallel Performance
Wizard (PPW)
  • Open-source tool developed by UPC Group at
    University of Florida
  • Performance analysis and optimization (PGAS
    systems and MPI support)
  • Performance data can be analyzed for bottlenecks
  • Offers several ways of exploring performance data
  • Graphs and charts to quickly view high-level
    performance information at a glance right, top
  • In-depth execution statistics for identifying
    communication and computational bottlenecks
  • Interacts with popular trace viewers (e.g.
    Jumpshot right, bottom) for detailed analysis
    of trace data
  • Comprehensive support for correlating performance
    back to original source code

Partitioned Global Address Space languages
allow partitioned memory to be treated as global
shared memory by software.
7
Motivation for RC Performance Analysis
  • Dual-paradigm applications gaining more traction
    in HPC and HPEC
  • Design flexibility allows best use of FPGAs and
    traditional processors
  • Drawback More challenging to design applications
    for dual-paradigm systems
  • Parallel application tuning and FPGA core
    debugging are hard enough!

Less
Difficultylevel
More
  • No existing holistic solutions for analyzing
    dual-paradigm applications
  • Software-only views leave out low-level details
  • Hardware-only views provide incomplete
    performance information
  • Need complete system view for effective tuning of
    entire application

8
Motivation for RC Performance Analysis
  • Q Is my runtime load-balancing strategy working?
  • A ???

ChipScope waveform
9
Motivation for RC Performance Analysis
  • Q How well is my cores pipelining strategy
    working?
  • A ???

Flat profile Each sample counts as 0.01
seconds. cumulative self
self total time seconds seconds calls
ms/call ms/call name 51.52 2.55 2.55
5 510.04 510.04 USURP_Reg_poll 29.41
4.01 1.46 34 42.82 42.82
USURP_DMA_write 11.97 4.60 0.59
14 42.31 42.31 USURP_DMA_read 4.06
4.80 0.20 1 200.80 200.80
USURP_Finalize 2.23 4.91 0.11 5
22.09 22.09 localp 1.22 4.97
0.06 5 12.05 12.05 USURP_Load
0.00 4.97 0.00 10 0.00
0.00 USURP_Reg_write 0.00 4.97 0.00
5 0.00 0.00 USURP_Set_clk 0.00
4.97 0.00 5 0.00 931.73
rcwork 0.00 4.97 0.00 1
0.00 0.00 USURP_Init
gprof output (N, one for each node!)
10
What to Instrument in Hardware?
  • Control
  • Watch state machines, pipelines, etc.
  • Replicated cores
  • Understand distribution and parallelism inside
    FPGA
  • Communication
  • On-chip (Components, Block RAMs, embedded
    processors)
  • On-board (On-board memory, other on-board FPGAs
    or processors)
  • Off-board (CPUs, off-board FPGAs, main memory)

11
Instrumentation Modifications
Color Legend
Framework
User Application
Process is automatable!
Additions are temporary!
12
Performance Analysis Framework
  • Instrument VHDL source (vs. binary or
    intermediate levels)
  • Portable across devices
  • Flexible (access to signals)
  • Low change in area / speed (optimized)
  • Relatively easy
  • Must pass through place-and-route
  • Language specific (VHDL vs. Verilog)
  • Store data with CPU-initiated transfers (vs.
    CPU-assisted or FPGA-initiated)
  • Universally supported
  • Not portable across APIs
  • Inefficient (lock contention, wasteful)
  • Lower fidelity

Request
CPU
FPGA
Data
13
Hardware Measurement Extractation Module
  • Separate thread (HMM_Main) periodically transfers
    data from FPGA to memory
  • Adaptive polling frequency can beemployed to
    balance fidelity and overhead
  • Measurement can be stopped andrestarted (similar
    to stopwatch)

HMM_Init
HMM_Start
HMM_Main (thread)
Application
HMM_Stop
HMM_Finalize
14
Instrumentation Modifications (cont)
  • New top-level file arbitrates between application
    and performance framework for off-chip
    communication
  • Splice into communication scheme
  • Acquire address space in memory map
  • Acquire network address or other unique
    identifier
  • Connect hardware together
  • Signal analysis
  • Challenges in Automation
  • Custom APIs for FPGAs
  • Custom user schemes for communication
  • Application knowledge not available

15
Hardware Measurement Module
  • Tracing, profiling, sampling with signal
    analysis

16
Visualization
  • Need unified visualizations that accentuate
    important statistics
  • Must be scalable to many nodes

17
Analysis
  • Instrument and measure to locate common or
    expected bottlenecks
  • Provide potential solutions or other aid to
    mitigate these bottlenecks
  • Best practices, common pitfalls, etc
  • Hardware/platform specific checks and solutions

Bottleneck Pattern Possible Solution
FPGA idle waiting for data Employ double-buffering
Frequent, small communication packets between CPU FPGA Buffer data on CPU or FPGA side
Some cores busy while others idle Improve distribution scheme / load-balancing
Cray XD1 reads slow on CPU Use FPGA to write data
Heavy CPU/FPGA communication Modify partitioning of CPU and FPGA work/data
Excessive time spent in miscellaneous states Combine states
18
Performance flow (users perspective)
  • Instrument hardware through VHDL Instrumenter GUI
  • Java/Perl program to simplify modifications to
    VHDL for performance analysis
  • Must resynthesize implement hardware
  • Requires adding in instrumented HDL file via
    standard tool flow
  • Instrument software through PPW compiler scripts
  • Run software with ppwupcc instead of standard
    compiler
  • Use fpga-nallatech and inst-functions command
    line options

19
Case Study N-Queens
  • Overview
  • Find number of distinct ways n queens can be
    placed on an nxn board without attacking each
    other
  • Performance analysis overhead
  • Sixteen 32-bit profile counters
  • One 96-bit trace buffer (completed cores)
  • Main state machine optimized based on data
  • Improved speedup (from 34 to 37 vs. Xeon code)

N-Queens results for board size of 16 XD1 XD1 Xeon-H101 Xeon-H101
N-Queens results for board size of 16 Original Instr. Original Instr.
Slices ( relative to device) 9,041 9,901 (4) 23,086 26,218 (6)
Block RAM ( relative to device) 11 15 (2) 21 22 (0)
Frequency (MHz) ( relative to orig.) 124 123 (-1) 101 101 (0)
Communication (KB/s) lt1 33 lt1 30
FPGAs
Standard backtracking algorithm employed
20
Case study Collatz conjecture (3x1)
  • Application
  • Search for sequences that do not reach 1 under
    the following function
  • 3.2GHz P4-Xeon CPU with Virtex-4 LX100 FPGA over
    PCI-X
  • Uses 88 of FPGA slices, 22 (53) of block RAM,
    runs at 100MHz
  • Setup
  • 17 counters monitored 3 state machines
  • No frequency degradation observe
  • Results
  • Frequent, small FPGA communication
  • 31 performance improvement achieved by buffering
    data before sending to the FPGA
  • Unexpected...hardware was tuned to work longer to
    eliminate communication problems
  • Distribution of data inside FPGA
  • Expected performance increase not large enough to
    merit implementation
  • Conclusions
  • Buffering data achieved 31 increase in speed

FPGA Write
FPGA Read
FPGA Data Processing
Computation
FPGA Read
FPGA Write
21
Conclusions
  • RC performance analysis is critical to
    understanding RC application behavior
  • Need unified instrumentation, measurement, and
    visualization to handle diverse and massively
    parallel RC systems
  • Automated analysis can be useful for locating
    common RC bottlenecks (though difficult to do)
  • Framework developed
  • First RC performance concept and tool framework
    (per extensive literature review)
  • Automated instrumentation
  • Measurement via tracing, profiling, sampling
  • Application case-studies
  • Observed minimal overhead from tool
  • Speedup achieved due to performance analysis

22
References
  • R. DeVille, I. Troxel, and A. George. Performance
    monitoring for run-time management of
    reconfigurable devices. Proc. of International
    Conference on Engineering of Reconfigurable
    Systems and Algorithms (ERSA), pages 175-181,
    June 2005.
  • Paul Graham, Brent Nelson, and Brad Hutchings.
    Instrumenting bitstreams for debugging FPGA
    circuits. In Proc. of the the 9th Annual IEEE
    Symposium on Field-Programmable Custom Computing
    Machines (FCCM), pages 41-50, Washington, DC,
    USA, Apr. 2001. IEEE Computer Society.
  • Sameer S. Shende and Allen D. Malony. The Tau
    parallel performance system. International
    Journal of High Performance Computing
    Applications (HPCA), 20(2)287-311, May 2006.
  • C. EricWu, Anthony Bolmarcich, Marc Snir,
    DavidWootton, Farid Parpia, Anthony Chan, Ewing
    Lusk, and William Gropp. From trace generation to
    visualization a performance framework for
    distributed parallel systems. In Proc. of the
    2000 ACM/IEEE conference on Supercomputing
    (CDROM) (SC), page 50, Washington, DC, USA, Nov.
    2000. IEEE Computer Society.
  • Adam Leko and Max Billingsley, III. Parallel
    performance wizard user manual.
    http//ppw.hcs.ufl.edu/docs/pdf/manual.pdf, 2007.
  • S. Koehler, J. Curreri, and A. George,
    "Challenges for Performance Analysis in
    High-Performance Reconfigurable Computing," Proc.
    of Reconfigurable Systems Summer Institute 2007
    (RSSI), Urbana, IL, July 17-20, 2007.
Write a Comment
User Comments (0)
About PowerShow.com