John Curreri - PowerPoint PPT Presentation

1 / 25
About This Presentation

John Curreri


Separate C source files are made for the CPU & FPGA ... Simulators / Emulators. Too slow or too inaccurate. Require significant development time ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 26
Provided by: gstitt


Transcript and Presenter's Notes

Title: John Curreri

Project F2 Application Performance Analysis
  • John Curreri
  • Seth Koehler
  • Rafael Garcia

  • Introduction
  • Application mappers
  • Historical background
  • Performance analysis today
  • HLL runtime performance analysis tool
  • Motivation
  • Instrumentation
  • Framework
  • Visualization
  • Case study
  • Molecular Dynamics
  • Conclusions References

Application Mappers
  • Translates C code to HDL
  • Higher level of abstraction
  • Usually a subset of ANSI C
  • No pointers
  • No standard C libraries for FPGA
  • HDL is generated as a project file for Xilinx or
    Altera tools
  • Built-in communication
  • Separate C source files are made for the CPU
  • Similar communication function calls between CPU

Application Mappers (continued)
  • Computational parallelism
  • Pipelining of loops
  • for(), while(), etc.
  • Use of library functions
  • HDL coded functions called at HLL
  • FFT, Floating point operations
  • Replication of functions defined in hardware
  • Types of communication
  • DMA transfers
  • Efficient transfer of large chucks of data
  • Stream transfers
  • Steady flow of data
  • Buffered for transfer rate changes

Introduction to the F2 project
  • Goals for performance analysis in RC
  • Productively identify and remedy performance
    bottlenecks in RC applications (CPUs and FPGAs)
  • Motivations
  • Complex systems are difficult to analyze by hand
  • Manual instrumentation is unwieldy
  • Difficult to make sense of large volume of raw
  • Tools can help quickly locate performance
  • Collect and view performance data with little
  • Analyze performance data to indicate potential
  • Staple in HPC, limited in HPEC, and virtually
    non-existent in RC
  • Challenges
  • How do we expand notion of software performance
    analysis into software-hardware realm of RC?
  • What are common bottlenecks for dual-paradigm
  • What techniques are necessary to detect
    performance bottlenecks?
  • How do we analyze and present these bottlenecks
    to a user?

Historical Background
  • Gettimeofday and printf
  • VERY cumbersome, repetitive, manual, not
    optimized for speed
  • Profilers date back to 70s with prof (gprof,
  • Provide user with information about application
  • Percentage of time spent in a function
  • How often a function calls another function
  • Simulators / Emulators
  • Too slow or too inaccurate
  • Require significant development time
  • PAPI (Performance Application Programming
  • Portable interface to hardware performance
    counters on modern CPUs
  • Provides information about caches, CPU
    functional units, main memory, and more

Source Wikipedia
Performance Analysis Today
  • What does performance analysis look like today?
  • Goals
  • Low impact on application behavior
  • High-fidelity performance data
  • Flexible
  • Portable
  • Automated
  • Concise Visualization
  • Techniques
  • Event-based, sample-based
  • Profile, Trace
  • Above all, we want to understand application
    behavior in order to locate performance problems!

Related Research and Tools Parallel Performance
Wizard (PPW)
  • Open-source tool developed by UPC Group at
    University of Florida
  • Performance analysis and optimization (PGAS
    systems and MPI support)
  • Performance data can be analyzed for bottlenecks
  • Offers several ways of exploring performance data
  • Graphs and charts to quickly view high-level
    performance information at a glance right, top
  • In-depth execution statistics for identifying
    communication and computational bottlenecks
  • Interacts with popular trace viewers (e.g.
    Jumpshot right, bottom) for detailed analysis
    of trace data
  • Comprehensive support for correlating performance
    back to original source code

Partitioned Global Address Space languages
allow partitioned memory to be treated as global
shared memory by software.
Motivation for RC Performance Analysis
  • Dual-paradigm applications gaining more traction
    in HPC and HPEC
  • Design flexibility allows best use of FPGAs and
    traditional processors
  • Drawback More challenging to design applications
    for dual-paradigm systems
  • Parallel application tuning and FPGA core
    debugging are hard enough!

  • No existing holistic solutions for analyzing
    dual-paradigm applications
  • Software-only views leave out low-level details
  • Hardware-only views provide incomplete
    performance information
  • Need complete system view for effective tuning of
    entire application

Motivation for RC Performance Analysis
  • Q Is my runtime load-balancing strategy working?
  • A ???

ChipScope waveform
Motivation for RC Performance Analysis
  • Q How well is my cores pipelining strategy
  • A ???

Flat profile Each sample counts as 0.01
seconds. cumulative self
self total time seconds seconds calls
ms/call ms/call name 51.52 2.55 2.55
5 510.04 510.04 USURP_Reg_poll 29.41
4.01 1.46 34 42.82 42.82
USURP_DMA_write 11.97 4.60 0.59
14 42.31 42.31 USURP_DMA_read 4.06
4.80 0.20 1 200.80 200.80
USURP_Finalize 2.23 4.91 0.11 5
22.09 22.09 localp 1.22 4.97
0.06 5 12.05 12.05 USURP_Load
0.00 4.97 0.00 10 0.00
0.00 USURP_Reg_write 0.00 4.97 0.00
5 0.00 0.00 USURP_Set_clk 0.00
4.97 0.00 5 0.00 931.73
rcwork 0.00 4.97 0.00 1
0.00 0.00 USURP_Init
gprof output (N, one for each node!)
Instrumentation Level
  • High-level language (HLL)
  • Requires HLL timing functions
  • Application mapping disturbed by instrumentation
  • Hardware Description Language (HDL)
  • Portable between HLL and types FPGA families
  • Selected level for instrumentation
  • FPGA bit stream
  • Requires targeting specific FPGA family
  • Instrument in minutes

Instrumentation Selection
  • Automated - Computation
  • State machines
  • Used for preserving execution order in C
  • Used to control state of pipelines
  • Control and status signals
  • Used by library function
  • Automated - Communication
  • Control and status signals
  • Used for streaming communication
  • Used for DMA transfers
  • Application specific
  • Monitoring variables for meaningful values

Measurement Techniques
  • Profiling
  • Counters
  • Records number of occurrences of event
  • Low overhead
  • Normally uses registers
  • Block RAM can be used for state machines
  • Tracing
  • Timestamps
  • Indicating when event occurred
  • Data
  • Associated with each event
  • Greater overhead
  • Uses memory to store timestamps and data
  • Greater fidelity
  • Reconstruction of sequence of events

Zaki, O., Lusk, E., Gropp, W., and Swider, D.
1999. Toward Scalable Performance Visualization
with Jumpshot. Int. J. High Perform. Comput.
Appl. 13, 3 (Aug. 1999), 277-288.
Hardware Measurement Module
Adding Instrumentation Measurement
HLL Tool Flow
C source
Application (C source)
Software -hardware mapping
HLL API Wrapper
Compile software
Implement hardware
HLL Hardware Wrapper
Application (C source)
Application (HDL)
Hardware Measurement Module
Finished design
Uninstrumented Project
Instrumentation added to C source
C source for FPGA mapped to HDL
Instrumentation added to HDL
Implement hardware
Reverse Mapping Analysis
  • Mapping of HDL data back to HLL
  • Variable name-matching
  • Observing scope and other patterns
  • Bottleneck detection
  • Load-balancing of replicated functions
  • Monitoring for pipeline stalls
  • Detecting streaming communication stalls
  • Finding shared-memory contention

Example RC Visualization
  • Need unified visualizations that accentuate
    important statistics
  • Must be scalable to many nodes

Molecular Dynamics
  • Simulation
  • Interactions between atoms and molecules
  • discrete time intervals
  • Models forces
  • Newtonian physics
  • Van Der Walls forces
  • Other interactions
  • Tracks molecules position and velocity
  • X, Y and Z directions

Case Study Setup
  • Impulse C v2.2
  • XD1000 platform
  • Opteron 2.2 GHz
  • XD1000 module with Altera Stratix-II EP2S180 FPGA
    in second processor socket
  • MD communication architecture
  • Chunks of MD data are read from SRAM
  • Data is streamed to multiple MD kernels that are
  • Results are stored back to SRAM

Impulse-C Profile Percentages
Output stream of Molecular Dynamics kernel is a
Stream buffer size was increased by 32 times
allowing application speedup to increase from
6.2 to 7.8 vs. serial baseline.
Performance Analysis Overhead
  • Additional FPGA resource usage
  • Less than 4
  • Frequency reduction
  • Less than 3

  • Developed prototype HLL-oriented RC performance
    analysis tool
  • First such runtime performance analysis tool
    framework (per extensive literature review)
  • Tracing profiling available
  • Automated instrumentation in progress
  • Application case study performed
  • Observed minimal overhead from tool
  • Speedup achieved due to performance analysis
  • Future work
  • SRC support, automated instrumentation and
    analysis, integration with software PAT, further
    case studies

  • Paul Graham, Brent Nelson, and Brad Hutchings.
    Instrumenting bitstreams for debugging FPGA
    circuits. In Proc. of the the 9th Annual IEEE
    Symposium on Field-Programmable Custom Computing
    Machines (FCCM), pages 41-50, Washington, DC,
    USA, Apr. 2001. IEEE Computer Society.
  • Sameer S. Shende and Allen D. Malony. The Tau
    parallel performance system. International
    Journal of High Performance Computing
    Applications (HPCA), 20(2)287-311, May 2006.
  • C. EricWu, Anthony Bolmarcich, Marc Snir,
    DavidWootton, Farid Parpia, Anthony Chan, Ewing
    Lusk, and William Gropp. From trace generation to
    visualization a performance framework for
    distributed parallel systems. In Proc. of the
    2000 ACM/IEEE conference on Supercomputing
    (CDROM) (SC), page 50, Washington, DC, USA, Nov.
    2000. IEEE Computer Society.
  • Adam Leko and Max Billingsley, III. Parallel
    performance wizard user manual.
    http//, 2007.
  • S. Koehler, J. Curreri, and A. George,
    "Challenges for Performance Analysis in
    High-Performance Reconfigurable Computing," Proc.
    of Reconfigurable Systems Summer Institute 2007
    (RSSI), Urbana, IL, July 17-20, 2007.
  • J. Curreri, S. Koehler, B. Holland, and A.
    George, "Performance Analysis with High-Level
    Languages for High-Performance Reconfigurable
    Computing," Proc. of 16th IEEE Symposium on
    Field-Programmable Custom Computing Machines
    (FCCM), Palo Alto, CA, Apr. 14-15, 2008.
Write a Comment
User Comments (0)