Title: Cycle Accurate Performance Measurement
1Cycle Accurate Performance Measurement
- Richard Hough
- Phillip Jones, Scott Friedman, Roger Chamberlain,
Jason Fritts, John Lockwood, and Ron Cytron - rh3_at_wustl.edu
- http//liquid.arl.wustl.edu/
Funded by NSF Grant ITR-0313203
2Outline
- Introduction
- Motivation
- Background
- Architecture
- Usage
- Results
- Future Work
- Related Work
- Conclusion
3Introduction What Are We Doing?
- Creating a module for capturing cycle-accurate
profiles of hardware events during the runtime of
programs on real systems
4Introduction What Are We Doing?
- Creating a module for capturing cycle-accurate
profiles of hardware events during the runtime of
programs on real systems
Statistics Module
5Introduction What Are We Doing?
- Creating a module for capturing cycle-accurate
profiles of hardware events during the runtime of
programs on real systems
Statistics Module
6Introduction What Are We Doing?
- Creating a module for capturing cycle-accurate
profiles of hardware events during the runtime of
programs on real systems
Statistics Module
7Introduction What Are We Doing?
- Creating a module for capturing cycle-accurate
profiles of hardware events during the runtime of
programs on real systems
Statistics Module
8Background - FPX
- Designed and implemented on the FPX platform
- The FPX platform is
- Designed for developing pluggable network
circuits - Contains a Virtex 2000e FPGA for design
deployment - Possesses a smaller FPGA used as a network
interface device - Can potentially operate at gigabit line rates
9Background - LEON2
- Developed by Gaisler Research
- Sparc-V8
- Open-Source VHDL
- Widely used
- European Space Agency, etc.
- Second in popularity only to the Microblaze
10Motivation Why Not Use Software?
- Software Profiling Is
- Inaccurate
- Many data points estimated
- Time slices not absolute
- Profiling affects results
- Inefficient
- Unreasonable for real-system deployment
- Ineffective
- Difficult to separate OS overhead
11Motivation Why Not Use Simulation?
- Simulation is
- Slow
- A simple simulation could require 100X more time
than running the program - Bound by the quality of the model
- The model used may be inaccurate
- Processors often tweaked without updating the
documentation Larus
12Motivation Why Use FPGAs?
- ASICs are expensive
- FPGAs provide good blend of cost and accuracy
- Software simulation of processors is incredibly
slow - Allows for easy prototyping
- Test new caching methods, tweak the ISA, etc.
13Motivation Why Put Statsmod In A FPGA?
- The Statistics Module Allows You To
- Pull Event Signals from anywhere
- Evaluate both software and hardware optimizations
- Tweak the architecture
- Integrate hardware accelerated modules into
software solutions - Adjust the software algorithm
- Gather repeatable and reliable results
14Architecture Naïve Solution
- Interested in 10 events and counters
- Naïve solution implements a counter for each
possibility - 100 counters!
- Not scalable for large systems
15Architecture Our Solution
- Better Approach
- Associate counters to events and methods at run
time - Covers the problem area, but uses less chip space
16Architecture An In Depth Look
17Architecture Scalability
Naïve Approach
Address Range Registers
Counters
Events
18Usage
19Results What do we get?
- The next few slides contain data from the Linpack
benchmark running on the FPGA - Linpack is a FPU intensive benchmark
- While the following slides focus on runtime, it
is important to remember that the graphs could in
principle be of any event
20Results
323,686,726
Clock Cycles
21Results
22Results
23Results
24Future Work Where can we go?
- As of a week ago, the StatsMod was successfully
integrated into a Linux 2.6.11 OS running on Leon - Changes have been made to allow a clear
separation between Process IDs - OS, background tasks, threads
- A device driver allows any program, including the
program being profiled, to gather the statistics
25Future Work Where can we go?
- Programs could now potentially collect statistics
on themselves perform runtime introspection - Adjust operation to conserve power, memory
accesses, etc. - Deeper integration could occur at the kernel
level to affect scheduler decisions - Adds a new dimension for slicing resources
- Network activity, device activity, page faults,
etc.
26Related Work
- SnoopP
- Developed by Lesley Shannon and Paul Chow at the
University of Toronto - Collects timing characteristics of programs
running on a Microblaze processor - Focuses on clock cycles only
- Integrated into the EDK
27Conclusion
- In closing, I would like to thank
- Phillip Jones for his hard work and support
- Ron Cytron for his mentoring and persistence
- Scott Friedman for his work on the web interface
- The rest of the Liquid Architecture team
- And WISA for the invitation to present
28Questions?
29Background Liquid
30Usage
- Connect to a secure web server controlling the
FPGA hardware - Upload the desired binary executable, associated
mapfile, and desired programming bitfile - A perl script parses the map file and provides a
graphical interface for selecting the desired
address ranges and events - Statistic results are tabulated at the end of the
programs execution