Title: Lessons Learned with Performance Prediction and Design Patterns on Molecular Dynamics
1Lessons Learned with Performance Prediction and
Design Patterns on Molecular Dynamics
- Brian Holland
- Karthik Nagarajan
- Saumil Merchant
- Herman Lam
- Alan D. George
2Outline of Algorithm Design Progression
- Algorithm decomposition
- Design flow challenges
- Performance prediction
- RC Amenability Test (RAT)
- Application case study
- Improvements to RAT
- Design patterns and methodology
- Introduction and related research
- Expanding pattern documentation
- Molecular dynamics case study
- Conclusions
Feb 07
Jun 07
Sept 07
Design Evolution
3Design Flow Challenges
- Original mission
- Create scientific applications for FPGAs as case
studies to investigate topics such as portability
and scalability - Molecular dynamics is one such application
- Maximize performance and productivity using HLLs
and high-performance reconfigurable computing
(HPRC) - Applications should have significant speedup over
SW baseline - Challenges
- Ensuring speedup over traditional implementations
- Particularly when researcher is not RC oriented
- Exploring design space efficiently
- Several designs may achieve speedup but which
should be used?
4Algorithm Performance
- Premises
- (Re)designing applications is expensive
- Only want to design once and even then, do it
efficiently - Scientific applications can contain extra
precision - Floating point may not be necessary but is a SW
standard - Optimal design may overuse available FPGA
resources - Discovering resource exhaustion mid-development
is expensive - Need
- Performance prediction
- Quickly and with reasonable accuracy estimate the
performance of a particular algorithm on a
specific FPGA platform - Utilize simple analytic models to make prediction
accessible to novices
5Introduction
- RC Amenability Test
- Methodology for rapidly analyzing a particular
algorithms design compatibility to a specific
FPGA platform and projecting speedup - Importance of RAT
- Design migration process is lengthy and costly
- Allows for detailed consideration with potential
tradeoff analyses - Creates formal procedure reducing need for
expert knowledge - Scope of RAT
- RAT cannot make generalizations about
applications - Different algorithm choices will greatly affect
application performance - Different FPGA platform architectures will affect
algorithm capabilities
6RAT Methodology
- Throughput Test
- Algorithm and FPGA platform are parameterized
- Equations are used to predict speedup
- Numerical Precision Test
- RAT user should explicitly examine the impact of
reducing precision on computation - Interrelated with throughput test
- Two tests essentially proceed simultaneously
- Resource Utilization Test
- FPGA resources usage is estimated to determine
scalability on FPGA platform
Overview of RAT Methodology
7Related Work
- Performance prediction via parameterization
- The performance analysis in this paper is not
real performance prediction rather it targets
the general concern of whether or not an
algorithm will fit within the memory subsystem
that is designed to feed it. 1 (Illinois) - Applications decomposed to determine total size
and computational density - Computational platforms characterized by memory
size, bandwidth, and latency - Parallel, heterogeneous shared RC resource
modeling 2 (ORNL) - System-level modeling for multi-FPGA augmented
systems - Other performance prediction research
- Performance Prediction Model (PPM) 3
- Optimizing hardware function evaluation 4
- Comparable models for conventional parallel
processing systems - Parallel Random Access Machine (PRAM) 5
- LogP 6
8Throughput
- Methodology
- Parameterize key components of algorithm to
estimate runtime - Use equations to determine execution time of RC
application - Compare runtime with SW baseline to determine
projected speedup - Explore ranges of values to examine algorithm
performance bounds - Terminology
- Element
- Basic unit of data for the algorithm that
determines the amount of computation - e.g. Each character (element) in a string
matching algorithm will require some number of
computations to complete - Operation
- Basic unit of work which helps complete a data
element - e.g. Granularity can vary greatly depending upon
formulation. - 1 Multiply or 16 shifts could represent 1 or 16
operations, respectively
RAT Input Parameters
9Communication and Computation
- Communication is defined by reads and writes of
FPGA - Note that this equation refers to a single
iteration of the algorithm - Read and write times are a function of number of
elements size of each element and FPGA/CPU
interconnect transfer rate - Similarly, computation is determined by number
of operations (a function of number of elements)
parallelism/pipelining in the algorithm
(throughput) and clock frequency
10RC Execution Time
Example Overlap Scenarios
- Total RC execution time
- Function of communication time, computation time,
and number of iterations required to complete the
algorithm - Overlap of computation and communication
- Single Buffered (SB)
- No overlap, computation and communication are
additive - Double Buffered (DB)
- Complete overlap, execution time is dominated by
larger term
11Performance Equations
- Speedup
- Compares predicted performance versus software
baseline - Shows performance as a function of total
execution time - Utilization
- Computation utilization shows effective idle time
of FPGA - Communication utilization illustrates
interconnect saturation
12Numerical Precision
- Applications should have minimum level of
precision necessary to remain within user
tolerances - SW applications will often have extra precision
due to coarse-grain data types of general-purpose
processors - Extra precision can be wasteful in terms of
performance and resource utilization on FPGAs - Automated fixed-point to floating-point
conversion - Useful for exploring reduced precision in
algorithm designs - Often requires additional coding to explore
options - Ultimately, user must make final determination on
precision - RAT exists to help explore computation
performance aspects of application, just as it
helps investigate other algorithmic tradeoffs
13Resource Utilizations
- Intended to prevent designs that cannot be
physically realized in FPGAs - On-Chip RAM
- Includes memory for application core and off-chip
I/O - Relatively simple to examine and scale prior to
hardware design - Hardware Multipliers
- Includes variety of vendor-specific multipliers
and/or MAC units - Simple to compute usage with sufficient device
knowledge - Logic Elements
- Includes look-up tables and other basic
registering logic - Extremely difficult to predict usage before
hardware design
14Probability Density Function Estimation
- Parzen window probability density function (PDF)
estimation - Computation complexity O(Nnd)
- N number of discrete probability levels (i.e.
bins) - n number of discrete points where probability
is estimated - d number of dimensions
- Intended architecture
- Eight parallel kernels each compute the discrete
points versus a subset of the bins - Incoming data samples are processed against 256
bins
Chosen 1-D PDF algorithm architecture
151-D PDF Estimation Walkthrough
- Dataset Parameters
- Nelements, input
- 204800 samples / 400 iterations 512
- Nelements, output
- For 1-D PDF, output is negligible
- Nbytes/element
- Each data value is 4 bytes, size of Nallatech
communication channel - Communication Parameters
- Models a Nallatech H101-PCIXM card containing a
Virtex-4 LX100 user FPGA connected via 133MHz
PCI-X bus - a parameters were established using a read/write
microbenchmark for modeling transfer times - Computation parameters
- Nops/element
- 256 bins 3 ops each 768
- throughputproc
- 8 pipelines 3 ops 24 20
- fclock
- Several values are considered
- Software Parameters
- tsoft
RAT Input Parameters of 1-D PDF
161-D PDF Estimation Walkthrough
- Frequency
- Difficult to predict a priori
- Several possible values are explored
- Prediction accuracy
- Communication accuracy was low
- Despite microbenchmarking, communication was
longer than expected - Minor inaccuracies in timing for small transfers
compounded over 400 iterations for 1-D PDF - Computational accuracy was high
- Throughput was rounded from 24 ops/cycle to 20
ops/cycle - Conservative parallelism was warranted due to
unaccounted pipeline stalls - Algorithm constructed in VHDL
Performance Parameters of 1-D PDF
Example Computations from RAT Analysis
172-D PDF Estimation
- Dimensionality
- PDF can extend to multiple dimensions
- Significantly increases computational complexity
and volume of communication - Algorithm
- Same construction as 1-D PDF
- Written in VHDL
- Targets Nallatech H101
- Xilinx V4LX100 FPGA
- PCI-X interconnect
- Prediction Accuracy
- Communication
- Similar to 1-D PDF, communication times were
underestimated - Computation
- Computation was smaller than expected, balancing
overall execution time
RAT Input Parameters of 2-D PDF
Performance Parameters of 2-D PDF
18Molecular Dynamics
- Simulation of physical interaction of a set of
molecules over a given time interval - Based upon code provided by Oak Ridge National
Lab (ORNL) - Algorithm
- 16,384 molecule data set
- Written in Impulse C
- XtremeData XD1000 platform
- Altera Stratix-II EPS2180 FPGA
- HyperTransport interconnect
- SW baseline on 2.4GHz Opteron
- Challenges for accurate prediction
- Nondeterministic runtime
- Molecules beyond a certain threshold are assumed
to have zero impact - Large datasets for MD
- Exhausts FPGA local memory
RAT Input Parameters of MD
Performance Parameters of MD
19Conclusions
- RC Amenability Test
- Provides simple, fast, and effective method for
investigating performance potential of given
application design for a given target FPGA
platform - Works with empirical knowledge of RC devices to
create more efficient and effective means for
application design - When RAT-projected speedups are found to be
disappointing, designer can quickly reevaluate
their algorithm design and/or RC platform
selected as target - Successes
- Allows for rapid algorithm analysis before any
significant hardware coding - Demonstrates reasonably accurate predictions
despite coarse parameterization - Applications
- Showcases effectiveness of RAT for deterministic
algorithms like PDF estimation - Provides valuable qualitative insight for
nondeterministic algorithms such as MD - Future Work
- Improve support for nondeterministic algorithms
through pipelining - Explore performance prediction with applications
for multi-FPGA systems - Expand methodology for numerical precision and
resource utilization
20- Molecular Dynamics Revisited
21Molecular Dynamics
- Algorithm
- 16,384 molecule data set
- Written in Impulse C
- XtremeData XD1000 platform
- Altera Stratix II EPS2180 FPGA
- HyperTransport interconnect
- SW baseline on 2.4GHz Opteron
- Parameters
- Dataset Parameters
- Model volume of data used by FPGA
- Communication Parameters
- Model the HyperTransport Interconnect
- Computation Parameters
- Model computational requirement of FPGA
- Nops/element
- 164000 16384 10 ops
- i.e. each molecule (element) takes
10ops/iteration - Throughputproc
- 50
RAT Input Parameters of MD
Performance Parameters of MD
22Parameter Alterations for Pipelining
- MD Optimization
- Each molecular computation should be pipelined
- Focus becomes less on individual molecules and
more on molecular interactions - Parameters
- Computation Parameters
- Nops/element
- 16400
- Strictly number of interactions per element
- Throughputpipeline
- .333
- Number of cycles needed to per interaction. i.e.
you can only stall pipeline for 2 extra cycles - Npipeline
- 15
- Guess based on predicted area usage
Modified RAT Input Parameters of MD
Performance Parameters of MD
23Pipelined Performance Prediction
- Molecular Dynamics
- If a pipeline is possible, certain parameters
become obsolete - The number of operations in the pipeline (i.e
depth) is not important - The number of pipeline stalls becomes critical
and is much more meaningful for non-deterministic
apps - Parameters
- Nelement
- 163842
- Number of molecular pairs
- Nclks/element
- 3
- i.e. up to two cycles can be stalls
- Npipelines
- 15
- Same number of kernels as before
Pipelined RAT Input Parameters of MD
24- And now for something completely different
- -Monty Python
- (Or is it?)
25Leveraging Algorithm Designs
- Introduction
- Molecular dynamics provided several lessons
learned - Best design practices for coding in Impulse C
- Algorithm optimizations for maximum performance
- Memory staging for minimal footprint and delay
- Sacrificing computation efficiency for decreased
memory accesses - Motivations and Challenges
- Application design should educate the researcher
- Designs should also trainer other researchers
- Unfortunately, new designing can be expensive
- Collecting application knowledge into design
patterns provides distilled lessons learned for
efficient application
26Design Patterns
- Objected-oriented software engineering
- A design pattern names, abstracts, and
identifies the key aspects of a common design
structure that make it useful for creating a
reusable object-oriented design (1) - Reconfigurable Computing
- Design patterns offer us organizing and
structuring principles that help us understand
how to put building blocks (e.g., adders,
multipliers, FIRs) together. (2)
(1) Gamma, Eric, et.al., Design Patterns
Elements of Reusable Object-Oriented Software,
Addison-Wesley, Boston, 1995. (2) DeHon, Andre,
et. Al., Design Patterns for Reconfigurable
Computing, Proceedings of 12th IEEE Symposium on
Field-Programmable Custom Computing Machines
(FCCM04), April 20-23, 2004, Napa, California.
26
27Classificaiton of Design Patterns OO Text
Book(1)
- Pattern categories
- Creational
- Abstract Factory
- Prototype
- Singleton
- Etc.
- Structural
- Adapter
- Bridge
- Proxy
- Etc.
- Behavioral
- Iterator
- Mediator
- Interpreter
- Etc.
- Describing Patterns
- Pattern name
- Intent
- Also know as
- Motivation
- Applicability
- Structure
- Participants
- Collaborations
- Consequences
- Implementation
- Sample code
- Known uses
- Related patterns
27
28Sample Design Patterns RC Paper (2)
- 14 pattern categories
- Area-Time Tradeoffs
- Expressing Parallelism
- Implementing Parallelism
- Processor-FPGA Integration
- Common-Case Optimization
- Re-using Hardware Efficiently
- Specialization
- Partial Reconfiguration
- Communications
- Synchronization
- Efficient Layout and Communications
- Implementing Communication
- Value-Added Memory Patterns
- Number Representation Patterns
- 89 patterns identified (samples)
- Course-Grained Time Multiplexing
- Synchronous Dataflow
- Multi-threaded
- Sequential vs. Parallel Design (hardware-software
partitioning) - SIMD
- Communicating FSMDs
- Instruction augmentation
- Exceptions
- Pipelining
- Worst-Case Footprint
- Streaming Data
- Shared Memory
- Synchronous Clocking
- Asynchronous Handshaking
- Cellular Automata
- Etc
28
29Representing DPs for RC Engineering
- Design Section
- Structure
- Block diagram representation
- Reference to major RC building blocks (BRAM,
SDRAM, compute modules, etc.). - Rational compatibility with RAT
- Specification
- More formal representation
- Such as UML
- Possibly maps to HDL
- Implementation Section
- HDL language-specific information
- Platform specific information
- Sample code
- Description Section
- Pattern name and classification
- Intent
- Also know as
- Motivation
- Applicability
- Participants
- Collaborations
- Consequences
- Known uses
- Related patterns
29
30Example Time Multiplexing Pattern
Computational graph divided into smaller
subgraphs
- Intent Large designs on small or fixed capacity
platforms - Motivation Meet real-time needs or inadequate
design space - Applicability For slow reconfiguration
- No feedback loops (acyclic dataflow)
- Participants Subgraphs
- Collaborations Control algorithm directs
subgraphs swapping - Consequences Slow reconfiguration time, large
buffers imperfect device resource utilization - Known Uses Video processing, target recognition
- Implementation Conventional processor issues
commands for reconfiguration and collaboration
30
31Example Datapath Duplication
Replicated computational structures for
parallel processing
- Intent Exploiting computation parallelism in
sequential programming structures (loops) - Motivation Achieving faster performance through
replication of computational structures - Applicability data independent
- No feedback loops (acyclic dataflow)
- Participants Single computational kernel
- Collaborations Control algorithm directs
dataflow and synchronization - Consequences Area time tradeoff, higher
processing speed at the cost of increased
implementation footprint in hardware - Known Uses PDF estimation, BbNN implementation,
MD, etc. - Implementation Centralize controller
orchestrates data movement and synchronization of
parallel processing elements
31
32System-level patterns for MD
Visualization of Datapath Duplication
- When design MD, initial goal is decompose
algorithm into parallel kernels - Datapath duplication is a potential starting
pattern - MD will require additional modifications since
computational structure will not divide cleanly
What do customers buy after viewing this
item? 67 use this pattern 37 alternatively
use . May we also recommend? Pipelining Loop
Fusion
On-line Shopping for Design Patterns
33Kernel-level optimization patterns for MD
Pattern Utilization
- void ComputeAccel()
- double dr3,f,fcVal,rrCut,rr,ri2,ri6,r1
- int j1,j2,n,k
- rrCut RCUTRCUT
- for(n0nltnAtomn) for(k0klt3k) rank
0.0 - potEnergy 0.0
- for (j10 j1ltnAtom-1 j1)
- for (j2j11 j2ltnAtom j2)
- for (rr0.0, k0 klt3 k)
- drk rj1k - rj2k
- drk drk-SignR(RegionHk,drk-RegionHk
) - - SignR(RegionHk,drkRegionHk)
- rr rr drkdrk
-
- if (rr lt rrCut)
- ri2 1.0/rr ri6 ri2ri2ri2 r1
sqrt(rr) - fcVal 48.0ri2ri6(ri6-0.5) Duc/r1
- 2-D arrays
- SW addressing is handled by C compiler
- HW should be explicit
- Loop fusion
- Fairly straightforward in explicit languages
- Challenging to make efficient in other HLLs
- Memory dependencies
- Shared bank
- Repeat accesses in pipeline cause stalls
- Write after read
- Double access, even of same memory location,
similarly causes stalls
34Design Pattern Effects on MD
- for (i0 iltnum(num-1) i)
- cg_count_ceil_32(1,0,i0,num-2,k)
- cg_count_ceil_32(1,0,i0,num-2,j2)
- cg_count_ceil_32(j20,0,i0,num,j1)
if( j2 gt j1) j2 - if(j20) rr 0.0
- split_64to32_flt_flt(ALj1,j1y,j1x)
- split_64to32_flt_flt(BLj1,dummy,j1z)
- split_64to32_flt_flt(CLj2,j2y,j2x)
- split_64to32_flt_flt(DLj2,dummy,j2z)
- if(j1 lt j2) dr0 j1x - j2x dr1 j1y
- j2y dr2 j1z - j2z - else dr0 j2x - j1x dr1 j2y - j1y
dr2 j2z - j1z - dr0 dr0 - ( dr0 gt REGIONH0 ? REGIONH0
MREGIONH0 ) - - ( dr0 gt MREGIONH0 ? REGIONH0
MREGIONH0 ) - dr1 dr1 - ( dr1 gt REGIONH1 ? REGIONH1
MREGIONH1 ) - - ( dr1 gt MREGIONH1 ? REGIONH1
MREGIONH1 ) - dr2 dr2 - ( dr2 gt REGIONH2 ? REGIONH2
MREGIONH2 )
void ComputeAccel() double dr3,f,fcVal,rrCut,
rr,ri2,ri6,r1 int j1,j2,n,k rrCut
RCUTRCUT for(n0nltnAtomn) for(k0klt3k)
rank 0.0 potEnergy 0.0 for (j10
j1ltnAtom-1 j1) for (j2j11 j2ltnAtom
j2) for (rr0.0, k0 klt3 k)
drk rj1k - rj2k drk
drk - SignR(RegionHk,drk-RegionHk) -
SignR(RegionHk,drkRegionHk) rr rr
drkdrk if (rr lt rrCut) ri2
1.0/rr ri6 ri2ri2ri2 r1
sqrt(rr) fcVal 48.0ri2ri6(ri6-0.5)
Duc/r1 for (k0 klt3 k) f
fcValdrk raj1k raj1k
f raj2k raj2k -
f potEnergy4.0ri6(ri6-1.0)- Uc -
Duc(r1-RCUT)
Carte MD, fully pipelined, 282 cycle depth
C baseline code for MD
35Conclusions
- Performance prediction is a powerful technique
for improving efficiency of RC application
formulation - Provides reasonable accuracy for the rough
estimate - Encourages importance of numerical precision and
resource utilization in performance prediction - Design patterns provide lessons learned
documentation - Records and disseminates algorithm design
knowledge - Allows for more effective formulation of future
designs - Future Work
- Improve connection b/w design patterns and
performance prediction - Expand design pattern methodology for better
integration with RC - Increase role of numerical precision in
performance prediction