Lessons Learned with Performance Prediction and Design Patterns on Molecular Dynamics

About This Presentation

Title:

Lessons Learned with Performance Prediction and Design Patterns on Molecular Dynamics

Description:

Title: 12JAN07 Talk for I/UCRC Annual Meeting Author: Dr. Alan D. George Last modified by: Brian Holland Created Date: 7/12/2003 3:21:27 PM Document presentation format – PowerPoint PPT presentation

Number of Views:213

Avg rating:3.0/5.0

Slides: 36

Provided by: DRAL90

Learn more at: http://www.gstitt.ece.ufl.edu

Category:

more less

Transcript and Presenter's Notes

Title: Lessons Learned with Performance Prediction and Design Patterns on Molecular Dynamics

1
Lessons Learned with Performance Prediction and
Design Patterns on Molecular Dynamics

Brian Holland
Karthik Nagarajan
Saumil Merchant
Herman Lam
Alan D. George

2
Outline of Algorithm Design Progression

Algorithm decomposition
Design flow challenges
Performance prediction
RC Amenability Test (RAT)
Application case study
Improvements to RAT
Design patterns and methodology
Introduction and related research
Expanding pattern documentation
Molecular dynamics case study
Conclusions

Feb 07
Jun 07
Sept 07
Design Evolution
3
Design Flow Challenges

Original mission
Create scientific applications for FPGAs as case
studies to investigate topics such as portability
and scalability
Molecular dynamics is one such application
Maximize performance and productivity using HLLs
and high-performance reconfigurable computing
(HPRC)
Applications should have significant speedup over
SW baseline
Challenges
Ensuring speedup over traditional implementations
Particularly when researcher is not RC oriented
Exploring design space efficiently
Several designs may achieve speedup but which
should be used?

4
Algorithm Performance

Premises
(Re)designing applications is expensive
Only want to design once and even then, do it
efficiently
Scientific applications can contain extra
precision
Floating point may not be necessary but is a SW
standard
Optimal design may overuse available FPGA
resources
Discovering resource exhaustion mid-development
is expensive
Need
Performance prediction
Quickly and with reasonable accuracy estimate the
performance of a particular algorithm on a
specific FPGA platform
Utilize simple analytic models to make prediction
accessible to novices

5
Introduction

RC Amenability Test
Methodology for rapidly analyzing a particular
algorithms design compatibility to a specific
FPGA platform and projecting speedup
Importance of RAT
Design migration process is lengthy and costly
Allows for detailed consideration with potential
tradeoff analyses
Creates formal procedure reducing need for
expert knowledge
Scope of RAT
RAT cannot make generalizations about
applications
Different algorithm choices will greatly affect
application performance
Different FPGA platform architectures will affect
algorithm capabilities

6
RAT Methodology

Throughput Test
Algorithm and FPGA platform are parameterized
Equations are used to predict speedup
Numerical Precision Test
RAT user should explicitly examine the impact of
reducing precision on computation
Interrelated with throughput test
Two tests essentially proceed simultaneously
Resource Utilization Test
FPGA resources usage is estimated to determine
scalability on FPGA platform

Overview of RAT Methodology
7
Related Work

Performance prediction via parameterization
The performance analysis in this paper is not
real performance prediction rather it targets
the general concern of whether or not an
algorithm will fit within the memory subsystem
that is designed to feed it. 1 (Illinois)
Applications decomposed to determine total size
and computational density
Computational platforms characterized by memory
size, bandwidth, and latency
Parallel, heterogeneous shared RC resource
modeling 2 (ORNL)
System-level modeling for multi-FPGA augmented
systems
Other performance prediction research
Performance Prediction Model (PPM) 3
Optimizing hardware function evaluation 4
Comparable models for conventional parallel
processing systems
Parallel Random Access Machine (PRAM) 5
LogP 6

8
Throughput

Methodology
Parameterize key components of algorithm to
estimate runtime
Use equations to determine execution time of RC
application
Compare runtime with SW baseline to determine
projected speedup
Explore ranges of values to examine algorithm
performance bounds
Terminology
Element
Basic unit of data for the algorithm that
determines the amount of computation
e.g. Each character (element) in a string
matching algorithm will require some number of
computations to complete
Operation
Basic unit of work which helps complete a data
element
e.g. Granularity can vary greatly depending upon
formulation.
1 Multiply or 16 shifts could represent 1 or 16
operations, respectively

RAT Input Parameters
9
Communication and Computation

Communication is defined by reads and writes of
FPGA
Note that this equation refers to a single
iteration of the algorithm
Read and write times are a function of number of
elements size of each element and FPGA/CPU
interconnect transfer rate
Similarly, computation is determined by number
of operations (a function of number of elements)
parallelism/pipelining in the algorithm
(throughput) and clock frequency

10
RC Execution Time
Example Overlap Scenarios

Total RC execution time
Function of communication time, computation time,
and number of iterations required to complete the
algorithm
Overlap of computation and communication
Single Buffered (SB)
No overlap, computation and communication are
additive
Double Buffered (DB)
Complete overlap, execution time is dominated by
larger term

11
Performance Equations

Speedup
Compares predicted performance versus software
baseline
Shows performance as a function of total
execution time
Utilization
Computation utilization shows effective idle time
of FPGA
Communication utilization illustrates
interconnect saturation

12
Numerical Precision

Applications should have minimum level of
precision necessary to remain within user
tolerances
SW applications will often have extra precision
due to coarse-grain data types of general-purpose
processors
Extra precision can be wasteful in terms of
performance and resource utilization on FPGAs
Automated fixed-point to floating-point
conversion
Useful for exploring reduced precision in
algorithm designs
Often requires additional coding to explore
options
Ultimately, user must make final determination on
precision
RAT exists to help explore computation
performance aspects of application, just as it
helps investigate other algorithmic tradeoffs

13
Resource Utilizations

Intended to prevent designs that cannot be
physically realized in FPGAs
On-Chip RAM
Includes memory for application core and off-chip
I/O
Relatively simple to examine and scale prior to
hardware design
Hardware Multipliers
Includes variety of vendor-specific multipliers
and/or MAC units
Simple to compute usage with sufficient device
knowledge
Logic Elements
Includes look-up tables and other basic
registering logic
Extremely difficult to predict usage before
hardware design

14
Probability Density Function Estimation

Parzen window probability density function (PDF)
estimation
Computation complexity O(Nnd)
N number of discrete probability levels (i.e.
bins)
n number of discrete points where probability
is estimated
d number of dimensions
Intended architecture
Eight parallel kernels each compute the discrete
points versus a subset of the bins
Incoming data samples are processed against 256
bins

Chosen 1-D PDF algorithm architecture
15
1-D PDF Estimation Walkthrough

Dataset Parameters
Nelements, input
204800 samples / 400 iterations 512
Nelements, output
For 1-D PDF, output is negligible
Nbytes/element
Each data value is 4 bytes, size of Nallatech
communication channel
Communication Parameters
Models a Nallatech H101-PCIXM card containing a
Virtex-4 LX100 user FPGA connected via 133MHz
PCI-X bus
a parameters were established using a read/write
microbenchmark for modeling transfer times
Computation parameters
Nops/element
256 bins 3 ops each 768
throughputproc
8 pipelines 3 ops 24 20
fclock
Several values are considered
Software Parameters
tsoft

RAT Input Parameters of 1-D PDF
16
1-D PDF Estimation Walkthrough

Frequency
Difficult to predict a priori
Several possible values are explored
Prediction accuracy
Communication accuracy was low
Despite microbenchmarking, communication was
longer than expected
Minor inaccuracies in timing for small transfers
compounded over 400 iterations for 1-D PDF
Computational accuracy was high
Throughput was rounded from 24 ops/cycle to 20
ops/cycle
Conservative parallelism was warranted due to
unaccounted pipeline stalls
Algorithm constructed in VHDL

Performance Parameters of 1-D PDF
Example Computations from RAT Analysis
17
2-D PDF Estimation

Dimensionality
PDF can extend to multiple dimensions
Significantly increases computational complexity
and volume of communication
Algorithm
Same construction as 1-D PDF
Written in VHDL
Targets Nallatech H101
Xilinx V4LX100 FPGA
PCI-X interconnect
Prediction Accuracy
Communication
Similar to 1-D PDF, communication times were
underestimated
Computation
Computation was smaller than expected, balancing
overall execution time

RAT Input Parameters of 2-D PDF
Performance Parameters of 2-D PDF
18
Molecular Dynamics

Simulation of physical interaction of a set of
molecules over a given time interval
Based upon code provided by Oak Ridge National
Lab (ORNL)
Algorithm
16,384 molecule data set
Written in Impulse C
XtremeData XD1000 platform
Altera Stratix-II EPS2180 FPGA
HyperTransport interconnect
SW baseline on 2.4GHz Opteron
Challenges for accurate prediction
Nondeterministic runtime
Molecules beyond a certain threshold are assumed
to have zero impact
Large datasets for MD
Exhausts FPGA local memory

RAT Input Parameters of MD
Performance Parameters of MD
19
Conclusions

RC Amenability Test
Provides simple, fast, and effective method for
investigating performance potential of given
application design for a given target FPGA
platform
Works with empirical knowledge of RC devices to
create more efficient and effective means for
application design
When RAT-projected speedups are found to be
disappointing, designer can quickly reevaluate
their algorithm design and/or RC platform
selected as target
Successes
Allows for rapid algorithm analysis before any
significant hardware coding
Demonstrates reasonably accurate predictions
despite coarse parameterization
Applications
Showcases effectiveness of RAT for deterministic
algorithms like PDF estimation
Provides valuable qualitative insight for
nondeterministic algorithms such as MD
Future Work
Improve support for nondeterministic algorithms
through pipelining
Explore performance prediction with applications
for multi-FPGA systems
Expand methodology for numerical precision and
resource utilization

Molecular Dynamics Revisited

21
Molecular Dynamics

Algorithm
16,384 molecule data set
Written in Impulse C
XtremeData XD1000 platform
Altera Stratix II EPS2180 FPGA
HyperTransport interconnect
SW baseline on 2.4GHz Opteron
Parameters
Dataset Parameters
Model volume of data used by FPGA
Communication Parameters
Model the HyperTransport Interconnect
Computation Parameters
Model computational requirement of FPGA
Nops/element
164000 16384 10 ops
i.e. each molecule (element) takes
10ops/iteration
Throughputproc
50

RAT Input Parameters of MD
Performance Parameters of MD
22
Parameter Alterations for Pipelining

MD Optimization
Each molecular computation should be pipelined
Focus becomes less on individual molecules and
more on molecular interactions
Parameters
Computation Parameters
Nops/element
16400
Strictly number of interactions per element
Throughputpipeline
.333
Number of cycles needed to per interaction. i.e.
you can only stall pipeline for 2 extra cycles
Npipeline
15
Guess based on predicted area usage

Modified RAT Input Parameters of MD
Performance Parameters of MD
23
Pipelined Performance Prediction

Molecular Dynamics
If a pipeline is possible, certain parameters
become obsolete
The number of operations in the pipeline (i.e
depth) is not important
The number of pipeline stalls becomes critical
and is much more meaningful for non-deterministic
apps
Parameters
Nelement
163842
Number of molecular pairs
Nclks/element
3
i.e. up to two cycles can be stalls
Npipelines
15
Same number of kernels as before

Pipelined RAT Input Parameters of MD
24

And now for something completely different
-Monty Python
(Or is it?)

25
Leveraging Algorithm Designs

Introduction
Molecular dynamics provided several lessons
learned
Best design practices for coding in Impulse C
Algorithm optimizations for maximum performance
Memory staging for minimal footprint and delay
Sacrificing computation efficiency for decreased
memory accesses
Motivations and Challenges
Application design should educate the researcher
Designs should also trainer other researchers
Unfortunately, new designing can be expensive
Collecting application knowledge into design
patterns provides distilled lessons learned for
efficient application

26
Design Patterns

Objected-oriented software engineering
A design pattern names, abstracts, and
identifies the key aspects of a common design
structure that make it useful for creating a
reusable object-oriented design (1)
Reconfigurable Computing
Design patterns offer us organizing and
structuring principles that help us understand
how to put building blocks (e.g., adders,
multipliers, FIRs) together. (2)

(1) Gamma, Eric, et.al., Design Patterns
Elements of Reusable Object-Oriented Software,
Addison-Wesley, Boston, 1995. (2) DeHon, Andre,
et. Al., Design Patterns for Reconfigurable
Computing, Proceedings of 12th IEEE Symposium on
Field-Programmable Custom Computing Machines
(FCCM04), April 20-23, 2004, Napa, California.
26
27
Classificaiton of Design Patterns OO Text
Book(1)

Pattern categories
Creational
Abstract Factory
Prototype
Singleton
Etc.
Structural
Adapter
Bridge
Proxy
Etc.
Behavioral
Iterator
Mediator
Interpreter
Etc.

Describing Patterns
Pattern name
Intent
Also know as
Motivation
Applicability
Structure
Participants
Collaborations
Consequences
Implementation
Sample code
Known uses
Related patterns

27
28
Sample Design Patterns RC Paper (2)

14 pattern categories
Area-Time Tradeoffs
Expressing Parallelism
Implementing Parallelism
Processor-FPGA Integration
Common-Case Optimization
Re-using Hardware Efficiently
Specialization
Partial Reconfiguration
Communications
Synchronization
Efficient Layout and Communications
Implementing Communication
Value-Added Memory Patterns
Number Representation Patterns

89 patterns identified (samples)
Course-Grained Time Multiplexing
Synchronous Dataflow
Multi-threaded
Sequential vs. Parallel Design (hardware-software
partitioning)
SIMD
Communicating FSMDs
Instruction augmentation
Exceptions
Pipelining
Worst-Case Footprint
Streaming Data
Shared Memory
Synchronous Clocking
Asynchronous Handshaking
Cellular Automata
Etc

28
29
Representing DPs for RC Engineering

Design Section
Structure
Block diagram representation
Reference to major RC building blocks (BRAM,
SDRAM, compute modules, etc.).
Rational compatibility with RAT
Specification
More formal representation
Such as UML
Possibly maps to HDL
Implementation Section
HDL language-specific information
Platform specific information
Sample code

Description Section
Pattern name and classification
Intent
Also know as
Motivation
Applicability
Participants
Collaborations
Consequences
Known uses
Related patterns

29
30
Example Time Multiplexing Pattern
Computational graph divided into smaller
subgraphs

Intent Large designs on small or fixed capacity
platforms
Motivation Meet real-time needs or inadequate
design space
Applicability For slow reconfiguration
No feedback loops (acyclic dataflow)
Participants Subgraphs

Collaborations Control algorithm directs
subgraphs swapping
Consequences Slow reconfiguration time, large
buffers imperfect device resource utilization
Known Uses Video processing, target recognition
Implementation Conventional processor issues
commands for reconfiguration and collaboration

30
31
Example Datapath Duplication
Replicated computational structures for
parallel processing

Intent Exploiting computation parallelism in
sequential programming structures (loops)
Motivation Achieving faster performance through
replication of computational structures
Applicability data independent
No feedback loops (acyclic dataflow)
Participants Single computational kernel

Collaborations Control algorithm directs
dataflow and synchronization
Consequences Area time tradeoff, higher
processing speed at the cost of increased
implementation footprint in hardware
Known Uses PDF estimation, BbNN implementation,
MD, etc.
Implementation Centralize controller
orchestrates data movement and synchronization of
parallel processing elements

31
32
System-level patterns for MD
Visualization of Datapath Duplication

When design MD, initial goal is decompose
algorithm into parallel kernels
Datapath duplication is a potential starting
pattern
MD will require additional modifications since
computational structure will not divide cleanly

What do customers buy after viewing this
item? 67 use this pattern 37 alternatively
use . May we also recommend? Pipelining Loop
Fusion
On-line Shopping for Design Patterns
33
Kernel-level optimization patterns for MD
Pattern Utilization

void ComputeAccel()
double dr3,f,fcVal,rrCut,rr,ri2,ri6,r1
int j1,j2,n,k
rrCut RCUTRCUT
for(n0nltnAtomn) for(k0klt3k) rank
0.0
potEnergy 0.0
for (j10 j1ltnAtom-1 j1)
for (j2j11 j2ltnAtom j2)
for (rr0.0, k0 klt3 k)
drk rj1k - rj2k
drk drk-SignR(RegionHk,drk-RegionHk
)
- SignR(RegionHk,drkRegionHk)
rr rr drkdrk
if (rr lt rrCut)
ri2 1.0/rr ri6 ri2ri2ri2 r1
sqrt(rr)
fcVal 48.0ri2ri6(ri6-0.5) Duc/r1

2-D arrays
SW addressing is handled by C compiler
HW should be explicit
Loop fusion
Fairly straightforward in explicit languages
Challenging to make efficient in other HLLs
Memory dependencies
Shared bank
Repeat accesses in pipeline cause stalls
Write after read
Double access, even of same memory location,
similarly causes stalls

34
Design Pattern Effects on MD

for (i0 iltnum(num-1) i)
cg_count_ceil_32(1,0,i0,num-2,k)
cg_count_ceil_32(1,0,i0,num-2,j2)
cg_count_ceil_32(j20,0,i0,num,j1)
if( j2 gt j1) j2
if(j20) rr 0.0
split_64to32_flt_flt(ALj1,j1y,j1x)
split_64to32_flt_flt(BLj1,dummy,j1z)
split_64to32_flt_flt(CLj2,j2y,j2x)
split_64to32_flt_flt(DLj2,dummy,j2z)
if(j1 lt j2) dr0 j1x - j2x dr1 j1y
- j2y dr2 j1z - j2z
else dr0 j2x - j1x dr1 j2y - j1y
dr2 j2z - j1z
dr0 dr0 - ( dr0 gt REGIONH0 ? REGIONH0
MREGIONH0 )
- ( dr0 gt MREGIONH0 ? REGIONH0
MREGIONH0 )
dr1 dr1 - ( dr1 gt REGIONH1 ? REGIONH1
MREGIONH1 )
- ( dr1 gt MREGIONH1 ? REGIONH1
MREGIONH1 )
dr2 dr2 - ( dr2 gt REGIONH2 ? REGIONH2
MREGIONH2 )

void ComputeAccel() double dr3,f,fcVal,rrCut,
rr,ri2,ri6,r1 int j1,j2,n,k rrCut
RCUTRCUT for(n0nltnAtomn) for(k0klt3k)
rank 0.0 potEnergy 0.0 for (j10
j1ltnAtom-1 j1) for (j2j11 j2ltnAtom
j2) for (rr0.0, k0 klt3 k)
drk rj1k - rj2k drk
drk - SignR(RegionHk,drk-RegionHk) -
SignR(RegionHk,drkRegionHk) rr rr
drkdrk if (rr lt rrCut) ri2
1.0/rr ri6 ri2ri2ri2 r1
sqrt(rr) fcVal 48.0ri2ri6(ri6-0.5)
Duc/r1 for (k0 klt3 k) f
fcValdrk raj1k raj1k
f raj2k raj2k -
f potEnergy4.0ri6(ri6-1.0)- Uc -
Duc(r1-RCUT)
Carte MD, fully pipelined, 282 cycle depth
C baseline code for MD
35
Conclusions

Performance prediction is a powerful technique
for improving efficiency of RC application
formulation
Provides reasonable accuracy for the rough
estimate
Encourages importance of numerical precision and
resource utilization in performance prediction
Design patterns provide lessons learned
documentation
Records and disseminates algorithm design
knowledge
Allows for more effective formulation of future
designs
Future Work
Improve connection b/w design patterns and
performance prediction
Expand design pattern methodology for better
integration with RC
Increase role of numerical precision in
performance prediction