Lessons Learned with Performance Prediction and Design Patterns on Molecular Dynamics - PowerPoint PPT Presentation

About This Presentation
Title:

Lessons Learned with Performance Prediction and Design Patterns on Molecular Dynamics

Description:

Title: 12JAN07 Talk for I/UCRC Annual Meeting Author: Dr. Alan D. George Last modified by: Brian Holland Created Date: 7/12/2003 3:21:27 PM Document presentation format – PowerPoint PPT presentation

Number of Views:213
Avg rating:3.0/5.0
Slides: 36
Provided by: DRAL90
Category:

less

Transcript and Presenter's Notes

Title: Lessons Learned with Performance Prediction and Design Patterns on Molecular Dynamics


1
Lessons Learned with Performance Prediction and
Design Patterns on Molecular Dynamics
  • Brian Holland
  • Karthik Nagarajan
  • Saumil Merchant
  • Herman Lam
  • Alan D. George

2
Outline of Algorithm Design Progression
  • Algorithm decomposition
  • Design flow challenges
  • Performance prediction
  • RC Amenability Test (RAT)
  • Application case study
  • Improvements to RAT
  • Design patterns and methodology
  • Introduction and related research
  • Expanding pattern documentation
  • Molecular dynamics case study
  • Conclusions

Feb 07
Jun 07
Sept 07
Design Evolution
3
Design Flow Challenges
  • Original mission
  • Create scientific applications for FPGAs as case
    studies to investigate topics such as portability
    and scalability
  • Molecular dynamics is one such application
  • Maximize performance and productivity using HLLs
    and high-performance reconfigurable computing
    (HPRC)
  • Applications should have significant speedup over
    SW baseline
  • Challenges
  • Ensuring speedup over traditional implementations
  • Particularly when researcher is not RC oriented
  • Exploring design space efficiently
  • Several designs may achieve speedup but which
    should be used?

4
Algorithm Performance
  • Premises
  • (Re)designing applications is expensive
  • Only want to design once and even then, do it
    efficiently
  • Scientific applications can contain extra
    precision
  • Floating point may not be necessary but is a SW
    standard
  • Optimal design may overuse available FPGA
    resources
  • Discovering resource exhaustion mid-development
    is expensive
  • Need
  • Performance prediction
  • Quickly and with reasonable accuracy estimate the
    performance of a particular algorithm on a
    specific FPGA platform
  • Utilize simple analytic models to make prediction
    accessible to novices

5
Introduction
  • RC Amenability Test
  • Methodology for rapidly analyzing a particular
    algorithms design compatibility to a specific
    FPGA platform and projecting speedup
  • Importance of RAT
  • Design migration process is lengthy and costly
  • Allows for detailed consideration with potential
    tradeoff analyses
  • Creates formal procedure reducing need for
    expert knowledge
  • Scope of RAT
  • RAT cannot make generalizations about
    applications
  • Different algorithm choices will greatly affect
    application performance
  • Different FPGA platform architectures will affect
    algorithm capabilities

6
RAT Methodology
  • Throughput Test
  • Algorithm and FPGA platform are parameterized
  • Equations are used to predict speedup
  • Numerical Precision Test
  • RAT user should explicitly examine the impact of
    reducing precision on computation
  • Interrelated with throughput test
  • Two tests essentially proceed simultaneously
  • Resource Utilization Test
  • FPGA resources usage is estimated to determine
    scalability on FPGA platform

Overview of RAT Methodology
7
Related Work
  • Performance prediction via parameterization
  • The performance analysis in this paper is not
    real performance prediction rather it targets
    the general concern of whether or not an
    algorithm will fit within the memory subsystem
    that is designed to feed it. 1 (Illinois)
  • Applications decomposed to determine total size
    and computational density
  • Computational platforms characterized by memory
    size, bandwidth, and latency
  • Parallel, heterogeneous shared RC resource
    modeling 2 (ORNL)
  • System-level modeling for multi-FPGA augmented
    systems
  • Other performance prediction research
  • Performance Prediction Model (PPM) 3
  • Optimizing hardware function evaluation 4
  • Comparable models for conventional parallel
    processing systems
  • Parallel Random Access Machine (PRAM) 5
  • LogP 6

8
Throughput
  • Methodology
  • Parameterize key components of algorithm to
    estimate runtime
  • Use equations to determine execution time of RC
    application
  • Compare runtime with SW baseline to determine
    projected speedup
  • Explore ranges of values to examine algorithm
    performance bounds
  • Terminology
  • Element
  • Basic unit of data for the algorithm that
    determines the amount of computation
  • e.g. Each character (element) in a string
    matching algorithm will require some number of
    computations to complete
  • Operation
  • Basic unit of work which helps complete a data
    element
  • e.g. Granularity can vary greatly depending upon
    formulation.
  • 1 Multiply or 16 shifts could represent 1 or 16
    operations, respectively

RAT Input Parameters
9
Communication and Computation
  • Communication is defined by reads and writes of
    FPGA
  • Note that this equation refers to a single
    iteration of the algorithm
  • Read and write times are a function of number of
    elements size of each element and FPGA/CPU
    interconnect transfer rate
  • Similarly, computation is determined by number
    of operations (a function of number of elements)
    parallelism/pipelining in the algorithm
    (throughput) and clock frequency

10
RC Execution Time
Example Overlap Scenarios
  • Total RC execution time
  • Function of communication time, computation time,
    and number of iterations required to complete the
    algorithm
  • Overlap of computation and communication
  • Single Buffered (SB)
  • No overlap, computation and communication are
    additive
  • Double Buffered (DB)
  • Complete overlap, execution time is dominated by
    larger term

11
Performance Equations
  • Speedup
  • Compares predicted performance versus software
    baseline
  • Shows performance as a function of total
    execution time
  • Utilization
  • Computation utilization shows effective idle time
    of FPGA
  • Communication utilization illustrates
    interconnect saturation

12
Numerical Precision
  • Applications should have minimum level of
    precision necessary to remain within user
    tolerances
  • SW applications will often have extra precision
    due to coarse-grain data types of general-purpose
    processors
  • Extra precision can be wasteful in terms of
    performance and resource utilization on FPGAs
  • Automated fixed-point to floating-point
    conversion
  • Useful for exploring reduced precision in
    algorithm designs
  • Often requires additional coding to explore
    options
  • Ultimately, user must make final determination on
    precision
  • RAT exists to help explore computation
    performance aspects of application, just as it
    helps investigate other algorithmic tradeoffs

13
Resource Utilizations
  • Intended to prevent designs that cannot be
    physically realized in FPGAs
  • On-Chip RAM
  • Includes memory for application core and off-chip
    I/O
  • Relatively simple to examine and scale prior to
    hardware design
  • Hardware Multipliers
  • Includes variety of vendor-specific multipliers
    and/or MAC units
  • Simple to compute usage with sufficient device
    knowledge
  • Logic Elements
  • Includes look-up tables and other basic
    registering logic
  • Extremely difficult to predict usage before
    hardware design

14
Probability Density Function Estimation
  • Parzen window probability density function (PDF)
    estimation
  • Computation complexity O(Nnd)
  • N number of discrete probability levels (i.e.
    bins)
  • n number of discrete points where probability
    is estimated
  • d number of dimensions
  • Intended architecture
  • Eight parallel kernels each compute the discrete
    points versus a subset of the bins
  • Incoming data samples are processed against 256
    bins

Chosen 1-D PDF algorithm architecture
15
1-D PDF Estimation Walkthrough
  • Dataset Parameters
  • Nelements, input
  • 204800 samples / 400 iterations 512
  • Nelements, output
  • For 1-D PDF, output is negligible
  • Nbytes/element
  • Each data value is 4 bytes, size of Nallatech
    communication channel
  • Communication Parameters
  • Models a Nallatech H101-PCIXM card containing a
    Virtex-4 LX100 user FPGA connected via 133MHz
    PCI-X bus
  • a parameters were established using a read/write
    microbenchmark for modeling transfer times
  • Computation parameters
  • Nops/element
  • 256 bins 3 ops each 768
  • throughputproc
  • 8 pipelines 3 ops 24 20
  • fclock
  • Several values are considered
  • Software Parameters
  • tsoft

RAT Input Parameters of 1-D PDF
16
1-D PDF Estimation Walkthrough
  • Frequency
  • Difficult to predict a priori
  • Several possible values are explored
  • Prediction accuracy
  • Communication accuracy was low
  • Despite microbenchmarking, communication was
    longer than expected
  • Minor inaccuracies in timing for small transfers
    compounded over 400 iterations for 1-D PDF
  • Computational accuracy was high
  • Throughput was rounded from 24 ops/cycle to 20
    ops/cycle
  • Conservative parallelism was warranted due to
    unaccounted pipeline stalls
  • Algorithm constructed in VHDL

Performance Parameters of 1-D PDF
Example Computations from RAT Analysis
17
2-D PDF Estimation
  • Dimensionality
  • PDF can extend to multiple dimensions
  • Significantly increases computational complexity
    and volume of communication
  • Algorithm
  • Same construction as 1-D PDF
  • Written in VHDL
  • Targets Nallatech H101
  • Xilinx V4LX100 FPGA
  • PCI-X interconnect
  • Prediction Accuracy
  • Communication
  • Similar to 1-D PDF, communication times were
    underestimated
  • Computation
  • Computation was smaller than expected, balancing
    overall execution time

RAT Input Parameters of 2-D PDF
Performance Parameters of 2-D PDF
18
Molecular Dynamics
  • Simulation of physical interaction of a set of
    molecules over a given time interval
  • Based upon code provided by Oak Ridge National
    Lab (ORNL)
  • Algorithm
  • 16,384 molecule data set
  • Written in Impulse C
  • XtremeData XD1000 platform
  • Altera Stratix-II EPS2180 FPGA
  • HyperTransport interconnect
  • SW baseline on 2.4GHz Opteron
  • Challenges for accurate prediction
  • Nondeterministic runtime
  • Molecules beyond a certain threshold are assumed
    to have zero impact
  • Large datasets for MD
  • Exhausts FPGA local memory

RAT Input Parameters of MD
Performance Parameters of MD
19
Conclusions
  • RC Amenability Test
  • Provides simple, fast, and effective method for
    investigating performance potential of given
    application design for a given target FPGA
    platform
  • Works with empirical knowledge of RC devices to
    create more efficient and effective means for
    application design
  • When RAT-projected speedups are found to be
    disappointing, designer can quickly reevaluate
    their algorithm design and/or RC platform
    selected as target
  • Successes
  • Allows for rapid algorithm analysis before any
    significant hardware coding
  • Demonstrates reasonably accurate predictions
    despite coarse parameterization
  • Applications
  • Showcases effectiveness of RAT for deterministic
    algorithms like PDF estimation
  • Provides valuable qualitative insight for
    nondeterministic algorithms such as MD
  • Future Work
  • Improve support for nondeterministic algorithms
    through pipelining
  • Explore performance prediction with applications
    for multi-FPGA systems
  • Expand methodology for numerical precision and
    resource utilization

20
  • Molecular Dynamics Revisited

21
Molecular Dynamics
  • Algorithm
  • 16,384 molecule data set
  • Written in Impulse C
  • XtremeData XD1000 platform
  • Altera Stratix II EPS2180 FPGA
  • HyperTransport interconnect
  • SW baseline on 2.4GHz Opteron
  • Parameters
  • Dataset Parameters
  • Model volume of data used by FPGA
  • Communication Parameters
  • Model the HyperTransport Interconnect
  • Computation Parameters
  • Model computational requirement of FPGA
  • Nops/element
  • 164000 16384 10 ops
  • i.e. each molecule (element) takes
    10ops/iteration
  • Throughputproc
  • 50

RAT Input Parameters of MD
Performance Parameters of MD
22
Parameter Alterations for Pipelining
  • MD Optimization
  • Each molecular computation should be pipelined
  • Focus becomes less on individual molecules and
    more on molecular interactions
  • Parameters
  • Computation Parameters
  • Nops/element
  • 16400
  • Strictly number of interactions per element
  • Throughputpipeline
  • .333
  • Number of cycles needed to per interaction. i.e.
    you can only stall pipeline for 2 extra cycles
  • Npipeline
  • 15
  • Guess based on predicted area usage

Modified RAT Input Parameters of MD
Performance Parameters of MD
23
Pipelined Performance Prediction
  • Molecular Dynamics
  • If a pipeline is possible, certain parameters
    become obsolete
  • The number of operations in the pipeline (i.e
    depth) is not important
  • The number of pipeline stalls becomes critical
    and is much more meaningful for non-deterministic
    apps
  • Parameters
  • Nelement
  • 163842
  • Number of molecular pairs
  • Nclks/element
  • 3
  • i.e. up to two cycles can be stalls
  • Npipelines
  • 15
  • Same number of kernels as before

Pipelined RAT Input Parameters of MD
24
  • And now for something completely different
  • -Monty Python
  • (Or is it?)

25
Leveraging Algorithm Designs
  • Introduction
  • Molecular dynamics provided several lessons
    learned
  • Best design practices for coding in Impulse C
  • Algorithm optimizations for maximum performance
  • Memory staging for minimal footprint and delay
  • Sacrificing computation efficiency for decreased
    memory accesses
  • Motivations and Challenges
  • Application design should educate the researcher
  • Designs should also trainer other researchers
  • Unfortunately, new designing can be expensive
  • Collecting application knowledge into design
    patterns provides distilled lessons learned for
    efficient application

26
Design Patterns
  • Objected-oriented software engineering
  • A design pattern names, abstracts, and
    identifies the key aspects of a common design
    structure that make it useful for creating a
    reusable object-oriented design (1)
  • Reconfigurable Computing
  • Design patterns offer us organizing and
    structuring principles that help us understand
    how to put building blocks (e.g., adders,
    multipliers, FIRs) together. (2)

(1) Gamma, Eric, et.al., Design Patterns
Elements of Reusable Object-Oriented Software,
Addison-Wesley, Boston, 1995. (2) DeHon, Andre,
et. Al., Design Patterns for Reconfigurable
Computing, Proceedings of 12th IEEE Symposium on
Field-Programmable Custom Computing Machines
(FCCM04), April 20-23, 2004, Napa, California.
26
27
Classificaiton of Design Patterns OO Text
Book(1)
  • Pattern categories
  • Creational
  • Abstract Factory
  • Prototype
  • Singleton
  • Etc.
  • Structural
  • Adapter
  • Bridge
  • Proxy
  • Etc.
  • Behavioral
  • Iterator
  • Mediator
  • Interpreter
  • Etc.
  • Describing Patterns
  • Pattern name
  • Intent
  • Also know as
  • Motivation
  • Applicability
  • Structure
  • Participants
  • Collaborations
  • Consequences
  • Implementation
  • Sample code
  • Known uses
  • Related patterns

27
28
Sample Design Patterns RC Paper (2)
  • 14 pattern categories
  • Area-Time Tradeoffs
  • Expressing Parallelism
  • Implementing Parallelism
  • Processor-FPGA Integration
  • Common-Case Optimization
  • Re-using Hardware Efficiently
  • Specialization
  • Partial Reconfiguration
  • Communications
  • Synchronization
  • Efficient Layout and Communications
  • Implementing Communication
  • Value-Added Memory Patterns
  • Number Representation Patterns
  • 89 patterns identified (samples)
  • Course-Grained Time Multiplexing
  • Synchronous Dataflow
  • Multi-threaded
  • Sequential vs. Parallel Design (hardware-software
    partitioning)
  • SIMD
  • Communicating FSMDs
  • Instruction augmentation
  • Exceptions
  • Pipelining
  • Worst-Case Footprint
  • Streaming Data
  • Shared Memory
  • Synchronous Clocking
  • Asynchronous Handshaking
  • Cellular Automata
  • Etc

28
29
Representing DPs for RC Engineering
  • Design Section
  • Structure
  • Block diagram representation
  • Reference to major RC building blocks (BRAM,
    SDRAM, compute modules, etc.).
  • Rational compatibility with RAT
  • Specification
  • More formal representation
  • Such as UML
  • Possibly maps to HDL
  • Implementation Section
  • HDL language-specific information
  • Platform specific information
  • Sample code
  • Description Section
  • Pattern name and classification
  • Intent
  • Also know as
  • Motivation
  • Applicability
  • Participants
  • Collaborations
  • Consequences
  • Known uses
  • Related patterns

29
30
Example Time Multiplexing Pattern
Computational graph divided into smaller
subgraphs
  • Intent Large designs on small or fixed capacity
    platforms
  • Motivation Meet real-time needs or inadequate
    design space
  • Applicability For slow reconfiguration
  • No feedback loops (acyclic dataflow)
  • Participants Subgraphs
  • Collaborations Control algorithm directs
    subgraphs swapping
  • Consequences Slow reconfiguration time, large
    buffers imperfect device resource utilization
  • Known Uses Video processing, target recognition
  • Implementation Conventional processor issues
    commands for reconfiguration and collaboration

30
31
Example Datapath Duplication
Replicated computational structures for
parallel processing
  • Intent Exploiting computation parallelism in
    sequential programming structures (loops)
  • Motivation Achieving faster performance through
    replication of computational structures
  • Applicability data independent
  • No feedback loops (acyclic dataflow)
  • Participants Single computational kernel
  • Collaborations Control algorithm directs
    dataflow and synchronization
  • Consequences Area time tradeoff, higher
    processing speed at the cost of increased
    implementation footprint in hardware
  • Known Uses PDF estimation, BbNN implementation,
    MD, etc.
  • Implementation Centralize controller
    orchestrates data movement and synchronization of
    parallel processing elements

31
32
System-level patterns for MD
Visualization of Datapath Duplication
  • When design MD, initial goal is decompose
    algorithm into parallel kernels
  • Datapath duplication is a potential starting
    pattern
  • MD will require additional modifications since
    computational structure will not divide cleanly

What do customers buy after viewing this
item? 67 use this pattern 37 alternatively
use . May we also recommend? Pipelining Loop
Fusion
On-line Shopping for Design Patterns
33
Kernel-level optimization patterns for MD
Pattern Utilization
  • void ComputeAccel()
  • double dr3,f,fcVal,rrCut,rr,ri2,ri6,r1
  • int j1,j2,n,k
  • rrCut RCUTRCUT
  • for(n0nltnAtomn) for(k0klt3k) rank
    0.0
  • potEnergy 0.0
  • for (j10 j1ltnAtom-1 j1)
  • for (j2j11 j2ltnAtom j2)
  • for (rr0.0, k0 klt3 k)
  • drk rj1k - rj2k
  • drk drk-SignR(RegionHk,drk-RegionHk
    )
  • - SignR(RegionHk,drkRegionHk)
  • rr rr drkdrk
  • if (rr lt rrCut)
  • ri2 1.0/rr ri6 ri2ri2ri2 r1
    sqrt(rr)
  • fcVal 48.0ri2ri6(ri6-0.5) Duc/r1
  • 2-D arrays
  • SW addressing is handled by C compiler
  • HW should be explicit
  • Loop fusion
  • Fairly straightforward in explicit languages
  • Challenging to make efficient in other HLLs
  • Memory dependencies
  • Shared bank
  • Repeat accesses in pipeline cause stalls
  • Write after read
  • Double access, even of same memory location,
    similarly causes stalls

34
Design Pattern Effects on MD
  • for (i0 iltnum(num-1) i)
  • cg_count_ceil_32(1,0,i0,num-2,k)
  • cg_count_ceil_32(1,0,i0,num-2,j2)
  • cg_count_ceil_32(j20,0,i0,num,j1)
    if( j2 gt j1) j2
  • if(j20) rr 0.0
  • split_64to32_flt_flt(ALj1,j1y,j1x)
  • split_64to32_flt_flt(BLj1,dummy,j1z)
  • split_64to32_flt_flt(CLj2,j2y,j2x)
  • split_64to32_flt_flt(DLj2,dummy,j2z)
  • if(j1 lt j2) dr0 j1x - j2x dr1 j1y
    - j2y dr2 j1z - j2z
  • else dr0 j2x - j1x dr1 j2y - j1y
    dr2 j2z - j1z
  • dr0 dr0 - ( dr0 gt REGIONH0 ? REGIONH0
    MREGIONH0 )
  • - ( dr0 gt MREGIONH0 ? REGIONH0
    MREGIONH0 )
  • dr1 dr1 - ( dr1 gt REGIONH1 ? REGIONH1
    MREGIONH1 )
  • - ( dr1 gt MREGIONH1 ? REGIONH1
    MREGIONH1 )
  • dr2 dr2 - ( dr2 gt REGIONH2 ? REGIONH2
    MREGIONH2 )

void ComputeAccel() double dr3,f,fcVal,rrCut,
rr,ri2,ri6,r1 int j1,j2,n,k rrCut
RCUTRCUT for(n0nltnAtomn) for(k0klt3k)
rank 0.0 potEnergy 0.0 for (j10
j1ltnAtom-1 j1) for (j2j11 j2ltnAtom
j2) for (rr0.0, k0 klt3 k)
drk rj1k - rj2k drk
drk - SignR(RegionHk,drk-RegionHk) -
SignR(RegionHk,drkRegionHk) rr rr
drkdrk if (rr lt rrCut) ri2
1.0/rr ri6 ri2ri2ri2 r1
sqrt(rr) fcVal 48.0ri2ri6(ri6-0.5)
Duc/r1 for (k0 klt3 k) f
fcValdrk raj1k raj1k
f raj2k raj2k -
f potEnergy4.0ri6(ri6-1.0)- Uc -
Duc(r1-RCUT)
Carte MD, fully pipelined, 282 cycle depth
C baseline code for MD
35
Conclusions
  • Performance prediction is a powerful technique
    for improving efficiency of RC application
    formulation
  • Provides reasonable accuracy for the rough
    estimate
  • Encourages importance of numerical precision and
    resource utilization in performance prediction
  • Design patterns provide lessons learned
    documentation
  • Records and disseminates algorithm design
    knowledge
  • Allows for more effective formulation of future
    designs
  • Future Work
  • Improve connection b/w design patterns and
    performance prediction
  • Expand design pattern methodology for better
    integration with RC
  • Increase role of numerical precision in
    performance prediction
Write a Comment
User Comments (0)
About PowerShow.com