Title: A Scalable FPGA-based Multiprocessor for Molecular Dynamics Simulation
1A Scalable FPGA-based Multiprocessorfor
Molecular Dynamics Simulation
- Arun Patel1, Christopher A. Madill2,3, Manuel
Saldaña1, Christopher Comis1, Régis Pomès2,3,
Paul Chow1
1 Department of Electrical and Computer
Engineering, University of Toronto2 Department
of Structural Biology and Biochemistry, The
Hospital for Sick Children 3 Department of
Biochemistry, University of Toronto
Presented By Arun Patel apatel_at_eecg.toronto.edu
Connections 2006 The University of Toronto ECE
Graduate Symposium Toronto, Ontario, Canada June
9th, 2006
2Introduction
- FPGAs can accelerate many computing tasks by up
to 2 or 3 orders of magnitude - Supercomputers and computing clusters have been
designed to improve computing performance - Our work focuses on developing a computing
cluster based on a scalable network of FPGAs - Initial design will be tailored for performing
Molecular Dynamics simulations
3Molecular Dynamics
- Combines empirical force calculations with
Newtons equations of motion - Predicts the time trajectory of small atomic
systems - Computationally demanding
- Calculate interatomic forces
- Calculate the net force
- Integrate Newtonian equations of motion
4Molecular Dynamics
- Combines empirical force calculations with
Newtons equations of motion - Predicts the time trajectory of small atomic
systems - Computationally demanding
- Calculate interatomic forces
- Calculate the net force
- Integrate Newtonian equations of motion
5Molecular Dynamics
- Combines empirical force calculations with
Newtons equations of motion - Predicts the time trajectory of small atomic
systems - Computationally demanding
- Calculate interatomic forces
- Calculate the net force
- Integrate Newtonian equations of motion
6Molecular Dynamics
- Combines empirical force calculations with
Newtons equations of motion - Predicts the time trajectory of small atomic
systems - Computationally demanding
- Calculate interatomic forces
- Calculate the net force
- Integrate Newtonian equations of motion
7Molecular Dynamics
U
8Why Molecular Dynamics?
1. Inherently Parallelizable
2. Computationally Demanding
9Motivation for Architecture
- Majority of hardware accelerators achieve
102-103x improvement over S/W by - Pipelining a serially-executed algorithm - or
- - Performing operations in parallel
- Such techniques do not address large-scale
computing applications (such as MD) - Much greater speedups are required (104-105)
- Not likely with a single hardware accelerator
- Ideal solution for large-scale computing?
- Scalability of modern HPC platforms
- Performance of hardware acceleration
10The TMD Machine
- An investigation of a FPGA-based architecture
- Designed for applications that exhibit high
compute-to-communication ratio - Made possible by integration of microprocessors,
high-speed communication interfaces into modern
FPGA packages
11Inter-Task Communication
- Based on Message Passing Interface (MPI)
- Popular message-passing standard for distributed
applications - Implementations available for virtually every HPC
platform - TMD-MPI
- Subset of MPI standard developed for TMD
architecture - Software library for tasks implemented on
embedded microprocessors - Hardware Message Passing Engine (MPE) for
hardware computing tasks
12MD Software Implementation
Force Engine
Force Engine
Force Engine
Force Engine
mpiCC
- Design Flow
- Testing and validation
- Parallel design
- Software to hardware transition
Interconnection Network
13Current Work
- Replace software processes with hardware
computing engines
Atom Store
Atom Store
Atom Store
Atom Store
Force Engine
Force Engine
Force Engine
Force Engine
Force Engine C ? HDL TMD-MPE Synthesis
Atom Store TMD-MPI ppc-g
PPC-405
PPC-405
XC2VP100
XC2VP100
14Acknowledgements
Past Members
TMD Group
David Chui Christopher Comis Sam Lee
Dr. Régis Pomès Christopher Madill Arun
Patel Lesley Shannon
Dr. Paul Chow Andrew House Daniel Nunes Manuel
Saldaña Emanuel Ramalho
15Large-Scale Computing Solutions
- Class 1 Machines
- Supercomputers or clusters of workstations
- 10-105 interconnected CPUs
16Large-Scale Computing Solutions
- Class 1 Machines
- Supercomputers or clusters of workstations
- 10-105 interconnected CPUs
- Class 2 Machines
- Hybrid network of CPU and FPGA hardware
- FPGA acts as external co-processor to CPU
- Programming model still evolving
17Large-Scale Computing Solutions
- Class 1 Machines
- Supercomputers or clusters of workstations
- 10-105 interconnected CPUs
- Class 2 Machines
- Hybrid network of CPU and FPGA hardware
- FPGA acts as external co-processor to CPU
- Programming model still evolving
- Class 3 Machines
- Network of FPGA-based computing nodes
- Recent area of academic and industrial focus
Interconnection Network
18TMD Communication Infrastructure
- Tier 1 Intra-FPGA Communication
- Point-to-Point FIFOs are used as communication
channels - Asynchronous FIFOs isolate clock domains
- Application-specific network topologies can be
defined
19TMD Communication Infrastructure
- Tier 1 Intra-FPGA Communication
- Point-to-Point FIFOs are used as communication
channels - Asynchronous FIFOs isolate clock domains
- Application-specific network topologies can be
defined - Tier 2 Inter-FPGA Communication
- Multi-gigabit serial transceivers used for
inter-FPGA communication - Fully-interconnected network topology using
2N(N-1) pairs of traces
20TMD Communication Infrastructure
- Tier 1 Intra-FPGA Communication
- Point-to-Point FIFOs are used as communication
channels - Asynchronous FIFOs isolate clock domains
- Application-specific network topologies can be
defined - Tier 2 Inter-FPGA Communication
- Multi-gigabit serial transceivers used for
inter-FPGA communication - Fully-interconnected network topology using
2N(N-1) pairs of traces - Tier 3 Inter-Cluster Communication
- Commercially-available switches interconnect
cluster PCBs - Built-in features for large-scale computing
fault-tolerance, scalability
21TMD Computing Tasks (1/2)
- Computing Tasks
- Applications are defined as collection of
computing tasks - Tasks communicate by passing messages
- Task Implementation Flexibility
- Software processes executing on embedded
microprocessors - Dedicated hardware computing engines
-
Task
Class 3
Class 1
ComputingEngine
Embedded Microprocessor
Processor on CPU Node
22TMD Computing Tasks (2/2)
- Computing Task Granularity
- Tasks can vary in size and complexity
- Not restricted to one task per FPGA
-
FPGAs
Tasks
C
A
B
D
E
F
G
H
I
J
K
L
M
23TMD-MPI Software Implementation
Layer 4 MPI Interface All MPI functions
implemented in TMD-MPI that are available to the
application.
Application
MPI Application Interface
Point-to-Point MPI Functions
Send/Receive Implementation
FSL Hardware Interface
Hardware
24TMD-MPI Software Implementation
Layer 4 MPI Interface All MPI functions
implemented in TMD-MPI that are available to the
application.
Layer 3 Collective Operations Barrier
synchronization, data gathering and message
broadcasts.
25TMD-MPI Software Implementation
Layer 4 MPI Interface All MPI functions
implemented in TMD-MPI that are available to the
application.
Layer 3 Collective Operations Barrier
synchronization, data gathering and message
broadcasts.
Layer 2 Communication Primitives MPI_Send and
MPI_Recv methods are used to transmit data
between processes.
26TMD-MPI Software Implementation
Layer 4 MPI Interface All MPI functions
implemented in TMD-MPI that are available to the
application.
Layer 3 Collective Operations Barrier
synchronization, data gathering and message
broadcasts.
Layer 2 Communication Primitives MPI_Send and
MPI_Recv methods are used to transmit data
between processes.
Layer 1 Hardware Interface Low level methods to
communicate with FSLs for both on and off-chip
communication.
27TMD Application Design Flow
- Step 1 Application Prototyping
- Software prototype of application developed
- Profiling identifies compute-intensive routines
Application Prototype
28TMD Application Design Flow
- Step 1 Application Prototyping
- Software prototype of application developed
- Profiling identifies compute-intensive routines
- Step 2 Application Refinement
- Partitioning into tasks communicating using MPI
- Each task emulates a computing engine
- Communication patterns analyzed to determine
network topology
Application Prototype
Process A
Process B
Process C
29TMD Application Design Flow
- Step 1 Application Prototyping
- Software prototype of application developed
- Profiling identifies compute-intensive routines
- Step 2 Application Refinement
- Partitioning into tasks communicating using MPI
- Each task emulates a computing engine
- Communication patterns analyzed to determine
network topology - Step 3 TMD Prototyping
- Tasks are ported to soft-processors on TMD
- Software refined to utilize TMD-MPI library
- On-chip communication network verified
Application Prototype
Process A
Process B
Process C
A
B
C
30TMD Application Design Flow
- Step 1 Application Prototyping
- Software prototype of application developed
- Profiling identifies compute-intensive routines
- Step 2 Application Refinement
- Partitioning into tasks communicating using MPI
- Each task emulates a computing engine
- Communication patterns analyzed to determine
network topology - Step 3 TMD Prototyping
- Tasks are ported to soft-processors on TMD
- Software refined to utilize TMD-MPI library
- On-chip communication network verified
- Step 4 TMD Optimization
- Intensive tasks replaced with hardware engines
- MPE handles communication for hardware engines
Application Prototype
Process A
Process B
Process C
A
B
C
B
31Future Work Phase 2
TMD Version 2 Prototype
32Future Work Phase 3
The final TMD architecture will contain a
hierarchical network of FPGA chips