Title: Accelerating Molecular Dynamics on a GPU
1Accelerating Molecular Dynamics on a GPU
- John Stone
- Theoretical and Computational Biophysics Group
- Beckman Institute for Advanced Science and
Technology - University of Illinois at Urbana-Champaign
- http//www.ks.uiuc.edu/Research/gpu/
- Careers in High-Performance Systems (CHiPS)
Workshop - National Center for Supercomputing Applications,
July 25, 2009
2Computational Biologys Insatiable Demand for
Processing Power
- Simulations still fall short of biological
timescales - Large simulations extremely difficult to prepare,
analyze - Order of magnitude increase in performance would
allow use of more sophisticated models
Satellite Tobacco Mosaic Virus (STMV)
3Programmable Graphics Hardware
- Groundbreaking research systems
- ATT Pixel Machine (1989)
- 82 x DSP32 processors
- UNC PixelFlow (1992-98)
- 64 x (PA-8000
- 8,192 bit-serial SIMD)
- SGI RealityEngine (1990s)
- Up to 12 i860-XP processors perform vertex
operations (ucode), fixed-func. fragment hardware - All mainstream GPUs now incorporate fully
programmable processors
UNC PixelFlow Rack
SGI Reality Engine i860 Vertex Processors
4GLSL Sphere Fragment Shader
- Written in OpenGL Shading Language
- High-level C-like language with vector types and
operations - Compiled dynamically by the graphics driver at
runtime - Compiled machine code executes on GPU
5GPU Computing
- Commodity devices, omnipresent in modern
computers (over a million sold per week) - Massively parallel hardware, hundreds of
processing units, throughput oriented
architecture - Standard integer and floating point types
supported - Programming tools allow software to be written in
dialects of familiar C/C and integrated into
legacy software - GPU algorithms are often multicore friendly due
to attention paid to data locality and
data-parallel work decomposition
6What Speedups Can GPUs Achieve?
- Single-GPU speedups of 10x to 30x vs. one CPU
core are common - Best speedups can reach 100x or more, attained on
codes dominated by floating point arithmetic,
especially native GPU machine instructions, e.g.
expf(), rsqrtf(), - Amdahls Law can prevent legacy codes from
achieving peak speedups with shallow GPU
acceleration efforts
7GPU Peak Single-Precision PerformanceExponential
Trend
8GPU Peak Memory Bandwidth Linear Trend
9Comparison of CPU and GPU Hardware
Architecture
CPU Cache heavy, focused on individual thread
performance
GPU ALU heavy, massively parallel, throughput
oriented
10NVIDIA GT200
Streaming Processor Array
Grid of thread blocks
Multiple thread blocks, many warps of threads
Texture Processor Cluster
Streaming Multiprocessor
SP
SP
SP
SP
SFU
SFU
SP
SP
SP
SP
Texture Unit
Individual threads
11GPU Memory Accessible in CUDA
- Mapped host memory up to 4GB, 5.7GB/sec
bandwidth (PCIe), accessible by multiple GPUs - Global memory up to 4GB, high latency (600
clock cycles), 140GB/sec bandwidth, accessible by
all threads, atomic operations (slow) - Texture memory read-only, cached, and
interpolated/filtered access to global memory - Constant memory 64KB, read-only, cached,
fast/low-latency if data elements are accessed in
unison by peer threads - Shared memory16KB, low-latency, accessible among
threads in the same block, fast if accessed
without bank conflicts
12An Approach to Writing CUDA Kernels
- Find an algorithm that exposes substantial
parallelism, thousands of independent threads - Identify appropriate GPU memory subsystems for
storage of data used by kernel - Are there trade-offs that can be made to exchange
computation for more parallelism? - Brute force methods that expose significant
parallelism do surprisingly well on current GPUs - Analyze the real-world use case for the problem
and optimize the kernel for the problem
size/characteristics that will be heavily used
13NAMD Parallel Molecular Dynamics
Kale et al., J. Comp. Phys. 151283-312, 1999.
- Designed from the beginning as a parallel program
- Uses the Charm philosophy
- Decompose computation into a large number of
objects - Intelligent Run-time system (Charm) assigns
objects to processors for dynamic load balancing
with minimal communication
- Hybrid of spatial and force decomposition
- Spatial decomposition of atoms into cubes (called
patches) - For every pair of interacting patches, create one
object for calculating electrostatic interactions - Recent Blue Matter, Desmond, etc. use this idea
in some form
14NAMD Overlapping Execution
Phillips et al., SC2002.
Example Configuration
108
847 objects
100,000
Offload to GPU
Objects are assigned to processors and queued as
data arrives.
15Non-bonded Interactions
- Calculate forces for pairs of atoms within cutoff
distance
Cutoff radius
rij distance between Atomi to Atomj
Atomi
Atomj
16Nonbonded Forces on G80 GPU
- Start with most expensive calculation direct
nonbonded interactions. - Decompose work into pairs of patches, identical
to NAMD structure. - GPU hardware assigns patch-pairs to
multiprocessors dynamically.
Force computation on single multiprocessor
(GeForce 8800 GTX has 16)
16kB Shared Memory Patch A Coordinates
Parameters
Texture Unit Force Table Interpolation
32-way SIMD Multiprocessor 32-256 multiplexed
threads
Constants Exclusions
32kB Registers Patch B Coords, Params, Forces
64kB cache
8kB cache
768 MB Main Memory, no cache, 300 cycle latency
Stone et al., J. Comp. Chem. 282618-2640, 2007.
17textureltfloat4gt force_table __constant__
unsigned int exclusions __shared__ atom
jatom atom iatom // per-thread atom,
stored in registers float4 iforce //
per-thread force, stored in registers for ( int j
0 j lt jatom_count j ) float dx
jatomj.x - iatom.x float dy jatomj.y -
iatom.y float dz jatomj.z - iatom.z
float r2 dxdx dydy dzdz if ( r2 lt
cutoff2 ) float4 ft texfetch(force_table,
1.f/sqrt(r2)) bool excluded false int
indexdiff iatom.index - jatomj.index if
( abs(indexdiff) lt (int) jatomj.excl_maxdiff )
indexdiff jatomj.excl_index
excluded ((exclusionsindexdiffgtgt5
(1ltlt(indexdiff31))) ! 0) float f
iatom.half_sigma jatomj.half_sigma //
sigma f ff // sigma3 f f //
sigma6 f ( f ft.x ft.y ) //
sigma12 fi.x - sigma6 fi.y f
iatom.sqrt_epsilon jatomj.sqrt_epsilon
float qq iatom.charge jatomj.charge if
( excluded ) f qq ft.w // PME
correction else f qq ft.z //
Coulomb iforce.x dx f iforce.y dy
f iforce.z dz f iforce.w 1.f
// interaction count or energy
Nonbonded Forces CUDA Code
Force Interpolation
Exclusions
Parameters
Accumulation
Stone et al., J. Comp. Chem. 282618-2640, 2007.
18NAMD Performance on NCSA GPU Cluster, April 2008
- STMV virus (1M atoms)
- 60 GPUs match performance of 330 CPU cores
- 5.5-7x overall application speedup w/ G80-based
GPUs - Overlap with CPU
- Off-node results done first
- Plans for better performance
- Tune or port remaining work
- Balance GPU load
STMV Performance
25.7
13.8
7.8
faster
2.4 GHz Opteron Quadro FX 5600
19NAMD Performance on GT200 GPU Cluster, August
2008
- 8 GT200s, 240 SPs _at_ 1.3GHz
- 72x faster than a single CPU core
- 9x overall application speedup vs. 8 CPU
cores - 32 faster overall than 8 nodes of G80 cluster
- GT200 CUDA kernel is 54 faster
- 8 variation in GPU load
- Cost of double-precision for force accumulation
is minimal only 8 slower than single-precision
20VMD Visual Molecular Dynamics
- Visualization and analysis of molecular dynamics
simulations, sequence data, volumetric data,
quantum chemistry simulations, particle systems,
- User extensible with scripting and plugins
- http//www.ks.uiuc.edu/Research/vmd/
21GPU Acceleration in VMD
Electrostatic field calculation, ion
placement factor of 20x to 44x faster
Molecular orbital calculation and
display factor of 120x faster
Imaging of gas migration pathways in proteins
with implicit ligand sampling factor of 20x to
30x faster
22Electrostatic Potential Maps
- Electrostatic potentials evaluated on 3-D
lattice - Applications include
- Ion placement for structure building
- Time-averaged potentials for simulation
- Visualization and analysis
Isoleucine tRNA synthetase
23Direct Coulomb Summation
- Each lattice point accumulates electrostatic
potential contribution from all atoms - potentialj chargei / rij
rij distance from latticej to atomi
Lattice point j being evaluated
atomi
24Photobiology of Vision and Photosynthesis Investig
ations of the chromatophore, a photosynthetic
organelle
Light
Partial model 10M atoms
Electrostatic field of chromatophore model from
multilevel summation method computed with 3 GPUs
(G80) in 90 seconds, 46x faster than single CPU
core
Electrostatics needed to build full structural
model, place ions, study macroscopic properties
Full chromatophore model will permit structural,
chemical and kinetic investigations at a
structural systems biology level
25Lessons Learned
- GPU algorithms need fine-grained parallelism and
sufficient work to fully utilize the hardware - Much of per-thread GPU algorithm optimization
revolves around efficient use of multiple memory
systems and latency hiding - Concurrency can often be traded for per-thread
performance, in combination with increased use of
registers or shared memory - Fine-grained GPU work decompositions often
compose well with the comparatively
coarse-grained decompositions used for multicore
or distributed memory programing
26Lessons Learned (2)
- The host CPU can potentially be used to
regularize the computation for the GPU,
yielding better overall performance - Overlapping CPU work with GPU can hide some
communication and unaccelerated computation - Targeted use of double-precision floating point
arithmetic, or compensated summation can reduce
the effects of floating point truncation at low
cost to performance
27Acknowledgements
- Theoretical and Computational Biophysics Group,
University of Illinois at Urbana-Champaign - Wen-mei Hwu and the IMPACT group at University of
Illinois at Urbana-Champaign - NVIDIA Center of Excellence, University of
Illinois at Urbana-Champaign - NCSA Innovative Systems Lab
- David Kirk and the CUDA team at NVIDIA
- NIH support P41-RR05969