Accelerating Molecular Dynamics on a GPU - PowerPoint PPT Presentation

About This Presentation

Title:

Accelerating Molecular Dynamics on a GPU

Description:

fragment hardware. All mainstream GPUs now incorporate fully programmable processors ... GLSL Sphere Fragment Shader. Written in OpenGL Shading Language ... – PowerPoint PPT presentation

Number of Views:118

Avg rating:3.0/5.0

Slides: 28

Provided by: johns438

Category:

more less

Transcript and Presenter's Notes

Title: Accelerating Molecular Dynamics on a GPU

1
Accelerating Molecular Dynamics on a GPU

John Stone
Theoretical and Computational Biophysics Group
Beckman Institute for Advanced Science and
Technology
University of Illinois at Urbana-Champaign
http//www.ks.uiuc.edu/Research/gpu/
Careers in High-Performance Systems (CHiPS)
Workshop
National Center for Supercomputing Applications,
July 25, 2009

2
Computational Biologys Insatiable Demand for
Processing Power

Simulations still fall short of biological
timescales
Large simulations extremely difficult to prepare,
analyze
Order of magnitude increase in performance would
allow use of more sophisticated models

Satellite Tobacco Mosaic Virus (STMV)
3
Programmable Graphics Hardware

Groundbreaking research systems
ATT Pixel Machine (1989)
82 x DSP32 processors
UNC PixelFlow (1992-98)
64 x (PA-8000
8,192 bit-serial SIMD)
SGI RealityEngine (1990s)
Up to 12 i860-XP processors perform vertex
operations (ucode), fixed-func. fragment hardware
All mainstream GPUs now incorporate fully
programmable processors

UNC PixelFlow Rack
SGI Reality Engine i860 Vertex Processors
4
GLSL Sphere Fragment Shader

Written in OpenGL Shading Language
High-level C-like language with vector types and
operations
Compiled dynamically by the graphics driver at
runtime
Compiled machine code executes on GPU

5
GPU Computing

Commodity devices, omnipresent in modern
computers (over a million sold per week)
Massively parallel hardware, hundreds of
processing units, throughput oriented
architecture
Standard integer and floating point types
supported
Programming tools allow software to be written in
dialects of familiar C/C and integrated into
legacy software
GPU algorithms are often multicore friendly due
to attention paid to data locality and
data-parallel work decomposition

6
What Speedups Can GPUs Achieve?

Single-GPU speedups of 10x to 30x vs. one CPU
core are common
Best speedups can reach 100x or more, attained on
codes dominated by floating point arithmetic,
especially native GPU machine instructions, e.g.
expf(), rsqrtf(),
Amdahls Law can prevent legacy codes from
achieving peak speedups with shallow GPU
acceleration efforts

7
GPU Peak Single-Precision PerformanceExponential
Trend
8
GPU Peak Memory Bandwidth Linear Trend
9
Comparison of CPU and GPU Hardware
Architecture
CPU Cache heavy, focused on individual thread
performance
GPU ALU heavy, massively parallel, throughput
oriented
10
NVIDIA GT200
Streaming Processor Array
Grid of thread blocks
Multiple thread blocks, many warps of threads
Texture Processor Cluster
Streaming Multiprocessor
SP
SP
SP
SP
SFU
SFU
SP
SP
SP
SP
Texture Unit
Individual threads
11
GPU Memory Accessible in CUDA

Mapped host memory up to 4GB, 5.7GB/sec
bandwidth (PCIe), accessible by multiple GPUs
Global memory up to 4GB, high latency (600
clock cycles), 140GB/sec bandwidth, accessible by
all threads, atomic operations (slow)
Texture memory read-only, cached, and
interpolated/filtered access to global memory
Constant memory 64KB, read-only, cached,
fast/low-latency if data elements are accessed in
unison by peer threads
Shared memory16KB, low-latency, accessible among
threads in the same block, fast if accessed
without bank conflicts

12
An Approach to Writing CUDA Kernels

Find an algorithm that exposes substantial
parallelism, thousands of independent threads
Identify appropriate GPU memory subsystems for
storage of data used by kernel
Are there trade-offs that can be made to exchange
computation for more parallelism?
Brute force methods that expose significant
parallelism do surprisingly well on current GPUs
Analyze the real-world use case for the problem
and optimize the kernel for the problem
size/characteristics that will be heavily used

13
NAMD Parallel Molecular Dynamics
Kale et al., J. Comp. Phys. 151283-312, 1999.

Designed from the beginning as a parallel program
Uses the Charm philosophy
Decompose computation into a large number of
objects
Intelligent Run-time system (Charm) assigns
objects to processors for dynamic load balancing
with minimal communication

Hybrid of spatial and force decomposition
Spatial decomposition of atoms into cubes (called
patches)
For every pair of interacting patches, create one
object for calculating electrostatic interactions
Recent Blue Matter, Desmond, etc. use this idea
in some form

14
NAMD Overlapping Execution
Phillips et al., SC2002.
Example Configuration
108
847 objects
100,000
Offload to GPU
Objects are assigned to processors and queued as
data arrives.
15
Non-bonded Interactions

Calculate forces for pairs of atoms within cutoff
distance

Cutoff radius
rij distance between Atomi to Atomj
Atomi
Atomj
16
Nonbonded Forces on G80 GPU

Start with most expensive calculation direct
nonbonded interactions.
Decompose work into pairs of patches, identical
to NAMD structure.
GPU hardware assigns patch-pairs to
multiprocessors dynamically.

Force computation on single multiprocessor
(GeForce 8800 GTX has 16)
16kB Shared Memory Patch A Coordinates
Parameters
Texture Unit Force Table Interpolation
32-way SIMD Multiprocessor 32-256 multiplexed
threads
Constants Exclusions
32kB Registers Patch B Coords, Params, Forces
64kB cache
8kB cache
768 MB Main Memory, no cache, 300 cycle latency
Stone et al., J. Comp. Chem. 282618-2640, 2007.
17
textureltfloat4gt force_table __constant__
unsigned int exclusions __shared__ atom
jatom atom iatom // per-thread atom,
stored in registers float4 iforce //
per-thread force, stored in registers for ( int j
0 j lt jatom_count j ) float dx
jatomj.x - iatom.x float dy jatomj.y -
iatom.y float dz jatomj.z - iatom.z
float r2 dxdx dydy dzdz if ( r2 lt
cutoff2 ) float4 ft texfetch(force_table,
1.f/sqrt(r2)) bool excluded false int
indexdiff iatom.index - jatomj.index if
( abs(indexdiff) lt (int) jatomj.excl_maxdiff )
indexdiff jatomj.excl_index
excluded ((exclusionsindexdiffgtgt5
(1ltlt(indexdiff31))) ! 0) float f
iatom.half_sigma jatomj.half_sigma //
sigma f ff // sigma3 f f //
sigma6 f ( f ft.x ft.y ) //
sigma12 fi.x - sigma6 fi.y f
iatom.sqrt_epsilon jatomj.sqrt_epsilon
float qq iatom.charge jatomj.charge if
( excluded ) f qq ft.w // PME
correction else f qq ft.z //
Coulomb iforce.x dx f iforce.y dy
f iforce.z dz f iforce.w 1.f
// interaction count or energy
Nonbonded Forces CUDA Code
Force Interpolation
Exclusions
Parameters
Accumulation
Stone et al., J. Comp. Chem. 282618-2640, 2007.
18
NAMD Performance on NCSA GPU Cluster, April 2008

STMV virus (1M atoms)
60 GPUs match performance of 330 CPU cores
5.5-7x overall application speedup w/ G80-based
GPUs
Overlap with CPU
Off-node results done first
Plans for better performance
Tune or port remaining work
Balance GPU load

STMV Performance
25.7
13.8
7.8
faster
2.4 GHz Opteron Quadro FX 5600
19
NAMD Performance on GT200 GPU Cluster, August
2008

8 GT200s, 240 SPs _at_ 1.3GHz
72x faster than a single CPU core
9x overall application speedup vs. 8 CPU
cores
32 faster overall than 8 nodes of G80 cluster
GT200 CUDA kernel is 54 faster
8 variation in GPU load
Cost of double-precision for force accumulation
is minimal only 8 slower than single-precision

20
VMD Visual Molecular Dynamics

Visualization and analysis of molecular dynamics
simulations, sequence data, volumetric data,
quantum chemistry simulations, particle systems,
User extensible with scripting and plugins
http//www.ks.uiuc.edu/Research/vmd/

21
GPU Acceleration in VMD
Electrostatic field calculation, ion
placement factor of 20x to 44x faster
Molecular orbital calculation and
display factor of 120x faster
Imaging of gas migration pathways in proteins
with implicit ligand sampling factor of 20x to
30x faster
22
Electrostatic Potential Maps

Electrostatic potentials evaluated on 3-D
lattice
Applications include
Ion placement for structure building
Time-averaged potentials for simulation
Visualization and analysis

Isoleucine tRNA synthetase
23
Direct Coulomb Summation

Each lattice point accumulates electrostatic
potential contribution from all atoms
potentialj chargei / rij

rij distance from latticej to atomi
Lattice point j being evaluated
atomi
24
Photobiology of Vision and Photosynthesis Investig
ations of the chromatophore, a photosynthetic
organelle
Light
Partial model 10M atoms
Electrostatic field of chromatophore model from
multilevel summation method computed with 3 GPUs
(G80) in 90 seconds, 46x faster than single CPU
core
Electrostatics needed to build full structural
model, place ions, study macroscopic properties
Full chromatophore model will permit structural,
chemical and kinetic investigations at a
structural systems biology level
25
Lessons Learned

GPU algorithms need fine-grained parallelism and
sufficient work to fully utilize the hardware
Much of per-thread GPU algorithm optimization
revolves around efficient use of multiple memory
systems and latency hiding
Concurrency can often be traded for per-thread
performance, in combination with increased use of
registers or shared memory
Fine-grained GPU work decompositions often
compose well with the comparatively
coarse-grained decompositions used for multicore
or distributed memory programing

26
Lessons Learned (2)

The host CPU can potentially be used to
regularize the computation for the GPU,
yielding better overall performance
Overlapping CPU work with GPU can hide some
communication and unaccelerated computation
Targeted use of double-precision floating point
arithmetic, or compensated summation can reduce
the effects of floating point truncation at low
cost to performance

27
Acknowledgements

Theoretical and Computational Biophysics Group,
University of Illinois at Urbana-Champaign
Wen-mei Hwu and the IMPACT group at University of
Illinois at Urbana-Champaign
NVIDIA Center of Excellence, University of
Illinois at Urbana-Champaign
NCSA Innovative Systems Lab
David Kirk and the CUDA team at NVIDIA
NIH support P41-RR05969

Write a Comment

User Comments (0)