Accelerating Molecular Dynamics on a GPU - PowerPoint PPT Presentation

About This Presentation
Title:

Accelerating Molecular Dynamics on a GPU

Description:

fragment hardware. All mainstream GPUs now incorporate fully programmable processors ... GLSL Sphere Fragment Shader. Written in OpenGL Shading Language ... – PowerPoint PPT presentation

Number of Views:118
Avg rating:3.0/5.0
Slides: 28
Provided by: johns438
Category:

less

Transcript and Presenter's Notes

Title: Accelerating Molecular Dynamics on a GPU


1
Accelerating Molecular Dynamics on a GPU
  • John Stone
  • Theoretical and Computational Biophysics Group
  • Beckman Institute for Advanced Science and
    Technology
  • University of Illinois at Urbana-Champaign
  • http//www.ks.uiuc.edu/Research/gpu/
  • Careers in High-Performance Systems (CHiPS)
    Workshop
  • National Center for Supercomputing Applications,
    July 25, 2009

2
Computational Biologys Insatiable Demand for
Processing Power
  • Simulations still fall short of biological
    timescales
  • Large simulations extremely difficult to prepare,
    analyze
  • Order of magnitude increase in performance would
    allow use of more sophisticated models

Satellite Tobacco Mosaic Virus (STMV)
3
Programmable Graphics Hardware
  • Groundbreaking research systems
  • ATT Pixel Machine (1989)
  • 82 x DSP32 processors
  • UNC PixelFlow (1992-98)
  • 64 x (PA-8000
  • 8,192 bit-serial SIMD)
  • SGI RealityEngine (1990s)
  • Up to 12 i860-XP processors perform vertex
    operations (ucode), fixed-func. fragment hardware
  • All mainstream GPUs now incorporate fully
    programmable processors

UNC PixelFlow Rack
SGI Reality Engine i860 Vertex Processors
4
GLSL Sphere Fragment Shader
  • Written in OpenGL Shading Language
  • High-level C-like language with vector types and
    operations
  • Compiled dynamically by the graphics driver at
    runtime
  • Compiled machine code executes on GPU

5
GPU Computing
  • Commodity devices, omnipresent in modern
    computers (over a million sold per week)
  • Massively parallel hardware, hundreds of
    processing units, throughput oriented
    architecture
  • Standard integer and floating point types
    supported
  • Programming tools allow software to be written in
    dialects of familiar C/C and integrated into
    legacy software
  • GPU algorithms are often multicore friendly due
    to attention paid to data locality and
    data-parallel work decomposition

6
What Speedups Can GPUs Achieve?
  • Single-GPU speedups of 10x to 30x vs. one CPU
    core are common
  • Best speedups can reach 100x or more, attained on
    codes dominated by floating point arithmetic,
    especially native GPU machine instructions, e.g.
    expf(), rsqrtf(),
  • Amdahls Law can prevent legacy codes from
    achieving peak speedups with shallow GPU
    acceleration efforts

7
GPU Peak Single-Precision PerformanceExponential
Trend
8
GPU Peak Memory Bandwidth Linear Trend
9
Comparison of CPU and GPU Hardware
Architecture
CPU Cache heavy, focused on individual thread
performance
GPU ALU heavy, massively parallel, throughput
oriented
10
NVIDIA GT200
Streaming Processor Array
Grid of thread blocks
Multiple thread blocks, many warps of threads
Texture Processor Cluster
Streaming Multiprocessor
SP
SP
SP
SP
SFU
SFU
SP
SP
SP
SP
Texture Unit
Individual threads
11
GPU Memory Accessible in CUDA
  • Mapped host memory up to 4GB, 5.7GB/sec
    bandwidth (PCIe), accessible by multiple GPUs
  • Global memory up to 4GB, high latency (600
    clock cycles), 140GB/sec bandwidth, accessible by
    all threads, atomic operations (slow)
  • Texture memory read-only, cached, and
    interpolated/filtered access to global memory
  • Constant memory 64KB, read-only, cached,
    fast/low-latency if data elements are accessed in
    unison by peer threads
  • Shared memory16KB, low-latency, accessible among
    threads in the same block, fast if accessed
    without bank conflicts

12
An Approach to Writing CUDA Kernels
  • Find an algorithm that exposes substantial
    parallelism, thousands of independent threads
  • Identify appropriate GPU memory subsystems for
    storage of data used by kernel
  • Are there trade-offs that can be made to exchange
    computation for more parallelism?
  • Brute force methods that expose significant
    parallelism do surprisingly well on current GPUs
  • Analyze the real-world use case for the problem
    and optimize the kernel for the problem
    size/characteristics that will be heavily used

13
NAMD Parallel Molecular Dynamics
Kale et al., J. Comp. Phys. 151283-312, 1999.
  • Designed from the beginning as a parallel program
  • Uses the Charm philosophy
  • Decompose computation into a large number of
    objects
  • Intelligent Run-time system (Charm) assigns
    objects to processors for dynamic load balancing
    with minimal communication
  • Hybrid of spatial and force decomposition
  • Spatial decomposition of atoms into cubes (called
    patches)
  • For every pair of interacting patches, create one
    object for calculating electrostatic interactions
  • Recent Blue Matter, Desmond, etc. use this idea
    in some form

14
NAMD Overlapping Execution
Phillips et al., SC2002.
Example Configuration
108
847 objects
100,000
Offload to GPU
Objects are assigned to processors and queued as
data arrives.
15
Non-bonded Interactions
  • Calculate forces for pairs of atoms within cutoff
    distance

Cutoff radius
rij distance between Atomi to Atomj
Atomi
Atomj
16
Nonbonded Forces on G80 GPU
  • Start with most expensive calculation direct
    nonbonded interactions.
  • Decompose work into pairs of patches, identical
    to NAMD structure.
  • GPU hardware assigns patch-pairs to
    multiprocessors dynamically.

Force computation on single multiprocessor
(GeForce 8800 GTX has 16)
16kB Shared Memory Patch A Coordinates
Parameters
Texture Unit Force Table Interpolation
32-way SIMD Multiprocessor 32-256 multiplexed
threads
Constants Exclusions
32kB Registers Patch B Coords, Params, Forces
64kB cache
8kB cache
768 MB Main Memory, no cache, 300 cycle latency
Stone et al., J. Comp. Chem. 282618-2640, 2007.
17
textureltfloat4gt force_table __constant__
unsigned int exclusions __shared__ atom
jatom atom iatom // per-thread atom,
stored in registers float4 iforce //
per-thread force, stored in registers for ( int j
0 j lt jatom_count j ) float dx
jatomj.x - iatom.x float dy jatomj.y -
iatom.y float dz jatomj.z - iatom.z
float r2 dxdx dydy dzdz if ( r2 lt
cutoff2 ) float4 ft texfetch(force_table,
1.f/sqrt(r2)) bool excluded false int
indexdiff iatom.index - jatomj.index if
( abs(indexdiff) lt (int) jatomj.excl_maxdiff )
indexdiff jatomj.excl_index
excluded ((exclusionsindexdiffgtgt5
(1ltlt(indexdiff31))) ! 0) float f
iatom.half_sigma jatomj.half_sigma //
sigma f ff // sigma3 f f //
sigma6 f ( f ft.x ft.y ) //
sigma12 fi.x - sigma6 fi.y f
iatom.sqrt_epsilon jatomj.sqrt_epsilon
float qq iatom.charge jatomj.charge if
( excluded ) f qq ft.w // PME
correction else f qq ft.z //
Coulomb iforce.x dx f iforce.y dy
f iforce.z dz f iforce.w 1.f
// interaction count or energy
Nonbonded Forces CUDA Code
Force Interpolation
Exclusions
Parameters
Accumulation
Stone et al., J. Comp. Chem. 282618-2640, 2007.
18
NAMD Performance on NCSA GPU Cluster, April 2008
  • STMV virus (1M atoms)
  • 60 GPUs match performance of 330 CPU cores
  • 5.5-7x overall application speedup w/ G80-based
    GPUs
  • Overlap with CPU
  • Off-node results done first
  • Plans for better performance
  • Tune or port remaining work
  • Balance GPU load

STMV Performance
25.7
13.8
7.8
faster
2.4 GHz Opteron Quadro FX 5600
19
NAMD Performance on GT200 GPU Cluster, August
2008
  • 8 GT200s, 240 SPs _at_ 1.3GHz
  • 72x faster than a single CPU core
  • 9x overall application speedup vs. 8 CPU
    cores
  • 32 faster overall than 8 nodes of G80 cluster
  • GT200 CUDA kernel is 54 faster
  • 8 variation in GPU load
  • Cost of double-precision for force accumulation
    is minimal only 8 slower than single-precision

20
VMD Visual Molecular Dynamics
  • Visualization and analysis of molecular dynamics
    simulations, sequence data, volumetric data,
    quantum chemistry simulations, particle systems,
  • User extensible with scripting and plugins
  • http//www.ks.uiuc.edu/Research/vmd/

21
GPU Acceleration in VMD
Electrostatic field calculation, ion
placement factor of 20x to 44x faster
Molecular orbital calculation and
display factor of 120x faster
Imaging of gas migration pathways in proteins
with implicit ligand sampling factor of 20x to
30x faster
22
Electrostatic Potential Maps
  • Electrostatic potentials evaluated on 3-D
    lattice
  • Applications include
  • Ion placement for structure building
  • Time-averaged potentials for simulation
  • Visualization and analysis

Isoleucine tRNA synthetase
23
Direct Coulomb Summation
  • Each lattice point accumulates electrostatic
    potential contribution from all atoms
  • potentialj chargei / rij

rij distance from latticej to atomi
Lattice point j being evaluated
atomi
24
Photobiology of Vision and Photosynthesis Investig
ations of the chromatophore, a photosynthetic
organelle
Light
Partial model 10M atoms
Electrostatic field of chromatophore model from
multilevel summation method computed with 3 GPUs
(G80) in 90 seconds, 46x faster than single CPU
core
Electrostatics needed to build full structural
model, place ions, study macroscopic properties
Full chromatophore model will permit structural,
chemical and kinetic investigations at a
structural systems biology level
25
Lessons Learned
  • GPU algorithms need fine-grained parallelism and
    sufficient work to fully utilize the hardware
  • Much of per-thread GPU algorithm optimization
    revolves around efficient use of multiple memory
    systems and latency hiding
  • Concurrency can often be traded for per-thread
    performance, in combination with increased use of
    registers or shared memory
  • Fine-grained GPU work decompositions often
    compose well with the comparatively
    coarse-grained decompositions used for multicore
    or distributed memory programing

26
Lessons Learned (2)
  • The host CPU can potentially be used to
    regularize the computation for the GPU,
    yielding better overall performance
  • Overlapping CPU work with GPU can hide some
    communication and unaccelerated computation
  • Targeted use of double-precision floating point
    arithmetic, or compensated summation can reduce
    the effects of floating point truncation at low
    cost to performance

27
Acknowledgements
  • Theoretical and Computational Biophysics Group,
    University of Illinois at Urbana-Champaign
  • Wen-mei Hwu and the IMPACT group at University of
    Illinois at Urbana-Champaign
  • NVIDIA Center of Excellence, University of
    Illinois at Urbana-Champaign
  • NCSA Innovative Systems Lab
  • David Kirk and the CUDA team at NVIDIA
  • NIH support P41-RR05969
Write a Comment
User Comments (0)
About PowerShow.com