Title: Brent Oster
1Brent Oster
- Associate Director Allosphere, CNSI
- PhD Studies in
- Computational Nanotechology
- nVidia G80 GPU CUDA
2Previously Technical Director for Video Games
(Bioware, EA, AliasWavefront, LucasArts )
3(No Transcript)
4Modern GPU is More General Purpose Lots of ALUs
5The nVidia G80 GPU
- 128 streaming floating point processors _at_1.5Ghz
- 1.5 Gb Shared RAM with 86Gb/s bandwidth
- 500 Gflop on one chip (single precision)
6What Has Driven the Evolution of These
Chips?Males age 15-35 buy10B in video games /
yearCrysis Demo
7Are GPUs Useful for Scientific Computing?
Electronic Structure (DFT)
Finite Element Modeling
Molecular Dynamics Monte Carlo
8nVidia G80 GPU Architecture Overview
- 16 Multiprocessors Blocks
- Each MP Block Has
- 8 Streaming Processors (IEEE 754 spfp compliant)
- 16K Shared Memory
- 64K Constant Cache
- 8K Texture Cache
- Each processor can access all of the memory at
86Gb/s, but with different latencies - Shared 2 cycle latency
- Device 300 cycle latency
9Programming Interface
- Interface to GPU via nVidias proprietary API
CUDA (very C-like) - Looks a lot like UPC (simplified CUDA below)
- void AddVectors(float r, float a, float a)
-
- int tx threadId.x //processor rank
- rtx atx btx //executed in parallel
10Actual CUDA Code
- define MAX_THREADS 512
- extern C void AddVectors(float r, float a,
float b, int n) -
- int nThreads MAX_THREADS/2
- int nBlocks n / nThreads
- AddVectorsKernelltnThreads, nBlocksgt(r, a, b, n)
-
- __global__ void AddVectorsKernel(float r, float
a, float b, int n) -
- int tx threadID.x
- int bx blockID.x
- int i tx bx MAX_THREADS
- ri ai bi
- // This would be extremely slow and inefficient
code more later
11Still A Specialized Processor
- Very Efficient For
- Fast Parallel Floating Point Processing
- Single Instruction Multiple Data Operations
- High Computation per Memory Access
- Not As Efficient For
- Double Precision (need to test performance)
- Logical Operations on Integer Data
- Branching-Intensive Operations
- Random Access, Memory-Intensive Operations
12- __global__ void NxNGenericOp_Kernel(float r,
float a, float b, int n) // ri
SUMj(aibj) -
- __shared__ float r_shMAX_THREADS //Allocate
in fast 16K shared memory - __shared__ float a_shMAX_THREADS
- __shared__ float b_shMAX_THREADS
-
- int tx threadID.x //Rank of processor
- int bx blockID.x //Rank of multiprocessor
block - int i tx bx MAX_THREADS //Compute index
from tx, bx - a_shtx ai //Each thread loads a value
for a_sh - r_shtx 0 //Each thread zeros a value for
r_sh - __synchthreads() //sync till all threads
reach this point - for(int J 0 J lt n J MAX_THREADS) //Loop
over blocks in b -
- b_shtx bJtx //Each thread loads a
value for b_sh - __synchthreads() //synch
- for(int j 0 j lt MAX_THREADS j) //For each
b_sh - r_shtx a_shtx b_shj //Compute
product a_shb_sh, add to r_sh
13Making Optimal Use of 16Kb Shared Memory
16K Shared Memory is Allocated in 16 Banks Array
data allocated across banks B0 -gt bank0 B1
-gt bank1 Bn -gtmod(n,nBanks) No Bank
Conflicts if Each Thread Indexes A Different
Bank Bank Conflicts if Threads Access The Same
Bank (results in data stall)
14More Detail on GPU Architecture
15(No Transcript)
16Exploiting the Texture Samplers
- Designed to map textures onto 3D polygons
- Specialty hardware pipelines for
- Fast data sampling from 1D, 2D, 3D arrays
- Swizzling of 2D, 3D data for optimal access
- Bilinear filtering in zero cycles
- Image compositing blending operations
- Arrays indexed by u,v,w coordinates easy to
program - Extremely well suited for multigrid finite
difference methods example later
17Experiments in Computational Nanotech on GPU
Electronic Structure (DFT)
Finite Element Modeling
Molecular Dynamics Monte Carlo
18HP XW9400 with Quad AMD CPU Dual nVidia
Quaddro 5600 GPUs A Teraflop Workstation?
19Molecular Dynamics Trial
- Lennard Jones inter-atomic potential
- Verlet integration
- Normalized coordinates
- FCC lattice in a NxNxN Simulation Cell
- Periodic Boundary Conditions
- Trials with Rc 8, Rc 3.0
- Tested nVidia 8800 GPU vs 3.0 Ghz Intel P4
- Open GL used to implement MD on GPU
20MD Timing Tests (NxN brute force)
21MD Timing Results (bins Rc9 Ang)
22Hardware Accelerated DFT Test
- Real-Space Grid Method
- (Beck, Bryant,)
- LDA, localized basis fns
- Iterative soln KS Equations
- Finite difference methods
- Multigrid with FMG-FAS
- Weighted Jacobi relaxation
- Merhstellen discritization
- Grahm-Schmidt
- orthogonalization on lo-res
- 64x64x64 grid x 4 orbitals
- 8 H nuclei, 8 Electrons
- gt1M data elements
- 81 ms computation time!
23Where Next? G90 Double Precision GPU in Spring
2008
nVidia Quaddro PC Workstation 4
Teraflops 15000
G90 GPU Double-precision 1 Teraflop 2500
nVidia QuadroPlex Cluster - 16 PC Nodes 64
Teraflops 300000
24NanoCAD in the AllosphereCalifornia Nano Systems
Institute
25How to Find out More
- Download CUDA and docs from nVidia
- http//developer.nvidia.com/object/cuda.html
- Buy a 600 nVidia GeForce 8800GTX
- Get one free through their developer program
(talk to me after class) - CUDA Programming Course through CS
- Fall 07 or Winter 08
- Tobias Hollerer Myself
- NanoCAD collaborative development
www.powerofminus9.net