Brent Oster - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Brent Oster

Description:

Modern GPU is More General Purpose Lots of ALU's. The ... (results in data stall) Making Optimal Use of 16Kb Shared Memory. More Detail on GPU Architecture ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 26

Provided by: brentandm

Category:

more less

Transcript and Presenter's Notes

Title: Brent Oster

1
Brent Oster

Associate Director Allosphere, CNSI
PhD Studies in
Computational Nanotechology
nVidia G80 GPU CUDA

2
Previously Technical Director for Video Games
(Bioware, EA, AliasWavefront, LucasArts )
3
(No Transcript)
4
Modern GPU is More General Purpose Lots of ALUs
5
The nVidia G80 GPU

128 streaming floating point processors _at_1.5Ghz
1.5 Gb Shared RAM with 86Gb/s bandwidth
500 Gflop on one chip (single precision)

6
What Has Driven the Evolution of These
Chips?Males age 15-35 buy10B in video games /
yearCrysis Demo
7
Are GPUs Useful for Scientific Computing?
Electronic Structure (DFT)
Finite Element Modeling
Molecular Dynamics Monte Carlo
8
nVidia G80 GPU Architecture Overview

16 Multiprocessors Blocks
Each MP Block Has
8 Streaming Processors (IEEE 754 spfp compliant)
16K Shared Memory
64K Constant Cache
8K Texture Cache
Each processor can access all of the memory at
86Gb/s, but with different latencies
Shared 2 cycle latency
Device 300 cycle latency

9
Programming Interface

Interface to GPU via nVidias proprietary API
CUDA (very C-like)
Looks a lot like UPC (simplified CUDA below)
void AddVectors(float r, float a, float a)
int tx threadId.x //processor rank
rtx atx btx //executed in parallel

10
Actual CUDA Code

define MAX_THREADS 512
extern C void AddVectors(float r, float a,
float b, int n)
int nThreads MAX_THREADS/2
int nBlocks n / nThreads
AddVectorsKernelltnThreads, nBlocksgt(r, a, b, n)
__global__ void AddVectorsKernel(float r, float
a, float b, int n)
int tx threadID.x
int bx blockID.x
int i tx bx MAX_THREADS
ri ai bi
// This would be extremely slow and inefficient
code more later

11
Still A Specialized Processor

Very Efficient For
Fast Parallel Floating Point Processing
Single Instruction Multiple Data Operations
High Computation per Memory Access
Not As Efficient For
Double Precision (need to test performance)
Logical Operations on Integer Data
Branching-Intensive Operations
Random Access, Memory-Intensive Operations

__global__ void NxNGenericOp_Kernel(float r,
float a, float b, int n) // ri
SUMj(aibj)
__shared__ float r_shMAX_THREADS //Allocate
in fast 16K shared memory
__shared__ float a_shMAX_THREADS
__shared__ float b_shMAX_THREADS
int tx threadID.x //Rank of processor
int bx blockID.x //Rank of multiprocessor
block
int i tx bx MAX_THREADS //Compute index
from tx, bx
a_shtx ai //Each thread loads a value
for a_sh
r_shtx 0 //Each thread zeros a value for
r_sh
__synchthreads() //sync till all threads
reach this point
for(int J 0 J lt n J MAX_THREADS) //Loop
over blocks in b
b_shtx bJtx //Each thread loads a
value for b_sh
__synchthreads() //synch
for(int j 0 j lt MAX_THREADS j) //For each
b_sh
r_shtx a_shtx b_shj //Compute
product a_shb_sh, add to r_sh

13
Making Optimal Use of 16Kb Shared Memory
16K Shared Memory is Allocated in 16 Banks Array
data allocated across banks B0 -gt bank0 B1
-gt bank1 Bn -gtmod(n,nBanks) No Bank
Conflicts if Each Thread Indexes A Different
Bank Bank Conflicts if Threads Access The Same
Bank (results in data stall)
14
More Detail on GPU Architecture
15
(No Transcript)
16
Exploiting the Texture Samplers

Designed to map textures onto 3D polygons
Specialty hardware pipelines for
Fast data sampling from 1D, 2D, 3D arrays
Swizzling of 2D, 3D data for optimal access
Bilinear filtering in zero cycles
Image compositing blending operations
Arrays indexed by u,v,w coordinates easy to
program
Extremely well suited for multigrid finite
difference methods example later

17
Experiments in Computational Nanotech on GPU
Electronic Structure (DFT)
Finite Element Modeling
Molecular Dynamics Monte Carlo
18
HP XW9400 with Quad AMD CPU Dual nVidia
Quaddro 5600 GPUs A Teraflop Workstation?
19
Molecular Dynamics Trial

Lennard Jones inter-atomic potential
Verlet integration
Normalized coordinates
FCC lattice in a NxNxN Simulation Cell
Periodic Boundary Conditions
Trials with Rc 8, Rc 3.0
Tested nVidia 8800 GPU vs 3.0 Ghz Intel P4
Open GL used to implement MD on GPU

20
MD Timing Tests (NxN brute force)
21
MD Timing Results (bins Rc9 Ang)
22
Hardware Accelerated DFT Test

Real-Space Grid Method
(Beck, Bryant,)
LDA, localized basis fns
Iterative soln KS Equations
Finite difference methods
Multigrid with FMG-FAS
Weighted Jacobi relaxation
Merhstellen discritization
Grahm-Schmidt
orthogonalization on lo-res
64x64x64 grid x 4 orbitals
8 H nuclei, 8 Electrons
gt1M data elements
81 ms computation time!

23
Where Next? G90 Double Precision GPU in Spring
2008
nVidia Quaddro PC Workstation 4
Teraflops 15000
G90 GPU Double-precision 1 Teraflop 2500
nVidia QuadroPlex Cluster - 16 PC Nodes 64
Teraflops 300000
24
NanoCAD in the AllosphereCalifornia Nano Systems
Institute
25
How to Find out More