Brent Oster - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Brent Oster

Description:

Modern GPU is More General Purpose Lots of ALU's. The ... (results in data stall) Making Optimal Use of 16Kb Shared Memory. More Detail on GPU Architecture ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 26
Provided by: brentandm
Category:
Tags: brent | oster | stall

less

Transcript and Presenter's Notes

Title: Brent Oster


1
Brent Oster
  • Associate Director Allosphere, CNSI
  • PhD Studies in
  • Computational Nanotechology
  • nVidia G80 GPU CUDA

2
Previously Technical Director for Video Games
(Bioware, EA, AliasWavefront, LucasArts )
3
(No Transcript)
4
Modern GPU is More General Purpose Lots of ALUs
5
The nVidia G80 GPU
  • 128 streaming floating point processors _at_1.5Ghz
  • 1.5 Gb Shared RAM with 86Gb/s bandwidth
  • 500 Gflop on one chip (single precision)

6
What Has Driven the Evolution of These
Chips?Males age 15-35 buy10B in video games /
yearCrysis Demo
7
Are GPUs Useful for Scientific Computing?
Electronic Structure (DFT)
Finite Element Modeling
Molecular Dynamics Monte Carlo
8
nVidia G80 GPU Architecture Overview
  • 16 Multiprocessors Blocks
  • Each MP Block Has
  • 8 Streaming Processors (IEEE 754 spfp compliant)
  • 16K Shared Memory
  • 64K Constant Cache
  • 8K Texture Cache
  • Each processor can access all of the memory at
    86Gb/s, but with different latencies
  • Shared 2 cycle latency
  • Device 300 cycle latency

9
Programming Interface
  • Interface to GPU via nVidias proprietary API
    CUDA (very C-like)
  • Looks a lot like UPC (simplified CUDA below)
  • void AddVectors(float r, float a, float a)
  • int tx threadId.x //processor rank
  • rtx atx btx //executed in parallel

10
Actual CUDA Code
  • define MAX_THREADS 512
  • extern C void AddVectors(float r, float a,
    float b, int n)
  • int nThreads MAX_THREADS/2
  • int nBlocks n / nThreads
  • AddVectorsKernelltnThreads, nBlocksgt(r, a, b, n)
  • __global__ void AddVectorsKernel(float r, float
    a, float b, int n)
  • int tx threadID.x
  • int bx blockID.x
  • int i tx bx MAX_THREADS
  • ri ai bi
  • // This would be extremely slow and inefficient
    code more later

11
Still A Specialized Processor
  • Very Efficient For
  • Fast Parallel Floating Point Processing
  • Single Instruction Multiple Data Operations
  • High Computation per Memory Access
  • Not As Efficient For
  • Double Precision (need to test performance)
  • Logical Operations on Integer Data
  • Branching-Intensive Operations
  • Random Access, Memory-Intensive Operations

12
  • __global__ void NxNGenericOp_Kernel(float r,
    float a, float b, int n) // ri
    SUMj(aibj)
  • __shared__ float r_shMAX_THREADS //Allocate
    in fast 16K shared memory
  • __shared__ float a_shMAX_THREADS
  • __shared__ float b_shMAX_THREADS
  • int tx threadID.x //Rank of processor
  • int bx blockID.x //Rank of multiprocessor
    block
  • int i tx bx MAX_THREADS //Compute index
    from tx, bx
  • a_shtx ai //Each thread loads a value
    for a_sh
  • r_shtx 0 //Each thread zeros a value for
    r_sh
  • __synchthreads() //sync till all threads
    reach this point
  • for(int J 0 J lt n J MAX_THREADS) //Loop
    over blocks in b
  • b_shtx bJtx //Each thread loads a
    value for b_sh
  • __synchthreads() //synch
  • for(int j 0 j lt MAX_THREADS j) //For each
    b_sh
  • r_shtx a_shtx b_shj //Compute
    product a_shb_sh, add to r_sh

13
Making Optimal Use of 16Kb Shared Memory
16K Shared Memory is Allocated in 16 Banks Array
data allocated across banks B0 -gt bank0 B1
-gt bank1 Bn -gtmod(n,nBanks) No Bank
Conflicts if Each Thread Indexes A Different
Bank Bank Conflicts if Threads Access The Same
Bank (results in data stall)
14
More Detail on GPU Architecture
15
(No Transcript)
16
Exploiting the Texture Samplers
  • Designed to map textures onto 3D polygons
  • Specialty hardware pipelines for
  • Fast data sampling from 1D, 2D, 3D arrays
  • Swizzling of 2D, 3D data for optimal access
  • Bilinear filtering in zero cycles
  • Image compositing blending operations
  • Arrays indexed by u,v,w coordinates easy to
    program
  • Extremely well suited for multigrid finite
    difference methods example later

17
Experiments in Computational Nanotech on GPU
Electronic Structure (DFT)
Finite Element Modeling
Molecular Dynamics Monte Carlo
18
HP XW9400 with Quad AMD CPU Dual nVidia
Quaddro 5600 GPUs A Teraflop Workstation?
19
Molecular Dynamics Trial
  • Lennard Jones inter-atomic potential
  • Verlet integration
  • Normalized coordinates
  • FCC lattice in a NxNxN Simulation Cell
  • Periodic Boundary Conditions
  • Trials with Rc 8, Rc 3.0
  • Tested nVidia 8800 GPU vs 3.0 Ghz Intel P4
  • Open GL used to implement MD on GPU

20
MD Timing Tests (NxN brute force)
21
MD Timing Results (bins Rc9 Ang)
22
Hardware Accelerated DFT Test
  • Real-Space Grid Method
  • (Beck, Bryant,)
  • LDA, localized basis fns
  • Iterative soln KS Equations
  • Finite difference methods
  • Multigrid with FMG-FAS
  • Weighted Jacobi relaxation
  • Merhstellen discritization
  • Grahm-Schmidt
  • orthogonalization on lo-res
  • 64x64x64 grid x 4 orbitals
  • 8 H nuclei, 8 Electrons
  • gt1M data elements
  • 81 ms computation time!

23
Where Next? G90 Double Precision GPU in Spring
2008
nVidia Quaddro PC Workstation 4
Teraflops 15000
G90 GPU Double-precision 1 Teraflop 2500
nVidia QuadroPlex Cluster - 16 PC Nodes 64
Teraflops 300000
24
NanoCAD in the AllosphereCalifornia Nano Systems
Institute
25
How to Find out More
  • Download CUDA and docs from nVidia
  • http//developer.nvidia.com/object/cuda.html
  • Buy a 600 nVidia GeForce 8800GTX
  • Get one free through their developer program
    (talk to me after class)
  • CUDA Programming Course through CS
  • Fall 07 or Winter 08
  • Tobias Hollerer Myself
  • NanoCAD collaborative development
    www.powerofminus9.net
Write a Comment
User Comments (0)
About PowerShow.com