Title: GPGPU Programming
1GPGPU Programming
2Overview
- Choices in GPGPU programming
- Illustrated CPU vs. GPU step by step example
- GPU kernels in detail
3Choices in GPU Programming
Window manager e.g. GLUT, Qt, Win32, Motif
Graphics hardware e.g. Radeon (ATI), GeForce (NV)
4Bottom lines
- This is not as difficult as it seems
- Similar choices to be made in all software
projects - Some options are mutually exclusive
- Some can be used without in-depth knowledge
- No direct access to the hardware, the driver does
all the tedious thread-management anyway - Advantages and disadvantages
- Steeper learning curve vs. higher flexibility
- Focus on algorithm, not on (unnecessary) graphics
- Portable code vs. platform and hardware specific
5Shading languages
- Kernels are programmed in a shading language
- Cg (NVIDIA)
- HLSL (Microsoft, only Direct3D)
- GLSL (OpenGL)
- Feature sets
- Array access Conditionals, loops
- Math No bitwise ops (yet)
- Typically very easy to learn
- All three languages are very similar
6Libraries and Abstractions
- Some coding is required
- no library available that you just link against
- tremendously hard to massively parallelize
existing complex code automatically - Good news
- much functionality can be added to applications
in a minimally invasive way, no rewrite from
scratch - First libraries under development
- Accelerator (Microsoft) linear algebra,
BLAS-like - Glift (Lefohn et al.) abstract data structures,
e.g. trees
7Overview
- Choices in GPGPU programming
- Illustrated CPU vs. GPU step by step example
- GPU kernels in detail
8Native Data Layout
- CPU 1D array
- GPU 2D array
Indices are floats, addressing array element
centers (GL) or top-left corners (D3D). This
will be important later.
9Example Problem
- saxpy (from BLAS)
- given two vectors x and y of size N and a scalar
a - compute scaled vector-vector addition y y ax
- CPU implementation
- store each vector in one array, loop over all
elements - Identify computation inside loop as kernel
- no logic in this basic kernel, pure computation
- logic and computation fully separated
for (i0 iltN i) yi yi axi
for (i0 iltN i)
yi yi axi
10Understanding GPU Limitations
- No simultaneous reads and writes into the same
memory - No read-modify-write buffer means no logic
required to handle read-before-write hazards - Not a missing feature, but essential hardware
design for good performance and throughput - saxpy introduce additional array ynew yold
ax - Coherent memory access
- For a given output element, read in from the same
index in the two input arrays - Trivially achieved in this basic example
11Performing Computations
- Load a kernel program
- Detailed examples later on
- Specify the output and input arrays
- Pseudocode
- setInputArrays(yold, x)
- setOutputArray(ynew)
- Trigger the computation
- GPU is after all a graphics processor
- So just draw something appropriate
12Computing Drawing
- Specify input and output regions
- Set up 11 mapping from graphics viewport to
output array elements, set up input regions - saxpy input and output regions coincide
- Generate data streams
- Literally draw some geometry that covers all
elements in the output array - In this example, a 4x4 filled quad from four
vertices - GPU will interpolate output array indices from
vertices across the output region - And generate data stream flowing through the
parallel PEs
13Example
14Performing Computations
- High-level view
- Kernel is executed simultaneously on all elements
in the output region - Kernel knows its output index (and eventually
additional input indices, more on that later) - Drawing replaces CPU loops, foreach-execution
- Output array is write-only
- Feedback loop (ping-pong technique)
- Output array can be used read-only as input for
next operation
15Overview
- Choices in GPGPU programming
- Illustrated CPU vs. GPU step by step example
- GPU kernels in detail
16GPU Kernels saxpy
- Kernel on the CPU
- Written in Cg for the GPU
yi yi axi
float saxpy(float2 coords WPOS, uniform
samplerRECT arrayX, uniform samplerRECT
arrayY, uniform float a) COLOR float y
texRECT(arrayY,coords) float x
texRECT(arrayX,coords) return yax
17GPU Kernels Jacobi Iteration
- Good news
- Simple linear system solver can be built with
exactly these basic techniques! - Example Finite Differences
- x vector of unknowns, sampled with a 5-point
stencil (offsets) - b right-hand-side
- regular, equidistant grid
- solved with Jacobi iteration
18GPU Kernels Jacobi Iteration
float jacobi (float2 center WPOS, uniform
samplerRECT x, uniform samplerRECT b,
uniform float one_over_h) COLOR float2
left center float2(1,0) float2 right
center float2(1,0) float2 bottom center
float2(0,1) float2 top center
float2(0,1) float x_center texRECT(x,
center) float x_left texRECT(x, left)
float x_right texRECT(x, right) float
x_bottom texRECT(x, bottom) float x_top
texRECT(x, top) float rhs texRECT(b,
center) float Ax one_over_h
( 4.0 x_center x_left -
x_right x_bottom x_top ) float inv_diag
one_over_h / 4.0 return x_center inv_diag
(rhs Ax)
19Maximum of an Array
- Entirely different operation
- Output is single scalar, input is array of length
N - Naive approach
- Use 1x1 array as output, gather all N values in
one step - Doomed will only use one PE, no parallelism at
all - Runs into all sorts of other troubles
- Solution parallel reduction
- Idea based on global communication in parallel
computing - Smart interplay of output and input regions
- Same technique applies to dot products, norms etc.
20Maximum of an Array
float maximum (float2 coords WPOS, uniform
samplerRECT array) COLOR float2 topleft
((coords-0.5)2.0)0.5 float val1
texRECT(array, topleft) float val2
texRECT(array, topleftfloat2(1,0)) float val3
texRECT(array, topleftfloat2(1,1)) float
val4 texRECT(array, topleftfloat2(0,1))
return max(val1,max(val2,max(val3,val4)))
21Multigrid Transfers
- Restriction
- Interpolate values from fine into coarse array
- Local neighborhood weighted gather on both CPU
and GPU
22Multigrid Transfers
- Prolongation
- Scatter values from fine to coarse with weighting
stencil - Typical CPU implementation loop over coarse
array with stride-2 daxpys
23Multigrid Transfers
- Three cases
- Fine node lies in the center of an element (4
interpolants) - Fine node lies on the edge of an element (2
interpolants) - Fine node lies on top of a coarse node (copy)
- Reformulate scatter as gather for the GPU
- Set fine array as output region
- Sample with index offset 0.25
same code for all three cases, no conditionals or
red-black-map
24Conclusions
- This is not as complicated as it might seem
- Course notes online
- http//www.mathematik.uni-dortmund.de/goeddeke/i
ccs - GPGPU community site http//www.gpgpu.org
- Developer information, lots of useful references
- Paper archive
- Help from real people in the GPGPU forums