GPGPU Programming - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

GPGPU Programming

Description:

GPGPU Programming Dominik G ddeke – PowerPoint PPT presentation

Number of Views:82

Avg rating:3.0/5.0

Slides: 24

Provided by: strz151

Category:

more less

Transcript and Presenter's Notes

Title: GPGPU Programming

1
GPGPU Programming

Dominik Göddeke

2
Overview

Choices in GPGPU programming
Illustrated CPU vs. GPU step by step example
GPU kernels in detail

3
Choices in GPU Programming
Window manager e.g. GLUT, Qt, Win32, Motif
Graphics hardware e.g. Radeon (ATI), GeForce (NV)
4
Bottom lines

This is not as difficult as it seems
Similar choices to be made in all software
projects
Some options are mutually exclusive
Some can be used without in-depth knowledge
No direct access to the hardware, the driver does
all the tedious thread-management anyway
Advantages and disadvantages
Steeper learning curve vs. higher flexibility
Focus on algorithm, not on (unnecessary) graphics
Portable code vs. platform and hardware specific

5
Shading languages

Kernels are programmed in a shading language
Cg (NVIDIA)
HLSL (Microsoft, only Direct3D)
GLSL (OpenGL)
Feature sets
Array access Conditionals, loops
Math No bitwise ops (yet)
Typically very easy to learn
All three languages are very similar

6
Libraries and Abstractions

Some coding is required
no library available that you just link against
tremendously hard to massively parallelize
existing complex code automatically
Good news
much functionality can be added to applications
in a minimally invasive way, no rewrite from
scratch
First libraries under development
Accelerator (Microsoft) linear algebra,
BLAS-like
Glift (Lefohn et al.) abstract data structures,
e.g. trees

7
Overview

Choices in GPGPU programming
Illustrated CPU vs. GPU step by step example
GPU kernels in detail

8
Native Data Layout

CPU 1D array
GPU 2D array

Indices are floats, addressing array element
centers (GL) or top-left corners (D3D). This
will be important later.
9
Example Problem

saxpy (from BLAS)
given two vectors x and y of size N and a scalar
a
compute scaled vector-vector addition y y ax
CPU implementation
store each vector in one array, loop over all
elements
Identify computation inside loop as kernel
no logic in this basic kernel, pure computation
logic and computation fully separated

for (i0 iltN i) yi yi axi
for (i0 iltN i)
yi yi axi
10
Understanding GPU Limitations

No simultaneous reads and writes into the same
memory
No read-modify-write buffer means no logic
required to handle read-before-write hazards
Not a missing feature, but essential hardware
design for good performance and throughput
saxpy introduce additional array ynew yold
ax
Coherent memory access
For a given output element, read in from the same
index in the two input arrays
Trivially achieved in this basic example

11
Performing Computations

Load a kernel program
Detailed examples later on
Specify the output and input arrays
Pseudocode
setInputArrays(yold, x)
setOutputArray(ynew)
Trigger the computation
GPU is after all a graphics processor
So just draw something appropriate

12
Computing Drawing

Specify input and output regions
Set up 11 mapping from graphics viewport to
output array elements, set up input regions
saxpy input and output regions coincide
Generate data streams
Literally draw some geometry that covers all
elements in the output array
In this example, a 4x4 filled quad from four
vertices
GPU will interpolate output array indices from
vertices across the output region
And generate data stream flowing through the
parallel PEs

13
Example
14
Performing Computations

High-level view
Kernel is executed simultaneously on all elements
in the output region
Kernel knows its output index (and eventually
additional input indices, more on that later)
Drawing replaces CPU loops, foreach-execution
Output array is write-only
Feedback loop (ping-pong technique)
Output array can be used read-only as input for
next operation

15
Overview

Choices in GPGPU programming
Illustrated CPU vs. GPU step by step example
GPU kernels in detail

16
GPU Kernels saxpy

Kernel on the CPU
Written in Cg for the GPU

yi yi axi
float saxpy(float2 coords WPOS, uniform
samplerRECT arrayX, uniform samplerRECT
arrayY, uniform float a) COLOR float y
texRECT(arrayY,coords) float x
texRECT(arrayX,coords) return yax
17
GPU Kernels Jacobi Iteration

Good news
Simple linear system solver can be built with
exactly these basic techniques!
Example Finite Differences
x vector of unknowns, sampled with a 5-point
stencil (offsets)
b right-hand-side
regular, equidistant grid
solved with Jacobi iteration

18
GPU Kernels Jacobi Iteration
float jacobi (float2 center WPOS, uniform
samplerRECT x, uniform samplerRECT b,
uniform float one_over_h) COLOR float2
left center float2(1,0) float2 right
center float2(1,0) float2 bottom center
float2(0,1) float2 top center
float2(0,1) float x_center texRECT(x,
center) float x_left texRECT(x, left)
float x_right texRECT(x, right) float
x_bottom texRECT(x, bottom) float x_top
texRECT(x, top) float rhs texRECT(b,
center) float Ax one_over_h
( 4.0 x_center x_left -
x_right x_bottom x_top ) float inv_diag
one_over_h / 4.0 return x_center inv_diag
(rhs Ax)
19
Maximum of an Array

Entirely different operation
Output is single scalar, input is array of length
N
Naive approach
Use 1x1 array as output, gather all N values in
one step
Doomed will only use one PE, no parallelism at
all
Runs into all sorts of other troubles
Solution parallel reduction
Idea based on global communication in parallel
computing
Smart interplay of output and input regions
Same technique applies to dot products, norms etc.

20
Maximum of an Array
float maximum (float2 coords WPOS, uniform
samplerRECT array) COLOR float2 topleft
((coords-0.5)2.0)0.5 float val1
texRECT(array, topleft) float val2
texRECT(array, topleftfloat2(1,0)) float val3
texRECT(array, topleftfloat2(1,1)) float
val4 texRECT(array, topleftfloat2(0,1))
return max(val1,max(val2,max(val3,val4)))
21
Multigrid Transfers