GPGPU Programming - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

GPGPU Programming

Description:

GPGPU Programming Dominik G ddeke – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 24
Provided by: strz151
Category:

less

Transcript and Presenter's Notes

Title: GPGPU Programming


1
GPGPU Programming
  • Dominik Göddeke

2
Overview
  • Choices in GPGPU programming
  • Illustrated CPU vs. GPU step by step example
  • GPU kernels in detail

3
Choices in GPU Programming
Window manager e.g. GLUT, Qt, Win32, Motif
Graphics hardware e.g. Radeon (ATI), GeForce (NV)
4
Bottom lines
  • This is not as difficult as it seems
  • Similar choices to be made in all software
    projects
  • Some options are mutually exclusive
  • Some can be used without in-depth knowledge
  • No direct access to the hardware, the driver does
    all the tedious thread-management anyway
  • Advantages and disadvantages
  • Steeper learning curve vs. higher flexibility
  • Focus on algorithm, not on (unnecessary) graphics
  • Portable code vs. platform and hardware specific

5
Shading languages
  • Kernels are programmed in a shading language
  • Cg (NVIDIA)
  • HLSL (Microsoft, only Direct3D)
  • GLSL (OpenGL)
  • Feature sets
  • Array access Conditionals, loops
  • Math No bitwise ops (yet)
  • Typically very easy to learn
  • All three languages are very similar

6
Libraries and Abstractions
  • Some coding is required
  • no library available that you just link against
  • tremendously hard to massively parallelize
    existing complex code automatically
  • Good news
  • much functionality can be added to applications
    in a minimally invasive way, no rewrite from
    scratch
  • First libraries under development
  • Accelerator (Microsoft) linear algebra,
    BLAS-like
  • Glift (Lefohn et al.) abstract data structures,
    e.g. trees

7
Overview
  • Choices in GPGPU programming
  • Illustrated CPU vs. GPU step by step example
  • GPU kernels in detail

8
Native Data Layout
  • CPU 1D array
  • GPU 2D array

Indices are floats, addressing array element
centers (GL) or top-left corners (D3D). This
will be important later.
9
Example Problem
  • saxpy (from BLAS)
  • given two vectors x and y of size N and a scalar
    a
  • compute scaled vector-vector addition y y ax
  • CPU implementation
  • store each vector in one array, loop over all
    elements
  • Identify computation inside loop as kernel
  • no logic in this basic kernel, pure computation
  • logic and computation fully separated

for (i0 iltN i) yi yi axi
for (i0 iltN i)
yi yi axi
10
Understanding GPU Limitations
  • No simultaneous reads and writes into the same
    memory
  • No read-modify-write buffer means no logic
    required to handle read-before-write hazards
  • Not a missing feature, but essential hardware
    design for good performance and throughput
  • saxpy introduce additional array ynew yold
    ax
  • Coherent memory access
  • For a given output element, read in from the same
    index in the two input arrays
  • Trivially achieved in this basic example

11
Performing Computations
  • Load a kernel program
  • Detailed examples later on
  • Specify the output and input arrays
  • Pseudocode
  • setInputArrays(yold, x)
  • setOutputArray(ynew)
  • Trigger the computation
  • GPU is after all a graphics processor
  • So just draw something appropriate

12
Computing Drawing
  • Specify input and output regions
  • Set up 11 mapping from graphics viewport to
    output array elements, set up input regions
  • saxpy input and output regions coincide
  • Generate data streams
  • Literally draw some geometry that covers all
    elements in the output array
  • In this example, a 4x4 filled quad from four
    vertices
  • GPU will interpolate output array indices from
    vertices across the output region
  • And generate data stream flowing through the
    parallel PEs

13
Example
14
Performing Computations
  • High-level view
  • Kernel is executed simultaneously on all elements
    in the output region
  • Kernel knows its output index (and eventually
    additional input indices, more on that later)
  • Drawing replaces CPU loops, foreach-execution
  • Output array is write-only
  • Feedback loop (ping-pong technique)
  • Output array can be used read-only as input for
    next operation

15
Overview
  • Choices in GPGPU programming
  • Illustrated CPU vs. GPU step by step example
  • GPU kernels in detail

16
GPU Kernels saxpy
  • Kernel on the CPU
  • Written in Cg for the GPU

yi yi axi
float saxpy(float2 coords WPOS, uniform
samplerRECT arrayX, uniform samplerRECT
arrayY, uniform float a) COLOR float y
texRECT(arrayY,coords) float x
texRECT(arrayX,coords) return yax
17
GPU Kernels Jacobi Iteration
  • Good news
  • Simple linear system solver can be built with
    exactly these basic techniques!
  • Example Finite Differences
  • x vector of unknowns, sampled with a 5-point
    stencil (offsets)
  • b right-hand-side
  • regular, equidistant grid
  • solved with Jacobi iteration

18
GPU Kernels Jacobi Iteration
float jacobi (float2 center WPOS, uniform
samplerRECT x, uniform samplerRECT b,
uniform float one_over_h) COLOR float2
left center float2(1,0) float2 right
center float2(1,0) float2 bottom center
float2(0,1) float2 top center
float2(0,1) float x_center texRECT(x,
center) float x_left texRECT(x, left)
float x_right texRECT(x, right) float
x_bottom texRECT(x, bottom) float x_top
texRECT(x, top) float rhs texRECT(b,
center) float Ax one_over_h
( 4.0 x_center x_left -
x_right x_bottom x_top ) float inv_diag
one_over_h / 4.0 return x_center inv_diag
(rhs Ax)
19
Maximum of an Array
  • Entirely different operation
  • Output is single scalar, input is array of length
    N
  • Naive approach
  • Use 1x1 array as output, gather all N values in
    one step
  • Doomed will only use one PE, no parallelism at
    all
  • Runs into all sorts of other troubles
  • Solution parallel reduction
  • Idea based on global communication in parallel
    computing
  • Smart interplay of output and input regions
  • Same technique applies to dot products, norms etc.

20
Maximum of an Array
float maximum (float2 coords WPOS, uniform
samplerRECT array) COLOR float2 topleft
((coords-0.5)2.0)0.5 float val1
texRECT(array, topleft) float val2
texRECT(array, topleftfloat2(1,0)) float val3
texRECT(array, topleftfloat2(1,1)) float
val4 texRECT(array, topleftfloat2(0,1))
return max(val1,max(val2,max(val3,val4)))
21
Multigrid Transfers
  • Restriction
  • Interpolate values from fine into coarse array
  • Local neighborhood weighted gather on both CPU
    and GPU

22
Multigrid Transfers
  • Prolongation
  • Scatter values from fine to coarse with weighting
    stencil
  • Typical CPU implementation loop over coarse
    array with stride-2 daxpys

23
Multigrid Transfers
  • Three cases
  • Fine node lies in the center of an element (4
    interpolants)
  • Fine node lies on the edge of an element (2
    interpolants)
  • Fine node lies on top of a coarse node (copy)
  • Reformulate scatter as gather for the GPU
  • Set fine array as output region
  • Sample with index offset 0.25

same code for all three cases, no conditionals or
red-black-map
24
Conclusions
  • This is not as complicated as it might seem
  • Course notes online
  • http//www.mathematik.uni-dortmund.de/goeddeke/i
    ccs
  • GPGPU community site http//www.gpgpu.org
  • Developer information, lots of useful references
  • Paper archive
  • Help from real people in the GPGPU forums
Write a Comment
User Comments (0)
About PowerShow.com