Title: GPU Computation Strategies
1GPU Computation Strategies Tricks
- Ian Buck Stanford University
2DirectX or OpenGL?
- DirectX
- Render to Texture
- SetRenderTarget()
- No float targets on NV3x
- Write once run anywhere
- DBMON
- Short programs
- Only 96 instr required
- ps_2_a compiler target allows long programs on
NV3x - Readback is slow!
- 50 MB/sec
- OpenGL
- 0 to N texture addressing
- GL_TEXTURE_RECTANGLE_EXT
- Readback is fast
- Render to Texture not finalized
- Pbuffer rendering can be slow
- SuperBuffers
- GL_EXT_render_target
- Specialized float formats forATI and NV
- No ARB standard for creating float Pbuffer
- ATI float2 Red and Alpha
- NV float2 Red and Green
3ATI Radeon 9800XT or NVIDIA GeForce 5900 Ultra?
Instruction Timings
4Floating Point Precision
- NVIDIA FP32
- s23e8 (largest counting number 16,777,217)
- ATI 24-bit float
- s16e7 (largest 131,073)
- NVIDIA FP16
- s10e5 (largest 2,049)
mantissa
exponent
s
sign 1.mantissa 2(exponentbias)
5Floating Point Precision
- Common Mistake
- Pack large 1D array in 2D texture
- Compute 1D address in shader
- Convert 1D address into 2D
- FP precision will leave unaddressable texels!
NVIDIA FP32 16,777,217 ATI 24-bit float
131,073 NVIDIA FP16 2,049
6Multiple Outputs
- Hardware supported multiple outputs
- Not as fast as you think
ATI 9800XT
7Multiple Outputs
- Software solution
- Let cgc or fxc do dead code elimination
- can be a good idea if shader is separable
kernel void foo (float3 altgt, float3 bltgt,
, out float3 xltgt, out float3 yltgt)
kernel void foo1(float3 altgt, float3 bltgt,
, out float3 xltgt)
kernel void foo2(float3 altgt, float3 bltgt,
, out float3 yltgt)
8Scatter Techniques
- Problem ai p
- indirect write
- Cant set the x,y of fragment in pixel shader
- Also want to do ai p
9Scatter Techniques
- Solution 1
- Sort Search
- Shader outputs destination address and data
- Bitonic sort based on address
- Run binary search shader over destination buffer
- Each fragment searches for source data
- See Sorting and Searching course notes
10Scatter Techniques
- Solution 2
- Render points
- Use vertex shader to set destination
- or just read back the data and reissue
11Scatter Techniques
- Solution 3
- Vertex Textures
- Render data and address to texture
- Issue points, set point x,y in vertex shader
using address texture - Requires texld instruction in vertex program
12Conditional Mask
- How to efficiently implement if (a) then cb
- Kill instruction or LRP c, a, b, c
- Executes all conditional code
- Using early Z-kill
- Set Zbuffer equal to conditional
- Z test can prevent shader execution
13Conditional Mask
- Using early Z-kill
- Z-kill operates at 4x4 block resolution
- Good only if locality in conditional
14Optimizing Execution
- Two methods for GPGPU shader execution
glBegin(GL_QUADS) glVertex2f(left,
bottom) glVertex2f(right, bottom) glVertex2f(rig
ht, top) glVertex2f(left, top) glEnd()
glViewport(0,0,width,height) glBegin(GL_TRIANGLE)
glVertex2f( 0, 0) glVertex2f(width2,
0) glVertex2f( 0, height2) glEnd()
Faster Higher observed bandwidth
15Performance Issues
16Performance Issues
- NV3x Register Penalty
- The more registers used in a shader, the slower a
shader executes - 3-4 R x2 slower
- 5-6 R x3 slower
- 7-8 R x4 slower
- 9-12R x6 slower
- 13-16R x8 slower
- 17-24R x12 slower
- 25-32R x16 slower
- Compiler / driver will try to minimize register
usage. - General Rule The more state in your program the
slower the execution
17Performance Issues
- Floating Point Texture Bandwidth
- Observed Results
- GeForce 5900 Ultra
- Cache 11.08 GB/sec
- Sequential 4.40 GB/sec
- Random 0.76 GB/sec
- ATI 9800 XT (24-bit)
- Cache 9.15 GB/sec
- Sequential 5.55 GB/sec
- Random 1.80 GB/sec
- Big Penalty for Random Access!
18Performance Issues
- WinXP Float4 Download and Readback
- NVIDIA
- 1215 MB/sec texture download
- 221 MB/sec glReadPixels rate
- ATI
- 926 MB/sec texture download
- 180 MB/sec glReadPixel rate
- Readback should be faster!
- 680 MB/sec ATI Linux Readback