Title: GPU Data Formatting and Addressing
1GPU Data Formatting and Addressing
- Aaron Lefohn University of California, Davis
2Overview
- GPU Memory Model
- GPU-Based Data Structures
- Performance Considerations
3GPU memory model
- GPU Data Storage
- Vertex data
- Texture data
- Frame buffer
PS3.0 GPUs
Texture Data
Frame Buffer(s)
Vertex Data
4GPU memory model
- Read-Only
- Traditional use of GPU memory
- CPU writes, GPU reads
- Read/Write
- Save frame buffer(s) for later use as texture or
vertex array - Save up to 16, 32-bit floating values per pixel
- Multiple Render Targets (MRTs)
5How to Save Render Result
- Copy framebuffer result to other GPU memory
- Copy-to-texture
- Copy-to-vertex-array
- Write directly to other GPU memory''
- Render-to-texture
- Render-to-vertex-array
6OpenGL GPU Memory Writes
- Texture
- Copy frame buffer to texture
- Render-to-texture
- WGL_ARB_render_texture
- GL_EXT_render_target
- Superbuffers
- Vertex Array
- Copy frame buffer to vertex array
- GL_EXT_pixel_buffer_object
- Superbuffers
- Render-to-vertex-array
- Superbuffers
7Render-To-Texture 1
- Copy-To-Texture
- Good
- Cross-Platform texture writes
- Flexible output
- 2D output ? Copy to 1D, 2D, or 3D texture
- Bad
- Slow
- Consumes internal GPU memory bandwidth
8Render-To-Texture 2
- WGL_ARB_render_texture
- Render-to-texture (RTT) using pbuffers
- http//oss.sgi.com/projects/ogl-sample/registry/A
RB/wgl_render_texture.txt - Good
- Fast RTT
- Current state of the art for RTT
- Bad
- Only works on Windows
- Slow OpenGL context switches
- Many hacks to avoid this bottleneck
9Render-To-Texture 3
- GL_EXT_render_target
- Proposed extension for cross-platform RTT
- http//www.opengl.org/resources/features/GL_EXT_r
ender_target.txt - Good
- Cross-platform, efficient RTT solution
- Lightweight, simple extension
- Bad
- Specification not approved (April 24, 2004)
- No implementations exist (April 24, 2004)
10Render-To-Texture 4
- Superbuffers
- Proposed new memory model for GPUs
- http//www.ati.com/developer/gdc/SuperBuffers.pdf
- Good
- Unified GPU memory model
- Render to any GPU memory
- Cross platform (OpenGL owns memory, not OS)
- Mix-and-match depth/stencil/color buffers
- Bad
- Large, complex extension
- Specification not approved (April 24, 2004)
- Only driver support is alpha version (ATI)
11Render-To-Texture Summary
- OpenGL RTT Currently Only Under Windows
- Pbuffers
- Complex and awkward RTT mechanism
- Current state of the art
- Cross-Platform RTT Coming Soon
12Render-To-Vertex-Array 1
- GL_EXT_pixel_buffer_object
- Copy framebuffer to vertex buffer object
- http//developer.nvidia.com/object/nvidia_opengl_
specs.html - Good
- Only GPU/AGP memory bandwidth
- Works with current drivers (NVIDIA)
- Bad
- No direct render-to-vertex-array (slower than
true RTVA) - No ATI implementation
13Render-To-Vertex-Array 2
- Superbuffers
- Write to memory object as render target
- Read from memory object as vertex array
- Good
- Direct render-to-vertex-array (fast)
- Bad
- Can render results always be interpreted as
vertex data? - Large, complex, unapproved extension,
14Render-To-Vertex-Array Summary
- Current OpenGL Support
- NVIDIA GL_EXT_pixel_buffer_object
- ATI Superbuffers
- Semantics Still Under Development
15Fbuffer Capturing Fragments
- Idea
- Rasterization-Order FIFO Buffer
- Render results are fragment values instead of
pixel values - Mark and Proudfoot, Graphics Hardware 2001
- http//graphics.stanford.edu/projects/shading/pubs
/hwws2001-fbuffer/ - Uses
- Designed for multi-pass rendering with
transparent geometry - New possibilities for GPGPU?
- Varying number of results per pixel
- RTT and RTVA with an fbuffer?
16Fbuffer Capturing Fragments
- Implementations
- ATI Radeon 9800 and newer ATI GPUs
- Not yet exposed to user (ask for it!)
- Problems
- Size of fbuffer is not known before rendering
- GPUs cannot perform dynamic memory allocation
- How to handle buffer overflow?
17Overview
- GPU Memory Model
- GPU-Based Data Structures
- Performance Considerations
18GPU-Based Data Structures
- Building Blocks
- GPU memory addresses
- Address Generation
- Address Use
- Pointers
- Multi-dimensional arrays
- Sparse representations
19GPU Memory Addresses
- Where Are Addresses Generated?
- CPU Vertex stream or textures
- Vertex processor Input stream, ALU ops or
textures - Rasterizer Interpolation
- Fragment processor Input stream, ALU ops or
textures
20GPU Memory Addresses
- Where Are Addresses Used?
- Vertex textures (PS3.0 GPUs)
- Fragment textures
Texture Data
Vertex Processor
21GPU Memory Addresses
- Pointers
- Store addresses in texture
- Dependent texture read
- Example See Tim Purcells ray tracing talk
- float2 addr tex2D( addrTex, texCoord )
- float2 data tex2D( dataTex, addr )
Address Texture
Data Texture
0
3
Data
1
3
Data
1
2
Data
1
3
Data
22GPU-Based Data Structures
- Building Blocks
- GPU memory addresses
- Address Generation
- Address Use
- Pointers
- Multi-dimensional arrays
- Sparse representations
23Multi-Dimensional Arrays
- Build Data Structures in 2D Memory
- Read/Write GPU memory optimized for 2D
- Images
- But Isnt Physical Memory 1D?
- GPU memory hierarchy optimized to capture 2D
locality - Rasterization
- Texture filtering
- Igehy, Eldridge, Proudfoot, "Prefetching in a
Texture Cache Architecture, Graphics Hardware,
1998 - Conclusion Use illusion of 2D physical memory
24GPU Arrays
- Large 1D Arrays
- Current GPUs limit 1D array sizes to 2048 or 4096
- Pack into 2D memory
- 1D-to-2D address translation
25GPU Arrays
- 3D Arrays
- Problem
- GPUs do not have 3D frame buffers
- No RTT to slice of 3D texture (except
Superbuffers) - Solutions
- Stack of 2D slices
- Multiple slices per 2D buffer
26GPU Arrays
- Problems With 3D Arrays for GPGPU
- Cannot read stack of 2D slices as 3D texture
- Must know which slices are needed in advance
- Visualization of 3D data difficult
- Solutions
- Need render-to-slice-of-3D-texture (Superbuffers)
- Volume rendering of slice-based 3D data
- Course 28, Real-Time Volume Graphics, Siggraph
2004
27GPU Arrays
- Higher Dimensional Arrays
- Pack into 2D buffers
- N-D to 2D address translation
- Same problems as 3D arrays if data does not fit
in a single 2D texture - Conclusions
- Fundamental GPU memory primitive is a fixed-size
2D array - GPGPU needs more general memory model
28GPU-Based Data Structures
- Building Blocks
- GPU memory addresses
- Address Generation
- Address Use
- Pointers
- Multi-dimensional arrays
- Sparse representations
29Sparse Data Structures
- Why Sparse Data Structures?
- Reduce computational workload
- Reduce memory pressure
- Examples
- Sparse matrices
- Krueger et al., Siggraph 2003
- Bolz et al., Siggraph 2003
- Implicit surface computations (sparse volumes)
- Sherbondy et al., IEEE Visualization 2003
- Lefohn et al., IEEE Visualization 2003
Premoze et al. Eurographics 2003
30Sparse Computation
- Option 1 Store Complete Data Set on GPU
- Cull unused data
- Conditional execution tricks (discussed earlier)
- Option 2 Store Only Sparse Data on GPU
- Saves memory
- Potentially much faster than culling
- Much more complicated (especially if time-varying)
31Sparse Data Structures
- Basic Idea
- Pack active data elements into GPU memory
- For more information
- Linear algebra section in this course Static
structures - Level-set case study in this course Dynamic
structures
32Sparse Data Structures
- Addressing Sparse Data
- Neighborhoods no longer implicitly defined on
grid - Use pointer-based data structures to locate
neighbors - Pre-compute neighbor addresses if possible
- Use CPU or vertex processor
- Removes pointer dereference from fragment program
- Separate common addressing case from boundary
conditions - Common case must be cache coherent
- See Harris and Lefohn case studies for
substream technique
33Overview
- GPU Memory Model
- GPU-Based Data Structures
- Performance Considerations
34Memory Performance Issues
- Pbuffer Survival Guide
- Dependent Texture Costs
- Computational Frequency
35Pbuffer Survival Guide
- Pbuffers Give us Render-To-Texture
- Designed to create an environment map or two
- Never intended to be used for GPGPU (100s of
pbuffers) - Problem
- Each pbuffer has its own OpenGL render context
- Each pbuffer may have depth and/or stencil buffer
- Changing OpenGL contexts is slow
- Solution
- Many optimizations to avoid this bottleneck
36Pbuffer Survival Guide
- Pack Scalar Data Into RGBA
- gt 4x memory savings
- 4x reduction in context switches
- Be careful of read-modify-write hazard
1 RGBA Pbuffer
Scalar Data in 4 RGBA Pbuffers
37Pbuffer Survival Guide
- Use Multi-Surface Pbuffers
- Each RGBA surface is its own render-texture
- Front, Back, AuxN (N 0,1,2,)
- Greatly reduces context switches
- Technically illegal, but blessed by ATI. Works
on NVIDIA.
1 Pbuffer 5 RGBA Surfaces
5 Pbuffers 1 RGBA Surface Each
38Pbuffer Survival Guide
- Using Multi-Surface Pbuffers
- Allocate double buffer pbuffer (and/or with AUX
buffers) - Set render target to back buffer
- glDrawBuffer(GL_BACK)
- Bind front buffer as texture
- wglBindTexImageARB(hpbuffer, WGL_FRONT_ARB)
- Render
- Switch buffers
- wglReleaseTexImageARB(hpbuffer, WGL_FRONT_ARB)
- glDrawBuffer(GL_FRONT)
- wglBindTexImageARB(hpbuffer, WGL_BACK_ARB)
39Pbuffer Survival Guide
- Pack 2D domains into large buffer
- Flat 3D textures
- Be careful of read-modify-write hazard
Flattened Volume
3D Volume
40Dependent Texture Costs
- Cache Coherency
- Dependent reads fast if they hit cache
- Even chained dependencies can be same speed as
non-dependent reads - Very slow if out of cache
- Example
- 3 levels of dependent cache misses can be gt10x
slower - More detail in GPU Computation Strategies and
Tricks
41Computational Frequency
- Compute Memory Addresses at Low Frequency
- Compute memory addresses in vertex program
- Let rasterizer interpolation create per-fragment
addresses - Compute neighbor addresses this way
- Avoid fragment-level address computation whenever
possible - Consumes fragment instructions
- Computation often redundant with neighboring
fragments - May defeat texture pre-fetch
42Conclusions
- GPU Memory Model Evolving
- Writable GPU memory forms loop-back in an
otherwise feed-forward streaming pipeline - Memory model will continue to evolve as GPUs
become more general stream processors - GPGPU Data Structures
- Basic memory primitive is limited-size, 2D
texture - Use address translation to fit all array
dimensions into 2D - Maintain 2D cache locality
- Render-To-Texture
- Use pbuffers with care and eagerly adopt their
successor
43Selected References
- J. Boltz, I. Farmer, E. Grinspun, P. Schoder,
Spare Matrix Solvers on the GPU Conjugate
Gradients and Multigrid, SIGGRAPH 2003 - N. Goodnight, C. Woolley, G. Lewin, D. Luebke, G.
Humphreys, A Multigrid Solver for Boundary Value
Problems Using Programmable Graphics Hardware,
Graphics Hardware 2003 - M. Harris, W. Baxter, T. Scheuermann, A. Lastra,
Simulation of Cloud Dynamics on Graphics
Hardware, Graphics Hardware 2003 - H. Igehy, M. Eldridge, K. Proudfoot, Prefetching
in a Texture Cache Architecture, Graphics
Hardware 1998 - J. Krueger, R. Westermann, Linear Algebra
Operators for GPU Implementation of Numerical
Algorithms, SIGGRAPH 2003 - A. Lefohn, J. Kniss, C. Hansen, R. Whitaker, A
Streaming Narrow-Band Algorithm Interactive
Deformation and Visualization of Level Sets,
IEEE Transactions on Visualization and Computer
Graphics 2004
44Selected References
- A. Lefohn, J. Kniss, C. Hansen, R. Whitaker,
Interactive Deformation and Visualization of
Level Set Surfaces Using Graphics Hardware, IEEE
Visualization 2003 - W. Mark, K. Proudfoot, The F-Buffer A
Rasterization-Order FIFO Buffer for Multi-Pass
Rendering, Graphics Hardware 2001 - T. Purcell, C. Donner, M. Cammarano, H. W.
Jensen, P. Hanrahan, Photon Mapping on
Programmable Graphics Hardware, Graphics
Hardware 2003 - A. Sherbondy, M. Houston, S. Napel, Fast Volume
Segmentation With Simultaneous Visualization
Using Programmable Graphics Hardware, IEEE
Visualization 2003
45OpenGL References
- GL_EXT_pixel_buffer_objecthttp//www.nvidia.com/d
ev_content/nvopenglspecs/GL_EXT_pixel_buffer_objec
t.txt - GL_EXT_render_target, http//www.opengl.org/resour
ces/features/GL_EXT_render_target.txt - OpenGL Extension Registryhttp//oss.sgi.com/proje
cts/ogl-sample/registry/ - Superbuffershttp//www.ati.com/developer/gdc/Supe
rBuffers.pdf - WGL_ARB_render_texturehttp//oss.sgi.com/projects
/ogl-sample/registry/ARB/wgl_render_texture.txtht
tp//oss.sgi.com/projects/ogl-sample/registry/ARB/
wgl_pbuffer.txt
46Questions?
- Acknowledgements
- Cass Everitt, Craig Kolb, Chris Seitz, and Jeff
Juliano at NVIDIA - Mark Segal, Rob Mace, and Evan Hart at ATI
- GPGPU Siggraph 2004 course presenters
- Joe Kniss and Ross Whitaker
- Brian Budge
- John Owens
- National Science Foundation Graduate Fellowship
- Pixar Animation Studios