Global Illumination on the GPU: Lessons Learned - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Global Illumination on the GPU: Lessons Learned

Description:

It's all about porting and speed. How to implement on the GPU, especially NVidia's ... All cheats except for ray engine. Need more GPU global illumination algorithms ... – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 44
Provided by: Natha57
Category:

less

Transcript and Presenter's Notes

Title: Global Illumination on the GPU: Lessons Learned


1
Global Illuminationon the GPULessons Learned
  • John C. Hart
  • University of Illinois

2
What do NVidia U?
  • Its all about porting and speed
  • How to implement on the GPU, especially NVidias
  • How to make it run as fast ( competitive) as
    possible
  • Research papers leave out grodie details
  • Wont be important five years from now
  • But they are very important right now
  • This is where we discuss grodie details
  • Texture cache size, organization
  • Tricks that probably wont work next year

3
How to NVidia U.
  • Ask lots of questions
  • Chat between talks
  • Stuff you can know and stuff you cant know
  • NDA v. No _at_ Way
  • Dont expect to change the hardware
  • Dirty little secrets to getting code to run fast
  • Send interns
  • Computational Pantheism
  • Dont be religious, use whatever works best now
  • Windows, Linux, OpenGL, Direct3D

4
Local Illumination
  • GPU designed for efficient local illumination
    computations to make video games more interesting
  • Bump mapping, BRDF
  • Vertex shader processes per-vertex attributes
    (normal, texcoords, color)
  • Displacement mapping, skinning
  • Rasterization interpolates vertex attributes
    across pixels
  • Projective, perspective correct
  • Pixel shader computes colors from interpolated
    values and texture lookups
  • texture shading, perspective texturing

N
V
L
N
N
5
Modern GPU Org.
Geometry(vertex stream)
Rasterization
Vertex Shader
Setup
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
Texture Memory
Pixel Shader
Tex 0
Tex 1
Tex 2
Frame Buffer
6
Global Illumination
  • Energy transportacross all paths fromsource to
    eye
  • Ray tracing, radiosity, path tracing,
    bidirectional ray tracing, irradiance caching,
    photon mapping, subsurface scattering
  • Often broken into stages
  • Precomputation (e.g. form factors, radiosity
    solution)
  • Display (e.g. render colored patches)
  • Storage/query format
  • Ray ray tracing
  • Point-to-point radiosity, subsurface scattering
  • Pointsphere photon map, PRT

7
Global Illumination on GPU
  • GPU not designed for global illumination
  • OpenGL (w/shadow maps, env. maps)can yield
    LgDSgE path
  • Global illuminationapproximatedand precomputed
  • Cass EverittsDueling Frusta
  • Shadow mapresolution needsto be viewdependent

?
8
Environment Map Problems
  • Environment maps provide precomputed global
    illumination lookup
  • Approximate
  • Reflection from behind boat
  • boat doesnt meet reflection
  • Sampled
  • aliasing where magnified

9
Ray Tracing
  • Uses GPU to intersect rays with triangles
  • Turns Geometry Engine into a Ray Engine
  • Carr et al., GH02

10
GPU Ray-Tri Intersect
Rasterization(dist. D data across pixels)
Vertex Shader(prep)
Shared Quad-VertexAttributes D
vertex normal edge0 edge1 ID (color)
D Data
edge0
edge1
Pixel Shader(ray-D intersection)
Texture Memory
Ray Origins
Ray Directions
Z-buffer holds t-valuesFrame buffer holds
triangle ID
11
What Doesnt Work
  • Rays in vertex stream, triangles in texture
  • Rays 5D, triangles 9D, more attributes than
    textures
  • Ray can be held in two textures
  • 10-bit RGB anchors texture
  • 16-bit XY directions texture (bump map texture)
  • Ray-triangle intersection in vertex shader
  • Loses benefit of rasterization crossbar
  • Need to store rays in constant registers
  • Only 4.1M ray-triangle intersections per sec.
  • In general vertex shader not much faster than CPU
  • Vertex shader allows CPU to focus on other tasks

12
Quick Dirty
  • Implemented on ATIRadeon 8500 DX 8.1
  • Pixel Shaders 1.4
  • 16-bit fixed point
  • 114M ray-triangle ints/s
  • Much faster than bestsingle CPU time(20-40M on
    800MHz P3, Wald et al. EGRW01)
  • Expect gap to widen further
  • Problem Dont want to intersect all rays with
    all tris

13
Fast Ray Tracing
  • Avoid all pairs intersection
  • Need acceleration structure
  • We used ray cache (Pharr et al. S97)
  • Batches ray intersection queries
  • Organizes queries into coherent ray bundles
  • Triangle octree and 5D ray tree (Arvo Kirk S87)
  • Problem How to implement?

14
Ray Engine Organization
  • We have a perfectly good CPU sitting around doing
    nothing Put it to work!
  • Let the GPU do what it does best
  • SIMD parallel execution
  • streamed ray-triangle intersections
  • Let the CPU to what it does best
  • traverse/maintain data structures
  • decide which rays triangles to intersect
  • Ray Engine CPU-side
  • Cache ray-triangle int. queries into coherent
    buckets
  • When bucket large enough, send to GPU

15
The Ray Engine
CPU
Application ( Ray Tracing, Path Tracing, Photon
Mapping, Radiosity Form Factors, )
Rays To Query
Intersection Results
Geometry
Front End ( batch/queue/sort coherent rays)
Ray Data In Textures
Triangle Data as Quad Attributes
Intersection Pixel Data
Ray Triangle Intersection Pixel Shader
GPU
The Ray Engine
16
Analysis
  • How small can the ray/tri buckets be?
  • Overhead texture attr. setup, readback delay
  • Determined best query size by experimentation
  • Texture-strip ray buckets 4 texels high
  • Takes advantage of 2-D spatial texture cache
  • CPU handles small queries (using NV_FENCE)
  • CPU traces between 10 and 33 of rays

17
Comparison
  • Stanford GPU Ray Tracer (Purcell et al., S02)
  • State-based traversal, intersection, shading
  • Each pixel is ray intersection process
  • Four states traversal, intersection, shading,
    spawning
  • Same state program run simultaneously on all
    pixels
  • Result ignored when pixel was in different state
    (90)
  • Implemented entirely on GPU, avoids readback!
  • All geometry must fit in texture memory
  • Could page geometry from host
  • Limited to simple grid-based ray acceleration
  • GPU spawns rays, can be complex (importance)

18
Results
150K rays/s
207K rays/s
  • Ray Engine GH02
  • 200K rays/s for highlycoherent, small (2.5K)
    scenes
  • 115K rays/s for large (34K)complicated scenes
  • Wald et al. EGRW01
  • P3 SSE 200K 1.5M rays/s
  • CPU-SSE tightly coupled
  • Purcell et al. S02
  • 300K rays/s large (35K)
  • up to 4M rays/s small (35)

115K rays/s
128K rays/s
131K rays/s
19
! Readback
  • Why is readback slow?
  • Driver uses PCI readback, even for AGP cards!
  • Only got 250MB/s (should get 1GB/s)
  • Problem goes away if readback asynchronous
  • Proposed in OpenGL 2.0

20
(No Transcript)
21
GPU Ray Tracing Lessons
  • Ray Engine only doubles ray tracing speed
  • Maintaining coherence expensive, 2-3x
    intersection
  • Is coherence worth it?
  • Need to factor readback rate into performance
  • GPU(R,T) T R fill-1 R g readback-1
  • g 4 ? ID flat shaded triangles
  • g 16 ? barycentrics textured, shaded triangles
  • Vertex shader CPU, pixel shader gt CPU
  • SIGGRAPH values analysis over implementation

22
Matrix Radiosity
  • Given form-factor
  • Energy balance of scene result of linear system
    solution
  • MB E
  • Matrix ? 2-D texture
  • Vector ? 1-D texture
  • Product ? Series of row vector dot products
    accumulated into a 1-D texture

A
B
C
D
1
A1B2C3D4
E
F
G
H
2
E1F2G3H4

I
J
K
L
3
I1J2K3L4
M
N
O
P
4
M1N2O3P4
23
Jacobi v. Gauss-Seidel
  • Jacobi iteration
  • Classical Bi(k1) Ei Sj?i Mij Bj(k)
  • Decision free Bi(k1) E MB(k) B(k)
  • Gauss-Seidel
  • Needs decision Bi Ei Sj?i Mij Bj
  • Converges 2x Jacobi
  • GPU Gauss-Seidel
  • n passes (Kruger Westermann S03)
  • GPU Jacobi
  • n/254 passes (unrolled)

Mii 1
24
Radiosity Performance
  • CPU Athlon 2800
  • Gauss-Seidel
  • 40 iter/s, 190M fp/s
  • 100 mem. bw
  • O(n2)
  • GPU FX5900 Ultra
  • Jacobi
  • 30 iter/s, 141M fp/s
  • 10 mem. bw
  • O(n)

!
25
Radiosity Lessons
  • Matrix size limited
  • Maximum texture size 4Kx4K, maximum p-buffer size
    2Kx2K
  • Need paged block-based solutions, or sparse (Bolz
    et al. S03)
  • 1-D texture vector non-optimal for texture
    cache
  • 2x according to Kruger Westermann S03
  • They pack vectors nicely into 2-D textures
  • Also accelerates dot product, magnitude
    operations
  • Gouraud interpolation not so easy!
  • Need to interpolate 1-D texture across a 2-D mesh
  • KW-S03s 2-D texture vector not appropriate for
    radiosity
  • Matrix-matrix product caches better if done
    blockwise
  • R Upper Left, G LL, B UR, A LR
  • See UIUCDCS-R-2003-2328

26
Subsurface Scattering
  • Simulates scattering of light within a
    homogeneous translucent material
  • Needed for all non-metallic surfaces
  • Skin, milk, bread, stone
  • Precompute scattering for real-time display
  • CPU implementations
  • Jensen et al. S02 used octree
  • Hao et al. I3D03 approx. vert. backscatter
  • Lensch et al. PG02 used atlas
  • GPU implementation
  • Carr et al. GH03, extends Lensch et al.
  • Sloan et al. S03, incorporates SS into PRT

27
Scattering v. Radiosity
  • Diffuse subsurface scattering resembles a single
    radiosity transport step (Lensch et al. PG02)
  • Scattering factor Fij based on BSSRDF Rd
  • Precomputed and stored
  • Hierarchically clustered to avoid O(n2) evaluation

28
Multires Meshed Atlas
  • Quads in atlas correspond to clusters in surface
    mesh
  • Each cluster composed of four subclusters
  • Allows MIP-mapping to provides multiresolution
    mesh access
  • Allows subsurface scattering to operate at
    multiple resolutions

29
Algorithm
  • Pass 1 Construct Radiosity Map
  • Illuminate each patch by external light
  • Scale each patch by 1 Fresnel
  • Pass 2 Construct Irradiance Map
  • Gather for each texel i all texels j
  • Scaled by precomputed Fij
  • Pass 3 Display result
  • Scale irradiance map by 1 Fresnel
  • Texture map onto surface and display
  • Problem how to store all j terms foreach texel
    i?

VertexShader
PixelShader
30
Hierarchical Links
  • Each texel i needs to representa link to all
    other texels j
  • Instead link texel i to a cluster j
  • Store Fij records at each texel
  • Fij ? factor between texel iand cluster j
  • Accuracy limited by of textures
  • 16 links per texel ? 4 textures of4 components
    of 16-bit floats
  • Per link needs 1 lookup for address and¼ lookup
    (dependent) for factor

31
Adaptive Links
  • Construct dynamically
  • based on magnitude of Fij
  • Store u,v,LOD,Fij records at each texel
  • u,v ? location of cluster j
  • LOD ? MIP-map level of cluster j
  • Fij ? factor between patch i and cluster j
  • Needs 2 lookups (dependent) per link

32
Results
70K faces 1K texture 13 fps
Direct 18
Scattered 68
Displayed 14
33
Subsurface Lessons
  • Adaptive looks bad when of links small
  • Need to be very careful where to place links
  • Need to increase of links
  • Used vector quantization to compress link records
  • Increased to 64 links
  • Also allowed color

34
Precomputed Radiance Xfer
  • Radiance fns about p
  • Source (env. map)
  • Incident
  • Exit
  • Represented withspherical harmonics
  • 25-vector of SH weights
  • Radiance transfer
  • Transfer matrix source-to-incident x BRDF
  • Multiply source vector w/precomputed transfer
    matrix at p to get exit radiance vector
  • Need to store 252 625 elements at each point p

35
VQ PRT
  • Vector quantization
  • Create a codebook of typical transfer matrices
    (LBG)
  • Pick random codebook matrices
  • Cluster transfer matrices nearest to each
    codebook matrix
  • Replace codebook matrix with its cluster center
  • Repeat until done
  • Store for each transfer matrix the index of its
    nearest codebook matrix

36
PCA PRT
  • Principal Component Analysis
  • Determine which few principal directions in 625-D
    space have greatest transfer matrix variance
  • Store global origin transfer matrix and
    principal direction axes transfer matrices
  • Store for each transfer matrix its approx.
    coordinates along the axes

37
CPCA
  • VQPCA
  • Creates VQ clusters, codebook
  • Computes PCA on each VQ cluster
  • Iterative VQPCA
  • Computes PCA on each VQ cluster
  • Reclusters based on approx. error
  • Repeat until done
  • Adaptive VQPCA
  • Homogenize error
  • Give some clusters more PCA axes

Bad
Good
38
Vertex Shader Rendering
  • Set blending mode to ADD
  • For each cluster
  • Load clusters PCA origin and axes
    (multiplied by lighting) as constants
  • Render only faces w/a vertex whose transfer
    matrix is in the current cluster
  • Color non-cluster vertices black
  • The impact on runtime is not so nice
  • When faces tween clusters show twice or thrice
  • (sorry)

39
Cluster Coherence
  • Reclassification
  • Move some vertices to slightly worse clusters
    (10) if they improve coherence
  • Reduces mean overdraw from 2.0 to 1.8
  • Superclustering
  • Load several clusters constants into vertex
    shader simultaneously (1axes) 25/4
  • Greedily merge neighboring clusters into
    superclusters
  • Limited by vertex shader constant store
  • Reduced mean overdraw to 1.6

40
Results
  • Look what we can do in

30Hz
60Hz (250Hz non-local viewer)
40Hz
41
PRT Lessons
  • 24-vectors as good as 25-vectors, and fit nicely
    into 6 RGBA registers
  • Adaptive VQPCA needs data-dependent looping,
    eludes GPU implementation
  • Render one pass per channel to take advantage of
    alpha channel in registers
  • Allows each register to hold four data elements
    instead of three (one per channel)
  • GPU needs to interpolate high-precision textures
  • GeForceFX only interpolates 8-bit textures

42
Global Illumination Lessons
  • New algorithms data structures needed to port
    global illumination efficiently to GPU
  • Whats good for CPU not necessarily good for GPU
  • Focus has been on real time display of
    precomputed global illumination
  • All cheats except for ray engine
  • Need more GPU global illumination algorithms
  • Like Purcell et al. GH03
  • Leading to a GPUPACK
  • Library of tuned GPU algorithms

43
Thanks
  • Who did all the work?
  • Nate Carr
  • Jesse Hall
  • NSF ITR Award ACI-0113968
  • NVidia
  • Microsoft Research
  • Peter-Pike Sloan
  • John Snyder
Write a Comment
User Comments (0)
About PowerShow.com