Cg and Hardware Accelerated Shading

About This Presentation

Title:

Cg and Hardware Accelerated Shading

Description:

Cg and Hardware Accelerated Shading – PowerPoint PPT presentation

Number of Views:107

Avg rating:3.0/5.0

Slides: 84

Provided by: cse64

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Cg and Hardware Accelerated Shading

1
Cg and Hardware Accelerated Shading

Cem Cebenoyan

2
Overview

Cg Overview
Where we are in hardware today
Physical Simulation on GPU
GeforceFX / Cg Demos
Advanced hair and skin rendering in Dawn
Adaptive subdivision surfaces and ambient
occlusion shading in Ogre
Procedural shading in Time Machine
Depth of field and post-processing effects in
Toys
OIT

3
What is Cg?

A high level language for controlling parts of
the graphics pipeline of modern GPUs
Today, this includes the vertex transformation
and fragment processing units of the pipeline
Very C-like
Only simpler
Native support for vectors, matrices,
dot-products, reflection vectors, etc.
Similar in scope to Renderman
But notably different to handle the way hardware
accelerators work

4
Cg Pipeline Overview
Graphics Program Written in Cg C for Graphics
Compiled Optimized
Low Level, Graphics Assembly Code
5
Graphics Data Flow
VertexProgram
FragmentProgram
Application
Framebuffer
Cg Program
Cg Program
// // Diffuse lighting // float d dot
(normalize(frag.N), normalize(frag.L)) if (d lt
0) d 0 c d f4tex2D( t, frag.uv )
diffuse
6
Graphics Hardware Today

Fully programmable vertex processing
Full IEEE 32-bit floating point processing
Native support for mul, dp3, dp4, rsq, pow, sin,
cos...
Full support for branching, looping, subroutines
Fully programmable pixel processing
IEEE 32-bit, 16-bit (s10e5) math supported
Same native math ops as vertex, plus texture
fetch, and derivative instructions
No branching, but gt1000 instruction limit
Floating point textures / frame buffers
No blending / filtering yet
500mhz core clock

7
Physical Simulation

Simple cellular automata-like simulations are
possible on NV20 class hardware (e.g. Game of
Life, Greg James water simulation, Mark Harris
CML work)
Use textures to represent physical quantities
(e.g. displacement, velocity, force) on a regular
grid
Multiple texture lookups allow access to
neighbouring values
Pixel shader calculates new values, renders
results back to texture
Each rendering pass draws a single quad,
calculating next time step in simulation

8
Physical Simulation

Problem 8 bit precision on NV20 is not enough,
causes drifting, stability problems
Float precision on NV30 allows GPU physics to
match CPU accuracy
New fragment programming model (longer programs,
flexible dependent texture reads) allows much
more interesting simulations

9
Example Cloth Simulation Shader

Uses Verlet integration (see Jakobsen, GDC 2001)
Avoids storing explicit velocity
newx x (x oldx)damping adtdt
Not always accurate, but stable!
Store current and previous position of each
particle in 2 RGB float textures
Fragment program calculates new position, writes
result to float buffer
Copy float buffer back to texture for next
iteration (could use render-to-texture instead)
Swap current and previous textures

10
Cloth Shader Demo
11
Cloth Simulation Shader

2 passes
1. Perform integration
2. Apply constraints
Floor constraint
Sphere constraint
Distance constraints between particles
Read back float frame buffer using glReadPixels
Draw particles and constraints

12
Cloth Simulation Cg Code (1st pass)
void Integrate(inout float3 x, float3 oldx,
float3 a, float timestep2, float damping) x
x damping(x - oldx) atimestep2myFragout
main(v2fconnector In, uniform
texobjRECT x_tex, uniform
texobjRECT ox_tex, uniform float
timestep, uniform float damping,
uniform float3 gravity)
myFragout Out float2 s In.TEX0.xy // get
current and previous position float3 x
f3texRECT(x_tex, s) float3 oldx
f3texRECT(ox_tex, s) // move the particle
Integrate(x, oldx, gravity, timesteptimestep,
damping) Out.COL.xyz x return Out
13
Cloth Simulation Cg Code (2nd pass)
// constrain particle to be fixed distance from
another particlevoid DistanceConstraint(float3
x, inout float3 newx, float3 x2,
float restlength, float stiffness)
float3 delta x2 - x float deltalength
length(delta) float diff (deltalength -
restlength) / deltalength newx newx
deltastiffnessdiff // constraint particle to
be outside spherevoid SphereConstraint(inout
float3 x, float3 center, float r) float3
delta x - center float dist
length(delta) if (dist lt r) x center
delta(r / dist) // constrain particle to
be above floorvoid FloorConstraint(inout float3
x, float level) if (x.y lt level) x.y
level
14
Cloth Simulation Cg Code (cont.)
myFragout main(v2fconnector In,
uniform texobjRECT x_tex, uniform
texobjRECT ox_tex, uniform float dist,
uniform float stiffness) myFragout
Out float2 s In.TEX0.xy // get current
position float3 x f3texRECT(x_tex, s) //
satisfy constraints FloorConstraint(x, 0.0f)
SphereConstraint(x, float3(0.0, 2.0, 0.0), 1.0f)
// get positions of neighbouring particles
float3 x1 f3texRECT(x_tex, s float2(1.0, 0.0)
) float3 x2 f3texRECT(x_tex, s
float2(-1.0, 0.0) ) float3 x3
f3texRECT(x_tex, s float2(0.0, 1.0) ) float3
x4 f3texRECT(x_tex, s float2(0.0, -1.0) )
// apply distance constraints float3 newx x
if (s.x lt 31) DistanceConstraint(x, newx, x1,
dist, stiffness) if (s.x gt 0)
DistanceConstraint(x, newx, x2, dist,
stiffness) if (s.y lt 31) DistanceConstraint(x,
newx, x3, dist, stiffness) if (s.y gt 0)
DistanceConstraint(x, newx, x4, dist,
stiffness) Out.COL.xyz newx return Out
15
Physical Simulation Future Work

Limitation - only one destination buffer, can
only modify position of one particle at a time
Could use pack instructions to store 2 vec4h (8
half floats) in 128 bit float buffer
Could also use additional textures to encode
particle masses, stiffness, constraints between
arbitrary particles (rigid bodies)
float buffer to vertex array extension offers
possibility of directly interpreting results as
geometry without any CPU intervention!
Collision detection with meshes is hard

16
Demos Introduction

Developed 4 demos for the launch of GeForce FX
Dawn
Toys
Time Machine
Ogre(Spellcraft Studio)

17
Characters Look Better With Hair
18
Rendering Hair

Two options
1) Volumetric (texture)
2) Geometric (lines)
We have used volumetric approximations (shells
and fins) in the past (e.g. Wolfman demo)
Doesnt work well for long hair
We considered using textured ribbons (popular in
Japanese video games). Alpha sorting is a pain.
Performance of GeForce FX finally lets us render
hair as geometry

19
Rendering Hair as Lines

Each hair strand is rendered as a line strip
(2-20 vertices, depending on curvature)
Problem lines are a minimum of 1 pixel thick,
regardless of distance from camera
Not possible to change line width per vertex
Can use camera-facing triangle strips, but these
require twice the number of vertices, and have
aliasing problems

20
Anti-Aliasing

Two methods of anti-aliasing lines in OpenGL
GL_LINE_SMOOTH
High quality, but requires blending, sorting
geometry
GL_MULTISAMPLE
Usually lower quality, but order independent
We used multisample anti-aliasing with alpha to
coverage mode
By fading alpha to zero at the ends of hairs,
coverage and apparent thickness decreases
SAMPLE_ALPHA_TO_COVERAGE_ARB is part of the
ARB_multisample extension

21
Hair Without Antialiasing
22
Hair With Multisample Antialiasing
23
Hair Shading

Hair is lit with simple anisotropic shader
(Heidrich and Seidel model)
Low specular exponent, dim highlight looks best
Black hair no shadows!
Self-shadowing hair is hard
Deep shadow maps
Opacity shadow maps
Top of head is painted black to avoid skin
showing through
We also had a very short hair style, which helps

24
Hair Styling is Important
25
Hair Styling

Difficult to position 50,000 individual curves by
hand
Typical solution is to define a small number of
control hairs, which are then interpolated across
the surface to produce render hairs
We developed a custom tool for hair styling
Commercial hair applications have poor styling
tools and are not designed for real time output

26
Hair Styling

Scalp is defined as a polygon mesh
Hairs are represented as cubic Bezier curves
Controls hairs are defined for each vertex
Render hairs are interpolated across triangles
using barycentric coordinates
Number of generated hairs is based on triangle
area to maintain constant density
Can add noise to interpolated hairs to add
variation

27
Hair Styling Tool

Provides a simple UI for styling hair
Combing tools
Lengthen / shorten
Straighten / mess up
Uses a simple physics simulation based on Verlet
integration (Jakobson, GDC 2001)
Physics is run on control hairs only
Collision detection done with ellipsoids

28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
Dawn Demo

Show demo

32
(No Transcript)
33
The Ogre Demo

A real-time preview of Spellcraft Studios
in-production short movie Yeah!
Created in 3DStudio MAX
Used Character Studio for animation, plus Stitch
plug-in for cloth simulation
Original movie was rendered in Brazil with global
illumination
Available at www.yeahthemovie.de
Our aim was to recreate the original as closely
as possible, in real-time

34
What are Subdivision Surfaces?

A curved surface defined as the limit of repeated
subdivision steps on a polygonal model
Subdivision rules create new vertices, edges,
faces based on neighboring features
We used the Catmull-Clark subdivision scheme (as
used by Pixar)
MAX, Maya, Softimage, Lightwave all support forms
of subdivision surfaces

35
Realtime Adaptive Tessellation

Brute force subdivision is expensive
Generates lots of polygons where they arent
needed
Number of polygons increases exponentially with
each subdivision
Adaptive tessellation subdivides patches based on
screen-space patch size test
Guaranteed crack-free
Generates normals and tangents on the fly
Culls off-screen and back-facing patches
CPU-based (uses SSE were possible)

36
Control Mesh vs. Subdivided Mesh
4000 faces
17,000 triangles
37
Control Mesh Detail
38
Subdivided Mesh Detail
39
Why Use Subdivision Surfaces?

Content
Characters were modeled with subdivision in mind
(using 3DSMax MeshSmooth/NURMS modifier)
Scalability
wanted demo to be scalable to lower-end hardware
Infinite detail
Can zoom in forever without seeing hard edges
Animation compression
Just store low-res control mesh for each frame
May be accelerated on future GPUs

40
Disadvantages of Realtime Subdivision

CPU intensive
But we might as well use the CPU for something!
View dependent
Requires re-tessellation for shadow map passes
Mesh topology changes from frame to frame
Makes motion blur difficult

41
Ambient Occlusion Shading

Helps simulate the global illumination look of
the original movie
Self occlusion is the degree to which an object
shadows itself
How much of the sky can I see from this point?
Simulates a large spherical light surrounding the
scene
Popular in production rendering Pearl Harbor
(ILM), Stuart Little 2 (Sony)

42
Occlusion
N
43
How To Calculate Occlusion

Shoot rays from surface in random directions over
the hemisphere (centered around the normal)
The percentage of rays that hit something is the
occlusion amount
Can also keep track of average of un-occluded
directions bent normal
Some Renderman compliant renders (e.g. Entropy)
have a built-in occlusion() function that will do
this
We cant trace rays using graphics hardware (yet)
So we pre-calculate it!

44
Occlusion Baking Tool

Uses ray-tracing engine to calculate occlusion
values for each vertex in control mesh
We used 128 rays / vertex
Stored as floating point scalar for each vertex
and each frame of the animation
Calculation took around 5 hours for 1000 frames
Subdivision code interpolates occlusion values
using cubic interpolation
Used as ambient term in shader

45
(No Transcript)
46
(No Transcript)
47
Ogre Demo

Show demo

48
Procedural Shading in Time Machine

Goals for the Time Machine demo
Overview of effects
Metallic Paint
Wood
Chrome
Techniques used
Faux-BRDF reflection
Reveal and dXdT maps
Normal and DuDv scaling
Dynamic Bump mapping
Performance Issues
Summary

49
Why do Time Machine?

GPUs are much more programmable
Thanks to generalized dependent texturing, more
active textures (16 on GeForce FX) and (for our
purposes) unlimited blend operations,
high-quality animation is possible per-pixel
GeForce FX has gt2x performance of GeForce 4Ti
Executing lots of per-pixel operations isnt just
possible it can be done in real time.
Previous per-pixel animation was limited
Animated textures
PDE / CA effects (see Mark Harris talk at GDC)
Goal Full-scene per-pixel animation

50
Why do Time Machine? (continued)

Neglected pick-up trucks demonstrate a wide
variety of surface effects, with intricate
transitions and boundaries
Paint oxidizing, bleaching and rusting
Vinyl cracking
Wood splintering and fading
And more

Not possible with just per-vertex animation!
51
Time Machine Effects Paint

Paint textures
Paint Color
Rust LUT
Shadow map
Spotlight mask
Light Rust Color
Deep Rust Color
Ambient Light
Bubble Height
Reveal Time
New Environment
Old Environment
( artist created)

Oxidation
Specular color shift
Rusting
Bubbling
60 Pixel Shader instructions, 11 textures
52
Effects (contd) Wood, Chrome, Glass
Chrome welts and corrodes
Wood fades and cracks
31 instructions, 6 textures
23 instructions, 8 textures
Headlights fog
24 instructions, 4 textures
53
Procedural or Not?

Procedural shading normally replaces textures
with functions of several variables.
Time Machine uses textures liberally.
The only parameter to our shaders is time.
However, turning everything into math is
expensive
Time Machines solution
Give artist direct control (textures) over final
image, use functions to control transitions

54
Techniques Faux-BRDF Reflection

Many automotive paints exhibit a color-shift as a
function of the light and viewer directions.
This effect has been approximated with analytic
BRDFs (Lafortunes cosine lobes)
And measured by Cornell Universitys graphics lab
BRDF factorization McCool, Rusinkiewicz is one
method to use this data on graphics hardware
Efficient representation with multiple 2D
textures
Closely approximates the original BRDFs
But not necessarily the most efficient method for
automotive paint, and not artist-controllable.
Reflection intensity is uninteresting (largely
Blinn)
Rotated/projected axes hard to visualize

55
Techniques Faux-BRDF Reflection 2

Our solution project BRDF values onto a single
2D texture, and factor out the intensity
Compute intensity in real-time, using (N.H)s
Texture varies slowly, so it can be low-res
(64x64).
Anti-aliasing texture fixes laser noise at
grazing angles
For automotive paints, N.L and N.H work well for
axes.
Not physically accurate, but fast and
high-quality.
Easy for artists to tweak.

Mystique lacquer
Dupont Cayman lacquer
56
Techniques Reveal and dXdT maps

Artists do not want to paint hundreds of frames
of animation for a surface transition (e.g.,
paint-gtrust)
Ultimately, effect is just a conditional
if (time gt n) color rust else color
paint
Or an interpolation between a start and end point
paint interpolate(paint, bleach, s(time-n))
So all intermediate values can be generated.
For continuous effects, use dXdT (velocity) maps
Can be stored in alpha in a DXT5 texture.

57
Performance Concerns

Executing large shaders is expensive.
First rule of optimization Keep inner loops
tight
Shaders are the inner loop, run gt1M times per
frame.
But graphics cards have many parallel units
Vertex, fragment, and texture units
Modern GPUs do a great job of hiding texture
latency
Bandwidth is unimportant in long shaders
Time Machine runs at virtually the same framerate
on a 500/500 GeForceFX as it does on a 500/400 or
500/550
So not using textures is wasting performance!

58
Performance Concerns

What makes a good texture?
Saves math operations
8 (RGBA) or 16 (HILO) bit precision sufficient
Depends on a limited number of variables
Textures we used
Interpolating between light and dark rust layers
Required computing the difference between light
and dark layers reveal maps, and expanding to
0..1.
Function was dependent on current and reveal
time.
Used to blend two texture maps

59
Performance Concerns

Textures Used, continued
Surround Maps
Recomputing the normal requires knowing the
heights of 4 texels (s-1,t), (s1,t), (s,t1) and
(s,t-1)
Each height is only 1 8-bit component
Instead of 4 dependent fetches, we can pack all
into 1
S(s,t) H(s-1, t), H(s1, t), H(s,t-1),
H(s,t1)
Saved 4 math ops and 3 texture fetches shuffle
logic

60
Time Machine demo

Show demo

61
Toys Demo - Simple Depth of Field

Render scene to color and depth textures
Generate mipmaps for color texture
Render full screen quad with simpledof shader
Depth tex(depthtex, texcoord)
Coc (circle of confusion) abs(depthscale
bias)
Color txd(colortex, texcoord, (coc,0), (0,coc))
Scale and bias are derived from the camera
Scale (aperture focaldistance planeinfocus
(zfar znear)) / ((planeinfocus
focaldistance) znear zfar)
Bias (aperture focaldistance (znear
planeinfocus)) / ((planeinfocus
focaldistance) znear)

62
Artifacts Bilinear Interpolation/Magnification

Bilinear artifacts in extreme back- and
near-ground
Solution multiple jittered samples
Even without jittering, a 4 or 5 sample rotated
grid pattern brings smaller artifacts under
control
Larger artifacts need jittered samples, and more
of them
Then its just a tradeoff between noise from the
jittering and bilinear interpolation artifacts
(and of course the quality/performance tradeoff
with number of samples)

63
Noise vs. Interpolation Artifacts
With Noise
Without Noise
64
Artifacts Depth Discontinuities

Near-ground (blurry) pixels dont properly blend
out over top of mid-ground (sharp) pixels
Easy solution Cheat!
Either dont let objects get too far in front of
the plane in focus, or blur everything a little
more when they do soft edges help hide this
fairly well.

65
Depth Discontinuities
66
Fun With Color Matrices

Since were already rendering to a full-screen
texture, its easy to muck with the final image.
Operations are just rotations / scales in RGB
space
Color (hue) shift
Saturation
Brightness
Contrast
These are all matrices, so compose them together,
and apply them as 3 dot products in the shader

67
Original Image
68
Colorshifted Image
69
Black and White Image
70
Toys Demo

Show demo

71
Order Independent Transparency

Why is correct transparency hard?
Depth peeling
Two depth buffers
Enter the shadow map
Precision/invariance issues
Depth replace texture shader
Blending the layers
Other applications

72
Cant just glEnable(GL_BLEND)
Good Transparency Bad Transparency
without OIT
with OIT
73
Why is correct transparency hard?

Most hardware does object-order rendering
Correct transparency requires sorted traversal
Have to render polygons in sorted order
Not very convenient
Polygons cant intersect
Lot of extra application work
Especially difficult for dynamic scene databases

74
Depth Peeling

The algorithm uses an implicit sort to extract
multiple depth layers
First pass render finds front-most fragment
color/depth
Each successive pass render finds (extracts) the
fragment color/depth for the next-nearest
fragment on a per pixel basis
Use dual depth buffers to compare previous
nearest fragment with current
Second depth buffer used for comparison (read
only) from texture more on this later

75
Layer 0
Layer 1
Layer 2
Layer 3
76
Cross-section view of depth peeling
Layer 0
Layer 1
Layer 2
0 depth 1
0 depth 1
0 depth 1
Depth peeling strips away depth layers with each
successive pass. The frames above show the
frontmost (leftmost) surfaces as bold black
lines, hidden surfaces as thin black lines, and
peeled away surfaces as light grey lines.
77
Dual Depth Buffer Pseudo-code

for ( i 0 i lt num_passes i )
clear color buffer
depth unit 0
if(i 0) disable depth test
else enable depth test
bind depth buffer (i 2)
disable depth writes / read-only depth test
/
set depth func to GREATER
depth unit 1
bind depth buffer ((i1) 2)
clear depth buffer
enable depth writes
enable depth test
set depth func to LESS
render scene
save color buffer RGBA as layer i

78
Implementation

There is no dual depth buffer extension to
OpenGL, so what can we do?
Just need one depth test with writeable depth
buffer the other can be read-only
Shadow mapping is a read-only depth test!
Depth test can have an arbitrary camera location
Other interesting uses for clip volumes
Fast copies make this proposition reasonable
Copies will be unnecessary in the future

79
Precision / Invariance issues

Using shadow mapping hardware introduces
precision and invariance issues
depth rasterization usually just needs to match
output depth buffer precision, and requires no
perspective correction
Texture hardware requires perspective correction
and projection at high precision
Making things match would be difficult without
the DEPTH_REPLACE texture shader
Computes with texture hardware at texture
precision
Solves invariance problems at some extra expense
Will be cheaper in the future

80
(No Transcript)
81
Compositing

Each time we peel, we capture the RGBA, then as a
final step, we blend all the layers together from
back to front
Opaque fragments completely overwrite previous
transparent ones

82
Conclusions