Title: Advanced%20D3D10%20Rendering
1Advanced D3D10 Rendering
- Emil Persson
- May 24, 2007
2Overview
- Introduction to D3D10
- Rendering techniques in D3D10
- Optimizations
3Introduction
- Best D3D revision yet! ?
- Clean and powerful API
- Lots of new features
- SM 4.0
- New geometry shader
- Stream Out
- Texture arrays
- Render to volume texture
- MSAA individual sample access
- Constant buffers
- Sampler state decoupled from texture unit
- Dual-source blending
- Etc
4Clean API
- Vista only
- Everything is mandatory (almost)
- No legacy hardware support
- Clean starting point for future evolution of the
API - Limited market short-term
- Some old features deprecated
- Fixed function
- Assembly shaders
- Alpha test
- Triangle fans
- Point sprites
- Clip planes
5Dealing with deprecated features
- Fixed function
- Write a few über-shaders
- Assembly shaders
- Convert to HLSL
- Alpha test
- Use discard or clip() in pixel shader
- Use alpha-to-coverage
- Triangle fans
- Seldom used anyway, usually just for a quad
- Convert to triangle list or strip
- Point sprites
- Expand point to 2 triangles in GS
- Clip planes
- Use clip distance and/or cull distance
6SM 4.0
- Geometry shader
- Processes a full primitive (point, line,
triangle) - Has access to adjacency information (optional)
- Useful for silhouette detection, shadow volume
extrusion etc. - May output multiple primitives
- Output limitation is 1024 floats
- May output nothing (to kill primitive)
7SM 4.0
- Infinite instruction count
- Very long shaders may have lower throughput
though - Integer and bitwise instruction
- Indexable temporaries
- Allows for local arrays
- May be used to emulate a stack
- Useful system generated values
- SV_VertexID
- SV_PrimitiveID
- SV_InstanceID
- SV_Position (Like VPOS, but now .zw are defined
too) - SV_IsFrontFace (Like VFACE)
- SV_RenderTargetArrayIndex
- SV_ViewportArrayIndex
- SV_ClipDistance
- SV_CullDistance
8SM 4.0
- Integer bitwise instructions
- Signed and unsigned
- No idiv though, just udiv
- Same registers as floats
- Can alias without conversion with asint(),
asuint(), asfloat() etc. - Integer texture sample values
- Syntax Texture2D ltuint4gt myTex
- Access to individual samples in MSAA surface
- Allows for custom AA resolve
- Syntax Texture2DMS ltfloat4, 4gt myTex
9Pixel center
- Half pixel offset is gone! ?
- Affects SV_Position as well
- Now matches OpenGL
- DX10
DX9
10Pixel center
- Pixels and texels align
- TexCoord SV_Position.xy / float2(width, height)
- Texel center Screenspace
11The small batch problem
- D3D10 designed to minimize batch overhead
- Pulls work from draw time to creation time
- Validation
- Shader input/output configuration
- Immutable State Objects
- Input layout
- Rasterizer state
- Sampler state
- Depth stencil state
- Blend state
12The small batch problem
- D3D10 also provides tools to reduce draw calls
- Improved instancing interface
- Geometry shader
- More shader resources
- Constant indexing in PS
- Render target arrays
- Texture arrays
13Rendering techniques in D3D10
14Global Illumination
15Global Illumination
- Probes on a volume grid across the scene
- Each probe captures light environment into a tiny
cubemap - Probes are converted to Spherical Harmonics
coefficients - Indirect lighting is computed using interpolated
SH coefficients - Do the same in probe passes to get multiple light
bounces
16Global Illumination
- Awful lot of work
- Each probe is 6 slices. We need loads of probes.
- Sample scene has over 300 probes
- Solution
- Use geometry shader to reduce work
- Distribute work across multiple frames
- Sample updates 40 cubes per frame
- Scatter updates to hide artifacts
- Skip over empty space probes
17Global Illumination
- The Geometry Shader advantage
- 40 cubes x 6 faces x n draw calls Pain
- DX9 style unrealistic even for simple scenes
- Update multiple slices per pass with GS
- GS output limit is 1024 floats
- Keep number of interpolators down to maximize
primitive count - Managed to update 5 probes (30 slices) per pass
- 8 passes is more manageable than 240 ...
18Post tone-mapping resolve
- D3D10 allows for custom AA resolves
- Can drastically improve HDR AA quality
- Standard resolve occurs before tone-mapping
- Ideally resolve should be done after tone-mapping
- Standard resolve Custom
resolve
19Post-tonemapping resolve
- Texture2DMSltfloat4, SAMPLESgt tHDR
- float4 main(float4 pos SV_Position)
SV_Target -
- int3 coord
- coord.xy (int2) pos.xy
- coord.z 0
- // Tone-map individual samples and sum it
up - float4 sum 0
- unroll
- for (int i 0 i lt SAMPLES i)
-
- float4 c tHDR.Load(coord, i)
- sum.rgb 1.0 exp2(-exposure
c.rgb) -
- // Average
- sum (1.0 / SAMPLES)
20Optimizations
21Geometry shader
- GS optimizations
- Input/output usually the bottleneck
- Reduce outputs with frustum and/or backface
culling - Keep input small by packing data
- TexCoord could be 2x16 bits in an uint
- Or use for instance asuint(normal.w)
- Merge to full float4 vectors
- Dont do 2x float2
- Keep output small
- Could be faster to trade for some work in PS
- Pass just position, dont interpolate both
lightVec and viewVec - Or even back-project SV_Position.xyz to world
space in PS - Small output means more work fits within 1024
floats limit
22GS frustum and backface culling
- // Transform to clip space
- float4 pos3
- pos0 mul(mvp, In0.pos)
- pos1 mul(mvp, In1.pos)
- pos2 mul(mvp, In2.pos)
- // Use frustum culling to improve performance
- float4 t0 saturate(pos0.xyxy float4(-1,
-1, 1, 1) - pos0.w) - float4 t1 saturate(pos1.xyxy float4(-1,
-1, 1, 1) - pos1.w) - float4 t2 saturate(pos2.xyxy float4(-1,
-1, 1, 1) - pos2.w) - float4 t t0 t1 t2
- branch
- if (!any(t))
-
- // Use backface culling to improve
performance - float2 d0 pos1.xy pos0.w -
pos0.xy pos1.w - float2 d1 pos2.xy pos0.w -
pos0.xy pos2.w
23Miscellaneous optimizations
- Pre-baked constant buffers
- Dont update per-material constants in DX9 style
- PS dont need to return float4 anymore
- Use float3 if you only care about RGB
- May reduce instruction count
- Use GS to reduce draw calls
- Single pass render-to-cubemap
- Update multiple render targets per pass
24The new shader compiler
- SM4 shader compiler preserves semantics better
- This means more responsibility for you guys
- Be careful about your assumptions
- Periodically check the resulting assembly
- D3D10DisassembleShader()
- Use GPUShaderAnalyzer for performance critical
shaders
25The new shader compiler
Example
- HLSL code
- float4 main(float4 t TEXCOORD0) SV_Target
-
- if (t.x gt t.y)
- return t.xyzw
- else
- return t.wzyx
-
DX9 assembly add r0.x, -v0.x, v0.y cmp
oC0, r0.x, v0.wzyx, v0
DX10 assembly lt r0.x, v0.y, v0.x if_nz
r0.x // lt--- Did you really want a branch here?
mov o0.xyzw, v0.xyzw ret else
mov o0.xyzw, v0.wzyx ret endif
26The new shader compiler
- Use branch, flatten, unroll loop to
control output code - This is not for everyone
- Poor use could reduce performance
- Make sure you know what youre doing
- Only use if youre familiar with assembly code
- Verify that you get the code you expect
- Always benchmark both options
New DX10 assembly (using flatten) lt
r0.x, v0.y, v0.x movc o0.xyzw, r0.xxxx,
v0.xyzw, v0.wzyx ret
27Questions? emil.persson_at_amd.com