Advanced%20D3D10%20Rendering

About This Presentation

Title:

Advanced%20D3D10%20Rendering

Description:

Clean starting point for future evolution of the API. Limited market short-term ... Do the same in probe passes to get multiple light bounces. Advanced D3D10 Rendering ... – PowerPoint PPT presentation

Number of Views:73

Avg rating:3.0/5.0

Slides: 28

Provided by: emilpe3

Category:

more less

Transcript and Presenter's Notes

Title: Advanced%20D3D10%20Rendering

1
Advanced D3D10 Rendering

Emil Persson
May 24, 2007

2
Overview

Introduction to D3D10
Rendering techniques in D3D10
Optimizations

3
Introduction

Best D3D revision yet! ?
Clean and powerful API
Lots of new features
SM 4.0
New geometry shader
Stream Out
Texture arrays
Render to volume texture
MSAA individual sample access
Constant buffers
Sampler state decoupled from texture unit
Dual-source blending
Etc

4
Clean API

Vista only
Everything is mandatory (almost)
No legacy hardware support
Clean starting point for future evolution of the
API
Limited market short-term
Some old features deprecated
Fixed function
Assembly shaders
Alpha test
Triangle fans
Point sprites
Clip planes

5
Dealing with deprecated features

Fixed function
Write a few über-shaders
Assembly shaders
Convert to HLSL
Alpha test
Use discard or clip() in pixel shader
Use alpha-to-coverage
Triangle fans
Seldom used anyway, usually just for a quad
Convert to triangle list or strip
Point sprites
Expand point to 2 triangles in GS
Clip planes
Use clip distance and/or cull distance

6
SM 4.0

Geometry shader
Processes a full primitive (point, line,
triangle)
Has access to adjacency information (optional)
Useful for silhouette detection, shadow volume
extrusion etc.
May output multiple primitives
Output limitation is 1024 floats
May output nothing (to kill primitive)

7
SM 4.0

Infinite instruction count
Very long shaders may have lower throughput
though
Integer and bitwise instruction
Indexable temporaries
Allows for local arrays
May be used to emulate a stack
Useful system generated values
SV_VertexID
SV_PrimitiveID
SV_InstanceID
SV_Position (Like VPOS, but now .zw are defined
too)
SV_IsFrontFace (Like VFACE)
SV_RenderTargetArrayIndex
SV_ViewportArrayIndex
SV_ClipDistance
SV_CullDistance

8
SM 4.0

Integer bitwise instructions
Signed and unsigned
No idiv though, just udiv
Same registers as floats
Can alias without conversion with asint(),
asuint(), asfloat() etc.
Integer texture sample values
Syntax Texture2D ltuint4gt myTex
Access to individual samples in MSAA surface
Allows for custom AA resolve
Syntax Texture2DMS ltfloat4, 4gt myTex

9
Pixel center

Half pixel offset is gone! ?
Affects SV_Position as well
Now matches OpenGL
DX10
DX9

10
Pixel center

Pixels and texels align
TexCoord SV_Position.xy / float2(width, height)
Texel center Screenspace

11
The small batch problem

D3D10 designed to minimize batch overhead
Pulls work from draw time to creation time
Validation
Shader input/output configuration
Immutable State Objects
Input layout
Rasterizer state
Sampler state
Depth stencil state
Blend state

12
The small batch problem

D3D10 also provides tools to reduce draw calls
Improved instancing interface
Geometry shader
More shader resources
Constant indexing in PS
Render target arrays
Texture arrays

13
Rendering techniques in D3D10
14
Global Illumination
15
Global Illumination

Probes on a volume grid across the scene
Each probe captures light environment into a tiny
cubemap
Probes are converted to Spherical Harmonics
coefficients
Indirect lighting is computed using interpolated
SH coefficients
Do the same in probe passes to get multiple light
bounces

16
Global Illumination

Awful lot of work
Each probe is 6 slices. We need loads of probes.
Sample scene has over 300 probes
Solution
Use geometry shader to reduce work
Distribute work across multiple frames
Sample updates 40 cubes per frame
Scatter updates to hide artifacts
Skip over empty space probes

17
Global Illumination

The Geometry Shader advantage
40 cubes x 6 faces x n draw calls Pain
DX9 style unrealistic even for simple scenes
Update multiple slices per pass with GS
GS output limit is 1024 floats
Keep number of interpolators down to maximize
primitive count
Managed to update 5 probes (30 slices) per pass
8 passes is more manageable than 240 ...

18
Post tone-mapping resolve

D3D10 allows for custom AA resolves
Can drastically improve HDR AA quality
Standard resolve occurs before tone-mapping
Ideally resolve should be done after tone-mapping
Standard resolve Custom
resolve

19
Post-tonemapping resolve

Texture2DMSltfloat4, SAMPLESgt tHDR
float4 main(float4 pos SV_Position)
SV_Target
int3 coord
coord.xy (int2) pos.xy
coord.z 0
// Tone-map individual samples and sum it
up
float4 sum 0
unroll
for (int i 0 i lt SAMPLES i)
float4 c tHDR.Load(coord, i)
sum.rgb 1.0 exp2(-exposure
c.rgb)
// Average
sum (1.0 / SAMPLES)

20
Optimizations
21
Geometry shader

GS optimizations
Input/output usually the bottleneck
Reduce outputs with frustum and/or backface
culling
Keep input small by packing data
TexCoord could be 2x16 bits in an uint
Or use for instance asuint(normal.w)
Merge to full float4 vectors
Dont do 2x float2
Keep output small
Could be faster to trade for some work in PS
Pass just position, dont interpolate both
lightVec and viewVec
Or even back-project SV_Position.xyz to world
space in PS
Small output means more work fits within 1024
floats limit

22
GS frustum and backface culling

// Transform to clip space
float4 pos3
pos0 mul(mvp, In0.pos)
pos1 mul(mvp, In1.pos)
pos2 mul(mvp, In2.pos)
// Use frustum culling to improve performance
float4 t0 saturate(pos0.xyxy float4(-1,
-1, 1, 1) - pos0.w)
float4 t1 saturate(pos1.xyxy float4(-1,
-1, 1, 1) - pos1.w)
float4 t2 saturate(pos2.xyxy float4(-1,
-1, 1, 1) - pos2.w)
float4 t t0 t1 t2
branch
if (!any(t))
// Use backface culling to improve
performance
float2 d0 pos1.xy pos0.w -
pos0.xy pos1.w
float2 d1 pos2.xy pos0.w -
pos0.xy pos2.w

23
Miscellaneous optimizations

Pre-baked constant buffers
Dont update per-material constants in DX9 style
PS dont need to return float4 anymore
Use float3 if you only care about RGB
May reduce instruction count
Use GS to reduce draw calls
Single pass render-to-cubemap
Update multiple render targets per pass

24
The new shader compiler

SM4 shader compiler preserves semantics better
This means more responsibility for you guys
Be careful about your assumptions
Periodically check the resulting assembly
D3D10DisassembleShader()
Use GPUShaderAnalyzer for performance critical
shaders

25
The new shader compiler
Example

HLSL code
float4 main(float4 t TEXCOORD0) SV_Target
if (t.x gt t.y)
return t.xyzw
else
return t.wzyx

DX9 assembly add r0.x, -v0.x, v0.y cmp
oC0, r0.x, v0.wzyx, v0
DX10 assembly lt r0.x, v0.y, v0.x if_nz
r0.x // lt--- Did you really want a branch here?
mov o0.xyzw, v0.xyzw ret else
mov o0.xyzw, v0.wzyx ret endif
26
The new shader compiler

Use branch, flatten, unroll loop to
control output code
This is not for everyone
Poor use could reduce performance
Make sure you know what youre doing
Only use if youre familiar with assembly code
Verify that you get the code you expect
Always benchmark both options

New DX10 assembly (using flatten) lt
r0.x, v0.y, v0.x movc o0.xyzw, r0.xxxx,
v0.xyzw, v0.wzyx ret
27
Questions? emil.persson_at_amd.com

Write a Comment

User Comments (0)

About PowerShow.com

Advanced%20D3D10%20Rendering - PowerPoint PPT Presentation

Advanced%20D3D10%20Rendering

Clean starting point for future evolution of the API. Limited market short-term ... Do the same in probe passes to get multiple light bounces. Advanced D3D10 Rendering ... – PowerPoint PPT presentation