Title: DirectX 9
1DirectX 9 Radeon 9700Performance Optimizations
- Richard Huddy
- RHuddy_at_ati.com
2DirectX 9 and Radeon 9700 considerations
- Resources
- Sorting and Clearing
- Vertex Buffers and Index Buffers
- Render States
- How to draw primitives
- Vertex Data
- Vertex Shaders
- Pixel Shaders
- Textures
- Targets (both Z and color)
- Miscellaneous
3General resource management
- Create your most important resources first
(thats targets, shaders, textures, VBs, IBs
etc) - Most important is most frequently used
- Never call Create in your main loop
- So create the main colour and Z buffers before
you do anything else - The main buffer is the one through which the
largest number of pixels pass
4Sorting
- Sort roughly front to back
- Theres a staggering amount of hardware devoted
to making this highly efficient - Sort by vertex shader
- or
- Sort by pixel shader, or
- sort by texture
- When you change VS or PS its good to go back to
that shader as soon as possible - Short shaders are faster2 when sorted
5Clearing
- Ideally use Clear once per frame (not less)
- Always clear the whole render target
- Dont track dirty regions at all
- Always clear colour, Z and stencil together
unless you can just clear Z/stencil - Most importantly dont force us to preserve
stencil - Dont use 2 triangles to clear
- Using Clear() is the way to get all the fancy Z
buffer hardware working for you
6Vertex Buffers
- Use the standard DirectX8/9 VB handling algorithm
with NOOVERWRITE etc - Try to always use DISCARD at the start of the
frame on dynamic VBs - Specify write-only whenever possible
- Use the default pool whenever possible
- Roughly 2 4 MB for best performance
- This allows large batches
- And gives the driver sufficient granularity
7Index Buffers
- Treat Index Buffers exactly as if they were
vertex buffers except that you always choose
the smallest element possible - i.e. Use 32 bit indices only if you need to
- Use 16 bit indices whenever you can
- All recent ATI hardware treats Index Buffers as
first class citizens - They dont have to be copied about before the
chip gets access - So keep them out of system memory
8Updating Index and Vertex Buffers
- IBs and VBs which are optimally located need to
be updated with sequential DWORD writes. - AGP memory and LVM both benefit from this
treatment
9Handling Render States
- Prefer minimal state blocks
- minimal means you should weed out any redundant
state changes where possible - If 5 of state changes are redundant thats OK
- If 50 are redundant then get it fixed!
- The expensive state changes
- Switching between VS and FF
- Switching Vertex Shader
- Changing Texture
10How to draw primitives
- DrawIndexedPrimitive( strip or list )
- Indexing is a big win on real world data
- Long strips beat everything else
- Use lists if you would have to add large numbers
of degenerate polys to stick with strips (more
than 20 means use lists) - Make sure your VBs and IBs are in optimal
memory for best performance - Give the card hundreds of polys per call
- Small batches kill performance
11Vertex data
- Dont scatter it around
- Fewer streams give better cache behaviour
- Compress it if you can
- 16 bits or less per component
- Even if it costs you 1 or 2 ops in the shader
- Try to avoid spilling into AGP
- Because AGP has high latency
- pow2 sizes help 32 bytes is best
- Work the cache on the GPU
- Avoid random access patterns where possible by
reordering vertex data before the main loop - Thats at app start up or at authoring time
12Compiling and Linking shaders
- Do this all up front
- It may not be obvious to you - but you have to
actually use a shader to force its complete
instantiation in DirectX 9 - So, if youre not careful you may get linking
happening in your main loop - And linking may be time consuming ?
- Draw a little of everything before you start for
real. Think of this as priming the caches
13Vertex shaders I
- Shorter shaders are faster no surprises here
- Avoid all unnecessary writes
- This includes the output registers of the VS
- So use the write masks aggressively
- Pack constants as much as possible
- Prefer locality of reference on constants too
- Be aware of the expansion of macros but prefer
them anyway if they match exactly what you want - Pack your shader constant updates
- You should optimise the algorithm and leave the
object-code optimisation to the driver/runtime
14Vertex shaders II
- Branches and conditionals are fast so use them
agressively - Thats not like the CPU where branches are slow
- Longer shaders allow better batching
- Shorter shaders are also more cache friendly
- i.e. its usually faster to switch to the
previous shader than to any other - But the shorter your shaders are
- the more of them fit into the cache.
15Vertex shaders II
- API Change
- Now you dont mov to the address register, you
use mova - And this performs round to nearest, not floor
- And now A0 is a 4d register
- A0.x, A0.y, A0.z, A0.w
16Pixel shaders I
- API change to accommodate METs
- You now have to explicitly write to oC0, oC1, oC2
and 0C3 to set the output colour - And the write has to be with a mov instruction
- If you write to 0Cn you must write to all
elements from oC0 to 0cn-1 - i.e. Writes must be contiguous starting at oC0
- But the writes can happen in any order
- You can also write to oDepth to update the Z
buffer but note that this kills the early Z cull
(this replaces ps1.3 texdepth)
17Pixel shaders II
- Shorter is much faster
- Its much easier to be pixel limited than vertex
limited - Short shaders are more cache friendly
- Be aggressive with write masks
- Think dual-issue () even though its gone from
the API (so split colour and alpha out) - Generally prefer to spend cycles on shader ops
rather than using texture lookups - Because memory latency is the enemy here
18Pixel shaders III
- Dual issue?
- But thats not in the 2.0 shader spec
- But remember that DX9 hardware like the Radeon
9700 has to run DirectX 8 apps very fast indeed - And that means it has dual issue hardware ready
for you to use
19Pixel shaders IV
- Example Diffuse specular lighting
dp3 r0, r1, r0 // N.H dp3 r2, r1, r2 // N.L mul
r2, r2, r3 // color mul r2, r2, r4 //
texture mul r0.r, r0.r, r0.r // spec2 mul r0.r,
r0.r, r0.r // spec4 mul r0.r, r0.r, r0.r //
spec8 mad r0.rgb, r0.r, r5, r2 Total 8
instructions
dp3 r0, r1, r0 // N.H dp3 r2.r, r1, r2
// N.L mul r6.a, r0.r, r0.r // spec2 mul
r2.rgb, r2.r, r3 // color mul r6.a, r6.a, r6.a
// spec4 mul r2.rgb, r2, r4 // texture mul
r6.a, r6.a, r6.a // spec8 mad r0.rgb, r6.a, r5,
r2 Optimized to 5 DI instructions
20Pixel shaders IV
- Texture instructions
- Avoid TEXDEPTH to retain the early Z-reject
- If you do choose to use TEXKILL then use it as
early as possible. But, the positioning of
TEXKILL within texture loading code is
unimportant - Register usage
- Minimize total number of registers used
- No problems with dependency
21Vertex and Pixel shaders
- If youre fed up with writing assembler, and
dont feel excited by the opportunity to code 256
VS ops and 96 PS ops then - maybe you should consider HLSL?
- In most cases it is as good as hand written
assembler - And much faster to author
- Perfect for prototyping
- And for release code where you use D3DX
22Textures I
- API addition
- SetSamplerState()
- Handles the now-decoupled texture sampler setup.
- You may now freely mix and match texture
coordinates with texture samplers to fetch texels
in arbitrary ways - Texture coordinates are now just iterated floats
- Samplers handle clamp, wrap, bias and filter
modes - You have 8 texture coordinates
- And 16 texture samplers
- texld r11, t7, s15 (all register numbers are max)
23Textures II
- Use compressed textures
- Do you need a good compressor?
- Use smaller textures
- Use 16 bit textures in preference to 32 bit
- Use textures with few components
- Use an L8 or A8 format if thats what you want
- Pack textures together
- e. g. If youre using two 2D textures then
consider using a single RGBA texture - Texture performance is bandwidth limited
24Textures III
- Filtering modes
- Use trilinear filtering to improve texture cache
coherency - Only use anisotropic or tri-linear filtering when
they make sense - they are more expensive - Avoid using anisotropic filtering with
bumpmapping - Avoid using tri-linear anisotropic filtering
unless the quality win justifies it - More costly filtering is more affordable with
longer pixel shaders
25Targets
- Always clear the whole of the target
- Present()
- WASSTILLDRAWING makes a comeback
- Please use it!
- Because using it properly will gain you CPU
cycles - and thats typically your scarcest
resource
26Depth Buffer I
- Never lock depth buffers
- Clearing depth buffers
- Clear the whole surface
- When stencil is present clear both depth and
stencil simultaneously - If possible disable depth buffering when alpha
blending (i.e. drawing HUDs) - Use as few depth buffers as possible
- i.e. re-use them across multiple render targets
27Depth Buffer II
- Efficiently use Hyper-Z
- Render front to back
- Make Znear, Zfar close to active depth range of
the scene - The EQUAL and NOT EQUAL depth tests require exact
compares which kill the early Z comparisons.
Avoid them!
28Occlusion query
- New to DirectX 9
- In GL you have HP_occlusion_query and
NV_occlusion_query to avoid the need for locks - Not free, but much cheaper than Lock()
- Supported on all ATI hardware since the Radeon
8500 - CreateQuery(OCCLUSION, ppQuery)
- Issue(Begin/End)
- GetData() returns S_OK to signal completion - but
please dont spin waiting for the answer
29AGP 8X
- Is fast at 2GB per second
- But has high latency compared to LVM
- And is 10 times slower than LVM
- Radeon 9700 has up to 20GB per sec of bandwidth
available when talking to LVM - (LVM Local Video Memory)
30User clip planes
- User clip planes are much more efficient than
texkill because - They insert a per-vertex test, rather than a
per-pixel test, and vertices are typically fewer
in number than pixels - Its important always to kill data at the
earliest stage possible in the pipeline - Plus, clipping is essentially a geometric
operation - All hardware which supports ps1.4 supports user
clip planes in hardware
31Sky box. First or last?
- Draw it last because
- Thats a rough front to back sort
- In this case you know that most sky pixels will
fail the Z test. - Draw it first because
- That way you dont need any Z tests
- In this case you know that most sky pixels would
pass the Z test
32So, here is our target
- DX9 style mainstream graphics (per frame)
- gt 500K triangles
- lt 500 DrawIndexedPrimitive() calls
- lt 500 VertexBuffer switches
- lt 200 different textures
- lt 200 State change groups
- Few calls to SetRenderTarget - aim for 0 to 4...
- 1 pass per poly is typical, but 2 is sometimes
smart - Runs at monitor refresh rate
- Which gives more than 40 million polys per second
- And everything goes through the programmable
pipeline - No occurrences of Lock(0), DrawPrimitive(), DPUP()
33Questions
Richard Huddy RHuddy_at_ati.com