DirectX 9 - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

DirectX 9

Description:

Create your most important resources first (that's targets, shaders, ... Occlusion query. New to DirectX 9 ... CreateQuery(OCCLUSION, ppQuery) Issue(Begin/End) ... – PowerPoint PPT presentation

Number of Views:191

Avg rating:3.0/5.0

Slides: 34

Provided by: www2I4

Category:

more less

Transcript and Presenter's Notes

Title: DirectX 9

1
DirectX 9 Radeon 9700Performance Optimizations

Richard Huddy
RHuddy_at_ati.com

2
DirectX 9 and Radeon 9700 considerations

Resources
Sorting and Clearing
Vertex Buffers and Index Buffers
Render States
How to draw primitives
Vertex Data
Vertex Shaders
Pixel Shaders
Textures
Targets (both Z and color)
Miscellaneous

3
General resource management

Create your most important resources first
(thats targets, shaders, textures, VBs, IBs
etc)
Most important is most frequently used
Never call Create in your main loop
So create the main colour and Z buffers before
you do anything else
The main buffer is the one through which the
largest number of pixels pass

4
Sorting

Sort roughly front to back
Theres a staggering amount of hardware devoted
to making this highly efficient
Sort by vertex shader
or
Sort by pixel shader, or
sort by texture
When you change VS or PS its good to go back to
that shader as soon as possible
Short shaders are faster2 when sorted

5
Clearing

Ideally use Clear once per frame (not less)
Always clear the whole render target
Dont track dirty regions at all
Always clear colour, Z and stencil together
unless you can just clear Z/stencil
Most importantly dont force us to preserve
stencil
Dont use 2 triangles to clear
Using Clear() is the way to get all the fancy Z
buffer hardware working for you

6
Vertex Buffers

Use the standard DirectX8/9 VB handling algorithm
with NOOVERWRITE etc
Try to always use DISCARD at the start of the
frame on dynamic VBs
Specify write-only whenever possible
Use the default pool whenever possible
Roughly 2 4 MB for best performance
This allows large batches
And gives the driver sufficient granularity

7
Index Buffers

Treat Index Buffers exactly as if they were
vertex buffers except that you always choose
the smallest element possible
i.e. Use 32 bit indices only if you need to
Use 16 bit indices whenever you can
All recent ATI hardware treats Index Buffers as
first class citizens
They dont have to be copied about before the
chip gets access
So keep them out of system memory

8
Updating Index and Vertex Buffers

IBs and VBs which are optimally located need to
be updated with sequential DWORD writes.
AGP memory and LVM both benefit from this
treatment

9
Handling Render States

Prefer minimal state blocks
minimal means you should weed out any redundant
state changes where possible
If 5 of state changes are redundant thats OK
If 50 are redundant then get it fixed!
The expensive state changes
Switching between VS and FF
Switching Vertex Shader
Changing Texture

10
How to draw primitives

DrawIndexedPrimitive( strip or list )
Indexing is a big win on real world data
Long strips beat everything else
Use lists if you would have to add large numbers
of degenerate polys to stick with strips (more
than 20 means use lists)
Make sure your VBs and IBs are in optimal
memory for best performance
Give the card hundreds of polys per call
Small batches kill performance

11
Vertex data

Dont scatter it around
Fewer streams give better cache behaviour
Compress it if you can
16 bits or less per component
Even if it costs you 1 or 2 ops in the shader
Try to avoid spilling into AGP
Because AGP has high latency
pow2 sizes help 32 bytes is best
Work the cache on the GPU
Avoid random access patterns where possible by
reordering vertex data before the main loop
Thats at app start up or at authoring time

12
Compiling and Linking shaders

Do this all up front
It may not be obvious to you - but you have to
actually use a shader to force its complete
instantiation in DirectX 9
So, if youre not careful you may get linking
happening in your main loop
And linking may be time consuming ?
Draw a little of everything before you start for
real. Think of this as priming the caches

13
Vertex shaders I

Shorter shaders are faster no surprises here
Avoid all unnecessary writes
This includes the output registers of the VS
So use the write masks aggressively
Pack constants as much as possible
Prefer locality of reference on constants too
Be aware of the expansion of macros but prefer
them anyway if they match exactly what you want
Pack your shader constant updates
You should optimise the algorithm and leave the
object-code optimisation to the driver/runtime

14
Vertex shaders II

Branches and conditionals are fast so use them
agressively
Thats not like the CPU where branches are slow
Longer shaders allow better batching
Shorter shaders are also more cache friendly
i.e. its usually faster to switch to the
previous shader than to any other
But the shorter your shaders are
the more of them fit into the cache.

15
Vertex shaders II

API Change
Now you dont mov to the address register, you
use mova
And this performs round to nearest, not floor
And now A0 is a 4d register
A0.x, A0.y, A0.z, A0.w

16
Pixel shaders I

API change to accommodate METs
You now have to explicitly write to oC0, oC1, oC2
and 0C3 to set the output colour
And the write has to be with a mov instruction
If you write to 0Cn you must write to all
elements from oC0 to 0cn-1
i.e. Writes must be contiguous starting at oC0
But the writes can happen in any order
You can also write to oDepth to update the Z
buffer but note that this kills the early Z cull
(this replaces ps1.3 texdepth)

17
Pixel shaders II

Shorter is much faster
Its much easier to be pixel limited than vertex
limited
Short shaders are more cache friendly
Be aggressive with write masks
Think dual-issue () even though its gone from
the API (so split colour and alpha out)
Generally prefer to spend cycles on shader ops
rather than using texture lookups
Because memory latency is the enemy here

18
Pixel shaders III

Dual issue?
But thats not in the 2.0 shader spec
But remember that DX9 hardware like the Radeon
9700 has to run DirectX 8 apps very fast indeed
And that means it has dual issue hardware ready
for you to use

19
Pixel shaders IV

Example Diffuse specular lighting

dp3 r0, r1, r0 // N.H dp3 r2, r1, r2 // N.L mul
r2, r2, r3 // color mul r2, r2, r4 //
texture mul r0.r, r0.r, r0.r // spec2 mul r0.r,
r0.r, r0.r // spec4 mul r0.r, r0.r, r0.r //
spec8 mad r0.rgb, r0.r, r5, r2 Total 8
instructions
dp3 r0, r1, r0 // N.H dp3 r2.r, r1, r2
// N.L mul r6.a, r0.r, r0.r // spec2 mul
r2.rgb, r2.r, r3 // color mul r6.a, r6.a, r6.a
// spec4 mul r2.rgb, r2, r4 // texture mul
r6.a, r6.a, r6.a // spec8 mad r0.rgb, r6.a, r5,
r2 Optimized to 5 DI instructions
20
Pixel shaders IV

Texture instructions
Avoid TEXDEPTH to retain the early Z-reject
If you do choose to use TEXKILL then use it as
early as possible. But, the positioning of
TEXKILL within texture loading code is
unimportant
Register usage
Minimize total number of registers used
No problems with dependency

21
Vertex and Pixel shaders

If youre fed up with writing assembler, and
dont feel excited by the opportunity to code 256
VS ops and 96 PS ops then
maybe you should consider HLSL?
In most cases it is as good as hand written
assembler
And much faster to author
Perfect for prototyping
And for release code where you use D3DX

22
Textures I

API addition
SetSamplerState()
Handles the now-decoupled texture sampler setup.
You may now freely mix and match texture
coordinates with texture samplers to fetch texels
in arbitrary ways
Texture coordinates are now just iterated floats
Samplers handle clamp, wrap, bias and filter
modes
You have 8 texture coordinates
And 16 texture samplers
texld r11, t7, s15 (all register numbers are max)

23
Textures II

Use compressed textures
Do you need a good compressor?
Use smaller textures
Use 16 bit textures in preference to 32 bit
Use textures with few components
Use an L8 or A8 format if thats what you want
Pack textures together
e. g. If youre using two 2D textures then
consider using a single RGBA texture
Texture performance is bandwidth limited

24
Textures III

Filtering modes
Use trilinear filtering to improve texture cache
coherency
Only use anisotropic or tri-linear filtering when
they make sense - they are more expensive
Avoid using anisotropic filtering with
bumpmapping
Avoid using tri-linear anisotropic filtering
unless the quality win justifies it
More costly filtering is more affordable with
longer pixel shaders

25
Targets

Always clear the whole of the target
Present()
WASSTILLDRAWING makes a comeback
Please use it!
Because using it properly will gain you CPU
cycles - and thats typically your scarcest
resource

26
Depth Buffer I

Never lock depth buffers
Clearing depth buffers
Clear the whole surface
When stencil is present clear both depth and
stencil simultaneously
If possible disable depth buffering when alpha
blending (i.e. drawing HUDs)
Use as few depth buffers as possible
i.e. re-use them across multiple render targets

27
Depth Buffer II

Efficiently use Hyper-Z
Render front to back
Make Znear, Zfar close to active depth range of
the scene
The EQUAL and NOT EQUAL depth tests require exact
compares which kill the early Z comparisons.
Avoid them!

28
Occlusion query

New to DirectX 9
In GL you have HP_occlusion_query and
NV_occlusion_query to avoid the need for locks
Not free, but much cheaper than Lock()
Supported on all ATI hardware since the Radeon
8500
CreateQuery(OCCLUSION, ppQuery)
Issue(Begin/End)
GetData() returns S_OK to signal completion - but
please dont spin waiting for the answer

29
AGP 8X

Is fast at 2GB per second
But has high latency compared to LVM
And is 10 times slower than LVM
Radeon 9700 has up to 20GB per sec of bandwidth
available when talking to LVM
(LVM Local Video Memory)

30
User clip planes

User clip planes are much more efficient than
texkill because
They insert a per-vertex test, rather than a
per-pixel test, and vertices are typically fewer
in number than pixels
Its important always to kill data at the
earliest stage possible in the pipeline
Plus, clipping is essentially a geometric
operation
All hardware which supports ps1.4 supports user
clip planes in hardware

31
Sky box. First or last?