DirectX 9 - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

DirectX 9

Description:

Create your most important resources first (that's targets, shaders, ... Occlusion query. New to DirectX 9 ... CreateQuery(OCCLUSION, ppQuery) Issue(Begin/End) ... – PowerPoint PPT presentation

Number of Views:191
Avg rating:3.0/5.0
Slides: 34
Provided by: www2I4
Category:

less

Transcript and Presenter's Notes

Title: DirectX 9


1
DirectX 9 Radeon 9700Performance Optimizations
  • Richard Huddy
  • RHuddy_at_ati.com

2
DirectX 9 and Radeon 9700 considerations
  • Resources
  • Sorting and Clearing
  • Vertex Buffers and Index Buffers
  • Render States
  • How to draw primitives
  • Vertex Data
  • Vertex Shaders
  • Pixel Shaders
  • Textures
  • Targets (both Z and color)
  • Miscellaneous

3
General resource management
  • Create your most important resources first
    (thats targets, shaders, textures, VBs, IBs
    etc)
  • Most important is most frequently used
  • Never call Create in your main loop
  • So create the main colour and Z buffers before
    you do anything else
  • The main buffer is the one through which the
    largest number of pixels pass

4
Sorting
  • Sort roughly front to back
  • Theres a staggering amount of hardware devoted
    to making this highly efficient
  • Sort by vertex shader
  • or
  • Sort by pixel shader, or
  • sort by texture
  • When you change VS or PS its good to go back to
    that shader as soon as possible
  • Short shaders are faster2 when sorted

5
Clearing
  • Ideally use Clear once per frame (not less)
  • Always clear the whole render target
  • Dont track dirty regions at all
  • Always clear colour, Z and stencil together
    unless you can just clear Z/stencil
  • Most importantly dont force us to preserve
    stencil
  • Dont use 2 triangles to clear
  • Using Clear() is the way to get all the fancy Z
    buffer hardware working for you

6
Vertex Buffers
  • Use the standard DirectX8/9 VB handling algorithm
    with NOOVERWRITE etc
  • Try to always use DISCARD at the start of the
    frame on dynamic VBs
  • Specify write-only whenever possible
  • Use the default pool whenever possible
  • Roughly 2 4 MB for best performance
  • This allows large batches
  • And gives the driver sufficient granularity

7
Index Buffers
  • Treat Index Buffers exactly as if they were
    vertex buffers except that you always choose
    the smallest element possible
  • i.e. Use 32 bit indices only if you need to
  • Use 16 bit indices whenever you can
  • All recent ATI hardware treats Index Buffers as
    first class citizens
  • They dont have to be copied about before the
    chip gets access
  • So keep them out of system memory

8
Updating Index and Vertex Buffers
  • IBs and VBs which are optimally located need to
    be updated with sequential DWORD writes.
  • AGP memory and LVM both benefit from this
    treatment

9
Handling Render States
  • Prefer minimal state blocks
  • minimal means you should weed out any redundant
    state changes where possible
  • If 5 of state changes are redundant thats OK
  • If 50 are redundant then get it fixed!
  • The expensive state changes
  • Switching between VS and FF
  • Switching Vertex Shader
  • Changing Texture

10
How to draw primitives
  • DrawIndexedPrimitive( strip or list )
  • Indexing is a big win on real world data
  • Long strips beat everything else
  • Use lists if you would have to add large numbers
    of degenerate polys to stick with strips (more
    than 20 means use lists)
  • Make sure your VBs and IBs are in optimal
    memory for best performance
  • Give the card hundreds of polys per call
  • Small batches kill performance

11
Vertex data
  • Dont scatter it around
  • Fewer streams give better cache behaviour
  • Compress it if you can
  • 16 bits or less per component
  • Even if it costs you 1 or 2 ops in the shader
  • Try to avoid spilling into AGP
  • Because AGP has high latency
  • pow2 sizes help 32 bytes is best
  • Work the cache on the GPU
  • Avoid random access patterns where possible by
    reordering vertex data before the main loop
  • Thats at app start up or at authoring time

12
Compiling and Linking shaders
  • Do this all up front
  • It may not be obvious to you - but you have to
    actually use a shader to force its complete
    instantiation in DirectX 9
  • So, if youre not careful you may get linking
    happening in your main loop
  • And linking may be time consuming ?
  • Draw a little of everything before you start for
    real. Think of this as priming the caches

13
Vertex shaders I
  • Shorter shaders are faster no surprises here
  • Avoid all unnecessary writes
  • This includes the output registers of the VS
  • So use the write masks aggressively
  • Pack constants as much as possible
  • Prefer locality of reference on constants too
  • Be aware of the expansion of macros but prefer
    them anyway if they match exactly what you want
  • Pack your shader constant updates
  • You should optimise the algorithm and leave the
    object-code optimisation to the driver/runtime

14
Vertex shaders II
  • Branches and conditionals are fast so use them
    agressively
  • Thats not like the CPU where branches are slow
  • Longer shaders allow better batching
  • Shorter shaders are also more cache friendly
  • i.e. its usually faster to switch to the
    previous shader than to any other
  • But the shorter your shaders are
  • the more of them fit into the cache.

15
Vertex shaders II
  • API Change
  • Now you dont mov to the address register, you
    use mova
  • And this performs round to nearest, not floor
  • And now A0 is a 4d register
  • A0.x, A0.y, A0.z, A0.w

16
Pixel shaders I
  • API change to accommodate METs
  • You now have to explicitly write to oC0, oC1, oC2
    and 0C3 to set the output colour
  • And the write has to be with a mov instruction
  • If you write to 0Cn you must write to all
    elements from oC0 to 0cn-1
  • i.e. Writes must be contiguous starting at oC0
  • But the writes can happen in any order
  • You can also write to oDepth to update the Z
    buffer but note that this kills the early Z cull
    (this replaces ps1.3 texdepth)

17
Pixel shaders II
  • Shorter is much faster
  • Its much easier to be pixel limited than vertex
    limited
  • Short shaders are more cache friendly
  • Be aggressive with write masks
  • Think dual-issue () even though its gone from
    the API (so split colour and alpha out)
  • Generally prefer to spend cycles on shader ops
    rather than using texture lookups
  • Because memory latency is the enemy here

18
Pixel shaders III
  • Dual issue?
  • But thats not in the 2.0 shader spec
  • But remember that DX9 hardware like the Radeon
    9700 has to run DirectX 8 apps very fast indeed
  • And that means it has dual issue hardware ready
    for you to use

19
Pixel shaders IV
  • Example Diffuse specular lighting

dp3 r0, r1, r0 // N.H dp3 r2, r1, r2 // N.L mul
r2, r2, r3 // color mul r2, r2, r4 //
texture mul r0.r, r0.r, r0.r // spec2 mul r0.r,
r0.r, r0.r // spec4 mul r0.r, r0.r, r0.r //
spec8 mad r0.rgb, r0.r, r5, r2 Total 8
instructions
dp3 r0, r1, r0 // N.H dp3 r2.r, r1, r2
// N.L mul r6.a, r0.r, r0.r // spec2 mul
r2.rgb, r2.r, r3 // color mul r6.a, r6.a, r6.a
// spec4 mul r2.rgb, r2, r4 // texture mul
r6.a, r6.a, r6.a // spec8 mad r0.rgb, r6.a, r5,
r2 Optimized to 5 DI instructions
20
Pixel shaders IV
  • Texture instructions
  • Avoid TEXDEPTH to retain the early Z-reject
  • If you do choose to use TEXKILL then use it as
    early as possible. But, the positioning of
    TEXKILL within texture loading code is
    unimportant
  • Register usage
  • Minimize total number of registers used
  • No problems with dependency

21
Vertex and Pixel shaders
  • If youre fed up with writing assembler, and
    dont feel excited by the opportunity to code 256
    VS ops and 96 PS ops then
  • maybe you should consider HLSL?
  • In most cases it is as good as hand written
    assembler
  • And much faster to author
  • Perfect for prototyping
  • And for release code where you use D3DX

22
Textures I
  • API addition
  • SetSamplerState()
  • Handles the now-decoupled texture sampler setup.
  • You may now freely mix and match texture
    coordinates with texture samplers to fetch texels
    in arbitrary ways
  • Texture coordinates are now just iterated floats
  • Samplers handle clamp, wrap, bias and filter
    modes
  • You have 8 texture coordinates
  • And 16 texture samplers
  • texld r11, t7, s15 (all register numbers are max)

23
Textures II
  • Use compressed textures
  • Do you need a good compressor?
  • Use smaller textures
  • Use 16 bit textures in preference to 32 bit
  • Use textures with few components
  • Use an L8 or A8 format if thats what you want
  • Pack textures together
  • e. g. If youre using two 2D textures then
    consider using a single RGBA texture
  • Texture performance is bandwidth limited

24
Textures III
  • Filtering modes
  • Use trilinear filtering to improve texture cache
    coherency
  • Only use anisotropic or tri-linear filtering when
    they make sense - they are more expensive
  • Avoid using anisotropic filtering with
    bumpmapping
  • Avoid using tri-linear anisotropic filtering
    unless the quality win justifies it
  • More costly filtering is more affordable with
    longer pixel shaders

25
Targets
  • Always clear the whole of the target
  • Present()
  • WASSTILLDRAWING makes a comeback
  • Please use it!
  • Because using it properly will gain you CPU
    cycles - and thats typically your scarcest
    resource

26
Depth Buffer I
  • Never lock depth buffers
  • Clearing depth buffers
  • Clear the whole surface
  • When stencil is present clear both depth and
    stencil simultaneously
  • If possible disable depth buffering when alpha
    blending (i.e. drawing HUDs)
  • Use as few depth buffers as possible
  • i.e. re-use them across multiple render targets

27
Depth Buffer II
  • Efficiently use Hyper-Z
  • Render front to back
  • Make Znear, Zfar close to active depth range of
    the scene
  • The EQUAL and NOT EQUAL depth tests require exact
    compares which kill the early Z comparisons.
    Avoid them!

28
Occlusion query
  • New to DirectX 9
  • In GL you have HP_occlusion_query and
    NV_occlusion_query to avoid the need for locks
  • Not free, but much cheaper than Lock()
  • Supported on all ATI hardware since the Radeon
    8500
  • CreateQuery(OCCLUSION, ppQuery)
  • Issue(Begin/End)
  • GetData() returns S_OK to signal completion - but
    please dont spin waiting for the answer

29
AGP 8X
  • Is fast at 2GB per second
  • But has high latency compared to LVM
  • And is 10 times slower than LVM
  • Radeon 9700 has up to 20GB per sec of bandwidth
    available when talking to LVM
  • (LVM Local Video Memory)

30
User clip planes
  • User clip planes are much more efficient than
    texkill because
  • They insert a per-vertex test, rather than a
    per-pixel test, and vertices are typically fewer
    in number than pixels
  • Its important always to kill data at the
    earliest stage possible in the pipeline
  • Plus, clipping is essentially a geometric
    operation
  • All hardware which supports ps1.4 supports user
    clip planes in hardware

31
Sky box. First or last?
  • Draw it last because
  • Thats a rough front to back sort
  • In this case you know that most sky pixels will
    fail the Z test.
  • Draw it first because
  • That way you dont need any Z tests
  • In this case you know that most sky pixels would
    pass the Z test

32
So, here is our target
  • DX9 style mainstream graphics (per frame)
  • gt 500K triangles
  • lt 500 DrawIndexedPrimitive() calls
  • lt 500 VertexBuffer switches
  • lt 200 different textures
  • lt 200 State change groups
  • Few calls to SetRenderTarget - aim for 0 to 4...
  • 1 pass per poly is typical, but 2 is sometimes
    smart
  • Runs at monitor refresh rate
  • Which gives more than 40 million polys per second
  • And everything goes through the programmable
    pipeline
  • No occurrences of Lock(0), DrawPrimitive(), DPUP()

33
Questions
  • ?

Richard Huddy RHuddy_at_ati.com
Write a Comment
User Comments (0)
About PowerShow.com