Revving Up Shader Performance - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Revving Up Shader Performance

Description:

Know the shader hardware on a low level. Pixel threads, GPRs, fetch ... True for z-prepass, vis-testing, shadow rendering. For vis-testing, use tighter bounding ... – PowerPoint PPT presentation

Number of Views:122
Avg rating:3.0/5.0
Slides: 39
Provided by: downloadM
Category:

less

Transcript and Presenter's Notes

Title: Revving Up Shader Performance


1
Revving Up Shader Performance
Presentation/Presenter Title Slide
Under the Hood
  • Shanon Drone
  • Development Engineer
  • XNA Developer ConnectionMicrosoft

2
Key Takeaways
  • Know the shader hardware on a low level
  • Pixel threads, GPRs, fetch efficiency, latency,
    etc.
  • Learn from the Xbox 360
  • Architect shaders accordingly
  • Shader optimization is not always intuitive
  • Some things that seem to make sense, actually
    hurt performance
  • At some point, you must take a trial-n-error
    approach
  • In other words
  • You must write tests
  • Real-world measurements are the only way to
    really know what works

3
Under the Hood
  • Help the shader compiler by writing shaders for
    how hardware works under the hood
  • Modern GPUs are highly parallel to hide latency
  • Lots of pipelines (maybe even a shared shader
    core)
  • Works on units larger than single vertices and
    pixels
  • Rasterization goes to tiles, then to quads, then
    to (groups of) pixels
  • Texture caches are still small
  • 32 KB on Xbox 360 and similar on PC hardware
  • Shaders use 610 textures with 1020 MB traffic
  • Rasterization order affects texture cache usage
  • Lots of other subtleties affect performance

4
Pixel and Vertex Vectors
  • GPUs work on groups of vertices and pixels
  • Xbox 360 works on a vector of 64 vertices or
    pixels
  • Meaning each instruction is executed 64 times at
    once
  • Indexed constants cause constant waterfalling
  • For example ca0 can translate to 64
    different constants per one instruction
  • When all pixels take the same branch
  • performance can be good
  • But in most cases, both code paths will execute
  • Author shaders accordingly
  • Early out code paths may not work out like
    expected
  • Check assembly output

5
Why GPRs Matter
  • GPU can process many ALU ops per clock cycle
  • But fetch results can take hundreds of clock
    cycles
  • Due to cache misses, texel size, filtering, etc.
  • Threads hide fetch latency
  • While one thread is waiting for fetch results,
    the GPU can start processing another thread
  • Max number of threads is limited by the GPR pool
  • Example 64 pixels 32 GPRs 2048 GPRs per
    thread
  • The GPU can quickly run out of GPRs
  • Fewer GPRs used generally translates to more
    threads

6
Minimize GPR Usage
  • Xbox 360 has 24,576 GPRS
  • 64-unit vector 128 register banks 3 SIMD
    units
  • Sounds like a lot, but many shaders are GPR
    bound
  • The issue is the same for PC hardware
  • Tweak shaders to use the least number of GPRs
  • Maybe even at the expense of additional ALUs or
    unintuitive control flow
  • Unrolling loops usually requires more ALUs
  • as does lumping tfetches at the beginning of the
    shader
  • Ditto for über-shaders, even with static control
    flow
  • Rules get tricky, so we must try out many
    variations

7
Vertex Shader Performance
  • Lots of reasons to care about vertex shader
    performance
  • Modern games have many passes (z-pass, tiling
    ,etc.)
  • Non-shared cores devote less hardware for
    vertices
  • For shared cores, resources could be used for
    pixels
  • For ALU bound shaders
  • Constant waterfalling can be the biggest problem
  • Matrix order can affect ALU optimization
  • Many vertex shaders are fetch bound
  • Especially lightweight shaders
  • One fetch can cost 32x more cycles than an ALU

8
VFetch
  • Goal is to minimize the number of fetches
  • Which is why vertex compression is so important
  • Vertex data should be aligned to 32 or 64 bytes
  • Multiple streams will multiply the fetch cost
  • Vertex declaration should match fetch order
  • Shader compiler has ZERO info about vertex
    components
  • Shaders are patched at run time with the vertex
    declaration
  • Shader patching cant optimize out unnecessary
    vfetches

9
Mega vs. Mini Fetches
  • A quick refresher
  • GPU reads vertex data is groups of bytes
  • Typically 32 bytes (a mega or full fetch)
  • Additional fetches within the 32-byte range
    should be free (a mini fetch)
  • On Xbox 360
  • A vfetch_full pulls in 32 bytes worth of data
  • Two fetches (times 64 vertices) per clock cycle
    equals 32 cycles per fetch
  • Without 32 cycles worth of ALU ops, the shader is
    fetch bound

10
Vfetch Recommendations
  • Compress vertices
  • Normals, Tangent, BiNormal -gt 111110
  • Texture coords -gt 1616
  • Put all non-lighting components first
  • So depth-only shaders do just one fetch
  • FLOAT32 Position3
  • UINT8 BoneWeights4
  • UINT8 BoneIndices4
  • UINT16 DiffuseTexCoords2
  • UINT16 NormalMapCoords2
  • DEC3N Normal
  • DEC3N Tangent
  • DEC3N BiNormal
  • All in one stream, of course

1st Fetch
2nd Fetch
11
Fetch From Two Streams
  • Cycles/64 vertex vector ALU 12, vertex 64,
    sequencer 12
  • 3 GPRs, 31 threads
  • // Fetch position
  • vfetch_full r2.xyz1, r0.x, vf0,
  • Offset0,
  • DataFormatFMT_32_32_32_FLOAT
  • // Fetch diffuse texcoord
  • vfetch_full r0.xy0_, r0.x, vf2,
  • Offset0,
  • DataFormatFMT_32_32_FLOAT
  • mul r1, r2.y, c1
  • mad r1, r2.x, c0.wyxz, r1.wyxz
  • mad r1, r2.z, c2.zywx, r1.wyxz
  • mad r2, r2.w, c3, r1.wyxz
  • mul r1, r2.y, c5.wzyx
  • mad r1, r2.x, c4.xzwy, r1.wyxz
  • mad r1, r2.z, c6.yzxw, r1.wyxz
  • mad oPos, r2.w, c7, r1.zxyw

12
2 Fetches From Same Stream
  • Cycles/64 vertex vector ALU 12, vertex 64,
    sequencer 12
  • 3 GPRs, 31 threads
  • // Fetch position
  • vfetch_full r2.xyz1, r0.x, vf0,
  • Offset0,
  • DataFormatFMT_32_32_32_FLOAT
  • // Fetch diffuse texcoord
  • vfetch_full r0.xy0_, r0.x, vf0,
  • Offset10,
  • DataFormatFMT_32_32_FLOAT
  • mul r1, r2.y, c1
  • mad r1, r2.x, c0.wyxz, r1.wyxz
  • mad r1, r2.z, c2.zywx, r1.wyxz
  • mad r2, r2.w, c3, r1.wyxz
  • mul r1, r2.y, c5.wzyx
  • mad r1, r2.x, c4.xzwy, r1.wyxz
  • mad r1, r2.z, c6.yzxw, r1.wyxz
  • mad oPos, r2.w, c7, r1.zxyw

13
1 Fetch From Single Stream
  • Cycles/64 vertex vector ALU 12, vertex 32,
    sequencer 12
  • 3 GPRs, 31 threads
  • // Fetch position
  • vfetch_full r2.xyz1, r0.x, vf0,
  • Offset0,
  • DataFormatFMT_32_32_32_FLOAT
  • // Fetch diffuse texcoord
  • vfetch_mini r0.xy0_, r0.x, vf0,
  • Offset5,
  • DataFormatFMT_32_32_FLOAT
  • mul r1, r2.y, c1
  • mad r1, r2.x, c0.wyxz, r1.wyxz
  • mad r1, r2.z, c2.zywx, r1.wyxz
  • mad r2, r2.w, c3, r1.wyxz
  • mul r1, r2.y, c5.wzyx
  • mad r1, r2.x, c4.xzwy, r1.wyxz
  • mad r1, r2.z, c6.yzxw, r1.wyxz
  • mad oPos, r2.w, c7, r1.zxyw

14
Triple the Fetch Cost
// DepthOnlyVS.asm vfetch r6.xyz1, r0.x,
position vfetch r1, r0.x, blendindices vfetch
r2, r0.x, blendweight vfetch r0.xy__, r0.x,
texcoord mul r1, r1.wzyx, c255.x movas r0._,
r1.x dp4 r3.x, c8a0.zxyw, r6.zxyw dp4 r3.y,
c9a0.zxyw, r6.zxyw dp4 r3.z, c10a0.zxyw,
r6.zxyw
  • //DepthOnlyVS.hlsl
  • struct VS_INPUT
  • float4 Position
  • float4 BoneIndices
  • float4 BoneWeights
  • float2 DiffuseTexCoords
  • VS_OUTPUT DepthOnlyVS( VS_INPUT In )

//DepthOnlyVS.cpp D3DVERTEXELEMENT9 decl
0, 0, D3DDECLTYPE_FLOAT3, ,
D3DDECLUSAGE_POSITION, 0 , 0, 12,
D3DDECLTYPE_FLOAT3, , D3DDECLUSAGE_NORMAL,
0 , 0, 24, D3DDECLTYPE_FLOAT2, ,
D3DDECLUSAGE_TEXCOORD, 0 , 0, 32,
D3DDECLTYPE_UBYTE4, , D3DDECLUSAGE_BLENDINDICES
, 0 , 0, 36, D3DDECLTYPE_USHORT4N, ,
D3DDECLUSAGE_BLENDWEIGHT, 0 ,
D3DDECL_END()
15
Triple the Fetch Cost
  • Cycles/64 vertex vector ALU 38, vertex 96,
    sequencer 22
  • 7 GPRs, 27 threads
  • vfetch_full r6.xyz1, r0.x, vf0,
  • Offset0
  • DataFormatFMT_32_32_32_FLOAT //
    FLOAT3 POSITION
  • vfetch_full r1, r0.x, vf0,
  • Offset8,
  • DataFormatFMT_8_8_8_8 //
    UBYTE4 BLENDINDICES
  • vfetch_mini r2,
  • Offset9,
  • DataFormatFMT_8_8_8_8 //
    USHORT4N BLENDWEIGHT
  • vfetch_full r0.xy__, r0.x, vf0,
  • Offset6,
  • DataFormatFMT_32_32_FLOAT //
    FLOAT2 TEXCOORD
  • mul r1, r1.wzyx, c255.x
  • movas r0._, r1.x
  • dp4 r3.x, c8a0.zxyw, r6.zxyw

16
One-third the Fetch Cost
  • Cycles/64 vertex vector ALU 38, vertex 32,
    sequencer 22
  • 7 GPRs, 27 threads
  • vfetch_full r6.xyz1, r0.x, vf0,
  • Offset0
  • DataFormatFMT_32_32_32_FLOAT //
    FLOAT3 POSITION
  • vfetch_mini r1
  • Offset3,
  • DataFormatFMT_8_8_8_8 //
    UBYTE4 BLENDINDICES
  • vfetch_mini r2,
  • Offset4,
  • DataFormatFMT_8_8_8_8 //
    USHORT4N BLENDWEIGHT
  • vfetch_mini r0.xy__
  • Offset5,
  • DataFormatFMT_32_32_FLOAT //
    FLOAT2 TEXCOORD
  • mul r1, r1.wzyx, c255.x
  • movas r0._, r1.x
  • dp4 r3.x, c8a0.zxyw, r6.zxyw

17
Depth-Only Rendering
  • GPUs have perf improvements for depth-buffering
  • Hierarchical-Z
  • Double-speed, depth-only rendering
  • Depth-only rendering is often still fill-bound
  • A few triangles can cover 10100,000s of quads
  • True for z-prepass, vis-testing, shadow rendering
  • For vis-testing, use tighter bounding objects and
    proper culling
  • Since were fill fill-bound, consider doing
    pixels
  • I.e. Give up the double-speed benefit
  • Lay down something useful to spare an additional
    pass
  • Velocity, focal plane, etc.

18
Pixel Shader Performance
  • Most calls will be fill-bound
  • Pixel shader optimization is some combination of
  • Minimizing ALUs
  • Minimizing GPRs
  • Reducing control flow overhead
  • Improving texture cache usage
  • Avoiding expensive work
  • Also, trying to balance the hardware
  • Fetches versus ALUs versus GPUs
  • A big challenge is getting the shader compiler to
    do exactly what we want

19
Minimizing ALUs
  • Minor modifications to an expression can change
    the number of ALUs
  • The shader compiler produces slightly different
    results
  • Play around to try different things out
  • Avoid math on constants
  • Reducing just one multiply has saved 3 ALU ops
  • Using isolate can dramatically change results
  • Especially for the ALU ops around texture fetches
  • Verify shader compiler output
  • Get comfortable with assembly
  • Compare with expectations given your HLSL code
  • Finally, start tweaking to get what you want

20
Minimizing GPRs
  • Minimizing ALUs usually saves on GPRs as well
  • Unrolling loops consumes more GPRs
  • Conversely, using loops can save GPRs
  • Lumping tfetches to top of shader costs GPRs
  • Both for calculated tex coords
  • And fetch results
  • Xbox 360 extensions can save ALUs
  • Like tfetch with offsets
  • Shader compiler can be told to make due with a
    user-specified max number of GPRs

21
Control Flow
  • Flattening or preserving loops can have a huge
    effect on shader performance
  • One game shaved 4 ms off of a 30 ms scene by
    unrolling just one loopits main pixel shader
  • Unrolling allowed for much more aggressive
    optimization by the compiler
  • However, the DepthOfField shader saves 2 ms by
    not preserving the loop
  • Using the loop reduced GPR usage dramatically
  • Recommendation is to try both ways
  • Non-branching control flow is still overhead
  • Gets reduced as ALU count goes down

22
Early Out Shaders
  • Early out shaders may not really do anything
  • On Xbox 360, HLSL clip doesnt really kill a
    pixel
  • But rather just invalidates the output
  • Remaining texture fetches may be spared
  • but all remaining ALU instructions still execute
  • Write a test for other hardware to see if early
    outs actually improve performance
  • Otherwise, assume they dont
  • For any gain, all 64 pixels would need to be
    killed
  • Dynamic control flow via an if-else block should
    get close to what you intend

23
Dynamic Branching
  • Dynamic branching can help or hurt performance
  • All pixels in a thread must actually take the
    branch
  • Otherwise, both branches need to be executed
  • Use branching to skip fetches and calculations
  • Like whenever the alpha will be zero
  • But beware of multiple code paths executing
  • ifelse statements result in additional overhead
  • The ? operator turns into a single instruction
  • Avoid static branching masquerading as dynamic
  • Do not use numerical constants for control flow
    use booleans instead
  • Special-case simple various code paths, which
    results in less control-flow and fewer GPRs used

24
Thread Size and Branching
Some pixel threads take the non-lighting
path Some pixel threads take the lighting
path Some pixel threads take the soft shadow path
25
Thread Size and Branching
  • 43 of pixel threads take the non-lighting path
  • 14 of pixel threads take the lighting path
  • 43 of pixel threads take the soft shadow path

26
Thread Size and Branching
43 of pixel threads take the non-lighting
path 14 of pixel threads take the lighting
path 43 of pixel threads take the soft shadow
path
  • 54
  • 20
  • 26

27
Texture Cache Usage
  • Fetches have latency that become a bottleneck
  • Can be a challenge to fetch 610 textures per
    pixel and many MB of texture traffic through a 32
    KB cache
  • Age-old recommendations still apply
  • Compare measured texture traffic to ideal traffic
  • Consider a 1280x720x32-bit post-processing pass
  • 1280x720x32 bits 3.686 MB of ideal texture
    traffic
  • But measured result may claim 7.0 MB
  • Triangle rasterization can and will affect
    texture cache usage
  • In the case above, its the only explanation
  • Pixels are processed in an order thats causing
    texels to be evicted/re-fetched to the cache

28
Rasterization Test
  • Use an MxN grid instead of a full-screen quad
  • Smaller primitives confine rasterization to
    better match usage patterns of the texture cache
  • Prevent premature evictions from the texture
    cache
  • Ideal grid size varies for different conditions
  • Number of textures, texel width, etc.
  • And surely for different hardware, too
  • Write a test that lets you try different grid
    configurations for each shader
  • For the DepthOfField shader, an 8x1 grid works
    best
  • For a different shader, 20x13 worked best
  • In all cases, 1x1 seems to be pretty bad

29
Conditional Processing
  • The shader compiler doesnt know when
    intermediate results might be zero
  • Diffuse alpha, NL, specular, bone weight,
    lightmap contribution, etc.
  • Pixel is in shadow
  • Some constant value is zero (or one)
  • Try making expressions conditional when using
    these values
  • Experiment to see if even small branches pay off
  • Use a texture mask to mask off areas where
    expensive calculations can be avoided
  • For PCF, take a few samples to see if you need to
    do more

30
Multiple Passes
  • Multiple passes cause lots of re-fetching of
    textures (normal maps, etc.)
  • However, separate passes are better for the
    tcache
  • Resolve and fetch cost of scene and depth
    textures adds up
  • Tiling overhead may be worth it to save passes
  • Alpha blending between passes eats bandwidth
  • ALU power is 4x that of texture hardware
  • Branching in shader can handle multiple lights
  • Consider multiple render targets
  • Meshes are often transformed many times
  • Try to skin meshes just once for all N passes
  • Consider memexport or StreamOut for skinning

31
Xbox 360 Platform
  • Fixed hardware offers lots of low-level
    manipulations to tweak performance
  • Consistent hardware characteristics, like fetch
    rules
  • Shader GPR allocation
  • Tfetch with offsets
  • Ability to view/author shader microcode
  • Access to hardware performance counters
  • Custom formats (7e3) and blend modes
  • Predicated tiling
  • Custom tools
  • Microcode-aware shader compiler, with custom
    extensions and attributes
  • PIX for Xbox 360
  • Its still important to measure real-world
    performance

32
Windows Platform
  • Hardware can obviously vary a lot
  • Hard to get low-level knowledge of what the rules
    are
  • Driver performs additional compiling of shaders
  • Some things to check on include
  • Effect of thread size on dynamic branching
  • Effect of GPR usage
  • 32-bit versus 16-bit float performance
  • 64-bit render target and texture performance
  • Multiple render target performance
  • Z-prepass effectiveness
  • Hardware support for shadowmap filtering
  • If possible, consider shader authoring on an Xbox
    360 dev kit
  • Get ready for D3D 10

33
The HLSL Shader Compiler
  • Dont expect the shader compiler to solve all
    your problems
  • It cant be perfect (and its not)
  • Garbage in garbage out
  • It cant know what youre really trying to do
  • Its easy to trick the compiler, especially with
    constants
  • It cant know the situation the shader will run
    in
  • Texture cache usage, 64-bit textures, filtering,
    etc.
  • Vertex declarations
  • Interpolators
  • Alpha-blending
  • Neighboring vertices and pixels
  • Besides, what does the driver do with your shader
    once it gets it?

34
Shader Compiler
  • The shader compiler can generate variants that
    perform dramatically different
  • Loops and branching versus flat control flow
  • Grouping tfetches versus minimizing GPR count
  • Isolated operations versus intertwining
    instructions
  • Reactions to a subtle difference in an expression
  • Be accountable for shader compiler output
  • Always verify the output of the shader compiler
  • Know what you want, then try to get shader
    compiler to play along
  • Rough-count number of hypothetical instructions
    for verification

35
Controlling the HLSL Compiler
  • Options to affect compiler output include
  • Compiler switches and HLSL attributes to mandate
    control flow
  • Manually unrolling loops
  • Rearranging HLSL code
  • The Xbox 360 compiler has a few more switches and
    attributes
  • /Xmaxtempreg to limit GPR usage
  • isolateing blocks can have a huge effect on
    code generation
  • Especially around fetches
  • Ditto for noExpressionOptimizations
  • Compiler output can often be improved by
    massaging the actual HLSL code

36
Massaging HLSL Code
  • Changing one simple operation can improve or
    degrade output
  • The input changes, so the rules change, so code
    generation changes
  • But new code is not always better
  • Send weird cases into developer support
  • The one operation may be as simple as an add
  • Or moving an expression to earlier/later in the
    shader
  • Or needless math on constants
  • rcp r0.x, c20.x
  • mul r0.xyz, c5.xyz, c6.w
  • movs r0.y, c0.z
  • cndeq r0, c6.x, r0, r1
  • Always verify results in assembly

37
Wheres the HLSL?
  • Make sure your art pipeline lets you
    access/view/tweak HLSL shaders
  • Many engines assemble shader fragments
    dynamically
  • Meaning theres not complete HLSL source lying
    around for every variation of every shader used
    in-game
  • You must solve this problem
  • Recommendation is to spit out exact HLSL
    immediately after compilation
  • Save the HLSL to a sequentially named file
  • Then add PIX output to your scene with the ID of
    each shader used
  • That way, you can trace draw calls back to an
    HLSL file that you can experiment with

38
Test Things Out
  • Optimizing shaders means trying out a lot of
    different things under real-world conditions
  • So its imperative to test things out
  • Shaders need to be buildable from the command
    line
  • And be able to be dropped into a test framework
  • And hooked up to performance tools like PIX
  • Get comfortable with shader compiler output
  • Verify that the assembly looks close to expected
  • Like GPR usage, control flow, tfetch placement
  • Isolate shaders and exaggerate their effect
  • Draw 100 full-screen quads instead of one
  • Draw objects off-screen to eliminate fill cost
  • Then, start tweaking
Write a Comment
User Comments (0)
About PowerShow.com