Title: Revving Up Shader Performance
1Revving Up Shader Performance
Presentation/Presenter Title Slide
Under the Hood
- Shanon Drone
- Development Engineer
- XNA Developer ConnectionMicrosoft
2Key Takeaways
- Know the shader hardware on a low level
- Pixel threads, GPRs, fetch efficiency, latency,
etc. - Learn from the Xbox 360
- Architect shaders accordingly
- Shader optimization is not always intuitive
- Some things that seem to make sense, actually
hurt performance - At some point, you must take a trial-n-error
approach - In other words
- You must write tests
- Real-world measurements are the only way to
really know what works
3Under the Hood
- Help the shader compiler by writing shaders for
how hardware works under the hood - Modern GPUs are highly parallel to hide latency
- Lots of pipelines (maybe even a shared shader
core) - Works on units larger than single vertices and
pixels - Rasterization goes to tiles, then to quads, then
to (groups of) pixels - Texture caches are still small
- 32 KB on Xbox 360 and similar on PC hardware
- Shaders use 610 textures with 1020 MB traffic
- Rasterization order affects texture cache usage
- Lots of other subtleties affect performance
4Pixel and Vertex Vectors
- GPUs work on groups of vertices and pixels
- Xbox 360 works on a vector of 64 vertices or
pixels - Meaning each instruction is executed 64 times at
once - Indexed constants cause constant waterfalling
- For example ca0 can translate to 64
different constants per one instruction - When all pixels take the same branch
- performance can be good
- But in most cases, both code paths will execute
- Author shaders accordingly
- Early out code paths may not work out like
expected - Check assembly output
5Why GPRs Matter
- GPU can process many ALU ops per clock cycle
- But fetch results can take hundreds of clock
cycles - Due to cache misses, texel size, filtering, etc.
- Threads hide fetch latency
- While one thread is waiting for fetch results,
the GPU can start processing another thread - Max number of threads is limited by the GPR pool
- Example 64 pixels 32 GPRs 2048 GPRs per
thread - The GPU can quickly run out of GPRs
- Fewer GPRs used generally translates to more
threads
6Minimize GPR Usage
- Xbox 360 has 24,576 GPRS
- 64-unit vector 128 register banks 3 SIMD
units - Sounds like a lot, but many shaders are GPR
bound - The issue is the same for PC hardware
- Tweak shaders to use the least number of GPRs
- Maybe even at the expense of additional ALUs or
unintuitive control flow - Unrolling loops usually requires more ALUs
- as does lumping tfetches at the beginning of the
shader - Ditto for über-shaders, even with static control
flow - Rules get tricky, so we must try out many
variations
7Vertex Shader Performance
- Lots of reasons to care about vertex shader
performance - Modern games have many passes (z-pass, tiling
,etc.) - Non-shared cores devote less hardware for
vertices - For shared cores, resources could be used for
pixels - For ALU bound shaders
- Constant waterfalling can be the biggest problem
- Matrix order can affect ALU optimization
- Many vertex shaders are fetch bound
- Especially lightweight shaders
- One fetch can cost 32x more cycles than an ALU
8VFetch
- Goal is to minimize the number of fetches
- Which is why vertex compression is so important
- Vertex data should be aligned to 32 or 64 bytes
- Multiple streams will multiply the fetch cost
- Vertex declaration should match fetch order
- Shader compiler has ZERO info about vertex
components - Shaders are patched at run time with the vertex
declaration - Shader patching cant optimize out unnecessary
vfetches
9Mega vs. Mini Fetches
- A quick refresher
- GPU reads vertex data is groups of bytes
- Typically 32 bytes (a mega or full fetch)
- Additional fetches within the 32-byte range
should be free (a mini fetch) - On Xbox 360
- A vfetch_full pulls in 32 bytes worth of data
- Two fetches (times 64 vertices) per clock cycle
equals 32 cycles per fetch - Without 32 cycles worth of ALU ops, the shader is
fetch bound
10Vfetch Recommendations
- Compress vertices
- Normals, Tangent, BiNormal -gt 111110
- Texture coords -gt 1616
- Put all non-lighting components first
- So depth-only shaders do just one fetch
- FLOAT32 Position3
- UINT8 BoneWeights4
- UINT8 BoneIndices4
- UINT16 DiffuseTexCoords2
- UINT16 NormalMapCoords2
- DEC3N Normal
- DEC3N Tangent
- DEC3N BiNormal
- All in one stream, of course
1st Fetch
2nd Fetch
11Fetch From Two Streams
- Cycles/64 vertex vector ALU 12, vertex 64,
sequencer 12 - 3 GPRs, 31 threads
- // Fetch position
- vfetch_full r2.xyz1, r0.x, vf0,
- Offset0,
- DataFormatFMT_32_32_32_FLOAT
- // Fetch diffuse texcoord
- vfetch_full r0.xy0_, r0.x, vf2,
- Offset0,
- DataFormatFMT_32_32_FLOAT
- mul r1, r2.y, c1
- mad r1, r2.x, c0.wyxz, r1.wyxz
- mad r1, r2.z, c2.zywx, r1.wyxz
- mad r2, r2.w, c3, r1.wyxz
- mul r1, r2.y, c5.wzyx
- mad r1, r2.x, c4.xzwy, r1.wyxz
- mad r1, r2.z, c6.yzxw, r1.wyxz
- mad oPos, r2.w, c7, r1.zxyw
122 Fetches From Same Stream
- Cycles/64 vertex vector ALU 12, vertex 64,
sequencer 12 - 3 GPRs, 31 threads
- // Fetch position
- vfetch_full r2.xyz1, r0.x, vf0,
- Offset0,
- DataFormatFMT_32_32_32_FLOAT
- // Fetch diffuse texcoord
- vfetch_full r0.xy0_, r0.x, vf0,
- Offset10,
- DataFormatFMT_32_32_FLOAT
- mul r1, r2.y, c1
- mad r1, r2.x, c0.wyxz, r1.wyxz
- mad r1, r2.z, c2.zywx, r1.wyxz
- mad r2, r2.w, c3, r1.wyxz
- mul r1, r2.y, c5.wzyx
- mad r1, r2.x, c4.xzwy, r1.wyxz
- mad r1, r2.z, c6.yzxw, r1.wyxz
- mad oPos, r2.w, c7, r1.zxyw
131 Fetch From Single Stream
- Cycles/64 vertex vector ALU 12, vertex 32,
sequencer 12 - 3 GPRs, 31 threads
- // Fetch position
- vfetch_full r2.xyz1, r0.x, vf0,
- Offset0,
- DataFormatFMT_32_32_32_FLOAT
- // Fetch diffuse texcoord
- vfetch_mini r0.xy0_, r0.x, vf0,
- Offset5,
- DataFormatFMT_32_32_FLOAT
- mul r1, r2.y, c1
- mad r1, r2.x, c0.wyxz, r1.wyxz
- mad r1, r2.z, c2.zywx, r1.wyxz
- mad r2, r2.w, c3, r1.wyxz
- mul r1, r2.y, c5.wzyx
- mad r1, r2.x, c4.xzwy, r1.wyxz
- mad r1, r2.z, c6.yzxw, r1.wyxz
- mad oPos, r2.w, c7, r1.zxyw
14Triple the Fetch Cost
// DepthOnlyVS.asm vfetch r6.xyz1, r0.x,
position vfetch r1, r0.x, blendindices vfetch
r2, r0.x, blendweight vfetch r0.xy__, r0.x,
texcoord mul r1, r1.wzyx, c255.x movas r0._,
r1.x dp4 r3.x, c8a0.zxyw, r6.zxyw dp4 r3.y,
c9a0.zxyw, r6.zxyw dp4 r3.z, c10a0.zxyw,
r6.zxyw
- //DepthOnlyVS.hlsl
- struct VS_INPUT
-
- float4 Position
- float4 BoneIndices
- float4 BoneWeights
- float2 DiffuseTexCoords
-
- VS_OUTPUT DepthOnlyVS( VS_INPUT In )
-
-
//DepthOnlyVS.cpp D3DVERTEXELEMENT9 decl
0, 0, D3DDECLTYPE_FLOAT3, ,
D3DDECLUSAGE_POSITION, 0 , 0, 12,
D3DDECLTYPE_FLOAT3, , D3DDECLUSAGE_NORMAL,
0 , 0, 24, D3DDECLTYPE_FLOAT2, ,
D3DDECLUSAGE_TEXCOORD, 0 , 0, 32,
D3DDECLTYPE_UBYTE4, , D3DDECLUSAGE_BLENDINDICES
, 0 , 0, 36, D3DDECLTYPE_USHORT4N, ,
D3DDECLUSAGE_BLENDWEIGHT, 0 ,
D3DDECL_END()
15Triple the Fetch Cost
- Cycles/64 vertex vector ALU 38, vertex 96,
sequencer 22 - 7 GPRs, 27 threads
- vfetch_full r6.xyz1, r0.x, vf0,
- Offset0
- DataFormatFMT_32_32_32_FLOAT //
FLOAT3 POSITION - vfetch_full r1, r0.x, vf0,
- Offset8,
- DataFormatFMT_8_8_8_8 //
UBYTE4 BLENDINDICES - vfetch_mini r2,
- Offset9,
- DataFormatFMT_8_8_8_8 //
USHORT4N BLENDWEIGHT - vfetch_full r0.xy__, r0.x, vf0,
- Offset6,
- DataFormatFMT_32_32_FLOAT //
FLOAT2 TEXCOORD - mul r1, r1.wzyx, c255.x
- movas r0._, r1.x
- dp4 r3.x, c8a0.zxyw, r6.zxyw
16One-third the Fetch Cost
- Cycles/64 vertex vector ALU 38, vertex 32,
sequencer 22 - 7 GPRs, 27 threads
- vfetch_full r6.xyz1, r0.x, vf0,
- Offset0
- DataFormatFMT_32_32_32_FLOAT //
FLOAT3 POSITION - vfetch_mini r1
- Offset3,
- DataFormatFMT_8_8_8_8 //
UBYTE4 BLENDINDICES - vfetch_mini r2,
- Offset4,
- DataFormatFMT_8_8_8_8 //
USHORT4N BLENDWEIGHT - vfetch_mini r0.xy__
- Offset5,
- DataFormatFMT_32_32_FLOAT //
FLOAT2 TEXCOORD - mul r1, r1.wzyx, c255.x
- movas r0._, r1.x
- dp4 r3.x, c8a0.zxyw, r6.zxyw
17Depth-Only Rendering
- GPUs have perf improvements for depth-buffering
- Hierarchical-Z
- Double-speed, depth-only rendering
- Depth-only rendering is often still fill-bound
- A few triangles can cover 10100,000s of quads
- True for z-prepass, vis-testing, shadow rendering
- For vis-testing, use tighter bounding objects and
proper culling - Since were fill fill-bound, consider doing
pixels - I.e. Give up the double-speed benefit
- Lay down something useful to spare an additional
pass - Velocity, focal plane, etc.
18Pixel Shader Performance
- Most calls will be fill-bound
- Pixel shader optimization is some combination of
- Minimizing ALUs
- Minimizing GPRs
- Reducing control flow overhead
- Improving texture cache usage
- Avoiding expensive work
- Also, trying to balance the hardware
- Fetches versus ALUs versus GPUs
- A big challenge is getting the shader compiler to
do exactly what we want
19Minimizing ALUs
- Minor modifications to an expression can change
the number of ALUs - The shader compiler produces slightly different
results - Play around to try different things out
- Avoid math on constants
- Reducing just one multiply has saved 3 ALU ops
- Using isolate can dramatically change results
- Especially for the ALU ops around texture fetches
- Verify shader compiler output
- Get comfortable with assembly
- Compare with expectations given your HLSL code
- Finally, start tweaking to get what you want
20Minimizing GPRs
- Minimizing ALUs usually saves on GPRs as well
- Unrolling loops consumes more GPRs
- Conversely, using loops can save GPRs
- Lumping tfetches to top of shader costs GPRs
- Both for calculated tex coords
- And fetch results
- Xbox 360 extensions can save ALUs
- Like tfetch with offsets
- Shader compiler can be told to make due with a
user-specified max number of GPRs
21Control Flow
- Flattening or preserving loops can have a huge
effect on shader performance - One game shaved 4 ms off of a 30 ms scene by
unrolling just one loopits main pixel shader - Unrolling allowed for much more aggressive
optimization by the compiler - However, the DepthOfField shader saves 2 ms by
not preserving the loop - Using the loop reduced GPR usage dramatically
- Recommendation is to try both ways
- Non-branching control flow is still overhead
- Gets reduced as ALU count goes down
22Early Out Shaders
- Early out shaders may not really do anything
- On Xbox 360, HLSL clip doesnt really kill a
pixel - But rather just invalidates the output
- Remaining texture fetches may be spared
- but all remaining ALU instructions still execute
- Write a test for other hardware to see if early
outs actually improve performance - Otherwise, assume they dont
- For any gain, all 64 pixels would need to be
killed - Dynamic control flow via an if-else block should
get close to what you intend
23Dynamic Branching
- Dynamic branching can help or hurt performance
- All pixels in a thread must actually take the
branch - Otherwise, both branches need to be executed
- Use branching to skip fetches and calculations
- Like whenever the alpha will be zero
- But beware of multiple code paths executing
- ifelse statements result in additional overhead
- The ? operator turns into a single instruction
- Avoid static branching masquerading as dynamic
- Do not use numerical constants for control flow
use booleans instead - Special-case simple various code paths, which
results in less control-flow and fewer GPRs used
24Thread Size and Branching
Some pixel threads take the non-lighting
path Some pixel threads take the lighting
path Some pixel threads take the soft shadow path
25Thread Size and Branching
- 43 of pixel threads take the non-lighting path
- 14 of pixel threads take the lighting path
- 43 of pixel threads take the soft shadow path
26Thread Size and Branching
43 of pixel threads take the non-lighting
path 14 of pixel threads take the lighting
path 43 of pixel threads take the soft shadow
path
27Texture Cache Usage
- Fetches have latency that become a bottleneck
- Can be a challenge to fetch 610 textures per
pixel and many MB of texture traffic through a 32
KB cache - Age-old recommendations still apply
- Compare measured texture traffic to ideal traffic
- Consider a 1280x720x32-bit post-processing pass
- 1280x720x32 bits 3.686 MB of ideal texture
traffic - But measured result may claim 7.0 MB
- Triangle rasterization can and will affect
texture cache usage - In the case above, its the only explanation
- Pixels are processed in an order thats causing
texels to be evicted/re-fetched to the cache
28Rasterization Test
- Use an MxN grid instead of a full-screen quad
- Smaller primitives confine rasterization to
better match usage patterns of the texture cache - Prevent premature evictions from the texture
cache - Ideal grid size varies for different conditions
- Number of textures, texel width, etc.
- And surely for different hardware, too
- Write a test that lets you try different grid
configurations for each shader - For the DepthOfField shader, an 8x1 grid works
best - For a different shader, 20x13 worked best
- In all cases, 1x1 seems to be pretty bad
29Conditional Processing
- The shader compiler doesnt know when
intermediate results might be zero - Diffuse alpha, NL, specular, bone weight,
lightmap contribution, etc. - Pixel is in shadow
- Some constant value is zero (or one)
- Try making expressions conditional when using
these values - Experiment to see if even small branches pay off
- Use a texture mask to mask off areas where
expensive calculations can be avoided - For PCF, take a few samples to see if you need to
do more
30Multiple Passes
- Multiple passes cause lots of re-fetching of
textures (normal maps, etc.) - However, separate passes are better for the
tcache - Resolve and fetch cost of scene and depth
textures adds up - Tiling overhead may be worth it to save passes
- Alpha blending between passes eats bandwidth
- ALU power is 4x that of texture hardware
- Branching in shader can handle multiple lights
- Consider multiple render targets
- Meshes are often transformed many times
- Try to skin meshes just once for all N passes
- Consider memexport or StreamOut for skinning
31Xbox 360 Platform
- Fixed hardware offers lots of low-level
manipulations to tweak performance - Consistent hardware characteristics, like fetch
rules - Shader GPR allocation
- Tfetch with offsets
- Ability to view/author shader microcode
- Access to hardware performance counters
- Custom formats (7e3) and blend modes
- Predicated tiling
- Custom tools
- Microcode-aware shader compiler, with custom
extensions and attributes - PIX for Xbox 360
- Its still important to measure real-world
performance
32Windows Platform
- Hardware can obviously vary a lot
- Hard to get low-level knowledge of what the rules
are - Driver performs additional compiling of shaders
- Some things to check on include
- Effect of thread size on dynamic branching
- Effect of GPR usage
- 32-bit versus 16-bit float performance
- 64-bit render target and texture performance
- Multiple render target performance
- Z-prepass effectiveness
- Hardware support for shadowmap filtering
- If possible, consider shader authoring on an Xbox
360 dev kit - Get ready for D3D 10
33The HLSL Shader Compiler
- Dont expect the shader compiler to solve all
your problems - It cant be perfect (and its not)
- Garbage in garbage out
- It cant know what youre really trying to do
- Its easy to trick the compiler, especially with
constants - It cant know the situation the shader will run
in - Texture cache usage, 64-bit textures, filtering,
etc. - Vertex declarations
- Interpolators
- Alpha-blending
- Neighboring vertices and pixels
- Besides, what does the driver do with your shader
once it gets it?
34Shader Compiler
- The shader compiler can generate variants that
perform dramatically different - Loops and branching versus flat control flow
- Grouping tfetches versus minimizing GPR count
- Isolated operations versus intertwining
instructions - Reactions to a subtle difference in an expression
- Be accountable for shader compiler output
- Always verify the output of the shader compiler
- Know what you want, then try to get shader
compiler to play along - Rough-count number of hypothetical instructions
for verification
35Controlling the HLSL Compiler
- Options to affect compiler output include
- Compiler switches and HLSL attributes to mandate
control flow - Manually unrolling loops
- Rearranging HLSL code
- The Xbox 360 compiler has a few more switches and
attributes - /Xmaxtempreg to limit GPR usage
- isolateing blocks can have a huge effect on
code generation - Especially around fetches
- Ditto for noExpressionOptimizations
- Compiler output can often be improved by
massaging the actual HLSL code
36Massaging HLSL Code
- Changing one simple operation can improve or
degrade output - The input changes, so the rules change, so code
generation changes - But new code is not always better
- Send weird cases into developer support
- The one operation may be as simple as an add
- Or moving an expression to earlier/later in the
shader - Or needless math on constants
- rcp r0.x, c20.x
- mul r0.xyz, c5.xyz, c6.w
- movs r0.y, c0.z
- cndeq r0, c6.x, r0, r1
- Always verify results in assembly
37Wheres the HLSL?
- Make sure your art pipeline lets you
access/view/tweak HLSL shaders - Many engines assemble shader fragments
dynamically - Meaning theres not complete HLSL source lying
around for every variation of every shader used
in-game - You must solve this problem
- Recommendation is to spit out exact HLSL
immediately after compilation - Save the HLSL to a sequentially named file
- Then add PIX output to your scene with the ID of
each shader used - That way, you can trace draw calls back to an
HLSL file that you can experiment with
38Test Things Out
- Optimizing shaders means trying out a lot of
different things under real-world conditions - So its imperative to test things out
- Shaders need to be buildable from the command
line - And be able to be dropped into a test framework
- And hooked up to performance tools like PIX
- Get comfortable with shader compiler output
- Verify that the assembly looks close to expected
- Like GPR usage, control flow, tfetch placement
- Isolate shaders and exaggerate their effect
- Draw 100 full-screen quads instead of one
- Draw objects off-screen to eliminate fill cost
- Then, start tweaking