Revving Up Shader Performance

About This Presentation

Title:

Revving Up Shader Performance

Description:

Know the shader hardware on a low level. Pixel threads, GPRs, fetch ... True for z-prepass, vis-testing, shadow rendering. For vis-testing, use tighter bounding ... – PowerPoint PPT presentation

Number of Views:122

Avg rating:3.0/5.0

Slides: 39

Provided by: downloadM

Category:

more less

Transcript and Presenter's Notes

Title: Revving Up Shader Performance

1
Revving Up Shader Performance
Presentation/Presenter Title Slide
Under the Hood

Shanon Drone
Development Engineer
XNA Developer ConnectionMicrosoft

2
Key Takeaways

Know the shader hardware on a low level
Pixel threads, GPRs, fetch efficiency, latency,
etc.
Learn from the Xbox 360
Architect shaders accordingly
Shader optimization is not always intuitive
Some things that seem to make sense, actually
hurt performance
At some point, you must take a trial-n-error
approach
In other words
You must write tests
Real-world measurements are the only way to
really know what works

3
Under the Hood

Help the shader compiler by writing shaders for
how hardware works under the hood
Modern GPUs are highly parallel to hide latency
Lots of pipelines (maybe even a shared shader
core)
Works on units larger than single vertices and
pixels
Rasterization goes to tiles, then to quads, then
to (groups of) pixels
Texture caches are still small
32 KB on Xbox 360 and similar on PC hardware
Shaders use 610 textures with 1020 MB traffic
Rasterization order affects texture cache usage
Lots of other subtleties affect performance

4
Pixel and Vertex Vectors

GPUs work on groups of vertices and pixels
Xbox 360 works on a vector of 64 vertices or
pixels
Meaning each instruction is executed 64 times at
once
Indexed constants cause constant waterfalling
For example ca0 can translate to 64
different constants per one instruction
When all pixels take the same branch
performance can be good
But in most cases, both code paths will execute
Author shaders accordingly
Early out code paths may not work out like
expected
Check assembly output

5
Why GPRs Matter

GPU can process many ALU ops per clock cycle
But fetch results can take hundreds of clock
cycles
Due to cache misses, texel size, filtering, etc.
Threads hide fetch latency
While one thread is waiting for fetch results,
the GPU can start processing another thread
Max number of threads is limited by the GPR pool
Example 64 pixels 32 GPRs 2048 GPRs per
thread
The GPU can quickly run out of GPRs
Fewer GPRs used generally translates to more
threads

6
Minimize GPR Usage

Xbox 360 has 24,576 GPRS
64-unit vector 128 register banks 3 SIMD
units
Sounds like a lot, but many shaders are GPR
bound
The issue is the same for PC hardware
Tweak shaders to use the least number of GPRs
Maybe even at the expense of additional ALUs or
unintuitive control flow
Unrolling loops usually requires more ALUs
as does lumping tfetches at the beginning of the
shader
Ditto for über-shaders, even with static control
flow
Rules get tricky, so we must try out many
variations

7
Vertex Shader Performance

Lots of reasons to care about vertex shader
performance
Modern games have many passes (z-pass, tiling
,etc.)
Non-shared cores devote less hardware for
vertices
For shared cores, resources could be used for
pixels
For ALU bound shaders
Constant waterfalling can be the biggest problem
Matrix order can affect ALU optimization
Many vertex shaders are fetch bound
Especially lightweight shaders
One fetch can cost 32x more cycles than an ALU

8
VFetch

Goal is to minimize the number of fetches
Which is why vertex compression is so important
Vertex data should be aligned to 32 or 64 bytes
Multiple streams will multiply the fetch cost
Vertex declaration should match fetch order
Shader compiler has ZERO info about vertex
components
Shaders are patched at run time with the vertex
declaration
Shader patching cant optimize out unnecessary
vfetches

9
Mega vs. Mini Fetches

A quick refresher
GPU reads vertex data is groups of bytes
Typically 32 bytes (a mega or full fetch)
Additional fetches within the 32-byte range
should be free (a mini fetch)
On Xbox 360
A vfetch_full pulls in 32 bytes worth of data
Two fetches (times 64 vertices) per clock cycle
equals 32 cycles per fetch
Without 32 cycles worth of ALU ops, the shader is
fetch bound

10
Vfetch Recommendations

Compress vertices
Normals, Tangent, BiNormal -gt 111110
Texture coords -gt 1616
Put all non-lighting components first
So depth-only shaders do just one fetch
FLOAT32 Position3
UINT8 BoneWeights4
UINT8 BoneIndices4
UINT16 DiffuseTexCoords2
UINT16 NormalMapCoords2
DEC3N Normal
DEC3N Tangent
DEC3N BiNormal
All in one stream, of course

1st Fetch
2nd Fetch
11
Fetch From Two Streams

Cycles/64 vertex vector ALU 12, vertex 64,
sequencer 12
3 GPRs, 31 threads
// Fetch position
vfetch_full r2.xyz1, r0.x, vf0,
Offset0,
DataFormatFMT_32_32_32_FLOAT
// Fetch diffuse texcoord
vfetch_full r0.xy0_, r0.x, vf2,
Offset0,
DataFormatFMT_32_32_FLOAT
mul r1, r2.y, c1
mad r1, r2.x, c0.wyxz, r1.wyxz
mad r1, r2.z, c2.zywx, r1.wyxz
mad r2, r2.w, c3, r1.wyxz
mul r1, r2.y, c5.wzyx
mad r1, r2.x, c4.xzwy, r1.wyxz
mad r1, r2.z, c6.yzxw, r1.wyxz
mad oPos, r2.w, c7, r1.zxyw

12
2 Fetches From Same Stream

Cycles/64 vertex vector ALU 12, vertex 64,
sequencer 12
3 GPRs, 31 threads
// Fetch position
vfetch_full r2.xyz1, r0.x, vf0,
Offset0,
DataFormatFMT_32_32_32_FLOAT
// Fetch diffuse texcoord
vfetch_full r0.xy0_, r0.x, vf0,
Offset10,
DataFormatFMT_32_32_FLOAT
mul r1, r2.y, c1
mad r1, r2.x, c0.wyxz, r1.wyxz
mad r1, r2.z, c2.zywx, r1.wyxz
mad r2, r2.w, c3, r1.wyxz
mul r1, r2.y, c5.wzyx
mad r1, r2.x, c4.xzwy, r1.wyxz
mad r1, r2.z, c6.yzxw, r1.wyxz
mad oPos, r2.w, c7, r1.zxyw

13
1 Fetch From Single Stream

Cycles/64 vertex vector ALU 12, vertex 32,
sequencer 12
3 GPRs, 31 threads
// Fetch position
vfetch_full r2.xyz1, r0.x, vf0,
Offset0,
DataFormatFMT_32_32_32_FLOAT
// Fetch diffuse texcoord
vfetch_mini r0.xy0_, r0.x, vf0,
Offset5,
DataFormatFMT_32_32_FLOAT
mul r1, r2.y, c1
mad r1, r2.x, c0.wyxz, r1.wyxz
mad r1, r2.z, c2.zywx, r1.wyxz
mad r2, r2.w, c3, r1.wyxz
mul r1, r2.y, c5.wzyx
mad r1, r2.x, c4.xzwy, r1.wyxz
mad r1, r2.z, c6.yzxw, r1.wyxz
mad oPos, r2.w, c7, r1.zxyw

14
Triple the Fetch Cost
// DepthOnlyVS.asm vfetch r6.xyz1, r0.x,
position vfetch r1, r0.x, blendindices vfetch
r2, r0.x, blendweight vfetch r0.xy__, r0.x,
texcoord mul r1, r1.wzyx, c255.x movas r0._,
r1.x dp4 r3.x, c8a0.zxyw, r6.zxyw dp4 r3.y,
c9a0.zxyw, r6.zxyw dp4 r3.z, c10a0.zxyw,
r6.zxyw

//DepthOnlyVS.hlsl
struct VS_INPUT
float4 Position
float4 BoneIndices
float4 BoneWeights
float2 DiffuseTexCoords
VS_OUTPUT DepthOnlyVS( VS_INPUT In )

//DepthOnlyVS.cpp D3DVERTEXELEMENT9 decl
0, 0, D3DDECLTYPE_FLOAT3, ,
D3DDECLUSAGE_POSITION, 0 , 0, 12,
D3DDECLTYPE_FLOAT3, , D3DDECLUSAGE_NORMAL,
0 , 0, 24, D3DDECLTYPE_FLOAT2, ,
D3DDECLUSAGE_TEXCOORD, 0 , 0, 32,
D3DDECLTYPE_UBYTE4, , D3DDECLUSAGE_BLENDINDICES
, 0 , 0, 36, D3DDECLTYPE_USHORT4N, ,
D3DDECLUSAGE_BLENDWEIGHT, 0 ,
D3DDECL_END()
15
Triple the Fetch Cost

Cycles/64 vertex vector ALU 38, vertex 96,
sequencer 22
7 GPRs, 27 threads
vfetch_full r6.xyz1, r0.x, vf0,
Offset0
DataFormatFMT_32_32_32_FLOAT //
FLOAT3 POSITION
vfetch_full r1, r0.x, vf0,
Offset8,
DataFormatFMT_8_8_8_8 //
UBYTE4 BLENDINDICES
vfetch_mini r2,
Offset9,
DataFormatFMT_8_8_8_8 //
USHORT4N BLENDWEIGHT
vfetch_full r0.xy__, r0.x, vf0,
Offset6,
DataFormatFMT_32_32_FLOAT //
FLOAT2 TEXCOORD
mul r1, r1.wzyx, c255.x
movas r0._, r1.x
dp4 r3.x, c8a0.zxyw, r6.zxyw

16
One-third the Fetch Cost

Cycles/64 vertex vector ALU 38, vertex 32,
sequencer 22
7 GPRs, 27 threads
vfetch_full r6.xyz1, r0.x, vf0,
Offset0
DataFormatFMT_32_32_32_FLOAT //
FLOAT3 POSITION
vfetch_mini r1
Offset3,
DataFormatFMT_8_8_8_8 //
UBYTE4 BLENDINDICES
vfetch_mini r2,
Offset4,
DataFormatFMT_8_8_8_8 //
USHORT4N BLENDWEIGHT
vfetch_mini r0.xy__
Offset5,
DataFormatFMT_32_32_FLOAT //
FLOAT2 TEXCOORD
mul r1, r1.wzyx, c255.x
movas r0._, r1.x
dp4 r3.x, c8a0.zxyw, r6.zxyw

17
Depth-Only Rendering

GPUs have perf improvements for depth-buffering
Hierarchical-Z
Double-speed, depth-only rendering
Depth-only rendering is often still fill-bound
A few triangles can cover 10100,000s of quads
True for z-prepass, vis-testing, shadow rendering
For vis-testing, use tighter bounding objects and
proper culling
Since were fill fill-bound, consider doing
pixels
I.e. Give up the double-speed benefit
Lay down something useful to spare an additional
pass
Velocity, focal plane, etc.

18
Pixel Shader Performance

Most calls will be fill-bound
Pixel shader optimization is some combination of
Minimizing ALUs
Minimizing GPRs
Reducing control flow overhead
Improving texture cache usage
Avoiding expensive work
Also, trying to balance the hardware
Fetches versus ALUs versus GPUs
A big challenge is getting the shader compiler to
do exactly what we want

19
Minimizing ALUs

Minor modifications to an expression can change
the number of ALUs
The shader compiler produces slightly different
results
Play around to try different things out
Avoid math on constants
Reducing just one multiply has saved 3 ALU ops
Using isolate can dramatically change results
Especially for the ALU ops around texture fetches
Verify shader compiler output
Get comfortable with assembly
Compare with expectations given your HLSL code
Finally, start tweaking to get what you want

20
Minimizing GPRs

Minimizing ALUs usually saves on GPRs as well
Unrolling loops consumes more GPRs
Conversely, using loops can save GPRs
Lumping tfetches to top of shader costs GPRs
Both for calculated tex coords
And fetch results
Xbox 360 extensions can save ALUs
Like tfetch with offsets
Shader compiler can be told to make due with a
user-specified max number of GPRs

21
Control Flow

Flattening or preserving loops can have a huge
effect on shader performance
One game shaved 4 ms off of a 30 ms scene by
unrolling just one loopits main pixel shader
Unrolling allowed for much more aggressive
optimization by the compiler
However, the DepthOfField shader saves 2 ms by
not preserving the loop
Using the loop reduced GPR usage dramatically
Recommendation is to try both ways
Non-branching control flow is still overhead
Gets reduced as ALU count goes down

22
Early Out Shaders

Early out shaders may not really do anything
On Xbox 360, HLSL clip doesnt really kill a
pixel
But rather just invalidates the output
Remaining texture fetches may be spared
but all remaining ALU instructions still execute
Write a test for other hardware to see if early
outs actually improve performance
Otherwise, assume they dont
For any gain, all 64 pixels would need to be
killed
Dynamic control flow via an if-else block should
get close to what you intend

23
Dynamic Branching

Dynamic branching can help or hurt performance
All pixels in a thread must actually take the
branch
Otherwise, both branches need to be executed
Use branching to skip fetches and calculations
Like whenever the alpha will be zero
But beware of multiple code paths executing
ifelse statements result in additional overhead
The ? operator turns into a single instruction
Avoid static branching masquerading as dynamic
Do not use numerical constants for control flow
use booleans instead
Special-case simple various code paths, which
results in less control-flow and fewer GPRs used

24
Thread Size and Branching
Some pixel threads take the non-lighting
path Some pixel threads take the lighting
path Some pixel threads take the soft shadow path
25
Thread Size and Branching

43 of pixel threads take the non-lighting path
14 of pixel threads take the lighting path
43 of pixel threads take the soft shadow path

26
Thread Size and Branching
43 of pixel threads take the non-lighting
path 14 of pixel threads take the lighting
path 43 of pixel threads take the soft shadow
path

27
Texture Cache Usage

Fetches have latency that become a bottleneck
Can be a challenge to fetch 610 textures per
pixel and many MB of texture traffic through a 32
KB cache
Age-old recommendations still apply
Compare measured texture traffic to ideal traffic
Consider a 1280x720x32-bit post-processing pass
1280x720x32 bits 3.686 MB of ideal texture
traffic
But measured result may claim 7.0 MB
Triangle rasterization can and will affect
texture cache usage
In the case above, its the only explanation
Pixels are processed in an order thats causing
texels to be evicted/re-fetched to the cache

28
Rasterization Test

Use an MxN grid instead of a full-screen quad
Smaller primitives confine rasterization to
better match usage patterns of the texture cache
Prevent premature evictions from the texture
cache
Ideal grid size varies for different conditions
Number of textures, texel width, etc.
And surely for different hardware, too
Write a test that lets you try different grid
configurations for each shader
For the DepthOfField shader, an 8x1 grid works
best
For a different shader, 20x13 worked best
In all cases, 1x1 seems to be pretty bad

29
Conditional Processing

The shader compiler doesnt know when
intermediate results might be zero
Diffuse alpha, NL, specular, bone weight,
lightmap contribution, etc.
Pixel is in shadow
Some constant value is zero (or one)
Try making expressions conditional when using
these values
Experiment to see if even small branches pay off
Use a texture mask to mask off areas where
expensive calculations can be avoided
For PCF, take a few samples to see if you need to
do more

30
Multiple Passes

Multiple passes cause lots of re-fetching of
textures (normal maps, etc.)
However, separate passes are better for the
tcache
Resolve and fetch cost of scene and depth
textures adds up
Tiling overhead may be worth it to save passes
Alpha blending between passes eats bandwidth
ALU power is 4x that of texture hardware
Branching in shader can handle multiple lights
Consider multiple render targets
Meshes are often transformed many times
Try to skin meshes just once for all N passes
Consider memexport or StreamOut for skinning

31
Xbox 360 Platform

Fixed hardware offers lots of low-level
manipulations to tweak performance
Consistent hardware characteristics, like fetch
rules
Shader GPR allocation
Tfetch with offsets
Ability to view/author shader microcode
Access to hardware performance counters
Custom formats (7e3) and blend modes
Predicated tiling
Custom tools
Microcode-aware shader compiler, with custom
extensions and attributes
PIX for Xbox 360
Its still important to measure real-world
performance

32
Windows Platform

Hardware can obviously vary a lot
Hard to get low-level knowledge of what the rules
are
Driver performs additional compiling of shaders
Some things to check on include
Effect of thread size on dynamic branching
Effect of GPR usage
32-bit versus 16-bit float performance
64-bit render target and texture performance
Multiple render target performance
Z-prepass effectiveness
Hardware support for shadowmap filtering
If possible, consider shader authoring on an Xbox
360 dev kit
Get ready for D3D 10

33
The HLSL Shader Compiler

Dont expect the shader compiler to solve all
your problems
It cant be perfect (and its not)
Garbage in garbage out
It cant know what youre really trying to do
Its easy to trick the compiler, especially with
constants
It cant know the situation the shader will run
in
Texture cache usage, 64-bit textures, filtering,
etc.
Vertex declarations
Interpolators
Alpha-blending
Neighboring vertices and pixels
Besides, what does the driver do with your shader
once it gets it?

34
Shader Compiler

The shader compiler can generate variants that
perform dramatically different
Loops and branching versus flat control flow
Grouping tfetches versus minimizing GPR count
Isolated operations versus intertwining
instructions
Reactions to a subtle difference in an expression
Be accountable for shader compiler output
Always verify the output of the shader compiler
Know what you want, then try to get shader
compiler to play along
Rough-count number of hypothetical instructions
for verification

35
Controlling the HLSL Compiler

Options to affect compiler output include
Compiler switches and HLSL attributes to mandate
control flow
Manually unrolling loops
Rearranging HLSL code
The Xbox 360 compiler has a few more switches and
attributes
/Xmaxtempreg to limit GPR usage
isolateing blocks can have a huge effect on
code generation
Especially around fetches
Ditto for noExpressionOptimizations
Compiler output can often be improved by
massaging the actual HLSL code

36
Massaging HLSL Code

Changing one simple operation can improve or
degrade output
The input changes, so the rules change, so code
generation changes
But new code is not always better
Send weird cases into developer support
The one operation may be as simple as an add
Or moving an expression to earlier/later in the
shader
Or needless math on constants
rcp r0.x, c20.x
mul r0.xyz, c5.xyz, c6.w
movs r0.y, c0.z
cndeq r0, c6.x, r0, r1
Always verify results in assembly

37
Wheres the HLSL?

Make sure your art pipeline lets you
access/view/tweak HLSL shaders
Many engines assemble shader fragments
dynamically
Meaning theres not complete HLSL source lying
around for every variation of every shader used
in-game
You must solve this problem
Recommendation is to spit out exact HLSL
immediately after compilation
Save the HLSL to a sequentially named file
Then add PIX output to your scene with the ID of
each shader used
That way, you can trace draw calls back to an
HLSL file that you can experiment with

38
Test Things Out

Optimizing shaders means trying out a lot of
different things under real-world conditions
So its imperative to test things out
Shaders need to be buildable from the command
line
And be able to be dropped into a test framework
And hooked up to performance tools like PIX
Get comfortable with shader compiler output
Verify that the assembly looks close to expected
Like GPR usage, control flow, tfetch placement
Isolate shaders and exaggerate their effect
Draw 100 full-screen quads instead of one
Draw objects off-screen to eliminate fill cost
Then, start tweaking

Write a Comment

User Comments (0)

About PowerShow.com

Revving Up Shader Performance - PowerPoint PPT Presentation

Revving Up Shader Performance

Know the shader hardware on a low level. Pixel threads, GPRs, fetch ... True for z-prepass, vis-testing, shadow rendering. For vis-testing, use tighter bounding ... – PowerPoint PPT presentation