Title: Efficient HighLevel Shader Development
1Efficient High-Level Shader Development
- Natalya Tatarchuk
- 3D Application Research Group
- ATI Technologies, Inc.
2Overview
- Writing optimal HLSL code
- Compiling issues
- Optimization strategies
- Code structure pointers
- HLSL Shader Examples
- Multi-layer car paint effect
- Translucent Iridescent Shader
- Überlight Shader
3Why use HLSL?
- Faster, easier effect development
- Instant readability of your shader code
- Better code re-use and maintainability
- Optimization
- Added benefit of HLSL compiler optimizations
- Still helps to know whats under the hood
- Industry standard which will run on cards from
any vendor - Current and future industry direction
- Increase your ability to iterate on a given
shader design, resulting in better looking games - Conveniently manage shader permutations
4Compile Targets
- Legal HLSL is still independent of compile target
chosen - But having an HLSL shader doesnt mean it will
always run on any hardware! - Currently supported compile targets
- vs_1_1, vs_2_0, vs_2_sw
- ps_1_1, ps_1_2, ps_1_3, ps_1_4, ps_2_0, ps_2_sw
- Compilation is vendor-independent and is done by
a D3DX component that Microsoft can update
independent of the runtime release schedule
5Compilation Failure
- The obvious program errors (bad syntax, etc)
- Compile target specific reasons your shader is
too complex for the selected target - Not enough resources in the selected target
- Uses too many registers (temporaries, for
example) - Too many resulting asm instructions for the
compile target - Lack of capability in the target
- Such as trying to sample a texture in vs_1_1
- Using dynamic branching when unsupported in the
target - Sampling texture too many times for the target
(Example more than 6 for ps_1_4) - Compiler provides useful messages
6Use Disassembly for Hints
- Very helpful for understanding relationship
between compile targets and code generation - Disassembly output provides valuable hints when
compiling down to an older compile target - If successfully compiled for a more recent target
(eg. ps_2_0), look at the disassembly output for
hints when failing to compile to an older target
(eg. ps_1_4) - Check out instruction count for ALU and tex ops
- Figure out how HLSL instructions get mapped to
assembly
7Getting Disassembly Output for Your Shaders
- Directly use FXC
- Compile for any target desired
- Compile both individual shader files and full
effects - Various input arguments
- Allow to turn shader optimizations on / off
- Specify different entry points
- Enable / disable generating debug information
8Easier Path to Disassembly
- Use RenderMonkey while developing shaders
- See your changes in real-time
- Disassembly output is updated every time a
shader is compiled - Displays count for ALUand texture ops, as well
as the limits forthe selected target - Can save resulting assembly code into text file
9Optimizing HLSL Shaders
- Dont forget you are running on a vector
processor - Do your computations at the most efficient
frequency - Dont do something per-pixel that you can do
per-vertex - Dont perform computation in a shader that you
can precompute in the app - Use HLSL intrinsic functions
- Helps hardware to optimize your shaders
- Know your intrinsics and how they map to asm,
especially asm modifiers
10HLSL Syntax Not Limited
- The HLSL code you write is not limited by the
compile target you choose - You can always use loops, subroutines, if-else
statements etc - If not natively supported in the selected compile
target, the compiler will still try to generate
code - Loops will be unrolled
- Subroutines will be inlined
- If else statements will execute both branches,
selecting appropriate output as the result - Code generation is dependent upon compile target
- Use appropriate data types to improve instruction
count - Store your data in a vector when needed
- However, using appropriate data types helps
compiler do better job at optimizing your code
11Using If Statement in HLSL
- Can have large performance implications
- Lack of branching support in most asm models
- Both sides of an if statement will be executed
- The output is chosen based on which side of the
if would have been taken - Optimization is different than in the CPU
programming world
12Example of Using If in Vs_1_1
If ( Threshold gt 0.0 ) Out.Position
Value1else Out.Position Value2
generates following assembly output
// calculate lerp value based on Value gt 0 mov
r1.w, c2.x slt r0.w, c3.x, r1.w // lerp between
Value1 and Value2 mov r7, -c1 add r2, r7, c0 mad
oPos, r0.w, r2, c1
13Example of Function Inlining
// Bias and double a value to take it from 0..1
range to -1..1 range float4 bx2(float x)
return 2.0f x - 1.0f float4 main( float4
tc0 TEXCOORD0, float4 tc1
TEXCOORD1, float4 tc2 TEXCOORD2,
float4 tc3 TEXCOORD3) COLOR
// Sample noise map three times with different
// texture coordinates float4 noise0
tex2D(fire_distortion, tc1) float4 noise1
tex2D(fire_distortion, tc2) float4 noise2
tex2D(fire_distortion, tc3) // Weighted sum
of signed noise float4 noiseSum bx2(noise0)
distortion_amount0
bx2(noise1) distortion_amount1
bx2(noise2) distortion_amount2 //
Perturb base coordinates in direction of noiseSum
as function of height (y) float4
perturbedBaseCoords tc0 noiseSum (tc0.y
height_attenuation.x
height_attenuation.y) // Sample base and
opacity maps with perturbed coordinates float4
base tex2D(fire_base, perturbedBaseCoords)
float4 opacity tex2D(fire_opacity,
perturbedBaseCoords) return base opacity
14Code Permutations Via Compilation
static const bool bAnimate false VS_OUTPUT
vs_main( float4 Pos POSITION,
float2 Tex TEXCOORD0 ) VS_OUTPUT Out
(VS_OUTPUT) 0 Out.Pos mul(
view_proj_matrix, Pos ) if ( bAnimate )
Out.Tex.x Tex.x time / 2
Out.Tex.y Tex.y - time / 2 else
Out.Tex Tex return Out
static const bool bAnimate false
vs_1_1 dcl_position v0 dcl_texcoord v1 mul r0,
v0.y, c1 mad r0, c0, v0.x, r0 mad r0, c2, v0.z,
r0 mad oPos, c3, v0.w, r0 mov oT0.xy, v1
5 instructions
bool bAnimate false VS_OUTPUT vs_main( float4
Pos POSITION, float2 Tex
TEXCOORD0 ) VS_OUTPUT Out (VS_OUTPUT) 0
Out.Pos mul( view_proj_matrix, Pos ) if (
bAnimate ) Out.Tex.x Tex.x time /
2 Out.Tex.y Tex.y - time / 2
else Out.Tex Tex return Out
vs_1_1 def c6, 0.5, 0, 0, 0dcl_position v0
dcl_texcoord v1 mul r0, v0.y, c1 mad r0, c0,
v0.x, r0 mov r1.w, c4.x mul r1.x, r1.w, c6.x mad
r0, c2, v0.z, r0 mov r1.y, -r1.x mad oPos, c3,
v0.w, r0 mad oT0.xy, c5.x, r1, v1
const bool bAnimate false
8 instructions
15Scalar and Vector Data Types
- Scalar data types are not all natively supported
in hardware - i.e. integers are emulated on float hardware
- Not all targets have native half and none
currently have double - Can apply swizzles to vector types
- float2 vec pos.xy
- But!
- Not all targets have fully flexible swizzles
- Acquaint yourself with the swizzles native to the
relevant compile targets (particularly ps_2_0 and
lower)
16Integer Data Type
- Added to make relative addressing more efficient
- Using floats for addressing purposes without
defined truncation rules can result in incorrect
access to arrays. - All inputs used as ints should be defined as ints
in your shader
17Example of Integer Data Type Usage
- Matrix palette indices for skinning
- Declaring variable as an int is a free
operation gt no truncation occurs - Using a float and casting it to an int or using
directly gt truncation will happen
18Real-World Shader Examples
- Will present several case studies of developing
shaders used in ATIs demos - Multi-tone car paint effect
- Translucent iridescent effect
- Classic überlight example
- Examples are presented as RenderMonkeyTM
workspaces - Distributed publicly with version 1.0 release
19Multi-Tone Car Paint
20Multi-Tone Car Paint Effect
- Multi-tone base color layer
- Microflake layer simulation
- Clear gloss coat
- Dynamically Blurred Reflections
21Car Paint Layers Build Up
Multi-Tone Base Color
Microflake Layer
Clear gloss coat
Final Color Composite
22Multi-Tone Base Paint Layer
- View-dependent lerpingbetween three paintcolors
- Normal from appearancepreserving
simplificationprocess, N - Uses subtractive tone to control overall color
accumulation
23Normal Decompression
- Sample from two-channel 16-16 normal map
- Derive z from sqrt (1 x2 y2)
- Gives higher precision than typically used
8-8-8-8 normal map
24Multi-Tone Base Coat Vertex Shader
VS_OUTPUT main( float4 Pos POSITION,
float3 Normal NORMAL,
float2 Tex TEXCOORD0,
float3 Tangent TANGENT, float3
Binormal BINORMAL ) VS_OUTPUT Out
(VS_OUTPUT) 0 // Propagate transformed
position out Out.Pos mul( view_proj_matrix,
Pos ) // Compute view vector Out.View
normalize( mul(inv_view_matrix,
float4( 0, 0, 0, 1)) - Pos ) //
Propagate texture coordinates Out.Tex Tex
// Propagate tangent, binormal, and normal
vectors to pixel shader Out.Normal
Normal Out.Tangent Tangent
Out.Binormal Binormal return Out
25Multi-Tone Base Coat Pixel Shader
float4 main( float4 Diff COLOR0, float2
Tex TEXCOORD0, float3 Tangent
TEXCOORD1, float3 Binormal TEXCOORD2,
float3 Normal TEXCOORD3, float3 View
TEXCOORD4 ) COLOR float3 vNormal
tex2D( normalMap, Tex ) vNormal 2 vNormal
- 1.0 float3 vView normalize( View )
float3x3 mTangentToWorld transpose( float3x3(
Tangent,
Binormal, Normal )) float3
vNormalWorld normalize( mul(mTangentToWorld,v
Normal)) float fNdotV saturate( dot(
vNormalWorld, vView ) ) float fNdotVSq
fNdotV fNdotV float4 paintColor fNdotV
paintColor0
fNdotVSq paintColorMid
fNdotVSq fNdotVSq paintColor2
return float4( paintColor.rgb, 1.0 )
26Microflake Layer
27Microflake Deposit Layer
- Simulating light interaction resulting from
metallic flakes suspended in the enamel coat of
the paint - Uses high frequency normalized vector noise map
(Nn) which is repeated across the surface of the
car
28Computing Microflake Layer Normals
- Start out by using normal vector fetched from
the normal map, N - Using the high frequency noise map, compute
perturbed normal Np - Simulate two layers of microflake deposits by
computing perturbed normals Np1 and Np2
where c b
where a ltlt b
29Microflake Layer Vertex Shader
- VS_OUTPUT main(float4 Pos POSITION, float3
Normal NORMAL, float2 Tex
TEXCOORD0, float3 Tangent TANGENT,
float3 Binormal BINORMAL ) -
- VS_OUTPUT Out (VS_OUTPUT) 0
- // Propagate transformed position out
- Out.Pos mul( view_proj_matrix, Pos )
- // Compute view vector
- Out.View normalize(mul(inv_view_matrix,
float4(0, 0, 0, 1))- Pos) - // Propagate texture coordinates
- Out.Tex Tex
- // Propagate tangent, binormal, and normal
vectors to pixel - // shader
- Out.Normal Normal
- Out.Tangent Tangent
- Out.Binormal Binormal
-
- // Compute microflake tiling factor
Compute texture coordinates for accessing noise
map using input texture coordinates and a tiling
factor
30Microflake Layer Pixel Shader
- float4 main(float4 Diff COLOR0, float2
Tex TEXCOORD0, float3 Tangent
TEXCOORD1, float3 Binormal TEXCOORD2,
float3 Normal TEXCOORD3, float3 View
TEXCOORD4, float3 SparkleTex
TEXCOORD5 ) COLOR -
- fetch and signed scale the normal fetched
from the normal map - float3 vFlakesNormal 2 tex2D(
microflakeNMap, SparkleTex ) - 1 - float3 vNp1 microflakePerturbationA
vFlakesNormal normalPerturbation
vNormal - float3 vNp2 microflakePerturbation (
vFlakesNormal vNormal ) - float3 vView normalize( View )
- float3x3 mTangentToWorld transpose( float3x3(
Tangent, Binormal,
Normal )) - float3 vNp1World normalize( mul(
mTangentToWorld, vNp1) ) - float fFresnel1 saturate( dot( vNp1World,
vView )) - float3 vNp2World normalize( mul(
mTangentToWorld, vNp2 )) - float fFresnel2 saturate( dot( vNp2World,
vView )) - float fFresnel1Sq fFresnel1 fFresnel1
- float4 paintColor fFresnel1 flakeColor
fFresnel1Sq flakeColor
fFresnel1Sq fFresnel1Sq flakeColor
pow( fFresnel2, 16 )
flakeColor
31Clear Gloss Coat
32RGBScale HDR Environment Map
- Alpha channel contains 1/16 of the true HDR scale
of the pixel value - RGB contains normalized color of the pixel
- Pixel shader reconstructs HDR value from
scale8color to get half of the true HDR value - Obvious quantization issues, but reasonable for
some applications - Similar to Wards RGBE Real Pixels but simpler
to reconstruct in the pixel shader
33Environment Map
Ceiling of car showroom
Top Cube Map Face RGB
Top Face Scale in Alpha Channel
34Dynamically Blurred Reflections
Blurred Reflections
35Dynamic Blurring of Environment Map Reflections
- A gloss map can be supplied to specify the
regions where reflections can be blurred - Use bias when sampling the environment map to
vary blurriness of the resulting reflections - Use texCUBEbias for to access the cubic
environment map - For rough specular, the bias is high, causing a
blurring effect - Can also convert color fetched from environment
map to luminance in rough trim areas
36Clear Gloss Coat Pixel Shader
- float4 ps_main( ... / same inputs as in the
previous shader / ) -
- // ... use normal in world space (see
Multi-tone pixel shader) - // Compute reflection vector
- float fFresnel saturate(dot( vNormalWorld,
vView)) - float3 vReflection 2 vNormalWorld fFresnel
- vView - float fEnvBias glossLevel
- // Sample environment map using this reflection
vector and bias - float4 envMap texCUBEbias( showroomMap,
float4( vReflection,
fEnvBias ) ) - // Premultiply by alpha
- envMap.rgb envMap.rgb envMap.a
- // Brighten the environment map sampling result
- envMap.rgb brightnessFactor
37Compositing Multi-Tone Base Layer and Microflake
Layer
- Base color and flake effect are derived from Np1
and Np2 using the following polynomial - color0(Np1V) color1(Np1V)2 color2(Np1V)4
color3(Np2V)16
Base Color
Flake
38Compositing Final Look
... // Compute final paint color combines
all layers of paint as well// as two layers of
microflakes float fFresnel1Sq fFresnel1
fFresnel1 float4 paintColor fFresnel1
paintColor0 fFresnel1Sq
paintColorMid fFresnel1Sq
fFresnel1Sq paintColor2
pow( fFresnel2, 16 ) flakeLayerColor //
Combine result of environment map reflection with
the paint // color float fEnvContribution
1.0 - 0.5 fNdotV // Assemble the final
look float4 finalColor finalColor.a
1.0finalColor.rgb envMap fEnvContribution
paintColor return finalColor
39Original Hand-Tuned Assembly
40Car Paint Shader HLSL Compiler Disassembly Output
41Full Result of Multi-Layer Paint
42Translucent Iridescent Shader Butterfly Wings
43Translucent Iridescent Shader Butterfly Wings
- Simulates translucency of delicate butterfly
wings - Wings glow from scattered reflected light
- Similar to the effect of softly backlit rice
paper - Displays subtle iridescent lighting
- Similar to rainbow pattern on the surface of soap
bubbles - Caused by the interference of light waves
resulting from multiple reflections of light off
of surfaces of varying thickness - Combines gloss, opacity and normal maps for a
multi-layered final look - Gloss map contributes to satiny highlights
- Opacity map allows portions of wings to be
transparent - Normal map is used to give wings a bump-mapped
look
44RenderMonkey Butterfly Wings Shader Example
- Parameters that contribute to the translucency
and iridescence look - Light position and scene ambient color
- Translucency coefficient
- Gloss scale and bias
- Scale and bias for speed of iridescence change
- WorkspaceIridescent Butterfly.rfx
45Translucent Iridescent Shader Vertex Shader
- ..
- // Propagate input texture coordinates
- Out.Tex Tex
- // Define tangent space matrix
- float3x3 mTangentSpace
- mTangentSpace0 Tangent
- mTangentSpace1 Binormal
- mTangentSpace2 Normal
- // Compute the light vector (object space)
- float3 vLight normalize( mul(
inv_view_matrix, lightPos ) - Pos ) -
- // Output light vector in tangent space
- Out.Light mul( mTangentSpace, vLight )
-
- // Compute the view vector (object space)
- float3 vView normalize( mul(
inv_view_matrix, float4(0,0,0,1)) - Pos )
46Translucent Iridescent Shader Loading
Information
float3 vNormal, baseColor float fGloss,
fTranslucency // Load normal and gloss
map float4( vNormal, fGloss ) tex2D(
bump_glossMap, Tex ) // Load base and opacity
map float4 (baseColor, fTranslucency) tex2D(
base_opacityMap, Tex )
47Diffuse Illumination For Translucency
float3 scatteredIllumination saturate(dot(-vNorm
al, Light))
fTranslucency translucencyCoeff float3
diffuseContribution saturate(dot(vNormal,Light
)) ambient baseColor
scatteredIllumination diffuseContribution
48Adding Opacity to ButterlyWings
- Resulted color is modulated by the opacity value
to add - transparency to the wings
// Premultiply alpha blend to avoid clamping the
highlights baseColor fOpacity
49Making Butterfly Wings Iridescent
// Compute index into the iridescence gradient
map, which // consists of NV coefficient float
fGradientIndex dot( vNormal, View)
iridescence_speed_scale iridescence_speed_bias
// Load the iridescence value from the gradient
map float4 iridescence tex1D( gradientMap,
fGradientIndex )
50Assembling Final Color
// Compute glossy highlights using values from
gloss map float fGlossValue fGloss (
saturate( dot( vNormal, Half ))
gloss_scale gloss_bias ) // Assemble the
final color for the wings baseColor
fGlossValue iridescence
51HLSL Disassembly Comparison
12 ALU 3 Texture 15 Total
15 ALU 3 Texture 18 Total
52Example of Translucent Iridescent Shader
53Optimization Study Überlight
- Flexible light described in JGT article Lighting
Controls for Computer Cinematography by Ronen
Barzel of Pixar - Überlight is procedural and has many controls
- light type, intensity, light color, cuton,
cutoff, near edge, far edge, falloff, falloff
distance, max intensity, parallel rays, shearx,
sheary, width, height, width edge, height edge,
roundness and beam distribution - Code here is based upon the public domain
RenderMan implementation by Larry Gritz
54Überlight Spotlight Mode
- Spotlight mode defines a procedural volume with
smooth boundaries - Shape of spotlight is made up of two nested
superellipses which are swept along direction of
light - Also has smooth cuton and cutoff planes
- Can tune parameters to get all sorts of looks
55Überlight Spotlight Volume
Roundness ½
56Überlight Spotlight Volume
Outer swept superellipse
Roundness 1
b
Inner swept superellipse
a
A
B
57Original clipSuperellipse() routine
- Computes attenuation as a function of a points
position in the swept superellipse. - Directly ported from original RenderMan source
- Compiles to 42 cycles in ps_2_0, 40 cycles on R3x0
float clipSuperellipse ( float3 Q,
// Test point on the x-y plane float
a, // Inner superellipse float
b, float A, // Outer
superellipse float B,
float roundness) // Same roundness for both
ellipses float x abs(Q.x), y abs(Q.y)
float re 2/roundness // roundness
exponent float q a b pow (pow(bx, re)
pow(ay, re), -1/re) float r A B pow
(pow(Bx, re) pow(Ay, re), -1/re) return
smoothstep (q, r, 1)
58Vectorized Version
- Precompute functions of roundness in app
- Vectorize abs() and all of the multiplications
- Compiles to 33 cycles in ps_2_0, 28 cycles on
R3x0
float clipSuperellipse ( float2 Q,
// Test point on the x-y plane
float4 aABb, // Dimensions of superellipses
float2 r) // Two precomputed
functions of roundness float2 qr, Qabs
abs(Q) float2 bx_Bx Qabs.x aABb.wzyx
// Swizzle to unpack bB float2 ay_Ay Qabs.y
aABb qr.x pow (pow(bx_Bx.x, r.x)
pow(ay_Ay.x, r.x), r.y) qr.y pow
(pow(bx_Bx.y, r.x) pow(ay_Ay.y, r.x), r.y)
qr aABb aABb.wzyx return smoothstep
(qr.x, qr.y, 1)
59smoothstep() function
- Standard function in procedural shading
- Intrinsics built into RenderMan and DirectX HLSL
1
0
edge0
edge1
60C implementation
float smoothstep (float edge0, float edge1, float
x) if (x lt edge0) return 0 if (x
gt edge1) return 1 // Scale/bias into
0..1 range x (x - edge0) / (edge1 -
edge0) return x x (3 - 2 x)
61HLSL implementation
- The free saturate handles x outside of
edge0..edge1 range
float smoothstep (float edge0, float edge1, float
x) // Scale, bias and saturate x to 0..1
range x saturate((x - edge0) / (edge1
edge0)) // Evaluate polynomial return x
x (3 2 x)
62Vectorized HLSL Implementation
- Precompute 1/(edge1 edge0)
- Done in the app for edge widths at cuton and
cutoff planes
- Operation performed on float3s to compute three
different smoothstep operations in parallel
- With these optimizations, the entire spotlight
volume computation of überlight compiles to 47
cycles in ps_2_0, 41 cycles on R3x0
float3 smoothstep3 (float3 edge,
float3 OneOverWidth, float3 x) // Scale,
bias and saturate x to 0..1 range x
saturate( (x - edge) OneOverWidth ) //
Evaluate polynomial return x x (3 2
x)
63Summary
- Writing optimal HLSL code
- Compiling issues
- Optimization strategies
- Code structure pointers
- Shader Examples
- Shipped with RenderMonkey version 1.0see
www.ati.com/developer
MultiTone Car Paint.rfx
Iridescent Butterfly.rfx