Title: Insomniac
1Insomniacs SPU Best PracticesHard-won lessons
were applying to a 3rd generation PS3 title
Mike Acton Eric ChristensenGDC 08
2Introduction
- What will be covered...
- Understanding SPU programming
- Designing systems for the SPUs
- SPU optimization tips
- Breaking the 256K barrier
- Looking to the future
3Introduction
- Isn't it harder to program for the SPUs?
- No.
- Classical optimizations techniques still apply
- Perhaps even more so than on other
architectures. - e.g. In-order processing means predictable
pipeline. Means easier to optimize. - Both at instruction-level and multi-processing
level.
4Introduction
- Multi-processing is not new
- Trouble with the SPUs usually is just trouble
with multi-core. - You can't wish multi-core programming away. It's
part of the job.
5Introduction
- But isn't programming for the SPUs different?
- The SPU is not a magical beast only tamed by
wizards. - It's just a CPU
- Get your feet wet. Code something.
- Highly Recommend Linux on the PS3!
6Introduction
- Seriously though. It's not the same, right?
- Not the same if you've been sucked into one of
the three big lies of software development...
7Introduction
- The software as a platform" lie.
- The "domain-model design" lie.
- The "code design is more important than data
design" lie - ... The real difficulty is unlearning these
mindsets.
8Introduction
- But what's changed?
- Old model
- Big semi truck. Stuff everything in. Then stuff
some more. Then put some stuff up front. Then
drive away. - New model
- Fleet of Ford GTs taking off every five minutes.
Each one only fits so much. Bucket brigade. Damn
they're fast!
9Introduction
- But what about special code management?
- Yes, you need to upload the code.
- So what? Something needs to load the code on
every CPU.
10Introduction
- But what about DMA'ing data?
- Yes, you need to use a DMA controller to move
around the data. - Not really different from calling memcpy
11SPU DMA vs. PPU memcpy
PPU Memcpy
SPU DMA
DMA from main ram to local store wrch ch16,
ls_addr wrch ch18, main_addr wrch ch19,
size wrch ch20, dma_tag il 2,
MFC_GET_CMD wrch ch21, 2
PPU memcpy from far ram to near ram mr 3,
near_addr mr 4, far_addr mr 5, size bl
memcpy
DMA from local store to main ram wrch ch16,
ls_addr wrch ch18, main_addr wrch ch19,
size wrch ch20, dma_tag il 2,
MFC_PUT_CMD wrch ch21, 2
PPU memcpy from near ram to far ram mr 4,
near_addr mr 3, far_addr mr 5, size bl
memcpy
Conclusion If you can call memcpy, you can DMA
data.
12Introduction
- But what about DMA'ing data?
- But with more control about how and when it's
sent, retrieved.
13SPU Synchronization
Fence Transfer after previous with the same tag
Example Sync
DMA from main ram to local store
PUTF Transfer previous before this PUT PUTLF
Transfer previous before this PUT LIST GETF
Transfer previous before this GET GETLF Transfer
previous before this GET LIST
Do other productive work while DMA is
happening...
Barrier Transfer after previous and before next
with the same tag
PUTB Fixed order with respect to this PUT PUTLB
Fixed order with respect to this PUT LIST GETB
Fixed order with respect to this GET GETLB Fixed
order with respect to this GET LIST
(Sync) Wait for DMA to complete il 2,
1 shl 2, 2, dma_tag wrch ch22,
2 il 3, MFC_TAG_UPDATE_ALL wrch
ch23, 3 rdch 2, ch24
Lock Line Reservation
GETLLAR Gets locked line. (PPU lwarx,
ldarx) PUTLLC Puts locked line. (PPU stwcx,
stdcx)
14Introduction
- Bottom line SPUs are like most CPUs
- Basics are pretty much the same.
- Good data design decisions and smart code choices
see benefits any platform - Good DMA pattern also means cache coherency.
Better on every platform - Bad choices may work on some, but not others.
- Xbox 360, PC, Wii, DS, PSP, whatever.
15Introduction
- And that's what we're talking about today.
- Trying to apply smart choices to these
particular CPUs for our games. - That's what console development is.
- What mistakes we've made along the way.
- What's worked best.
16Understanding the SPUs
- Rule 1 The SPU is not a co-processor!
- Don't think of SPUs as hiding time behind a
main PPU loop
17Understanding the SPUs
- What clicked with some Insomniacs about the
SPUs - Everything is local
- Think streams of data
- Forget conventional OOP
- Everything is a quadword
- si intrinsics make things clearer
- Local memory is really, really fast
18Designing for the SPUs
- The ultimate goal Get everything on the SPUs.
- Leave the PPU for shuffling stuff around.
- Complex systems can go on the SPUs
- Not just streaming systems
- Used for any kind of task
- But you do need to consider some things...
19Designing for the SPUs
- Data comes first.
- Goal is minimum energy for transformation.
- What is energy usage? CPU time. Memory read/write
time. Stall time.
Output
Input
Transform()
20Designing for the SPUs
- Design the transformation pipeline back to front.
- Start with your destination data and work
backward. - Changes are inevitable. This way you pay less for
them. - An example...
21Front to Back
Back to Front
Simulate Glass
Started Here
Render
Rendered Dynamic Geometry using Fake Mesh Data
Had a really nice looking simulation but would
find out soon that This stage was worthless
Generate Crack Geometry
igTriangulate
Faked Inputs to Triangulate and output
transformed data to render stage
igTriangulate
Then wrote igTriangulate
Simulate Glass
wrote the simulation to provide useful (and
expected) results to the triangulation library.
Oops, the only possible output didnt support
the glamorous crack rendering
Render
- Could have avoided re-writing the simulation if
the design - process was done in the correct order.
- Good looking results were arrived at with a much
smaller - processing and memory impact.
- Full simulation turned out to be un-necessary
since its - outputs werent realistic considering the
restrictions of the - final stage.
- Proof that code as you design can be
disasterous. - Working from back to front forces you to think
about your - pipeline in advance. Its easier to fix problems
that live in - front of final code. Wildly scattered fixes and
data format - changes will only end in sorrow.
Realized that the level of detail from the
simulation wasnt necessary considering that the
granularity restrictions (memory, cpu)? Could
not support it.
The rendering part of the pipeline
didnt completely support the outputs of
the triangulation library
Even worse, the inputs that were being provided
to the triangulation library werent adequate.
Needed more information about retaining surface
features.
22Designing for the SPUs
- The data the SPUs will transform is the canonical
data. - i.e. Store the data in the best format for the
case that takes the most resources.
23Designing for the SPUs
- Minimize synchronization
- Start with the smallest synchronization method
possible.
24Designing for the SPUs
- Simplest method is usually lock-free single
reader, single writer queue.
25SPU Ordered Write
PPU Ordered Write
Write Data
Write Data
lwsync
Increment Index (with Fence)
Increment Index
26Designing for the SPUs
- Fairly straightforward to load balance
- For constant time transforms, just divide into
multiple queues - For other transforms, use heuristic to decide
times and a single entry queue to distribute to
multiple queues.
27Designing for the SPUs
- Then work your way up.
- Is there a pre-existing sync point that will
work? (e.g. vsync)? - Can you split your data into need-to-sync and
don't-care?
28Resistance Fall of Man Immediate Effect Updates
Only
Resistance2 Immediate Deferred Effect Updates
Reduced Sync Points
PPU
SPU
PPU
SPU
Sync Immediate Updates For Last Frame
Update Game Objects
Run Deferred Effect Update/Render
Deferred Update Render
Update Game Objects
Sync Deferred Updates
Run Immediate Effect Updates
Post Update Game Objects
Immediate Update
Finish Frame Update Start Rendering
Run Effects System Manager
System Manager
Finish Frame Update Start Rendering
Sync Immediate Effect Updates
Sync Effect System Manager
Generate Push Buffer To Render Frame
Run Immediate Effect Update/Render
Immediate Update Render (Can run past end of
PPU Frame due to reduced sync points)?
Generate Push Buffer To Render Frame
Generate Push Buffer To Render Effects
Finish Push Buffer Setup
Finish Push Buffer Setup
29 PPU time overlapping effects SPU time
PPU time spent on effect system
Resistance Fall of Man Immediate Effect Updates
Only
PPU time that cannot be overlapped
PPU
SPU
Update Game Objects
No effects can be updated till all game objects
have updated so attachments do not lag.
Visibility and LOD culling done on PPU before
creating jobs.
Run Immediate Effect Updates
Each effect is a separate SPU job
Immediate Update
Effect updates running on all available SPUs
(four)?
Finish Frame Update Start Rendering
Likely to stall here , due to limited window in
which to update all effects.
Sync Immediate Effect Updates
Generate Push Buffer To Render Frame
Generate Push Buffer To Render Effects
The number of effects that could render were
limited by available PPU time to generate their
PBs.
Finish Push Buffer Setup
30 PPU time overlapping effects SPU time
PPU time spent on effect system
Resistance2 Immediate Deferred Effect Updates
Reduced Sync Points
PPU time that cannot be overlapped
PPU
SPU
Sync Immediate Updates For Last Frame
Run Deferred Effect Update/Render
Initial PB allocations done on PPU Single SPU job
for each SPU (Anywhere from one to three)?
Deferred Update Render
Huge amount of previously unused SPU processing
time available.
Update Game Objects
Deferred effects are one frame behind, so
effects attached to moving objects usually should
not be deferred.
Sync Deferred Updates
Post Update Game Objects
SPU manager handles all visibility and LOD
culling previously done on the PPU.
Run Effects System Manager
System Manager
Finish Frame Update Start Rendering
Generates lists of instances for update jobs to
process.
Sync Effect System Manager
Immediate updates are allowed to run till the
beginning of the next frame, as they do not
need to sync to finish generating this frames PB
Run Immediate Effect Update/Render
Initial PB allocations done on PPU Single SPU job
for each SPU (Anywhere from one to three)?
Immediate Update Render (Can run past end of
PPU Frame due to reduced sync points)?
Generate Push Buffer To Render Frame
Doing the initial PB alloc on the PPU eliminates
need to sync SPU updates before generating full
PB.
Smaller window available to update immediate
effects, so only effects attached to moving
objects should be immediate.
Finish Push Buffer Setup
31Designing for the SPUs
- Write optimizable code.
- Often optimized code can wait a bit.
- Simple, self-contained loops
- Over as many iterations as possible
- No branches
32Designing for the SPUs
- Transitioning from "legacy" systems...
- We're not immune to design problems
- Schedule, manpower, education, and experience all
play a part.
33Designing for the SPUs
- Example from RCF...
- FastPathFollowers C class
- And it's derived classes
- Running on the PPU
- Typical Update() method
- Derived from a root class of all updatable types
34Designing for the SPUs
- Where did this go wrong?
- What rules where broken?
- Used domain-model design
- Code design over data design
- No advatage of scale
- No synchronization design
- No cache consideration
35Designing for the SPUs
- Result
- Typical performance issues
- Cache misses
- Unnecessary transformations
- Didn't scale well
- Problems after a few hundred updating
36Designing for the SPUs
- Step 1 Group the data together
- Where there's one, there's more than one.
- Before the update() loop was called,
- Intercepted all FastPathFollowers and derived
classes and removed them from the update list. - Then kept in a separate array.
37Designing for the SPUs
- Step 1 Group the data together
- Created new function, UpdateFastPathFollowers()?
- Used the new list of same type of data
- Generic Update() no longer used
- (Ignored derived class behaviors here.)?
38Designing for the SPUs
- Step 2 Organize Inputs and Outputs
- Define what's read, what's write.
- Inputs Position, Time, State, Results of
queries, Paths - Outputs Position, State, Queries, Animation
- Read inputs. Transform to Outputs.
- Nothing more complex than that.
39Designing for the SPUs
- Step 3 Reduce Synchronization Points
- Collected all outputs together
- Collected any external function calls together
into a command buffer - Separate Query and Query-Result
- Effectively a Queue between systems
- Reduced from many sync points per object to one
sync point for the system
40Designing for the SPUs
- Before Pattern
- Loop Objects
- Read Input 0
- Update 0
- Write Output
- Read Input 1
- Update 1
- Call External Function
- Block (Sync)?
41Designing for the SPUs
- After Pattern (Simplified)?
- Loop Objects
- Read Input 0, 1
- Update 0, 1
- Write Output, Function to Queue
- Block (Sync)?
- Empty (Execute) Queue
42Designing for the SPUs
- Next Added derived-class functionality
- Similarly simplified derived-class Update()
functions into functions with clear inputs and
outputs. - Added functions to deferred queue as any other
function. - Advantage Can limit derived functionality based
on count, LOD, etc.
43Designing for the SPUs
- Step 4 Move to PPU thread
- Now system update has no external dependencies
- Now system update has no conflicting data areas
(with other systems)? - Now system update does not call non-re-entrant
functions - Simply put in another thread
44Designing for the SPUs
- Step 4 Move to PPU thread
- Add literal sync between system update and queue
execution - Sync can be removed because only single reader
and single writer to data - Queue can be emptied while being filled without
collision - See RD page on multi-threaded optimization
45Designing for the SPUs
- Step 5 Move to SPU
- Now completely independent thread
- Can be run anytime
- Prototype for new SPU system
- AsyncMobyUpdate
- Using SPU Shaders
46Designing for the SPUs
- Transitioning from SPU as coprocessor model.
- Example igPhysics from Resistance to now...
47PPU
SPU
Execution
Resistance Fall of Man Physics Pipeline
Environment Pre-Update (Resolve AnimIK)?
AABB Tests
Triangle Intersection
Environment Update
Sphere, Capsule, etc..
Pack contact points
Note One Job Per Object. (box, ragdoll, etc..)?
Collision Update (Start Coll Jobs while building)?
Collide Prims (generate contacts)?
Sync Collision Jobs and Process Contact Points
Blocked!
Associate Rigid Bodies Through Constraints
Unpack Constraints
Generate Jacobian Data
Solve Constraints
Package Rigid Body Pools. (Start SPU Jobs While
packing)?
Simulate
Pack Rigid Body Data
Sync Sim Jobs and Process Rigid Body Data
The only time hidden between start and stop of
jobs is the packing of job data. The only other
savings come from merely running the jobs on the
SPU.
Blocked!
Post Update (Transform Anim Joints)?
48PPU
SPU
Execution
Resistance 2 Physics Pipeline
Environment Update
Upload Tri-Cache
Upload Object Cache
Upload RB Prims
Upload Intersect Funcs
Triangle Cache Update
Intersection Tests
For Each Iteration
Upload CO Prims
Object Cache Update
Collide Triangles
Upload Intersect Funcs
Intersection Tests
Start Physics Jobs
Collide Primitives
Sort Joint Types
Per Joint Type Upload Jacobian Generation Code
Upload Physics Joints
PPU Work
Build Simulation Pools
Calculate Jacobian Data
Solve Constraints
Upload Solver Code
Integrate
For Each Physics Object Upload Anim Joints
Sync Physics Jobs
Simulate Pools
Transform Anim Joints Using Rigid Body Data
Post Update
Update Rigid Bodies
Send Update To PPU
49Optimizing for SPUs
- Instruction-level optimizations are similar to
any other platform - i.e. Look at the instruction set and write code
that takes advantage of it.
50Optimizing for SPUs
- Memory transfer optimizations are similar to any
other platform - i.e. Organize data for line-length and coherency.
Separate read and write buffers wherever
possible. - DMA is exactly like cache pre-fetch
51Optimizing for SPUs
- Local memory optimizations are similar to any
other platform - i.e. Have a fixed-size buffer, split it into
smaller buffers for input, output, temporary data
and code. - Organizing 256K is essentially the same process
as organizing 256M
52Optimizing for SPUs
- Memory layout
- Memory is dedicated to your code.
- Memory is local to your code.
- Design so you know what will read and write to
the memory - i.e. DMAs from PPU, other SPUs, etc.
- Generally fairly straightforward.
- Remember you can use an offline tool to layout
your memory if you want.
53Optimizing for SPUs
- Memory layout
- But never, ever try to use a dynamic memory
allocator. - Malloc for dedicated 256K would be ridiculous.
- OK. Malloc in a console game would be ridiculous.
54Optimizing for SPUs
- Memory layout
- Rules of thumb
- Organize everything into blocks of 16b.
- SPU Reads/Writes only 16b
- Group same fields together
- No single object data
- Similar to most SIMD.
- Similar to GPUs.
55Optimizing for SPUs
- Memory transfer
- Usually pretty straightforward
- Rules of thumb
- Keep everything 128b aligned
- Nothing different. Same rule as the PPU.
(Cache-line is 128b)? - Transfer as much data as possible together.
Transform together. - Nothing different. Same rule as the PPU. (For
cache coherency)?
56Optimizing for SPUs
- Memory transfer
- Let's dig in to these rules of thumb a bit...
- Shared alignment between main ram and SPU local
memory is going to be faster. (So pick an
alignment and stick with it.)? - Transfer is done in 128b blocks, so alignment
isn't strictly necessary (but no worries about
above if it is)?
57Optimizing for SPUs
- Number of transfers doesn't really matter (re
Biggest transfers possible) but... - You want transfer 128b blocks, not scattered.
- You want to minimize synchronization (sync on
less dma tags)? - You have less places to worry about alignment.
- You want to minimize scatter/gather. Especially
considering TLB misses.
58Optimizing for SPUs
- Memory transfer
- Rules of thumb
- If scattered reads, writes are necessary, use DMA
list (not individual DMAs)? - Advantage over PPU. PPU can't do out-of-order,
grouped memory transfer. - Keeps predictability of in-order execution with
performance of out-of-order memory transfer.
59Optimizing for SPUs
- Speaking of out-of-order transfers...
- Use DMA fence to dictate order
- Reads and write are interleaved,
- If you need max transfer performance, issue them
separately.
60Optimizing for SPUs
- Memory transfer
- Double, Triple buffer optimization
- (Show fence example)?
61Optimizing for SPUs
- Code level optimization
- Rules of thumb
- Know the instruction set
- Use si intrinsics (or asm)?
- Stick with native types
- Clue There's only one (qword)?
62Optimizing for SPUs
- Code level optimization
- Rules of thumb
- Code branch free
- Not just for branch performance.
- Branch free scalar transforms to SIMD extremely
well. - There is a hitch. No SIMD loads or stores.
- This drives data design decisions.
63Optimizing for SPUs
- Code level optimization
- Examples...
64Optimizing for SPUs
- Example 1 Vector-Matrix Multiply
65Vector-Matrix Multiplication
Standard Approach
Multiplying a vector (x,y,z,w) by a 4x4
matrix (x y z w) (x y z w) (m00 m01 m02
m03)? (m10
m11 m12 m13)?
(m20 m21 m22 m23)?
(m30 m31 m32 m33)?
- The general case
- shufb xxxx, xyzw, xyzw, shuf_AAAA
- shufb yyyy, xyzw, xyzw, shuf_BBBB
- shufb zzzz, xyzw, xyzw, shuf_CCCC
- shufb wwww, xyzw, xyzw, shuf_DDDD
- fm result, xxxx, m0
- fma result, yyyy, m1, result
- fma result, zzzz, m2, result
- fma result, wwww, m3, result
The result is obtained by multiplying the x by
the first row of the matrix, y by the second,
etc. and accumulating these products. This
observation leads to the standard
method Broadcast each of the x,y,z and w across
all 4 components, then perform 4 multiply-add
type instructions. Abbreviated versions are
possible in the special cases of w0 and w1,
which occur frequently. All 3 versions are
shown to the right. Its a simple matter to
extend this approach to the product of two 4x4
matrices. Note that the w0 and w1 cases
come into play here when our matrices have
(0,0,0,1)T in the rightmost column.
- Case w0
- shufb xxxx, xyz0, xyz0, shuf_AAAA
- shufb yyyy, xyz0, xyz0, shuf_BBBB
- shufb zzzz, xyz0, xyz0, shuf_CCCC
- fm result, xxxx, m0
- fma result, yyyy, m1, result
- fma result, zzzz, m2, result
- Case w1
- shufb xxxx, xyz1, xyz1, shuf_AAAA
- shufb yyyy, xyz1, xyz1, shuf_BBBB
- shufb zzzz, xyz1, xyz1, shuf_CCCC
- fma result, xxxx, m0, m3
- fma result, yyyy, m1, result
- fma result, zzzz, m2, result
66Vector-Matrix Multiplication
Faster Alternatives
In the simple case where we only wish to
transform a single vector, or multiply a single
pair of matrices, the standard approach that was
shown would be most appropriate. But frequently
well have a collection of vectors or matrices
which we wish to multiply by the same matrix, in
which case we may be prepared to make sacrifices
for the sake of reducing the instruction count.
67Vector-Matrix Multiplication
Alternative 1
By simply preswizzling the matrix, we can reduce
the number of shuffles needed
- The general case
- Preswizzle the matrix as (m00 m11 m22 m33)?
- (m10 m21 m32 m03)?
- (m20 m31 m02 m13)?
- (m30 m01 m12 m23)?
- then transform a vector using the sequence
- rotqbyi yzwx, xyzw, 4
- rotqbyi zwxy, xyzw, 8
- rotqbyi wxyz, xyzw, 12
- fm result, xyzw, m0_
- fma result, yzwx, m1_, result
- fma result, zwxy, m2_, result
- fma result, wxyz, m3_, result
- Case w0, with (0,0,0,1)T in the rightmost matrix
column - Preswizzle the matrix as (m00, m11, m22, 0)?
- (m10, m21, m02, 0)?
- (m20, m01, m12, 0)?
- This can be done efficiently using selb
- fsmbi mask_0F00, 0x0F00
- fsmbi mask_00F0, 0x00F0
- selb m0_, m0, m1, mask_0F00
- selb m1_, m1, m2, mask_0F00
- selb m2_, m2, m0, mask_0F00
- selb m0_, m0_, m2, mask_00F0
- selb m1_, m1_, m0, mask_00F0
- selb m2_, m2_, m1, mask_00F0
- The vector multiply then only
- requires 5 instructions
- shufb yzx0, xyz0, xyz0, shuf_BCA0
- Case w1, with (0,0,0,1)T in the rightmost matrix
column - Use the same preswizzle as the w0 case,
- leaving row 3 unchanged.
- Again 5 instructions suffice
-
- shufb yzx0, xyz0, xyz0, shuf_BCA0
- shufb zxy0, xyz0, xyz0, shuf_CAB0
- fma result, xyz0, m0_, m3
- fma result, yzx0, m1_, result
- fma result, zxy0, m2_, result
68Vector-Matrix Multiplication
Alternative 2
If were dealing with the general case, we can
reduce the instruction count further still
- Using the preswizzle (m02, m13, m20, m31)?
- (m12, m23,
m30, m01)? - (m00, m11,
m22, m33)? - (m10, m21,
m32, m03)? - we can carry out the vector multiply
- in just 6 instructions
- rotqbyi yzwx, xyzw, 4
- fm temp, xyzw, m0_
- fma temp, yzwx, m1_, temp
- rotqbyi result, temp, 8
- fma result, xyzw, m2_, result
- fma result, yzwx, m3_, result
This approach yields no additional benefits for
the w0 and w1 cases however.
Conclusion
Single vector/matrix times a single matrix use
the Standard Approach. Many vectors/matrices
times a single matrix use Alternative 1. Many
general vectors/matrices (i.e. anything in w)
times a single matrix in a pipelined loop use
Alternative 2.
69Optimizing for SPUs
- Example 2 Matrix Transpose
70Matrix Transposition
Standard Approach
A general 4x4 matrix can be transposed in 8
shuffles as follows
(x0, y0, z0, w0) (x0, x1, x2,
x3)? (x1, y1, z1, w1) -gt (y0, y1,
y2, y3)? (x2, y1, z2, w2) (z0,
z1, z2, z3)? (x3, y3, z3, w3)
(w0, w1, w2, w3)? shufb t0, a0, a2,
shuf_AaBb // t0 (x0, x2, y0, y2)?
shufb t1, a1, a3, shuf_AaBb // t1 (x1, x3,
y1, y3)? shufb t2, a0, a2, shuf_CcDd //
t2 (z0, z2, w0, w2)? shufb t3, a1, a3,
shuf_CcDd // t3 (z1, z3, w1, w3)?
shufb b0, t0, t1, shuf_AaBb // b0 (x0, x1,
x2, x3)? shufb b1, t0, t1, shuf_CcDd //
b1 (y0, y1, y2, y3)? shufb b2, t2, t3,
shuf_AaBb // b2 (z0, z1, z2, z3)?
shufb b3, t2, t3, shuf_CcDd // b3 (w0, w1,
w2, w3)?
Many variations are possible by changing the
particular shuffles used, but they all end up
doing the same thing in the same amount of work.
The version shown above is a good choice because
it only requires two constants.
71Matrix Transposition
Faster 4x4
By using a different set of shuffles, a couple
of the shuffles can then be replaced by
select-bytes which has lower latency
shufb t0, a0, a1, shuf_AaCc // t0
(x0, x1, z0, z1)? shufb t1, a2, a3,
shuf_CcAa // t1 (z2, z3, x2, x3)?
shufb t2, a0, a1, shuf_BbDd // t2 (y0,
y1, w0, w1)? shufb t3, a2, a3, shuf_DdBb
// t3 (w2, w3, y2, y3)? shufb b2, t0,
t1, shuf_CDab // b2 (z0, z2, z2, z2)?
shufb b3, t2, t3, shuf_CDab // b3 (w0,
w3, w3, w3)? selb b0, t0, t1, mask_00FF
// b0 (x0, x0, x0, x0)? selb b1, t2,
t3, mask_00FF // b1 (y0, y1, y1, y1)?
This version is quicker by 1 cycle, at the
expense of requiring more constants
72Matrix Transposition
3x4 -gt 4x3
Here is an example that uses only 6 shuffles
(x0, y0, z0, w0) (x0, x1, x2,
0)? (x1, y1, z1, w1) -gt (y0, y1,
y2, 0)? (x2, y2, z2, w2) (z0,
z1, z2, 0)?
(w0, w1, w2, 0)? shufb t0, a0, a1,
shuf_AaBb // t0 (x0, x1, y0, y1)?
shufb t1, a0, a1, shuf_CcDd // t1 (z0,
z1, w0, w1)? shufb b0, t0, a2, shuf_ABa0
// b0 (x0, x1, x2, 0)? shufb b1, t0,
a2, shuf_CDb0 // b1 (y0, y1, y2, 0)
shufb b2, t1, a2, shuf_ABc0 // b2 (z0,
z1, z2, 0)? shufb b3, t1, a2, shuf_CDd0
// b3 (w0, w1, w2, 0)?
Note that care must be taken if the destination
matrix is the same as the source. In this case
the last 2 lines of code must be swapped to
avoid prematurely overwriting a2.
73Matrix Transposition
3x3
Here is an example that uses only 5 shuffles
(x0, y0, z0, w0) (x0, x1, x2,
0)? (x1, y1, z1, w1) -gt (y0, y1,
y2, 0)? (x2, y1, z2, w2) (z0,
z1, z2, 0)? shufb t0, a0, a1,a shuf_AaBb
// t0 (x0, x1, y0, y1)? shufb t1, a0,
a1, shuf_CcDd // t1 (z0, z1, w0, w1)?
shufb b0, t0, a2, shuf_ABa0 // b0 (x0,
x1, x2, 0)? shufb b1, t0, a2, shuf_CDb0
// b1 (y0, y1, y2, 0) shufb b2, t1,
a2, shuf_ABc0 // b2 (z0, z1, z2, 0)?
74Matrix Transposition
3x3 (reduced latency)?
If we seek the lowest latency, this example is 2
cycles quicker than the last example, at the
expense of an extra instruction and an extra
constant
(x0, y0, z0, w0) (x0, x1, x2,
0)? (x1, y1, z1, w1) -gt (y0, y1,
y2, 0)? (x2, y1, z2, w2) (z0,
z1, z2, 0)? shufb t0, a1, a2, shuf_0Aa0
// t0 ( 0, x1, x2, 0)? shufb t1, a2,
a0, shuf_b0B0 // t1 (y0, 0, y2, 0)?
shufb t2, a0, a1, shuf_Cc00 // t2 (z0,
z1, 0, 0)? selb b0, a0, t0, mask_0FFF
// b0 (x0, x1, x2, 0)? selb b1, a1,
t1, mask_F0FF // b1 (y0, y1, y2, 0)?
selb b2, a2, t2, mask_FF0F // b2 (z0,
z1, z2, 0)?
Hybrid versions are also possible, which may be
of use when trying to balance even vs. odd counts.
75Optimizing for SPUs
- Example 3 8 bit palette lookup
- Flip the problem around
- Instead of looking up index for each byte...
- Loop through the palette and compare each
quadword of indices and mask any matching results
76Optimizing for SPUs
- When is it better to use asm?
- When you know facts the compiler cannot (and can
take advantage of them)? - i.e. almost always.
77Optimizing for SPUs
- When is asm really worth it?
- Case-by-case.
- Time, experience, performance, practice.
- Doesn't it make the code unmaintainable?
- Not much different from using intrinsics.
- Especially if you use macro-asm tools.
- e.g. for register coloring - thats really the
tedious part of editing asm.
78Optimizing for SPUs
- Writing asm rules-of-thumb
- Minimize instruction count
- Minimize trace latency
- (Instruction count takes precedence)?
- Balance even/odd instruction pipelines
- Minimize memory accesses
- Can block DMA or instruction fetch
79The 256K Barrier
- The solution is simple
- Upload more code when you need it.
- Upload more data when you need it.
- Data is managed by traditional means
- i.e. Double, triple fixed-buffers, etc.
- Code is just data.
- Can we manage code the same way we manage data?
80SPU Shaders
- SPU Shaders are
- Fragments of code used in existing systems
(Physics, Animation, Effects, AI, etc.)? - Code is loaded at location pre-determined by
system. - Custom (Data/Interface) for each system.
- An expansion of an existing system (e.g.
Pipelined stages)? - Custom modifications of system data.
- Way of delivering feedback to other systems
outside the scope of the current system.
81SPU Shaders
- SPU Shaders are NOT
- Generic, general purpose system.
- A system of any kind, actually.
- Globally scheduled.
82SPU Shaders
- Why is it called a shader?
- Shares important similarities to GPU shaders.
- Native code fragments
- Part of a larger system
- In-context execution
- Independently optimizable
- Most important Concept is approachable.
83SPU Shaders
- Don't try to solve everyone's problems
- Solutions that try to solve all problems tend to
cause more problems than they solve.?
84SPU Shaders
- Easy to Implement
- Pick stage(s) in system kernel to inject shaders.
- Define available inputs and outputs.
- Collect common functions.
- Compile shaders as data.
- Sort instance data based on shader type(s)?
- Load shader on-demand based on data select.
- Call shaders.
85SPU Shaders
- What data is being transformed?
- What are the inputs?
- What are the outputs?
- What can be modified?
86SPU Shaders
- Collect the common functions...
- Always loaded by the system
- e.g.
- Dma wrapper functions
- Debugging functions
- Common transformation functions
87Example Structure Passed to Shader
struct common_t void (print_str)(const char
str) void (dma_wait)(uint32_t tag) void
(dma_send)(void ls, uint32_t ea, uint32_t size,
uint32_t tag) void (dma_recv)(void ls,
uint32_t ea, uint32_t size, uint32_t tag)
char ls uint32_t ls_size uint32_t
data_ea uint32_t data_size uint32_t
dma_tags2
88SPU Shaders
- System Shader Configuration...
- System knows where the fragments are.
- System knows when to call the fragments.
- System doesn't know what the fragments do.
- Fragments are in main RAM.
- Fragments don't need to be fixed.
89SPU Shaders
- System Shader Configuration.
- Manage fragment memory
- Simplest method
- Double buffer,
- On-demand,
- Fixed maximum size,
- By-index from array,...
90SPU Shaders
- Create the shader code...
- Code is just data
- No special distinquishing feature on the SPUs
- Overlays or additional jobs are too complex and
heavyweight. - Just want load and execute.
- No special system needed.
91SPU Shaders
- Create the shader code..
- Method 1 Shader as PPU header
- Compile shader as normal, to obj file.
- Dump obj file using spu-objdump
- Convert dump to header using script.
- This is what we started with
92SPU Shaders
- Create the shader code..
- Method 2 Use elf file
- Requires extra compile step, but more debugger
friendly. - This is what we're doing now.
- Other methods too, use whatever works for you.
93SPU Shaders
- Calling the shader...
- Nothing could be easier.
- ShaderEntry shader (addr of fragment)
- shader( data, common )
94SPU Shaders
- Debugging Shaders...
- Fragments are small
- Fragments have well defined inputs and outputs.
- Ideal for unit tests in separate framework.
- Test on PS3/Linux box.
- Alternatives
- Debug on PPU (intrinsics are portable)?
- Temporarily link in shader.
95SPU Shaders
- Runtime debugging
- Is a problem with the first method.
- Using the full elf, have debugging info
- Now works transparently in our debugger.
96SPU Shaders
- Rule 1 Don't Manage Data for Shaders
- Just give shaders a buffer and fixed size.
- Shaders should depend on size, so leave room for
system changes. - Best size depends on system.
- (Maybe 4K, maybe 32K)?
- Don't read or write from/to shader buffer.
97SPU Shaders
- System-specific
- Multiple list of instances to modify or transform
- Context data
- Shader-internal (local)?
- EA passed by system
- Fixed buffer
- Shader shared (global)?
- EA passed by system
98SPU Shaders
- Rule 2 Don't Manage DMA for Shaders
- Give fixed number of DMA tags to shader
- Grab them in the entry function and pass down)?
- Avoid GetDmaTagFromParentSystem()?
- Give DMA functions to shaders
- To allow system to run with any job manager, or
none - Don't use shader tags for other purposes
99SPU Shaders
- Rule 3 Enforce fixed maximum size for Shader
code. - System can be maintained.
- Rule 4 Shaders are always called in a clear,
well defined context. - i.e. Part of a larger system.?
100SPU Shaders
- Rule 5 Fixed parameter list for shaders,
per-system (or sub-system)? - Don't want to re-compile all shaders.
- Don't want to manage dynamic parameter lists.
- Rule 6 Shaders should be given as many instances
as possible. - More optimizable.?
101SPU Shaders
- Rule 7 Don't break the rules.
- You'll end up with a new job manager.
- You'll end up with a big headache.
102SPU Shaders
- Where are we using these?
- Physics, Effects, Animation, Some AI Update
- Also experimenting with pre-vertex shaders on the
SPUs - And experimenting with giving some of that
control to the artists (Directly generating code
from a tool...)?
103(No Transcript)
104(No Transcript)
105(No Transcript)
106(No Transcript)
107(No Transcript)
108(No Transcript)
109(No Transcript)
110(No Transcript)
111(No Transcript)
112Conclusion
- It's not that complicated.
- Good data and good design works well on the SPUs
(and will work well anywhere)? - Sometimes you can get away with bad design and
bad data on other platforms - ...for now. Bad design will not survive this
generation. - Lots of opportunities for optimization.
113Credits
- This was based on the hard work and dedication of
the Insomniac Tech Team. You guys are awesome.