Insomniac

About This Presentation

Title:

Insomniac

Description:

Insomniac's SPU Best Practices. Hard-won lessons we're applying to ... Bucket brigade. Damn they're fast! Introduction. But what about special code management? ... – PowerPoint PPT presentation

Number of Views:196

Avg rating:3.0/5.0

Slides: 114

Provided by: insomni

Category:

more less

Transcript and Presenter's Notes

Title: Insomniac

1
Insomniacs SPU Best PracticesHard-won lessons
were applying to a 3rd generation PS3 title
Mike Acton Eric ChristensenGDC 08
2
Introduction

What will be covered...
Understanding SPU programming
Designing systems for the SPUs
SPU optimization tips
Breaking the 256K barrier
Looking to the future

3
Introduction

Isn't it harder to program for the SPUs?
No.
Classical optimizations techniques still apply
Perhaps even more so than on other
architectures.
e.g. In-order processing means predictable
pipeline. Means easier to optimize.
Both at instruction-level and multi-processing
level.

4
Introduction

Multi-processing is not new
Trouble with the SPUs usually is just trouble
with multi-core.
You can't wish multi-core programming away. It's
part of the job.

5
Introduction

But isn't programming for the SPUs different?
The SPU is not a magical beast only tamed by
wizards.
It's just a CPU
Get your feet wet. Code something.
Highly Recommend Linux on the PS3!

6
Introduction

Seriously though. It's not the same, right?
Not the same if you've been sucked into one of
the three big lies of software development...

7
Introduction

The software as a platform" lie.
The "domain-model design" lie.
The "code design is more important than data
design" lie
... The real difficulty is unlearning these
mindsets.

8
Introduction

But what's changed?
Old model
Big semi truck. Stuff everything in. Then stuff
some more. Then put some stuff up front. Then
drive away.
New model
Fleet of Ford GTs taking off every five minutes.
Each one only fits so much. Bucket brigade. Damn
they're fast!

9
Introduction

But what about special code management?
Yes, you need to upload the code.
So what? Something needs to load the code on
every CPU.

10
Introduction

But what about DMA'ing data?
Yes, you need to use a DMA controller to move
around the data.
Not really different from calling memcpy

11
SPU DMA vs. PPU memcpy
PPU Memcpy
SPU DMA
DMA from main ram to local store wrch ch16,
ls_addr wrch ch18, main_addr wrch ch19,
size wrch ch20, dma_tag il 2,
MFC_GET_CMD wrch ch21, 2
PPU memcpy from far ram to near ram mr 3,
near_addr mr 4, far_addr mr 5, size bl
memcpy
DMA from local store to main ram wrch ch16,
ls_addr wrch ch18, main_addr wrch ch19,
size wrch ch20, dma_tag il 2,
MFC_PUT_CMD wrch ch21, 2
PPU memcpy from near ram to far ram mr 4,
near_addr mr 3, far_addr mr 5, size bl
memcpy
Conclusion If you can call memcpy, you can DMA
data.
12
Introduction

But what about DMA'ing data?
But with more control about how and when it's
sent, retrieved.

13
SPU Synchronization
Fence Transfer after previous with the same tag
Example Sync
DMA from main ram to local store
PUTF Transfer previous before this PUT PUTLF
Transfer previous before this PUT LIST GETF
Transfer previous before this GET GETLF Transfer
previous before this GET LIST
Do other productive work while DMA is
happening...
Barrier Transfer after previous and before next
with the same tag
PUTB Fixed order with respect to this PUT PUTLB
Fixed order with respect to this PUT LIST GETB
Fixed order with respect to this GET GETLB Fixed
order with respect to this GET LIST
(Sync) Wait for DMA to complete il 2,
1 shl 2, 2, dma_tag wrch ch22,
2 il 3, MFC_TAG_UPDATE_ALL wrch
ch23, 3 rdch 2, ch24
Lock Line Reservation
GETLLAR Gets locked line. (PPU lwarx,
ldarx) PUTLLC Puts locked line. (PPU stwcx,
stdcx)
14
Introduction

Bottom line SPUs are like most CPUs
Basics are pretty much the same.
Good data design decisions and smart code choices
see benefits any platform
Good DMA pattern also means cache coherency.
Better on every platform
Bad choices may work on some, but not others.
Xbox 360, PC, Wii, DS, PSP, whatever.

15
Introduction

And that's what we're talking about today.
Trying to apply smart choices to these
particular CPUs for our games.
That's what console development is.
What mistakes we've made along the way.
What's worked best.

16
Understanding the SPUs

Rule 1 The SPU is not a co-processor!
Don't think of SPUs as hiding time behind a
main PPU loop

17
Understanding the SPUs

What clicked with some Insomniacs about the
SPUs
Everything is local
Think streams of data
Forget conventional OOP
Everything is a quadword
si intrinsics make things clearer
Local memory is really, really fast

18
Designing for the SPUs

The ultimate goal Get everything on the SPUs.
Leave the PPU for shuffling stuff around.
Complex systems can go on the SPUs
Not just streaming systems
Used for any kind of task
But you do need to consider some things...

19
Designing for the SPUs

Data comes first.
Goal is minimum energy for transformation.
What is energy usage? CPU time. Memory read/write
time. Stall time.

Output
Input
Transform()
20
Designing for the SPUs

Design the transformation pipeline back to front.
Start with your destination data and work
backward.
Changes are inevitable. This way you pay less for
them.
An example...

21
Front to Back
Back to Front
Simulate Glass
Started Here
Render
Rendered Dynamic Geometry using Fake Mesh Data
Had a really nice looking simulation but would
find out soon that This stage was worthless
Generate Crack Geometry
igTriangulate
Faked Inputs to Triangulate and output
transformed data to render stage
igTriangulate
Then wrote igTriangulate
Simulate Glass
wrote the simulation to provide useful (and
expected) results to the triangulation library.
Oops, the only possible output didnt support
the glamorous crack rendering
Render

Could have avoided re-writing the simulation if
the design
process was done in the correct order.
Good looking results were arrived at with a much
smaller
processing and memory impact.
Full simulation turned out to be un-necessary
since its
outputs werent realistic considering the
restrictions of the
final stage.
Proof that code as you design can be
disasterous.
Working from back to front forces you to think
about your
pipeline in advance. Its easier to fix problems
that live in
front of final code. Wildly scattered fixes and
data format
changes will only end in sorrow.

Realized that the level of detail from the
simulation wasnt necessary considering that the
granularity restrictions (memory, cpu)? Could
not support it.
The rendering part of the pipeline
didnt completely support the outputs of
the triangulation library
Even worse, the inputs that were being provided
to the triangulation library werent adequate.
Needed more information about retaining surface
features.
22
Designing for the SPUs

The data the SPUs will transform is the canonical
data.
i.e. Store the data in the best format for the
case that takes the most resources.

23
Designing for the SPUs

Minimize synchronization
Start with the smallest synchronization method
possible.

24
Designing for the SPUs

Simplest method is usually lock-free single
reader, single writer queue.

25
SPU Ordered Write
PPU Ordered Write
Write Data
Write Data
lwsync
Increment Index (with Fence)
Increment Index
26
Designing for the SPUs

Fairly straightforward to load balance
For constant time transforms, just divide into
multiple queues
For other transforms, use heuristic to decide
times and a single entry queue to distribute to
multiple queues.

27
Designing for the SPUs

Then work your way up.
Is there a pre-existing sync point that will
work? (e.g. vsync)?
Can you split your data into need-to-sync and
don't-care?

28
Resistance Fall of Man Immediate Effect Updates
Only
Resistance2 Immediate Deferred Effect Updates
Reduced Sync Points
PPU
SPU
PPU
SPU
Sync Immediate Updates For Last Frame
Update Game Objects
Run Deferred Effect Update/Render
Deferred Update Render
Update Game Objects
Sync Deferred Updates
Run Immediate Effect Updates
Post Update Game Objects
Immediate Update
Finish Frame Update Start Rendering
Run Effects System Manager
System Manager
Finish Frame Update Start Rendering
Sync Immediate Effect Updates
Sync Effect System Manager
Generate Push Buffer To Render Frame
Run Immediate Effect Update/Render
Immediate Update Render (Can run past end of
PPU Frame due to reduced sync points)?
Generate Push Buffer To Render Frame
Generate Push Buffer To Render Effects
Finish Push Buffer Setup
Finish Push Buffer Setup
29
PPU time overlapping effects SPU time
PPU time spent on effect system
Resistance Fall of Man Immediate Effect Updates
Only
PPU time that cannot be overlapped
PPU
SPU
Update Game Objects
No effects can be updated till all game objects
have updated so attachments do not lag.
Visibility and LOD culling done on PPU before
creating jobs.
Run Immediate Effect Updates
Each effect is a separate SPU job
Immediate Update
Effect updates running on all available SPUs
(four)?
Finish Frame Update Start Rendering
Likely to stall here , due to limited window in
which to update all effects.
Sync Immediate Effect Updates
Generate Push Buffer To Render Frame
Generate Push Buffer To Render Effects
The number of effects that could render were
limited by available PPU time to generate their
PBs.
Finish Push Buffer Setup
30
PPU time overlapping effects SPU time
PPU time spent on effect system
Resistance2 Immediate Deferred Effect Updates
Reduced Sync Points
PPU time that cannot be overlapped
PPU
SPU
Sync Immediate Updates For Last Frame
Run Deferred Effect Update/Render
Initial PB allocations done on PPU Single SPU job
for each SPU (Anywhere from one to three)?
Deferred Update Render
Huge amount of previously unused SPU processing
time available.
Update Game Objects
Deferred effects are one frame behind, so
effects attached to moving objects usually should
not be deferred.
Sync Deferred Updates
Post Update Game Objects
SPU manager handles all visibility and LOD
culling previously done on the PPU.
Run Effects System Manager
System Manager
Finish Frame Update Start Rendering
Generates lists of instances for update jobs to
process.
Sync Effect System Manager
Immediate updates are allowed to run till the
beginning of the next frame, as they do not
need to sync to finish generating this frames PB
Run Immediate Effect Update/Render
Initial PB allocations done on PPU Single SPU job
for each SPU (Anywhere from one to three)?
Immediate Update Render (Can run past end of
PPU Frame due to reduced sync points)?
Generate Push Buffer To Render Frame
Doing the initial PB alloc on the PPU eliminates
need to sync SPU updates before generating full
PB.
Smaller window available to update immediate
effects, so only effects attached to moving
objects should be immediate.
Finish Push Buffer Setup
31
Designing for the SPUs

Write optimizable code.
Often optimized code can wait a bit.
Simple, self-contained loops
Over as many iterations as possible
No branches

32
Designing for the SPUs

Transitioning from "legacy" systems...
We're not immune to design problems
Schedule, manpower, education, and experience all
play a part.

33
Designing for the SPUs

Example from RCF...
FastPathFollowers C class
And it's derived classes
Running on the PPU
Typical Update() method
Derived from a root class of all updatable types

34
Designing for the SPUs

Where did this go wrong?
What rules where broken?
Used domain-model design
Code design over data design
No advatage of scale
No synchronization design
No cache consideration

35
Designing for the SPUs

Result
Typical performance issues
Cache misses
Unnecessary transformations
Didn't scale well
Problems after a few hundred updating

36
Designing for the SPUs

Step 1 Group the data together
Where there's one, there's more than one.
Before the update() loop was called,
Intercepted all FastPathFollowers and derived
classes and removed them from the update list.
Then kept in a separate array.

37
Designing for the SPUs

Step 1 Group the data together
Created new function, UpdateFastPathFollowers()?
Used the new list of same type of data
Generic Update() no longer used
(Ignored derived class behaviors here.)?

38
Designing for the SPUs

Step 2 Organize Inputs and Outputs
Define what's read, what's write.
Inputs Position, Time, State, Results of
queries, Paths
Outputs Position, State, Queries, Animation
Read inputs. Transform to Outputs.
Nothing more complex than that.

39
Designing for the SPUs

Step 3 Reduce Synchronization Points
Collected all outputs together
Collected any external function calls together
into a command buffer
Separate Query and Query-Result
Effectively a Queue between systems
Reduced from many sync points per object to one
sync point for the system

40
Designing for the SPUs

Before Pattern
Loop Objects
Read Input 0
Update 0
Write Output
Read Input 1
Update 1
Call External Function
Block (Sync)?

41
Designing for the SPUs

After Pattern (Simplified)?
Loop Objects
Read Input 0, 1
Update 0, 1
Write Output, Function to Queue
Block (Sync)?
Empty (Execute) Queue

42
Designing for the SPUs

Next Added derived-class functionality
Similarly simplified derived-class Update()
functions into functions with clear inputs and
outputs.
Added functions to deferred queue as any other
function.
Advantage Can limit derived functionality based
on count, LOD, etc.

43
Designing for the SPUs

Step 4 Move to PPU thread
Now system update has no external dependencies
Now system update has no conflicting data areas
(with other systems)?
Now system update does not call non-re-entrant
functions
Simply put in another thread

44
Designing for the SPUs

Step 4 Move to PPU thread
Add literal sync between system update and queue
execution
Sync can be removed because only single reader
and single writer to data
Queue can be emptied while being filled without
collision
See RD page on multi-threaded optimization

45
Designing for the SPUs

Step 5 Move to SPU
Now completely independent thread
Can be run anytime
Prototype for new SPU system
AsyncMobyUpdate
Using SPU Shaders

46
Designing for the SPUs

Transitioning from SPU as coprocessor model.
Example igPhysics from Resistance to now...

47
PPU
SPU
Execution
Resistance Fall of Man Physics Pipeline
Environment Pre-Update (Resolve AnimIK)?
AABB Tests
Triangle Intersection
Environment Update
Sphere, Capsule, etc..
Pack contact points
Note One Job Per Object. (box, ragdoll, etc..)?
Collision Update (Start Coll Jobs while building)?
Collide Prims (generate contacts)?
Sync Collision Jobs and Process Contact Points
Blocked!
Associate Rigid Bodies Through Constraints
Unpack Constraints
Generate Jacobian Data
Solve Constraints
Package Rigid Body Pools. (Start SPU Jobs While
packing)?
Simulate
Pack Rigid Body Data
Sync Sim Jobs and Process Rigid Body Data
The only time hidden between start and stop of
jobs is the packing of job data. The only other
savings come from merely running the jobs on the
SPU.
Blocked!
Post Update (Transform Anim Joints)?
48
PPU
SPU
Execution
Resistance 2 Physics Pipeline
Environment Update
Upload Tri-Cache
Upload Object Cache
Upload RB Prims
Upload Intersect Funcs
Triangle Cache Update
Intersection Tests
For Each Iteration
Upload CO Prims
Object Cache Update
Collide Triangles
Upload Intersect Funcs
Intersection Tests
Start Physics Jobs
Collide Primitives
Sort Joint Types
Per Joint Type Upload Jacobian Generation Code
Upload Physics Joints
PPU Work
Build Simulation Pools
Calculate Jacobian Data
Solve Constraints
Upload Solver Code
Integrate
For Each Physics Object Upload Anim Joints
Sync Physics Jobs
Simulate Pools
Transform Anim Joints Using Rigid Body Data
Post Update
Update Rigid Bodies
Send Update To PPU
49
Optimizing for SPUs

Instruction-level optimizations are similar to
any other platform
i.e. Look at the instruction set and write code
that takes advantage of it.

50
Optimizing for SPUs

Memory transfer optimizations are similar to any
other platform
i.e. Organize data for line-length and coherency.
Separate read and write buffers wherever
possible.
DMA is exactly like cache pre-fetch

51
Optimizing for SPUs

Local memory optimizations are similar to any
other platform
i.e. Have a fixed-size buffer, split it into
smaller buffers for input, output, temporary data
and code.
Organizing 256K is essentially the same process
as organizing 256M

52
Optimizing for SPUs

Memory layout
Memory is dedicated to your code.
Memory is local to your code.
Design so you know what will read and write to
the memory
i.e. DMAs from PPU, other SPUs, etc.
Generally fairly straightforward.
Remember you can use an offline tool to layout
your memory if you want.

53
Optimizing for SPUs

Memory layout
But never, ever try to use a dynamic memory
allocator.
Malloc for dedicated 256K would be ridiculous.
OK. Malloc in a console game would be ridiculous.

54
Optimizing for SPUs

Memory layout
Rules of thumb
Organize everything into blocks of 16b.
SPU Reads/Writes only 16b
Group same fields together
No single object data
Similar to most SIMD.
Similar to GPUs.

55
Optimizing for SPUs

Memory transfer
Usually pretty straightforward
Rules of thumb
Keep everything 128b aligned
Nothing different. Same rule as the PPU.
(Cache-line is 128b)?
Transfer as much data as possible together.
Transform together.
Nothing different. Same rule as the PPU. (For
cache coherency)?

56
Optimizing for SPUs

Memory transfer
Let's dig in to these rules of thumb a bit...
Shared alignment between main ram and SPU local
memory is going to be faster. (So pick an
alignment and stick with it.)?
Transfer is done in 128b blocks, so alignment
isn't strictly necessary (but no worries about
above if it is)?

57
Optimizing for SPUs

Number of transfers doesn't really matter (re
Biggest transfers possible) but...
You want transfer 128b blocks, not scattered.
You want to minimize synchronization (sync on
less dma tags)?
You have less places to worry about alignment.
You want to minimize scatter/gather. Especially
considering TLB misses.

58
Optimizing for SPUs

Memory transfer
Rules of thumb
If scattered reads, writes are necessary, use DMA
list (not individual DMAs)?
Advantage over PPU. PPU can't do out-of-order,
grouped memory transfer.
Keeps predictability of in-order execution with
performance of out-of-order memory transfer.

59
Optimizing for SPUs

Speaking of out-of-order transfers...
Use DMA fence to dictate order
Reads and write are interleaved,
If you need max transfer performance, issue them
separately.

60
Optimizing for SPUs

Memory transfer
Double, Triple buffer optimization
(Show fence example)?

61
Optimizing for SPUs

Code level optimization
Rules of thumb
Know the instruction set
Use si intrinsics (or asm)?
Stick with native types
Clue There's only one (qword)?

62
Optimizing for SPUs

Code level optimization
Rules of thumb
Code branch free
Not just for branch performance.
Branch free scalar transforms to SIMD extremely
well.
There is a hitch. No SIMD loads or stores.
This drives data design decisions.

63
Optimizing for SPUs

Code level optimization
Examples...

64
Optimizing for SPUs

Example 1 Vector-Matrix Multiply

65
Vector-Matrix Multiplication
Standard Approach
Multiplying a vector (x,y,z,w) by a 4x4
matrix (x y z w) (x y z w) (m00 m01 m02
m03)?             (m10
m11 m12 m13)?
(m20 m21 m22 m23)?
      (m30 m31 m32 m33)?

The general case
shufb xxxx, xyzw, xyzw, shuf_AAAA
shufb yyyy, xyzw, xyzw, shuf_BBBB
shufb zzzz, xyzw, xyzw, shuf_CCCC
shufb wwww, xyzw, xyzw, shuf_DDDD
fm result, xxxx, m0
fma result, yyyy, m1, result
fma result, zzzz, m2, result
fma result, wwww, m3, result

The result is obtained by multiplying the x by
the first row of the matrix, y by the second,
etc. and accumulating these products. This
observation leads to the standard
method Broadcast each of the x,y,z and w across
all 4 components, then perform 4 multiply-add
type instructions. Abbreviated versions are
possible in the special cases of w0 and w1,
which occur frequently. All 3 versions are
shown to the right. Its a simple matter to
extend this approach to the product of two 4x4
matrices. Note that the w0 and w1 cases
come into play here when our matrices have
(0,0,0,1)T in the rightmost column.

Case w0
shufb xxxx, xyz0, xyz0, shuf_AAAA
shufb yyyy, xyz0, xyz0, shuf_BBBB
shufb zzzz, xyz0, xyz0, shuf_CCCC
fm result, xxxx, m0
fma result, yyyy, m1, result
fma result, zzzz, m2, result

Case w1
shufb xxxx, xyz1, xyz1, shuf_AAAA
shufb yyyy, xyz1, xyz1, shuf_BBBB
shufb zzzz, xyz1, xyz1, shuf_CCCC
fma result, xxxx, m0, m3
fma result, yyyy, m1, result
fma result, zzzz, m2, result

66
Vector-Matrix Multiplication
Faster Alternatives
In the simple case where we only wish to
transform a single vector, or multiply a single
pair of matrices, the standard approach that was
shown would be most appropriate. But frequently
well have a collection of vectors or matrices
which we wish to multiply by the same matrix, in
which case we may be prepared to make sacrifices
for the sake of reducing the instruction count.
67
Vector-Matrix Multiplication
Alternative 1
By simply preswizzling the matrix, we can reduce
the number of shuffles needed

The general case
Preswizzle the matrix as (m00 m11 m22 m33)?
(m10 m21 m32 m03)?
(m20 m31 m02 m13)?
(m30 m01 m12 m23)?
then transform a vector using the sequence
rotqbyi yzwx, xyzw, 4
rotqbyi zwxy, xyzw, 8
rotqbyi wxyz, xyzw, 12
fm result, xyzw, m0_
fma result, yzwx, m1_, result
fma result, zwxy, m2_, result
fma result, wxyz, m3_, result

Case w0, with (0,0,0,1)T in the rightmost matrix
column
Preswizzle the matrix as (m00, m11, m22, 0)?
(m10, m21, m02, 0)?
(m20, m01, m12, 0)?
This can be done efficiently using selb
fsmbi mask_0F00, 0x0F00
fsmbi mask_00F0, 0x00F0
selb m0_, m0, m1, mask_0F00
selb m1_, m1, m2, mask_0F00
selb m2_, m2, m0, mask_0F00
selb m0_, m0_, m2, mask_00F0
selb m1_, m1_, m0, mask_00F0
selb m2_, m2_, m1, mask_00F0
The vector multiply then only
requires 5 instructions
shufb yzx0, xyz0, xyz0, shuf_BCA0

Case w1, with (0,0,0,1)T in the rightmost matrix
column
Use the same preswizzle as the w0 case,
leaving row 3 unchanged.
Again 5 instructions suffice
shufb yzx0, xyz0, xyz0, shuf_BCA0
shufb zxy0, xyz0, xyz0, shuf_CAB0
fma result, xyz0, m0_, m3
fma result, yzx0, m1_, result
fma result, zxy0, m2_, result

68
Vector-Matrix Multiplication
Alternative 2
If were dealing with the general case, we can
reduce the instruction count further still

Using the preswizzle (m02, m13, m20, m31)?
(m12, m23,
m30, m01)?
(m00, m11,
m22, m33)?
(m10, m21,
m32, m03)?
we can carry out the vector multiply
in just 6 instructions
rotqbyi yzwx, xyzw, 4
fm temp, xyzw, m0_
fma temp, yzwx, m1_, temp
rotqbyi result, temp, 8
fma result, xyzw, m2_, result
fma result, yzwx, m3_, result

This approach yields no additional benefits for
the w0 and w1 cases however.
Conclusion
Single vector/matrix times a single matrix use
the Standard Approach. Many vectors/matrices
times a single matrix use Alternative 1. Many
general vectors/matrices (i.e. anything in w)
times a single matrix in a pipelined loop use
Alternative 2.
69
Optimizing for SPUs

Example 2 Matrix Transpose

70
Matrix Transposition
Standard Approach
A general 4x4 matrix can be transposed in 8
shuffles as follows
(x0, y0, z0, w0) (x0, x1, x2,
x3)? (x1, y1, z1, w1) -gt (y0, y1,
y2, y3)? (x2, y1, z2, w2) (z0,
z1, z2, z3)? (x3, y3, z3, w3)
(w0, w1, w2, w3)? shufb t0, a0, a2,
shuf_AaBb // t0 (x0, x2, y0, y2)?
shufb t1, a1, a3, shuf_AaBb // t1 (x1, x3,
y1, y3)? shufb t2, a0, a2, shuf_CcDd //
t2 (z0, z2, w0, w2)? shufb t3, a1, a3,
shuf_CcDd // t3 (z1, z3, w1, w3)?
shufb b0, t0, t1, shuf_AaBb // b0 (x0, x1,
x2, x3)? shufb b1, t0, t1, shuf_CcDd //
b1 (y0, y1, y2, y3)? shufb b2, t2, t3,
shuf_AaBb // b2 (z0, z1, z2, z3)?
shufb b3, t2, t3, shuf_CcDd // b3 (w0, w1,
w2, w3)?
Many variations are possible by changing the
particular shuffles used, but they all end up
doing the same thing in the same amount of work.
The version shown above is a good choice because
it only requires two constants.
71
Matrix Transposition
Faster 4x4
By using a different set of shuffles, a couple
of the shuffles can then be replaced by
select-bytes which has lower latency
shufb t0, a0, a1, shuf_AaCc // t0
(x0, x1, z0, z1)? shufb t1, a2, a3,
shuf_CcAa // t1 (z2, z3, x2, x3)?
shufb t2, a0, a1, shuf_BbDd // t2 (y0,
y1, w0, w1)? shufb t3, a2, a3, shuf_DdBb
// t3 (w2, w3, y2, y3)? shufb b2, t0,
t1, shuf_CDab // b2 (z0, z2, z2, z2)?
shufb b3, t2, t3, shuf_CDab // b3 (w0,
w3, w3, w3)? selb b0, t0, t1, mask_00FF
// b0 (x0, x0, x0, x0)? selb b1, t2,
t3, mask_00FF // b1 (y0, y1, y1, y1)?
This version is quicker by 1 cycle, at the
expense of requiring more constants
72
Matrix Transposition
3x4 -gt 4x3
Here is an example that uses only 6 shuffles
(x0, y0, z0, w0) (x0, x1, x2,
0)? (x1, y1, z1, w1) -gt (y0, y1,
y2, 0)? (x2, y2, z2, w2) (z0,
z1, z2, 0)?
(w0, w1, w2, 0)? shufb t0, a0, a1,
shuf_AaBb // t0 (x0, x1, y0, y1)?
shufb t1, a0, a1, shuf_CcDd // t1 (z0,
z1, w0, w1)? shufb b0, t0, a2, shuf_ABa0
// b0 (x0, x1, x2, 0)? shufb b1, t0,
a2, shuf_CDb0 // b1 (y0, y1, y2, 0)
shufb b2, t1, a2, shuf_ABc0 // b2 (z0,
z1, z2, 0)? shufb b3, t1, a2, shuf_CDd0
// b3 (w0, w1, w2, 0)?
Note that care must be taken if the destination
matrix is the same as the source. In this case
the last 2 lines of code must be swapped to
avoid prematurely overwriting a2.
73
Matrix Transposition
3x3
Here is an example that uses only 5 shuffles
(x0, y0, z0, w0) (x0, x1, x2,
0)? (x1, y1, z1, w1) -gt (y0, y1,
y2, 0)? (x2, y1, z2, w2) (z0,
z1, z2, 0)? shufb t0, a0, a1,a shuf_AaBb
// t0 (x0, x1, y0, y1)? shufb t1, a0,
a1, shuf_CcDd // t1 (z0, z1, w0, w1)?
shufb b0, t0, a2, shuf_ABa0 // b0 (x0,
x1, x2, 0)? shufb b1, t0, a2, shuf_CDb0
// b1 (y0, y1, y2, 0) shufb b2, t1,
a2, shuf_ABc0 // b2 (z0, z1, z2, 0)?
74
Matrix Transposition
3x3 (reduced latency)?
If we seek the lowest latency, this example is 2
cycles quicker than the last example, at the
expense of an extra instruction and an extra
constant
(x0, y0, z0, w0) (x0, x1, x2,
0)? (x1, y1, z1, w1) -gt (y0, y1,
y2, 0)? (x2, y1, z2, w2) (z0,
z1, z2, 0)? shufb t0, a1, a2, shuf_0Aa0
// t0 ( 0, x1, x2, 0)? shufb t1, a2,
a0, shuf_b0B0 // t1 (y0, 0, y2, 0)?
shufb t2, a0, a1, shuf_Cc00 // t2 (z0,
z1, 0, 0)? selb b0, a0, t0, mask_0FFF
// b0 (x0, x1, x2, 0)? selb b1, a1,
t1, mask_F0FF // b1 (y0, y1, y2, 0)?
selb b2, a2, t2, mask_FF0F // b2 (z0,
z1, z2, 0)?
Hybrid versions are also possible, which may be
of use when trying to balance even vs. odd counts.
75
Optimizing for SPUs

Example 3 8 bit palette lookup
Flip the problem around
Instead of looking up index for each byte...
Loop through the palette and compare each
quadword of indices and mask any matching results

76
Optimizing for SPUs

When is it better to use asm?
When you know facts the compiler cannot (and can
take advantage of them)?
i.e. almost always.

77
Optimizing for SPUs

When is asm really worth it?
Case-by-case.
Time, experience, performance, practice.
Doesn't it make the code unmaintainable?
Not much different from using intrinsics.
Especially if you use macro-asm tools.
e.g. for register coloring - thats really the
tedious part of editing asm.

78
Optimizing for SPUs

Writing asm rules-of-thumb
Minimize instruction count
Minimize trace latency
(Instruction count takes precedence)?
Balance even/odd instruction pipelines
Minimize memory accesses
Can block DMA or instruction fetch

79
The 256K Barrier

The solution is simple
Upload more code when you need it.
Upload more data when you need it.
Data is managed by traditional means
i.e. Double, triple fixed-buffers, etc.
Code is just data.
Can we manage code the same way we manage data?

80
SPU Shaders

SPU Shaders are
Fragments of code used in existing systems
(Physics, Animation, Effects, AI, etc.)?
Code is loaded at location pre-determined by
system.
Custom (Data/Interface) for each system.
An expansion of an existing system (e.g.
Pipelined stages)?
Custom modifications of system data.
Way of delivering feedback to other systems
outside the scope of the current system.

81
SPU Shaders

SPU Shaders are NOT
Generic, general purpose system.
A system of any kind, actually.
Globally scheduled.

82
SPU Shaders

Why is it called a shader?
Shares important similarities to GPU shaders.
Native code fragments
Part of a larger system
In-context execution
Independently optimizable
Most important Concept is approachable.

83
SPU Shaders

Don't try to solve everyone's problems
Solutions that try to solve all problems tend to
cause more problems than they solve.?

84
SPU Shaders

Easy to Implement
Pick stage(s) in system kernel to inject shaders.
Define available inputs and outputs.
Collect common functions.
Compile shaders as data.
Sort instance data based on shader type(s)?
Load shader on-demand based on data select.
Call shaders.

85
SPU Shaders

What data is being transformed?
What are the inputs?
What are the outputs?
What can be modified?

86
SPU Shaders

Collect the common functions...
Always loaded by the system
e.g.
Dma wrapper functions
Debugging functions
Common transformation functions

87
Example Structure Passed to Shader
struct common_t void (print_str)(const char
str) void (dma_wait)(uint32_t tag) void
(dma_send)(void ls, uint32_t ea, uint32_t size,
uint32_t tag) void (dma_recv)(void ls,
uint32_t ea, uint32_t size, uint32_t tag)
char ls uint32_t ls_size uint32_t
data_ea uint32_t data_size uint32_t
dma_tags2
88
SPU Shaders

System Shader Configuration...
System knows where the fragments are.
System knows when to call the fragments.
System doesn't know what the fragments do.
Fragments are in main RAM.
Fragments don't need to be fixed.

89
SPU Shaders

System Shader Configuration.
Manage fragment memory
Simplest method
Double buffer,
On-demand,
Fixed maximum size,
By-index from array,...

90
SPU Shaders

Create the shader code...
Code is just data
No special distinquishing feature on the SPUs
Overlays or additional jobs are too complex and
heavyweight.
Just want load and execute.
No special system needed.

91
SPU Shaders

Create the shader code..
Method 1 Shader as PPU header
Compile shader as normal, to obj file.
Dump obj file using spu-objdump
Convert dump to header using script.
This is what we started with

92
SPU Shaders

Create the shader code..
Method 2 Use elf file
Requires extra compile step, but more debugger
friendly.
This is what we're doing now.
Other methods too, use whatever works for you.

93
SPU Shaders

Calling the shader...
Nothing could be easier.
ShaderEntry shader (addr of fragment)
shader( data, common )

94
SPU Shaders

Debugging Shaders...
Fragments are small
Fragments have well defined inputs and outputs.
Ideal for unit tests in separate framework.
Test on PS3/Linux box.
Alternatives
Debug on PPU (intrinsics are portable)?
Temporarily link in shader.

95
SPU Shaders

Runtime debugging
Is a problem with the first method.
Using the full elf, have debugging info
Now works transparently in our debugger.

96
SPU Shaders

Rule 1 Don't Manage Data for Shaders
Just give shaders a buffer and fixed size.
Shaders should depend on size, so leave room for
system changes.
Best size depends on system.
(Maybe 4K, maybe 32K)?
Don't read or write from/to shader buffer.

97
SPU Shaders

System-specific
Multiple list of instances to modify or transform
Context data
Shader-internal (local)?
EA passed by system
Fixed buffer
Shader shared (global)?
EA passed by system

98
SPU Shaders

Rule 2 Don't Manage DMA for Shaders
Give fixed number of DMA tags to shader
Grab them in the entry function and pass down)?
Avoid GetDmaTagFromParentSystem()?
Give DMA functions to shaders
To allow system to run with any job manager, or
none
Don't use shader tags for other purposes

99
SPU Shaders

Rule 3 Enforce fixed maximum size for Shader
code.
System can be maintained.
Rule 4 Shaders are always called in a clear,
well defined context.
i.e. Part of a larger system.?

100
SPU Shaders

Rule 5 Fixed parameter list for shaders,
per-system (or sub-system)?
Don't want to re-compile all shaders.
Don't want to manage dynamic parameter lists.
Rule 6 Shaders should be given as many instances
as possible.
More optimizable.?

101
SPU Shaders

Rule 7 Don't break the rules.
You'll end up with a new job manager.
You'll end up with a big headache.

102
SPU Shaders

Where are we using these?
Physics, Effects, Animation, Some AI Update
Also experimenting with pre-vertex shaders on the
SPUs
And experimenting with giving some of that
control to the artists (Directly generating code
from a tool...)?

103
(No Transcript)
104
(No Transcript)
105
(No Transcript)
106
(No Transcript)
107
(No Transcript)
108
(No Transcript)
109
(No Transcript)
110
(No Transcript)
111
(No Transcript)
112
Conclusion