Insomniac - PowerPoint PPT Presentation

1 / 113
About This Presentation
Title:

Insomniac

Description:

Insomniac's SPU Best Practices. Hard-won lessons we're applying to ... Bucket brigade. Damn they're fast! Introduction. But what about special code management? ... – PowerPoint PPT presentation

Number of Views:196
Avg rating:3.0/5.0
Slides: 114
Provided by: insomni
Category:

less

Transcript and Presenter's Notes

Title: Insomniac


1
Insomniacs SPU Best PracticesHard-won lessons
were applying to a 3rd generation PS3 title
Mike Acton Eric ChristensenGDC 08
2
Introduction
  • What will be covered...
  • Understanding SPU programming
  • Designing systems for the SPUs
  • SPU optimization tips
  • Breaking the 256K barrier
  • Looking to the future

3
Introduction
  • Isn't it harder to program for the SPUs?
  • No.
  • Classical optimizations techniques still apply
  • Perhaps even more so than on other
    architectures.
  • e.g. In-order processing means predictable
    pipeline. Means easier to optimize.
  • Both at instruction-level and multi-processing
    level.

4
Introduction
  • Multi-processing is not new
  • Trouble with the SPUs usually is just trouble
    with multi-core.
  • You can't wish multi-core programming away. It's
    part of the job.

5
Introduction
  • But isn't programming for the SPUs different?
  • The SPU is not a magical beast only tamed by
    wizards.
  • It's just a CPU
  • Get your feet wet. Code something.
  • Highly Recommend Linux on the PS3!

6
Introduction
  • Seriously though. It's not the same, right?
  • Not the same if you've been sucked into one of
    the three big lies of software development...

7
Introduction
  • The software as a platform" lie.
  • The "domain-model design" lie.
  • The "code design is more important than data
    design" lie
  • ... The real difficulty is unlearning these
    mindsets.

8
Introduction
  • But what's changed?
  • Old model
  • Big semi truck. Stuff everything in. Then stuff
    some more. Then put some stuff up front. Then
    drive away.
  • New model
  • Fleet of Ford GTs taking off every five minutes.
    Each one only fits so much. Bucket brigade. Damn
    they're fast!

9
Introduction
  • But what about special code management?
  • Yes, you need to upload the code.
  • So what? Something needs to load the code on
    every CPU.

10
Introduction
  • But what about DMA'ing data?
  • Yes, you need to use a DMA controller to move
    around the data.
  • Not really different from calling memcpy

11
SPU DMA vs. PPU memcpy
PPU Memcpy
SPU DMA
DMA from main ram to local store wrch ch16,
ls_addr wrch ch18, main_addr wrch ch19,
size wrch ch20, dma_tag il 2,
MFC_GET_CMD wrch ch21, 2
PPU memcpy from far ram to near ram mr 3,
near_addr mr 4, far_addr mr 5, size bl
memcpy
DMA from local store to main ram wrch ch16,
ls_addr wrch ch18, main_addr wrch ch19,
size wrch ch20, dma_tag il 2,
MFC_PUT_CMD wrch ch21, 2
PPU memcpy from near ram to far ram mr 4,
near_addr mr 3, far_addr mr 5, size bl
memcpy
Conclusion If you can call memcpy, you can DMA
data.
12
Introduction
  • But what about DMA'ing data?
  • But with more control about how and when it's
    sent, retrieved.

13
SPU Synchronization
Fence Transfer after previous with the same tag
Example Sync
DMA from main ram to local store
PUTF Transfer previous before this PUT PUTLF
Transfer previous before this PUT LIST GETF
Transfer previous before this GET GETLF Transfer
previous before this GET LIST
Do other productive work while DMA is
happening...
Barrier Transfer after previous and before next
with the same tag
PUTB Fixed order with respect to this PUT PUTLB
Fixed order with respect to this PUT LIST GETB
Fixed order with respect to this GET GETLB Fixed
order with respect to this GET LIST
(Sync) Wait for DMA to complete il 2,
1 shl 2, 2, dma_tag wrch ch22,
2 il 3, MFC_TAG_UPDATE_ALL wrch
ch23, 3 rdch 2, ch24
Lock Line Reservation
GETLLAR Gets locked line. (PPU lwarx,
ldarx) PUTLLC Puts locked line. (PPU stwcx,
stdcx)
14
Introduction
  • Bottom line SPUs are like most CPUs
  • Basics are pretty much the same.
  • Good data design decisions and smart code choices
    see benefits any platform
  • Good DMA pattern also means cache coherency.
    Better on every platform
  • Bad choices may work on some, but not others.
  • Xbox 360, PC, Wii, DS, PSP, whatever.

15
Introduction
  • And that's what we're talking about today.
  • Trying to apply smart choices to these
    particular CPUs for our games.
  • That's what console development is.
  • What mistakes we've made along the way.
  • What's worked best.

16
Understanding the SPUs
  • Rule 1 The SPU is not a co-processor!
  • Don't think of SPUs as hiding time behind a
    main PPU loop

17
Understanding the SPUs
  • What clicked with some Insomniacs about the
    SPUs
  • Everything is local
  • Think streams of data
  • Forget conventional OOP
  • Everything is a quadword
  • si intrinsics make things clearer
  • Local memory is really, really fast

18
Designing for the SPUs
  • The ultimate goal Get everything on the SPUs.
  • Leave the PPU for shuffling stuff around.
  • Complex systems can go on the SPUs
  • Not just streaming systems
  • Used for any kind of task
  • But you do need to consider some things...

19
Designing for the SPUs
  • Data comes first.
  • Goal is minimum energy for transformation.
  • What is energy usage? CPU time. Memory read/write
    time. Stall time.

Output
Input
Transform()
20
Designing for the SPUs
  • Design the transformation pipeline back to front.
  • Start with your destination data and work
    backward.
  • Changes are inevitable. This way you pay less for
    them.
  • An example...

21
Front to Back
Back to Front
Simulate Glass
Started Here
Render
Rendered Dynamic Geometry using Fake Mesh Data
Had a really nice looking simulation but would
find out soon that This stage was worthless
Generate Crack Geometry
igTriangulate
Faked Inputs to Triangulate and output
transformed data to render stage
igTriangulate
Then wrote igTriangulate
Simulate Glass
wrote the simulation to provide useful (and
expected) results to the triangulation library.
Oops, the only possible output didnt support
the glamorous crack rendering
Render
  • Could have avoided re-writing the simulation if
    the design
  • process was done in the correct order.
  • Good looking results were arrived at with a much
    smaller
  • processing and memory impact.
  • Full simulation turned out to be un-necessary
    since its
  • outputs werent realistic considering the
    restrictions of the
  • final stage.
  • Proof that code as you design can be
    disasterous.
  • Working from back to front forces you to think
    about your
  • pipeline in advance. Its easier to fix problems
    that live in
  • front of final code. Wildly scattered fixes and
    data format
  • changes will only end in sorrow.

Realized that the level of detail from the
simulation wasnt necessary considering that the
granularity restrictions (memory, cpu)? Could
not support it.
The rendering part of the pipeline
didnt completely support the outputs of
the triangulation library
Even worse, the inputs that were being provided
to the triangulation library werent adequate.
Needed more information about retaining surface
features.
22
Designing for the SPUs
  • The data the SPUs will transform is the canonical
    data.
  • i.e. Store the data in the best format for the
    case that takes the most resources.

23
Designing for the SPUs
  • Minimize synchronization
  • Start with the smallest synchronization method
    possible.

24
Designing for the SPUs
  • Simplest method is usually lock-free single
    reader, single writer queue.

25
SPU Ordered Write
PPU Ordered Write
Write Data
Write Data
lwsync
Increment Index (with Fence)
Increment Index
26
Designing for the SPUs
  • Fairly straightforward to load balance
  • For constant time transforms, just divide into
    multiple queues
  • For other transforms, use heuristic to decide
    times and a single entry queue to distribute to
    multiple queues.

27
Designing for the SPUs
  • Then work your way up.
  • Is there a pre-existing sync point that will
    work? (e.g. vsync)?
  • Can you split your data into need-to-sync and
    don't-care?

28
Resistance Fall of Man Immediate Effect Updates
Only
Resistance2 Immediate Deferred Effect Updates
Reduced Sync Points
PPU
SPU
PPU
SPU
Sync Immediate Updates For Last Frame
Update Game Objects
Run Deferred Effect Update/Render
Deferred Update Render
Update Game Objects
Sync Deferred Updates
Run Immediate Effect Updates
Post Update Game Objects
Immediate Update
Finish Frame Update Start Rendering
Run Effects System Manager
System Manager
Finish Frame Update Start Rendering
Sync Immediate Effect Updates
Sync Effect System Manager
Generate Push Buffer To Render Frame
Run Immediate Effect Update/Render
Immediate Update Render (Can run past end of
PPU Frame due to reduced sync points)?
Generate Push Buffer To Render Frame
Generate Push Buffer To Render Effects
Finish Push Buffer Setup
Finish Push Buffer Setup
29
PPU time overlapping effects SPU time
PPU time spent on effect system
Resistance Fall of Man Immediate Effect Updates
Only
PPU time that cannot be overlapped
PPU
SPU
Update Game Objects
No effects can be updated till all game objects
have updated so attachments do not lag.
Visibility and LOD culling done on PPU before
creating jobs.
Run Immediate Effect Updates
Each effect is a separate SPU job
Immediate Update
Effect updates running on all available SPUs
(four)?
Finish Frame Update Start Rendering
Likely to stall here , due to limited window in
which to update all effects.
Sync Immediate Effect Updates
Generate Push Buffer To Render Frame
Generate Push Buffer To Render Effects
The number of effects that could render were
limited by available PPU time to generate their
PBs.
Finish Push Buffer Setup
30
PPU time overlapping effects SPU time
PPU time spent on effect system
Resistance2 Immediate Deferred Effect Updates
Reduced Sync Points
PPU time that cannot be overlapped
PPU
SPU
Sync Immediate Updates For Last Frame
Run Deferred Effect Update/Render
Initial PB allocations done on PPU Single SPU job
for each SPU (Anywhere from one to three)?
Deferred Update Render
Huge amount of previously unused SPU processing
time available.
Update Game Objects
Deferred effects are one frame behind, so
effects attached to moving objects usually should
not be deferred.
Sync Deferred Updates
Post Update Game Objects
SPU manager handles all visibility and LOD
culling previously done on the PPU.
Run Effects System Manager
System Manager
Finish Frame Update Start Rendering
Generates lists of instances for update jobs to
process.
Sync Effect System Manager
Immediate updates are allowed to run till the
beginning of the next frame, as they do not
need to sync to finish generating this frames PB
Run Immediate Effect Update/Render
Initial PB allocations done on PPU Single SPU job
for each SPU (Anywhere from one to three)?
Immediate Update Render (Can run past end of
PPU Frame due to reduced sync points)?
Generate Push Buffer To Render Frame
Doing the initial PB alloc on the PPU eliminates
need to sync SPU updates before generating full
PB.
Smaller window available to update immediate
effects, so only effects attached to moving
objects should be immediate.
Finish Push Buffer Setup
31
Designing for the SPUs
  • Write optimizable code.
  • Often optimized code can wait a bit.
  • Simple, self-contained loops
  • Over as many iterations as possible
  • No branches

32
Designing for the SPUs
  • Transitioning from "legacy" systems...
  • We're not immune to design problems
  • Schedule, manpower, education, and experience all
    play a part.

33
Designing for the SPUs
  • Example from RCF...
  • FastPathFollowers C class
  • And it's derived classes
  • Running on the PPU
  • Typical Update() method
  • Derived from a root class of all updatable types

34
Designing for the SPUs
  • Where did this go wrong?
  • What rules where broken?
  • Used domain-model design
  • Code design over data design
  • No advatage of scale
  • No synchronization design
  • No cache consideration

35
Designing for the SPUs
  • Result
  • Typical performance issues
  • Cache misses
  • Unnecessary transformations
  • Didn't scale well
  • Problems after a few hundred updating

36
Designing for the SPUs
  • Step 1 Group the data together
  • Where there's one, there's more than one.
  • Before the update() loop was called,
  • Intercepted all FastPathFollowers and derived
    classes and removed them from the update list.
  • Then kept in a separate array.

37
Designing for the SPUs
  • Step 1 Group the data together
  • Created new function, UpdateFastPathFollowers()?
  • Used the new list of same type of data
  • Generic Update() no longer used
  • (Ignored derived class behaviors here.)?

38
Designing for the SPUs
  • Step 2 Organize Inputs and Outputs
  • Define what's read, what's write.
  • Inputs Position, Time, State, Results of
    queries, Paths
  • Outputs Position, State, Queries, Animation
  • Read inputs. Transform to Outputs.
  • Nothing more complex than that.

39
Designing for the SPUs
  • Step 3 Reduce Synchronization Points
  • Collected all outputs together
  • Collected any external function calls together
    into a command buffer
  • Separate Query and Query-Result
  • Effectively a Queue between systems
  • Reduced from many sync points per object to one
    sync point for the system

40
Designing for the SPUs
  • Before Pattern
  • Loop Objects
  • Read Input 0
  • Update 0
  • Write Output
  • Read Input 1
  • Update 1
  • Call External Function
  • Block (Sync)?

41
Designing for the SPUs
  • After Pattern (Simplified)?
  • Loop Objects
  • Read Input 0, 1
  • Update 0, 1
  • Write Output, Function to Queue
  • Block (Sync)?
  • Empty (Execute) Queue

42
Designing for the SPUs
  • Next Added derived-class functionality
  • Similarly simplified derived-class Update()
    functions into functions with clear inputs and
    outputs.
  • Added functions to deferred queue as any other
    function.
  • Advantage Can limit derived functionality based
    on count, LOD, etc.

43
Designing for the SPUs
  • Step 4 Move to PPU thread
  • Now system update has no external dependencies
  • Now system update has no conflicting data areas
    (with other systems)?
  • Now system update does not call non-re-entrant
    functions
  • Simply put in another thread

44
Designing for the SPUs
  • Step 4 Move to PPU thread
  • Add literal sync between system update and queue
    execution
  • Sync can be removed because only single reader
    and single writer to data
  • Queue can be emptied while being filled without
    collision
  • See RD page on multi-threaded optimization

45
Designing for the SPUs
  • Step 5 Move to SPU
  • Now completely independent thread
  • Can be run anytime
  • Prototype for new SPU system
  • AsyncMobyUpdate
  • Using SPU Shaders

46
Designing for the SPUs
  • Transitioning from SPU as coprocessor model.
  • Example igPhysics from Resistance to now...

47
PPU
SPU
Execution
Resistance Fall of Man Physics Pipeline
Environment Pre-Update (Resolve AnimIK)?
AABB Tests
Triangle Intersection
Environment Update
Sphere, Capsule, etc..
Pack contact points
Note One Job Per Object. (box, ragdoll, etc..)?
Collision Update (Start Coll Jobs while building)?
Collide Prims (generate contacts)?
Sync Collision Jobs and Process Contact Points
Blocked!
Associate Rigid Bodies Through Constraints
Unpack Constraints
Generate Jacobian Data
Solve Constraints
Package Rigid Body Pools. (Start SPU Jobs While
packing)?
Simulate
Pack Rigid Body Data
Sync Sim Jobs and Process Rigid Body Data
The only time hidden between start and stop of
jobs is the packing of job data. The only other
savings come from merely running the jobs on the
SPU.
Blocked!
Post Update (Transform Anim Joints)?
48
PPU
SPU
Execution
Resistance 2 Physics Pipeline
Environment Update
Upload Tri-Cache
Upload Object Cache
Upload RB Prims
Upload Intersect Funcs
Triangle Cache Update
Intersection Tests
For Each Iteration
Upload CO Prims
Object Cache Update
Collide Triangles
Upload Intersect Funcs
Intersection Tests
Start Physics Jobs
Collide Primitives
Sort Joint Types
Per Joint Type Upload Jacobian Generation Code
Upload Physics Joints
PPU Work
Build Simulation Pools
Calculate Jacobian Data
Solve Constraints
Upload Solver Code
Integrate
For Each Physics Object Upload Anim Joints
Sync Physics Jobs
Simulate Pools
Transform Anim Joints Using Rigid Body Data
Post Update
Update Rigid Bodies
Send Update To PPU
49
Optimizing for SPUs
  • Instruction-level optimizations are similar to
    any other platform
  • i.e. Look at the instruction set and write code
    that takes advantage of it.

50
Optimizing for SPUs
  • Memory transfer optimizations are similar to any
    other platform
  • i.e. Organize data for line-length and coherency.
    Separate read and write buffers wherever
    possible.
  • DMA is exactly like cache pre-fetch

51
Optimizing for SPUs
  • Local memory optimizations are similar to any
    other platform
  • i.e. Have a fixed-size buffer, split it into
    smaller buffers for input, output, temporary data
    and code.
  • Organizing 256K is essentially the same process
    as organizing 256M

52
Optimizing for SPUs
  • Memory layout
  • Memory is dedicated to your code.
  • Memory is local to your code.
  • Design so you know what will read and write to
    the memory
  • i.e. DMAs from PPU, other SPUs, etc.
  • Generally fairly straightforward.
  • Remember you can use an offline tool to layout
    your memory if you want.

53
Optimizing for SPUs
  • Memory layout
  • But never, ever try to use a dynamic memory
    allocator.
  • Malloc for dedicated 256K would be ridiculous.
  • OK. Malloc in a console game would be ridiculous.

54
Optimizing for SPUs
  • Memory layout
  • Rules of thumb
  • Organize everything into blocks of 16b.
  • SPU Reads/Writes only 16b
  • Group same fields together
  • No single object data
  • Similar to most SIMD.
  • Similar to GPUs.

55
Optimizing for SPUs
  • Memory transfer
  • Usually pretty straightforward
  • Rules of thumb
  • Keep everything 128b aligned
  • Nothing different. Same rule as the PPU.
    (Cache-line is 128b)?
  • Transfer as much data as possible together.
    Transform together.
  • Nothing different. Same rule as the PPU. (For
    cache coherency)?

56
Optimizing for SPUs
  • Memory transfer
  • Let's dig in to these rules of thumb a bit...
  • Shared alignment between main ram and SPU local
    memory is going to be faster. (So pick an
    alignment and stick with it.)?
  • Transfer is done in 128b blocks, so alignment
    isn't strictly necessary (but no worries about
    above if it is)?

57
Optimizing for SPUs
  • Number of transfers doesn't really matter (re
    Biggest transfers possible) but...
  • You want transfer 128b blocks, not scattered.
  • You want to minimize synchronization (sync on
    less dma tags)?
  • You have less places to worry about alignment.
  • You want to minimize scatter/gather. Especially
    considering TLB misses.

58
Optimizing for SPUs
  • Memory transfer
  • Rules of thumb
  • If scattered reads, writes are necessary, use DMA
    list (not individual DMAs)?
  • Advantage over PPU. PPU can't do out-of-order,
    grouped memory transfer.
  • Keeps predictability of in-order execution with
    performance of out-of-order memory transfer.

59
Optimizing for SPUs
  • Speaking of out-of-order transfers...
  • Use DMA fence to dictate order
  • Reads and write are interleaved,
  • If you need max transfer performance, issue them
    separately.

60
Optimizing for SPUs
  • Memory transfer
  • Double, Triple buffer optimization
  • (Show fence example)?

61
Optimizing for SPUs
  • Code level optimization
  • Rules of thumb
  • Know the instruction set
  • Use si intrinsics (or asm)?
  • Stick with native types
  • Clue There's only one (qword)?

62
Optimizing for SPUs
  • Code level optimization
  • Rules of thumb
  • Code branch free
  • Not just for branch performance.
  • Branch free scalar transforms to SIMD extremely
    well.
  • There is a hitch. No SIMD loads or stores.
  • This drives data design decisions.

63
Optimizing for SPUs
  • Code level optimization
  • Examples...

64
Optimizing for SPUs
  • Example 1 Vector-Matrix Multiply

65
Vector-Matrix Multiplication
Standard Approach
Multiplying a vector (x,y,z,w) by a 4x4
matrix (x y z w) (x y z w) (m00 m01 m02
m03)?               (m10
m11 m12 m13)?             
(m20 m21 m22 m23)?        
      (m30 m31 m32 m33)?
  • The general case
  • shufb xxxx, xyzw, xyzw, shuf_AAAA
  • shufb yyyy, xyzw, xyzw, shuf_BBBB
  • shufb zzzz, xyzw, xyzw, shuf_CCCC
  • shufb wwww, xyzw, xyzw, shuf_DDDD
  • fm result, xxxx, m0
  • fma result, yyyy, m1, result
  • fma result, zzzz, m2, result
  • fma result, wwww, m3, result

The result is obtained by multiplying the x by
the first row of the matrix, y by the second,
etc. and accumulating these products. This
observation leads to the standard
method Broadcast each of the x,y,z and w across
all 4 components, then perform 4 multiply-add
type instructions. Abbreviated versions are
possible in the special cases of w0 and w1,
which occur frequently. All 3 versions are
shown to the right. Its a simple matter to
extend this approach to the product of two 4x4
matrices. Note that the w0 and w1 cases
come into play here when our matrices have
(0,0,0,1)T in the rightmost column.
  • Case w0
  • shufb xxxx, xyz0, xyz0, shuf_AAAA
  • shufb yyyy, xyz0, xyz0, shuf_BBBB
  • shufb zzzz, xyz0, xyz0, shuf_CCCC
  • fm result, xxxx, m0
  • fma result, yyyy, m1, result
  • fma result, zzzz, m2, result
  • Case w1
  • shufb xxxx, xyz1, xyz1, shuf_AAAA
  • shufb yyyy, xyz1, xyz1, shuf_BBBB
  • shufb zzzz, xyz1, xyz1, shuf_CCCC
  • fma result, xxxx, m0, m3
  • fma result, yyyy, m1, result
  • fma result, zzzz, m2, result

66
Vector-Matrix Multiplication
Faster Alternatives
In the simple case where we only wish to
transform a single vector, or multiply a single
pair of matrices, the standard approach that was
shown would be most appropriate. But frequently
well have a collection of vectors or matrices
which we wish to multiply by the same matrix, in
which case we may be prepared to make sacrifices
for the sake of reducing the instruction count.
67
Vector-Matrix Multiplication
Alternative 1
By simply preswizzling the matrix, we can reduce
the number of shuffles needed
  • The general case
  • Preswizzle the matrix as (m00 m11 m22 m33)?
  • (m10 m21 m32 m03)?
  • (m20 m31 m02 m13)?
  • (m30 m01 m12 m23)?
  • then transform a vector using the sequence
  • rotqbyi yzwx, xyzw, 4
  • rotqbyi zwxy, xyzw, 8
  • rotqbyi wxyz, xyzw, 12
  • fm result, xyzw, m0_
  • fma result, yzwx, m1_, result
  • fma result, zwxy, m2_, result
  • fma result, wxyz, m3_, result
  • Case w0, with (0,0,0,1)T in the rightmost matrix
    column
  • Preswizzle the matrix as (m00, m11, m22, 0)?
  • (m10, m21, m02, 0)?
  • (m20, m01, m12, 0)?
  • This can be done efficiently using selb
  • fsmbi mask_0F00, 0x0F00
  • fsmbi mask_00F0, 0x00F0
  • selb m0_, m0, m1, mask_0F00
  • selb m1_, m1, m2, mask_0F00
  • selb m2_, m2, m0, mask_0F00
  • selb m0_, m0_, m2, mask_00F0
  • selb m1_, m1_, m0, mask_00F0
  • selb m2_, m2_, m1, mask_00F0
  • The vector multiply then only
  • requires 5 instructions
  • shufb yzx0, xyz0, xyz0, shuf_BCA0
  • Case w1, with (0,0,0,1)T in the rightmost matrix
    column
  • Use the same preswizzle as the w0 case,
  • leaving row 3 unchanged.
  • Again 5 instructions suffice
  • shufb yzx0, xyz0, xyz0, shuf_BCA0
  • shufb zxy0, xyz0, xyz0, shuf_CAB0
  • fma result, xyz0, m0_, m3
  • fma result, yzx0, m1_, result
  • fma result, zxy0, m2_, result

68
Vector-Matrix Multiplication
Alternative 2
If were dealing with the general case, we can
reduce the instruction count further still
  • Using the preswizzle (m02, m13, m20, m31)?
  • (m12, m23,
    m30, m01)?
  • (m00, m11,
    m22, m33)?
  • (m10, m21,
    m32, m03)?
  • we can carry out the vector multiply
  • in just 6 instructions
  • rotqbyi yzwx, xyzw, 4
  • fm temp, xyzw, m0_
  • fma temp, yzwx, m1_, temp
  • rotqbyi result, temp, 8
  • fma result, xyzw, m2_, result
  • fma result, yzwx, m3_, result

This approach yields no additional benefits for
the w0 and w1 cases however.
Conclusion
Single vector/matrix times a single matrix use
the Standard Approach. Many vectors/matrices
times a single matrix use Alternative 1. Many
general vectors/matrices (i.e. anything in w)
times a single matrix in a pipelined loop use
Alternative 2.
69
Optimizing for SPUs
  • Example 2 Matrix Transpose

70
Matrix Transposition
Standard Approach
A general 4x4 matrix can be transposed in 8
shuffles as follows
(x0, y0, z0, w0) (x0, x1, x2,
x3)? (x1, y1, z1, w1) -gt (y0, y1,
y2, y3)? (x2, y1, z2, w2) (z0,
z1, z2, z3)? (x3, y3, z3, w3)
(w0, w1, w2, w3)? shufb t0, a0, a2,
shuf_AaBb // t0 (x0, x2, y0, y2)?
shufb t1, a1, a3, shuf_AaBb // t1 (x1, x3,
y1, y3)? shufb t2, a0, a2, shuf_CcDd //
t2 (z0, z2, w0, w2)? shufb t3, a1, a3,
shuf_CcDd // t3 (z1, z3, w1, w3)?
shufb b0, t0, t1, shuf_AaBb // b0 (x0, x1,
x2, x3)? shufb b1, t0, t1, shuf_CcDd //
b1 (y0, y1, y2, y3)? shufb b2, t2, t3,
shuf_AaBb // b2 (z0, z1, z2, z3)?
shufb b3, t2, t3, shuf_CcDd // b3 (w0, w1,
w2, w3)?
Many variations are possible by changing the
particular shuffles used, but they all end up
doing the same thing in the same amount of work.
The version shown above is a good choice because
it only requires two constants.
71
Matrix Transposition
Faster 4x4
By using a different set of shuffles, a couple
of the shuffles can then be replaced by
select-bytes which has lower latency
shufb t0, a0, a1, shuf_AaCc // t0
(x0, x1, z0, z1)? shufb t1, a2, a3,
shuf_CcAa // t1 (z2, z3, x2, x3)?
shufb t2, a0, a1, shuf_BbDd // t2 (y0,
y1, w0, w1)? shufb t3, a2, a3, shuf_DdBb
// t3 (w2, w3, y2, y3)? shufb b2, t0,
t1, shuf_CDab // b2 (z0, z2, z2, z2)?
shufb b3, t2, t3, shuf_CDab // b3 (w0,
w3, w3, w3)? selb b0, t0, t1, mask_00FF
// b0 (x0, x0, x0, x0)? selb b1, t2,
t3, mask_00FF // b1 (y0, y1, y1, y1)?
This version is quicker by 1 cycle, at the
expense of requiring more constants
72
Matrix Transposition
3x4 -gt 4x3
Here is an example that uses only 6 shuffles
(x0, y0, z0, w0) (x0, x1, x2,
0)? (x1, y1, z1, w1) -gt (y0, y1,
y2, 0)? (x2, y2, z2, w2) (z0,
z1, z2, 0)?
(w0, w1, w2, 0)? shufb t0, a0, a1,
shuf_AaBb // t0 (x0, x1, y0, y1)?
shufb t1, a0, a1, shuf_CcDd // t1 (z0,
z1, w0, w1)? shufb b0, t0, a2, shuf_ABa0
// b0 (x0, x1, x2, 0)? shufb b1, t0,
a2, shuf_CDb0 // b1 (y0, y1, y2, 0)
shufb b2, t1, a2, shuf_ABc0 // b2 (z0,
z1, z2, 0)? shufb b3, t1, a2, shuf_CDd0
// b3 (w0, w1, w2, 0)?
Note that care must be taken if the destination
matrix is the same as the source. In this case
the last 2 lines of code must be swapped to
avoid prematurely overwriting a2.
73
Matrix Transposition
3x3
Here is an example that uses only 5 shuffles
(x0, y0, z0, w0) (x0, x1, x2,
0)? (x1, y1, z1, w1) -gt (y0, y1,
y2, 0)? (x2, y1, z2, w2) (z0,
z1, z2, 0)? shufb t0, a0, a1,a shuf_AaBb
// t0 (x0, x1, y0, y1)? shufb t1, a0,
a1, shuf_CcDd // t1 (z0, z1, w0, w1)?
shufb b0, t0, a2, shuf_ABa0 // b0 (x0,
x1, x2, 0)? shufb b1, t0, a2, shuf_CDb0
// b1 (y0, y1, y2, 0) shufb b2, t1,
a2, shuf_ABc0 // b2 (z0, z1, z2, 0)?
74
Matrix Transposition
3x3 (reduced latency)?
If we seek the lowest latency, this example is 2
cycles quicker than the last example, at the
expense of an extra instruction and an extra
constant
(x0, y0, z0, w0) (x0, x1, x2,
0)? (x1, y1, z1, w1) -gt (y0, y1,
y2, 0)? (x2, y1, z2, w2) (z0,
z1, z2, 0)? shufb t0, a1, a2, shuf_0Aa0
// t0 ( 0, x1, x2, 0)? shufb t1, a2,
a0, shuf_b0B0 // t1 (y0, 0, y2, 0)?
shufb t2, a0, a1, shuf_Cc00 // t2 (z0,
z1, 0, 0)? selb b0, a0, t0, mask_0FFF
// b0 (x0, x1, x2, 0)? selb b1, a1,
t1, mask_F0FF // b1 (y0, y1, y2, 0)?
selb b2, a2, t2, mask_FF0F // b2 (z0,
z1, z2, 0)?
Hybrid versions are also possible, which may be
of use when trying to balance even vs. odd counts.
75
Optimizing for SPUs
  • Example 3 8 bit palette lookup
  • Flip the problem around
  • Instead of looking up index for each byte...
  • Loop through the palette and compare each
    quadword of indices and mask any matching results

76
Optimizing for SPUs
  • When is it better to use asm?
  • When you know facts the compiler cannot (and can
    take advantage of them)?
  • i.e. almost always.

77
Optimizing for SPUs
  • When is asm really worth it?
  • Case-by-case.
  • Time, experience, performance, practice.
  • Doesn't it make the code unmaintainable?
  • Not much different from using intrinsics.
  • Especially if you use macro-asm tools.
  • e.g. for register coloring - thats really the
    tedious part of editing asm.

78
Optimizing for SPUs
  • Writing asm rules-of-thumb
  • Minimize instruction count
  • Minimize trace latency
  • (Instruction count takes precedence)?
  • Balance even/odd instruction pipelines
  • Minimize memory accesses
  • Can block DMA or instruction fetch

79
The 256K Barrier
  • The solution is simple
  • Upload more code when you need it.
  • Upload more data when you need it.
  • Data is managed by traditional means
  • i.e. Double, triple fixed-buffers, etc.
  • Code is just data.
  • Can we manage code the same way we manage data?

80
SPU Shaders
  • SPU Shaders are
  • Fragments of code used in existing systems
    (Physics, Animation, Effects, AI, etc.)?
  • Code is loaded at location pre-determined by
    system.
  • Custom (Data/Interface) for each system.
  • An expansion of an existing system (e.g.
    Pipelined stages)?
  • Custom modifications of system data.
  • Way of delivering feedback to other systems
    outside the scope of the current system.

81
SPU Shaders
  • SPU Shaders are NOT
  • Generic, general purpose system.
  • A system of any kind, actually.
  • Globally scheduled.

82
SPU Shaders
  • Why is it called a shader?
  • Shares important similarities to GPU shaders.
  • Native code fragments
  • Part of a larger system
  • In-context execution
  • Independently optimizable
  • Most important Concept is approachable.

83
SPU Shaders
  • Don't try to solve everyone's problems
  • Solutions that try to solve all problems tend to
    cause more problems than they solve.?

84
SPU Shaders
  • Easy to Implement
  • Pick stage(s) in system kernel to inject shaders.
  • Define available inputs and outputs.
  • Collect common functions.
  • Compile shaders as data.
  • Sort instance data based on shader type(s)?
  • Load shader on-demand based on data select.
  • Call shaders.

85
SPU Shaders
  • What data is being transformed?
  • What are the inputs?
  • What are the outputs?
  • What can be modified?

86
SPU Shaders
  • Collect the common functions...
  • Always loaded by the system
  • e.g.
  • Dma wrapper functions
  • Debugging functions
  • Common transformation functions

87
Example Structure Passed to Shader
struct common_t void (print_str)(const char
str) void (dma_wait)(uint32_t tag) void
(dma_send)(void ls, uint32_t ea, uint32_t size,
uint32_t tag) void (dma_recv)(void ls,
uint32_t ea, uint32_t size, uint32_t tag)
char ls uint32_t ls_size uint32_t
data_ea uint32_t data_size uint32_t
dma_tags2
88
SPU Shaders
  • System Shader Configuration...
  • System knows where the fragments are.
  • System knows when to call the fragments.
  • System doesn't know what the fragments do.
  • Fragments are in main RAM.
  • Fragments don't need to be fixed.

89
SPU Shaders
  • System Shader Configuration.
  • Manage fragment memory
  • Simplest method
  • Double buffer,
  • On-demand,
  • Fixed maximum size,
  • By-index from array,...

90
SPU Shaders
  • Create the shader code...
  • Code is just data
  • No special distinquishing feature on the SPUs
  • Overlays or additional jobs are too complex and
    heavyweight.
  • Just want load and execute.
  • No special system needed.

91
SPU Shaders
  • Create the shader code..
  • Method 1 Shader as PPU header
  • Compile shader as normal, to obj file.
  • Dump obj file using spu-objdump
  • Convert dump to header using script.
  • This is what we started with

92
SPU Shaders
  • Create the shader code..
  • Method 2 Use elf file
  • Requires extra compile step, but more debugger
    friendly.
  • This is what we're doing now.
  • Other methods too, use whatever works for you.

93
SPU Shaders
  • Calling the shader...
  • Nothing could be easier.
  • ShaderEntry shader (addr of fragment)
  • shader( data, common )

94
SPU Shaders
  • Debugging Shaders...
  • Fragments are small
  • Fragments have well defined inputs and outputs.
  • Ideal for unit tests in separate framework.
  • Test on PS3/Linux box.
  • Alternatives
  • Debug on PPU (intrinsics are portable)?
  • Temporarily link in shader.

95
SPU Shaders
  • Runtime debugging
  • Is a problem with the first method.
  • Using the full elf, have debugging info
  • Now works transparently in our debugger.

96
SPU Shaders
  • Rule 1 Don't Manage Data for Shaders
  • Just give shaders a buffer and fixed size.
  • Shaders should depend on size, so leave room for
    system changes.
  • Best size depends on system.
  • (Maybe 4K, maybe 32K)?
  • Don't read or write from/to shader buffer.

97
SPU Shaders
  • System-specific
  • Multiple list of instances to modify or transform
  • Context data
  • Shader-internal (local)?
  • EA passed by system
  • Fixed buffer
  • Shader shared (global)?
  • EA passed by system

98
SPU Shaders
  • Rule 2 Don't Manage DMA for Shaders
  • Give fixed number of DMA tags to shader
  • Grab them in the entry function and pass down)?
  • Avoid GetDmaTagFromParentSystem()?
  • Give DMA functions to shaders
  • To allow system to run with any job manager, or
    none
  • Don't use shader tags for other purposes

99
SPU Shaders
  • Rule 3 Enforce fixed maximum size for Shader
    code.
  • System can be maintained.
  • Rule 4 Shaders are always called in a clear,
    well defined context.
  • i.e. Part of a larger system.?

100
SPU Shaders
  • Rule 5 Fixed parameter list for shaders,
    per-system (or sub-system)?
  • Don't want to re-compile all shaders.
  • Don't want to manage dynamic parameter lists.
  • Rule 6 Shaders should be given as many instances
    as possible.
  • More optimizable.?

101
SPU Shaders
  • Rule 7 Don't break the rules.
  • You'll end up with a new job manager.
  • You'll end up with a big headache.

102
SPU Shaders
  • Where are we using these?
  • Physics, Effects, Animation, Some AI Update
  • Also experimenting with pre-vertex shaders on the
    SPUs
  • And experimenting with giving some of that
    control to the artists (Directly generating code
    from a tool...)?

103
(No Transcript)
104
(No Transcript)
105
(No Transcript)
106
(No Transcript)
107
(No Transcript)
108
(No Transcript)
109
(No Transcript)
110
(No Transcript)
111
(No Transcript)
112
Conclusion
  • It's not that complicated.
  • Good data and good design works well on the SPUs
    (and will work well anywhere)?
  • Sometimes you can get away with bad design and
    bad data on other platforms
  • ...for now. Bad design will not survive this
    generation.
  • Lots of opportunities for optimization.

113
Credits
  • This was based on the hard work and dedication of
    the Insomniac Tech Team. You guys are awesome.
Write a Comment
User Comments (0)
About PowerShow.com