PPU Optimizations

About This Presentation

Title:

PPU Optimizations

Description:

Putting a game on the PPU of the PS3 is like putting a game on the IOP of the PS2. ... Why was restrict introduced into C99? What transformations can the ... – PowerPoint PPT presentation

Number of Views:435

Avg rating:3.0/5.0

Slides: 199

Provided by: cellperf

Category:

more less

Transcript and Presenter's Notes

Title: PPU Optimizations

1
PPU Optimizations

Mike Acton
Highmoon Studios
macton_at_highmoonstudios.com

2
Why Optimize for the PPU?

Putting a game on the PPU of the PS3 is like
putting a game on the IOP of the PS2.
Not enough time?
Pre-existing codebase?
Someone else's codebase?
Stepping stone to good SPU usage?
Don't think an SPU can handle it?

3
Where Can We Optimize?

Data Hazards
Scalar Fixed Point (Integer)
Scalar Floating Point
Data Design
VMX (Altivec)
Code Layout

4
Data Hazards

Basic Hazards
Write After Read
Write After Write
Read After Write
Avoiding Hazards
Use const
Use restrict
Use inline

5
TIMEOUT Understanding restrict

Why was restrict introduced into C99?
What transformations can the compiler now make?
What is the danger in using restrict?
A little about overlapping regions...

6
TIMEOUT Understanding restrict

Why was restrict introduced into C99?
What transformations can the compiler now make?
What is the danger in using restrict?
A little about overlapping regions...

7
Why was restrict introduced into C99?

Not possible to prove that two memory windows do
not overlap

8
Why was restrict introduced into C99?

Not possible to prove that two memory windows do
not overlap
Not possible to prove that two memory access
patterns do not overlap

9
Why was restrict introduced into C99?

Not possible to prove that two memory windows do
not overlap
Not possible to prove that two memory access
patterns do not overlap

The scheduler must always presume that memory
accesses can overlap.
Avoid generating data hazards.

10
Why was restrict introduced into C99?

Not possible to prove that two memory windows do
not overlap
Not possible to prove that two memory access
patterns do not overlap

The scheduler must always presume that memory
accesses can overlap.
Avoid generating data hazards.
Unless there was some keyword
restrict is a no hazards will be generated
contract

11
TIMEOUT Understanding restrict

Why was restrict introduced into C99?
What transformations can the compiler now make?
What is the danger in using restrict?
A little about overlapping regions...

12
What transformations can the compiler now make?

Re-order loads and stores!

The scheduler can presume that memory accesses
can not overlap.
Responsibility of programmer Avoid generating
data hazards.

13
What transformations can the compiler now make?

Re-order loads and stores!

NOTES ON USE
Restricted pointers may be copied.
Only leaf pointers should be used.
Use of restrict should be very common.
Typical access is most likely exclusive.
Publish data requirements in declarations
Not doing this - Very hard to find bugs
Start using immediately.
Somewhat difficult to refactor restricted
requirements into pre-existing code.

14
What transformations can the compiler now make?

Re-order loads and stores!

NOTES ON USE
Restricted pointers may be copied.
Only leaf pointers should be used.
Use of restrict should be very common.
Typical access is most likely exclusive.
Publish data requirements in declarations
Not doing this - Very hard to find bugs
Start using immediately.
Somewhat difficult to refactor restricted
requirements into pre-existing code.

15
What transformations can the compiler now make?

Re-order loads and stores!

NOTES ON USE
Restricted pointers may be copied.
Only leaf pointers should be used.
Use of restrict should be very common.
Typical access is most likely exclusive.
Publish data requirements in declarations
Not doing this - Very hard to find bugs
Start using immediately.
Somewhat difficult to refactor restricted
requirements into pre-existing code.

16
What transformations can the compiler now make?

Re-order loads and stores!

NOTES ON USE
Restricted pointers may be copied.
Only leaf pointers should be used.
Use of restrict should be very common.
Typical access is most likely exclusive.
Publish data requirements in declarations
Not doing this - Very hard to find bugs
Start using immediately.
Somewhat difficult to refactor restricted
requirements into pre-existing code.

17
What transformations can the compiler now make?

Re-order loads and stores!
Potentially manage structures in registers

18
TIMEOUT Understanding restrict

Why was restrict introduced into C99?
What transformations can the compiler now make?
What is the danger in using restrict?
A little about overlapping regions...

19
What is the danger in using restrict?

Programmer breaking the restrict contract

20
What is the danger in using restrict?

Programmer breaking the restrict contract
Unexpected results
Hard to find bugs

21
What is the danger in using restrict?

Programmer breaking the restrict contract
Unexpected results
Hard to find bugs
Unit testing on host machine

22
What is the danger in using restrict?

Programmer breaking the restrict contract
Unexpected results
Hard to find bugs
Unit testing on host machine
Make sure restrict is supported
Compile with fstrict-aliasing

23
TIMEOUT Understanding restrict

Why was restrict introduced into C99?
What transformations can the compiler now make?
What is the danger in using restrict?
A little about overlapping regions...

24
A little about overlapping regions...

IMPORTANT! Not restricting the thing being
pointed to.

25
A little about overlapping regions...

IMPORTANT! Not restricting the thing being
pointed to.
Generally, data within a stripe is not
re-ordered.

26
A little about overlapping regions...

IMPORTANT! Not restricting the thing being
pointed to.
Generally, data within a stripe is not
re-ordered.
Use multiple levels of striped data to restrict
fields independently.

Can point to same address
27
Scalar Fixed Point

What size integer?
Single Load/Store
Aligned access (Preference Load or Store?)
Cache hints
Using floating point registers
Minimize status bit dependencies

28
Scalar Fixed Point

What size integer?
Single Load/Store
Aligned access (Preference Load or Store?)
Cache hints
Using floating point registers
Minimize status bit dependencies

29
What size integer?

General Purpose Use 64 bits ( int64_t /
uint64_t )

30
What size integer?

General Purpose Use 64 bits ( int64_t /
uint64_t )
Often sign extends after each arithmetic
operation
Signed 32 bits

31
What size integer?

General Purpose Use 64 bits ( int64_t /
uint64_t )
Often sign extends after each arithmetic
operation
Signed 32 bits

int32_t ab int32_t abc ab a b abc ab
c
32
What size integer?

General Purpose Use 64 bits ( int64_t /
uint64_t )
Often sign extends after each arithmetic
operation
Signed 32 bits

int32_t ab int32_t abc ab a b abc ab
c
add ab0, a, b extsw ab1, ab0 add abc0,
ab1, c extsw abc1, abc0
33
What size integer?

General Purpose Use 64 bits ( int64_t /
uint64_t )
Often sign extends after each arithmetic
operation
Signed 32 bits
Typically defers extension until after multiple
operations
Unsigned 32 bits
16 bits
8 bits

34
What size integer?

General Purpose Use 64 bits ( int64_t /
uint64_t )
Often sign extends after each arithmetic
operation
Signed 32 bits
Typically defers extension until after multiple
operations
Unsigned 32 bits
16 bits
8 bits

int16_t ab int16_t abc ab a b abc ab
c
35
What size integer?

General Purpose Use 64 bits ( int64_t /
uint64_t )
Often sign extends after each arithmetic
operation
Signed 32 bits
Typically defers extension until after multiple
operations
Unsigned 32 bits
16 bits
8 bits

int16_t ab int16_t abc ab a b abc ab
c
add ab, a, b add abc0, ab, c extsh abc1,
abc0
36
What size integer?

General Purpose Use 64 bits ( int64_t /
uint64_t )
Often sign extends after each arithmetic
operation
Signed 32 bits
Typically defers extension until after multiple
operations
Unsigned 32 bits
16 bits
8 bits
Reminder int is signed 32 bits

37
What size integer?

General Purpose Use 64 bits ( int64_t /
uint64_t )
Often sign extends after each arithmetic
operation
Signed 32 bits
Typically defers extension until after multiple
operations
Unsigned 32 bits
16 bits
8 bits
Reminder int is signed 32 bits
Avoid bool
bool is only good for creating more branching

38
What size integer?

General Purpose Use 64 bits ( int64_t /
uint64_t )
Often sign extends after each arithmetic
operation
Signed 32 bits
Typically defers extension until after multiple
operations
Unsigned 32 bits
16 bits
8 bits
Reminder int is signed 32 bits
Avoid bool
bool is only good for creating more branching
Most logical instructions add/sub
64 bits
1 cycle throughput
2 cycle latency

39
What size integer?

Multiply and Divide ( 32 and 64 bits )

40
What size integer?

Multiply and Divide ( 32 and 64 bits )
All integer multiply instructions stall FXU (6
15 cycles)

41
What size integer?

Multiply and Divide ( 32 and 64 bits )
All integer multiply instructions stall FXU (6
15 cycles)
64 bit integer divide instructions stall FXU (10
70 cycles)

42
What size integer?

Multiply and Divide ( 32 and 64 bits )
All integer multiply instructions stall FXU (6
15 cycles)
64 bit integer divide instructions stall FXU (10
70 cycles)
32 bit integer divide instructions stall FXU (10
38 cycles)

43
Scalar Fixed Point

What size integer?
Single Load/Store
Aligned access (Preference Load or Store?)
Cache hints
Using floating point registers
Minimize status bit dependencies

44
Single Load/Store

Avoid Multiple Load/Store (lmw, stmw, etc.)

45
Single Load/Store

Avoid Multiple Load/Store (lmw, stmw, etc.)

Microcoded instructions have 11 cycle decode
penalty
Microcoded instructions cannot be interrupted
Microcoded instructions require pipeline flush

46
Single Load/Store

Avoid Multiple Load/Store (lmw, stmw, etc.)
Always load/store on size boundary

47
Single Load/Store

Avoid Multiple Load/Store (lmw, stmw, etc.)
Always load/store on size boundary
Load Store Unit does manage misaligned
loads/stores but

48
Single Load/Store

Avoid Multiple Load/Store (lmw, stmw, etc.)
Always load/store on size boundary
Load Store Unit does manage misaligned
loads/stores but
Major penalties for
Crossing 32B boundaries ( microcoded
instructions )
Crossing Page boundaries ( microcoded
instructions )

49
Single Load/Store

Avoid Multiple Load/Store (lmw, stmw, etc.)
Always load/store on size boundary
If you must break this rule

50
Single Load/Store

Avoid Multiple Load/Store (lmw, stmw, etc.)
Always load/store on size boundary
If you must break this rule
Misaligned load aligned store

51
Single Load/Store

Avoid Multiple Load/Store (lmw, stmw, etc.)
Always load/store on size boundary
If you must break this rule Misaligned load
aligned store
Load Hit Store

52
Single Load/Store

Avoid Multiple Load/Store (lmw, stmw, etc.)
Always load/store on size boundary
If you must break this rule Misaligned load
aligned store
Load Hit Store
No store forwarding from the Store Queue (STQ).

53
Single Load/Store

Avoid Multiple Load/Store (lmw, stmw, etc.)
Always load/store on size boundary
If you must break this rule Misaligned load
aligned store
Load Hit Store
No store forwarding from the Store Queue (STQ).
Enter the Load Miss Queue (LMQ)

54
Single Load/Store

Avoid Multiple Load/Store (lmw, stmw, etc.)
Always load/store on size boundary
If you must break this rule Misaligned load
aligned store
Load Hit Store
No store forwarding from the Store Queue (STQ).
Enter the Load Miss Queue (LMQ)
Most likely problem!

Avoid
Small functions
Globals (especially in loops)

55
Single Load/Store

Avoid Multiple Load/Store (lmw, stmw, etc.)
Always load/store on size boundary
If you must break this rule Misaligned load
aligned store
Load Hit Store
No store forwarding from the Store Queue (STQ).
Enter the Load Miss Queue (LMQ)
Most likely problem!
(40 80) cycles

56
Single Load/Store

Avoid Multiple Load/Store (lmw, stmw, etc.)
Always load/store on size boundary
If you must break this rule Misaligned load
aligned store
Load Hit Store
Store Hit Load

57
Single Load/Store

Avoid Multiple Load/Store (lmw, stmw, etc.)
Always load/store on size boundary
If you must break this rule Misaligned load
aligned store
Load Hit Store
Store Hit Load
All younger loads re-issued

58
Single Load/Store

Avoid Multiple Load/Store (lmw, stmw, etc.)
Always load/store on size boundary
If you must break this rule Misaligned load
aligned store
Load Hit Store
Store Hit Load
Load Hit Load

59
Single Load/Store

Avoid Multiple Load/Store (lmw, stmw, etc.)
Always load/store on size boundary
If you must break this rule Misaligned load
aligned store
Load Hit Store
Store Hit Load
Load Hit Load
Un-snooped loads re-issued

60
Single Load/Store

Avoid Multiple Load/Store (lmw, stmw, etc.)
Always load/store on size boundary
If you must break this rule Misaligned load
aligned store
Load Hit Store
Store Hit Load
Load Hit Load
Load Hit Reload

61
Single Load/Store

Avoid Multiple Load/Store (lmw, stmw, etc.)
Always load/store on size boundary
If you must break this rule Misaligned load
aligned store
Load Hit Store
Store Hit Load
Load Hit Load
Load Hit Reload
Younger load enters LMQ
General Case Penalty hidden until LMQ full

62
Single Load/Store

Avoid Multiple Load/Store (lmw, stmw, etc.)
Always load/store on size boundary
If you must break this rule Misaligned load
aligned store
Load Hit Store
Store Hit Load
Load Hit Load
Load Hit Reload
Separate Loads and Stores

63
Scalar Fixed Point

What size integer?
Single Load/Store
Aligned access (Preference Load or Store?)
Cache hints
Reads Prefetch Block (dcbt)
Writes Zero Block (dcbz)
Using floating point registers
Minimize status bit dependencies

64
Scalar Fixed Point

What size integer?
Single Load/Store
Aligned access (Preference Load or Store?)
Cache hints
Using floating point registers
For Move Larger functions with free FPU
registers
Minimize status bit dependencies

65
Scalar Fixed Point

What size integer?
Single Load/Store
Aligned access (Preference Load or Store?)
Cache hints
Using floating point registers
Minimize status bit dependencies

66
Minimize status bit dependencies

Remember GPR are not the only dependencies

67
Minimize status bit dependencies

Remember GPR are not the only dependencies
Condition Register (CR) is a major source of
problems for the scheduler

68
Minimize status bit dependencies

Remember GPR are not the only dependencies
Condition Register (CR) is a major source of
problems for the scheduler
CR is read/modified by
Comparisons
Boolean operations
Branches

69
Minimize status bit dependencies

Remember GPR are not the only dependencies
Condition Register (CR) is a major source of
problems for the scheduler
FPU and VXU instructions that use CR will block
CR.

70
Minimize status bit dependencies

Remember GPR are not the only dependencies
Condition Register (CR) is a major source of
problems for the scheduler
FPU and VXU instructions that use CR will block
CR.
FXU instructions that use CR will be re-issued.

71
Scalar Fixed Point

Branch Elimination
Prefer bit operations to comparisons
Combine branches

72
Scalar Fixed Point

Branch Elimination
Prefer bit operations to comparisons
Combine branches

Even well-predicted branches can impact
performance
The instruction may not be in the fetch buffer
The instruction may not be in the icache
If functions are not properly aligned, small
subroutines can cause another icache miss on the
calling function on return.
The level-2 cache is shared with data, memory
fetch can impact more than code perforamce.

73
Scalar Fixed Point

Branch Elimination
Prefer bit operations to comparisons
Combine branches

Even well-predicted branches can impact
performance
The instruction may not be in the fetch buffer
The instruction may not be in the icache
If functions are not properly aligned, small
subroutines can cause another icache miss on the
calling function on return.
The level-2 cache is shared with data, memory
fetch can impact more than code perforamce.

74
Scalar Fixed Point

Branch Elimination
Prefer bit operations to comparisons
Combine branches

Even well-predicted branches can impact
performance
The instruction may not be in the fetch buffer
The instruction may not be in the icache
If functions are not properly aligned, small
subroutines can cause another icache miss on the
calling function on return.
The level-2 cache is shared with data, memory
fetch can impact more than code perforamce.

75
Scalar Fixed Point

Branch Elimination
Prefer bit operations to comparisons
Combine branches

Even well-predicted branches can impact
performance
The instruction may not be in the fetch buffer
The instruction may not be in the icache
If functions are not properly aligned, small
subroutines can cause another icache miss on the
calling function on return.
The level-2 cache is shared with data, memory
fetch can impact more than code perforamce.

76
Scalar Fixed Point

Branch Elimination
Prefer bit operations to comparisons
Combine branches

Even well-predicted branches can impact
performance
The instruction may not be in the fetch buffer
The instruction may not be in the icache
If functions are not properly aligned, small
subroutines can cause another icache miss on the
calling function on return.
The level-2 cache is shared with data, memory
fetch can impact more than code performance.

77
Scalar Fixed Point

Branch Elimination
Prefer bit operations to comparisons
Combine branches

Eliminating branches
Increases size of basic blocks
Decreases number of blocks
Good for compilers code scheduler
(optimization)!

78
Scalar Fixed Point

Branch Elimination
Prefer bit operations to comparisons
Combine branches

Eliminating branches
Increases size of basic blocks
Decreases number of blocks
Good for compilers code scheduler
(optimization)!

79
Scalar Fixed Point

Branch Elimination
Prefer bit operations to comparisons
Combine branches

Eliminating branches
Increases size of basic blocks
Decreases number of blocks
Good for compilers code scheduler
(optimization)!

80
Scalar Fixed Point

Branch Elimination
Prefer bit operations to comparisons
Combine branches

Eliminating branches
Increases size of basic blocks
Decreases number of blocks
Decreases opportunities for branch penalties
Good for compilers code scheduler
(optimization)!

81
Scalar Fixed Point

Branch Elimination
Prefer bit operations to comparisons
Combine branches

Eliminating branches
Increases size of basic blocks
Decreases number of blocks
Decreases opportunities for branch penalties
Good for compilers code scheduler
(optimization)!

TIMEOUT Replacing comparisons

Examples
Branch on greater than
Branch on not zero
Integer select
A little about the classic example min/max
Some other examples...

83
EXAMPLE
uint64_t test_1_0( const uint64_t a, const
uint64_t b ) if ( ( compare_a(a) ! 0 ) (
compare_b(b) ! 0 ) ) return (b)
return (a)
84
Simple inline test functions
static inline uint64_t compare_a( const uint64_t
a ) return ( a (uint64_t)0x10000100
) static inline uint64_t compare_b( const
uint64_t b ) return ( b (uint64_t)0x80004000
)
85
lis r6,0x1000 ori
r5,r6,0x0100 li r7,0x4000 and
r10,r3,r5 oris r9,r7,0x8000 cmpdi
cr7,r10,0 and r0,r4,r9 mr r11,r3 cmpdi
cr6,r0,0 bne- cr7, beq-
cr6, mr r11,r4 mr
r3,r11 blr
Two comparisons
ppu-lv2-gcc (GCC) 3.4.1 (Cell 2.3 Aug 18
2005) CFLAGS-pedantic stdc99 O3 Wall
fstrict-aliasing
86
lis r6,0x1000 ori
r5,r6,0x0100 li r7,0x4000 and
r10,r3,r5 oris r9,r7,0x8000 cmpdi
cr7,r10,0 and r0,r4,r9 mr r11,r3 cmpdi
cr6,r0,0 bne- cr7, beq-
cr6, mr r11,r4 mr
r3,r11 blr
(A) GCC knows there are no side-effects
Both comparisons started before branch(B) PPC
has CR logical instructions Why arent cr6
and cr7 merged?
87
lis r6,0x1000 ori
r5,r6,0x0100 li r7,0x4000 and
r10,r3,r5 oris r9,r7,0x8000 cmpdi
cr7,r10,0 and r0,r4,r9 mr r11,r3 cmpdi
cr6,r0,0 bne- cr7, beq-
cr6, mr r11,r4 mr
r3,r11 blr
Two branches Major optimization barrier if
this function is inlined.
88
lis r6,0x1000 ori
r5,r6,0x0100 li r7,0x4000 and
r10,r3,r5 oris r9,r7,0x8000 cmpdi
cr7,r10,0 and r0,r4,r9 mr r11,r3 cmpdi
cr6,r0,0 bne- cr7, beq-
cr6, mr r11,r4 mr
r3,r11 blr

Limited to one issue per cycle
Branches
CR Modify
CR Read
More difficult to schedule
May inhibit multithreading

89
lis r6,0x1000 ori
r5,r6,0x0100 li r7,0x4000 and
r10,r3,r5 oris r9,r7,0x8000 cmpdi
cr7,r10,0 and r0,r4,r9 mr r11,r3 cmpdi
cr6,r0,0 bne- cr7, beq-
cr6, mr r11,r4 mr
r3,r11 blr
Bonus penalty Two moves for false Three
(!!) moves for true
90
Combine the comparisons
uint64_t test_1_1( const uint64_t a, const
uint64_t b ) const uint64_t cmpa
compare_a(a) const uint64_t cmpb
compare_b(b) const uint64_t cmpab cmpa
cmpb if ( cmpab ! 0 ) return (b)
return (a)
Reminder No side effects
91
lis r7,0x1000 li
r6,0x4000 mr r11,r3 ori r5,r7,0x100 oris
r3,r6,0x8000 and r9,r11,r5 and
r0,r4,r3 or r10,r9,r0 cmpdi r10,0 mr
r3,r4 bnelr mr r3,r11 blr
Better.(We got what we expected.)
92
lis r7,0x1000 li
r6,0x4000 mr r11,r3 ori r5,r7,0x100 oris
r3,r6,0x8000 and r9,r11,r5 and
r0,r4,r3 or r10,r9,r0 cmpdi r10,0 mr
r3,r4 bnelr mr r3,r11 blr
One comparison
93
lis r7,0x1000 li
r6,0x4000 mr r11,r3 ori r5,r7,0x100 oris
r3,r6,0x8000 and r9,r11,r5 and
r0,r4,r3 or r10,r9,r0 cmpdi r10,0 mr
r3,r4 bnelr mr r3,r11 blr
One branch
94
lis r7,0x1000 li
r6,0x4000 mr r11,r3 ori r5,r7,0x100 oris
r3,r6,0x8000 and r9,r11,r5 and
r0,r4,r3 or r10,r9,r0 cmpdi r10,0 mr
r3,r4 bnelr mr r3,r11 blr
But still Two moves for false Three
moves for true pretty good indicator of
optimization barrier.
95
QUESTION! Does the ? syntax make a difference in
what the compiler will generate?
uint64_t test_1_2( const uint64_t a, const
uint64_t b ) return ( ( compare_a(a)
compare_b(b) ) ? b a )
96
QUESTION! Does the ? syntax make a difference in
what the compiler will generate?
uint64_t test_1_2( const uint64_t a, const
uint64_t b ) return ( ( compare_a(a)
compare_b(b) ) ? b a )
ANSWER! No. Knock yourself out.
97
Introducing cmpnz_u64
static inline uint64_t cmpnz_u64( const uint64_t
arg ) const uint64_t sign arg
(-arg) const int64_t signed_sign
(int64_t)sign const uint64_t sign_ext
(uint64_t)( signed_sign 0x1f ) return (
sign_ext )
98
static inline uint64_t cmpnz_u64( const uint64_t
arg ) const uint64_t sign arg
(-arg) const int64_t signed_sign
(int64_t)sign const uint64_t sign_ext
(uint64_t)( signed_sign 0x1f ) return (
sign_ext )
/ All non zero values will set the sign bit /
99
static inline uint64_t cmpnz_u64( const uint64_t
arg ) const uint64_t sign arg
(-arg) const int64_t signed_sign
(int64_t)sign const uint64_t sign_ext
(uint64_t)( signed_sign 0x1f ) return (
sign_ext )
/ Make sure we end up using sra /
100
static inline uint64_t cmpnz_u64( const uint64_t
arg ) const uint64_t sign arg
(-arg) const int64_t signed_sign
(int64_t)sign const uint64_t sign_ext
(uint64_t)( signed_sign 0x1f ) return (
sign_ext )
/ Saturate with sign bit. ( arg 0 ) ?
0x00000000_00000000 0xffffffff_ffffffff /
101
What do we expect for cmpnz_u64?
neg result, arg or result, result, arg
sradi result, result, 31
about three fixed point instructions.
102
Why use masks instead of predicates?
103
Why use masks instead of predicates?
104
Predicates

Recommended by IBMs PowerPC Compiler Writers
Guide

105
Predicates

Recommended by IBMs PowerPC Compiler Writers
Guide
One bit (of) value identical to condition
register
false0 true1

106
Predicates

Recommended by IBMs PowerPC Compiler Writers
Guide
One bit (of) value identical to condition
register
false0 true1

static inline uint64_t predicate_cmpnz_u64( const
uint64_t arg ) const uint64_t sign
arg (-arg) const uint64_t predicate sign
0x1f return (predicate)
Similar code to maskGenerates similar
instructions,except srl is used instead of sra
107
Predicates

Recommended by IBMs PowerPC Compiler Writers
Guide
One bit (of) value identical to condition
register
false0 true1
Potentially many predicates can be stored if
registers are limited

108
Predicates

Recommended by IBMs PowerPC Compiler Writers
Guide
One bit (of) value identical to condition
register
false0 true1
Potentially many predicates can be stored if
registers are limited
Easy to generate code that uses either predicate
or CR

109
Predicates

Recommended by IBMs PowerPC Compiler Writers
Guide
One bit (of) value identical to condition
register
false0 true1
Potentially many predicates can be stored if
registers are limited
Easy to generate code that uses either predicate
or CR
Doesnt break higher level code

110
Predicates

Recommended by IBMs PowerPC Compiler Writers
Guide
One bit (of) value identical to condition
register
false0 true1
Potentially many predicates can be stored if
registers are limited
Easy to generate code that uses either predicate
or CR
Doesnt break higher level code
A lot of pre-existing code

111
Predicates

Recommended by IBMs PowerPC Compiler Writers
Guide
One bit (of) value identical to condition
register
false0 true1
Potentially many predicates can be stored if
registers are limited
Easy to generate code that uses either predicate
or CR
Doesnt break higher level code
A lot of pre-existing code
But you can select with masks.

112
Introducing sel_u64
static inline uint64_t sel_u64( const uint64_t
mask, const uint64_t a, const uint64_t b )
const uint64_t b_result b mask const
uint64_t a_result a (mask) const
uint64_t result b_result a_result
return (result)
113
static inline uint64_t sel_u64( const uint64_t
mask, const uint64_t a, const uint64_t b )
const uint64_t b_result b mask const
uint64_t a_result a (mask) const
uint64_t result b_result a_result
return (result)
/ b_result is b if mask is set (else zero) /
114
static inline uint64_t sel_u64( const uint64_t
mask, const uint64_t a, const uint64_t b )
const uint64_t b_result b mask const
uint64_t a_result a (mask) const
uint64_t result b_result a_result
return (result)
/ a_result is a if mask is not set (else zero) /
115
static inline uint64_t sel_u64( const uint64_t
mask, const uint64_t a, const uint64_t b )
const uint64_t b_result b mask const
uint64_t a_result a (mask) const
uint64_t result b_result a_result
return (result)
/ One of the two results will be zero, the other
will be the one we want. oring them together
will just move the result we want into the final
result register. /
116
What do we expect for sel_u64?
and b_result, b, mask andc a_result, a,
mask or result, a_result, b_result
about three fixed point instructions.
117
and b_result, b, mask andc a_result, a,
mask or result, a_result, b_result
PPC has two fixed point logical with complement
operators that make building and working with
masks much simplier. a andc b a and b a
orc b a or b
118
Lets make a new version of our test that uses
our new mask functions
uint64_t test_2_0( const uint64_t a, const
uint64_t b ) const uint64_t cmpa
compare_a(a) const uint64_t cmpb
compare_b(b) const uint64_t ab_sel
cmpnz_u64( cmpa cmpb ) const uint64_t result
sel_u64( ab_sel, a, b ) return
(result)
119
What do we get?
lis r0,0x1000 li
r5,0x4000 ori r12,r0,0x100 oris
r10,r5,0x8000 and r8,r3,r12 and
r9,r4,r10 or r7,r8,r9 neg r6,r7 or
r5,r6,r7 sradi r0,r5,0x1f and r4,r4,r0 andc
r3,r3,r0 or r3,r4,r3 blr
No branchesNo compares (no CR dependencies)
120
With masks (After)
With compare and branch (Before)
lis r0,0x1000 li
r5,0x4000 ori r12,r0,0x100 oris
r10,r5,0x8000 and r8,r3,r12 and
r9,r4,r10 or r7,r8,r9 neg r6,r7 or
r5,r6,r7 sradi r0,r5,0x1f and r4,r4,r0 andc
r3,r3,r0 or r3,r4,r3 blr
lis r7,0x1000 li
r6,0x4000 mr r11,r3 ori r5,r7,0x100 oris
r3,r6,0x8000 and r9,r11,r5 and
r0,r4,r3 or r10,r9,r0 cmpdi r10,0 mr
r3,r4 bnelr mr r3,r11 blr
121
With masks (After)
With compare and branch (Before)
lis r0,0x1000 li
r5,0x4000 ori r12,r0,0x100 oris
r10,r5,0x8000 and r8,r3,r12 and
r9,r4,r10 or r7,r8,r9 neg r6,r7 or
r5,r6,r7 sradi r0,r5,0x1f and r4,r4,r0 andc
r3,r3,r0 or r3,r4,r3 blr
lis r7,0x1000 li
r6,0x4000 mr r11,r3 ori r5,r7,0x100 oris
r3,r6,0x8000 and r9,r11,r5 and
r0,r4,r3 or r10,r9,r0 cmpdi r10,0 mr
r3,r4 bnelr mr r3,r11 blr

About the same number of instructions (for size),
but
Inlines extremely well (Can be pipelined)

122

TIMEOUT Replacing comparisons

Examples
Branch on greater than
Branch on not zero
Integer select
A little about the classic example min/max
Some other examples...

123
A little about min/max...
static inline sint64_t min2_s64( const sint64_t
arg0, const sint64_t arg1 ) return ( ( arg0
arg1 ) ? arg0 arg1 )

One compare (uses CR)
One branch

124
What we're looking for...
static inline sint64_t min2_s64( const sint64_t
arg0, const sint64_t arg1 ) return sel_u64(
cmpgte_s64( arg0, arg1 ), arg0, arg1 )
125
Breakdown of cmpgte
static inline uint64_t cmpgte_s64( const int64_t
arg0, const int64_t arg1 ) const int64_t
msb0 arg0 0x3f const
int64_t msb1 arg1 0x3f
const int64_t signs_neq msb0
msb1 const int64_t signs_eq
signs_neg const uint64_t always_gt
(uint64_t)(msb1 signs_neq) const int64_t
diff arg1 arg0 const int64_t
neg_diff -diff const int64_t
diff_nz_msb diff neg_diff const
int64_t diff_nz diff_nz_msb
0x3f const uint64_t diff_z
(uint64_t)diff_nz const int64_t
diff_iif_signs_eq diff signs_eq const
uint64_t diff_gt (uint_64_t)(diff_if
f_signs_eq 0x3f) const uint64_t result_gt
diff_gt always_gt const uint64_t
result_gte result_gt diff_z
return (result_gte)
126
Breakdown of cmpgte
static inline uint64_t cmpgte_s64( const int64_t
arg0, const int64_t arg1 ) const int64_t
msb0 arg0 0x3f const
int64_t msb1 arg1 0x3f
const int64_t signs_neq msb0
msb1 const int64_t signs_eq
signs_neg const uint64_t always_gt
(uint64_t)(msb1 signs_neq) const int64_t
diff arg1 arg0 const int64_t
neg_diff -diff const int64_t
diff_nz_msb diff neg_diff const
int64_t diff_nz diff_nz_msb
0x3f const uint64_t diff_z
(uint64_t)diff_nz const int64_t
diff_iif_signs_eq diff signs_eq const
uint64_t diff_gt (uint_64_t)(diff_if
f_signs_eq 0x3f) const uint64_t result_gt
diff_gt always_gt const uint64_t
result_gte result_gt diff_z
return (result_gte)
Checking sign independently cmpgte_s64(
INT64_MAX, INT64_MIN )
127
__FAST_MATH__ for integers?
if defined(__FAST_MATH__) static inline uint64_t
cmpgte_s64( const int64_t arg0, const int64_t
arg1 ) const int64_t diff
arg1 arg0 const int64_t neg_diff
-diff const int64_t diff_nz_msb
diff neg_diff const int64_t diff_nz
diff_nz_msb 0x3f const uint64_t
diff_z (uint64_t)diff_nz const
uint64_t diff_gt (uint_64_t)(diff
0x3f) const uint64_t result_gte
diff_gt diff_z return (result_gte) endif
128
We don't need for min/max...
if defined(__FAST_MATH__) static inline uint64_t
cmpgte_s64( const int64_t arg0, const int64_t
arg1 ) const int64_t diff
arg1 arg0 const int64_t neg_diff
-diff const int64_t diff_nz_msb
diff neg_diff const int64_t diff_nz
diff_nz_msb 0x3f const uint64_t
diff_z (uint64_t)diff_nz const
uint64_t diff_gt (uint_64_t)(diff
0x3f) const uint64_t result_gte
diff_gt diff_z return (result_gte) endif
Checking for arg0 arg1
129
Better!
if defined(__FAST_MATH__) static inline uint64_t
cmpgt_s64( const int64_t arg0, const int64_t arg1
) const int64_t diff arg1 arg0
const uint64_t diff_gt (uint_64_t)(diff
0x3f) return (diff_gt) endif
Change to cmpgt...
static inline sint64_t min2_s64( const sint64_t
arg0, const sint64_t arg1 ) return sel_u64(
cmpgt_s64( arg0, arg1 ), arg0, arg1 )
130
Scalar Floating Point

Double versus float
Single Load/Store
Aligned access
Using fixed point registers

131
Scalar Floating Point

Double versus float
Single Load/Store
Aligned access
Using fixed point registers

132
Double versus float

Expect mostly similar performance

133
Double versus float

Expect mostly similar performance
Differences to note

double
float
fsqrt fre frsqrte fsel fabs fnabs
fsqrts fres frsqrtes
134
Double versus float

Expect mostly similar performance
Differences to note

double
float
fsqrt fre frsqrte fsel fabs fnabs
fsqrts fres frsqrtes

ppu-lv2-gcc (GCC) 3.4.1 (Cell 2.3 Aug 18 2005)
Does NOT generate these instructions!

135
fsqrt
static inline double ppc_fsqrt( const double arg
) double result __asm__ (fsqrt 0,1
f(result) f(arg) )
fsqrts
static inline float ppc_fsqrts( const float arg
) float result __asm__ (fsqrts 0,1
f(result) f(arg) )
136
fres
/- 1/256
static inline float ppc_fres( const float arg )
float result __asm__ (fres 0,1
f(result) f(arg) )
frs
/- 4 ulps
static inline float ppc_fres( const float arg )
const float estimate ppc_res( arg )
const float refinement -( estimate arg 1.0f
) const float result refinement
estimate estimate return (result)
137
frsqrte
static inline double ppc_fsqrt( const double arg
) double result __asm__ (frsqrte
0,1 f(result) f(arg) )
frsqrtes
static inline float ppc_frsqrtes( const double
arg ) float result __asm__ (frsqrte
0,1 f(result) f(arg) )
138
fsel
static inline double ppc_fsel( const double
test_gez, const double arg0, const double
arg1 ) double result __asm__ (fsel
0,1, 2, 3 f(result) f(test_gez),
f(arg0), f(arg1) )
frsqrtes
139
fsels
static inline float ppc_fsels( const double
test_gez, const double arg0, const double
arg1 ) float result __asm__ (fsel
0,1, 2, 3 f(result) f(test_gez),
f(arg0), f(arg1) )
frsqrtes
140
fabs
static inline double ppc_fabs( const double arg
) double result __asm__ (fabs 0,1
f(result) f(arg) )
fabss
static inline float ppc_fabss( const double arg
) float result __asm__ (fabs 0,1
f(result) f(arg) )
141
fnabs
static inline double ppc_fnabs( const double arg
) double result __asm__ (fnabs 0,1
f(result) f(arg) )
fnabss
static inline float ppc_fnabss( const double arg
) float result __asm__ (fnabs 0,1
f(result) f(arg) )
142
Double versus float

Expect mostly similar performance
Differences to note
Use ffast-math (if reordering is OK)

static inline double fmul_re( const double arg0,
const double arg1 ) const double result
arg0 ( 1.0 / arg1 )
143
Double versus float

Expect mostly similar performance
Differences to note
Use ffast-math (if reordering is OK)

/ const double result arg0 ( 1.0 / arg1 )
/ / -fno-fast-math (default) generates / lfd
oned, 0(addr_of_oned) fdiv temp, arg1,
oned fmul result, arg0, temp / -ffast-math
generates / fdiv result, arg0, arg1
144
Scalar Floating Point

Double versus float
Single Load/Store
Aligned access
Using fixed point registers

145
Scalar Floating Point

Double versus float
Single Load/Store
Aligned access
Always load on address aligned to size
Misalignment generates interrupt
Using fixed point registers

146
Scalar Floating Point

Double versus float
Single Load/Store
Aligned access
Using fixed point registers
Same idea OK for moves.

147
Scalar Floating Point

Branch Elimination
Avoid bit operations
Floating point select
Combine branches

148
Scalar Floating Point

Branch Elimination
Avoid bit operations
Load-Hit-Store Hazard
Use fctiw / fctiwz / stfiwx if result is integer
Floating point select
Combine branches

149
Scalar Floating Point

Branch Elimination
Avoid bit operations
Floating point select
Combine branches

150
Floating point select

Slightly different than integer select on mask
double gez

151
fsel_gez
static inline double ppc_fsel_gez( const
double test_gez, const double arg0, const
double arg1 ) double result __asm__
(fsel 0,1, 2, 3 f(result)
f(test_gez), f(arg0), f(arg1) )
152
fsel_lz
static inline double ppc_fsel_ltz( const
double test_ltz, const double arg0, const
double arg1 ) double result __asm__
(fsel 0,1, 2, 3 f(result)
f(test_gez), f(arg1), f(arg0) )
153
fsel_gte
static inline double ppc_fsel_gte( const
double cmp0, const double cmp1, const
double arg0, const double arg1 ) const
double test_gez cmp0 cmp1 double
result __asm__ (fsel 0,1, 2, 3
f(result) f(test_gez), f(arg1), f(arg0)
)
154
fmax (with fsel)
static inline double ppc_fmax( const double
arg0, const double arg1 ) return
(ppc_fsel_gte( arg0, arg1, arg0, arg1 ))
fsub temp, arg0, arg1 fsel result, temp, arg0,
arg1 blr
155
fmax (with compare/branch)
static inline double ppc_fmax( const double
arg0, const double arg1 ) return ( ( arg0
arg1 ) ? arg0 arg1 )
fmr result, arg0 fcmpu temp, arg0, arg1
bgelr- temp fmr result, arg1 blr
156
fmax (with compare/branch)
static inline double ppc_fmax( const double
arg0, const double arg1 ) return ( ( arg0
arg1 ) ? arg0 arg1 )
fmr result, arg0 fcmpu temp, arg0, arg1
bgelr- temp fmr result, arg1 blr
Blocks CR
157
fmax (with compare/branch)
static inline double ppc_fmax( const double
arg0, const double arg1 ) return ( ( arg0
arg1 ) ? arg0 arg1 )
fmr result, arg0 fcmpu temp, arg0, arg1
bgelr- temp fmr result, arg1 blr
Blocks CR
Optimization (scheduling) barrier
158
Scalar Floating Point

Branch Elimination
Avoid bit operations
Floating point select
Combine branches
Similar benefit to fixed point

159
Data Design

Basic Principles
Know the data and access patterns
Be prepared to reorganize the data
Every bit counts
Design for the hardware
Sort by dominant type
Clearly distinguish RO/WO/RW data
Almost everything belongs to a set

160
Data Design

Cache Friendly Data
Minimize cache footprint
Sort by data-reuse lifetime
Separate scalars from arrays
Use table-based storage patterns
Tile sparse queues with sequential data
Merge multiple source tiles
Keep write-once data off-cache

161
Data Design

Cache Friendly Data (cont.)
Minimize write-multiple data
Keep short life read-write data in register file
Pipeline long life read-write data
Subclass data based on independent functionality

162
Data Design

Allocation
Static versus Dynamic
Alignment
System pages

163
VMX (Altivec)

What is VMX?
What are the advantages to using it?
Are there any dangers?

164
VMX (Altivec)

What is VMX?
What are the advantages to using it?
Are there any dangers?

165
VMX (Altivec)

What is VMX?
What are the advantages to using it?
Are there any dangers?

166
VMX What are the advantages to using it?

More registers

167
VMX What are the advantages to using it?

More registers
Much higher throughput

168
VMX What are the advantages to using it?

More registers
Much higher throughput
Instruction throughput
1 cycle
Latency for simple instructions 4
cycles
Latency for complex instructions 9
cycles
Latency for float add/sub/madd/nmsub 12 cycles
Latency for float re/rsqrte
12 cycles

169
VMX What are the advantages to using it?

More registers
Much higher throughput
Saturated arithmetic instructions
Bit manipulation on all types (permute, shift,
rotate)
Tons (162) of really cool instructions!

170
VMX (Altivec)

What is VMX?
What are the advantages to using it?
Are there any dangers?

171
VMX (Altivec)

What is VMX?
What are the advantages to using it?
Are there any dangers?
Load-Hit-Store from GPR

172
VMX (Altivec)

What types?
Aligned access
Minimize dependencies
Branch Elimination

173
VMX (Altivec)

What types?
Aligned access
Minimize dependencies
Branch Elimination

174
VMX What types?
int8_t uint8_t x 16 (vector unsigned
char) int16_t uint16_t x 8 (vector
unsigned short) int32_t uint32_t x 4
(vector unsigned int) float x
4 (vector float)
175
VMX What types?
int8_t uint8_t x 16 (vector unsigned
char) int16_t uint16_t x 8 (vector
unsigned short) int32_t uint32_t x 4
(vector unsigned int) float x
4 (vector float)

No 64 bit argument instructions

176
VMX What types?
int8_t uint8_t x 16 (vector unsigned
char) int16_t uint16_t x 8 (vector
unsigned short) int32_t uint32_t x 4
(vector unsigned int) float x
4 (vector float)