Title: PPU Optimizations
1PPU Optimizations
- Mike Acton
- Highmoon Studios
- macton_at_highmoonstudios.com
2Why Optimize for the PPU?
- Putting a game on the PPU of the PS3 is like
putting a game on the IOP of the PS2. - Not enough time?
- Pre-existing codebase?
- Someone else's codebase?
- Stepping stone to good SPU usage?
- Don't think an SPU can handle it?
3Where Can We Optimize?
- Data Hazards
- Scalar Fixed Point (Integer)
- Scalar Floating Point
- Data Design
- VMX (Altivec)
- Code Layout
4Data Hazards
- Basic Hazards
- Write After Read
- Write After Write
- Read After Write
- Avoiding Hazards
- Use const
- Use restrict
- Use inline
5TIMEOUT Understanding restrict
- Why was restrict introduced into C99?
- What transformations can the compiler now make?
- What is the danger in using restrict?
- A little about overlapping regions...
6TIMEOUT Understanding restrict
- Why was restrict introduced into C99?
- What transformations can the compiler now make?
- What is the danger in using restrict?
- A little about overlapping regions...
7Why was restrict introduced into C99?
- Not possible to prove that two memory windows do
not overlap
8Why was restrict introduced into C99?
- Not possible to prove that two memory windows do
not overlap - Not possible to prove that two memory access
patterns do not overlap
9Why was restrict introduced into C99?
- Not possible to prove that two memory windows do
not overlap - Not possible to prove that two memory access
patterns do not overlap
- The scheduler must always presume that memory
accesses can overlap. - Avoid generating data hazards.
10Why was restrict introduced into C99?
- Not possible to prove that two memory windows do
not overlap - Not possible to prove that two memory access
patterns do not overlap
- The scheduler must always presume that memory
accesses can overlap. - Avoid generating data hazards.
- Unless there was some keyword
- restrict is a no hazards will be generated
contract
11TIMEOUT Understanding restrict
- Why was restrict introduced into C99?
- What transformations can the compiler now make?
- What is the danger in using restrict?
- A little about overlapping regions...
12What transformations can the compiler now make?
- Re-order loads and stores!
- The scheduler can presume that memory accesses
can not overlap. - Responsibility of programmer Avoid generating
data hazards.
13What transformations can the compiler now make?
- Re-order loads and stores!
- NOTES ON USE
- Restricted pointers may be copied.
- Only leaf pointers should be used.
- Use of restrict should be very common.
- Typical access is most likely exclusive.
- Publish data requirements in declarations
- Not doing this - Very hard to find bugs
- Start using immediately.
- Somewhat difficult to refactor restricted
requirements into pre-existing code.
14What transformations can the compiler now make?
- Re-order loads and stores!
- NOTES ON USE
- Restricted pointers may be copied.
- Only leaf pointers should be used.
- Use of restrict should be very common.
- Typical access is most likely exclusive.
- Publish data requirements in declarations
- Not doing this - Very hard to find bugs
- Start using immediately.
- Somewhat difficult to refactor restricted
requirements into pre-existing code.
15What transformations can the compiler now make?
- Re-order loads and stores!
- NOTES ON USE
- Restricted pointers may be copied.
- Only leaf pointers should be used.
- Use of restrict should be very common.
- Typical access is most likely exclusive.
- Publish data requirements in declarations
- Not doing this - Very hard to find bugs
- Start using immediately.
- Somewhat difficult to refactor restricted
requirements into pre-existing code.
16What transformations can the compiler now make?
- Re-order loads and stores!
- NOTES ON USE
- Restricted pointers may be copied.
- Only leaf pointers should be used.
- Use of restrict should be very common.
- Typical access is most likely exclusive.
- Publish data requirements in declarations
- Not doing this - Very hard to find bugs
- Start using immediately.
- Somewhat difficult to refactor restricted
requirements into pre-existing code.
17What transformations can the compiler now make?
- Re-order loads and stores!
- Potentially manage structures in registers
18TIMEOUT Understanding restrict
- Why was restrict introduced into C99?
- What transformations can the compiler now make?
- What is the danger in using restrict?
- A little about overlapping regions...
19What is the danger in using restrict?
- Programmer breaking the restrict contract
20What is the danger in using restrict?
- Programmer breaking the restrict contract
- Unexpected results
- Hard to find bugs
21What is the danger in using restrict?
- Programmer breaking the restrict contract
- Unexpected results
- Hard to find bugs
- Unit testing on host machine
22What is the danger in using restrict?
- Programmer breaking the restrict contract
- Unexpected results
- Hard to find bugs
- Unit testing on host machine
- Make sure restrict is supported
- Compile with fstrict-aliasing
23TIMEOUT Understanding restrict
- Why was restrict introduced into C99?
- What transformations can the compiler now make?
- What is the danger in using restrict?
- A little about overlapping regions...
24A little about overlapping regions...
- IMPORTANT! Not restricting the thing being
pointed to.
25A little about overlapping regions...
- IMPORTANT! Not restricting the thing being
pointed to. - Generally, data within a stripe is not
re-ordered.
26A little about overlapping regions...
- IMPORTANT! Not restricting the thing being
pointed to. - Generally, data within a stripe is not
re-ordered. - Use multiple levels of striped data to restrict
fields independently.
Can point to same address
27Scalar Fixed Point
- What size integer?
- Single Load/Store
- Aligned access (Preference Load or Store?)
- Cache hints
- Using floating point registers
- Minimize status bit dependencies
28Scalar Fixed Point
- What size integer?
- Single Load/Store
- Aligned access (Preference Load or Store?)
- Cache hints
- Using floating point registers
- Minimize status bit dependencies
29What size integer?
- General Purpose Use 64 bits ( int64_t /
uint64_t )
30What size integer?
- General Purpose Use 64 bits ( int64_t /
uint64_t ) - Often sign extends after each arithmetic
operation - Signed 32 bits
31What size integer?
- General Purpose Use 64 bits ( int64_t /
uint64_t ) - Often sign extends after each arithmetic
operation - Signed 32 bits
int32_t ab int32_t abc ab a b abc ab
c
32What size integer?
- General Purpose Use 64 bits ( int64_t /
uint64_t ) - Often sign extends after each arithmetic
operation - Signed 32 bits
int32_t ab int32_t abc ab a b abc ab
c
add ab0, a, b extsw ab1, ab0 add abc0,
ab1, c extsw abc1, abc0
33What size integer?
- General Purpose Use 64 bits ( int64_t /
uint64_t ) - Often sign extends after each arithmetic
operation - Signed 32 bits
- Typically defers extension until after multiple
operations - Unsigned 32 bits
- 16 bits
- 8 bits
34What size integer?
- General Purpose Use 64 bits ( int64_t /
uint64_t ) - Often sign extends after each arithmetic
operation - Signed 32 bits
- Typically defers extension until after multiple
operations - Unsigned 32 bits
- 16 bits
- 8 bits
int16_t ab int16_t abc ab a b abc ab
c
35What size integer?
- General Purpose Use 64 bits ( int64_t /
uint64_t ) - Often sign extends after each arithmetic
operation - Signed 32 bits
- Typically defers extension until after multiple
operations - Unsigned 32 bits
- 16 bits
- 8 bits
int16_t ab int16_t abc ab a b abc ab
c
add ab, a, b add abc0, ab, c extsh abc1,
abc0
36What size integer?
- General Purpose Use 64 bits ( int64_t /
uint64_t ) - Often sign extends after each arithmetic
operation - Signed 32 bits
- Typically defers extension until after multiple
operations - Unsigned 32 bits
- 16 bits
- 8 bits
- Reminder int is signed 32 bits
37What size integer?
- General Purpose Use 64 bits ( int64_t /
uint64_t ) - Often sign extends after each arithmetic
operation - Signed 32 bits
- Typically defers extension until after multiple
operations - Unsigned 32 bits
- 16 bits
- 8 bits
- Reminder int is signed 32 bits
- Avoid bool
- bool is only good for creating more branching
38What size integer?
- General Purpose Use 64 bits ( int64_t /
uint64_t ) - Often sign extends after each arithmetic
operation - Signed 32 bits
- Typically defers extension until after multiple
operations - Unsigned 32 bits
- 16 bits
- 8 bits
- Reminder int is signed 32 bits
- Avoid bool
- bool is only good for creating more branching
- Most logical instructions add/sub
- 64 bits
- 1 cycle throughput
- 2 cycle latency
39What size integer?
- Multiply and Divide ( 32 and 64 bits )
40What size integer?
- Multiply and Divide ( 32 and 64 bits )
- All integer multiply instructions stall FXU (6
15 cycles)
41What size integer?
- Multiply and Divide ( 32 and 64 bits )
- All integer multiply instructions stall FXU (6
15 cycles) - 64 bit integer divide instructions stall FXU (10
70 cycles)
42What size integer?
- Multiply and Divide ( 32 and 64 bits )
- All integer multiply instructions stall FXU (6
15 cycles) - 64 bit integer divide instructions stall FXU (10
70 cycles) - 32 bit integer divide instructions stall FXU (10
38 cycles)
43Scalar Fixed Point
- What size integer?
- Single Load/Store
- Aligned access (Preference Load or Store?)
- Cache hints
- Using floating point registers
- Minimize status bit dependencies
44Single Load/Store
- Avoid Multiple Load/Store (lmw, stmw, etc.)
45Single Load/Store
- Avoid Multiple Load/Store (lmw, stmw, etc.)
- Microcoded instructions have 11 cycle decode
penalty - Microcoded instructions cannot be interrupted
- Microcoded instructions require pipeline flush
46Single Load/Store
- Avoid Multiple Load/Store (lmw, stmw, etc.)
- Always load/store on size boundary
47Single Load/Store
- Avoid Multiple Load/Store (lmw, stmw, etc.)
- Always load/store on size boundary
- Load Store Unit does manage misaligned
loads/stores but
48Single Load/Store
- Avoid Multiple Load/Store (lmw, stmw, etc.)
- Always load/store on size boundary
- Load Store Unit does manage misaligned
loads/stores but - Major penalties for
- Crossing 32B boundaries ( microcoded
instructions ) - Crossing Page boundaries ( microcoded
instructions )
49Single Load/Store
- Avoid Multiple Load/Store (lmw, stmw, etc.)
- Always load/store on size boundary
- If you must break this rule
50Single Load/Store
- Avoid Multiple Load/Store (lmw, stmw, etc.)
- Always load/store on size boundary
- If you must break this rule
- Misaligned load aligned store
51Single Load/Store
- Avoid Multiple Load/Store (lmw, stmw, etc.)
- Always load/store on size boundary
- If you must break this rule Misaligned load
aligned store - Load Hit Store
52Single Load/Store
- Avoid Multiple Load/Store (lmw, stmw, etc.)
- Always load/store on size boundary
- If you must break this rule Misaligned load
aligned store - Load Hit Store
- No store forwarding from the Store Queue (STQ).
53Single Load/Store
- Avoid Multiple Load/Store (lmw, stmw, etc.)
- Always load/store on size boundary
- If you must break this rule Misaligned load
aligned store - Load Hit Store
- No store forwarding from the Store Queue (STQ).
- Enter the Load Miss Queue (LMQ)
54Single Load/Store
- Avoid Multiple Load/Store (lmw, stmw, etc.)
- Always load/store on size boundary
- If you must break this rule Misaligned load
aligned store - Load Hit Store
- No store forwarding from the Store Queue (STQ).
- Enter the Load Miss Queue (LMQ)
- Most likely problem!
- Avoid
- Small functions
- Globals (especially in loops)
55Single Load/Store
- Avoid Multiple Load/Store (lmw, stmw, etc.)
- Always load/store on size boundary
- If you must break this rule Misaligned load
aligned store - Load Hit Store
- No store forwarding from the Store Queue (STQ).
- Enter the Load Miss Queue (LMQ)
- Most likely problem!
- (40 80) cycles
56Single Load/Store
- Avoid Multiple Load/Store (lmw, stmw, etc.)
- Always load/store on size boundary
- If you must break this rule Misaligned load
aligned store - Load Hit Store
- Store Hit Load
57Single Load/Store
- Avoid Multiple Load/Store (lmw, stmw, etc.)
- Always load/store on size boundary
- If you must break this rule Misaligned load
aligned store - Load Hit Store
- Store Hit Load
- All younger loads re-issued
58Single Load/Store
- Avoid Multiple Load/Store (lmw, stmw, etc.)
- Always load/store on size boundary
- If you must break this rule Misaligned load
aligned store - Load Hit Store
- Store Hit Load
- Load Hit Load
59Single Load/Store
- Avoid Multiple Load/Store (lmw, stmw, etc.)
- Always load/store on size boundary
- If you must break this rule Misaligned load
aligned store - Load Hit Store
- Store Hit Load
- Load Hit Load
- Un-snooped loads re-issued
60Single Load/Store
- Avoid Multiple Load/Store (lmw, stmw, etc.)
- Always load/store on size boundary
- If you must break this rule Misaligned load
aligned store - Load Hit Store
- Store Hit Load
- Load Hit Load
- Load Hit Reload
61Single Load/Store
- Avoid Multiple Load/Store (lmw, stmw, etc.)
- Always load/store on size boundary
- If you must break this rule Misaligned load
aligned store - Load Hit Store
- Store Hit Load
- Load Hit Load
- Load Hit Reload
- Younger load enters LMQ
- General Case Penalty hidden until LMQ full
62Single Load/Store
- Avoid Multiple Load/Store (lmw, stmw, etc.)
- Always load/store on size boundary
- If you must break this rule Misaligned load
aligned store - Load Hit Store
- Store Hit Load
- Load Hit Load
- Load Hit Reload
- Separate Loads and Stores
63Scalar Fixed Point
- What size integer?
- Single Load/Store
- Aligned access (Preference Load or Store?)
- Cache hints
- Reads Prefetch Block (dcbt)
- Writes Zero Block (dcbz)
- Using floating point registers
- Minimize status bit dependencies
64Scalar Fixed Point
- What size integer?
- Single Load/Store
- Aligned access (Preference Load or Store?)
- Cache hints
- Using floating point registers
- For Move Larger functions with free FPU
registers - Minimize status bit dependencies
65Scalar Fixed Point
- What size integer?
- Single Load/Store
- Aligned access (Preference Load or Store?)
- Cache hints
- Using floating point registers
- Minimize status bit dependencies
66Minimize status bit dependencies
- Remember GPR are not the only dependencies
67Minimize status bit dependencies
- Remember GPR are not the only dependencies
- Condition Register (CR) is a major source of
problems for the scheduler
68Minimize status bit dependencies
- Remember GPR are not the only dependencies
- Condition Register (CR) is a major source of
problems for the scheduler - CR is read/modified by
- Comparisons
- Boolean operations
- Branches
69Minimize status bit dependencies
- Remember GPR are not the only dependencies
- Condition Register (CR) is a major source of
problems for the scheduler - FPU and VXU instructions that use CR will block
CR. -
70Minimize status bit dependencies
- Remember GPR are not the only dependencies
- Condition Register (CR) is a major source of
problems for the scheduler - FPU and VXU instructions that use CR will block
CR. - FXU instructions that use CR will be re-issued.
-
71Scalar Fixed Point
- Branch Elimination
- Prefer bit operations to comparisons
- Combine branches
72Scalar Fixed Point
- Branch Elimination
- Prefer bit operations to comparisons
- Combine branches
- Even well-predicted branches can impact
performance - The instruction may not be in the fetch buffer
- The instruction may not be in the icache
- If functions are not properly aligned, small
subroutines can cause another icache miss on the
calling function on return. - The level-2 cache is shared with data, memory
fetch can impact more than code perforamce.
73Scalar Fixed Point
- Branch Elimination
- Prefer bit operations to comparisons
- Combine branches
- Even well-predicted branches can impact
performance - The instruction may not be in the fetch buffer
- The instruction may not be in the icache
- If functions are not properly aligned, small
subroutines can cause another icache miss on the
calling function on return. - The level-2 cache is shared with data, memory
fetch can impact more than code perforamce.
74Scalar Fixed Point
- Branch Elimination
- Prefer bit operations to comparisons
- Combine branches
- Even well-predicted branches can impact
performance - The instruction may not be in the fetch buffer
- The instruction may not be in the icache
- If functions are not properly aligned, small
subroutines can cause another icache miss on the
calling function on return. - The level-2 cache is shared with data, memory
fetch can impact more than code perforamce.
75Scalar Fixed Point
- Branch Elimination
- Prefer bit operations to comparisons
- Combine branches
- Even well-predicted branches can impact
performance - The instruction may not be in the fetch buffer
- The instruction may not be in the icache
- If functions are not properly aligned, small
subroutines can cause another icache miss on the
calling function on return. - The level-2 cache is shared with data, memory
fetch can impact more than code perforamce.
76Scalar Fixed Point
- Branch Elimination
- Prefer bit operations to comparisons
- Combine branches
- Even well-predicted branches can impact
performance - The instruction may not be in the fetch buffer
- The instruction may not be in the icache
- If functions are not properly aligned, small
subroutines can cause another icache miss on the
calling function on return. - The level-2 cache is shared with data, memory
fetch can impact more than code performance.
77Scalar Fixed Point
- Branch Elimination
- Prefer bit operations to comparisons
- Combine branches
- Eliminating branches
- Increases size of basic blocks
- Decreases number of blocks
- Good for compilers code scheduler
(optimization)!
78Scalar Fixed Point
- Branch Elimination
- Prefer bit operations to comparisons
- Combine branches
- Eliminating branches
- Increases size of basic blocks
- Decreases number of blocks
- Good for compilers code scheduler
(optimization)!
79Scalar Fixed Point
- Branch Elimination
- Prefer bit operations to comparisons
- Combine branches
- Eliminating branches
- Increases size of basic blocks
- Decreases number of blocks
- Good for compilers code scheduler
(optimization)!
80Scalar Fixed Point
- Branch Elimination
- Prefer bit operations to comparisons
- Combine branches
- Eliminating branches
- Increases size of basic blocks
- Decreases number of blocks
- Decreases opportunities for branch penalties
- Good for compilers code scheduler
(optimization)!
81Scalar Fixed Point
- Branch Elimination
- Prefer bit operations to comparisons
- Combine branches
- Eliminating branches
- Increases size of basic blocks
- Decreases number of blocks
- Decreases opportunities for branch penalties
- Good for compilers code scheduler
(optimization)!
82- TIMEOUT Replacing comparisons
- Examples
- Branch on greater than
- Branch on not zero
- Integer select
- A little about the classic example min/max
- Some other examples...
83EXAMPLE
uint64_t test_1_0( const uint64_t a, const
uint64_t b ) if ( ( compare_a(a) ! 0 ) (
compare_b(b) ! 0 ) ) return (b)
return (a)
84Simple inline test functions
static inline uint64_t compare_a( const uint64_t
a ) return ( a (uint64_t)0x10000100
) static inline uint64_t compare_b( const
uint64_t b ) return ( b (uint64_t)0x80004000
)
85 lis r6,0x1000 ori
r5,r6,0x0100 li r7,0x4000 and
r10,r3,r5 oris r9,r7,0x8000 cmpdi
cr7,r10,0 and r0,r4,r9 mr r11,r3 cmpdi
cr6,r0,0 bne- cr7, beq-
cr6, mr r11,r4 mr
r3,r11 blr
Two comparisons
ppu-lv2-gcc (GCC) 3.4.1 (Cell 2.3 Aug 18
2005) CFLAGS-pedantic stdc99 O3 Wall
fstrict-aliasing
86 lis r6,0x1000 ori
r5,r6,0x0100 li r7,0x4000 and
r10,r3,r5 oris r9,r7,0x8000 cmpdi
cr7,r10,0 and r0,r4,r9 mr r11,r3 cmpdi
cr6,r0,0 bne- cr7, beq-
cr6, mr r11,r4 mr
r3,r11 blr
(A) GCC knows there are no side-effects
Both comparisons started before branch(B) PPC
has CR logical instructions Why arent cr6
and cr7 merged?
87 lis r6,0x1000 ori
r5,r6,0x0100 li r7,0x4000 and
r10,r3,r5 oris r9,r7,0x8000 cmpdi
cr7,r10,0 and r0,r4,r9 mr r11,r3 cmpdi
cr6,r0,0 bne- cr7, beq-
cr6, mr r11,r4 mr
r3,r11 blr
Two branches Major optimization barrier if
this function is inlined.
88 lis r6,0x1000 ori
r5,r6,0x0100 li r7,0x4000 and
r10,r3,r5 oris r9,r7,0x8000 cmpdi
cr7,r10,0 and r0,r4,r9 mr r11,r3 cmpdi
cr6,r0,0 bne- cr7, beq-
cr6, mr r11,r4 mr
r3,r11 blr
- Limited to one issue per cycle
- Branches
- CR Modify
- CR Read
- More difficult to schedule
- May inhibit multithreading
89 lis r6,0x1000 ori
r5,r6,0x0100 li r7,0x4000 and
r10,r3,r5 oris r9,r7,0x8000 cmpdi
cr7,r10,0 and r0,r4,r9 mr r11,r3 cmpdi
cr6,r0,0 bne- cr7, beq-
cr6, mr r11,r4 mr
r3,r11 blr
Bonus penalty Two moves for false Three
(!!) moves for true
90Combine the comparisons
uint64_t test_1_1( const uint64_t a, const
uint64_t b ) const uint64_t cmpa
compare_a(a) const uint64_t cmpb
compare_b(b) const uint64_t cmpab cmpa
cmpb if ( cmpab ! 0 ) return (b)
return (a)
Reminder No side effects
91 lis r7,0x1000 li
r6,0x4000 mr r11,r3 ori r5,r7,0x100 oris
r3,r6,0x8000 and r9,r11,r5 and
r0,r4,r3 or r10,r9,r0 cmpdi r10,0 mr
r3,r4 bnelr mr r3,r11 blr
Better.(We got what we expected.)
92 lis r7,0x1000 li
r6,0x4000 mr r11,r3 ori r5,r7,0x100 oris
r3,r6,0x8000 and r9,r11,r5 and
r0,r4,r3 or r10,r9,r0 cmpdi r10,0 mr
r3,r4 bnelr mr r3,r11 blr
One comparison
93 lis r7,0x1000 li
r6,0x4000 mr r11,r3 ori r5,r7,0x100 oris
r3,r6,0x8000 and r9,r11,r5 and
r0,r4,r3 or r10,r9,r0 cmpdi r10,0 mr
r3,r4 bnelr mr r3,r11 blr
One branch
94 lis r7,0x1000 li
r6,0x4000 mr r11,r3 ori r5,r7,0x100 oris
r3,r6,0x8000 and r9,r11,r5 and
r0,r4,r3 or r10,r9,r0 cmpdi r10,0 mr
r3,r4 bnelr mr r3,r11 blr
But still Two moves for false Three
moves for true pretty good indicator of
optimization barrier.
95QUESTION! Does the ? syntax make a difference in
what the compiler will generate?
uint64_t test_1_2( const uint64_t a, const
uint64_t b ) return ( ( compare_a(a)
compare_b(b) ) ? b a )
96QUESTION! Does the ? syntax make a difference in
what the compiler will generate?
uint64_t test_1_2( const uint64_t a, const
uint64_t b ) return ( ( compare_a(a)
compare_b(b) ) ? b a )
ANSWER! No. Knock yourself out.
97Introducing cmpnz_u64
static inline uint64_t cmpnz_u64( const uint64_t
arg ) const uint64_t sign arg
(-arg) const int64_t signed_sign
(int64_t)sign const uint64_t sign_ext
(uint64_t)( signed_sign 0x1f ) return (
sign_ext )
98static inline uint64_t cmpnz_u64( const uint64_t
arg ) const uint64_t sign arg
(-arg) const int64_t signed_sign
(int64_t)sign const uint64_t sign_ext
(uint64_t)( signed_sign 0x1f ) return (
sign_ext )
/ All non zero values will set the sign bit /
99static inline uint64_t cmpnz_u64( const uint64_t
arg ) const uint64_t sign arg
(-arg) const int64_t signed_sign
(int64_t)sign const uint64_t sign_ext
(uint64_t)( signed_sign 0x1f ) return (
sign_ext )
/ Make sure we end up using sra /
100static inline uint64_t cmpnz_u64( const uint64_t
arg ) const uint64_t sign arg
(-arg) const int64_t signed_sign
(int64_t)sign const uint64_t sign_ext
(uint64_t)( signed_sign 0x1f ) return (
sign_ext )
/ Saturate with sign bit. ( arg 0 ) ?
0x00000000_00000000 0xffffffff_ffffffff /
101What do we expect for cmpnz_u64?
neg result, arg or result, result, arg
sradi result, result, 31
about three fixed point instructions.
102Why use masks instead of predicates?
103Why use masks instead of predicates?
104Predicates
- Recommended by IBMs PowerPC Compiler Writers
Guide
105Predicates
- Recommended by IBMs PowerPC Compiler Writers
Guide - One bit (of) value identical to condition
register - false0 true1
106Predicates
- Recommended by IBMs PowerPC Compiler Writers
Guide - One bit (of) value identical to condition
register - false0 true1
static inline uint64_t predicate_cmpnz_u64( const
uint64_t arg ) const uint64_t sign
arg (-arg) const uint64_t predicate sign
0x1f return (predicate)
Similar code to maskGenerates similar
instructions,except srl is used instead of sra
107Predicates
- Recommended by IBMs PowerPC Compiler Writers
Guide - One bit (of) value identical to condition
register - false0 true1
- Potentially many predicates can be stored if
registers are limited
108Predicates
- Recommended by IBMs PowerPC Compiler Writers
Guide - One bit (of) value identical to condition
register - false0 true1
- Potentially many predicates can be stored if
registers are limited - Easy to generate code that uses either predicate
or CR
109Predicates
- Recommended by IBMs PowerPC Compiler Writers
Guide - One bit (of) value identical to condition
register - false0 true1
- Potentially many predicates can be stored if
registers are limited - Easy to generate code that uses either predicate
or CR - Doesnt break higher level code
110Predicates
- Recommended by IBMs PowerPC Compiler Writers
Guide - One bit (of) value identical to condition
register - false0 true1
- Potentially many predicates can be stored if
registers are limited - Easy to generate code that uses either predicate
or CR - Doesnt break higher level code
- A lot of pre-existing code
111Predicates
- Recommended by IBMs PowerPC Compiler Writers
Guide - One bit (of) value identical to condition
register - false0 true1
- Potentially many predicates can be stored if
registers are limited - Easy to generate code that uses either predicate
or CR - Doesnt break higher level code
- A lot of pre-existing code
- But you can select with masks.
112Introducing sel_u64
static inline uint64_t sel_u64( const uint64_t
mask, const uint64_t a, const uint64_t b )
const uint64_t b_result b mask const
uint64_t a_result a (mask) const
uint64_t result b_result a_result
return (result)
113static inline uint64_t sel_u64( const uint64_t
mask, const uint64_t a, const uint64_t b )
const uint64_t b_result b mask const
uint64_t a_result a (mask) const
uint64_t result b_result a_result
return (result)
/ b_result is b if mask is set (else zero) /
114static inline uint64_t sel_u64( const uint64_t
mask, const uint64_t a, const uint64_t b )
const uint64_t b_result b mask const
uint64_t a_result a (mask) const
uint64_t result b_result a_result
return (result)
/ a_result is a if mask is not set (else zero) /
115static inline uint64_t sel_u64( const uint64_t
mask, const uint64_t a, const uint64_t b )
const uint64_t b_result b mask const
uint64_t a_result a (mask) const
uint64_t result b_result a_result
return (result)
/ One of the two results will be zero, the other
will be the one we want. oring them together
will just move the result we want into the final
result register. /
116What do we expect for sel_u64?
and b_result, b, mask andc a_result, a,
mask or result, a_result, b_result
about three fixed point instructions.
117 and b_result, b, mask andc a_result, a,
mask or result, a_result, b_result
PPC has two fixed point logical with complement
operators that make building and working with
masks much simplier. a andc b a and b a
orc b a or b
118Lets make a new version of our test that uses
our new mask functions
uint64_t test_2_0( const uint64_t a, const
uint64_t b ) const uint64_t cmpa
compare_a(a) const uint64_t cmpb
compare_b(b) const uint64_t ab_sel
cmpnz_u64( cmpa cmpb ) const uint64_t result
sel_u64( ab_sel, a, b ) return
(result)
119What do we get?
lis r0,0x1000 li
r5,0x4000 ori r12,r0,0x100 oris
r10,r5,0x8000 and r8,r3,r12 and
r9,r4,r10 or r7,r8,r9 neg r6,r7 or
r5,r6,r7 sradi r0,r5,0x1f and r4,r4,r0 andc
r3,r3,r0 or r3,r4,r3 blr
No branchesNo compares (no CR dependencies)
120With masks (After)
With compare and branch (Before)
lis r0,0x1000 li
r5,0x4000 ori r12,r0,0x100 oris
r10,r5,0x8000 and r8,r3,r12 and
r9,r4,r10 or r7,r8,r9 neg r6,r7 or
r5,r6,r7 sradi r0,r5,0x1f and r4,r4,r0 andc
r3,r3,r0 or r3,r4,r3 blr
lis r7,0x1000 li
r6,0x4000 mr r11,r3 ori r5,r7,0x100 oris
r3,r6,0x8000 and r9,r11,r5 and
r0,r4,r3 or r10,r9,r0 cmpdi r10,0 mr
r3,r4 bnelr mr r3,r11 blr
121With masks (After)
With compare and branch (Before)
lis r0,0x1000 li
r5,0x4000 ori r12,r0,0x100 oris
r10,r5,0x8000 and r8,r3,r12 and
r9,r4,r10 or r7,r8,r9 neg r6,r7 or
r5,r6,r7 sradi r0,r5,0x1f and r4,r4,r0 andc
r3,r3,r0 or r3,r4,r3 blr
lis r7,0x1000 li
r6,0x4000 mr r11,r3 ori r5,r7,0x100 oris
r3,r6,0x8000 and r9,r11,r5 and
r0,r4,r3 or r10,r9,r0 cmpdi r10,0 mr
r3,r4 bnelr mr r3,r11 blr
- About the same number of instructions (for size),
but - Inlines extremely well (Can be pipelined)
122- TIMEOUT Replacing comparisons
- Examples
- Branch on greater than
- Branch on not zero
- Integer select
- A little about the classic example min/max
- Some other examples...
123A little about min/max...
static inline sint64_t min2_s64( const sint64_t
arg0, const sint64_t arg1 ) return ( ( arg0
arg1 ) ? arg0 arg1 )
- One compare (uses CR)
- One branch
124What we're looking for...
static inline sint64_t min2_s64( const sint64_t
arg0, const sint64_t arg1 ) return sel_u64(
cmpgte_s64( arg0, arg1 ), arg0, arg1 )
125Breakdown of cmpgte
static inline uint64_t cmpgte_s64( const int64_t
arg0, const int64_t arg1 ) const int64_t
msb0 arg0 0x3f const
int64_t msb1 arg1 0x3f
const int64_t signs_neq msb0
msb1 const int64_t signs_eq
signs_neg const uint64_t always_gt
(uint64_t)(msb1 signs_neq) const int64_t
diff arg1 arg0 const int64_t
neg_diff -diff const int64_t
diff_nz_msb diff neg_diff const
int64_t diff_nz diff_nz_msb
0x3f const uint64_t diff_z
(uint64_t)diff_nz const int64_t
diff_iif_signs_eq diff signs_eq const
uint64_t diff_gt (uint_64_t)(diff_if
f_signs_eq 0x3f) const uint64_t result_gt
diff_gt always_gt const uint64_t
result_gte result_gt diff_z
return (result_gte)
126Breakdown of cmpgte
static inline uint64_t cmpgte_s64( const int64_t
arg0, const int64_t arg1 ) const int64_t
msb0 arg0 0x3f const
int64_t msb1 arg1 0x3f
const int64_t signs_neq msb0
msb1 const int64_t signs_eq
signs_neg const uint64_t always_gt
(uint64_t)(msb1 signs_neq) const int64_t
diff arg1 arg0 const int64_t
neg_diff -diff const int64_t
diff_nz_msb diff neg_diff const
int64_t diff_nz diff_nz_msb
0x3f const uint64_t diff_z
(uint64_t)diff_nz const int64_t
diff_iif_signs_eq diff signs_eq const
uint64_t diff_gt (uint_64_t)(diff_if
f_signs_eq 0x3f) const uint64_t result_gt
diff_gt always_gt const uint64_t
result_gte result_gt diff_z
return (result_gte)
Checking sign independently cmpgte_s64(
INT64_MAX, INT64_MIN )
127__FAST_MATH__ for integers?
if defined(__FAST_MATH__) static inline uint64_t
cmpgte_s64( const int64_t arg0, const int64_t
arg1 ) const int64_t diff
arg1 arg0 const int64_t neg_diff
-diff const int64_t diff_nz_msb
diff neg_diff const int64_t diff_nz
diff_nz_msb 0x3f const uint64_t
diff_z (uint64_t)diff_nz const
uint64_t diff_gt (uint_64_t)(diff
0x3f) const uint64_t result_gte
diff_gt diff_z return (result_gte) endif
128We don't need for min/max...
if defined(__FAST_MATH__) static inline uint64_t
cmpgte_s64( const int64_t arg0, const int64_t
arg1 ) const int64_t diff
arg1 arg0 const int64_t neg_diff
-diff const int64_t diff_nz_msb
diff neg_diff const int64_t diff_nz
diff_nz_msb 0x3f const uint64_t
diff_z (uint64_t)diff_nz const
uint64_t diff_gt (uint_64_t)(diff
0x3f) const uint64_t result_gte
diff_gt diff_z return (result_gte) endif
Checking for arg0 arg1
129Better!
if defined(__FAST_MATH__) static inline uint64_t
cmpgt_s64( const int64_t arg0, const int64_t arg1
) const int64_t diff arg1 arg0
const uint64_t diff_gt (uint_64_t)(diff
0x3f) return (diff_gt) endif
Change to cmpgt...
static inline sint64_t min2_s64( const sint64_t
arg0, const sint64_t arg1 ) return sel_u64(
cmpgt_s64( arg0, arg1 ), arg0, arg1 )
130 Scalar Floating Point
- Double versus float
- Single Load/Store
- Aligned access
- Using fixed point registers
131 Scalar Floating Point
- Double versus float
- Single Load/Store
- Aligned access
- Using fixed point registers
132Double versus float
- Expect mostly similar performance
133Double versus float
- Expect mostly similar performance
- Differences to note
double
float
fsqrt fre frsqrte fsel fabs fnabs
fsqrts fres frsqrtes
134Double versus float
- Expect mostly similar performance
- Differences to note
double
float
fsqrt fre frsqrte fsel fabs fnabs
fsqrts fres frsqrtes
- ppu-lv2-gcc (GCC) 3.4.1 (Cell 2.3 Aug 18 2005)
- Does NOT generate these instructions!
135fsqrt
static inline double ppc_fsqrt( const double arg
) double result __asm__ (fsqrt 0,1
f(result) f(arg) )
fsqrts
static inline float ppc_fsqrts( const float arg
) float result __asm__ (fsqrts 0,1
f(result) f(arg) )
136fres
/- 1/256
static inline float ppc_fres( const float arg )
float result __asm__ (fres 0,1
f(result) f(arg) )
frs
/- 4 ulps
static inline float ppc_fres( const float arg )
const float estimate ppc_res( arg )
const float refinement -( estimate arg 1.0f
) const float result refinement
estimate estimate return (result)
137frsqrte
static inline double ppc_fsqrt( const double arg
) double result __asm__ (frsqrte
0,1 f(result) f(arg) )
frsqrtes
static inline float ppc_frsqrtes( const double
arg ) float result __asm__ (frsqrte
0,1 f(result) f(arg) )
138fsel
static inline double ppc_fsel( const double
test_gez, const double arg0, const double
arg1 ) double result __asm__ (fsel
0,1, 2, 3 f(result) f(test_gez),
f(arg0), f(arg1) )
frsqrtes
139fsels
static inline float ppc_fsels( const double
test_gez, const double arg0, const double
arg1 ) float result __asm__ (fsel
0,1, 2, 3 f(result) f(test_gez),
f(arg0), f(arg1) )
frsqrtes
140fabs
static inline double ppc_fabs( const double arg
) double result __asm__ (fabs 0,1
f(result) f(arg) )
fabss
static inline float ppc_fabss( const double arg
) float result __asm__ (fabs 0,1
f(result) f(arg) )
141fnabs
static inline double ppc_fnabs( const double arg
) double result __asm__ (fnabs 0,1
f(result) f(arg) )
fnabss
static inline float ppc_fnabss( const double arg
) float result __asm__ (fnabs 0,1
f(result) f(arg) )
142Double versus float
- Expect mostly similar performance
- Differences to note
- Use ffast-math (if reordering is OK)
static inline double fmul_re( const double arg0,
const double arg1 ) const double result
arg0 ( 1.0 / arg1 )
143Double versus float
- Expect mostly similar performance
- Differences to note
- Use ffast-math (if reordering is OK)
/ const double result arg0 ( 1.0 / arg1 )
/ / -fno-fast-math (default) generates / lfd
oned, 0(addr_of_oned) fdiv temp, arg1,
oned fmul result, arg0, temp / -ffast-math
generates / fdiv result, arg0, arg1
144 Scalar Floating Point
- Double versus float
- Single Load/Store
- Aligned access
- Using fixed point registers
145 Scalar Floating Point
- Double versus float
- Single Load/Store
- Aligned access
- Always load on address aligned to size
- Misalignment generates interrupt
- Using fixed point registers
146 Scalar Floating Point
- Double versus float
- Single Load/Store
- Aligned access
- Using fixed point registers
- Same idea OK for moves.
147 Scalar Floating Point
- Branch Elimination
- Avoid bit operations
- Floating point select
- Combine branches
148 Scalar Floating Point
- Branch Elimination
- Avoid bit operations
- Load-Hit-Store Hazard
- Use fctiw / fctiwz / stfiwx if result is integer
- Floating point select
- Combine branches
149 Scalar Floating Point
- Branch Elimination
- Avoid bit operations
- Floating point select
- Combine branches
150Floating point select
- Slightly different than integer select on mask
- double gez
151fsel_gez
static inline double ppc_fsel_gez( const
double test_gez, const double arg0, const
double arg1 ) double result __asm__
(fsel 0,1, 2, 3 f(result)
f(test_gez), f(arg0), f(arg1) )
152fsel_lz
static inline double ppc_fsel_ltz( const
double test_ltz, const double arg0, const
double arg1 ) double result __asm__
(fsel 0,1, 2, 3 f(result)
f(test_gez), f(arg1), f(arg0) )
153fsel_gte
static inline double ppc_fsel_gte( const
double cmp0, const double cmp1, const
double arg0, const double arg1 ) const
double test_gez cmp0 cmp1 double
result __asm__ (fsel 0,1, 2, 3
f(result) f(test_gez), f(arg1), f(arg0)
)
154fmax (with fsel)
static inline double ppc_fmax( const double
arg0, const double arg1 ) return
(ppc_fsel_gte( arg0, arg1, arg0, arg1 ))
fsub temp, arg0, arg1 fsel result, temp, arg0,
arg1 blr
155fmax (with compare/branch)
static inline double ppc_fmax( const double
arg0, const double arg1 ) return ( ( arg0
arg1 ) ? arg0 arg1 )
fmr result, arg0 fcmpu temp, arg0, arg1
bgelr- temp fmr result, arg1 blr
156fmax (with compare/branch)
static inline double ppc_fmax( const double
arg0, const double arg1 ) return ( ( arg0
arg1 ) ? arg0 arg1 )
fmr result, arg0 fcmpu temp, arg0, arg1
bgelr- temp fmr result, arg1 blr
Blocks CR
157fmax (with compare/branch)
static inline double ppc_fmax( const double
arg0, const double arg1 ) return ( ( arg0
arg1 ) ? arg0 arg1 )
fmr result, arg0 fcmpu temp, arg0, arg1
bgelr- temp fmr result, arg1 blr
Blocks CR
Optimization (scheduling) barrier
158 Scalar Floating Point
- Branch Elimination
- Avoid bit operations
- Floating point select
- Combine branches
- Similar benefit to fixed point
159 Data Design
- Basic Principles
- Know the data and access patterns
- Be prepared to reorganize the data
- Every bit counts
- Design for the hardware
- Sort by dominant type
- Clearly distinguish RO/WO/RW data
- Almost everything belongs to a set
160 Data Design
- Cache Friendly Data
- Minimize cache footprint
- Sort by data-reuse lifetime
- Separate scalars from arrays
- Use table-based storage patterns
- Tile sparse queues with sequential data
- Merge multiple source tiles
- Keep write-once data off-cache
161 Data Design
- Cache Friendly Data (cont.)
- Minimize write-multiple data
- Keep short life read-write data in register file
- Pipeline long life read-write data
- Subclass data based on independent functionality
162 Data Design
- Allocation
- Static versus Dynamic
- Alignment
- System pages
163 VMX (Altivec)
- What is VMX?
- What are the advantages to using it?
- Are there any dangers?
164 VMX (Altivec)
- What is VMX?
- What are the advantages to using it?
- Are there any dangers?
165 VMX (Altivec)
- What is VMX?
- What are the advantages to using it?
- Are there any dangers?
166VMX What are the advantages to using it?
167VMX What are the advantages to using it?
- More registers
- Much higher throughput
168VMX What are the advantages to using it?
- More registers
- Much higher throughput
- Instruction throughput
1 cycle - Latency for simple instructions 4
cycles - Latency for complex instructions 9
cycles - Latency for float add/sub/madd/nmsub 12 cycles
- Latency for float re/rsqrte
12 cycles
169VMX What are the advantages to using it?
- More registers
- Much higher throughput
- Saturated arithmetic instructions
- Bit manipulation on all types (permute, shift,
rotate) - Tons (162) of really cool instructions!
170 VMX (Altivec)
- What is VMX?
- What are the advantages to using it?
- Are there any dangers?
171 VMX (Altivec)
- What is VMX?
- What are the advantages to using it?
- Are there any dangers?
- Load-Hit-Store from GPR
172 VMX (Altivec)
- What types?
- Aligned access
- Minimize dependencies
- Branch Elimination
173 VMX (Altivec)
- What types?
- Aligned access
- Minimize dependencies
- Branch Elimination
174VMX What types?
int8_t uint8_t x 16 (vector unsigned
char) int16_t uint16_t x 8 (vector
unsigned short) int32_t uint32_t x 4
(vector unsigned int) float x
4 (vector float)
175VMX What types?
int8_t uint8_t x 16 (vector unsigned
char) int16_t uint16_t x 8 (vector
unsigned short) int32_t uint32_t x 4
(vector unsigned int) float x
4 (vector float)
- No 64 bit argument instructions
176VMX What types?
int8_t uint8_t x 16 (vector unsigned
char) int16_t uint16_t x 8 (vector
unsigned short) int32_t uint32_t x 4
(vector unsigned int) float x
4 (vector float)
- No 64 bit argument instructions
int64_t uint64_t x 2 (vector unsigned
long long) double x 2 (vector
double)
- 64 bit typedefs exist (but memory-based)
- Even simple casts are really crappy. Avoid!
177 VMX (Altivec)
- What types?
- Aligned access
- Normal load/store must be aligned (ld/ldx/st/stx)
- There are explicit load misaligned instructions
(lvsl/lvsr) - There are store element instructions (ste)
- Minimize dependencies
- Branch Elimination
178 VMX (Altivec)
- What types?
- Aligned access
- Minimize dependencies
- Branch Elimination
179 VMX (Altivec)
- What types?
- Aligned access
- Minimize dependencies
- Branch Elimination
180VMX Branch Elimination
181VMX Branch Elimination
- Mask compare and select
- vec_cmpeq
- vec_cmpge ( vector float only )
- vec_cmpgt
- vec_cmple ( vector float only )
- vec_cmplt
- vec_sel
182VMX Branch Elimination
- Also
- vec_min
- vec_max
- vec_avg ( except vector float )
- Mask compare and select
- vec_cmpeq
- vec_cmpge ( vector float only )
- vec_cmpgt
- vec_cmple ( vector float only )
- vec_cmplt
- vec_sel
183 VMX (Altivec)
- Maximizing throughput
- Combining transformations
- Uniform versus Non-uniform vectors
- Watch out! Building immediate values
184 VMX (Altivec)
- Maximizing throughput
- Combining transformations
- Uniform versus Non-uniform vectors
- Watch out! Building immediate values
185 VMX (Altivec)
- Maximizing throughput
- Combining transformations
- Uniform versus Non-uniform