PPU Optimizations - PowerPoint PPT Presentation

1 / 198
About This Presentation
Title:

PPU Optimizations

Description:

Putting a game on the PPU of the PS3 is like putting a game on the IOP of the PS2. ... Why was restrict introduced into C99? What transformations can the ... – PowerPoint PPT presentation

Number of Views:435
Avg rating:3.0/5.0
Slides: 199
Provided by: cellperf
Category:
Tags: ppu | optimizations | ps3

less

Transcript and Presenter's Notes

Title: PPU Optimizations


1
PPU Optimizations
  • Mike Acton
  • Highmoon Studios
  • macton_at_highmoonstudios.com

2
Why Optimize for the PPU?
  • Putting a game on the PPU of the PS3 is like
    putting a game on the IOP of the PS2.
  • Not enough time?
  • Pre-existing codebase?
  • Someone else's codebase?
  • Stepping stone to good SPU usage?
  • Don't think an SPU can handle it?

3
Where Can We Optimize?
  • Data Hazards
  • Scalar Fixed Point (Integer)
  • Scalar Floating Point
  • Data Design
  • VMX (Altivec)
  • Code Layout

4
Data Hazards
  • Basic Hazards
  • Write After Read
  • Write After Write
  • Read After Write
  • Avoiding Hazards
  • Use const
  • Use restrict
  • Use inline

5
TIMEOUT Understanding restrict
  • Why was restrict introduced into C99?
  • What transformations can the compiler now make?
  • What is the danger in using restrict?
  • A little about overlapping regions...

6
TIMEOUT Understanding restrict
  • Why was restrict introduced into C99?
  • What transformations can the compiler now make?
  • What is the danger in using restrict?
  • A little about overlapping regions...

7
Why was restrict introduced into C99?
  • Not possible to prove that two memory windows do
    not overlap

8
Why was restrict introduced into C99?
  • Not possible to prove that two memory windows do
    not overlap
  • Not possible to prove that two memory access
    patterns do not overlap

9
Why was restrict introduced into C99?
  • Not possible to prove that two memory windows do
    not overlap
  • Not possible to prove that two memory access
    patterns do not overlap
  • The scheduler must always presume that memory
    accesses can overlap.
  • Avoid generating data hazards.

10
Why was restrict introduced into C99?
  • Not possible to prove that two memory windows do
    not overlap
  • Not possible to prove that two memory access
    patterns do not overlap
  • The scheduler must always presume that memory
    accesses can overlap.
  • Avoid generating data hazards.
  • Unless there was some keyword
  • restrict is a no hazards will be generated
    contract

11
TIMEOUT Understanding restrict
  • Why was restrict introduced into C99?
  • What transformations can the compiler now make?
  • What is the danger in using restrict?
  • A little about overlapping regions...

12
What transformations can the compiler now make?
  • Re-order loads and stores!
  • The scheduler can presume that memory accesses
    can not overlap.
  • Responsibility of programmer Avoid generating
    data hazards.

13
What transformations can the compiler now make?
  • Re-order loads and stores!
  • NOTES ON USE
  • Restricted pointers may be copied.
  • Only leaf pointers should be used.
  • Use of restrict should be very common.
  • Typical access is most likely exclusive.
  • Publish data requirements in declarations
  • Not doing this - Very hard to find bugs
  • Start using immediately.
  • Somewhat difficult to refactor restricted
    requirements into pre-existing code.

14
What transformations can the compiler now make?
  • Re-order loads and stores!
  • NOTES ON USE
  • Restricted pointers may be copied.
  • Only leaf pointers should be used.
  • Use of restrict should be very common.
  • Typical access is most likely exclusive.
  • Publish data requirements in declarations
  • Not doing this - Very hard to find bugs
  • Start using immediately.
  • Somewhat difficult to refactor restricted
    requirements into pre-existing code.

15
What transformations can the compiler now make?
  • Re-order loads and stores!
  • NOTES ON USE
  • Restricted pointers may be copied.
  • Only leaf pointers should be used.
  • Use of restrict should be very common.
  • Typical access is most likely exclusive.
  • Publish data requirements in declarations
  • Not doing this - Very hard to find bugs
  • Start using immediately.
  • Somewhat difficult to refactor restricted
    requirements into pre-existing code.

16
What transformations can the compiler now make?
  • Re-order loads and stores!
  • NOTES ON USE
  • Restricted pointers may be copied.
  • Only leaf pointers should be used.
  • Use of restrict should be very common.
  • Typical access is most likely exclusive.
  • Publish data requirements in declarations
  • Not doing this - Very hard to find bugs
  • Start using immediately.
  • Somewhat difficult to refactor restricted
    requirements into pre-existing code.

17
What transformations can the compiler now make?
  • Re-order loads and stores!
  • Potentially manage structures in registers

18
TIMEOUT Understanding restrict
  • Why was restrict introduced into C99?
  • What transformations can the compiler now make?
  • What is the danger in using restrict?
  • A little about overlapping regions...

19
What is the danger in using restrict?
  • Programmer breaking the restrict contract

20
What is the danger in using restrict?
  • Programmer breaking the restrict contract
  • Unexpected results
  • Hard to find bugs

21
What is the danger in using restrict?
  • Programmer breaking the restrict contract
  • Unexpected results
  • Hard to find bugs
  • Unit testing on host machine

22
What is the danger in using restrict?
  • Programmer breaking the restrict contract
  • Unexpected results
  • Hard to find bugs
  • Unit testing on host machine
  • Make sure restrict is supported
  • Compile with fstrict-aliasing

23
TIMEOUT Understanding restrict
  • Why was restrict introduced into C99?
  • What transformations can the compiler now make?
  • What is the danger in using restrict?
  • A little about overlapping regions...

24
A little about overlapping regions...
  • IMPORTANT! Not restricting the thing being
    pointed to.

25
A little about overlapping regions...
  • IMPORTANT! Not restricting the thing being
    pointed to.
  • Generally, data within a stripe is not
    re-ordered.

26
A little about overlapping regions...
  • IMPORTANT! Not restricting the thing being
    pointed to.
  • Generally, data within a stripe is not
    re-ordered.
  • Use multiple levels of striped data to restrict
    fields independently.

Can point to same address
27
Scalar Fixed Point
  • What size integer?
  • Single Load/Store
  • Aligned access (Preference Load or Store?)
  • Cache hints
  • Using floating point registers
  • Minimize status bit dependencies

28
Scalar Fixed Point
  • What size integer?
  • Single Load/Store
  • Aligned access (Preference Load or Store?)
  • Cache hints
  • Using floating point registers
  • Minimize status bit dependencies

29
What size integer?
  • General Purpose Use 64 bits ( int64_t /
    uint64_t )

30
What size integer?
  • General Purpose Use 64 bits ( int64_t /
    uint64_t )
  • Often sign extends after each arithmetic
    operation
  • Signed 32 bits

31
What size integer?
  • General Purpose Use 64 bits ( int64_t /
    uint64_t )
  • Often sign extends after each arithmetic
    operation
  • Signed 32 bits

int32_t ab int32_t abc ab a b abc ab
c
32
What size integer?
  • General Purpose Use 64 bits ( int64_t /
    uint64_t )
  • Often sign extends after each arithmetic
    operation
  • Signed 32 bits

int32_t ab int32_t abc ab a b abc ab
c
add ab0, a, b extsw ab1, ab0 add abc0,
ab1, c extsw abc1, abc0
33
What size integer?
  • General Purpose Use 64 bits ( int64_t /
    uint64_t )
  • Often sign extends after each arithmetic
    operation
  • Signed 32 bits
  • Typically defers extension until after multiple
    operations
  • Unsigned 32 bits
  • 16 bits
  • 8 bits

34
What size integer?
  • General Purpose Use 64 bits ( int64_t /
    uint64_t )
  • Often sign extends after each arithmetic
    operation
  • Signed 32 bits
  • Typically defers extension until after multiple
    operations
  • Unsigned 32 bits
  • 16 bits
  • 8 bits

int16_t ab int16_t abc ab a b abc ab
c
35
What size integer?
  • General Purpose Use 64 bits ( int64_t /
    uint64_t )
  • Often sign extends after each arithmetic
    operation
  • Signed 32 bits
  • Typically defers extension until after multiple
    operations
  • Unsigned 32 bits
  • 16 bits
  • 8 bits

int16_t ab int16_t abc ab a b abc ab
c
add ab, a, b add abc0, ab, c extsh abc1,
abc0
36
What size integer?
  • General Purpose Use 64 bits ( int64_t /
    uint64_t )
  • Often sign extends after each arithmetic
    operation
  • Signed 32 bits
  • Typically defers extension until after multiple
    operations
  • Unsigned 32 bits
  • 16 bits
  • 8 bits
  • Reminder int is signed 32 bits

37
What size integer?
  • General Purpose Use 64 bits ( int64_t /
    uint64_t )
  • Often sign extends after each arithmetic
    operation
  • Signed 32 bits
  • Typically defers extension until after multiple
    operations
  • Unsigned 32 bits
  • 16 bits
  • 8 bits
  • Reminder int is signed 32 bits
  • Avoid bool
  • bool is only good for creating more branching

38
What size integer?
  • General Purpose Use 64 bits ( int64_t /
    uint64_t )
  • Often sign extends after each arithmetic
    operation
  • Signed 32 bits
  • Typically defers extension until after multiple
    operations
  • Unsigned 32 bits
  • 16 bits
  • 8 bits
  • Reminder int is signed 32 bits
  • Avoid bool
  • bool is only good for creating more branching
  • Most logical instructions add/sub
  • 64 bits
  • 1 cycle throughput
  • 2 cycle latency

39
What size integer?
  • Multiply and Divide ( 32 and 64 bits )

40
What size integer?
  • Multiply and Divide ( 32 and 64 bits )
  • All integer multiply instructions stall FXU (6
    15 cycles)

41
What size integer?
  • Multiply and Divide ( 32 and 64 bits )
  • All integer multiply instructions stall FXU (6
    15 cycles)
  • 64 bit integer divide instructions stall FXU (10
    70 cycles)

42
What size integer?
  • Multiply and Divide ( 32 and 64 bits )
  • All integer multiply instructions stall FXU (6
    15 cycles)
  • 64 bit integer divide instructions stall FXU (10
    70 cycles)
  • 32 bit integer divide instructions stall FXU (10
    38 cycles)

43
Scalar Fixed Point
  • What size integer?
  • Single Load/Store
  • Aligned access (Preference Load or Store?)
  • Cache hints
  • Using floating point registers
  • Minimize status bit dependencies

44
Single Load/Store
  • Avoid Multiple Load/Store (lmw, stmw, etc.)

45
Single Load/Store
  • Avoid Multiple Load/Store (lmw, stmw, etc.)
  • Microcoded instructions have 11 cycle decode
    penalty
  • Microcoded instructions cannot be interrupted
  • Microcoded instructions require pipeline flush

46
Single Load/Store
  • Avoid Multiple Load/Store (lmw, stmw, etc.)
  • Always load/store on size boundary

47
Single Load/Store
  • Avoid Multiple Load/Store (lmw, stmw, etc.)
  • Always load/store on size boundary
  • Load Store Unit does manage misaligned
    loads/stores but

48
Single Load/Store
  • Avoid Multiple Load/Store (lmw, stmw, etc.)
  • Always load/store on size boundary
  • Load Store Unit does manage misaligned
    loads/stores but
  • Major penalties for
  • Crossing 32B boundaries ( microcoded
    instructions )
  • Crossing Page boundaries ( microcoded
    instructions )

49
Single Load/Store
  • Avoid Multiple Load/Store (lmw, stmw, etc.)
  • Always load/store on size boundary
  • If you must break this rule

50
Single Load/Store
  • Avoid Multiple Load/Store (lmw, stmw, etc.)
  • Always load/store on size boundary
  • If you must break this rule
  • Misaligned load aligned store

51
Single Load/Store
  • Avoid Multiple Load/Store (lmw, stmw, etc.)
  • Always load/store on size boundary
  • If you must break this rule Misaligned load
    aligned store
  • Load Hit Store

52
Single Load/Store
  • Avoid Multiple Load/Store (lmw, stmw, etc.)
  • Always load/store on size boundary
  • If you must break this rule Misaligned load
    aligned store
  • Load Hit Store
  • No store forwarding from the Store Queue (STQ).

53
Single Load/Store
  • Avoid Multiple Load/Store (lmw, stmw, etc.)
  • Always load/store on size boundary
  • If you must break this rule Misaligned load
    aligned store
  • Load Hit Store
  • No store forwarding from the Store Queue (STQ).
  • Enter the Load Miss Queue (LMQ)

54
Single Load/Store
  • Avoid Multiple Load/Store (lmw, stmw, etc.)
  • Always load/store on size boundary
  • If you must break this rule Misaligned load
    aligned store
  • Load Hit Store
  • No store forwarding from the Store Queue (STQ).
  • Enter the Load Miss Queue (LMQ)
  • Most likely problem!
  • Avoid
  • Small functions
  • Globals (especially in loops)

55
Single Load/Store
  • Avoid Multiple Load/Store (lmw, stmw, etc.)
  • Always load/store on size boundary
  • If you must break this rule Misaligned load
    aligned store
  • Load Hit Store
  • No store forwarding from the Store Queue (STQ).
  • Enter the Load Miss Queue (LMQ)
  • Most likely problem!
  • (40 80) cycles

56
Single Load/Store
  • Avoid Multiple Load/Store (lmw, stmw, etc.)
  • Always load/store on size boundary
  • If you must break this rule Misaligned load
    aligned store
  • Load Hit Store
  • Store Hit Load

57
Single Load/Store
  • Avoid Multiple Load/Store (lmw, stmw, etc.)
  • Always load/store on size boundary
  • If you must break this rule Misaligned load
    aligned store
  • Load Hit Store
  • Store Hit Load
  • All younger loads re-issued

58
Single Load/Store
  • Avoid Multiple Load/Store (lmw, stmw, etc.)
  • Always load/store on size boundary
  • If you must break this rule Misaligned load
    aligned store
  • Load Hit Store
  • Store Hit Load
  • Load Hit Load

59
Single Load/Store
  • Avoid Multiple Load/Store (lmw, stmw, etc.)
  • Always load/store on size boundary
  • If you must break this rule Misaligned load
    aligned store
  • Load Hit Store
  • Store Hit Load
  • Load Hit Load
  • Un-snooped loads re-issued

60
Single Load/Store
  • Avoid Multiple Load/Store (lmw, stmw, etc.)
  • Always load/store on size boundary
  • If you must break this rule Misaligned load
    aligned store
  • Load Hit Store
  • Store Hit Load
  • Load Hit Load
  • Load Hit Reload

61
Single Load/Store
  • Avoid Multiple Load/Store (lmw, stmw, etc.)
  • Always load/store on size boundary
  • If you must break this rule Misaligned load
    aligned store
  • Load Hit Store
  • Store Hit Load
  • Load Hit Load
  • Load Hit Reload
  • Younger load enters LMQ
  • General Case Penalty hidden until LMQ full

62
Single Load/Store
  • Avoid Multiple Load/Store (lmw, stmw, etc.)
  • Always load/store on size boundary
  • If you must break this rule Misaligned load
    aligned store
  • Load Hit Store
  • Store Hit Load
  • Load Hit Load
  • Load Hit Reload
  • Separate Loads and Stores

63
Scalar Fixed Point
  • What size integer?
  • Single Load/Store
  • Aligned access (Preference Load or Store?)
  • Cache hints
  • Reads Prefetch Block (dcbt)
  • Writes Zero Block (dcbz)
  • Using floating point registers
  • Minimize status bit dependencies

64
Scalar Fixed Point
  • What size integer?
  • Single Load/Store
  • Aligned access (Preference Load or Store?)
  • Cache hints
  • Using floating point registers
  • For Move Larger functions with free FPU
    registers
  • Minimize status bit dependencies

65
Scalar Fixed Point
  • What size integer?
  • Single Load/Store
  • Aligned access (Preference Load or Store?)
  • Cache hints
  • Using floating point registers
  • Minimize status bit dependencies

66
Minimize status bit dependencies
  • Remember GPR are not the only dependencies

67
Minimize status bit dependencies
  • Remember GPR are not the only dependencies
  • Condition Register (CR) is a major source of
    problems for the scheduler

68
Minimize status bit dependencies
  • Remember GPR are not the only dependencies
  • Condition Register (CR) is a major source of
    problems for the scheduler
  • CR is read/modified by
  • Comparisons
  • Boolean operations
  • Branches

69
Minimize status bit dependencies
  • Remember GPR are not the only dependencies
  • Condition Register (CR) is a major source of
    problems for the scheduler
  • FPU and VXU instructions that use CR will block
    CR.

70
Minimize status bit dependencies
  • Remember GPR are not the only dependencies
  • Condition Register (CR) is a major source of
    problems for the scheduler
  • FPU and VXU instructions that use CR will block
    CR.
  • FXU instructions that use CR will be re-issued.

71
Scalar Fixed Point
  • Branch Elimination
  • Prefer bit operations to comparisons
  • Combine branches

72
Scalar Fixed Point
  • Branch Elimination
  • Prefer bit operations to comparisons
  • Combine branches
  • Even well-predicted branches can impact
    performance
  • The instruction may not be in the fetch buffer
  • The instruction may not be in the icache
  • If functions are not properly aligned, small
    subroutines can cause another icache miss on the
    calling function on return.
  • The level-2 cache is shared with data, memory
    fetch can impact more than code perforamce.

73
Scalar Fixed Point
  • Branch Elimination
  • Prefer bit operations to comparisons
  • Combine branches
  • Even well-predicted branches can impact
    performance
  • The instruction may not be in the fetch buffer
  • The instruction may not be in the icache
  • If functions are not properly aligned, small
    subroutines can cause another icache miss on the
    calling function on return.
  • The level-2 cache is shared with data, memory
    fetch can impact more than code perforamce.

74
Scalar Fixed Point
  • Branch Elimination
  • Prefer bit operations to comparisons
  • Combine branches
  • Even well-predicted branches can impact
    performance
  • The instruction may not be in the fetch buffer
  • The instruction may not be in the icache
  • If functions are not properly aligned, small
    subroutines can cause another icache miss on the
    calling function on return.
  • The level-2 cache is shared with data, memory
    fetch can impact more than code perforamce.

75
Scalar Fixed Point
  • Branch Elimination
  • Prefer bit operations to comparisons
  • Combine branches
  • Even well-predicted branches can impact
    performance
  • The instruction may not be in the fetch buffer
  • The instruction may not be in the icache
  • If functions are not properly aligned, small
    subroutines can cause another icache miss on the
    calling function on return.
  • The level-2 cache is shared with data, memory
    fetch can impact more than code perforamce.

76
Scalar Fixed Point
  • Branch Elimination
  • Prefer bit operations to comparisons
  • Combine branches
  • Even well-predicted branches can impact
    performance
  • The instruction may not be in the fetch buffer
  • The instruction may not be in the icache
  • If functions are not properly aligned, small
    subroutines can cause another icache miss on the
    calling function on return.
  • The level-2 cache is shared with data, memory
    fetch can impact more than code performance.

77
Scalar Fixed Point
  • Branch Elimination
  • Prefer bit operations to comparisons
  • Combine branches
  • Eliminating branches
  • Increases size of basic blocks
  • Decreases number of blocks
  • Good for compilers code scheduler
    (optimization)!

78
Scalar Fixed Point
  • Branch Elimination
  • Prefer bit operations to comparisons
  • Combine branches
  • Eliminating branches
  • Increases size of basic blocks
  • Decreases number of blocks
  • Good for compilers code scheduler
    (optimization)!

79
Scalar Fixed Point
  • Branch Elimination
  • Prefer bit operations to comparisons
  • Combine branches
  • Eliminating branches
  • Increases size of basic blocks
  • Decreases number of blocks
  • Good for compilers code scheduler
    (optimization)!

80
Scalar Fixed Point
  • Branch Elimination
  • Prefer bit operations to comparisons
  • Combine branches
  • Eliminating branches
  • Increases size of basic blocks
  • Decreases number of blocks
  • Decreases opportunities for branch penalties
  • Good for compilers code scheduler
    (optimization)!

81
Scalar Fixed Point
  • Branch Elimination
  • Prefer bit operations to comparisons
  • Combine branches
  • Eliminating branches
  • Increases size of basic blocks
  • Decreases number of blocks
  • Decreases opportunities for branch penalties
  • Good for compilers code scheduler
    (optimization)!

82
  • TIMEOUT Replacing comparisons
  • Examples
  • Branch on greater than
  • Branch on not zero
  • Integer select
  • A little about the classic example min/max
  • Some other examples...

83
EXAMPLE
uint64_t test_1_0( const uint64_t a, const
uint64_t b ) if ( ( compare_a(a) ! 0 ) (
compare_b(b) ! 0 ) ) return (b)
return (a)
84
Simple inline test functions
static inline uint64_t compare_a( const uint64_t
a ) return ( a (uint64_t)0x10000100
) static inline uint64_t compare_b( const
uint64_t b ) return ( b (uint64_t)0x80004000
)
85
lis r6,0x1000 ori
r5,r6,0x0100 li r7,0x4000 and
r10,r3,r5 oris r9,r7,0x8000 cmpdi
cr7,r10,0 and r0,r4,r9 mr r11,r3 cmpdi
cr6,r0,0 bne- cr7, beq-
cr6, mr r11,r4 mr
r3,r11 blr
Two comparisons
ppu-lv2-gcc (GCC) 3.4.1 (Cell 2.3 Aug 18
2005) CFLAGS-pedantic stdc99 O3 Wall
fstrict-aliasing
86
lis r6,0x1000 ori
r5,r6,0x0100 li r7,0x4000 and
r10,r3,r5 oris r9,r7,0x8000 cmpdi
cr7,r10,0 and r0,r4,r9 mr r11,r3 cmpdi
cr6,r0,0 bne- cr7, beq-
cr6, mr r11,r4 mr
r3,r11 blr
(A) GCC knows there are no side-effects
Both comparisons started before branch(B) PPC
has CR logical instructions Why arent cr6
and cr7 merged?
87
lis r6,0x1000 ori
r5,r6,0x0100 li r7,0x4000 and
r10,r3,r5 oris r9,r7,0x8000 cmpdi
cr7,r10,0 and r0,r4,r9 mr r11,r3 cmpdi
cr6,r0,0 bne- cr7, beq-
cr6, mr r11,r4 mr
r3,r11 blr
Two branches Major optimization barrier if
this function is inlined.
88
lis r6,0x1000 ori
r5,r6,0x0100 li r7,0x4000 and
r10,r3,r5 oris r9,r7,0x8000 cmpdi
cr7,r10,0 and r0,r4,r9 mr r11,r3 cmpdi
cr6,r0,0 bne- cr7, beq-
cr6, mr r11,r4 mr
r3,r11 blr
  • Limited to one issue per cycle
  • Branches
  • CR Modify
  • CR Read
  • More difficult to schedule
  • May inhibit multithreading

89
lis r6,0x1000 ori
r5,r6,0x0100 li r7,0x4000 and
r10,r3,r5 oris r9,r7,0x8000 cmpdi
cr7,r10,0 and r0,r4,r9 mr r11,r3 cmpdi
cr6,r0,0 bne- cr7, beq-
cr6, mr r11,r4 mr
r3,r11 blr
Bonus penalty Two moves for false Three
(!!) moves for true
90
Combine the comparisons
uint64_t test_1_1( const uint64_t a, const
uint64_t b ) const uint64_t cmpa
compare_a(a) const uint64_t cmpb
compare_b(b) const uint64_t cmpab cmpa
cmpb if ( cmpab ! 0 ) return (b)
return (a)
Reminder No side effects
91
lis r7,0x1000 li
r6,0x4000 mr r11,r3 ori r5,r7,0x100 oris
r3,r6,0x8000 and r9,r11,r5 and
r0,r4,r3 or r10,r9,r0 cmpdi r10,0 mr
r3,r4 bnelr mr r3,r11 blr
Better.(We got what we expected.)
92
lis r7,0x1000 li
r6,0x4000 mr r11,r3 ori r5,r7,0x100 oris
r3,r6,0x8000 and r9,r11,r5 and
r0,r4,r3 or r10,r9,r0 cmpdi r10,0 mr
r3,r4 bnelr mr r3,r11 blr
One comparison
93
lis r7,0x1000 li
r6,0x4000 mr r11,r3 ori r5,r7,0x100 oris
r3,r6,0x8000 and r9,r11,r5 and
r0,r4,r3 or r10,r9,r0 cmpdi r10,0 mr
r3,r4 bnelr mr r3,r11 blr
One branch
94
lis r7,0x1000 li
r6,0x4000 mr r11,r3 ori r5,r7,0x100 oris
r3,r6,0x8000 and r9,r11,r5 and
r0,r4,r3 or r10,r9,r0 cmpdi r10,0 mr
r3,r4 bnelr mr r3,r11 blr
But still Two moves for false Three
moves for true pretty good indicator of
optimization barrier.
95
QUESTION! Does the ? syntax make a difference in
what the compiler will generate?
uint64_t test_1_2( const uint64_t a, const
uint64_t b ) return ( ( compare_a(a)
compare_b(b) ) ? b a )
96
QUESTION! Does the ? syntax make a difference in
what the compiler will generate?
uint64_t test_1_2( const uint64_t a, const
uint64_t b ) return ( ( compare_a(a)
compare_b(b) ) ? b a )
ANSWER! No. Knock yourself out.
97
Introducing cmpnz_u64
static inline uint64_t cmpnz_u64( const uint64_t
arg ) const uint64_t sign arg
(-arg) const int64_t signed_sign
(int64_t)sign const uint64_t sign_ext
(uint64_t)( signed_sign 0x1f ) return (
sign_ext )
98
static inline uint64_t cmpnz_u64( const uint64_t
arg ) const uint64_t sign arg
(-arg) const int64_t signed_sign
(int64_t)sign const uint64_t sign_ext
(uint64_t)( signed_sign 0x1f ) return (
sign_ext )
/ All non zero values will set the sign bit /
99
static inline uint64_t cmpnz_u64( const uint64_t
arg ) const uint64_t sign arg
(-arg) const int64_t signed_sign
(int64_t)sign const uint64_t sign_ext
(uint64_t)( signed_sign 0x1f ) return (
sign_ext )
/ Make sure we end up using sra /
100
static inline uint64_t cmpnz_u64( const uint64_t
arg ) const uint64_t sign arg
(-arg) const int64_t signed_sign
(int64_t)sign const uint64_t sign_ext
(uint64_t)( signed_sign 0x1f ) return (
sign_ext )
/ Saturate with sign bit. ( arg 0 ) ?
0x00000000_00000000 0xffffffff_ffffffff /
101
What do we expect for cmpnz_u64?
neg result, arg or result, result, arg
sradi result, result, 31
about three fixed point instructions.
102
Why use masks instead of predicates?
103
Why use masks instead of predicates?
104
Predicates
  • Recommended by IBMs PowerPC Compiler Writers
    Guide

105
Predicates
  • Recommended by IBMs PowerPC Compiler Writers
    Guide
  • One bit (of) value identical to condition
    register
  • false0 true1

106
Predicates
  • Recommended by IBMs PowerPC Compiler Writers
    Guide
  • One bit (of) value identical to condition
    register
  • false0 true1

static inline uint64_t predicate_cmpnz_u64( const
uint64_t arg ) const uint64_t sign
arg (-arg) const uint64_t predicate sign
0x1f return (predicate)
Similar code to maskGenerates similar
instructions,except srl is used instead of sra
107
Predicates
  • Recommended by IBMs PowerPC Compiler Writers
    Guide
  • One bit (of) value identical to condition
    register
  • false0 true1
  • Potentially many predicates can be stored if
    registers are limited

108
Predicates
  • Recommended by IBMs PowerPC Compiler Writers
    Guide
  • One bit (of) value identical to condition
    register
  • false0 true1
  • Potentially many predicates can be stored if
    registers are limited
  • Easy to generate code that uses either predicate
    or CR

109
Predicates
  • Recommended by IBMs PowerPC Compiler Writers
    Guide
  • One bit (of) value identical to condition
    register
  • false0 true1
  • Potentially many predicates can be stored if
    registers are limited
  • Easy to generate code that uses either predicate
    or CR
  • Doesnt break higher level code

110
Predicates
  • Recommended by IBMs PowerPC Compiler Writers
    Guide
  • One bit (of) value identical to condition
    register
  • false0 true1
  • Potentially many predicates can be stored if
    registers are limited
  • Easy to generate code that uses either predicate
    or CR
  • Doesnt break higher level code
  • A lot of pre-existing code

111
Predicates
  • Recommended by IBMs PowerPC Compiler Writers
    Guide
  • One bit (of) value identical to condition
    register
  • false0 true1
  • Potentially many predicates can be stored if
    registers are limited
  • Easy to generate code that uses either predicate
    or CR
  • Doesnt break higher level code
  • A lot of pre-existing code
  • But you can select with masks.

112
Introducing sel_u64
static inline uint64_t sel_u64( const uint64_t
mask, const uint64_t a, const uint64_t b )
const uint64_t b_result b mask const
uint64_t a_result a (mask) const
uint64_t result b_result a_result
return (result)
113
static inline uint64_t sel_u64( const uint64_t
mask, const uint64_t a, const uint64_t b )
const uint64_t b_result b mask const
uint64_t a_result a (mask) const
uint64_t result b_result a_result
return (result)
/ b_result is b if mask is set (else zero) /
114
static inline uint64_t sel_u64( const uint64_t
mask, const uint64_t a, const uint64_t b )
const uint64_t b_result b mask const
uint64_t a_result a (mask) const
uint64_t result b_result a_result
return (result)
/ a_result is a if mask is not set (else zero) /
115
static inline uint64_t sel_u64( const uint64_t
mask, const uint64_t a, const uint64_t b )
const uint64_t b_result b mask const
uint64_t a_result a (mask) const
uint64_t result b_result a_result
return (result)
/ One of the two results will be zero, the other
will be the one we want. oring them together
will just move the result we want into the final
result register. /
116
What do we expect for sel_u64?
and b_result, b, mask andc a_result, a,
mask or result, a_result, b_result
about three fixed point instructions.
117
and b_result, b, mask andc a_result, a,
mask or result, a_result, b_result
PPC has two fixed point logical with complement
operators that make building and working with
masks much simplier. a andc b a and b a
orc b a or b
118
Lets make a new version of our test that uses
our new mask functions
uint64_t test_2_0( const uint64_t a, const
uint64_t b ) const uint64_t cmpa
compare_a(a) const uint64_t cmpb
compare_b(b) const uint64_t ab_sel
cmpnz_u64( cmpa cmpb ) const uint64_t result
sel_u64( ab_sel, a, b ) return
(result)
119
What do we get?
lis r0,0x1000 li
r5,0x4000 ori r12,r0,0x100 oris
r10,r5,0x8000 and r8,r3,r12 and
r9,r4,r10 or r7,r8,r9 neg r6,r7 or
r5,r6,r7 sradi r0,r5,0x1f and r4,r4,r0 andc
r3,r3,r0 or r3,r4,r3 blr
No branchesNo compares (no CR dependencies)
120
With masks (After)
With compare and branch (Before)
lis r0,0x1000 li
r5,0x4000 ori r12,r0,0x100 oris
r10,r5,0x8000 and r8,r3,r12 and
r9,r4,r10 or r7,r8,r9 neg r6,r7 or
r5,r6,r7 sradi r0,r5,0x1f and r4,r4,r0 andc
r3,r3,r0 or r3,r4,r3 blr
lis r7,0x1000 li
r6,0x4000 mr r11,r3 ori r5,r7,0x100 oris
r3,r6,0x8000 and r9,r11,r5 and
r0,r4,r3 or r10,r9,r0 cmpdi r10,0 mr
r3,r4 bnelr mr r3,r11 blr
121
With masks (After)
With compare and branch (Before)
lis r0,0x1000 li
r5,0x4000 ori r12,r0,0x100 oris
r10,r5,0x8000 and r8,r3,r12 and
r9,r4,r10 or r7,r8,r9 neg r6,r7 or
r5,r6,r7 sradi r0,r5,0x1f and r4,r4,r0 andc
r3,r3,r0 or r3,r4,r3 blr
lis r7,0x1000 li
r6,0x4000 mr r11,r3 ori r5,r7,0x100 oris
r3,r6,0x8000 and r9,r11,r5 and
r0,r4,r3 or r10,r9,r0 cmpdi r10,0 mr
r3,r4 bnelr mr r3,r11 blr
  • About the same number of instructions (for size),
    but
  • Inlines extremely well (Can be pipelined)

122
  • TIMEOUT Replacing comparisons
  • Examples
  • Branch on greater than
  • Branch on not zero
  • Integer select
  • A little about the classic example min/max
  • Some other examples...

123
A little about min/max...
static inline sint64_t min2_s64( const sint64_t
arg0, const sint64_t arg1 ) return ( ( arg0
arg1 ) ? arg0 arg1 )
  • One compare (uses CR)
  • One branch

124
What we're looking for...
static inline sint64_t min2_s64( const sint64_t
arg0, const sint64_t arg1 ) return sel_u64(
cmpgte_s64( arg0, arg1 ), arg0, arg1 )
125
Breakdown of cmpgte
static inline uint64_t cmpgte_s64( const int64_t
arg0, const int64_t arg1 ) const int64_t
msb0 arg0 0x3f const
int64_t msb1 arg1 0x3f
const int64_t signs_neq msb0
msb1 const int64_t signs_eq
signs_neg const uint64_t always_gt
(uint64_t)(msb1 signs_neq) const int64_t
diff arg1 arg0 const int64_t
neg_diff -diff const int64_t
diff_nz_msb diff neg_diff const
int64_t diff_nz diff_nz_msb
0x3f const uint64_t diff_z
(uint64_t)diff_nz const int64_t
diff_iif_signs_eq diff signs_eq const
uint64_t diff_gt (uint_64_t)(diff_if
f_signs_eq 0x3f) const uint64_t result_gt
diff_gt always_gt const uint64_t
result_gte result_gt diff_z
return (result_gte)
126
Breakdown of cmpgte
static inline uint64_t cmpgte_s64( const int64_t
arg0, const int64_t arg1 ) const int64_t
msb0 arg0 0x3f const
int64_t msb1 arg1 0x3f
const int64_t signs_neq msb0
msb1 const int64_t signs_eq
signs_neg const uint64_t always_gt
(uint64_t)(msb1 signs_neq) const int64_t
diff arg1 arg0 const int64_t
neg_diff -diff const int64_t
diff_nz_msb diff neg_diff const
int64_t diff_nz diff_nz_msb
0x3f const uint64_t diff_z
(uint64_t)diff_nz const int64_t
diff_iif_signs_eq diff signs_eq const
uint64_t diff_gt (uint_64_t)(diff_if
f_signs_eq 0x3f) const uint64_t result_gt
diff_gt always_gt const uint64_t
result_gte result_gt diff_z
return (result_gte)
Checking sign independently cmpgte_s64(
INT64_MAX, INT64_MIN )
127
__FAST_MATH__ for integers?
if defined(__FAST_MATH__) static inline uint64_t
cmpgte_s64( const int64_t arg0, const int64_t
arg1 ) const int64_t diff
arg1 arg0 const int64_t neg_diff
-diff const int64_t diff_nz_msb
diff neg_diff const int64_t diff_nz
diff_nz_msb 0x3f const uint64_t
diff_z (uint64_t)diff_nz const
uint64_t diff_gt (uint_64_t)(diff
0x3f) const uint64_t result_gte
diff_gt diff_z return (result_gte) endif
128
We don't need for min/max...
if defined(__FAST_MATH__) static inline uint64_t
cmpgte_s64( const int64_t arg0, const int64_t
arg1 ) const int64_t diff
arg1 arg0 const int64_t neg_diff
-diff const int64_t diff_nz_msb
diff neg_diff const int64_t diff_nz
diff_nz_msb 0x3f const uint64_t
diff_z (uint64_t)diff_nz const
uint64_t diff_gt (uint_64_t)(diff
0x3f) const uint64_t result_gte
diff_gt diff_z return (result_gte) endif
Checking for arg0 arg1
129
Better!
if defined(__FAST_MATH__) static inline uint64_t
cmpgt_s64( const int64_t arg0, const int64_t arg1
) const int64_t diff arg1 arg0
const uint64_t diff_gt (uint_64_t)(diff
0x3f) return (diff_gt) endif
Change to cmpgt...
static inline sint64_t min2_s64( const sint64_t
arg0, const sint64_t arg1 ) return sel_u64(
cmpgt_s64( arg0, arg1 ), arg0, arg1 )
130
Scalar Floating Point
  • Double versus float
  • Single Load/Store
  • Aligned access
  • Using fixed point registers

131
Scalar Floating Point
  • Double versus float
  • Single Load/Store
  • Aligned access
  • Using fixed point registers

132
Double versus float
  • Expect mostly similar performance

133
Double versus float
  • Expect mostly similar performance
  • Differences to note

double
float
fsqrt fre frsqrte fsel fabs fnabs
fsqrts fres frsqrtes
134
Double versus float
  • Expect mostly similar performance
  • Differences to note

double
float
fsqrt fre frsqrte fsel fabs fnabs
fsqrts fres frsqrtes
  • ppu-lv2-gcc (GCC) 3.4.1 (Cell 2.3 Aug 18 2005)
  • Does NOT generate these instructions!

135
fsqrt
static inline double ppc_fsqrt( const double arg
) double result __asm__ (fsqrt 0,1
f(result) f(arg) )
fsqrts
static inline float ppc_fsqrts( const float arg
) float result __asm__ (fsqrts 0,1
f(result) f(arg) )
136
fres
/- 1/256
static inline float ppc_fres( const float arg )
float result __asm__ (fres 0,1
f(result) f(arg) )
frs
/- 4 ulps
static inline float ppc_fres( const float arg )
const float estimate ppc_res( arg )
const float refinement -( estimate arg 1.0f
) const float result refinement
estimate estimate return (result)
137
frsqrte
static inline double ppc_fsqrt( const double arg
) double result __asm__ (frsqrte
0,1 f(result) f(arg) )
frsqrtes
static inline float ppc_frsqrtes( const double
arg ) float result __asm__ (frsqrte
0,1 f(result) f(arg) )
138
fsel
static inline double ppc_fsel( const double
test_gez, const double arg0, const double
arg1 ) double result __asm__ (fsel
0,1, 2, 3 f(result) f(test_gez),
f(arg0), f(arg1) )
frsqrtes
139
fsels
static inline float ppc_fsels( const double
test_gez, const double arg0, const double
arg1 ) float result __asm__ (fsel
0,1, 2, 3 f(result) f(test_gez),
f(arg0), f(arg1) )
frsqrtes
140
fabs
static inline double ppc_fabs( const double arg
) double result __asm__ (fabs 0,1
f(result) f(arg) )
fabss
static inline float ppc_fabss( const double arg
) float result __asm__ (fabs 0,1
f(result) f(arg) )
141
fnabs
static inline double ppc_fnabs( const double arg
) double result __asm__ (fnabs 0,1
f(result) f(arg) )
fnabss
static inline float ppc_fnabss( const double arg
) float result __asm__ (fnabs 0,1
f(result) f(arg) )
142
Double versus float
  • Expect mostly similar performance
  • Differences to note
  • Use ffast-math (if reordering is OK)

static inline double fmul_re( const double arg0,
const double arg1 ) const double result
arg0 ( 1.0 / arg1 )
143
Double versus float
  • Expect mostly similar performance
  • Differences to note
  • Use ffast-math (if reordering is OK)

/ const double result arg0 ( 1.0 / arg1 )
/ / -fno-fast-math (default) generates / lfd
oned, 0(addr_of_oned) fdiv temp, arg1,
oned fmul result, arg0, temp / -ffast-math
generates / fdiv result, arg0, arg1
144
Scalar Floating Point
  • Double versus float
  • Single Load/Store
  • Aligned access
  • Using fixed point registers

145
Scalar Floating Point
  • Double versus float
  • Single Load/Store
  • Aligned access
  • Always load on address aligned to size
  • Misalignment generates interrupt
  • Using fixed point registers

146
Scalar Floating Point
  • Double versus float
  • Single Load/Store
  • Aligned access
  • Using fixed point registers
  • Same idea OK for moves.

147
Scalar Floating Point
  • Branch Elimination
  • Avoid bit operations
  • Floating point select
  • Combine branches

148
Scalar Floating Point
  • Branch Elimination
  • Avoid bit operations
  • Load-Hit-Store Hazard
  • Use fctiw / fctiwz / stfiwx if result is integer
  • Floating point select
  • Combine branches

149
Scalar Floating Point
  • Branch Elimination
  • Avoid bit operations
  • Floating point select
  • Combine branches

150
Floating point select
  • Slightly different than integer select on mask
  • double gez

151
fsel_gez
static inline double ppc_fsel_gez( const
double test_gez, const double arg0, const
double arg1 ) double result __asm__
(fsel 0,1, 2, 3 f(result)
f(test_gez), f(arg0), f(arg1) )
152
fsel_lz
static inline double ppc_fsel_ltz( const
double test_ltz, const double arg0, const
double arg1 ) double result __asm__
(fsel 0,1, 2, 3 f(result)
f(test_gez), f(arg1), f(arg0) )
153
fsel_gte
static inline double ppc_fsel_gte( const
double cmp0, const double cmp1, const
double arg0, const double arg1 ) const
double test_gez cmp0 cmp1 double
result __asm__ (fsel 0,1, 2, 3
f(result) f(test_gez), f(arg1), f(arg0)
)
154
fmax (with fsel)
static inline double ppc_fmax( const double
arg0, const double arg1 ) return
(ppc_fsel_gte( arg0, arg1, arg0, arg1 ))
fsub temp, arg0, arg1 fsel result, temp, arg0,
arg1 blr
155
fmax (with compare/branch)
static inline double ppc_fmax( const double
arg0, const double arg1 ) return ( ( arg0
arg1 ) ? arg0 arg1 )
fmr result, arg0 fcmpu temp, arg0, arg1
bgelr- temp fmr result, arg1 blr
156
fmax (with compare/branch)
static inline double ppc_fmax( const double
arg0, const double arg1 ) return ( ( arg0
arg1 ) ? arg0 arg1 )
fmr result, arg0 fcmpu temp, arg0, arg1
bgelr- temp fmr result, arg1 blr
Blocks CR
157
fmax (with compare/branch)
static inline double ppc_fmax( const double
arg0, const double arg1 ) return ( ( arg0
arg1 ) ? arg0 arg1 )
fmr result, arg0 fcmpu temp, arg0, arg1
bgelr- temp fmr result, arg1 blr
Blocks CR
Optimization (scheduling) barrier
158
Scalar Floating Point
  • Branch Elimination
  • Avoid bit operations
  • Floating point select
  • Combine branches
  • Similar benefit to fixed point

159
Data Design
  • Basic Principles
  • Know the data and access patterns
  • Be prepared to reorganize the data
  • Every bit counts
  • Design for the hardware
  • Sort by dominant type
  • Clearly distinguish RO/WO/RW data
  • Almost everything belongs to a set

160
Data Design
  • Cache Friendly Data
  • Minimize cache footprint
  • Sort by data-reuse lifetime
  • Separate scalars from arrays
  • Use table-based storage patterns
  • Tile sparse queues with sequential data
  • Merge multiple source tiles
  • Keep write-once data off-cache

161
Data Design
  • Cache Friendly Data (cont.)
  • Minimize write-multiple data
  • Keep short life read-write data in register file
  • Pipeline long life read-write data
  • Subclass data based on independent functionality

162
Data Design
  • Allocation
  • Static versus Dynamic
  • Alignment
  • System pages

163
VMX (Altivec)
  • What is VMX?
  • What are the advantages to using it?
  • Are there any dangers?

164
VMX (Altivec)
  • What is VMX?
  • What are the advantages to using it?
  • Are there any dangers?

165
VMX (Altivec)
  • What is VMX?
  • What are the advantages to using it?
  • Are there any dangers?

166
VMX What are the advantages to using it?
  • More registers

167
VMX What are the advantages to using it?
  • More registers
  • Much higher throughput

168
VMX What are the advantages to using it?
  • More registers
  • Much higher throughput
  • Instruction throughput
    1 cycle
  • Latency for simple instructions 4
    cycles
  • Latency for complex instructions 9
    cycles
  • Latency for float add/sub/madd/nmsub 12 cycles
  • Latency for float re/rsqrte
    12 cycles

169
VMX What are the advantages to using it?
  • More registers
  • Much higher throughput
  • Saturated arithmetic instructions
  • Bit manipulation on all types (permute, shift,
    rotate)
  • Tons (162) of really cool instructions!

170
VMX (Altivec)
  • What is VMX?
  • What are the advantages to using it?
  • Are there any dangers?

171
VMX (Altivec)
  • What is VMX?
  • What are the advantages to using it?
  • Are there any dangers?
  • Load-Hit-Store from GPR

172
VMX (Altivec)
  • What types?
  • Aligned access
  • Minimize dependencies
  • Branch Elimination

173
VMX (Altivec)
  • What types?
  • Aligned access
  • Minimize dependencies
  • Branch Elimination

174
VMX What types?
int8_t uint8_t x 16 (vector unsigned
char) int16_t uint16_t x 8 (vector
unsigned short) int32_t uint32_t x 4
(vector unsigned int) float x
4 (vector float)
175
VMX What types?
int8_t uint8_t x 16 (vector unsigned
char) int16_t uint16_t x 8 (vector
unsigned short) int32_t uint32_t x 4
(vector unsigned int) float x
4 (vector float)
  • No 64 bit argument instructions

176
VMX What types?
int8_t uint8_t x 16 (vector unsigned
char) int16_t uint16_t x 8 (vector
unsigned short) int32_t uint32_t x 4
(vector unsigned int) float x
4 (vector float)
  • No 64 bit argument instructions

int64_t uint64_t x 2 (vector unsigned
long long) double x 2 (vector
double)
  • 64 bit typedefs exist (but memory-based)
  • Even simple casts are really crappy. Avoid!

177
VMX (Altivec)
  • What types?
  • Aligned access
  • Normal load/store must be aligned (ld/ldx/st/stx)
  • There are explicit load misaligned instructions
    (lvsl/lvsr)
  • There are store element instructions (ste)
  • Minimize dependencies
  • Branch Elimination

178
VMX (Altivec)
  • What types?
  • Aligned access
  • Minimize dependencies
  • Branch Elimination

179
VMX (Altivec)
  • What types?
  • Aligned access
  • Minimize dependencies
  • Branch Elimination

180
VMX Branch Elimination
  • Mask compare and select

181
VMX Branch Elimination
  • Mask compare and select
  • vec_cmpeq
  • vec_cmpge ( vector float only )
  • vec_cmpgt
  • vec_cmple ( vector float only )
  • vec_cmplt
  • vec_sel

182
VMX Branch Elimination
  • Also
  • vec_min
  • vec_max
  • vec_avg ( except vector float )
  • Mask compare and select
  • vec_cmpeq
  • vec_cmpge ( vector float only )
  • vec_cmpgt
  • vec_cmple ( vector float only )
  • vec_cmplt
  • vec_sel

183
VMX (Altivec)
  • Maximizing throughput
  • Combining transformations
  • Uniform versus Non-uniform vectors
  • Watch out! Building immediate values

184
VMX (Altivec)
  • Maximizing throughput
  • Combining transformations
  • Uniform versus Non-uniform vectors
  • Watch out! Building immediate values

185
VMX (Altivec)
  • Maximizing throughput
  • Combining transformations
  • Uniform versus Non-uniform
Write a Comment
User Comments (0)
About PowerShow.com