Title: InterIteration Scalar Replacement in the Presence of ControlFlow
1Inter-Iteration Scalar Replacement in the
Presence of Control-Flow
- Mihai Budiu Microsoft Research, Silicon Valley
- Seth Copen Goldstein Carnegie Mellon University
- ODES 2005
2Summary
- What compiler optimization
- Where dense regular matrix codes
- FORTRAN
- some media processing
- Goal reduce number of memory accesses
- How allocate array elements to registers
- New optimal algorithm based on predication
3Outline
- Scalar Replacement
- Predicated PRE
- Combining the two
- Results
4Scalar Replacement
tmp ai tmp 2 tmp ltlt 4 ai tmp
ai ai 2 ai ltlt 4
Front-end
ld ai arith arith st ai
ld ai arith ... st ai ld ai arith st ai
Back-end
5Inter-Iteration Scalar Replacement
tmp0 a0 for (i0 i lt N i) tmp1
a1 ai tmp0 tmp1 tmp0 tmp1
for (i0 i lt N i) ai ai1
Runtime
ld a0 ld a1 st a0 ld a2 st a1
i0
i0
ld a0 ld a1 st a0 ld a1 ld a2 st a1
tmp1
i1
i1
6Rotating Scalars
for () . tmp0 tmp1 tmp1
tmp2 tmp2 tmp3 tmp3 ai4
for (i0 i lt N i) ai ai3
Invariant tmp0 ai0 tmp1 ai1 tmp2
ai2 tmp3 ai3
Itanium has hardware support for rotating
registers.
7Control-Flow
for (i0 i lt N i) if (i 1)
ai ai3
8Outline
- Scalar Replacement
- Predicated PRE
- Combining the two
- Results
9Availability
y ai ... if (x) ... ... ai
10Conservative Analysis
if (x) ... y ai ... ... ai
y?
11Predicated PRE
flag false if (x) ... y ai
flag true ... ... flag ? y ai
Invariant flag true y ai
12Outline
- Scalar Replacement
- Predicated PRE
- Combining the two
- Results
13Scalars and Flags
for (i0 i lt N i) if (i 1) ai
ai3
Invariant
(valid0 true) tmp0 ai0 (valid1
true) tmp1 ai1 (valid2 true) tmp2
ai2 (valid3 true) tmp3 ai3
scalar
bool
14Scalar Replacement Algorithm
if (! validk) ld aik tmpk aik
validk true
Can be implemented with predication or
conditional moves
tmpk v validk true
st aik, v
15Optimality
- No scalarized memory location is read or written
two times - The resulting program touches exactly the same
memory locations as the original program - Proof trivial based on valid flags invariant
given perfect dependence analysis and enough
registers
16Additional Details
(see paper)
- Initialize validk to false
- Rotate scalars and valid flags
- Use dirtyk flags to avoid extra stores
- Postlude for missing stores
- if (validk) aNk tmpk
- Lift loop-invariant accesses
- (finding loop-invariant predicates)
- Hardware support
(for rotating registers and flags).
17Outline
- Scalar Replacement
- Predicated PRE
- Combining the two
- Results
18Redundant Stores
reduction
19Redundant Loads
reduction
20Performance Impact
target Spatial Computation
Removed accesses tend to be cache hits small
contribution to running time.
reduction running time
21Conclusions
- Use predicates to dynamically detect redundant
memory accesses - Simple algorithm gives optimal result even with
un-analyzable control flow - Can dramatically reduce memory accesses
22Related Work
Carr Kennedy, PLDI 1990 Scalar Replacement -
Arrays, no control flow -
Carr Kennedy, SPE 1994 Generalized Scalar
Replacement - Restricted control-flow -
Morel Renvoise, CACM 1979 Partial Redundancy
Elimination - Not across remote iterations -
Scholz, Europar 2003 Predicated PRE - Single
iteration, no writes -
This work, ODES 2005 PPRE across iterations -
Optimal -
Non-speculative promotion
Speculative promotion