Title: Increasing and Detecting Memory Address Congruence
1Increasing and Detecting Memory Address Congruence
Sam Larsen Emmett Witchel Saman
Amarasinghe Laboratory for Computer
Science Massachusetts Institute of Technology
2The Congruence Property
int aM, bn for (i0 iltn i) abi8
0
0
4
8
3The Congruence Property
int aM, bn for (i0 iltn i) abi8
0
i0
0
4
8
4The Congruence Property
int aM, bn for (i0 iltn i) abi8
0
i0
i1
0
4
8
5The Congruence Property
int aM, bn for (i0 iltn i) abi8
0
0
4
8
6The Congruence Property
int aM, bn for (i0 iltn i) abi8
0
0
4
8
12
16
20
24
28
Congruent with offset of 0
7The Congruence Property
int aM for (i0 iltn i) a16i2 0
0
4
8
12
16
20
24
28
Congruent with offset of 8
8The Congruence Property
int aM for (i0 iltn i) a15i3 0
0
4
8
12
16
20
24
28
Not Congruent (32-byte line)
9Outline
- Uses of congruence information
- Congruence detection algorithm
- Congruence-increasing transformations
- Results
- Related work
10SIMD Compilation PLDI 00
- Multimedia extensions offer wide mem ops
- Motorolas AltiVec
- Intels MMX/SSE
- Automatic SIMD parallelization
- Multiple mem ops single wide mem op
- 128-bit lds/strs must be 128-bit aligned
- SSE 6-9 cycle penalty for unaligned accesses
- AltiVec All wide mem ops have to be aligned
11Energy Savings Micro 01
- Skip tag checks in a set-associative cache
- Add special loads/stores to ISA
- First mem op memoizes the cache way
- Second mem op uses this to skip the check
- Compiler analysis determines when data occupy the
same line - Need congruence information
12Banked Memory Architectures
- Offset specifies the memory bank
- Place data close to computation
- Access banks in parallel
13Congruence Recognition
- Iterative dataflow analysis
- Low-level IR
- Lattice elements of the form anb
- For pointers, memory locations accessed
- If a cache line size then b offset
- 32n8 accesses offset 8 in a 32-byte line
14Dataflow Lattice
?
8n0
8n4
8n2
8n6
8n1
8n5
8n3
8n7
4n0
4n2
4n1
4n3
2n0
2n1
n0
8 byte cache line
15Dataflow Lattice
?
8n0
8n4
8n2
8n6
8n1
8n5
8n3
8n7
4n0
4n2
4n1
4n3
2n0
2n1
n0
8n0
4n2
16Transfer Functions
Meet
a gcd(a1, a2, b1-b2) b b1 a
17Transfer Functions
Subtract
Meet
a gcd(a1, a2, b1-b2) b b1 a
a gcd(a1, a2) b (b1 b2) a
Add
Multiply
a gcd(a1, a2) b (b1b2) a
a gcd(a1a2, a1b2, a2b1, C) b (b1b2) a
18The Bad News
- Most mem ops are not congruent
- 32 byte cache line
19Congruence Conventions (Padding)
- Allocate arrays/structs on a line boundary
- Congruent accesses to arrays for a given index
- Congruent accesses to struct fields
- Requires that we
- Allocate stack frames on cache line boundary
- Modify malloc to return aligned data
20Unrolling
- Unrolling creates congruent references
int a100 for (i0 iltn i8) ai0
0 ai1 0 ai7 0
a0 a8 a16
0
4
8
12
16
20
24
28
21Unrolling
- Unrolling creates congruent references
int a100 for (i0 iltn i8) ai0
0 ai1 0 ai7 0
a1 a9 a17
0
4
8
12
16
20
24
28
22Congruence with Parameters
void init(int a) for (i0 iltn i8)
ai0 0 ai1 0 ai7
0
void main() int a100 init(a2)
init(a3)
0
4
8
12
16
20
24
28
23Congruence with Parameters
void init(int a) for (i0 iltn i8)
ai0 0 ai1 0 ai7
0
void main() int a100 init(a2)
init(a3)
0
4
8
12
16
20
24
28
24Pre-loop
- Add a pre-loop to enforce congruence
for (i0 iltn i) if ((int)ai 32
0) break ai 0 for ( iltn i8)
ai0 0 ai1 0 ai7 0
0
4
8
12
16
20
24
28
25Pre-loop
- Add a pre-loop to enforce congruence
- Mem ops congruent in the unrolled body
- Pre-loop has few iterations
- Most dynamic mem ops are congruent
26Finding the Break Condition
- Can we choose arbitrarily?
void init(int x) int i for (i0 ilt100
i2) if ((int)xi 32 0)
break xi 0 ... int main()
int x200 init(x1)
i xi32
0 4
2 12
4 20
6 28
8 4
NO!
27Finding the Break Condition
void copy(int x, int y) int i for (i0
ilt100 i) if ((int)xi 32 0
(int)yi 32 0) break xi
yi ... int main() int x200,
y200 copy(x0, y0) copy(x0,
y1)
first call
i xi32 yi32
0 0 0
1 4 4
8 0 0
second call
i xi32 yi32
0 0 4
1 4 8
8 0 4
28Finding the Break Condition
void copy(int x, int y) int i for (i0
ilt100 i) if ((int)xi 32 0
(int)yi 32 0) break xi
yi ... int main() int x200,
y200 copy(x0, y0) copy(x0,
y1)
first call
i xi32 yi32
0 0 0
1 4 4
8 0 0
second call
i xi32 yi32
0 0 4
1 4 8
8 0 4
29Finding the Break Condition
void copy(int x, int y) int i for (i0
ilt100 i) if ((int)xi 32 0
(int)yi 32 4) break xi
yi ... int main() int x200,
y200 copy(x0, y0) copy(x0,
y1)
first call
i xi32 yi32
0 0 0
1 4 4
8 0 0
second call
i xi32 yi32
0 0 4
1 4 8
8 0 4
30Finding the Break Condition
void copy(int x, int y) int i for (i0
ilt100 i) if ((int)xi 32 0)
break xi yi ... int main()
int x200, y200 copy(x0, y0)
copy(x0, y1)
first call
i xi32 yi32
0 0 0
1 4 4
8 0 0
second call
i xi32 yi32
0 0 4
1 4 8
8 0 4
31Finding the Break Condition
- Use profiling to observe runtime addresses
- Find best break condition for the profile
- Exhaustive search
- Consider all possible break conditions
- Compute iterations in unrolled loop
- Multiply by of mem ops with known offset
- Break condition with highest value is the best
- Results vary little with profile data set
- Insignificant on all but one benchmark
32Congruence Results (SPECfp95)
Original
Congruent
33Congruence Results (SPECfp95)
34Congruence Results (MediaBench)
35Execution Time Overhead
unrolling pre-loop
applu -6.27 -5.28
apsi 0.93 1.13
fpppp 0.00 0.00
hydro2d 0.99 0.39
mgrid 0.72 0.72
su2cor -0.32 0.11
swim -0.96 -0.17
tomcatv -0.18 0.65
turb3d -0.80 1.72
wave5 3.75 4.58
36DCache Energy Savings Micro 01
37Related Work
- Fisher and Ellis Bulldog Compiler
- Memory bank disambiguation
- Loop unrolling
- Barua et al. Raw Compiler
- Modulo unrolling
- Davidson et al. Mem Access Coalescing
- Loop Unrolling
- Alignment checks at runtime
38Conclusions
- Increased number of congruent refs by 5x
- Analysis detected 95
- Results are good
- MediaBench 65 congruent, 60 detected
- SpecFP95 84 congruent, 82 detected
- Many uses of congruence information
- Wide accesses in multimedia extensions
- Energy savings by tag check elimination
- Bank disambiguation in clustered architectures
39Increasing and Detecting Memory Address Congruence
Sam Larsen Emmett Witchel Saman
Amarasinghe Laboratory for Computer
Science Massachusetts Institute of Technology
40Example
int a100 for (i0 iltn i8) ai0
0 ai1 0 ai7 0
41Example
i 32n0
r0 32n0 32n7 32n7
r1 32n7 32n4 32n28
r2 32n28 32n0 32n28
i 32n0 32n8 32n8
42Example
i 32n0
i 32n0
i 32n0 Ç 32n8 8n0
r0 32n0 32n7 32n7
r0 8n0 32n7 8n7
r1 32n7 32n4 32n28
r1 8n7 32n4 32n28
r2 32n28 32n0 32n28
r2 32n28 32n0 32n28
i 32n0 32n8 32n8
i 8n0 32n8 8n0
r2 offset is 28
43Multimedia Compilation
- PowerMAC G4 with AltiVec
- Commercial vectorizing compiler
- Alignment pragmas
datatype Vector length Speedup (unaligned) Speedup (aligned) Improve-ment
float 4 3.25 4.75 46
int 4 2.15 2.93 36
short 8 2.98 5.87 97
char 16 5.21 11.53 121