Increasing and Detecting Memory Address Congruence - PowerPoint PPT Presentation

About This Presentation

Title:

Increasing and Detecting Memory Address Congruence

Description:

Increasing and Detecting Memory Address Congruence. Sam Larsen. Emmett ... Skip tag checks in a set-associative cache. Add ... to skip the check ... – PowerPoint PPT presentation

Number of Views:12

Avg rating:3.0/5.0

Slides: 44

Provided by: samuel76

Learn more at: https://groups.csail.mit.edu

Category:

more less

Transcript and Presenter's Notes

Title: Increasing and Detecting Memory Address Congruence

1
Increasing and Detecting Memory Address Congruence
Sam Larsen Emmett Witchel Saman
Amarasinghe Laboratory for Computer
Science Massachusetts Institute of Technology
2
The Congruence Property
int aM, bn for (i0 iltn i) abi8
0
0
4
8

3
The Congruence Property
int aM, bn for (i0 iltn i) abi8
0
i0
0
4
8

4
The Congruence Property
int aM, bn for (i0 iltn i) abi8
0
i0
i1
0
4
8

5
The Congruence Property
int aM, bn for (i0 iltn i) abi8
0
0
4
8

6
The Congruence Property
int aM, bn for (i0 iltn i) abi8
0
0
4
8
12
16
20
24
28
Congruent with offset of 0
7
The Congruence Property
int aM for (i0 iltn i) a16i2 0
0
4
8
12
16
20
24
28
Congruent with offset of 8
8
The Congruence Property
int aM for (i0 iltn i) a15i3 0
0
4
8
12
16
20
24
28
Not Congruent (32-byte line)
9
Outline

Uses of congruence information
Congruence detection algorithm
Congruence-increasing transformations
Results
Related work

10
SIMD Compilation PLDI 00

Multimedia extensions offer wide mem ops
Motorolas AltiVec
Intels MMX/SSE
Automatic SIMD parallelization
Multiple mem ops single wide mem op
128-bit lds/strs must be 128-bit aligned
SSE 6-9 cycle penalty for unaligned accesses
AltiVec All wide mem ops have to be aligned

11
Energy Savings Micro 01

Skip tag checks in a set-associative cache
Add special loads/stores to ISA
First mem op memoizes the cache way
Second mem op uses this to skip the check
Compiler analysis determines when data occupy the
same line
Need congruence information

12
Banked Memory Architectures

Offset specifies the memory bank
Place data close to computation
Access banks in parallel

13
Congruence Recognition

Iterative dataflow analysis
Low-level IR
Lattice elements of the form anb
For pointers, memory locations accessed
If a cache line size then b offset
32n8 accesses offset 8 in a 32-byte line

14
Dataflow Lattice
?
8n0
8n4
8n2
8n6
8n1
8n5
8n3
8n7
4n0
4n2
4n1
4n3
2n0
2n1
n0
8 byte cache line
15
Dataflow Lattice
?
8n0
8n4
8n2
8n6
8n1
8n5
8n3
8n7
4n0
4n2
4n1
4n3
2n0
2n1
n0
8n0
4n2
16
Transfer Functions
Meet
a gcd(a1, a2, b1-b2) b b1 a
17
Transfer Functions
Subtract
Meet
a gcd(a1, a2, b1-b2) b b1 a
a gcd(a1, a2) b (b1 b2) a
Add
Multiply
a gcd(a1, a2) b (b1b2) a
a gcd(a1a2, a1b2, a2b1, C) b (b1b2) a
18
The Bad News

Most mem ops are not congruent
32 byte cache line

19
Congruence Conventions (Padding)

Allocate arrays/structs on a line boundary
Congruent accesses to arrays for a given index
Congruent accesses to struct fields
Requires that we
Allocate stack frames on cache line boundary
Modify malloc to return aligned data

20
Unrolling

Unrolling creates congruent references

int a100 for (i0 iltn i8) ai0
0 ai1 0 ai7 0
a0 a8 a16
0
4
8
12
16
20
24
28
21
Unrolling

Unrolling creates congruent references

int a100 for (i0 iltn i8) ai0
0 ai1 0 ai7 0
a1 a9 a17
0
4
8
12
16
20
24
28
22
Congruence with Parameters
void init(int a) for (i0 iltn i8)
ai0 0 ai1 0 ai7
0
void main() int a100 init(a2)
init(a3)
0
4
8
12
16
20
24
28
23
Congruence with Parameters
void init(int a) for (i0 iltn i8)
ai0 0 ai1 0 ai7
0
void main() int a100 init(a2)
init(a3)
0
4
8
12
16
20
24
28
24
Pre-loop

Add a pre-loop to enforce congruence

for (i0 iltn i) if ((int)ai 32
0) break ai 0 for ( iltn i8)
ai0 0 ai1 0 ai7 0
0
4
8
12
16
20
24
28
25
Pre-loop

Add a pre-loop to enforce congruence
Mem ops congruent in the unrolled body
Pre-loop has few iterations
Most dynamic mem ops are congruent

26
Finding the Break Condition

Can we choose arbitrarily?

void init(int x) int i for (i0 ilt100
i2) if ((int)xi 32 0)
break xi 0 ... int main()
int x200 init(x1)
i xi32
0 4
2 12
4 20
6 28
8 4
NO!
27
Finding the Break Condition
void copy(int x, int y) int i for (i0
ilt100 i) if ((int)xi 32 0
(int)yi 32 0) break xi
yi ... int main() int x200,
y200 copy(x0, y0) copy(x0,
y1)
first call
i xi32 yi32
0 0 0
1 4 4

8 0 0
second call
i xi32 yi32
0 0 4
1 4 8

8 0 4
28
Finding the Break Condition
void copy(int x, int y) int i for (i0
ilt100 i) if ((int)xi 32 0
(int)yi 32 0) break xi
yi ... int main() int x200,
y200 copy(x0, y0) copy(x0,
y1)
first call
i xi32 yi32
0 0 0
1 4 4

8 0 0
second call
i xi32 yi32
0 0 4
1 4 8

8 0 4
29
Finding the Break Condition
void copy(int x, int y) int i for (i0
ilt100 i) if ((int)xi 32 0
(int)yi 32 4) break xi
yi ... int main() int x200,
y200 copy(x0, y0) copy(x0,
y1)
first call
i xi32 yi32
0 0 0
1 4 4

8 0 0
second call
i xi32 yi32
0 0 4
1 4 8

8 0 4
30
Finding the Break Condition
void copy(int x, int y) int i for (i0
ilt100 i) if ((int)xi 32 0)
break xi yi ... int main()
int x200, y200 copy(x0, y0)
copy(x0, y1)
first call
i xi32 yi32
0 0 0
1 4 4

8 0 0
second call
i xi32 yi32
0 0 4
1 4 8

8 0 4
31
Finding the Break Condition

Use profiling to observe runtime addresses
Find best break condition for the profile
Exhaustive search
Consider all possible break conditions
Compute iterations in unrolled loop
Multiply by of mem ops with known offset
Break condition with highest value is the best
Results vary little with profile data set
Insignificant on all but one benchmark

32
Congruence Results (SPECfp95)
Original
Congruent
33
Congruence Results (SPECfp95)
34
Congruence Results (MediaBench)
35
Execution Time Overhead
unrolling pre-loop
applu -6.27 -5.28
apsi 0.93 1.13
fpppp 0.00 0.00
hydro2d 0.99 0.39
mgrid 0.72 0.72
su2cor -0.32 0.11
swim -0.96 -0.17
tomcatv -0.18 0.65
turb3d -0.80 1.72
wave5 3.75 4.58
36
DCache Energy Savings Micro 01
37
Related Work

Fisher and Ellis Bulldog Compiler
Memory bank disambiguation
Loop unrolling
Barua et al. Raw Compiler
Modulo unrolling
Davidson et al. Mem Access Coalescing
Loop Unrolling
Alignment checks at runtime

38
Conclusions

Increased number of congruent refs by 5x
Analysis detected 95
Results are good
MediaBench 65 congruent, 60 detected
SpecFP95 84 congruent, 82 detected
Many uses of congruence information
Wide accesses in multimedia extensions
Energy savings by tag check elimination
Bank disambiguation in clustered architectures

39
Increasing and Detecting Memory Address Congruence
Sam Larsen Emmett Witchel Saman
Amarasinghe Laboratory for Computer
Science Massachusetts Institute of Technology
40
Example
int a100 for (i0 iltn i8) ai0
0 ai1 0 ai7 0
41
Example
i 32n0
r0 32n0 32n7 32n7
r1 32n7 32n4 32n28
r2 32n28 32n0 32n28
i 32n0 32n8 32n8
42
Example
i 32n0
i 32n0
i 32n0 Ç 32n8 8n0
r0 32n0 32n7 32n7
r0 8n0 32n7 8n7
r1 32n7 32n4 32n28
r1 8n7 32n4 32n28
r2 32n28 32n0 32n28
r2 32n28 32n0 32n28
i 32n0 32n8 32n8
i 8n0 32n8 8n0
r2 offset is 28
43
Multimedia Compilation

PowerMAC G4 with AltiVec
Commercial vectorizing compiler
Alignment pragmas

datatype Vector length Speedup (unaligned) Speedup (aligned) Improve-ment
float 4 3.25 4.75 46
int 4 2.15 2.93 36
short 8 2.98 5.87 97
char 16 5.21 11.53 121

Write a Comment

User Comments (0)