Increasing and Detecting Memory Address Congruence - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Increasing and Detecting Memory Address Congruence

Description:

Increasing and Detecting Memory Address Congruence. Sam Larsen. Emmett ... Skip tag checks in a set-associative cache. Add ... to skip the check ... – PowerPoint PPT presentation

Number of Views:12
Avg rating:3.0/5.0
Slides: 44
Provided by: samuel76
Category:

less

Transcript and Presenter's Notes

Title: Increasing and Detecting Memory Address Congruence


1
Increasing and Detecting Memory Address Congruence
Sam Larsen Emmett Witchel Saman
Amarasinghe Laboratory for Computer
Science Massachusetts Institute of Technology
2
The Congruence Property
int aM, bn for (i0 iltn i) abi8
0
0
4
8

3
The Congruence Property
int aM, bn for (i0 iltn i) abi8
0
i0
0
4
8

4
The Congruence Property
int aM, bn for (i0 iltn i) abi8
0
i0
i1
0
4
8

5
The Congruence Property
int aM, bn for (i0 iltn i) abi8
0
0
4
8

6
The Congruence Property
int aM, bn for (i0 iltn i) abi8
0
0
4
8
12
16
20
24
28
Congruent with offset of 0
7
The Congruence Property
int aM for (i0 iltn i) a16i2 0
0
4
8
12
16
20
24
28
Congruent with offset of 8
8
The Congruence Property
int aM for (i0 iltn i) a15i3 0
0
4
8
12
16
20
24
28
Not Congruent (32-byte line)
9
Outline
  • Uses of congruence information
  • Congruence detection algorithm
  • Congruence-increasing transformations
  • Results
  • Related work

10
SIMD Compilation PLDI 00
  • Multimedia extensions offer wide mem ops
  • Motorolas AltiVec
  • Intels MMX/SSE
  • Automatic SIMD parallelization
  • Multiple mem ops single wide mem op
  • 128-bit lds/strs must be 128-bit aligned
  • SSE 6-9 cycle penalty for unaligned accesses
  • AltiVec All wide mem ops have to be aligned

11
Energy Savings Micro 01
  • Skip tag checks in a set-associative cache
  • Add special loads/stores to ISA
  • First mem op memoizes the cache way
  • Second mem op uses this to skip the check
  • Compiler analysis determines when data occupy the
    same line
  • Need congruence information

12
Banked Memory Architectures
  • Offset specifies the memory bank
  • Place data close to computation
  • Access banks in parallel

13
Congruence Recognition
  • Iterative dataflow analysis
  • Low-level IR
  • Lattice elements of the form anb
  • For pointers, memory locations accessed
  • If a cache line size then b offset
  • 32n8 accesses offset 8 in a 32-byte line

14
Dataflow Lattice
?
8n0
8n4
8n2
8n6
8n1
8n5
8n3
8n7
4n0
4n2
4n1
4n3
2n0
2n1
n0
8 byte cache line
15
Dataflow Lattice
?
8n0
8n4
8n2
8n6
8n1
8n5
8n3
8n7
4n0
4n2
4n1
4n3
2n0
2n1
n0
8n0
4n2
16
Transfer Functions
Meet
a gcd(a1, a2, b1-b2) b b1 a
17
Transfer Functions
Subtract
Meet
a gcd(a1, a2, b1-b2) b b1 a
a gcd(a1, a2) b (b1 b2) a
Add
Multiply
a gcd(a1, a2) b (b1b2) a
a gcd(a1a2, a1b2, a2b1, C) b (b1b2) a
18
The Bad News
  • Most mem ops are not congruent
  • 32 byte cache line

19
Congruence Conventions (Padding)
  • Allocate arrays/structs on a line boundary
  • Congruent accesses to arrays for a given index
  • Congruent accesses to struct fields
  • Requires that we
  • Allocate stack frames on cache line boundary
  • Modify malloc to return aligned data

20
Unrolling
  • Unrolling creates congruent references

int a100 for (i0 iltn i8) ai0
0 ai1 0 ai7 0
a0 a8 a16
0
4
8
12
16
20
24
28
21
Unrolling
  • Unrolling creates congruent references

int a100 for (i0 iltn i8) ai0
0 ai1 0 ai7 0
a1 a9 a17
0
4
8
12
16
20
24
28
22
Congruence with Parameters
void init(int a) for (i0 iltn i8)
ai0 0 ai1 0 ai7
0
void main() int a100 init(a2)
init(a3)
0
4
8
12
16
20
24
28
23
Congruence with Parameters
void init(int a) for (i0 iltn i8)
ai0 0 ai1 0 ai7
0
void main() int a100 init(a2)
init(a3)
0
4
8
12
16
20
24
28
24
Pre-loop
  • Add a pre-loop to enforce congruence

for (i0 iltn i) if ((int)ai 32
0) break ai 0 for ( iltn i8)
ai0 0 ai1 0 ai7 0
0
4
8
12
16
20
24
28
25
Pre-loop
  • Add a pre-loop to enforce congruence
  • Mem ops congruent in the unrolled body
  • Pre-loop has few iterations
  • Most dynamic mem ops are congruent

26
Finding the Break Condition
  • Can we choose arbitrarily?

void init(int x) int i for (i0 ilt100
i2) if ((int)xi 32 0)
break xi 0 ... int main()
int x200 init(x1)
i xi32
0 4
2 12
4 20
6 28
8 4
NO!
27
Finding the Break Condition
void copy(int x, int y) int i for (i0
ilt100 i) if ((int)xi 32 0
(int)yi 32 0) break xi
yi ... int main() int x200,
y200 copy(x0, y0) copy(x0,
y1)
first call
i xi32 yi32
0 0 0
1 4 4

8 0 0
second call
i xi32 yi32
0 0 4
1 4 8

8 0 4
28
Finding the Break Condition
void copy(int x, int y) int i for (i0
ilt100 i) if ((int)xi 32 0
(int)yi 32 0) break xi
yi ... int main() int x200,
y200 copy(x0, y0) copy(x0,
y1)
first call
i xi32 yi32
0 0 0
1 4 4

8 0 0
second call
i xi32 yi32
0 0 4
1 4 8

8 0 4
29
Finding the Break Condition
void copy(int x, int y) int i for (i0
ilt100 i) if ((int)xi 32 0
(int)yi 32 4) break xi
yi ... int main() int x200,
y200 copy(x0, y0) copy(x0,
y1)
first call
i xi32 yi32
0 0 0
1 4 4

8 0 0
second call
i xi32 yi32
0 0 4
1 4 8

8 0 4
30
Finding the Break Condition
void copy(int x, int y) int i for (i0
ilt100 i) if ((int)xi 32 0)
break xi yi ... int main()
int x200, y200 copy(x0, y0)
copy(x0, y1)
first call
i xi32 yi32
0 0 0
1 4 4

8 0 0
second call
i xi32 yi32
0 0 4
1 4 8

8 0 4
31
Finding the Break Condition
  • Use profiling to observe runtime addresses
  • Find best break condition for the profile
  • Exhaustive search
  • Consider all possible break conditions
  • Compute iterations in unrolled loop
  • Multiply by of mem ops with known offset
  • Break condition with highest value is the best
  • Results vary little with profile data set
  • Insignificant on all but one benchmark

32
Congruence Results (SPECfp95)
Original
Congruent
33
Congruence Results (SPECfp95)
34
Congruence Results (MediaBench)
35
Execution Time Overhead
unrolling pre-loop
applu -6.27 -5.28
apsi 0.93 1.13
fpppp 0.00 0.00
hydro2d 0.99 0.39
mgrid 0.72 0.72
su2cor -0.32 0.11
swim -0.96 -0.17
tomcatv -0.18 0.65
turb3d -0.80 1.72
wave5 3.75 4.58
36
DCache Energy Savings Micro 01
37
Related Work
  • Fisher and Ellis Bulldog Compiler
  • Memory bank disambiguation
  • Loop unrolling
  • Barua et al. Raw Compiler
  • Modulo unrolling
  • Davidson et al. Mem Access Coalescing
  • Loop Unrolling
  • Alignment checks at runtime

38
Conclusions
  • Increased number of congruent refs by 5x
  • Analysis detected 95
  • Results are good
  • MediaBench 65 congruent, 60 detected
  • SpecFP95 84 congruent, 82 detected
  • Many uses of congruence information
  • Wide accesses in multimedia extensions
  • Energy savings by tag check elimination
  • Bank disambiguation in clustered architectures

39
Increasing and Detecting Memory Address Congruence
Sam Larsen Emmett Witchel Saman
Amarasinghe Laboratory for Computer
Science Massachusetts Institute of Technology
40
Example
int a100 for (i0 iltn i8) ai0
0 ai1 0 ai7 0
41
Example
i 32n0
r0 32n0 32n7 32n7
r1 32n7 32n4 32n28
r2 32n28 32n0 32n28
i 32n0 32n8 32n8
42
Example
i 32n0
i 32n0
i 32n0 Ç 32n8 8n0
r0 32n0 32n7 32n7
r0 8n0 32n7 8n7
r1 32n7 32n4 32n28
r1 8n7 32n4 32n28
r2 32n28 32n0 32n28
r2 32n28 32n0 32n28
i 32n0 32n8 32n8
i 8n0 32n8 8n0
r2 offset is 28
43
Multimedia Compilation
  • PowerMAC G4 with AltiVec
  • Commercial vectorizing compiler
  • Alignment pragmas

datatype Vector length Speedup (unaligned) Speedup (aligned) Improve-ment
float 4 3.25 4.75 46
int 4 2.15 2.93 36
short 8 2.98 5.87 97
char 16 5.21 11.53 121
Write a Comment
User Comments (0)
About PowerShow.com