Title: Compiler Managed Partitioned Data Caches for Low Power
1Compiler Managed Partitioned Data Caches for Low
Power
- Rajiv Ravindran, Michael Chu, and Scott Mahlke
- Advanced Computer Architecture Lab
- Department of Electrical Engineering and Computer
Science - University of Michigan, Ann Arbor
Currently with the Java, Compilers, and Tools
Lab, Hewlett Packard, Cupertino, California
2Introduction Memory Power
- On-chip memories are a major contributor to
system energy - Data caches ? 16 in StrongARM Unsal et. al,
01
Hardware
Software
Banking, dynamic voltage/frequency, scaling,
dynamic resizing Transparent to the user
Handle arbitrary instr/data accesses Limited
program information Reactive
Software controlled scratch-pad, data/code
reorganization Whole program information
Proactive No dynamic adaptability
Conservative
3Reducing Data Memory PowerCompiler Managed,
Hardware Assisted
Hardware
Software
Banking, dynamic voltage/frequency, scaling,
dynamic resizing Transparent to the user
Handle arbitrary instr/data accesses ? Limited
program information ? Reactive
Software controlled scratch-pad, data/code
reorganization Whole program information
Proactive ? No dynamic adaptability ?
Conservative
4Data Caches Tradeoffs
Advantages
Disadvantages
- Capture spatial/temporal locality
- Transparent to the programmer
- General than software scratch-pads
- Efficient lookups
- Fixed replacement policy
- Set index? no program locality
- Set-associativity has high overhead
- Activate multiple data/tag-array
- per access
5Traditional Cache Architecture
tag
set
offset
tag
data
lru
tag
data
lru
tag
data
lru
tag
data
lru
Replace
?
?
?
?
41 mux
- Lookup ? Activate all ways on every
access - Replacement ? Choose among all the ways
6Partitioned Cache Architecture
Ld/St Reg Addr k-bitvector
R/U
tag
set
offset
tag
data
lru
tag
data
lru
tag
data
lru
tag
data
lru
P0
P3
P2
P1
Replace
?
?
?
?
41 mux
- Advantages
- Improve performance by controlling replacement
- Reduce cache access power by restricting number
of accesses
- Lookup ? Restricted to partitions
specified in bit-vector if R, else default to
all partitions - Replacement ? Restricted to partitions specified
in bit-vector
7Partitioned Caches Example
for (i 0 i lt N1 i) for (j 0 j lt
N2 j) yi j w1 xi j
for (k 0 k lt N3 k) yi k w2
xi k
ld1/st1
ld3
ld5
ld2/st2
ld4
ld6
way-0
way-2
way-1
tag
data
tag
data
tag
data
ld1 100, R ld5 010, R ld3 001, R
ld1, st1, ld2, st2
ld5, ld6
ld3, ld4
y
w1/w2
x
- Reduce number of tag checks per iteration from
12 to 4 !
8Compiler Controlled Data Partitioning
- Goal Place loads/stores into cache partitions
- Analyze applications memory characteristics
- Cache requirements ? Number of partitions per
ld/st - Predict conflicts
- Place loads/stores to different partitions
- Satisfies its caching needs
- Avoid conflicts, overlap if possible
9Cache Analysis Estimating Number of Partitions
- Minimal partitions to avoid conflict/capacity
misses - Probabilistic hit-rate estimate
- Use the working-set to compute number of
partitions
j-loop
k-loop
X W1 Y Y X W1 Y Y X W2 Y
Y X W2 Y Y
B1
B1
B1
B1
M
M
M
M
10Cache AnalysisEstimating Number Of Partitions
- Avoid conflict/capacity misses for an
instruction - Estimates hit-rate based on
- Reuse-distance (D), total number of cache blocks
(B), associativity (A)
(Brehob et. al., 99)
D 2
D 0
D 1
1
2
3
4
1
2
3
4
1
2
3
4
.76
.98
1
1
8
.87
1
1
1
1
1
1
1
8
8
16
16
16
24
24
24
32
32
32
- Compute energy matrices in reality
- Pick most energy efficient configuration per
instruction
11Cache Analysis Computing Interferences
- Avoid conflicts among temporally co-located
references - Model conflicts using interference graph
X W1 Y Y X W1 Y Y X W2 Y
Y X W2 Y Y
M4 M2 M1 M1 M4 M2 M1 M1
M4 M3 M1 M1 M4 M3 M1 M1
M4 D 1
M2 D 1
M1 D 1
M3 D 1
12Partition Assignment
- Placement phase can overlap references
- Compute combined working-set
- Use graph-theoretic notion of a clique
- For each clique, new D ? S D of each node
- Combined D for all overlaps ? Max (All cliques)
M4 D 1
Clique 1
M2 D 1
M1 D 1
Clique 1 M1, M2, M4 ? New reuse distance (D)
3 Clique 2 M1, M3, M4 ? New reuse distance (D)
3 Combined reuse distance ? Max(3, 3) 3
M3 D 1
Clique 2
13Experimental Setup
- Trimaran compiler and simulator infrastructure
- ARM9 processor model
- Cache configurations
- 1-Kb to 32-Kb
- 32-byte block size
- 2, 4, 8 partitions vs. 2, 4, 8-way
set-associative cache - Mediabench suite
- CACTI for cache energy modeling
14Reduction in Tag Data-Array Checks
8
8-part
4-part
2-part
7
6
5
Average way accesses
4
3
2
1
0
1-K
2-K
4-K
8-K
16-K
32-K
Average
Cache size
- 36 reduction on a 8-partition cache
15Improvement in Fetch Energy
16-Kb cache
60
2-part vs 2-way
4-part vs 4-way
8-part vs 8-way
50
40
30
Percentage energy improvement
20
10
0
epic
cjpeg
djpeg
unepic
Average
pegwitenc
pegwitdec
rawcaudio
rawdaudio
mpeg2dec
mpeg2enc
pgpencode
pgpdecode
gsmencode
gsmdecode
g721encode
g721decode
16Summary
- Maintain the advantages of a hardware-cache
- Expose placement and lookup decisions to the
compiler - Avoid conflicts, eliminate redundancies
- 24 energy savings for 4-Kb with 4-partitions
- Extensions
- Hybrid scratch-pad and caches
- Disable selected tags ? convert them to
scratch-pads - 35 additional savings in 4-Kb cache with 1
partition as SP
17Thank You Questions
18Cache Analysis Step 1 Instruction Fusioning
- Combine ld/st that accesses the same set of
objects - Avoids coherence and duplication
- Points-to analysis
for (i 0 i lt N1 i) for (j 0 j lt
readInput1() j) yi j w1 xi
j for (k 0 k lt readInput2() k)
yi k w2 xi k
ld1/st1
ld3
ld5
M1
M2
ld2/st2
ld4
ld6
19Partition Assignment
- Greedily place instructions based on its cache
estimates - Overlap instructions if required
- Compute number of partitions for overlapped
instructions - Enumerate cliques within interference graph
- Compute combined working-set of all cliques
- Assign the R/U bit to control lookup
M4 D 1
Clique 1
M2 D 1
M1 D 1
M3 D 1
Clique 2
20Related Work
- Direct addressed, cool caches Unsal 01,
Asanovic 01 - Tags maintained in registers that are addressed
within loads/stores - Split temporal/spatial cache Rivers 96
- Hardware managed, two partitions
- Column partitioning Devdas 00
- Individual ways can be configured as a
scratch-pad - No load/store based partitioning
- Region based caching Tyson 02
- Heap, stack, globals
- More finer grained control and management
- Pseudo set-associative caches Calder 96,Inou
99,Albonesi 99 - Reduce tag check power
- Compromises on cycle time
- Orthogonal to our technique
21Code Size Overhead
Annotated LD/STs
Extra MOV instructions
15
16
12
10
8
6
Percentage instructions
4
2
0
epic
cjpeg
djpeg
unepic
Average
pegwitenc
pegwitdec
rawcaudio
rawdaudio
mpeg2dec
mpeg2enc
pgpencode
pgpdecode
gsmencode
gsmdecode
g721encode
g721decode