Compiler Managed Partitioned Data Caches for Low Power - PowerPoint PPT Presentation

About This Presentation

Title:

Compiler Managed Partitioned Data Caches for Low Power

Description:

Currently with the Java, Compilers, and Tools Lab, Hewlett Packard, Cupertino, California ... Direct addressed, cool caches [Unsal '01, Asanovic '01] ... – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 22

Provided by: Micha1

Learn more at: https://cccp.eecs.umich.edu

Category:

more less

Transcript and Presenter's Notes

Title: Compiler Managed Partitioned Data Caches for Low Power

1
Compiler Managed Partitioned Data Caches for Low
Power

Rajiv Ravindran, Michael Chu, and Scott Mahlke
Advanced Computer Architecture Lab
Department of Electrical Engineering and Computer
Science
University of Michigan, Ann Arbor

Currently with the Java, Compilers, and Tools
Lab, Hewlett Packard, Cupertino, California
2
Introduction Memory Power

On-chip memories are a major contributor to
system energy
Data caches ? 16 in StrongARM Unsal et. al,
01

Hardware
Software
Banking, dynamic voltage/frequency, scaling,
dynamic resizing Transparent to the user
Handle arbitrary instr/data accesses Limited
program information Reactive
Software controlled scratch-pad, data/code
reorganization Whole program information
Proactive No dynamic adaptability
Conservative
3
Reducing Data Memory PowerCompiler Managed,
Hardware Assisted
Hardware
Software
Banking, dynamic voltage/frequency, scaling,
dynamic resizing Transparent to the user
Handle arbitrary instr/data accesses ? Limited
program information ? Reactive
Software controlled scratch-pad, data/code
reorganization Whole program information
Proactive ? No dynamic adaptability ?
Conservative
4
Data Caches Tradeoffs
Advantages
Disadvantages

Capture spatial/temporal locality
Transparent to the programmer
General than software scratch-pads
Efficient lookups

Fixed replacement policy
Set index? no program locality
Set-associativity has high overhead
Activate multiple data/tag-array
per access

5
Traditional Cache Architecture
tag
set
offset
tag
data
lru
tag
data
lru
tag
data
lru
tag
data
lru
Replace
?
?
?
?
41 mux

Lookup ? Activate all ways on every
access
Replacement ? Choose among all the ways

6
Partitioned Cache Architecture
Ld/St Reg Addr k-bitvector
R/U
tag
set
offset
tag
data
lru
tag
data
lru
tag
data
lru
tag
data
lru
P0
P3
P2
P1
Replace
?
?
?
?
41 mux

Advantages
Improve performance by controlling replacement
Reduce cache access power by restricting number
of accesses

Lookup ? Restricted to partitions
specified in bit-vector if R, else default to
all partitions
Replacement ? Restricted to partitions specified
in bit-vector

7
Partitioned Caches Example
for (i 0 i lt N1 i) for (j 0 j lt
N2 j) yi j w1 xi j
for (k 0 k lt N3 k) yi k w2
xi k
ld1/st1
ld3
ld5
ld2/st2
ld4
ld6
way-0
way-2
way-1
tag
data
tag
data
tag
data
ld1 100, R ld5 010, R ld3 001, R
ld1, st1, ld2, st2
ld5, ld6
ld3, ld4
y
w1/w2
x

Reduce number of tag checks per iteration from
12 to 4 !

8
Compiler Controlled Data Partitioning

Goal Place loads/stores into cache partitions
Analyze applications memory characteristics
Cache requirements ? Number of partitions per
ld/st
Predict conflicts
Place loads/stores to different partitions
Satisfies its caching needs
Avoid conflicts, overlap if possible

9
Cache Analysis Estimating Number of Partitions

Minimal partitions to avoid conflict/capacity
misses
Probabilistic hit-rate estimate
Use the working-set to compute number of
partitions

j-loop
k-loop
X W1 Y Y X W1 Y Y X W2 Y
Y X W2 Y Y
B1
B1
B1
B1
M
M
M
M

M has working-set size 1

10
Cache AnalysisEstimating Number Of Partitions

Avoid conflict/capacity misses for an
instruction
Estimates hit-rate based on
Reuse-distance (D), total number of cache blocks
(B), associativity (A)

(Brehob et. al., 99)
D 2
D 0
D 1
1
2
3
4
1
2
3
4
1
2
3
4
.76
.98
1
1
8
.87
1
1
1
1
1
1
1
8
8
16
16
16
24
24
24
32
32
32

Compute energy matrices in reality
Pick most energy efficient configuration per
instruction

11
Cache Analysis Computing Interferences

Avoid conflicts among temporally co-located
references
Model conflicts using interference graph

X W1 Y Y X W1 Y Y X W2 Y
Y X W2 Y Y
M4 M2 M1 M1 M4 M2 M1 M1
M4 M3 M1 M1 M4 M3 M1 M1
M4 D 1
M2 D 1
M1 D 1
M3 D 1
12
Partition Assignment

Placement phase can overlap references
Compute combined working-set
Use graph-theoretic notion of a clique
For each clique, new D ? S D of each node
Combined D for all overlaps ? Max (All cliques)

M4 D 1
Clique 1
M2 D 1
M1 D 1
Clique 1 M1, M2, M4 ? New reuse distance (D)
3 Clique 2 M1, M3, M4 ? New reuse distance (D)
3 Combined reuse distance ? Max(3, 3) 3
M3 D 1
Clique 2
13
Experimental Setup

Trimaran compiler and simulator infrastructure
ARM9 processor model
Cache configurations
1-Kb to 32-Kb
32-byte block size
2, 4, 8 partitions vs. 2, 4, 8-way
set-associative cache
Mediabench suite
CACTI for cache energy modeling

14
Reduction in Tag Data-Array Checks
8
8-part
4-part
2-part
7
6
5
Average way accesses
4
3
2
1
0
1-K
2-K
4-K
8-K
16-K
32-K
Average
Cache size

36 reduction on a 8-partition cache

15
Improvement in Fetch Energy
16-Kb cache
60
2-part vs 2-way
4-part vs 4-way
8-part vs 8-way
50
40
30
Percentage energy improvement
20
10
0
epic
cjpeg
djpeg
unepic
Average
pegwitenc
pegwitdec
rawcaudio
rawdaudio
mpeg2dec
mpeg2enc
pgpencode
pgpdecode
gsmencode
gsmdecode
g721encode
g721decode
16
Summary

Maintain the advantages of a hardware-cache
Expose placement and lookup decisions to the
compiler
Avoid conflicts, eliminate redundancies
24 energy savings for 4-Kb with 4-partitions
Extensions
Hybrid scratch-pad and caches
Disable selected tags ? convert them to
scratch-pads
35 additional savings in 4-Kb cache with 1
partition as SP

17
Thank You Questions
18
Cache Analysis Step 1 Instruction Fusioning

Combine ld/st that accesses the same set of
objects
Avoids coherence and duplication
Points-to analysis

for (i 0 i lt N1 i) for (j 0 j lt
readInput1() j) yi j w1 xi
j for (k 0 k lt readInput2() k)
yi k w2 xi k
ld1/st1
ld3
ld5
M1
M2
ld2/st2
ld4
ld6
19
Partition Assignment

Greedily place instructions based on its cache
estimates
Overlap instructions if required
Compute number of partitions for overlapped
instructions
Enumerate cliques within interference graph
Compute combined working-set of all cliques
Assign the R/U bit to control lookup

M4 D 1
Clique 1
M2 D 1
M1 D 1
M3 D 1
Clique 2
20
Related Work

Direct addressed, cool caches Unsal 01,
Asanovic 01
Tags maintained in registers that are addressed
within loads/stores
Split temporal/spatial cache Rivers 96
Hardware managed, two partitions
Column partitioning Devdas 00
Individual ways can be configured as a
scratch-pad
No load/store based partitioning
Region based caching Tyson 02
Heap, stack, globals
More finer grained control and management
Pseudo set-associative caches Calder 96,Inou
99,Albonesi 99
Reduce tag check power
Compromises on cycle time
Orthogonal to our technique

21
Code Size Overhead
Annotated LD/STs
Extra MOV instructions
15
16
12
10
8
6
Percentage instructions
4
2
0
epic
cjpeg
djpeg
unepic
Average
pegwitenc
pegwitdec
rawcaudio
rawdaudio
mpeg2dec
mpeg2enc
pgpencode
pgpdecode
gsmencode
gsmdecode
g721encode
g721decode

Write a Comment

User Comments (0)