Compiler Managed Partitioned Data Caches for Low Power - PowerPoint PPT Presentation

About This Presentation
Title:

Compiler Managed Partitioned Data Caches for Low Power

Description:

Currently with the Java, Compilers, and Tools Lab, Hewlett Packard, Cupertino, California ... Direct addressed, cool caches [Unsal '01, Asanovic '01] ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 22
Provided by: Micha1
Category:

less

Transcript and Presenter's Notes

Title: Compiler Managed Partitioned Data Caches for Low Power


1
Compiler Managed Partitioned Data Caches for Low
Power
  • Rajiv Ravindran, Michael Chu, and Scott Mahlke
  • Advanced Computer Architecture Lab
  • Department of Electrical Engineering and Computer
    Science
  • University of Michigan, Ann Arbor

Currently with the Java, Compilers, and Tools
Lab, Hewlett Packard, Cupertino, California
2
Introduction Memory Power
  • On-chip memories are a major contributor to
    system energy
  • Data caches ? 16 in StrongARM Unsal et. al,
    01

Hardware
Software
Banking, dynamic voltage/frequency, scaling,
dynamic resizing Transparent to the user
Handle arbitrary instr/data accesses Limited
program information Reactive
Software controlled scratch-pad, data/code
reorganization Whole program information
Proactive No dynamic adaptability
Conservative
3
Reducing Data Memory PowerCompiler Managed,
Hardware Assisted
Hardware
Software
Banking, dynamic voltage/frequency, scaling,
dynamic resizing Transparent to the user
Handle arbitrary instr/data accesses ? Limited
program information ? Reactive
Software controlled scratch-pad, data/code
reorganization Whole program information
Proactive ? No dynamic adaptability ?
Conservative
4
Data Caches Tradeoffs
Advantages
Disadvantages
  • Capture spatial/temporal locality
  • Transparent to the programmer
  • General than software scratch-pads
  • Efficient lookups
  • Fixed replacement policy
  • Set index? no program locality
  • Set-associativity has high overhead
  • Activate multiple data/tag-array
  • per access

5
Traditional Cache Architecture
tag
set
offset
tag
data
lru
tag
data
lru
tag
data
lru
tag
data
lru
Replace
?
?
?
?
41 mux
  • Lookup ? Activate all ways on every
    access
  • Replacement ? Choose among all the ways

6
Partitioned Cache Architecture
Ld/St Reg Addr k-bitvector
R/U
tag
set
offset
tag
data
lru
tag
data
lru
tag
data
lru
tag
data
lru
P0
P3
P2
P1
Replace
?
?
?
?
41 mux
  • Advantages
  • Improve performance by controlling replacement
  • Reduce cache access power by restricting number
    of accesses
  • Lookup ? Restricted to partitions
    specified in bit-vector if R, else default to
    all partitions
  • Replacement ? Restricted to partitions specified
    in bit-vector

7
Partitioned Caches Example
for (i 0 i lt N1 i) for (j 0 j lt
N2 j) yi j w1 xi j
for (k 0 k lt N3 k) yi k w2
xi k
ld1/st1
ld3
ld5
ld2/st2
ld4
ld6
way-0
way-2
way-1
tag
data
tag
data
tag
data
ld1 100, R ld5 010, R ld3 001, R
ld1, st1, ld2, st2
ld5, ld6
ld3, ld4
y
w1/w2
x
  • Reduce number of tag checks per iteration from
    12 to 4 !

8
Compiler Controlled Data Partitioning
  • Goal Place loads/stores into cache partitions
  • Analyze applications memory characteristics
  • Cache requirements ? Number of partitions per
    ld/st
  • Predict conflicts
  • Place loads/stores to different partitions
  • Satisfies its caching needs
  • Avoid conflicts, overlap if possible

9
Cache Analysis Estimating Number of Partitions
  • Minimal partitions to avoid conflict/capacity
    misses
  • Probabilistic hit-rate estimate
  • Use the working-set to compute number of
    partitions

j-loop
k-loop
X W1 Y Y X W1 Y Y X W2 Y
Y X W2 Y Y
B1
B1
B1
B1
M
M
M
M
  • M has working-set size 1

10
Cache AnalysisEstimating Number Of Partitions
  • Avoid conflict/capacity misses for an
    instruction
  • Estimates hit-rate based on
  • Reuse-distance (D), total number of cache blocks
    (B), associativity (A)

(Brehob et. al., 99)
D 2
D 0
D 1
1
2
3
4
1
2
3
4
1
2
3
4
.76
.98
1
1
8
.87
1
1
1
1
1
1
1
8
8
16
16
16
24
24
24
32
32
32
  • Compute energy matrices in reality
  • Pick most energy efficient configuration per
    instruction

11
Cache Analysis Computing Interferences
  • Avoid conflicts among temporally co-located
    references
  • Model conflicts using interference graph

X W1 Y Y X W1 Y Y X W2 Y
Y X W2 Y Y
M4 M2 M1 M1 M4 M2 M1 M1
M4 M3 M1 M1 M4 M3 M1 M1
M4 D 1
M2 D 1
M1 D 1
M3 D 1
12
Partition Assignment
  • Placement phase can overlap references
  • Compute combined working-set
  • Use graph-theoretic notion of a clique
  • For each clique, new D ? S D of each node
  • Combined D for all overlaps ? Max (All cliques)

M4 D 1
Clique 1
M2 D 1
M1 D 1
Clique 1 M1, M2, M4 ? New reuse distance (D)
3 Clique 2 M1, M3, M4 ? New reuse distance (D)
3 Combined reuse distance ? Max(3, 3) 3
M3 D 1
Clique 2
13
Experimental Setup
  • Trimaran compiler and simulator infrastructure
  • ARM9 processor model
  • Cache configurations
  • 1-Kb to 32-Kb
  • 32-byte block size
  • 2, 4, 8 partitions vs. 2, 4, 8-way
    set-associative cache
  • Mediabench suite
  • CACTI for cache energy modeling

14
Reduction in Tag Data-Array Checks
8
8-part
4-part
2-part
7
6
5
Average way accesses
4
3
2
1
0
1-K
2-K
4-K
8-K
16-K
32-K
Average
Cache size
  • 36 reduction on a 8-partition cache

15
Improvement in Fetch Energy
16-Kb cache
60
2-part vs 2-way
4-part vs 4-way
8-part vs 8-way
50
40
30
Percentage energy improvement
20
10
0
epic
cjpeg
djpeg
unepic
Average
pegwitenc
pegwitdec
rawcaudio
rawdaudio
mpeg2dec
mpeg2enc
pgpencode
pgpdecode
gsmencode
gsmdecode
g721encode
g721decode
16
Summary
  • Maintain the advantages of a hardware-cache
  • Expose placement and lookup decisions to the
    compiler
  • Avoid conflicts, eliminate redundancies
  • 24 energy savings for 4-Kb with 4-partitions
  • Extensions
  • Hybrid scratch-pad and caches
  • Disable selected tags ? convert them to
    scratch-pads
  • 35 additional savings in 4-Kb cache with 1
    partition as SP

17
Thank You Questions
18
Cache Analysis Step 1 Instruction Fusioning
  • Combine ld/st that accesses the same set of
    objects
  • Avoids coherence and duplication
  • Points-to analysis

for (i 0 i lt N1 i) for (j 0 j lt
readInput1() j) yi j w1 xi
j for (k 0 k lt readInput2() k)
yi k w2 xi k
ld1/st1
ld3
ld5
M1
M2
ld2/st2
ld4
ld6
19
Partition Assignment
  • Greedily place instructions based on its cache
    estimates
  • Overlap instructions if required
  • Compute number of partitions for overlapped
    instructions
  • Enumerate cliques within interference graph
  • Compute combined working-set of all cliques
  • Assign the R/U bit to control lookup

M4 D 1
Clique 1
M2 D 1
M1 D 1
M3 D 1
Clique 2
20
Related Work
  • Direct addressed, cool caches Unsal 01,
    Asanovic 01
  • Tags maintained in registers that are addressed
    within loads/stores
  • Split temporal/spatial cache Rivers 96
  • Hardware managed, two partitions
  • Column partitioning Devdas 00
  • Individual ways can be configured as a
    scratch-pad
  • No load/store based partitioning
  • Region based caching Tyson 02
  • Heap, stack, globals
  • More finer grained control and management
  • Pseudo set-associative caches Calder 96,Inou
    99,Albonesi 99
  • Reduce tag check power
  • Compromises on cycle time
  • Orthogonal to our technique

21
Code Size Overhead
Annotated LD/STs
Extra MOV instructions
15
16
12
10
8
6
Percentage instructions
4
2
0
epic
cjpeg
djpeg
unepic
Average
pegwitenc
pegwitdec
rawcaudio
rawdaudio
mpeg2dec
mpeg2enc
pgpencode
pgpdecode
gsmencode
gsmdecode
g721encode
g721decode
Write a Comment
User Comments (0)
About PowerShow.com