Title: A Low Energy Set-Associative I-Cache with Extended BTB
1A Low Energy Set-Associative I-Cache with
Extended BTB
K. Inoue, V. Moshnyaga, and K. Murakami
2Introduction
Increase in cache size
Power consumed in on-chip caches
DEC 21164 CPU
StrongARM SA-110 CPU
Bipolar ECL CPU
50
25
43
Kamble et. al., Analytical energy Dissipation
Models for Low Power Caches, ISLPED97
Joouppi et. al., A 300-MHz 115-W 32-b Bipolar
ECL Microprocessor ,IEEE Journal
of
Solid-State Circuits93
3Problem of Conventional Caches
4Our Proposal
History-Based Tag-Comparison I-Cache
- Attempts to reduce cache-access energy without
performance degradation - Reuses tag-check results to eliminate unnecessary
way activation - Can achieve 62 of energy reduction with only
0.2 of performance degradation
5Conventional Tag-Check Scheme
Completely the same tag-check result!
6History-Based Tag-Comparison (HBTC) Scheme
Attempts to reuse tag-check results produced
before during a cache-miss interval!
- The target instruction has been referenced
before, and - No cache miss has occurred since the previous
reference.
Miss!
Miss!
Ref. A
Ref. A
time
Tag Check!
Reuse!
Cache-miss interval
7Concept of the HBTC Cache
2. If a cache miss occurs, then we invalidate all
the stored tag-check results
8Conventional VS. Phased VS. HBTC
Conventional
Phased
HBTC
Reuse
Cache Hit
No Reuse
Cache Miss
9HBTC SA I- Architecture
PBAreg
10HBTC I- Operation
Normal Mode (NM) w/ Tag checks Omitting Mode
(OM) w/o Tag checks (Reuse) Tracing Mode (TM)
w/ Tag checks (tag-check results are
preserved into the WPRreg, and are stored into
the WP-table on the next BTB hit )
11HBTC I- Operation Example
Mode Transition
Valid
OM
BTB Hit
WPRreg
PBAreg
GOtoNM
Invalid
From I-Cache
WP Table
NM
TM
GOtoNM
Inst. Addr. A
Target Addr.
T
N
Branch Target Buffer
PC
Inst. Addr. B
Target Addr.
4-way I-Cache
Pred. (T or N)
0
1
2
3
WPreg
Mode Controller
12HBTC I- Operation Example
Mode Transition
Valid
OM
BTB Hit
WPRreg
PBAreg
GOtoNM
Invalid
From I-Cache
WP Table
NM
TM
GOtoNM
Inst. Addr. A
Target Addr.
T
N
A
Branch Target Buffer
PC
Inst. Addr. B
Target Addr.
4-way I-Cache
Taken
0
1
2
3
WPreg
Mode Controller
13HBTC I- Operation Example
Mode Transition
Valid
OM
BTB Hit
WPRreg
PBAreg
GOtoNM
Invalid
From I-Cache
WP Table
NM
TM
GOtoNM
Inst. Addr. A
Target Addr.
T
N
A
Branch Target Buffer
PC
Inst. Addr. B
Target Addr.
4-way I-Cache
Taken
0
1
2
3
NO valid WPs are detected!
WPreg
Mode Controller
14HBTC I- Operation Example
Mode Transition
Valid
PC and Branch prediction result are saved!
OM
BTB Hit
WPRreg
PBAreg
GOtoNM
Invalid
From I-Cache
A
T
WP Table
NM
TM
GOtoNM
Inst. Addr. A
Target Addr.
T
N
A
Branch Target Buffer
PC
Inst. Addr. B
Target Addr.
4-way I-Cache
Pred. (T or N)
0
1
2
3
NO valid WPs are detected!
WPreg
Mode Controller
15HBTC I- Operation Example
Mode Transition
Tag-Comparison result is stored into the WPRreg!
Valid
OM
BTB Hit
WPRreg
PBAreg
1
GOtoNM
Invalid
A
T
WP Table
NM
TM
GOtoNM
Inst. Addr. A
Target Addr.
T
N
Branch Target Buffer
Conventional Accesses!
PC
Inst. Addr. B
Target Addr.
4-way I-Cache
Pred. (T or N)
0
1
2
3
WPreg
Mode Controller
16HBTC I- Operation Example
Mode Transition
Tag-Comparison result is stored into the WPRreg!
Valid
OM
BTB Hit
WPRreg
PBAreg
3
GOtoNM
Invalid
A
T
WP Table
NM
TM
GOtoNM
Inst. Addr. A
Target Addr.
T
N
Branch Target Buffer
Conventional Accesses!
PC
Inst. Addr. B
Target Addr.
4-way I-Cache
Pred. (T or N)
0
1
2
3
WPreg
Mode Controller
17HBTC I- Operation Example
Mode Transition
Tag-Comparison result is stored into the WPRreg!
Valid
OM
BTB Hit
WPRreg
PBAreg
0
GOtoNM
Invalid
A
T
WP Table
NM
TM
GOtoNM
Inst. Addr. A
Target Addr.
T
N
Branch Target Buffer
Conventional Accesses!
PC
Inst. Addr. B
Target Addr.
4-way I-Cache
Pred. (T or N)
0
1
2
3
WPreg
Mode Controller
18HBTC I- Operation Example
Mode Transition
The WPRreg is stored into the WP-Table entry
pointed by the PBAreg!
Valid
OM
BTB Hit
WPRreg
PBAreg
GOtoNM
Invalid
From I-Cache
A
T
WP Table
NM
TM
GOtoNM
Inst. Addr. A
Target Addr.
BTB Hit!
T
N
B
Branch Target Buffer
PC
Inst. Addr. B
Target Addr.
4-way I-Cache
Pred. (T or N)
0
1
2
3
WPreg
Mode Controller
19HBTC I- Operation Example
Mode Transition
Valid
OM
BTB Hit
WPRreg
PBAreg
GOtoNM
Invalid
From I-Cache
WP Table
NM
TM
GOtoNM
Inst. Addr. A
Target Addr.
T
N
A
Branch Target Buffer
PC
Inst. Addr. B
Target Addr.
4-way I-Cache
Taken
0
1
2
3
WPreg
Mode Controller
20HBTC I- Operation Example
Mode Transition
Valid
OM
BTB Hit
WPRreg
PBAreg
GOtoNM
Invalid
From I-Cache
WP Table
NM
TM
GOtoNM
Inst. Addr. A
Target Addr.
T
N
A
Branch Target Buffer
PC
Inst. Addr. B
Target Addr.
4-way I-Cache
Taken
0
1
2
3
Valid WPs are detected!
WPreg
Mode Controller
21HBTC I- Operation Example
Mode Transition
Valid
OM
BTB Hit
WPRreg
PBAreg
GOtoNM
Invalid
From I-Cache
WP Table
NM
TM
GOtoNM
Inst. Addr. A
Target Addr.
T
N
Branch Target Buffer
Tag-Comparison Reuse
PC
Inst. Addr. B
Target Addr.
4-way I-Cache
1
Pred. (T or N)
0
1
2
3
WPreg
Mode Controller
22HBTC I- Operation Example
Mode Transition
Valid
OM
BTB Hit
WPRreg
PBAreg
GOtoNM
Invalid
From I-Cache
WP Table
NM
TM
GOtoNM
Inst. Addr. A
Target Addr.
T
N
Branch Target Buffer
Tag-Comparison Reuse
PC
Inst. Addr. B
Target Addr.
4-way I-Cache
3
Pred. (T or N)
0
1
2
3
WPreg
Mode Controller
23HBTC I- Operation Example
Mode Transition
Valid
OM
BTB Hit
WPRreg
PBAreg
GOtoNM
Invalid
From I-Cache
WP Table
NM
TM
GOtoNM
Inst. Addr. A
Target Addr.
T
N
Branch Target Buffer
Tag-Comparison Reuse
PC
Inst. Addr. B
Target Addr.
4-way I-Cache
0
Pred. (T or N)
0
1
2
3
WPreg
Mode Controller
24HBTC I- Operation Example
Mode Transition
Valid
OM
BTB Hit
WPRreg
PBAreg
GOtoNM
Invalid
From I-Cache
WP Table
NM
TM
GOtoNM
Inst. Addr. A
Target Addr.
T
N
Branch Target Buffer
No valid WPs in the WPreg!
PC
Inst. Addr. B
Target Addr.
4-way I-Cache
?
Pred. (T or N)
0
1
2
3
WPreg
Mode Controller
25HBTC I- Operation Example
Mode Transition
Valid
OM
BTB Hit
WPRreg
PBAreg
GOtoNM
Invalid
From I-Cache
WP Table
NM
TM
GOtoNM
Inst. Addr. A
Target Addr.
T
N
Branch Target Buffer
Conventional Accesses!
PC
Inst. Addr. B
Target Addr.
4-way I-Cache
Pred. (T or N)
0
1
2
3
WPreg
Mode Controller
26Advantages and Disadvantages
Normal Mode (NM) / Tracing Mode (TM)
Omitting Mode (OM)
- Eliminate unnecessary energy consumption w/o
performance degradation (during OM)!
- BTB energy overhead due to WP-table read-accesses
- BTB access conflict for invalidating all WPs
(causes 1 stall cycle)
- BTB access conflict to record WPs (causes 1 stall
cycle)
27Evaluation Environment
- OOO simulation by SimpleScalar
- 16 KB 4-way I-cache (32 B line size)
- For others, default parameters were used
- Cache Energy Model based on Kamble97
- (including the WP-table read-energy overhead)
- Assume that the BTB is accessed only when branch
or jump instructions are executed (instructions
are pre-decoded)
Kamble97 M.B.Kamble and K.Ghose, Analytical
Energy Dissipation Models For Low Power Caches,
ISLPED97
28Evaluation Energy and Performance
62
0.2
- 62 of Ecache reduction with 0.2 of Exe. Time
increase - Even if in the worst case, about 20 of Ecache
reduction
29Evaluation Effect of WP invalidation penalty
126.gcc
099.go
Norm. Exe. Time (cycle)
mpeg2(d)
132.ijpeg
WP Invalidation Penalty (cycle)
- If the penalty is equal to or smaller than 4
clock cycles, the performance overhead is
trivial. - The performance overhead grows after the penalty
is more than 4 clock cycles.
30Evaluation Effect of The Number of WPs
w/ Pre-Decoding
w/o Pre-Decoding
1.2
126.gcc
Energy for Cache Access
1.0
Energy Overhead of BTB
0.8
0.6
Normalized Energy (Joule)
0.4
0.2
0.0
1 2 4 8 16 32
1 2 4 8 16 32
of Way Pointer
- Increasing the number of WPs makes it possible to
reuse many tag-check results - But, it produces BTB access energy overhead
31Evaluation Effect of Cache Associativity
mpeg2decode
Conventional
HBTC
Eothers Etag Edata,bl Edata,prectl
Energy (Joule)
1 2 4 8 16 32 64
1 2 4 8 16 32 64
Associativity
- Conv. Ecache grows with the increase in
assiciativity - HBTC Ecache is reduced with the increase in
associativity (nlt4), after that, It starts to
increase (ngt4)
32Conclusions
History-Based Tag-Comparison Instruction Cache
- Recodes tag-check results generated by the
I-cache into the extended BTB - Attempts to reuse them in order to eliminate
unnecessary way activation - Achieves 62 of I-cache energy reduction with
only 0.2 of performance degradation!
Future work
- Analyze energy consumption based on real chip
design.
33Buck Up Slides (History-based Tag-Comparison
Cache)
34Evaluation Comparison with IS Approach
Interline Sequential approach History-Based
Look-up Cache Combination of IS and HBL
0.8
0.7
0.6
0.5
Normalized Tag-Compare Count
0.4
0.3
0.2
0.1
0.0
099.go 126.gcc 130.li 102.swim
adpcm(d) mpeg2(d) 124.m88ksim 129.comp.
132.ijpeg adpcm(e) mpeg2(e)
35Evaluation Effects of Cache Associativity
Eothers Etag Edata,bl Edata,prectl
099.go
Conventional
HBL Cache
Energy (Joule)
1 2 4 8 16 32 64
1 2 4 8 16 32 64
Associativity
0.8um CMOS
) M.B.Kamble and K.ghose, Energy-Efficiency of
VLSI Caches A Comparative Study, 10th Int.
Conf. On VLSI Design ) S.J.E.Wilton and
N.P.Jouppi, An Enhanced Access and Cycle Time
Model for On-Chip Caches, WRL Research Report
93/5
36Evaluation Effects of Cache Associativity
Eothers Etag Edata,bl Edata,prectl
126.gcc
Conventional
HBL Cache
Energy (Joule)
1 2 4 8 16 32 64
1 2 4 8 16 32 64
Associativity
0.8um CMOS
) M.B.Kamble and K.ghose, Energy-Efficiency of
VLSI Caches A Comparative Study, 10th Int.
Conf. On VLSI Design ) S.J.E.Wilton and
N.P.Jouppi, An Enhanced Access and Cycle Time
Model for On-Chip Caches, WRL Research Report
93/5
37Evaluation Effects of Cache Associativity
Eothers Etag Edata,bl Edata,prectl
132.ijpeg
Conventional
HBL Cache
Energy (Joule)
1 2 4 8 16 32 64
1 2 4 8 16 32 64
Associativity
0.8um CMOS
) M.B.Kamble and K.ghose, Energy-Efficiency of
VLSI Caches A Comparative Study, 10th Int.
Conf. On VLSI Design ) S.J.E.Wilton and
N.P.Jouppi, An Enhanced Access and Cycle Time
Model for On-Chip Caches, WRL Research Report
93/5
38Evaluation Effects of Cache Associativity
Eothers Etag Edata,bl Edata,prectl
mpeg2decode
Conventional
HBL Cache
Energy (Joule)
1 2 4 8 16 32 64
1 2 4 8 16 32 64
Associativity
0.8um CMOS
) M.B.Kamble and K.ghose, Energy-Efficiency of
VLSI Caches A Comparative Study, 10th Int.
Conf. On VLSI Design ) S.J.E.Wilton and
N.P.Jouppi, An Enhanced Access and Cycle Time
Model for On-Chip Caches, WRL Research Report
93/5
39Evaluation Effects of of WPs
w/ Pre-Decoding (BTB access occurs only at
branch, or jump, executions)
1.0
126.gcc
132.ijpeg
0.8
0.6
Normalized Energy (Joule)
0.4
0.2
0.0
1 2 4 8 16 32
1 2 4 8 16 32
of Way Pointer
Energy for Cache Access
Energy Overhead at BTB
40Evaluation Effects of of WPs
w/o Pre-Decoding (BTB access occurs for all
instructions)
1.0
126.gcc
132.ijpeg
0.8
0.6
Normalized Energy (Joule)
0.4
0.2
0.0
1 2 4 8 16 32
1 2 4 8 16 32
of Way Pointer
Energy for Cache Access
Energy Overhead at BTB
41Evaluation Effect of WP invalidation penalty
BTB Replacement Cache Miss
126.gcc
Normalized Exe. Time (cycle)
Breakdown of WP invalidations
099.go
mpeg2(d)
132.ijpeg
099.go 126.gcc 130.li 102.swim adpcm(d)
mpeg2(d) 124.m88ksim 129.comp.132.ijpeg
adpcm(e) mpeg2(e)
WP Invalidation Penalty (cycle)
- If the penalty is equal to or smaller than 4
clock cycles, the performance overhead is
trivial. - The performance overhead grows after the penalty
is more than 4 clock cycles.