Title: Drowsy%20Caches%20Simple%20Techniques%20for%20Reducing%20Leakage%20Power
1Drowsy CachesSimple Techniques for Reducing
Leakage Power
- Krisztián Flautner
- Nam Sung Kim
- Steve Martin
- David Blaauw
- Trevor Mudge
krisztian.flautner_at_arm.com kimns_at_eecs.umich.edu st
evenmm_at_eecs.umich.edu blaauw_at_eecs.umich.edu tnm_at_ee
cs.umich.edu
2Motivation
- Ever increasing leakage power
- as feature size shrinks
- Vt scales down
- exponential increase in leakage power
- On-chip caches
- responsible for 1520 of the total power
- leakage power can exceed 50 of total cache power
according to our projection using Berkeley
Predictive Models
3Processor power trends
- Based on ITRS roadmap and transistor count
estimates. - Total power in this projection cannot come true.
4An observation about data caches
- L1 data caches
- Working set fraction of cache lines accessed in
a time window. - Window size 2000 cycles.
- Only a small fraction of lines are accessed in a
window.
Working set of current 1, 8, and 32 previous
windows
Working set of current window
5The Drowsy Cache approach
Instead of being sophisticated about predicting
the working set, reduce the penalty for being
wrong.
- Algorithm
- Periodically put all lines in cache into drowsy
mode. - When accessed, wake up the line.
- Optimize across circuit-microarchitecture
boundary - Use of the appropriate circuit technique enables
simplified microarchitectural control. - Requirement state preservation in low leakage
mode.
6Access control flow Awake tags
Awake tags
Awake tag match
Line wake up
Line access
Hit
Awake tag miss
Line wake up
Miss
Memory
Replacement
- Drowsy hit / miss adds at most 1 cycle latency
- Access to awake line is not penalized
7Access control flow Drowsy tags
Drowsy tags
Awake tag match
Line wake up
Line access
Tag wake up
Hit
Awake tag miss
Line wake up
Tag wake up
Unneeded tags and lines back to drowsy
Miss
Memory
Replacement
- Drowsy tags implementation is more complicated
- Is the complexity worth it?
- Tags use about 7 of data bits (32 bit address)
- Only small incremental leakage reduction
- Worst case 3 cycle extra latency
8Low-leakage circuit techniques
Circuit Pros Cons
Gated-VDD Largest leakage reduction Fast mode switching Easy implementation Loses cell state
ABB-MTCMOS Retains cell state Slow mode switching
DVS Retains cell state Fase mode switching More power reduction than ABB More SEU noise susceptible
9Drowsy memory using DVS
- Low supply voltage for inactive memory cells
- Low voltage reduces leakage current too!
- Quadratic reduction in leakage power
supply voltage for normal mode
leakage path
supply voltage for drowsy mode
10Leakage reduction using DVS
- High-Vt devices for access transistors
- reduce leakage power
- increase access time of cache
- 91 leakage reduction
- 6 cycle time increase
Projections for 0.07µm process
11Drowsy cache line architecture
12Energy reduction
- Projections for 0.07µm process
- High leakage lines have to be powered up when
accessed. - Drowsy circuit
- Without high vt device (in SRAM) 6x leakage
reduction, no access delay. - With high vt device 10x leakage reduction, 6
access time increase.
131 cycle vs. 2 cycle wake up
- Fast wakeup is important but easy to accomplish
! - Cache access time 0.57ns (for 0.07µm from CACTI
using 0.18µm baseline). - Speed dependent on voltage controller size 64 x
Leff 0.28ns (half cycle at 4 GHz), 32 x Leff
0.42ns, 16 x Leff 0.77ns. - Impact of drowsy tags are quite similar to
double-cycle wake up.
14Policy comparison
15Energy reduction
Normalized Total Energy Normalized Total Energy Normalized Leakage Energy Normalized Leakage Energy Run-time increase
DVS Theoretical min. DVS Theoretical min. Run-time increase
Awake tags 0.46 0.35 0.29 0.15 0.41
Drowsy tags 0.42 0.31 0.24 0.09 0.84
- Theoretical minimum assumes zero leakage in
drowsy mode - Total energy reduction within 0.1 of theoretical
minimum - Diminishing returns for better leakage reduction
techniques - Above figures assume 6x leakage reduction, 10x
possible with small additional run-time impact
16Conclusions
- Simple circuit technique
- Need high-Vt transistors, low Vdd supply
- Simple architecture
- No need to keep counter/predictor state for each
line - Periodic global counter asserts drowsy signal
- Window size (for periodic drowsy transition)
depends on core 4000 cycles has good E-delay
trade-off - Technique also works well on in-order procesors
- Memory subsystem is already latency tolerant
- Drowsy circuit is good enough
- Diminishing returns on further leakage reduction
- Focus is again on dynamic energy