Title: A highly Configurable Cache Architecture for Embedded Systems
1A highly Configurable Cache Architecture for
Embedded Systems
- Chuanjun Zhang, Frank Vahid , and Walid Najjar
- Dept. of Electrical Engineering
- Dept. of Computer Science and Engineering
- University of California, Riverside
- Also with the Center for Embedded Computer
Systems at UC Irvine - This work was supported by the National Science
Foundation and the Semiconductor Research
Corporation
2Outline
- Why a Configurable Cache? What Parameters ?
- Configurable Associativity by Way Concatenation
- Configurable Size by Way Shutdown
- Configurable Line Size
- How to Configure Cache
- Cache Parameter Explorer
- A Heuristic Algorithm Searches Pareto Set of
Cache Parameters - Tradeoff Between Energy Dissipation and
Performance - The explorer is Synthesized Using Synopsys
- Conclusions and Future Work
3Why Choose Cache Impacts Performance and Power
- Performance impacts are well known
- Power
- ARM920T Caches consume 50 of total processor
system power (Segars 01) - MCORE Unified cache consumes 50 of total
processor system power (Lee/Moyer/Arends 99) - Well show that a configurable cache can reduce
that power nearly in half on average
4Why a Configurable Cache?
Miss rate
- An embedded system may execute one application
forever - Tuning the cache configuration (size,
associativity, line size) can save a lot of
energy - Associativity example
- 40 difference in memory access energy
associativity
Normalized Energy
associativity
epic mpeg2 from MediaBench
5Benefits of Configurable Cache
- Mass production
- Unique chips getting more expensive as technology
scales down (ITRS) - Huge benefits to mass producing a single chip
- Harder to produce chips distinguished by cache
when we have 50-100 processors per chip - Adapt to program phases
- Recent research shows programs have different
cache requirements over time - Much research assumes a configurable cache
6Caches Vary Greatly in Embedded Processors
7Configurable Associativity by Way Concatenation
C. Zhang(ISCA 03)
Way 1
Way 2
Way 3
Way 4
- Four-way set-associative base cache
- Ways can be concatenated to form two-way
- Can be further concatenated to direct-mapped
- Concatenation is logical only 1 array accessed
four-way
8Way-Concatenate Cache Architecture
Trivial area overhead
a31 tag address a13 a12 a11
a10 index
a5 a4 line offset
a0
reg0
reg1
index
6x64
6x64
data array
data output
tag part
critical path
9Previous Method Way Shutdown
- Albonesi proposed a cache where ways could be
shut down - To save dynamic power
- Motorola MCORE has same way-shutdown feature
- Unified cache even allows setting each way as
I, D, both, or off
Way 1
Way 2
Way 3
Way 4
- Reduces dynamic power by accessing fewer ways
- But, decreases total size, so may increase miss
rate
10Way Shutdown Can be Good for Static Power
- Static power (leakage) increasingly important in
nanoscale technologies - We combine way shutdown with way concatenate
- Use sleep transistor method of Powell (ISLPED
2000)
Vdd
Bitline
Bitline
When off, prevents leakage. But 4 performance
overhead
Gated-Vdd Control
Gnd
11Cache Line Size
C. Zhang(ISVLSI 03)
A
64B cache line
64B consecutive code
B
64B cache line
64B non consecutive code
48B are wasted
16B
12Configurable Cache Line Size With Line
Concatenation
- The physical line size is 16 byte
- A programmable counter is used to designate the
line size - An interleaved off chip memory organization
4 physical lines are filled when line size is
64 bytes
One Way
Counter
bus
Off Chip Memory
13Computing Total Memory-Related Energy
- Considers CPU stall energy and off-chip memory
energy - Excludes CPU active energy
- Thus, represents all memory-related energy
energy_mem energy_dynamic energy_static
energy_dynamic cache_hits energy_hit
cache_misses energy_miss energy_miss
energy_offchip_access energy_uP_stall
energy_cache_block_fill energy_static cycles
energy_static_per_cycle
energy_miss k_miss_energy energy_hit
energy_static_per_cycle k_static
energy_total_per_cycle (We varied the ks to
account for different system implementations)
- Underlined measured quantities
- SimpleScalar (cache_hits, cache_misses, cycles)
- Our layout or data sheets (others)
14Energy Savings
- Energy savings when way concatenation, way shut
down, and cache line size concatenation are
implemented. (C. Zhang TECS ACM To Appear)
15Cache Parameters that Consume the Lowest Energy
Varies Across Applications
16How to Configure Cache
- Simulation-based methods
- Drawback slowness.
- Seconds of real-timework may take tens of hours
to simulate - Simulation tools set up
- Increase the time
- Self exploring method
- Cache parameter explorer
- Incorporated on a prototype platform
- Pareto parameters a set of parameters show
performance and energy trade off -
17Cache self-exploring hardware
- An explorer is used to detect the Pareto set of
cache parameters - The explorer stands aside to collect information
used to calculate the energy
Mem
Processor
D
I
18Pareto parameter sets
Lowest Energy
A
Tradeoff between Energy and Performance
Not a Pareto Point
D
Best Performance
C
B
19Heuristic algorithm
- Search all possible Cache configurations
- Time consuming. Considering other configurable
parameters voltage levels, bus width, etc. the
search space will increase very quickly to
millions. - A heuristic is proposed
- First to search point A
- Sequence of searching parameter matters,
- Do not need cache flush
- Then searching for point B
- Last we search for points in region C
Lowest Energy
A
Tradeoff
Time
Best Perf
B
C
Energy(mJ)
20Impact of Cache Parameters on Miss Rate and Energy
One Way Â
Line Size 32B Â
One Way Â
Line Size 32B Â
- Average Instruction Cache Miss Rate and
Normalized Energy of the Benchmarks.
21Energy Dissipation on On-Chip Cache and Off Chip
Memory
Benchmarkparser
22Searching for Point A
Search Cache Size
Search Line Size
Search Associativity
Way prediction
A
Lowest Energy
Time
- Point A The least energy cache configuration
Energy(mJ)
23Searching for Point B
Fix Cache Size
Search Line Size
Search Associativity
No Way prediction
A
Best Performance
B
- Point B The best performance cache configuration
- High associativity doesnt mean high performance
- Large line size may not be good for data cache
Energy(mJ)
24Searching for Point C
Point A B
C Line size
64 64 64 Cache size
2K 8K 4K 8K Associativity 1W
4W 1W 1W 2W Â
- Cache parameters in region C represent the trade
off between energy and performance - Choose cache parameters between points A and B.
- Cache size at points A and B are 8K and 4K
respectively, then the cache size of points in
region C will be tested at 8K and 4K. - Combinations of point A and Bs parameters are
tested.
A
C
B
Tradeoff between Energy and Performance
25FSM and Data Path of the Cache Explorer
hit energies
hit num
input
miss energies
miss num
static energies
exe time
FSM
mux
mux
control
multiplier
com_out
memory
adder
configure register
register
lowest energy
comparator
com_out
26Implementing the Heuristic in Hardware
- Total size of the explorer
- About 4,200 gates, or 0.041 mm2 in 0.18 micron
CMOS technology. - Area overhead
- Compared to the reported size of the MIPS 4Kp
with cache, this represents just over a 3 area
overhead. - Power consumption
- 2.69 mW at 200 MHz. The power overhead compared
with the MIPS 4Kp would be less than 0.5. - Furthermore, the exploring hardware is used only
during the exploring stage, and can be shut down
after the best configuration is determined.
27How well the heuristic is ?
- Time complexity
- Search all space O(m x n x l x p)
- Heuristic O(m n l p)
- mnumber of associativities, n number of cache
size - l number of cache line size , p way
prediction on/off - Efficiency
- On average 5 searching instead of 27 total
searchings can find point A - 2 out of 19 benchmarks miss the lowest power
cache configuration. - Use a different searching heuristic line size,
associativity, way prediction and cache size. - 11 out 19 benchmarks miss the best configuration
28Results of Some Other Benchmarks
29Conclusion and Future Work
- A configurable cache architecture is proposed.
- Associativity, size,line size.
- A cache parameter explorer is implemented to find
the cache parameters. - A heuristic algorithm is proposed to search the
Pareto cache parameter sets. - The complexity of the heuristic is O(mnl)
instead of O(mnl) - Only 95 of the Pareto points can be found by
Heuristic - Overhead
- little area and power overhead, and no
performance overhead. - Future Work
- Dynamically detect the cache parameters .