A highly Configurable Cache Architecture for Embedded Systems - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

A highly Configurable Cache Architecture for Embedded Systems

Description:

A highly Configurable Cache Architecture for Embedded Systems – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 30
Provided by: frank69
Category:

less

Transcript and Presenter's Notes

Title: A highly Configurable Cache Architecture for Embedded Systems


1
A highly Configurable Cache Architecture for
Embedded Systems
  • Chuanjun Zhang, Frank Vahid , and Walid Najjar
  • Dept. of Electrical Engineering
  • Dept. of Computer Science and Engineering
  • University of California, Riverside
  • Also with the Center for Embedded Computer
    Systems at UC Irvine
  • This work was supported by the National Science
    Foundation and the Semiconductor Research
    Corporation

2
Outline
  • Why a Configurable Cache? What Parameters ?
  • Configurable Associativity by Way Concatenation
  • Configurable Size by Way Shutdown
  • Configurable Line Size
  • How to Configure Cache
  • Cache Parameter Explorer
  • A Heuristic Algorithm Searches Pareto Set of
    Cache Parameters
  • Tradeoff Between Energy Dissipation and
    Performance
  • The explorer is Synthesized Using Synopsys
  • Conclusions and Future Work

3
Why Choose Cache Impacts Performance and Power
  • Performance impacts are well known
  • Power
  • ARM920T Caches consume 50 of total processor
    system power (Segars 01)
  • MCORE Unified cache consumes 50 of total
    processor system power (Lee/Moyer/Arends 99)
  • Well show that a configurable cache can reduce
    that power nearly in half on average

4
Why a Configurable Cache?
Miss rate
  • An embedded system may execute one application
    forever
  • Tuning the cache configuration (size,
    associativity, line size) can save a lot of
    energy
  • Associativity example
  • 40 difference in memory access energy

associativity
Normalized Energy
associativity
epic mpeg2 from MediaBench
5
Benefits of Configurable Cache
  • Mass production
  • Unique chips getting more expensive as technology
    scales down (ITRS)
  • Huge benefits to mass producing a single chip
  • Harder to produce chips distinguished by cache
    when we have 50-100 processors per chip
  • Adapt to program phases
  • Recent research shows programs have different
    cache requirements over time
  • Much research assumes a configurable cache

6
Caches Vary Greatly in Embedded Processors
7
Configurable Associativity by Way Concatenation
C. Zhang(ISCA 03)
Way 1
Way 2
Way 3
Way 4
  • Four-way set-associative base cache
  • Ways can be concatenated to form two-way
  • Can be further concatenated to direct-mapped
  • Concatenation is logical only 1 array accessed

four-way
8
Way-Concatenate Cache Architecture
Trivial area overhead
a31 tag address a13 a12 a11
a10 index
a5 a4 line offset
a0
reg0
reg1
index
6x64
6x64
data array
data output
tag part
critical path
9
Previous Method Way Shutdown
  • Albonesi proposed a cache where ways could be
    shut down
  • To save dynamic power
  • Motorola MCORE has same way-shutdown feature
  • Unified cache even allows setting each way as
    I, D, both, or off

Way 1
Way 2
Way 3
Way 4
  • Reduces dynamic power by accessing fewer ways
  • But, decreases total size, so may increase miss
    rate

10
Way Shutdown Can be Good for Static Power
  • Static power (leakage) increasingly important in
    nanoscale technologies
  • We combine way shutdown with way concatenate
  • Use sleep transistor method of Powell (ISLPED
    2000)

Vdd
Bitline
Bitline
When off, prevents leakage. But 4 performance
overhead
Gated-Vdd Control
Gnd
11
Cache Line Size
C. Zhang(ISVLSI 03)
A
64B cache line
64B consecutive code
B
64B cache line
64B non consecutive code
48B are wasted
16B
12
Configurable Cache Line Size With Line
Concatenation
  • The physical line size is 16 byte
  • A programmable counter is used to designate the
    line size
  • An interleaved off chip memory organization

4 physical lines are filled when line size is
64 bytes
One Way
Counter
bus
Off Chip Memory
13
Computing Total Memory-Related Energy
  • Considers CPU stall energy and off-chip memory
    energy
  • Excludes CPU active energy
  • Thus, represents all memory-related energy

energy_mem energy_dynamic energy_static
energy_dynamic cache_hits energy_hit
cache_misses energy_miss energy_miss
energy_offchip_access energy_uP_stall
energy_cache_block_fill energy_static cycles
energy_static_per_cycle
energy_miss k_miss_energy energy_hit
energy_static_per_cycle k_static
energy_total_per_cycle (We varied the ks to
account for different system implementations)
  • Underlined measured quantities
  • SimpleScalar (cache_hits, cache_misses, cycles)
  • Our layout or data sheets (others)

14
Energy Savings
  • Energy savings when way concatenation, way shut
    down, and cache line size concatenation are
    implemented. (C. Zhang TECS ACM To Appear)

15
Cache Parameters that Consume the Lowest Energy
Varies Across Applications
16
How to Configure Cache
  • Simulation-based methods
  • Drawback slowness.
  • Seconds of real-timework may take tens of hours
    to simulate
  • Simulation tools set up
  • Increase the time
  • Self exploring method
  • Cache parameter explorer
  • Incorporated on a prototype platform
  • Pareto parameters a set of parameters show
    performance and energy trade off

17
Cache self-exploring hardware
  • An explorer is used to detect the Pareto set of
    cache parameters
  • The explorer stands aside to collect information
    used to calculate the energy

Mem
Processor
D
I
18
Pareto parameter sets
Lowest Energy
A
Tradeoff between Energy and Performance
Not a Pareto Point
D
Best Performance
C
B
19
Heuristic algorithm
  • Search all possible Cache configurations
  • Time consuming. Considering other configurable
    parameters voltage levels, bus width, etc. the
    search space will increase very quickly to
    millions.
  • A heuristic is proposed
  • First to search point A
  • Sequence of searching parameter matters,
  • Do not need cache flush
  • Then searching for point B
  • Last we search for points in region C

Lowest Energy
A
Tradeoff
Time
Best Perf
B
C
Energy(mJ)
20
Impact of Cache Parameters on Miss Rate and Energy
One Way  
Line Size 32B  
One Way  
Line Size 32B  
  • Average Instruction Cache Miss Rate and
    Normalized Energy of the Benchmarks.

21
Energy Dissipation on On-Chip Cache and Off Chip
Memory
  • .

Benchmarkparser
22
Searching for Point A
Search Cache Size
Search Line Size
Search Associativity
Way prediction
A
Lowest Energy
Time
  • Point A The least energy cache configuration

Energy(mJ)
23
Searching for Point B
Fix Cache Size
Search Line Size
Search Associativity
No Way prediction
A
Best Performance
B
  • Point B The best performance cache configuration
  • High associativity doesnt mean high performance
  • Large line size may not be good for data cache

Energy(mJ)
24
Searching for Point C
Point A B
C Line size
64 64 64 Cache size
2K 8K 4K 8K Associativity 1W
4W 1W 1W 2W  
  • Cache parameters in region C represent the trade
    off between energy and performance
  • Choose cache parameters between points A and B.
  • Cache size at points A and B are 8K and 4K
    respectively, then the cache size of points in
    region C will be tested at 8K and 4K.
  • Combinations of point A and Bs parameters are
    tested.

A
C
B
Tradeoff between Energy and Performance
25
FSM and Data Path of the Cache Explorer
hit energies
hit num
input
miss energies
miss num
static energies
exe time
FSM
mux
mux
control
multiplier
com_out
memory
adder
configure register
register
lowest energy
comparator
com_out
26
Implementing the Heuristic in Hardware
  • Total size of the explorer
  • About 4,200 gates, or 0.041 mm2 in 0.18 micron
    CMOS technology.
  • Area overhead
  • Compared to the reported size of the MIPS 4Kp
    with cache, this represents just over a 3 area
    overhead.
  • Power consumption
  • 2.69 mW at 200 MHz. The power overhead compared
    with the MIPS 4Kp would be less than 0.5.
  • Furthermore, the exploring hardware is used only
    during the exploring stage, and can be shut down
    after the best configuration is determined.

27
How well the heuristic is ?
  • Time complexity
  • Search all space O(m x n x l x p)
  • Heuristic O(m n l p)
  • mnumber of associativities, n number of cache
    size
  • l number of cache line size , p way
    prediction on/off
  • Efficiency
  • On average 5 searching instead of 27 total
    searchings can find point A
  • 2 out of 19 benchmarks miss the lowest power
    cache configuration.
  • Use a different searching heuristic line size,
    associativity, way prediction and cache size.
  • 11 out 19 benchmarks miss the best configuration

28
Results of Some Other Benchmarks
29
Conclusion and Future Work
  • A configurable cache architecture is proposed.
  • Associativity, size,line size.
  • A cache parameter explorer is implemented to find
    the cache parameters.
  • A heuristic algorithm is proposed to search the
    Pareto cache parameter sets.
  • The complexity of the heuristic is O(mnl)
    instead of O(mnl)
  • Only 95 of the Pareto points can be found by
    Heuristic
  • Overhead
  • little area and power overhead, and no
    performance overhead.
  • Future Work
  • Dynamically detect the cache parameters .
Write a Comment
User Comments (0)
About PowerShow.com