Dynamic Loop Caching Meets Preloaded Loop Caching - PowerPoint PPT Presentation

About This Presentation

Title:

Dynamic Loop Caching Meets Preloaded Loop Caching

Description:

Dynamically Loaded Tagless Loop Cache ... Dynamically Loaded Tagless Loop Cache - Limitation. L1 Cache. Dynamic Loop Cache. Mux ... – PowerPoint PPT presentation

Number of Views:110

Avg rating:3.0/5.0

Slides: 22

Provided by: csU7

Learn more at: http://www.cs.ucr.edu

Category:

more less

Transcript and Presenter's Notes

Title: Dynamic Loop Caching Meets Preloaded Loop Caching

1
Dynamic Loop Caching Meets Preloaded Loop Caching
A Hybrid Approach

Ann Gordon-Ross and Frank Vahid
Department of Computer Science and Engineering
University of California, Riverside
Also with the Center for Embedded Computer
Systems, UC Irvine
This work was supported in part by the U.S.
National Science Foundation and a U.S. Dept. of
Education GAANN Fellowship
International Conference on Computer Design, 2002

2
Introduction
I-Mem

Memory access can consume 50 of an embedded
microprocessors system power
Instruction fetching usually more than half of
that power
Caches tend to be power hungry
ARM920T caches consume half of total power
(Segars 01)
MCORE unified cache consumes half of total
power (Lee/Moyer/Arends 99)

L1 Cache
Processor
ARM920T. Source Segars ISSCC01
3
Filter Cache
L1 Cache

Tiny L0 cache (64 instruct.)
Kin/Gupta/Mangione-Smith97
Has very low dynamic power
Short internal bitlines
Close to microprocessor
Power/energy savings, but
Performance penalty of 21 due to high miss rate
(Kin97)
Tag comparisons consume power

Filter Cache (L0)
Processor
4
Dynamically Loaded Tagless Loop Cache
L1 Cache

Tiny cache that passively fills with loops as
they execute (Lee/Moyer/Arends 99)
Not really first level of memory
Rather, an alternative
Operation
Filled when short backwards branch detected in
instruction stream
Compared to filter cache...
No tags even lower power
Missless no performance penalty

Dynamic Loop Cache
Mux
Processor
5
Dynamically Loaded Tagless Loop Cache
L1 Cache

Tiny cache that passively fills with loops as
they execute (Lee/Moyer/Arends 99)
Not really first level of memory
Rather, an alternative
Operation
Filled when short backwards branch detected in
instruction stream
Compared to filter cache...
No tags even lower power
Missless no performance penalty

Dynamic Loop Cache
Mux
Processor
mov r1, 2 sbb -2
6
Dynamically Loaded Tagless Loop Cache
L1 Cache

Tiny cache that passively fills with loops as
they execute (Lee/Moyer/Arends 99)
Not really first level of memory
Rather, an alternative
Operation
Filled when short backwards branch detected in
instruction stream
Compared to filter cache...
No tags even lower power
Missless no performance penalty

Dynamic Loop Cache
Mux
Processor
mov r1, 2 sbb -2
7
Dynamically Loaded Tagless Loop Cache
L1 Cache

Tiny cache that passively fills with loops as
they execute (Lee/Moyer/Arends 99)
Not really first level of memory
Rather, an alternative
Operation
Filled when short backwards branch detected in
instruction stream
Compared to filter cache...
No tags even lower power
Missless no performance penalty

Dynamic Loop Cache
Mux
Processor
mov r1, 2 sbb -2
8
Dynamically Loaded Tagless Loop Cache Results
L1 Cache

We ran 10 Powerstone benchmarks (from Motorola)
on a MIPS processor instruction-set simulator
Average L1 fetch reduction was 30
Closely matched results of Lee et al 99.
L1 fetch reductions translate to system power
savings of 10-15

Dynamic Loop Cache
Mux
Processor
9
Dynamically Loaded Tagless Loop Cache - Limitation
L1 Cache

Does not support loops with control of flow
changes (cof)
Only supports sequential instruction fetching
since it was filled passively during a loop
iteration
Does not see instructions not executed on that
iteration
A cof thus terminates loop cache filling or
fetching
cofs unfortunately include common if-then-else
statements within a loop

Dynamic Loop Cache
Mux
Processor
mov r1, 2 mov r2, 3 bne r1, r2, 2 sbb -4
10
Dynamically Loaded Tagless Loop Cache - Limitation
L1 Cache

Does not support loops with control of flow
changes (cof)
Only supports sequential instruction fetching
since it was filled passively during a loop
iteration
Does not see instructions not executed on that
iteration
A cof thus terminates loop cache filling or
fetching
cofs unfortunately include common if-then-else
statements within a loop

Dynamic Loop Cache
Mux
Processor
mov r1, 2 mov r2, 3 bne r1, r2, 2 sbb -4
11
Dynamically Loaded Tagless Loop Cache - Limitation
L1 Cache

Does not support loops with control of flow
changes (cof)
Only supports sequential instruction fetching
since it was filled passively during a loop
iteration
Does not see instructions not executed on that
iteration
A cof thus terminates loop cache filling or
fetching
cofs unfortunately include common if-then-else
statements within a loop

Dynamic Loop Cache
Mux
Processor
mov r1, 2 mov r2, 3 bne r1, r2, 2 sbb -4
12
Dynamically Loaded Tagless Loop Cache - Limitation

Lack of support of cofs results in support of
only half of small frequent loops in the
benchmarks

13
Preloaded Tagless Loop Cache

Embedded systems typically execute a fixed
application
Can determine critical loops/subroutines through
profiling
Can preload critical regions into loop cache
whose contents will not change
Preloaded loop cache (Gordon-Ross/Cotterell/Vahid
CAL02)
Tagless, missless
Supports more loops than dynamic loop cache

L1 Cache
Preloaded Loop Cache
mov r1, 2 mov r2, 3 bne r1, r2, 2 sbb -4
Mux
Processor
14
Preloaded Tagless Loop Cache

Embedded systems typically execute a fixed
application
Can determine critical loops/subroutines through
profiling
Can preload critical regions into loop cache
whose contents will not change
Preloaded loop cache (Gordon-Ross/Cotterell/Vahid
CAL02)
Tagless, missless
Supports more loops than dynamic loop cache

L1 Cache
Preloaded Loop Cache
mov r1, 2 mov r2, 3 bne r1, r2, 2 sbb -4
Mux
Processor
15
Preloaded Tagless Loop Cache - Results

Results
128 instruction preloaded loop cache reduces L1
fetches by nearly twice that of dynamic (30) for
the benchmarks studied (Powerstone and Mediabench)

16
Preloaded Tagless Loop Cache - Disadvantages

Preloaded loop cache has some limitations too
Occasionally dynamic loop cache is actually
better
Preloading also requires
Fixed application
Profiling
Limited number of loops can be preloaded
We really want both!

Instruction fetch power savings
17
Solution A Hybrid Loop Cache

2 levels of cache storage
Main Loop Cache for instruction fetching
2nd Level Storage - preloaded loops are stored
here

Functions as both a dynamic and a preloaded loop
cache

18
Hybrid Loop Cache - Functionality

Dynamic Loop Cache functionality
On a short backwards branch, main loop cache is
filled dynamically
Preloaded Loop Cache functionality
On a cof, if the next instruction falls within a
preloaded region of code, that region is filled
into the main loop cache from 2nd level storage
After being filled, instructions can be fetched
from the main loop cache

19
Hybrid Loop Cache - Results

Hybrid performance
Best in 9 out of 13 benchmarks
Equally well in 1
In the remaining 3, the hybrid loop cache
performed nearly as good or better the strictly
dynamic approach but was outperformed by the
preloaded approach

Instruction fetch power savings
20
Hybrid Loop Cache Additional Consideration

Hybrid loop cache can behave like a dynamic loop
cache
If designer does not wish to profile/preload
Power savings are almost identical to the dynamic
loop cache when no loops are preloaded

21
Conclusions

Hybrid loop cache reduced embedded system
instruction fetch power by average of 51
90 savings in several cases
Outperformed dynamic and preloaded loop caches
Dynamic 23, Preloaded 35, Hybrid 51
Can work as dynamic, preloaded, or both
More capacity than a preloaded loop cache
Can be used transparently as dynamic loop cache
With nearly identical results
Hybrid loop cache may be a good addition to low
power embedded microprocessor architectures