Dynamic Loop Caching Meets Preloaded Loop Caching - PowerPoint PPT Presentation

About This Presentation
Title:

Dynamic Loop Caching Meets Preloaded Loop Caching

Description:

Dynamically Loaded Tagless Loop Cache ... Dynamically Loaded Tagless Loop Cache - Limitation. L1 Cache. Dynamic Loop Cache. Mux ... – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 22
Provided by: csU7
Learn more at: http://www.cs.ucr.edu
Category:

less

Transcript and Presenter's Notes

Title: Dynamic Loop Caching Meets Preloaded Loop Caching


1
Dynamic Loop Caching Meets Preloaded Loop Caching
A Hybrid Approach
  • Ann Gordon-Ross and Frank Vahid
  • Department of Computer Science and Engineering
  • University of California, Riverside
  • Also with the Center for Embedded Computer
    Systems, UC Irvine
  • This work was supported in part by the U.S.
    National Science Foundation and a U.S. Dept. of
    Education GAANN Fellowship
  • International Conference on Computer Design, 2002

2
Introduction
I-Mem
  • Memory access can consume 50 of an embedded
    microprocessors system power
  • Instruction fetching usually more than half of
    that power
  • Caches tend to be power hungry
  • ARM920T caches consume half of total power
    (Segars 01)
  • MCORE unified cache consumes half of total
    power (Lee/Moyer/Arends 99)

L1 Cache
Processor
ARM920T. Source Segars ISSCC01
3
Filter Cache
L1 Cache
  • Tiny L0 cache (64 instruct.)
  • Kin/Gupta/Mangione-Smith97
  • Has very low dynamic power
  • Short internal bitlines
  • Close to microprocessor
  • Power/energy savings, but
  • Performance penalty of 21 due to high miss rate
    (Kin97)
  • Tag comparisons consume power

Filter Cache (L0)
Processor
4
Dynamically Loaded Tagless Loop Cache
L1 Cache
  • Tiny cache that passively fills with loops as
    they execute (Lee/Moyer/Arends 99)
  • Not really first level of memory
  • Rather, an alternative
  • Operation
  • Filled when short backwards branch detected in
    instruction stream
  • Compared to filter cache...
  • No tags even lower power
  • Missless no performance penalty

Dynamic Loop Cache
Mux
Processor
5
Dynamically Loaded Tagless Loop Cache
L1 Cache
  • Tiny cache that passively fills with loops as
    they execute (Lee/Moyer/Arends 99)
  • Not really first level of memory
  • Rather, an alternative
  • Operation
  • Filled when short backwards branch detected in
    instruction stream
  • Compared to filter cache...
  • No tags even lower power
  • Missless no performance penalty

Dynamic Loop Cache
Mux
Processor
mov r1, 2 sbb -2
6
Dynamically Loaded Tagless Loop Cache
L1 Cache
  • Tiny cache that passively fills with loops as
    they execute (Lee/Moyer/Arends 99)
  • Not really first level of memory
  • Rather, an alternative
  • Operation
  • Filled when short backwards branch detected in
    instruction stream
  • Compared to filter cache...
  • No tags even lower power
  • Missless no performance penalty

Dynamic Loop Cache
Mux
Processor
mov r1, 2 sbb -2
7
Dynamically Loaded Tagless Loop Cache
L1 Cache
  • Tiny cache that passively fills with loops as
    they execute (Lee/Moyer/Arends 99)
  • Not really first level of memory
  • Rather, an alternative
  • Operation
  • Filled when short backwards branch detected in
    instruction stream
  • Compared to filter cache...
  • No tags even lower power
  • Missless no performance penalty

Dynamic Loop Cache
Mux
Processor
mov r1, 2 sbb -2
8
Dynamically Loaded Tagless Loop Cache Results
L1 Cache
  • We ran 10 Powerstone benchmarks (from Motorola)
    on a MIPS processor instruction-set simulator
  • Average L1 fetch reduction was 30
  • Closely matched results of Lee et al 99.
  • L1 fetch reductions translate to system power
    savings of 10-15

Dynamic Loop Cache
Mux
Processor
9
Dynamically Loaded Tagless Loop Cache - Limitation
L1 Cache
  • Does not support loops with control of flow
    changes (cof)
  • Only supports sequential instruction fetching
    since it was filled passively during a loop
    iteration
  • Does not see instructions not executed on that
    iteration
  • A cof thus terminates loop cache filling or
    fetching
  • cofs unfortunately include common if-then-else
    statements within a loop

Dynamic Loop Cache
Mux
Processor
mov r1, 2 mov r2, 3 bne r1, r2, 2 sbb -4
10
Dynamically Loaded Tagless Loop Cache - Limitation
L1 Cache
  • Does not support loops with control of flow
    changes (cof)
  • Only supports sequential instruction fetching
    since it was filled passively during a loop
    iteration
  • Does not see instructions not executed on that
    iteration
  • A cof thus terminates loop cache filling or
    fetching
  • cofs unfortunately include common if-then-else
    statements within a loop

Dynamic Loop Cache
Mux
Processor
mov r1, 2 mov r2, 3 bne r1, r2, 2 sbb -4
11
Dynamically Loaded Tagless Loop Cache - Limitation
L1 Cache
  • Does not support loops with control of flow
    changes (cof)
  • Only supports sequential instruction fetching
    since it was filled passively during a loop
    iteration
  • Does not see instructions not executed on that
    iteration
  • A cof thus terminates loop cache filling or
    fetching
  • cofs unfortunately include common if-then-else
    statements within a loop

Dynamic Loop Cache
Mux
Processor
mov r1, 2 mov r2, 3 bne r1, r2, 2 sbb -4
12
Dynamically Loaded Tagless Loop Cache - Limitation
  • Lack of support of cofs results in support of
    only half of small frequent loops in the
    benchmarks

13
Preloaded Tagless Loop Cache
  • Embedded systems typically execute a fixed
    application
  • Can determine critical loops/subroutines through
    profiling
  • Can preload critical regions into loop cache
    whose contents will not change
  • Preloaded loop cache (Gordon-Ross/Cotterell/Vahid
    CAL02)
  • Tagless, missless
  • Supports more loops than dynamic loop cache

L1 Cache
Preloaded Loop Cache
mov r1, 2 mov r2, 3 bne r1, r2, 2 sbb -4
Mux
Processor
14
Preloaded Tagless Loop Cache
  • Embedded systems typically execute a fixed
    application
  • Can determine critical loops/subroutines through
    profiling
  • Can preload critical regions into loop cache
    whose contents will not change
  • Preloaded loop cache (Gordon-Ross/Cotterell/Vahid
    CAL02)
  • Tagless, missless
  • Supports more loops than dynamic loop cache

L1 Cache
Preloaded Loop Cache
mov r1, 2 mov r2, 3 bne r1, r2, 2 sbb -4
Mux
Processor
15
Preloaded Tagless Loop Cache - Results
  • Results
  • 128 instruction preloaded loop cache reduces L1
    fetches by nearly twice that of dynamic (30) for
    the benchmarks studied (Powerstone and Mediabench)

16
Preloaded Tagless Loop Cache - Disadvantages
  • Preloaded loop cache has some limitations too
  • Occasionally dynamic loop cache is actually
    better
  • Preloading also requires
  • Fixed application
  • Profiling
  • Limited number of loops can be preloaded
  • We really want both!

Instruction fetch power savings
17
Solution A Hybrid Loop Cache
  • 2 levels of cache storage
  • Main Loop Cache for instruction fetching
  • 2nd Level Storage - preloaded loops are stored
    here
  • Functions as both a dynamic and a preloaded loop
    cache

18
Hybrid Loop Cache - Functionality
  • Dynamic Loop Cache functionality
  • On a short backwards branch, main loop cache is
    filled dynamically
  • Preloaded Loop Cache functionality
  • On a cof, if the next instruction falls within a
    preloaded region of code, that region is filled
    into the main loop cache from 2nd level storage
  • After being filled, instructions can be fetched
    from the main loop cache

19
Hybrid Loop Cache - Results
  • Hybrid performance
  • Best in 9 out of 13 benchmarks
  • Equally well in 1
  • In the remaining 3, the hybrid loop cache
    performed nearly as good or better the strictly
    dynamic approach but was outperformed by the
    preloaded approach

Instruction fetch power savings
20
Hybrid Loop Cache Additional Consideration
  • Hybrid loop cache can behave like a dynamic loop
    cache
  • If designer does not wish to profile/preload
  • Power savings are almost identical to the dynamic
    loop cache when no loops are preloaded

21
Conclusions
  • Hybrid loop cache reduced embedded system
    instruction fetch power by average of 51
  • 90 savings in several cases
  • Outperformed dynamic and preloaded loop caches
  • Dynamic 23, Preloaded 35, Hybrid 51
  • Can work as dynamic, preloaded, or both
  • More capacity than a preloaded loop cache
  • Can be used transparently as dynamic loop cache
  • With nearly identical results
  • Hybrid loop cache may be a good addition to low
    power embedded microprocessor architectures
Write a Comment
User Comments (0)
About PowerShow.com