Tuning of Loop Cache Architectures to Programs in Embedded System Design PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Tuning of Loop Cache Architectures to Programs in Embedded System Design


1
Tuning of Loop Cache Architectures to Programs in
Embedded System Design
  • Susan Cotterell and Frank Vahid
  • Department of Computer Science and Engineering
  • University of California, Riverside
  • Also with the Center for Embedded Computer
    Systems at UC Irvine
  • This work was supported in part by the U.S.
    National Science Foundation and a U.S. Department
    of Education GAANN Fellowship

2
Introduction
3
Introduction
4
Introduction
  • Memory access can consume 50 of an embedded
    microprocessors system power
  • Caches tend to be power hungry
  • MCORE unified cache consumes half of total
    power (Lee/Moyer/Arends 99)
  • ARM920T caches consume half of total power
    (Segars 01)

5
Introduction
Mem
Processor
I
D
Bridge
USB
JPEG
CCDP
P4
  • Advantageous to focus on the instruction fetching
    subsystem

6
Introduction
  • Techniques to reduce instruction fetch power
  • Program compression
  • Compress only a subset of frequently used
    instructions (Benini 1999)
  • Compress procedures in a small cache (Kirvoski
    1997)
  • Lookup table based (Lekatsas 2000)
  • Bus encoding
  • Increment (Benini 1997)
  • Bus-invert (Stan 1995)
  • Binary/gray code (Mehta 1996)

7
Introduction
  • Techniques to reduce instruction fetch power
    (cont.)
  • Efficient cache design
  • Small buffers victim, non-temporal, speculative,
    and penalty to reduce miss rate (Bahar 1998)
  • Memory array partitioning and variation in cache
    sizes (Ko 1995)
  • Tiny caches
  • Filter cache (Kin/Gupta/Magione-Smith 1997)
  • Dynamically loaded tagless loop cache
    (Lee/Moyer/Arends 1999)
  • Preloaded tagless loop cache (Gordon-Ross/Cotterel
    l/Vahid 2002)

8
Cache Architectures Filter Cache
  • Small L0 direct mapped cache
  • Utilizes standard tag comparison and miss logic
  • Has low dynamic power
  • Short internal bitlines
  • Close to the microprocessor
  • Performance penalty of 21 due to high miss rate
    (Kin 1997)

9
Cache Architectures Dynamically Loaded Loop
Cache
  • Small tagless loop cache
  • Alternative location to fetch instructions
  • Dynamically fills the loop cache
  • Triggered by short backwards branch (sbb)
    instruction
  • Flexible variation
  • Allows loops larger than the loop cache to be
    partially stored

... add r1,2 ... sbb -5
10
Cache Architectures Dynamically Loaded Loop
Cache (cont.)
  • Limitations
  • Does not support loops with control of flow
    changes (cofs)
  • cofs terminate loop cache filling and fetching
  • cofs include commonly found if-then-else
    statements

... add r1,2 bne r1, r2, 3 ... sbb -5
11
Cache Architectures Preloaded Loop Cache
  • Small tagless loop cache
  • Alternative location to fetch instructions
  • Loop cache filled at compile time and remains
    fixed
  • Supports loops with cof
  • Fetch triggered by short backwards branch
  • Start address variation
  • Fetch begins on first loop iteration

... add r1,2 bne r1, r2, 3 ... sbb -5
12
Traditional Design
  • Traditional Pre-fabricated IC
  • Typically optimized for best average case
  • Intended to run well across a variety of programs
  • Benchmark suite is used to determine which
    configuration
  • On average, what is the best tiny cache
    configuration?

13
Evaluation Framework Candidate Cache
Configurations
Type Size Number of loops/ line size Configuration
Original dynamically loaded loop cache 8-1024 entries n/a 1-8
Flexible dynamically loaded loop cache 8-1024 entries n/a 9-16
Preloaded loop cache (sa) 8-1024 entries 2 - 3loop address registers 17-32
Preloaded loop cache (sbb) 8-1024 entries 2 - 6 loop address registers 33-72
Filter cache 8-1024 bytes line size of 8 to 64 bytes 73-106
14
Evaluation Framework Motorola's Powerstone
Benchmarks
Benchmark Lines of C Instructions Executed Description
adpcm 501 63891 Voice Encoding
bcnt 90 1938 Bit Manipulation
binary 67 816 Binary Insertion
blit 94 22845 Graphics Application
compress 943 138573 Data Compression Program
crc 84 37650 Cyclic Redundancy Check
des 745 122214 Data Encryption Standard
engine 276 410607 Engine Controller
fir 173 16211 FIR Filtering
g3fax 606 1128023 Group Three Fax Decode
jpeg 540 4594721 JPEG Compression
summin 74 1909787 Handwriting Recognition
ucbqsort 209 219978 U.C.B Quick Sort
v42 553 2442551 Modem Encoding/Decoding
15
Simplified Tool Chain
16
Best on Average
17
Core Based Design
  • Core Based Design
  • Know application
  • Opportunity to tune the architecture
  • Is it worth tuning the architecture to the
    application or is the average case good enough?

18
Best on Average
  • Both configurations perform well for some
    benchmarks such as engine and summin
  • However, both configurations perform below
    average for binary, v42, and others

19
Results - binary
  • Config 30 yields 61 savings
  • Config 105 yields 65 savings
  • Config 31 (preloaded/1024entry/2LAR) yields 79
    savings

20
Results v42
  • Config 30 yields 58 savings
  • Config 105 yields 23 savings
  • Config 67 (preloaded/512entry/6LAR) yields 68

21
Results - averages
22
Conclusion and Future Work
  • Shown benefits of tuning the tiny cache to a
    particular program
  • On average yields an additional 11
  • Up to an additional 40 for some programs
  • Environment automated but requires several hours
    to find best configuration
  • Current methodology is too slow
  • Faster method based on equations described in
    upcoming ICCAD 2002
Write a Comment
User Comments (0)
About PowerShow.com