Title: Flexicache: Softwarebased Instruction Caching for Embedded Processors
1FlexicacheSoftware-based Instruction Caching
for Embedded Processors
- Jason E Miller and Anant Agarwal
- Raw Group - MIT CSAIL
2Outline
- Introduction
- Baseline Implementation
- Optimizations
- Energy
- Conclusions
3Hardware Instruction Caches
- Used in virtually all high-performance
general-purpose processors
DRAM
- Good performance
- Decreases average memory access time
- Easy to use
- Transparent operation
I-Cache
Processor
Chip
4ICache-less Processors
- Embedded procs and DSPs
- TMS470, ADSP-21xx, etc.
- Embedded multicore processors
- IBM Cell SPE
DRAM
SRAM
- No special-purpose hardware
- Less design/verification time
- Less area
- Shorter cycle time
- Less energy per access
- Predictable behavior
Processor
Chip
- Much harder to program!
- Manually partition code and transfer pieces from
DRAM
5Software-based I-Caching
- Use a software system to virtualize instruction
memory by recreating hardware cache functionality - Automatic management of simple SRAM memory
- Good performance with no extra programming effort
- Integrated into each individual application
- Customized to programs needs
- Optimize for different goals
- Real-time predictability
- Maintain low-cost, high-speed hardware
6Outline
- Introduction
- Baseline Implementation
- Optimizations
- Energy
- Conclusions
7Flexicache System Overview
Original Binary
Programmer
8Binary Rewriter
- Break up user program into cache blocks
- Modify control-flow that leaves the blocks
Flexicache runtime
Binary
Rewriter
9Rewriter Details
- One basic block in each cache block, but
- Fixed-size of 16 instructions
- Simplifies bookkeeping
- Requires padding of small blocks and splitting of
large ones - Control-flow instructions that leave a block are
modified to jump to the runtime system - E.g. BEQ 2,3,foo ? JEQL 2,3,runtime
- Original destination addresses stored in table
- Fall-through jumps at end of blocks
10Runtime Overview
- Stays resident in I-mem
- Receive requests from cache blocks
- See if requested block is resident
- Load new block from DRAM if necessary
- Evict blocks to make room
- Transfer control to the new block
11Runtime Operation
Loaded Cache Blocks
Block 2
12System Policies and Mechanisms
- Fully-associative cache block placement
- Replacement Policy FIFO
- Evict oldest block in cache
- Matches sequential execution
- Pinned functions
- Key feature for timing predictability
- No cache overhead within function
13Experimental Setup
- Implemented for a tile in the Raw multicore
processor - Similar to many embedded processors
- 32-bit single-issue in-order MIPS pipeline
- 32 kB SRAM I-mem
- Raw simulator
- Cycle-accurate
- Idealized I/O model
- SRAM I-mem or traditional hardware I-cache models
- Uses Wattch to estimate energy consumption
- Mediabench benchmark suite
- Multimedia applications for embedded processors
14Baseline Performance
Flexicache Overhead
Overhead Number of additional cycles relative to
32 kB, 2-way HW cache
15Outline
- Introduction
- Baseline Implementation
- Optimizations
- Energy
- Conclusions
16Basic Chaining
Block A
Runtime System
- Problem Hit case in runtime system takes about
40 cycles
Block B
Block C
Without Chaining
Block D
- Solution Modify jump to runtime system so that
it jumps directly to loaded code the next time
17Basic Chaining Performance
Flexicache Overhead
18Basic Chaining Performance
Flexicache Overhead
19Function Call Chaining
- Problem Function calls were not being chained
- Compound instructions (like jump-and-link) handle
two virtual addresses - Load return address into link register
- Jump to destination address
- Solution
- Decompose them in the rewriter
- Jump can be chained normally at runtime
20Function Call Chaining Performance
Flexicache Overhead
21Replacement Policy
- Problem Too much bookkeeping
- Chains must be backed out if destination block is
evicted - Idea 1 With FIFO replacement policy, no need to
record chains from old to young - Idea 2 Limit of chains to each block
Block A
Runtime System
Block B
Block C
Block D
- Solution Flush replacement policy
- Evict everything and start fresh
- No need to undo or track chains
- Increased miss rate vs FIFO
? D
? C
? A
22Flush Policy Performance
Flexicache Overhead
23Indirect Jump Chaining
- Problem Different destination on each execution
- Solution Pre-screen addresses and chain each
individually
JR 31
- But
- Screening takes time
- Which addresses should we chain?
24Indirect Jump Chaining Performance
Flexicache Overhead
25Fixed-size Block Padding
00008400 ltL2B1gt 8400 mfsr r9,28
8404 rlm r9,r9,0x4,0x0 8408
jnel r9,0, _dispatch.entry1 840c
jal _dispatch.entry2 8410 nop
8414 nop 8418 nop 841c nop
- Padding for small blocks wastes more space than
expected - Average basic block contains 5.5 instructions
- Most common size is 3
- 60-65 of storage space is wasted on NOPs
268-word Cache Blocks
- Reduce cache block size to better fit basic
blocks - Less padding ? less wasted space ? lower miss
rate - Bookkeeping structures get bigger ? higher miss
rate - More block splits ? higher miss rate, overhead
- Allow up to 4 consecutive blocks to be loaded
together - Effectively creates 8, 16, 24 and 32 word blocks
- Avoid splitting up large basic blocks
- Performance Benefits
- Amortize cost of a call into the runtime
- Overlap DRAM fetches
- Eliminate jumps used to split large blocks
- Also used to add extra space for runtime JR
chaining
278-word Blocks Performance
Flexicache Overhead
28Performance Summary
- Good performance on 6 of 9 benchmarks 5-11
- G721 (24.2 overhead)
- Indirect jumps
- Mesa (24.4 overhead)
- Indirect jumps, High miss rate
- Rasta (93.6 overhead)
- High miss rate, indirect jumps
- Majority of remaining overhead is due to
modifications to user code, not runtime calls - Fall-through jumps added by rewriter
- Indirect jump chain comparisons
29Outline
- Introduction
- Baseline Implementation
- Optimizations
- Energy
- Conclusions
30Energy Analysis
- SRAM uses less energy than cache for each access
- No tags and unused cache ways
- Saves about 9 of total processor power
- Additional instructions for software management
use extra energy - Total energy roughly proportional to number of
cycles - Software I-cache will use less total energy if
instruction overhead is below 9
31Energy Results
- Wattch used with CACTI models for SRAM and
I-cache - 32 kB, 2-way set associative HW cache, 25 of
total power - Total energy to complete each benchmark calculated
32Conclusions
- Software-based instruction caching can be a
practical solution for embedded processors - Provides programming convenience of a HW cache
- Performance and energy similar to a HW cache
- Overhead lt 10 on several benchmarks
- Energy savings of up to 3.8
- Maintain advantages of Icache-less architecture
- Low-cost hardware
- Real-time guarantees
http//cag.csail.mit.edu/raw
33Questions?
- http//cag.csail.mit.edu/raw