Flexicache: Softwarebased Instruction Caching for Embedded Processors - PowerPoint PPT Presentation

About This Presentation

Title:

Flexicache: Softwarebased Instruction Caching for Embedded Processors

Description:

Software-based Instruction Caching for Embedded Processors. Jason E Miller and Anant Agarwal ... caching can be a practical solution for embedded processors ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 34

Provided by: JasonM96

Learn more at: https://groups.csail.mit.edu

Category:

more less

Transcript and Presenter's Notes

Title: Flexicache: Softwarebased Instruction Caching for Embedded Processors

1
FlexicacheSoftware-based Instruction Caching
for Embedded Processors

Jason E Miller and Anant Agarwal
Raw Group - MIT CSAIL

2
Outline

Introduction
Baseline Implementation
Optimizations
Energy
Conclusions

3
Hardware Instruction Caches

Used in virtually all high-performance
general-purpose processors

DRAM

Good performance
Decreases average memory access time
Easy to use
Transparent operation

I-Cache
Processor
Chip
4
ICache-less Processors

Embedded procs and DSPs
TMS470, ADSP-21xx, etc.
Embedded multicore processors
IBM Cell SPE

DRAM
SRAM

No special-purpose hardware
Less design/verification time
Less area
Shorter cycle time
Less energy per access
Predictable behavior

Processor
Chip

Much harder to program!
Manually partition code and transfer pieces from
DRAM

5
Software-based I-Caching

Use a software system to virtualize instruction
memory by recreating hardware cache functionality
Automatic management of simple SRAM memory
Good performance with no extra programming effort
Integrated into each individual application
Customized to programs needs
Optimize for different goals
Real-time predictability
Maintain low-cost, high-speed hardware

6
Outline

Introduction
Baseline Implementation
Optimizations
Energy
Conclusions

7
Flexicache System Overview
Original Binary
Programmer
8
Binary Rewriter

Break up user program into cache blocks
Modify control-flow that leaves the blocks

Flexicache runtime
Binary
Rewriter
9
Rewriter Details

One basic block in each cache block, but
Fixed-size of 16 instructions
Simplifies bookkeeping
Requires padding of small blocks and splitting of
large ones
Control-flow instructions that leave a block are
modified to jump to the runtime system
E.g. BEQ 2,3,foo ? JEQL 2,3,runtime
Original destination addresses stored in table
Fall-through jumps at end of blocks

10
Runtime Overview

Stays resident in I-mem
Receive requests from cache blocks
See if requested block is resident
Load new block from DRAM if necessary
Evict blocks to make room
Transfer control to the new block

11
Runtime Operation
Loaded Cache Blocks
Block 2
12
System Policies and Mechanisms

Fully-associative cache block placement
Replacement Policy FIFO
Evict oldest block in cache
Matches sequential execution
Pinned functions
Key feature for timing predictability
No cache overhead within function

13
Experimental Setup

Implemented for a tile in the Raw multicore
processor
Similar to many embedded processors
32-bit single-issue in-order MIPS pipeline
32 kB SRAM I-mem
Raw simulator
Cycle-accurate
Idealized I/O model
SRAM I-mem or traditional hardware I-cache models
Uses Wattch to estimate energy consumption
Mediabench benchmark suite
Multimedia applications for embedded processors

14
Baseline Performance
Flexicache Overhead
Overhead Number of additional cycles relative to
32 kB, 2-way HW cache
15
Outline

Introduction
Baseline Implementation
Optimizations
Energy
Conclusions

16
Basic Chaining
Block A
Runtime System

Problem Hit case in runtime system takes about
40 cycles

Block B
Block C
Without Chaining
Block D

Solution Modify jump to runtime system so that
it jumps directly to loaded code the next time

17
Basic Chaining Performance
Flexicache Overhead
18
Basic Chaining Performance
Flexicache Overhead
19
Function Call Chaining

Problem Function calls were not being chained
Compound instructions (like jump-and-link) handle
two virtual addresses
Load return address into link register
Jump to destination address
Solution
Decompose them in the rewriter
Jump can be chained normally at runtime

20
Function Call Chaining Performance
Flexicache Overhead
21
Replacement Policy

Problem Too much bookkeeping
Chains must be backed out if destination block is
evicted
Idea 1 With FIFO replacement policy, no need to
record chains from old to young
Idea 2 Limit of chains to each block

Block A
Runtime System
Block B
Block C
Block D

Solution Flush replacement policy
Evict everything and start fresh
No need to undo or track chains
Increased miss rate vs FIFO

? D
? C
? A
22
Flush Policy Performance
Flexicache Overhead
23
Indirect Jump Chaining

Problem Different destination on each execution
Solution Pre-screen addresses and chain each
individually

JR 31

But
Screening takes time
Which addresses should we chain?

24
Indirect Jump Chaining Performance
Flexicache Overhead
25
Fixed-size Block Padding
00008400 ltL2B1gt 8400 mfsr r9,28
8404 rlm r9,r9,0x4,0x0 8408
jnel r9,0, _dispatch.entry1 840c
jal _dispatch.entry2 8410 nop
8414 nop 8418 nop 841c nop

Padding for small blocks wastes more space than
expected
Average basic block contains 5.5 instructions
Most common size is 3
60-65 of storage space is wasted on NOPs

26
8-word Cache Blocks

Reduce cache block size to better fit basic
blocks
Less padding ? less wasted space ? lower miss
rate
Bookkeeping structures get bigger ? higher miss
rate
More block splits ? higher miss rate, overhead

Allow up to 4 consecutive blocks to be loaded
together
Effectively creates 8, 16, 24 and 32 word blocks
Avoid splitting up large basic blocks
Performance Benefits
Amortize cost of a call into the runtime
Overlap DRAM fetches
Eliminate jumps used to split large blocks
Also used to add extra space for runtime JR
chaining

27
8-word Blocks Performance
Flexicache Overhead
28
Performance Summary

Good performance on 6 of 9 benchmarks 5-11
G721 (24.2 overhead)
Indirect jumps
Mesa (24.4 overhead)
Indirect jumps, High miss rate
Rasta (93.6 overhead)
High miss rate, indirect jumps
Majority of remaining overhead is due to
modifications to user code, not runtime calls
Fall-through jumps added by rewriter
Indirect jump chain comparisons

29
Outline

Introduction
Baseline Implementation
Optimizations
Energy
Conclusions

30
Energy Analysis

SRAM uses less energy than cache for each access
No tags and unused cache ways
Saves about 9 of total processor power
Additional instructions for software management
use extra energy
Total energy roughly proportional to number of
cycles
Software I-cache will use less total energy if
instruction overhead is below 9

31
Energy Results

Wattch used with CACTI models for SRAM and
I-cache
32 kB, 2-way set associative HW cache, 25 of
total power
Total energy to complete each benchmark calculated

32
Conclusions

Software-based instruction caching can be a
practical solution for embedded processors
Provides programming convenience of a HW cache
Performance and energy similar to a HW cache
Overhead lt 10 on several benchmarks
Energy savings of up to 3.8
Maintain advantages of Icache-less architecture
Low-cost hardware
Real-time guarantees

http//cag.csail.mit.edu/raw
33
Questions?