Flexicache: Softwarebased Instruction Caching for Embedded Processors - PowerPoint PPT Presentation

About This Presentation
Title:

Flexicache: Softwarebased Instruction Caching for Embedded Processors

Description:

Software-based Instruction Caching for Embedded Processors. Jason E Miller and Anant Agarwal ... caching can be a practical solution for embedded processors ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 34
Provided by: JasonM96
Category:

less

Transcript and Presenter's Notes

Title: Flexicache: Softwarebased Instruction Caching for Embedded Processors


1
FlexicacheSoftware-based Instruction Caching
for Embedded Processors
  • Jason E Miller and Anant Agarwal
  • Raw Group - MIT CSAIL

2
Outline
  • Introduction
  • Baseline Implementation
  • Optimizations
  • Energy
  • Conclusions

3
Hardware Instruction Caches
  • Used in virtually all high-performance
    general-purpose processors

DRAM
  • Good performance
  • Decreases average memory access time
  • Easy to use
  • Transparent operation

I-Cache
Processor
Chip
4
ICache-less Processors
  • Embedded procs and DSPs
  • TMS470, ADSP-21xx, etc.
  • Embedded multicore processors
  • IBM Cell SPE

DRAM
SRAM
  • No special-purpose hardware
  • Less design/verification time
  • Less area
  • Shorter cycle time
  • Less energy per access
  • Predictable behavior

Processor
Chip
  • Much harder to program!
  • Manually partition code and transfer pieces from
    DRAM

5
Software-based I-Caching
  • Use a software system to virtualize instruction
    memory by recreating hardware cache functionality
  • Automatic management of simple SRAM memory
  • Good performance with no extra programming effort
  • Integrated into each individual application
  • Customized to programs needs
  • Optimize for different goals
  • Real-time predictability
  • Maintain low-cost, high-speed hardware

6
Outline
  • Introduction
  • Baseline Implementation
  • Optimizations
  • Energy
  • Conclusions

7
Flexicache System Overview
Original Binary
Programmer
8
Binary Rewriter
  • Break up user program into cache blocks
  • Modify control-flow that leaves the blocks

Flexicache runtime
Binary
Rewriter
9
Rewriter Details
  • One basic block in each cache block, but
  • Fixed-size of 16 instructions
  • Simplifies bookkeeping
  • Requires padding of small blocks and splitting of
    large ones
  • Control-flow instructions that leave a block are
    modified to jump to the runtime system
  • E.g. BEQ 2,3,foo ? JEQL 2,3,runtime
  • Original destination addresses stored in table
  • Fall-through jumps at end of blocks

10
Runtime Overview
  • Stays resident in I-mem
  • Receive requests from cache blocks
  • See if requested block is resident
  • Load new block from DRAM if necessary
  • Evict blocks to make room
  • Transfer control to the new block

11
Runtime Operation
Loaded Cache Blocks
Block 2
12
System Policies and Mechanisms
  • Fully-associative cache block placement
  • Replacement Policy FIFO
  • Evict oldest block in cache
  • Matches sequential execution
  • Pinned functions
  • Key feature for timing predictability
  • No cache overhead within function

13
Experimental Setup
  • Implemented for a tile in the Raw multicore
    processor
  • Similar to many embedded processors
  • 32-bit single-issue in-order MIPS pipeline
  • 32 kB SRAM I-mem
  • Raw simulator
  • Cycle-accurate
  • Idealized I/O model
  • SRAM I-mem or traditional hardware I-cache models
  • Uses Wattch to estimate energy consumption
  • Mediabench benchmark suite
  • Multimedia applications for embedded processors

14
Baseline Performance
Flexicache Overhead
Overhead Number of additional cycles relative to
32 kB, 2-way HW cache
15
Outline
  • Introduction
  • Baseline Implementation
  • Optimizations
  • Energy
  • Conclusions

16
Basic Chaining
Block A
Runtime System
  • Problem Hit case in runtime system takes about
    40 cycles

Block B
Block C
Without Chaining
Block D
  • Solution Modify jump to runtime system so that
    it jumps directly to loaded code the next time

17
Basic Chaining Performance
Flexicache Overhead
18
Basic Chaining Performance
Flexicache Overhead
19
Function Call Chaining
  • Problem Function calls were not being chained
  • Compound instructions (like jump-and-link) handle
    two virtual addresses
  • Load return address into link register
  • Jump to destination address
  • Solution
  • Decompose them in the rewriter
  • Jump can be chained normally at runtime

20
Function Call Chaining Performance
Flexicache Overhead
21
Replacement Policy
  • Problem Too much bookkeeping
  • Chains must be backed out if destination block is
    evicted
  • Idea 1 With FIFO replacement policy, no need to
    record chains from old to young
  • Idea 2 Limit of chains to each block

Block A
Runtime System
Block B
Block C
Block D
  • Solution Flush replacement policy
  • Evict everything and start fresh
  • No need to undo or track chains
  • Increased miss rate vs FIFO

? D
? C
? A
22
Flush Policy Performance
Flexicache Overhead
23
Indirect Jump Chaining
  • Problem Different destination on each execution
  • Solution Pre-screen addresses and chain each
    individually

JR 31
  • But
  • Screening takes time
  • Which addresses should we chain?

24
Indirect Jump Chaining Performance
Flexicache Overhead
25
Fixed-size Block Padding
00008400 ltL2B1gt 8400 mfsr r9,28
8404 rlm r9,r9,0x4,0x0 8408
jnel r9,0, _dispatch.entry1 840c
jal _dispatch.entry2 8410 nop
8414 nop 8418 nop 841c nop
  • Padding for small blocks wastes more space than
    expected
  • Average basic block contains 5.5 instructions
  • Most common size is 3
  • 60-65 of storage space is wasted on NOPs

26
8-word Cache Blocks
  • Reduce cache block size to better fit basic
    blocks
  • Less padding ? less wasted space ? lower miss
    rate
  • Bookkeeping structures get bigger ? higher miss
    rate
  • More block splits ? higher miss rate, overhead
  • Allow up to 4 consecutive blocks to be loaded
    together
  • Effectively creates 8, 16, 24 and 32 word blocks
  • Avoid splitting up large basic blocks
  • Performance Benefits
  • Amortize cost of a call into the runtime
  • Overlap DRAM fetches
  • Eliminate jumps used to split large blocks
  • Also used to add extra space for runtime JR
    chaining

27
8-word Blocks Performance
Flexicache Overhead
28
Performance Summary
  • Good performance on 6 of 9 benchmarks 5-11
  • G721 (24.2 overhead)
  • Indirect jumps
  • Mesa (24.4 overhead)
  • Indirect jumps, High miss rate
  • Rasta (93.6 overhead)
  • High miss rate, indirect jumps
  • Majority of remaining overhead is due to
    modifications to user code, not runtime calls
  • Fall-through jumps added by rewriter
  • Indirect jump chain comparisons

29
Outline
  • Introduction
  • Baseline Implementation
  • Optimizations
  • Energy
  • Conclusions

30
Energy Analysis
  • SRAM uses less energy than cache for each access
  • No tags and unused cache ways
  • Saves about 9 of total processor power
  • Additional instructions for software management
    use extra energy
  • Total energy roughly proportional to number of
    cycles
  • Software I-cache will use less total energy if
    instruction overhead is below 9

31
Energy Results
  • Wattch used with CACTI models for SRAM and
    I-cache
  • 32 kB, 2-way set associative HW cache, 25 of
    total power
  • Total energy to complete each benchmark calculated

32
Conclusions
  • Software-based instruction caching can be a
    practical solution for embedded processors
  • Provides programming convenience of a HW cache
  • Performance and energy similar to a HW cache
  • Overhead lt 10 on several benchmarks
  • Energy savings of up to 3.8
  • Maintain advantages of Icache-less architecture
  • Low-cost hardware
  • Real-time guarantees

http//cag.csail.mit.edu/raw
33
Questions?
  • http//cag.csail.mit.edu/raw
Write a Comment
User Comments (0)
About PowerShow.com