Title: Region Based Caching
1Region Based Caching
-
- Advanced Computer Architecture Lab
- University of Michigan
2Research Focus Memory Subsystem
The Host Processor
L1 Cache (SRAM)
L2 Cache
Core Processor
(SRAM)
Back-Side Bus
Front-Side Bus
Graphics
e.g. A.G.P.
System Memory (DRAM)
Processing Unit
Local
Frame
Buffer
3Research Direction
- Memory subsystem optimization
- Low energy cache architecture
- Region-based Cachelets CASES-00 ToC-02
- For Ubiquitous/Wearable information appliance
- High memory level parallelism
- Stack Value File HPCA-7 ToC-02
- For Data/Web/Simulation server farms
4Memory Space Partitioning
max mem
- Based on programming language
- Non-overlapped subdivisions
reserved
Dynamic Data Region
Protected
Dynamic Data Region
Static Data Region
Code Region
Static Data Region
reserved
min mem
MIPS Architecture
5Memory Space Partitioning
max mem
- Based on programming language
- Non-overlapped subdivisions
- Split code and data Þ I-cache D-cache
reserved
Dynamic Data Region
Protected
Dynamic Data Region
Static Data Region
Code Region
Static Data Region
reserved
min mem
MIPS Architecture
6Memory Space Partitioning
max mem
- Based on programming language
- Non-overlapped subdivisions
- Split code and data Þ I-cache D-cache
- Split data into regions
- Stack (?)
- Heap (?)
- Global (static)
- Read-only (static)
reserved
Stack grows downward
Protected
Heap grows upward
Static Global Data Region
Code Region
Read-only data
reserved
min mem
MIPS Architecture
7Motivation
- Most of current caches exploit average
reference behavior - leading to sub-optimal cache designs
- Exploiting memory region characteristics can lead
to more effective caches - Attacking each region individually
- finding optimal cache designs for each region
8Outline
- Cache Partitioning Overview
- Reference Characterization by Regions
- Region-based Cachelets
- Stack Value File Design
- Future Research Directions
9Prior Art in Cache Partitioning
- Academic Research
- Cacheable Non-Allocatable (CNA) Tyson et al. 95
- Dual Data Cache Gonzàlez 95
- Split Temporal Spatial Data Cache Milutinovic et
al. 96 - Non-Temporal Streaming (NTS) Cache Rivers
Davidson 96 - Memory Address Table (MAT) Johnson Hwu 97
- Reconfigurable caches Ranganathan, Adve Jouppi
00
10Prior Art in Cache Partitioning
- Commercial Processors
- Victim cache Compaq Alpha Intel Pentium Pro
- Assist cache HP-PA 7200
- Temporal/Non-temporal prefetch instruction
support Intel Pentium III, Pentium 4 Itanium - Lock way cache support IBM PPC440 Cyrix
MediaGX
11How about Region-based Partitioning ?
12Stack Reference of Memory Instructions
13Stack Global
14Stack Global Heap
15Stack Global Heap Read-only Data
16Cache Line Reference Frequency by Regions
Number of references
Cache line ID
17Exploiting Low Energy Opportunities
Region-based Cachelets
Low power stack and static cachelets
18Power Density Trend
- Surpassed hot-plate power density in 0.5?m Not
too long to reach nuclear reactor, Intel Fellow
Fred Pollack.
19Low Energy Design
- For both mobile gadgets and server farms.
- Longer battery life.
- Power consumption cost. (25 total cost Singh
Tiwari 99 ) - Caches consume 50 power Montanaro 96
- Cache die area is growing.
- Power reduction techniques in architectural level
- ISA diet, e.g. ARM Thumb architecture.
- Code compression. Wolfe et al. 92,94 Lefurgy
et al. 97 LarinConte 99 - Code rescheduling. SuTsuiDespain 94
ToburenConteReilly 98 - Cache partitioning or slicing.
20Region-based Partitioning
- Run-time virtual memory address space is
partitioned by programming languages. - Can we reduce power while retaining performance ?
- Reference patterns and characteristics of these
data are different.
21Miss Rate by Memory Region
- Stack data level off quickly so do global data
- Heap drops linearly every time cache size doubled
22Region-based Cachelets (RBC)
- A simple idea
- A horizontal partitioning
- Clock gated caches
- Only enable (cycle) the region cachelet being
accessed - Redirect gt70 accesses to smaller region
cachelets
address
region select
Address DeMultiplexer
cs
cs
cs
RC2
RC1
static
stack
L1 Cache
Processor Data Bus
23Speedup of RBC over Baselines
- Apple-to-apple compare S4k-G4k-32kL1 40K-5way
- In average, on-par performance
24Power Reduction of RBC
- Dynamic cache power reduced by as much as 63
25Summary
- Data cache partitioning based on programming
language semantics - Greater than 70 data accesses re-routed to
smaller region-based cachelets - Region-based cachelets mechanism
- Easier to expand cache size.
- Can reduce Dynamic and Static Power Dissipation.
- Retain/Increase Performance Level.
- 65 reduction in Energy-Delay Product.
- Can be applied on top of existent art, e.g.
filter cache. - Can be an alternative of multi-ported caches.
-
26Exploiting High Performance Stack Value
File
A new stack cache, morphing sp-relative
references
27Memory Access Distribution
- SPEC2000int benchmark (Alpha binary)
- 42 instructions access memory
28Access Method Breakdown
86 stack references use (spdisp)
29Morphing sp-relative References
- Morph sp-relative references into register
accesses - Use a Stack Value File (SVF)
- Resolve address early in decode stage for
stack-pointer indexed accesses - Resolve stack memory dependency early
- Aliased references are re-routed to SVF
30Cache Footprint Distribution by Regions
of hits to each set
Cache set ?
31Stack Reference Characteristics
- Contiguity
- Good temporal and spatial locality
- Can be stored in a simple, fast structure
- Small die area relative to a regular cache
- Less power dissipation
- No address tag need for each datum
- Keep the current TOS address
32Stack Reference Characteristics
- First touch is almost always a Store
- Avoid waste bandwidth to bring in dead data
- A register write to the SVF
- Deallocated stack frame
- Dead data
- No need to write them back to memory
33Speedup Potential of Stack Value File
- Assume all references can be morphed
- 30 speedup for a 16-wide with a dual-ported L1
34Why is SVF Faster ?
- It reduces the load-to-use latency of stack
references - It effectively increases the number of memory
port by rerouting more than ½ of all memory
references to the SVF - It reduces contention in the MOB
- More flexibility in renaming stack references
- It reduces memory traffic
35Summary
- Stack references have several unique
characteristics - Contiguity, spdisp, first reference store,
frame deallocation. - Stack Value File
- a microarchitecture extension to exploit these
characteristics - improves performance by 24 - 65
36Future Research Directions
37Extensions to Region Caching
- Compiler Interaction
- Cache conscious data placement
- Similar in concept to decoder issues in PPro
- More flexibility with different cache
organizations - Avoiding reference mismatches
- Sparse local arrays
- Split virtual address space into regions by
function and characteristic
38Extensions to Region Caching
- Heap allocation techniques
- Caches hold heap data (85 occupancy)
- Heap references are increasing as we use OOL
- Allocation sites are moving to libraries
- Does expected locality behavior correlate to
allocation site? - If so, then it is possible to call one of two
malloc routines to place it in a cachable region
39Extensions to Region Caching
- Synthesized Processor Memories
- Many embedded designs do not use caches
- Slow clocks, static scheduled VLIW
- But memory design is still important
- Code size issues
- Power issues
40Thats All Folks !
http//www.eecs.umich.edu/linear
41Baseline Microarchitecture
Issue
Execute
Commit
Ld
/St
Dispatch
Fetch
Decode
MOB
Unit
DecoderQ
Reservation Station / LSQ
Reg
Decoder
Instr
-Cache
Renamer
Func Unit
(
RAT)
ArchRF
ReOrder
Buffer
42Microarchitecture Extension
Issue
Execute
Commit
Ld
/St
Dispatch
Fetch
Decode
MOB
Unit
DecoderQ
Reservation Station / LSQ
Reg
Decoder
Instr
-Cache
Renamer
Func Unit
(
RAT)
Morphing
Pre-Decode
offset
ArchRF
Max
ReOrder
Buffer
Hash
SP
Stack
SP
Value
File
interlock
43Microarchitecture Extension
stq r10, 24(sp)
TOS
Issue
Execute
Commit
Ld
/St
Dispatch
Fetch
Decode
MOB
Unit
DecoderQ
Reservation Station / LSQ
Reg
Decoder
Instr
-Cache
Renamer
Func Unit
(
RAT)
Morphing
Pre-Decode
offset
ArchRF
Max
ReOrder
Buffer
Hash
SP
Stack
SP
Value
File
interlock
44Microarchitecture Extension
stq r10, 24(sp)
TOS
Issue
Execute
Commit
Ld
/St
Dispatch
Fetch
Decode
MOB
Unit
DecoderQ
Reservation Station / LSQ
Reg
Decoder
Instr
-Cache
Renamer
Func Unit
(
RAT)
Morphing
Pre-Decode
offset
3
ArchRF
Max
ReOrder
Buffer
Hash
SP
Stack
SP
Value
File
interlock
45Microarchitecture Extension
stq r10, 24(sp)
p35 ? ROB-18
TOS
Issue
Execute
Commit
Ld
/St
Dispatch
Fetch
Decode
MOB
Unit
DecoderQ
Reservation Station / LSQ
Reg
Decoder
Instr
-Cache
Renamer
Func Unit
(
RAT)
Morphing
Pre-Decode
offset
ArchRF
Max
ReOrder
Buffer
Hash
SP
Stack
SP
Value
File
interlock
46Microarchitecture Extension
stq r10, 24(sp)
p35 ? ROB-18
TOS
Issue
Execute
Commit
Ld
/St
Dispatch
Fetch
Decode
MOB
Unit
DecoderQ
Reservation Station / LSQ
Reg
Decoder
Instr
-Cache
Renamer
Func Unit
(
RAT)
Morphing
Pre-Decode
offset
ArchRF
Max
ReOrder
Buffer
Hash
SP
Stack
SP
Value
File
interlock
47Microarchitecture Extension
stq r10, 24(sp)
p35 ? SVF3
TOS
Issue
Execute
Commit
Ld
/St
Dispatch
Fetch
Decode
MOB
Unit
DecoderQ
Reservation Station / LSQ
Reg
Decoder
Instr
-Cache
Renamer
Func Unit
(
RAT)
Morphing
Pre-Decode
offset
ArchRF
Max
ReOrder
Buffer
Hash
SP
Stack
SP
interlock
48Backup Foils
49Stack Global Heap Rdata
50Simulation Framework
- Wattch simulator Brooks et al.00
- Consider switching power only
- Simple clock gating
- Baseline microarchitecture parameters
- Close to Intel StrongARM SA-110
- Single-issue 5-stage in-order pipeline
- Unified 32B-line 32KB L1
- Region-based caching only applied to L1
- 4-way 512KB L2
51Stack Depth Variation
52Offset Locality of Stack
- Cumulative offset within a function call
- Avg 3b - 380b
- gt80 offset within400b
- gt99 offset within8Kb
53SVF Reference Type Breakdown
- 86 stack references can be morphed
- Re-routed references enter regular memory
pipeline
54Memory Traffic
- SVF dramatically reduces memory traffic by many
orders of magnitude. - For gcc, 28M (Stack cache ? L2) reduced to 86K
(SVF ? L1). - Incoming traffic is eliminated because SVF does
not allocate a cache line on a miss. - Outgoing traffic consists of only those words
that are dirty when evicted (instead of entire
cache lines).
55High Bandwidth Design Eager Writeback
Bandwidth problem can be cured with money.
Latency problems are harder because the speed of
light is fixed ? you cannot bribe God. ?David
Clark, MIT
56Conclusions
- Data demonstrate distinct characteristics by
reference regions and cache reference behaviors - Region based caching
- Maps cache structure to the specific need of
memory references - Exploits the data characteristics to optimize
hardware/software design - These characteristics can be used to improve
memory allocation