Region Based Caching - PowerPoint PPT Presentation

1 / 56

About This Presentation

Title:

Region Based Caching

Description:

Memory Address Table (MAT) [Johnson & Hwu 97] ... 'Surpassed hot-plate power density in 0.5 m; Not too long ... Can we reduce power while retaining performance ? ... – PowerPoint PPT presentation

Number of Views:372

Avg rating:3.0/5.0

Slides: 57

Provided by: csF2

Category:

more less

Transcript and Presenter's Notes

Title: Region Based Caching

1
Region Based Caching

Advanced Computer Architecture Lab
University of Michigan

2
Research Focus Memory Subsystem
The Host Processor
L1 Cache (SRAM)
L2 Cache
Core Processor
(SRAM)
Back-Side Bus
Front-Side Bus
Graphics
e.g. A.G.P.
System Memory (DRAM)
Processing Unit
Local
Frame
Buffer
3
Research Direction

Memory subsystem optimization
Low energy cache architecture
Region-based Cachelets CASES-00 ToC-02
For Ubiquitous/Wearable information appliance
High memory level parallelism
Stack Value File HPCA-7 ToC-02
For Data/Web/Simulation server farms

4
Memory Space Partitioning
max mem

Based on programming language
Non-overlapped subdivisions

reserved
Dynamic Data Region
Protected
Dynamic Data Region
Static Data Region
Code Region
Static Data Region
reserved
min mem
MIPS Architecture
5
Memory Space Partitioning
max mem

Based on programming language
Non-overlapped subdivisions
Split code and data Þ I-cache D-cache

reserved
Dynamic Data Region
Protected
Dynamic Data Region
Static Data Region
Code Region
Static Data Region
reserved
min mem
MIPS Architecture
6
Memory Space Partitioning
max mem

Based on programming language
Non-overlapped subdivisions
Split code and data Þ I-cache D-cache
Split data into regions
Stack (?)
Heap (?)
Global (static)
Read-only (static)

reserved
Stack grows downward
Protected
Heap grows upward
Static Global Data Region
Code Region
Read-only data
reserved
min mem
MIPS Architecture
7
Motivation

Most of current caches exploit average
reference behavior
leading to sub-optimal cache designs
Exploiting memory region characteristics can lead
to more effective caches
Attacking each region individually
finding optimal cache designs for each region

8
Outline

Cache Partitioning Overview
Reference Characterization by Regions
Region-based Cachelets
Stack Value File Design
Future Research Directions

9
Prior Art in Cache Partitioning

Academic Research
Cacheable Non-Allocatable (CNA) Tyson et al. 95
Dual Data Cache Gonzàlez 95
Split Temporal Spatial Data Cache Milutinovic et
al. 96
Non-Temporal Streaming (NTS) Cache Rivers
Davidson 96
Memory Address Table (MAT) Johnson Hwu 97
Reconfigurable caches Ranganathan, Adve Jouppi
00

10
Prior Art in Cache Partitioning

Commercial Processors
Victim cache Compaq Alpha Intel Pentium Pro
Assist cache HP-PA 7200
Temporal/Non-temporal prefetch instruction
support Intel Pentium III, Pentium 4 Itanium
Lock way cache support IBM PPC440 Cyrix
MediaGX

11
How about Region-based Partitioning ?
12
Stack Reference of Memory Instructions
13
Stack Global
14
Stack Global Heap
15
Stack Global Heap Read-only Data
16
Cache Line Reference Frequency by Regions
Number of references
Cache line ID
17
Exploiting Low Energy Opportunities
Region-based Cachelets
Low power stack and static cachelets
18
Power Density Trend

Surpassed hot-plate power density in 0.5?m Not
too long to reach nuclear reactor, Intel Fellow
Fred Pollack.

19
Low Energy Design

For both mobile gadgets and server farms.
Longer battery life.
Power consumption cost. (25 total cost Singh
Tiwari 99 )
Caches consume 50 power Montanaro 96
Cache die area is growing.
Power reduction techniques in architectural level
ISA diet, e.g. ARM Thumb architecture.
Code compression. Wolfe et al. 92,94 Lefurgy
et al. 97 LarinConte 99
Code rescheduling. SuTsuiDespain 94
ToburenConteReilly 98
Cache partitioning or slicing.

20
Region-based Partitioning

Run-time virtual memory address space is
partitioned by programming languages.
Can we reduce power while retaining performance ?
Reference patterns and characteristics of these
data are different.

21
Miss Rate by Memory Region

Stack data level off quickly so do global data
Heap drops linearly every time cache size doubled

22
Region-based Cachelets (RBC)

A simple idea
A horizontal partitioning
Clock gated caches
Only enable (cycle) the region cachelet being
accessed
Redirect gt70 accesses to smaller region
cachelets

address
region select
Address DeMultiplexer
cs
cs
cs
RC2
RC1
static
stack
L1 Cache
Processor Data Bus
23
Speedup of RBC over Baselines

Apple-to-apple compare S4k-G4k-32kL1 40K-5way
In average, on-par performance

24
Power Reduction of RBC

Dynamic cache power reduced by as much as 63

25
Summary

Data cache partitioning based on programming
language semantics
Greater than 70 data accesses re-routed to
smaller region-based cachelets
Region-based cachelets mechanism
Easier to expand cache size.
Can reduce Dynamic and Static Power Dissipation.
Retain/Increase Performance Level.
65 reduction in Energy-Delay Product.
Can be applied on top of existent art, e.g.
filter cache.
Can be an alternative of multi-ported caches.

26
Exploiting High Performance Stack Value
File
A new stack cache, morphing sp-relative
references
27
Memory Access Distribution

SPEC2000int benchmark (Alpha binary)
42 instructions access memory

28
Access Method Breakdown
86 stack references use (spdisp)
29
Morphing sp-relative References

Morph sp-relative references into register
accesses
Use a Stack Value File (SVF)
Resolve address early in decode stage for
stack-pointer indexed accesses
Resolve stack memory dependency early
Aliased references are re-routed to SVF

30
Cache Footprint Distribution by Regions
of hits to each set
Cache set ?
31
Stack Reference Characteristics

Contiguity
Good temporal and spatial locality
Can be stored in a simple, fast structure
Small die area relative to a regular cache
Less power dissipation
No address tag need for each datum
Keep the current TOS address

32
Stack Reference Characteristics

First touch is almost always a Store
Avoid waste bandwidth to bring in dead data
A register write to the SVF
Deallocated stack frame
Dead data
No need to write them back to memory

33
Speedup Potential of Stack Value File

Assume all references can be morphed
30 speedup for a 16-wide with a dual-ported L1

34
Why is SVF Faster ?

It reduces the load-to-use latency of stack
references
It effectively increases the number of memory
port by rerouting more than ½ of all memory
references to the SVF
It reduces contention in the MOB
More flexibility in renaming stack references
It reduces memory traffic

35
Summary

Stack references have several unique
characteristics
Contiguity, spdisp, first reference store,
frame deallocation.
Stack Value File
a microarchitecture extension to exploit these
characteristics
improves performance by 24 - 65

36
Future Research Directions
37
Extensions to Region Caching

Compiler Interaction
Cache conscious data placement
Similar in concept to decoder issues in PPro
More flexibility with different cache
organizations
Avoiding reference mismatches
Sparse local arrays
Split virtual address space into regions by
function and characteristic

38
Extensions to Region Caching

Heap allocation techniques
Caches hold heap data (85 occupancy)
Heap references are increasing as we use OOL
Allocation sites are moving to libraries
Does expected locality behavior correlate to
allocation site?
If so, then it is possible to call one of two
malloc routines to place it in a cachable region

39
Extensions to Region Caching

Synthesized Processor Memories
Many embedded designs do not use caches
Slow clocks, static scheduled VLIW
But memory design is still important
Code size issues
Power issues

40
Thats All Folks !
http//www.eecs.umich.edu/linear
41
Baseline Microarchitecture
Issue
Execute
Commit
Ld
/St
Dispatch
Fetch
Decode
MOB
Unit
DecoderQ
Reservation Station / LSQ
Reg
Decoder
Instr
-Cache
Renamer
Func Unit
(
RAT)
ArchRF
ReOrder
Buffer
42
Microarchitecture Extension
Issue
Execute
Commit
Ld
/St
Dispatch
Fetch
Decode
MOB
Unit
DecoderQ
Reservation Station / LSQ
Reg
Decoder
Instr
-Cache
Renamer
Func Unit
(
RAT)
Morphing
Pre-Decode
offset
ArchRF
Max
ReOrder
Buffer
Hash
SP
Stack
SP
Value
File
interlock
43
Microarchitecture Extension
stq r10, 24(sp)
TOS
Issue
Execute
Commit
Ld
/St
Dispatch
Fetch
Decode
MOB
Unit
DecoderQ
Reservation Station / LSQ
Reg
Decoder
Instr
-Cache
Renamer
Func Unit
(
RAT)
Morphing
Pre-Decode
offset
ArchRF
Max
ReOrder
Buffer
Hash
SP
Stack
SP
Value
File
interlock
44
Microarchitecture Extension
stq r10, 24(sp)
TOS
Issue
Execute
Commit
Ld
/St
Dispatch
Fetch
Decode
MOB
Unit
DecoderQ
Reservation Station / LSQ
Reg
Decoder
Instr
-Cache
Renamer
Func Unit
(
RAT)
Morphing
Pre-Decode
offset
3
ArchRF
Max
ReOrder
Buffer
Hash
SP
Stack
SP
Value
File
interlock
45
Microarchitecture Extension
stq r10, 24(sp)
p35 ? ROB-18
TOS
Issue
Execute
Commit
Ld
/St
Dispatch
Fetch
Decode
MOB
Unit
DecoderQ
Reservation Station / LSQ
Reg
Decoder
Instr
-Cache
Renamer
Func Unit
(
RAT)
Morphing
Pre-Decode
offset
ArchRF
Max
ReOrder
Buffer
Hash
SP
Stack
SP
Value
File
interlock
46
Microarchitecture Extension
stq r10, 24(sp)
p35 ? ROB-18
TOS
Issue
Execute
Commit
Ld
/St
Dispatch
Fetch
Decode
MOB
Unit
DecoderQ
Reservation Station / LSQ
Reg
Decoder
Instr
-Cache
Renamer
Func Unit
(
RAT)
Morphing
Pre-Decode
offset
ArchRF
Max
ReOrder
Buffer
Hash
SP
Stack
SP
Value
File
interlock
47
Microarchitecture Extension
stq r10, 24(sp)
p35 ? SVF3
TOS
Issue
Execute
Commit
Ld
/St
Dispatch
Fetch
Decode
MOB
Unit
DecoderQ
Reservation Station / LSQ
Reg
Decoder
Instr
-Cache
Renamer
Func Unit
(
RAT)
Morphing
Pre-Decode
offset
ArchRF
Max
ReOrder
Buffer
Hash
SP
Stack
SP
interlock
48
Backup Foils
49
Stack Global Heap Rdata
50
Simulation Framework

Wattch simulator Brooks et al.00
Consider switching power only
Simple clock gating
Baseline microarchitecture parameters
Close to Intel StrongARM SA-110
Single-issue 5-stage in-order pipeline
Unified 32B-line 32KB L1
Region-based caching only applied to L1
4-way 512KB L2

51
Stack Depth Variation
52
Offset Locality of Stack

Cumulative offset within a function call
Avg 3b - 380b
gt80 offset within400b
gt99 offset within8Kb

53
SVF Reference Type Breakdown

86 stack references can be morphed
Re-routed references enter regular memory
pipeline

54
Memory Traffic

SVF dramatically reduces memory traffic by many
orders of magnitude.
For gcc, 28M (Stack cache ? L2) reduced to 86K
(SVF ? L1).
Incoming traffic is eliminated because SVF does
not allocate a cache line on a miss.
Outgoing traffic consists of only those words
that are dirty when evicted (instead of entire
cache lines).

55
High Bandwidth Design Eager Writeback
Bandwidth problem can be cured with money.
Latency problems are harder because the speed of
light is fixed ? you cannot bribe God. ?David
Clark, MIT
56
Conclusions

Data demonstrate distinct characteristics by
reference regions and cache reference behaviors
Region based caching
Maps cache structure to the specific need of
memory references
Exploits the data characteristics to optimize
hardware/software design
These characteristics can be used to improve
memory allocation

Write a Comment

User Comments (0)