Region Based Caching - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Region Based Caching

Description:

Memory Address Table (MAT) [Johnson & Hwu 97] ... 'Surpassed hot-plate power density in 0.5 m; Not too long ... Can we reduce power while retaining performance ? ... – PowerPoint PPT presentation

Number of Views:372
Avg rating:3.0/5.0
Slides: 57
Provided by: csF2
Category:
Tags: based | caching | chipset | mat | power | region

less

Transcript and Presenter's Notes

Title: Region Based Caching


1
Region Based Caching
  • Advanced Computer Architecture Lab
  • University of Michigan

2
Research Focus Memory Subsystem
The Host Processor
L1 Cache (SRAM)
L2 Cache
Core Processor
(SRAM)
Back-Side Bus
Front-Side Bus
Graphics
e.g. A.G.P.
System Memory (DRAM)
Processing Unit
Local
Frame
Buffer
3
Research Direction
  • Memory subsystem optimization
  • Low energy cache architecture
  • Region-based Cachelets CASES-00 ToC-02
  • For Ubiquitous/Wearable information appliance
  • High memory level parallelism
  • Stack Value File HPCA-7 ToC-02
  • For Data/Web/Simulation server farms

4
Memory Space Partitioning
max mem
  • Based on programming language
  • Non-overlapped subdivisions

reserved
Dynamic Data Region
Protected
Dynamic Data Region
Static Data Region
Code Region
Static Data Region
reserved
min mem
MIPS Architecture
5
Memory Space Partitioning
max mem
  • Based on programming language
  • Non-overlapped subdivisions
  • Split code and data Þ I-cache D-cache

reserved
Dynamic Data Region
Protected
Dynamic Data Region
Static Data Region
Code Region
Static Data Region
reserved
min mem
MIPS Architecture
6
Memory Space Partitioning
max mem
  • Based on programming language
  • Non-overlapped subdivisions
  • Split code and data Þ I-cache D-cache
  • Split data into regions
  • Stack (?)
  • Heap (?)
  • Global (static)
  • Read-only (static)

reserved
Stack grows downward
Protected
Heap grows upward
Static Global Data Region
Code Region
Read-only data
reserved
min mem
MIPS Architecture
7
Motivation
  • Most of current caches exploit average
    reference behavior
  • leading to sub-optimal cache designs
  • Exploiting memory region characteristics can lead
    to more effective caches
  • Attacking each region individually
  • finding optimal cache designs for each region

8
Outline
  • Cache Partitioning Overview
  • Reference Characterization by Regions
  • Region-based Cachelets
  • Stack Value File Design
  • Future Research Directions

9
Prior Art in Cache Partitioning
  • Academic Research
  • Cacheable Non-Allocatable (CNA) Tyson et al. 95
  • Dual Data Cache Gonzàlez 95
  • Split Temporal Spatial Data Cache Milutinovic et
    al. 96
  • Non-Temporal Streaming (NTS) Cache Rivers
    Davidson 96
  • Memory Address Table (MAT) Johnson Hwu 97
  • Reconfigurable caches Ranganathan, Adve Jouppi
    00

10
Prior Art in Cache Partitioning
  • Commercial Processors
  • Victim cache Compaq Alpha Intel Pentium Pro
  • Assist cache HP-PA 7200
  • Temporal/Non-temporal prefetch instruction
    support Intel Pentium III, Pentium 4 Itanium
  • Lock way cache support IBM PPC440 Cyrix
    MediaGX

11
How about Region-based Partitioning ?
12
Stack Reference of Memory Instructions
13
Stack Global
14
Stack Global Heap
15
Stack Global Heap Read-only Data
16
Cache Line Reference Frequency by Regions
Number of references
Cache line ID
17
Exploiting Low Energy Opportunities
Region-based Cachelets
Low power stack and static cachelets
18
Power Density Trend
  • Surpassed hot-plate power density in 0.5?m Not
    too long to reach nuclear reactor, Intel Fellow
    Fred Pollack.

19
Low Energy Design
  • For both mobile gadgets and server farms.
  • Longer battery life.
  • Power consumption cost. (25 total cost Singh
    Tiwari 99 )
  • Caches consume 50 power Montanaro 96
  • Cache die area is growing.
  • Power reduction techniques in architectural level
  • ISA diet, e.g. ARM Thumb architecture.
  • Code compression. Wolfe et al. 92,94 Lefurgy
    et al. 97 LarinConte 99
  • Code rescheduling. SuTsuiDespain 94
    ToburenConteReilly 98
  • Cache partitioning or slicing.

20
Region-based Partitioning
  • Run-time virtual memory address space is
    partitioned by programming languages.
  • Can we reduce power while retaining performance ?
  • Reference patterns and characteristics of these
    data are different.

21
Miss Rate by Memory Region
  • Stack data level off quickly so do global data
  • Heap drops linearly every time cache size doubled

22
Region-based Cachelets (RBC)
  • A simple idea
  • A horizontal partitioning
  • Clock gated caches
  • Only enable (cycle) the region cachelet being
    accessed
  • Redirect gt70 accesses to smaller region
    cachelets

address
region select
Address DeMultiplexer
cs
cs
cs
RC2
RC1
static
stack
L1 Cache
Processor Data Bus
23
Speedup of RBC over Baselines
  • Apple-to-apple compare S4k-G4k-32kL1 40K-5way
  • In average, on-par performance

24
Power Reduction of RBC
  • Dynamic cache power reduced by as much as 63

25
Summary
  • Data cache partitioning based on programming
    language semantics
  • Greater than 70 data accesses re-routed to
    smaller region-based cachelets
  • Region-based cachelets mechanism
  • Easier to expand cache size.
  • Can reduce Dynamic and Static Power Dissipation.
  • Retain/Increase Performance Level.
  • 65 reduction in Energy-Delay Product.
  • Can be applied on top of existent art, e.g.
    filter cache.
  • Can be an alternative of multi-ported caches.

26
Exploiting High Performance Stack Value
File
A new stack cache, morphing sp-relative
references
27
Memory Access Distribution
  • SPEC2000int benchmark (Alpha binary)
  • 42 instructions access memory

28
Access Method Breakdown
86 stack references use (spdisp)
29
Morphing sp-relative References
  • Morph sp-relative references into register
    accesses
  • Use a Stack Value File (SVF)
  • Resolve address early in decode stage for
    stack-pointer indexed accesses
  • Resolve stack memory dependency early
  • Aliased references are re-routed to SVF

30
Cache Footprint Distribution by Regions
of hits to each set
Cache set ?
31
Stack Reference Characteristics
  • Contiguity
  • Good temporal and spatial locality
  • Can be stored in a simple, fast structure
  • Small die area relative to a regular cache
  • Less power dissipation
  • No address tag need for each datum
  • Keep the current TOS address

32
Stack Reference Characteristics
  • First touch is almost always a Store
  • Avoid waste bandwidth to bring in dead data
  • A register write to the SVF
  • Deallocated stack frame
  • Dead data
  • No need to write them back to memory

33
Speedup Potential of Stack Value File
  • Assume all references can be morphed
  • 30 speedup for a 16-wide with a dual-ported L1

34
Why is SVF Faster ?
  • It reduces the load-to-use latency of stack
    references
  • It effectively increases the number of memory
    port by rerouting more than ½ of all memory
    references to the SVF
  • It reduces contention in the MOB
  • More flexibility in renaming stack references
  • It reduces memory traffic

35
Summary
  • Stack references have several unique
    characteristics
  • Contiguity, spdisp, first reference store,
    frame deallocation.
  • Stack Value File
  • a microarchitecture extension to exploit these
    characteristics
  • improves performance by 24 - 65

36
Future Research Directions
37
Extensions to Region Caching
  • Compiler Interaction
  • Cache conscious data placement
  • Similar in concept to decoder issues in PPro
  • More flexibility with different cache
    organizations
  • Avoiding reference mismatches
  • Sparse local arrays
  • Split virtual address space into regions by
    function and characteristic

38
Extensions to Region Caching
  • Heap allocation techniques
  • Caches hold heap data (85 occupancy)
  • Heap references are increasing as we use OOL
  • Allocation sites are moving to libraries
  • Does expected locality behavior correlate to
    allocation site?
  • If so, then it is possible to call one of two
    malloc routines to place it in a cachable region

39
Extensions to Region Caching
  • Synthesized Processor Memories
  • Many embedded designs do not use caches
  • Slow clocks, static scheduled VLIW
  • But memory design is still important
  • Code size issues
  • Power issues

40
Thats All Folks !
http//www.eecs.umich.edu/linear
41
Baseline Microarchitecture
Issue
Execute
Commit
Ld
/St
Dispatch
Fetch
Decode
MOB
Unit
DecoderQ
Reservation Station / LSQ
Reg
Decoder
Instr
-Cache
Renamer
Func Unit
(
RAT)
ArchRF
ReOrder
Buffer
42
Microarchitecture Extension
Issue
Execute
Commit
Ld
/St
Dispatch
Fetch
Decode
MOB
Unit
DecoderQ
Reservation Station / LSQ
Reg
Decoder
Instr
-Cache
Renamer
Func Unit
(
RAT)
Morphing
Pre-Decode
offset
ArchRF
Max
ReOrder
Buffer
Hash
SP
Stack
SP
Value
File
interlock
43
Microarchitecture Extension
stq r10, 24(sp)
TOS
Issue
Execute
Commit
Ld
/St
Dispatch
Fetch
Decode
MOB
Unit
DecoderQ
Reservation Station / LSQ
Reg
Decoder
Instr
-Cache
Renamer
Func Unit
(
RAT)
Morphing
Pre-Decode
offset
ArchRF
Max
ReOrder
Buffer
Hash
SP
Stack
SP
Value
File
interlock
44
Microarchitecture Extension
stq r10, 24(sp)
TOS
Issue
Execute
Commit
Ld
/St
Dispatch
Fetch
Decode
MOB
Unit
DecoderQ
Reservation Station / LSQ
Reg
Decoder
Instr
-Cache
Renamer
Func Unit
(
RAT)
Morphing
Pre-Decode
offset
3
ArchRF
Max
ReOrder
Buffer
Hash
SP
Stack
SP
Value
File
interlock
45
Microarchitecture Extension
stq r10, 24(sp)
p35 ? ROB-18
TOS
Issue
Execute
Commit
Ld
/St
Dispatch
Fetch
Decode
MOB
Unit
DecoderQ
Reservation Station / LSQ
Reg
Decoder
Instr
-Cache
Renamer
Func Unit
(
RAT)
Morphing
Pre-Decode
offset
ArchRF
Max
ReOrder
Buffer
Hash
SP
Stack
SP
Value
File
interlock
46
Microarchitecture Extension
stq r10, 24(sp)
p35 ? ROB-18
TOS
Issue
Execute
Commit
Ld
/St
Dispatch
Fetch
Decode
MOB
Unit
DecoderQ
Reservation Station / LSQ
Reg
Decoder
Instr
-Cache
Renamer
Func Unit
(
RAT)
Morphing
Pre-Decode
offset
ArchRF
Max
ReOrder
Buffer
Hash
SP
Stack
SP
Value
File
interlock
47
Microarchitecture Extension
stq r10, 24(sp)
p35 ? SVF3
TOS
Issue
Execute
Commit
Ld
/St
Dispatch
Fetch
Decode
MOB
Unit
DecoderQ
Reservation Station / LSQ
Reg
Decoder
Instr
-Cache
Renamer
Func Unit
(
RAT)
Morphing
Pre-Decode
offset
ArchRF
Max
ReOrder
Buffer
Hash
SP
Stack
SP
interlock
48
Backup Foils
49
Stack Global Heap Rdata
50
Simulation Framework
  • Wattch simulator Brooks et al.00
  • Consider switching power only
  • Simple clock gating
  • Baseline microarchitecture parameters
  • Close to Intel StrongARM SA-110
  • Single-issue 5-stage in-order pipeline
  • Unified 32B-line 32KB L1
  • Region-based caching only applied to L1
  • 4-way 512KB L2

51
Stack Depth Variation
52
Offset Locality of Stack
  • Cumulative offset within a function call
  • Avg 3b - 380b
  • gt80 offset within400b
  • gt99 offset within8Kb

53
SVF Reference Type Breakdown
  • 86 stack references can be morphed
  • Re-routed references enter regular memory
    pipeline

54
Memory Traffic
  • SVF dramatically reduces memory traffic by many
    orders of magnitude.
  • For gcc, 28M (Stack cache ? L2) reduced to 86K
    (SVF ? L1).
  • Incoming traffic is eliminated because SVF does
    not allocate a cache line on a miss.
  • Outgoing traffic consists of only those words
    that are dirty when evicted (instead of entire
    cache lines).

55
High Bandwidth Design Eager Writeback
Bandwidth problem can be cured with money.
Latency problems are harder because the speed of
light is fixed ? you cannot bribe God. ?David
Clark, MIT
56
Conclusions
  • Data demonstrate distinct characteristics by
    reference regions and cache reference behaviors
  • Region based caching
  • Maps cache structure to the specific need of
    memory references
  • Exploits the data characteristics to optimize
    hardware/software design
  • These characteristics can be used to improve
    memory allocation
Write a Comment
User Comments (0)
About PowerShow.com