Title: Cache Prefetching
1Cache Prefetching
2Outline
- Introduction and Terminology
- Three Techniques
- Stride Prefetching
- Recursive Prefetching
- Markov Prefetching
- Hybrid Approaches
- Conclusions
3Traditional Processor
Memory Latency
L1 Miss Time
Data Arrives
Memory Reference (miss)
Time
4 Lockup-Free Cache
Memory Latency
L1 Miss Time
Data Arrives
Dependent Instruction
Memory Reference (miss)
Time
5 Out-of-Order Execution
Memory Latency
L1 Miss Time
Stall
Data Arrives
Dependent Instruction
Memory Reference (miss)
Time
6 Cache Prefetching
Memory Latency
Prefetch
Data Arrives
Dependent Instruction
Memory Reference (miss)
Time
7Accuracy and Coverage
- Prefetch_all Number of prefetches
- Prefetch_hit Number of prefetches that result
in a cache hit - Misses Number of cache misses
Percentage of useful prefetches
Percentage of misses removed
8Producing Prefetch Addresses
- Observing memory references
- Stride prefetcher
- Strided memory reference pattern (hw/sw)
- Recursive prefetcher
- Linked memory reference pattern (hw/sw)
- Markov prefetcher
- History of miss addresses (hw)
9Outline
- Introduction and Terminology
- Three Techniques
- Stride Prefetching
- Recursive Prefetching
- Markov Prefetching
- Hybrid Approaches
- Conclusions
10Strided Access
b
b1s
b2s
b3s
b4s
b5s
Stride
Stride
Stride
Stride
Stride
11Stride Prefetcher
Reference Prediction Table (RPT)
Loop
Prev. Address
Stride
State
PC Tag
PC LOAD reg, address
address stride
12Prefetching in a Loop
Loop
Ideal execution time of loop
tx
LOAD reg, address
PREFETCH addressstride
?
tm
Memory latency
address stride
Memory latency bound
Memory latency completely hidden
13Overlapped Prefetching
Loop
LOAD reg, address
PREFETCH addressnstride
address stride
Instead of computing n we can use a lookahead PC
14Lookahead PC (Chen and Baer)
- Initially set equal to PC
- Incremented by 1 every cycle
- (Branches predicted using BPT)
- Lookahead PC will run ahead of PC when PC stalls
due to cache miss - Lookahead PC used to index RPT and issue
prefetches - Distance between Lookahead PC and PC not allowed
to exceed memory latency
15Superscalar Challenges
- Lookahead PC (LA PC) is not fast enough when
several instructions are issued every cycle - Cache tag can become a bottleneck because the
rate at which prefetches and memory references
are issued increases - Tango (Pinter and Yoaz)
- Faster Lookahead PC
- Cache for tag memory
16Tango (Pinter and Yoaz)
- Lookahead PC advances not by 1 every cycle, but
from one branch to the next
enhanced branch target buffer (BTB)
Target Address
Prediction Info
T-Entry
PC Tag
NT-Entry
17Prefetching with new LA PC
Reference Prediction Table (RPT)
Prev. Address
Stride
State
PC Tag
BTB-entry
T/NT
MemRefCnt
A Branch
LA PC
B Load
A Branch
NT
1
C Load
A Branch
NT
1
B Load
D Load
A Branch
NT
2
C Load
D Load
E Branch
18Cache for Tag Memory
- FIFO queue keeps track of last n (6) cache lines
that were found (hit) in the cache - If a prefetch address hits in the FIFO, it is
useless and can be discarded - Cut number of cache tag lookups from overhead
prefetches in half
19Outline
- Introduction and Terminology
- Three Techniques
- Stride Prefetching
- Recursive Prefetching
- Markov Prefetching
- Hybrid Approaches
- Conclusions
20Linked Data Structure Access
0
4
8
12
14
next
0
4
8
12
14
Offset
0
4
8
12
14
next
next
Offset
Offset
0
4
8
12
14
next
Offset
21Detecting Recursive Accesses
a0
a4
a8
a12
a14
b0
b4
b8
b12
b14
c0
c4
c8
c12
c14
next
next
next
Offset
Offset
Offset
Producer of b
Consumer of b/Producer of c
a
b
rsrc
rsrc
Example
LOAD rdest, rsrc(14)
LOAD rdest, rsrc(14)
p p-gtnext
hold same value
22Roth, Moshovos, Sohi (HW)
Producer of b
Consumer of b/Producer of c
a
rsrc
b
rsrc
PC-A LOAD rdest, rsrc(14)
PC-B LOAD rdest, rsrc(14)
hold same value
Potential Producer Window
Correlation Table
Memory Value Loaded
Producer Instruction Address
Producer Instruction Address
Consumer Instruction Address
Consumer Instruction Template
PC-A
b
PC-B
PC-A
LOAD r,r(14)
23Recursive Prefetching?
Potential Producer Window
Correlation Table
Memory Value Loaded
Producer Instruction Address
Producer Instruction Address
Consumer Instruction Address
Consumer Instruction Template
PC-A
b
PC-B
PC-A
LOAD r,r(14)
Record Producer
Establish Producer/Consumer Correlation
Prefetch
?
24Luk and Mowry (SW)
Recursive Data Structure (RDS)
RDS Traversal
Greedy Prefetching
struct T int data struct T next
T l ... while(l) ... l l-gtnext
T l ... while(l) prefetch(l-gtnext) ...
l l-gtnext
Identify RDSs
Find RDS traversals
Insert Prefetches
25Pre-Order Tree Traversal
Ordering of Prefetch Requests
1
3
2
5
6
7
4
8
9
10
11
12
13
14
15
n
miss
n
partial miss
n
hit
26Outline
- Introduction and Terminology
- Three Techniques
- Stride Prefetching
- Recursive Prefetching
- Markov Prefetching
- Hybrid Approaches
- Conclusions
27Markov Prefetcher (Joseph and Grunwald)
Miss Addr. 1
Miss Addr. 2a
Miss Addr. 3a
Miss Addr. 4a
Miss Addr. 5a
Miss Addr. 2b
Miss Addr. 3b
Miss Addr. 4b
Miss Addr. 5b
Miss Addr. 4c
Miss Addr. 5c
Miss Addr. 4d
Miss Addr. 5d
State Transition Table with History of 1
State Transition Table with History of 3
Miss Tag
Predictor 1
Predictor 2
Miss Tag
Predictor 1
Predictor 2
Addr. 1
Addr. 2a
Addr. 2b
Addr. 1
Addr. 4a
Addr. 4b
Addr. 2a
Addr. 3a
---
Addr. 2a
Addr. 5a
---
Addr. 2b
Addr. 3b
---
Addr. 2b
Addr. 5b
Addr. 5c
Addr. 3a
Addr. 4a
---
Addr. 3b
Addr. 4b
Addr. 4c
28Outline
- Introduction and Terminology
- Three Techniques
- Stride Prefetching
- Recursive Prefetching
- Markov Prefetching
- Hybrid Approaches
- Conclusions
29Hybrid Approaches (Joseph and Grunwald)
- Parallel Prefetching
- all prefetchers have access to hardware resources
(e.g. miss addresses, data on memory reference
instructions) - all prefetchers are allowed to prefetch
- Serial Prefetching
- most accurate prefetcher is allowed to prefetch
first - static ordering stride, Markov, consecutive
30Conclusions
- Stride prefetchers are most mature
- good coverage, timing and high accuracy
- Recursive prefetchers
- poor timing and not able to overlap well
- Markov prefetcher
- high hardware cost, not a good stand-alone
prefetcher, mediocre accuracy - Hybrid prefetching
- need evaluation of which prefetchers to combine
- and how to combine them (e.g., serial prefetching)