Title: Feedback Directed Prefetching
1Feedback Directed Prefetching
- Santhosh Srinath
- Onur Mutlu
- Hyesoon Kim
- Yale N. Patt
?
?
2Problem
Solution
- Prefetching can significantly improve performance
- When prefetches are accurate
- And timely
- However, Prefetching can also significantly
degrade performance - Due to Memory Bandwidth impact
- Pollution of the cache
Feedback Directed Prefetching is a
comprehensive mechanism which reduces the
negative effects of prefetching as well as
improves the positive effects
3Outline
- Background and Motivation
- Feedback Directed Prefetching (FDP)
- Metrics and How to collect
- How to adapt
- Prefetcher Aggressiveness
- Cache Insertion Policy for Prefetches
- Results
4Background (Prefetcher Aggressiveness)
- Prefetch Distance
- Prefetch Degree
Access Stream
Prefetch Degree
X1
Predicted Stream
Predicted Stream
1 2 3
P
Prefetch Distance
Very Conservative
Middle of the Road
Very Aggressive
5Background (Prefetcher Aggressiveness)
- Very Aggressive
- Well ahead of the load access stream
- Hides memory access latency better
- More speculative
- Very Conservative
- Closer to the load access stream
- Might not hide memory access latency completely
- Reduces potential for cache pollution and
bandwidth contention
6Motivation
?48
? 29
- Very Aggressive improves average performance by
84 - However it can also significantly reduce
performance on some benchmarks
7Outline
- Background and Motivation
- Feedback Directed Prefetching (FDP)
- Metrics and How to collect
- How to adapt
- Prefetcher Aggressiveness
- Cache Insertion Policy for Prefetches
- Results
7
Feedback Directed Prefetching
8Feedback Directed Prefetching
- Comprehensive mechanism which takes in account
- Prefetcher Accuracy
- Prefetcher Lateness
- Prefetcher-caused Cache Pollution
- Adapts
- Prefetcher Aggressiveness
- Cache Insertion Policy for Prefetches
9Metrics
- Prefetch Accuracy
- Prefetch Lateness
- Prefetcher-caused Cache Pollution
10Prefetch Accuracy
- Useful Prefetches are referenced by the demand
requests when in L2
11Prefetch Accuracy
- Low Accuracy
- More likely that Prefetching can reduce
performance
12Prefetch Accuracy
- Implementation
- pref-bit added to each L2 tag-store entry
- Tracked using two counters pref_total, used_total
13Prefetch Lateness
- Measure of how timely prefetches are
- Used to determine if increasing the
aggressiveness helps - Implementation
- pref-bit added to each L2 MSHR entry
- New counter late_total
14Prefetcher-caused Cache Pollution
- Measure of the disturbance caused by prefetched
data in the cache - Used to determine if the prefetcher is evicting
useful data from the cache
15Prefetcher-caused Cache Pollution (2)
- Hardware Implementation
- Insight this does not need to be exact
- Track pollution using Pollution filter
- Based on Bloom Filter concept
- Bit set when a prefetch evicts a demand miss
- Bit reset when a prefetch is serviced
- Two Counters pollution_total, demand_total
16Feedback Directed Prefetching
- Comprehensive mechanism which takes in account
- Prefetcher Accuracy
- Prefetcher Lateness
- Prefetcher-caused Cache Pollution
- Adapts
- Prefetcher Aggressiveness
- Cache Insertion Policy
16
Feedback Directed Prefetching
17How to adapt? Prefetcher Aggressiveness
- Dynamic Configuration Counter
- Current Aggressiveness
Distance Degree
1 Very Conservative 4 1
2 Conservative 8 1
3 Middle-of-the-Road 16 2
4 Aggressive 32 4
5 Very Aggressive 64 4
18How to adapt? Prefetcher Aggressiveness (2)
High Accuracy
Med Accuracy
Low Accuracy
Not-Late
Late
Not-Poll
Polluting
Not-Poll
Decrease
Polluting
Increase
Late
Decrease
Not-Late
Decrease
Increase
No Change
- For Current Phase, based on static thresholds,
classify - Accuracy
- Lateness
- Cache-Pollution caused by Prefetches
Reduce memory bandwidth usage and Cache Pollution
Improve Timeliness
Reduce Cache Pollution
19How to Adapt?Cache Insertion Policy for
Prefetches
- Why adapt?
- Reduce the potential for cache pollution
- Classify Cache Pollution based on static
thresholds - Low Insert at MID(n/2) Position
- Eg For a 16-way cache, MID 8 in LRU stack
- Medium Insert at LRU-4(n/4) Position
- Eg For a 16-way cache, LRU-4 4 in LRU stack
- High Insert at LRU Position
20Outline
- Background and Motivation
- Feedback Directed Prefetching
- Metrics and How to collect
- How to adapt
- Prefetcher Aggressiveness
- Cache Insertion Policy for Prefetches
- Results
20
Feedback Directed Prefetching
21Evaluation Methodology
- Execution-driven Alpha simulator
- Aggressive out-of-order superscalar processor
- 1 MB, 16-way, 10-cycle unified L2 cache
- 500-cycle minimum main memory latency
- Detailed memory model
- Prefetchers Modeled
- Stream Prefetcher tracking 64 different streams
- Global History Buffer Prefetcher (in paper)
- PC-based Stride Prefetcher (in paper)
22Results Adjusting Only Aggressiveness
- 4.7 higher avg IPC over the Very Aggressive
configuration - Most of the performance losses have been
eliminated
23Results Adjusting Only Cache Insertion Policy
Very Aggressive Prefetcher
- 5.1 better than inserting prefetches in MRU
position - 1.9 better than inserting prefetches in LRU-4
position
24Results Putting it all together (FDP)
?13
?11
- 6.5 IPC improvement over Very Aggressive
configuration - Performance losses converted to performance gains!
25Bandwidth Impact
- BPKI - Memory Bus Accesses per 1000 retired
Instructions - Includes effects of L2 demand misses as well as
pollution induced misses and prefetches - FDP significantly improves bandwidth efficiency
6.5 higher performance and18.7 less bandwidth
13.6 higher performance with similar bandwidth
usage
No. Pref. Very Cons Mid Very Aggr FDP
IPC 0.85 1.21 1.47 1.57 1.67
BPKI 8.56 9.34 10.60 13.38 10.88
26Hardware Cost
pref-bits for L2 cache 16384 blocks 16384 bits
Pollution Filter 4096 entries 1bit 4096 bits
16-bit counters 11 counters 176 bits
pref-bits for MSHR 128 entries 128 bits
- Total hardware cost 20784 bits 2.54 KB
- Percentage area overhead compared to baseline 1MB
L2 cache 2.5KB/1024KB 0.24 - NOT on the critical path
27Outline
- Background and Motivation
- Feedback Directed Prefetching
- Metrics and collecting this information in
Hardware - How to adapt
- Results
- Conclusions
27
Feedback Directed Prefetching
28Contributions
- Comprehensive and low-cost feedback mechanism for
hardware prefetchers - Uses
- Prefetcher Accuracy
- Prefetcher Lateness
- Prefetcher-caused Cache Pollution
- Adapts
- Aggressiveness
- Cache Insertion Policy for prefetches
- 6.5 higher performance and 18.7 less bandwidth
compared to Very Aggressive Prefetching - Eliminates negative impact of prefetching
- Applicable to any data prefetch algorithm
29Questions?
30Backups
31FDP vs Prefetch Cache
- Prefetch Caches eliminate prefetcher induced
cache pollution - However, prefetches are now limited to the size
of the prefetch cache - 5.3 higher perf. than Very Aggr.32KB
- Within 2 of Very Aggr.64KB
- Memory bandwidth of FDP is 16 less than 32KB and
9 less than 64KB.
32Performance on Other Prefetch algorithms
- Global History Buffer Prefetcher
- 20.8 less memory bandwidth than very aggressive
with similar perf. - 9.9 better performance than middle-of-the-road
with similar bandwidth usage - PC-based Stride Prefetcher
- 4 better performance than the very aggressive
- 24 reduction in bandwidth usage
33IPC Performance
34Dynamic Prefetcher Accuracy
35Prefetch Lateness
36Pollution Filter
37Thresholds
38Prefetches Sent
39Distribution of dynamic aggressiveness level
40Distribution of insertion position of prefetched
blocks
41Effect of FDP on memory bandwidth consumption
42Performance of Prefetch cache vs FDP
43Bandwidth consumption of prefetch cache vs. FDP
44Effect of FDP on GHB
45Effect of FDP on GHB(Bandwidth)
46Effect of varying L2 size and memory latency
47IPC on other benchmarks
48BPKI on other benchmarks