Glenn Reinman, Brad Calder,

About This Presentation

Title:

Glenn Reinman, Brad Calder,

Description:

american.cs.ucdavis.edu – PowerPoint PPT presentation

Number of Views:73

Avg rating:3.0/5.0

Slides: 24

Provided by: Glenn184

Learn more at: https://american.cs.ucdavis.edu

Category:

more less

Transcript and Presenter's Notes

Title: Glenn Reinman, Brad Calder,

1
Fetch Directed Instruction Prefetching

Glenn Reinman, Brad Calder,
Department of Computer Science and Engineering,
University of California San Diego
and Todd Austin
Department of Electrical Engineering and Computer
Science, University of Michigan

2
Introduction

Instruction supply critical to processor
performance
Complicated by instruction cache misses
Instruction cache miss solutions
Increasing size or associativity of instruction
cache
Instruction cache prefetching
Which cache blocks to prefetch?
Timeliness of prefetch
Interference with demand misses

3
Prior Instruction Prefetching Work

Next line prefetching (NLP) (Smith)
Each cache block is tagged with an NLP bit
When block is accessed during a fetch
NLP bit determines whether next sequential block
is prefetched
Prefetch into fully associative buffer
Streaming buffers (Jouppi)
On cache miss, sequential cache blocks, starting
with block that missed, are prefetched into a
buffer
Buffer can use fully associative lookup
Uniqueness filter can avoid redundant prefetches
Multiple streaming buffers can be used together

4
Our Prefetching Approach

Desirable characteristics
Accuracy of prefetch
Useful prefetches
Timeliness of prefetch
Maximize prefetch gain
Fetch Directed Prefetching
Branch predictor runs ahead of instruction cache
Instruction cache prefetch guided by instruction
stream

5
Talk Overview

Fetch Target Queue (FTQ)
Fetch Directed Prefetching (FDP)
Filtering Techniques
Enhancements to Streaming Buffers
Bandwidth Considerations
Conclusions

6
Fetch Target Queue

Queue of instruction fetch addresses
Latency tolerance
Branch predictor can continue in face of icache
miss
Instruction Fetch can continue in face of branch
predictor miss
When combined with high bandwidth branch
predictor
Provides stream of instr addresses far in advance
of current PC

7
Fetch Directed Prefetching
Prefetch
PIQ
Prefetch Enqueue (filtration mechanisms)
Fully associative buffer
(32 entry)
current FTQ prefetch candidate
Instruction Fetch
Branch Predictor
FTQ
(32 entry)

Stream of PCs contained in FTQ guides prefetch
FTQ is searched in-order for entries to prefetch
Prefetched cache blocks stored in fully
associative queue
Fully associative queue and instruction cache
probed in parallel

8
Methodology

SimpleScalar Alpha 3.0 tool set (Burger, Austin)
SPEC95 C Benchmarks
Fast forwarded past initialization portion of
benchmarks
Can issue 8 instructions per cycle
128 entry reorder buffer
32 entry load/store buffer
Variety of instruction cache sizes
16K 2-way and 4-way associative
32K 2-way associative
Tried both single and dual ported configurations
Instruction cache size for this talk is 16K 2-way
32K 4-way associative data cache
Unified 1MB 4-way associative second level cache

9
Bandwidth Concerns

Prefetching can disrupt demand fetching
Need to model bus utilization
Modified SimpleScalars memory hierarchy
Accurate modeling of bus usage
Two configurations of L2 cache bus to main memory
32 bytes/cycle
8 bytes/cycle
Single port on L2 cache
Shared by both data and instruction caches

10
Performance of Fetch Directed Prefetch
89.9
66 bus utilization
41 bus utilization
11
Reducing Wasted Prefetches

Reduce bus utilization while retaining speedup
How to identify useless or redundant prefetches?
Variety of filtration techniques
FTQ Position Filtering
Cache Probe Filtering
Use idle instruction cache ports to validate
prefetches
Remove CPF
Enqueue CPF
Evict Filtering

12
Cache Probe Filtering

Use instruction cache to validate FTQ entries for
prefetch
FTQ entries are initially unmarked
If cache block is in i-cache, invalidate FTQ
entry
If cache block is not in i-cache, validate FTQ
entry
Validation can occur whenever a cache port is
idle
When the instruction window is full
Instruction cache miss
Lockup-free instruction cache

13
Cache Probe Filtering Techniques

Enqueue CPF
Only enqueue Valid prefetches
Conservative, low bandwidth approach
Remove CPF
By default, prefetch all FTQ entries.
If idle cache ports are available for validation
Do not prefetch entries which are found Invalid

14
Performance of Filtering Techniques
8 bytes/cycle
30 bus utilization
55 bus utilization
15
Eviction Prefetching Example

If branch predictor holds more state than
instruction cache
Mark evicted cache blocks in branch predictor
Prefetch those blocks when predicted

Cache block evicted
Cache miss
FTB Index
Evict bit0
Evict bit1
Evict bit2
Evict index
3
1
0
0
0
0
27
1
0
0
15
2
0
1
0

Instruction Cache
Bit set for next prediction
FTB
16
Performance of Filtering Techniques
8 bytes/cycle
31 bus utilization
20 bus utilization
17
Enqueue CPF and Eviction Prefetching

Effective combination of two low bandwidth
approaches
Both attempt to prefetch entries not in
instruction cache
Enqueue CPF needs to wait on idle cache port to
prefetch
Eviction Prefetching can prefetch when prediction
is made
Combined
Eviction Prefetching gives basic coverage
Enqueue CPF finds additional prefetches that
Evict misses

18
Streaming Buffer Enhancements

All configurations used uniqueness filters and
fully associative lookup
Base configurations
Single streaming buffer (SB1)
Dual streaming buffers (SB2)
Eight streaming buffers (SB8)
Cache Probe Filtering (CPF) enhancements
Filter out streaming buffer prefetches already in
icache
Stop filtering

19
Streaming Buffer Results
8 bytes/cycle
36 bus utilization
58 bus utilization
20
Selected Low Bandwidth Results
8 bytes/cycle
21
Selected High Bandwidth Results
32 bytes/cycle
22
Conclusion