Title: Glenn Reinman, Brad Calder,
1Fetch Directed Instruction Prefetching
- Glenn Reinman, Brad Calder,
- Department of Computer Science and Engineering,
- University of California San Diego
- and Todd Austin
- Department of Electrical Engineering and Computer
Science, University of Michigan
2Introduction
- Instruction supply critical to processor
performance - Complicated by instruction cache misses
- Instruction cache miss solutions
- Increasing size or associativity of instruction
cache - Instruction cache prefetching
- Which cache blocks to prefetch?
- Timeliness of prefetch
- Interference with demand misses
3Prior Instruction Prefetching Work
- Next line prefetching (NLP) (Smith)
- Each cache block is tagged with an NLP bit
- When block is accessed during a fetch
- NLP bit determines whether next sequential block
is prefetched - Prefetch into fully associative buffer
- Streaming buffers (Jouppi)
- On cache miss, sequential cache blocks, starting
with block that missed, are prefetched into a
buffer - Buffer can use fully associative lookup
- Uniqueness filter can avoid redundant prefetches
- Multiple streaming buffers can be used together
4Our Prefetching Approach
- Desirable characteristics
- Accuracy of prefetch
- Useful prefetches
- Timeliness of prefetch
- Maximize prefetch gain
- Fetch Directed Prefetching
- Branch predictor runs ahead of instruction cache
- Instruction cache prefetch guided by instruction
stream
5Talk Overview
- Fetch Target Queue (FTQ)
- Fetch Directed Prefetching (FDP)
- Filtering Techniques
- Enhancements to Streaming Buffers
- Bandwidth Considerations
- Conclusions
6Fetch Target Queue
- Queue of instruction fetch addresses
- Latency tolerance
- Branch predictor can continue in face of icache
miss - Instruction Fetch can continue in face of branch
predictor miss - When combined with high bandwidth branch
predictor - Provides stream of instr addresses far in advance
of current PC
7Fetch Directed Prefetching
Prefetch
PIQ
Prefetch Enqueue (filtration mechanisms)
Fully associative buffer
(32 entry)
current FTQ prefetch candidate
Instruction Fetch
Branch Predictor
FTQ
(32 entry)
- Stream of PCs contained in FTQ guides prefetch
- FTQ is searched in-order for entries to prefetch
- Prefetched cache blocks stored in fully
associative queue - Fully associative queue and instruction cache
probed in parallel
8Methodology
- SimpleScalar Alpha 3.0 tool set (Burger, Austin)
- SPEC95 C Benchmarks
- Fast forwarded past initialization portion of
benchmarks - Can issue 8 instructions per cycle
- 128 entry reorder buffer
- 32 entry load/store buffer
- Variety of instruction cache sizes
- 16K 2-way and 4-way associative
- 32K 2-way associative
- Tried both single and dual ported configurations
- Instruction cache size for this talk is 16K 2-way
- 32K 4-way associative data cache
- Unified 1MB 4-way associative second level cache
9Bandwidth Concerns
- Prefetching can disrupt demand fetching
- Need to model bus utilization
- Modified SimpleScalars memory hierarchy
- Accurate modeling of bus usage
- Two configurations of L2 cache bus to main memory
- 32 bytes/cycle
- 8 bytes/cycle
- Single port on L2 cache
- Shared by both data and instruction caches
10Performance of Fetch Directed Prefetch
89.9
66 bus utilization
41 bus utilization
11Reducing Wasted Prefetches
- Reduce bus utilization while retaining speedup
- How to identify useless or redundant prefetches?
- Variety of filtration techniques
- FTQ Position Filtering
- Cache Probe Filtering
- Use idle instruction cache ports to validate
prefetches - Remove CPF
- Enqueue CPF
- Evict Filtering
12Cache Probe Filtering
- Use instruction cache to validate FTQ entries for
prefetch - FTQ entries are initially unmarked
- If cache block is in i-cache, invalidate FTQ
entry - If cache block is not in i-cache, validate FTQ
entry - Validation can occur whenever a cache port is
idle - When the instruction window is full
- Instruction cache miss
- Lockup-free instruction cache
13Cache Probe Filtering Techniques
- Enqueue CPF
- Only enqueue Valid prefetches
- Conservative, low bandwidth approach
- Remove CPF
- By default, prefetch all FTQ entries.
- If idle cache ports are available for validation
- Do not prefetch entries which are found Invalid
14Performance of Filtering Techniques
8 bytes/cycle
30 bus utilization
55 bus utilization
15Eviction Prefetching Example
- If branch predictor holds more state than
instruction cache - Mark evicted cache blocks in branch predictor
- Prefetch those blocks when predicted
Cache block evicted
Cache miss
FTB Index
Evict bit0
Evict bit1
Evict bit2
Evict index
3
1
0
0
0
0
27
1
0
0
15
2
0
1
0
Instruction Cache
Bit set for next prediction
FTB
16Performance of Filtering Techniques
8 bytes/cycle
31 bus utilization
20 bus utilization
17Enqueue CPF and Eviction Prefetching
- Effective combination of two low bandwidth
approaches - Both attempt to prefetch entries not in
instruction cache - Enqueue CPF needs to wait on idle cache port to
prefetch - Eviction Prefetching can prefetch when prediction
is made - Combined
- Eviction Prefetching gives basic coverage
- Enqueue CPF finds additional prefetches that
Evict misses
18Streaming Buffer Enhancements
- All configurations used uniqueness filters and
fully associative lookup - Base configurations
- Single streaming buffer (SB1)
- Dual streaming buffers (SB2)
- Eight streaming buffers (SB8)
- Cache Probe Filtering (CPF) enhancements
- Filter out streaming buffer prefetches already in
icache - Stop filtering
19Streaming Buffer Results
8 bytes/cycle
36 bus utilization
58 bus utilization
20Selected Low Bandwidth Results
8 bytes/cycle
21Selected High Bandwidth Results
32 bytes/cycle
22Conclusion
- Fetch Directed Prefetching
- Accurate, just in time prefetching
- Cache Probe Filtering
- Reduces bus bandwidth of fetch directed
prefetching - Also useful for Streaming Buffers
- Evict Filter
- Provides accurate prefetching by identifying
evicted cache blocks - Fully associative versus inorder prefetch buffer
- Available in upcoming tech report by end of year
23Prefetching Tradeoffs
- NLP
- Simple, low bandwidth approach
- No notion of prefetch usefulness
- Limited timeliness
- Streaming Buffers
- Takes advantage of latency of a cache miss
- Can use low to moderate bandwidth with filtering
- No notion of prefetch usefulness
- Fetch Directed Prefetching
- Prefetch based on prediction stream
- Can use low to moderate bandwidth with filtering
- Most useful with accurate branch prediction