Glenn Reinman, Brad Calder, - PowerPoint PPT Presentation

About This Presentation
Title:

Glenn Reinman, Brad Calder,

Description:

american.cs.ucdavis.edu – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 24
Provided by: Glenn184
Category:

less

Transcript and Presenter's Notes

Title: Glenn Reinman, Brad Calder,


1
Fetch Directed Instruction Prefetching
  • Glenn Reinman, Brad Calder,
  • Department of Computer Science and Engineering,
  • University of California San Diego
  • and Todd Austin
  • Department of Electrical Engineering and Computer
    Science, University of Michigan

2
Introduction
  • Instruction supply critical to processor
    performance
  • Complicated by instruction cache misses
  • Instruction cache miss solutions
  • Increasing size or associativity of instruction
    cache
  • Instruction cache prefetching
  • Which cache blocks to prefetch?
  • Timeliness of prefetch
  • Interference with demand misses

3
Prior Instruction Prefetching Work
  • Next line prefetching (NLP) (Smith)
  • Each cache block is tagged with an NLP bit
  • When block is accessed during a fetch
  • NLP bit determines whether next sequential block
    is prefetched
  • Prefetch into fully associative buffer
  • Streaming buffers (Jouppi)
  • On cache miss, sequential cache blocks, starting
    with block that missed, are prefetched into a
    buffer
  • Buffer can use fully associative lookup
  • Uniqueness filter can avoid redundant prefetches
  • Multiple streaming buffers can be used together

4
Our Prefetching Approach
  • Desirable characteristics
  • Accuracy of prefetch
  • Useful prefetches
  • Timeliness of prefetch
  • Maximize prefetch gain
  • Fetch Directed Prefetching
  • Branch predictor runs ahead of instruction cache
  • Instruction cache prefetch guided by instruction
    stream

5
Talk Overview
  • Fetch Target Queue (FTQ)
  • Fetch Directed Prefetching (FDP)
  • Filtering Techniques
  • Enhancements to Streaming Buffers
  • Bandwidth Considerations
  • Conclusions

6
Fetch Target Queue
  • Queue of instruction fetch addresses
  • Latency tolerance
  • Branch predictor can continue in face of icache
    miss
  • Instruction Fetch can continue in face of branch
    predictor miss
  • When combined with high bandwidth branch
    predictor
  • Provides stream of instr addresses far in advance
    of current PC

7
Fetch Directed Prefetching
Prefetch
PIQ
Prefetch Enqueue (filtration mechanisms)
Fully associative buffer
(32 entry)
current FTQ prefetch candidate
Instruction Fetch
Branch Predictor
FTQ
(32 entry)
  • Stream of PCs contained in FTQ guides prefetch
  • FTQ is searched in-order for entries to prefetch
  • Prefetched cache blocks stored in fully
    associative queue
  • Fully associative queue and instruction cache
    probed in parallel

8
Methodology
  • SimpleScalar Alpha 3.0 tool set (Burger, Austin)
  • SPEC95 C Benchmarks
  • Fast forwarded past initialization portion of
    benchmarks
  • Can issue 8 instructions per cycle
  • 128 entry reorder buffer
  • 32 entry load/store buffer
  • Variety of instruction cache sizes
  • 16K 2-way and 4-way associative
  • 32K 2-way associative
  • Tried both single and dual ported configurations
  • Instruction cache size for this talk is 16K 2-way
  • 32K 4-way associative data cache
  • Unified 1MB 4-way associative second level cache

9
Bandwidth Concerns
  • Prefetching can disrupt demand fetching
  • Need to model bus utilization
  • Modified SimpleScalars memory hierarchy
  • Accurate modeling of bus usage
  • Two configurations of L2 cache bus to main memory
  • 32 bytes/cycle
  • 8 bytes/cycle
  • Single port on L2 cache
  • Shared by both data and instruction caches

10
Performance of Fetch Directed Prefetch
89.9
66 bus utilization
41 bus utilization
11
Reducing Wasted Prefetches
  • Reduce bus utilization while retaining speedup
  • How to identify useless or redundant prefetches?
  • Variety of filtration techniques
  • FTQ Position Filtering
  • Cache Probe Filtering
  • Use idle instruction cache ports to validate
    prefetches
  • Remove CPF
  • Enqueue CPF
  • Evict Filtering

12
Cache Probe Filtering
  • Use instruction cache to validate FTQ entries for
    prefetch
  • FTQ entries are initially unmarked
  • If cache block is in i-cache, invalidate FTQ
    entry
  • If cache block is not in i-cache, validate FTQ
    entry
  • Validation can occur whenever a cache port is
    idle
  • When the instruction window is full
  • Instruction cache miss
  • Lockup-free instruction cache

13
Cache Probe Filtering Techniques
  • Enqueue CPF
  • Only enqueue Valid prefetches
  • Conservative, low bandwidth approach
  • Remove CPF
  • By default, prefetch all FTQ entries.
  • If idle cache ports are available for validation
  • Do not prefetch entries which are found Invalid

14
Performance of Filtering Techniques
8 bytes/cycle
30 bus utilization
55 bus utilization
15
Eviction Prefetching Example
  • If branch predictor holds more state than
    instruction cache
  • Mark evicted cache blocks in branch predictor
  • Prefetch those blocks when predicted

Cache block evicted
Cache miss
FTB Index
Evict bit0
Evict bit1
Evict bit2
Evict index
3
1
0
0
0
0
27
1
0
0
15
2
0
1
0

Instruction Cache
Bit set for next prediction
FTB
16
Performance of Filtering Techniques
8 bytes/cycle
31 bus utilization
20 bus utilization
17
Enqueue CPF and Eviction Prefetching
  • Effective combination of two low bandwidth
    approaches
  • Both attempt to prefetch entries not in
    instruction cache
  • Enqueue CPF needs to wait on idle cache port to
    prefetch
  • Eviction Prefetching can prefetch when prediction
    is made
  • Combined
  • Eviction Prefetching gives basic coverage
  • Enqueue CPF finds additional prefetches that
    Evict misses

18
Streaming Buffer Enhancements
  • All configurations used uniqueness filters and
    fully associative lookup
  • Base configurations
  • Single streaming buffer (SB1)
  • Dual streaming buffers (SB2)
  • Eight streaming buffers (SB8)
  • Cache Probe Filtering (CPF) enhancements
  • Filter out streaming buffer prefetches already in
    icache
  • Stop filtering

19
Streaming Buffer Results
8 bytes/cycle
36 bus utilization
58 bus utilization
20
Selected Low Bandwidth Results
8 bytes/cycle
21
Selected High Bandwidth Results
32 bytes/cycle
22
Conclusion
  • Fetch Directed Prefetching
  • Accurate, just in time prefetching
  • Cache Probe Filtering
  • Reduces bus bandwidth of fetch directed
    prefetching
  • Also useful for Streaming Buffers
  • Evict Filter
  • Provides accurate prefetching by identifying
    evicted cache blocks
  • Fully associative versus inorder prefetch buffer
  • Available in upcoming tech report by end of year

23
Prefetching Tradeoffs
  • NLP
  • Simple, low bandwidth approach
  • No notion of prefetch usefulness
  • Limited timeliness
  • Streaming Buffers
  • Takes advantage of latency of a cache miss
  • Can use low to moderate bandwidth with filtering
  • No notion of prefetch usefulness
  • Fetch Directed Prefetching
  • Prefetch based on prediction stream
  • Can use low to moderate bandwidth with filtering
  • Most useful with accurate branch prediction
Write a Comment
User Comments (0)
About PowerShow.com