Decoupled Architecture for Data Prefetching - PowerPoint PPT Presentation

About This Presentation
Title:

Decoupled Architecture for Data Prefetching

Description:

Prefetching helps, but it has overhead. ... It can (hopefully) exploit complex algorithms; ... Brute force approach. Use both tagged and stride prefetching ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 23
Provided by: pagesC
Category:

less

Transcript and Presenter's Notes

Title: Decoupled Architecture for Data Prefetching


1
Decoupled Architecture for Data Prefetching
ltchang_at_cs.wisc.edugt ltxuk_at_cs.wisc.edugt
Jichuan Chang Kai Xu
2
Outline
  • Motivation
  • Design and Evaluation
  • Results
  • Conclusions

3
Motivation
  • Processor-memory performance gap
  • Prefetching helps, but it has overhead.
  • Transistor is cheap, will a coprocessor help?

Main Processor
Prefetching CoProcessor
Info Flow
Cache
Prefetch Requests
Data
L1-L2 Internal Bus
4
Why a dedicated coprocessor?
  • Simple
  • It simplifies the design of main processor.
  • Powerful
  • It can (hopefully) exploit complex algorithms
  • It handles computation overhead (i.e. pattern
    computation, address computation).
  • Flexible
  • It can (hopefully) adapt to different situations
  • It can implement different algorithms.
  • But are these true?

5
The Generic Design
ALU
What ?
When ?
Where ?
6
Data Prefetching Techniques
  • Regular Access Prefetching
  • Tagged Next Block Lookahead Smith 82
  • Exploit sequential access pattern
  • Stride Prefetching Baer Chen 91
  • Exploit stride access pattern
  • Dependency-based Prefetching Roth, et al 98
  • Discover Linked-Data-Structure access pattern
  • Dead Block Correlation Lai, et al 01
  • History based correlation prediction
  • Stream Buffer Joppi 90
  • Reduce cache pollution

7
Simulation Settings
  • SimpleScalar v3.0
  • Modified sim-outorder to implement
  • information sharing between MP and PCP
  • Modified cache module to implement
  • Prefetching schemes (between L1 and L2 cache),
  • Prefetch queue (len 16) Bus sharing/contention,
  • Stream buffer.
  • Memory Parameters
  • L1 Data Cache 4KB, 32B line, 4-way associative
  • L2 Cache 64KB, 64B line, 4-way associative
  • Stream buffer 8 entries, fully associated, 1
    cycle hit
  • Hit latency (cycle) L1 1 L2 12 Mem
    70 (2)
  • Pipelined bus bus contention/latency are
    modeled.

8
Benchmarks
  • From Spec95
  • gcc
  • compress
  • swim
  • tomcatv
  • Microbenchmark
  • Matrix multiplication (128 X 128 double)
  • Binary tree (1M nodes, similar to treeadd)

9
Results (IPC)
10
Results (Miss Ratio)
11
Results (Prefetch Accuracy)
12
L1-L2 Traffic Increase
13
Results (Delay Tolerance)
  • How many cycles of delay can PCP tolerate?
  • More delay
  • Less useful (cant get back before demand
    references)
  • More pollution (due to outdated information)
  • Less prefetches (due to bus contention)
  • To avoid pollution, impl. prefetch queue as
    circular buffer.
  • Overwrite outdated entries when queue is full.
  • The major effect of larger delay will be less
    prefetches.
  • Hard to model memory behavior in SimpleScalar
  • Predetermine latency, no wake-up, no MSHR.

14
Delay tolerance
  • Preliminary result
  • For almost all schemes on all benchmarks
  • PCP can tolerant 8 cycles of delay

15
Can we integrate different schems?
  • Different applications need different schems
  • Brute force approach
  • Use both tagged and stride prefetching
  • Good speedup, but much more memory traffic.
  • Adapt prefetching policy dynamically?
  • Share the same hardware table
  • Using similar matching schemes
  • Hard to reconfigure/flush when context-swithes
  • Use separate tables
  • More hardware
  • Similar to tournament predictor (just a thought)

16
Conclusions
  • PCP helps performance! (2-30 speedup)
  • PCP handles prefetching, can tolerates some
    delays.
  • Different schemes work for different applications
  • Requires different information (from different
    places)
  • PCP should be placed close to the info source
  • Not easy to integrate different schemes.
  • Limitation of our approach
  • PCP not fully utilized.
  • Relies on tables (caches/queues/buffers)
  • DBCP requires large history table (7.6 M memory)!
  • Delay is critical to performance
  • It limits the complexity of prefetch schemes,
  • It also determines where to place PCP.

17
Future Work
  • To evaluate more prefetching schemes
  • Dependency-based prefetching, etc.
  • PCP Running Ahead
  • Probably with the help of trace cache
  • To fully utilize PCP
  • Need chkpt/rollback mechanisms.
  • CoProcessor to Support Other Functionalities
  • Branch prediction, power mgmt.
  • PCP for Multiprocessor
  • Suitable for One-Block-Lookahead.
  • Need to change CC protocol.

18
Thank You! Questions?
19
Backup Slides
20
Tagged Prefetching
21
Stride Prefetching
  • Recurrence Prediction Table (RPT)
  • Organized like a cache, indexed by PC
  • (Data addresses, stride, state)
  • State Machine

22
Dependency-based Prefetching
  • Potential Producer Window
  • Correlation Table
  • One Step Ahead
  • Jump Pointer Generation/Maintenance
Write a Comment
User Comments (0)
About PowerShow.com