Decoupled Architecture for Data Prefetching - PowerPoint PPT Presentation

About This Presentation

Title:

Decoupled Architecture for Data Prefetching

Description:

Prefetching helps, but it has overhead. ... It can (hopefully) exploit complex algorithms; ... Brute force approach. Use both tagged and stride prefetching ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 23

Provided by: pagesC

Learn more at: https://pages.cs.wisc.edu

Category:

Tags: architecture | brute | data | decoupled | prefetching

Transcript and Presenter's Notes

Title: Decoupled Architecture for Data Prefetching

1
Decoupled Architecture for Data Prefetching
ltchang_at_cs.wisc.edugt ltxuk_at_cs.wisc.edugt
Jichuan Chang Kai Xu
2
Outline

Motivation
Design and Evaluation
Results
Conclusions

3
Motivation

Processor-memory performance gap
Prefetching helps, but it has overhead.
Transistor is cheap, will a coprocessor help?

Main Processor
Prefetching CoProcessor
Info Flow
Cache
Prefetch Requests
Data
L1-L2 Internal Bus
4
Why a dedicated coprocessor?

Simple
It simplifies the design of main processor.
Powerful
It can (hopefully) exploit complex algorithms
It handles computation overhead (i.e. pattern
computation, address computation).
Flexible
It can (hopefully) adapt to different situations
It can implement different algorithms.
But are these true?

5
The Generic Design
ALU
What ?
When ?
Where ?
6
Data Prefetching Techniques

Regular Access Prefetching
Tagged Next Block Lookahead Smith 82
Exploit sequential access pattern
Stride Prefetching Baer Chen 91
Exploit stride access pattern
Dependency-based Prefetching Roth, et al 98
Discover Linked-Data-Structure access pattern
Dead Block Correlation Lai, et al 01
History based correlation prediction
Stream Buffer Joppi 90
Reduce cache pollution

7
Simulation Settings

SimpleScalar v3.0
Modified sim-outorder to implement
information sharing between MP and PCP
Modified cache module to implement
Prefetching schemes (between L1 and L2 cache),
Prefetch queue (len 16) Bus sharing/contention,
Stream buffer.
Memory Parameters
L1 Data Cache 4KB, 32B line, 4-way associative
L2 Cache 64KB, 64B line, 4-way associative
Stream buffer 8 entries, fully associated, 1
cycle hit
Hit latency (cycle) L1 1 L2 12 Mem
70 (2)
Pipelined bus bus contention/latency are
modeled.

8
Benchmarks

From Spec95
gcc
compress
swim
tomcatv
Microbenchmark
Matrix multiplication (128 X 128 double)
Binary tree (1M nodes, similar to treeadd)

9
Results (IPC)
10
Results (Miss Ratio)
11
Results (Prefetch Accuracy)
12
L1-L2 Traffic Increase
13
Results (Delay Tolerance)

How many cycles of delay can PCP tolerate?
More delay
Less useful (cant get back before demand
references)
More pollution (due to outdated information)
Less prefetches (due to bus contention)
To avoid pollution, impl. prefetch queue as
circular buffer.
Overwrite outdated entries when queue is full.
The major effect of larger delay will be less
prefetches.
Hard to model memory behavior in SimpleScalar
Predetermine latency, no wake-up, no MSHR.

14
Delay tolerance

Preliminary result
For almost all schemes on all benchmarks
PCP can tolerant 8 cycles of delay

15
Can we integrate different schems?

Different applications need different schems
Brute force approach
Use both tagged and stride prefetching
Good speedup, but much more memory traffic.
Adapt prefetching policy dynamically?
Share the same hardware table
Using similar matching schemes
Hard to reconfigure/flush when context-swithes
Use separate tables
More hardware
Similar to tournament predictor (just a thought)

16
Conclusions

PCP helps performance! (2-30 speedup)
PCP handles prefetching, can tolerates some
delays.
Different schemes work for different applications
Requires different information (from different
places)
PCP should be placed close to the info source
Not easy to integrate different schemes.
Limitation of our approach
PCP not fully utilized.
Relies on tables (caches/queues/buffers)
DBCP requires large history table (7.6 M memory)!
Delay is critical to performance
It limits the complexity of prefetch schemes,
It also determines where to place PCP.

17
Future Work

To evaluate more prefetching schemes
Dependency-based prefetching, etc.
PCP Running Ahead
Probably with the help of trace cache
To fully utilize PCP
Need chkpt/rollback mechanisms.
CoProcessor to Support Other Functionalities
Branch prediction, power mgmt.
PCP for Multiprocessor
Suitable for One-Block-Lookahead.
Need to change CC protocol.

18
Thank You! Questions?
19
Backup Slides
20
Tagged Prefetching
21
Stride Prefetching

Recurrence Prediction Table (RPT)
Organized like a cache, indexed by PC
(Data addresses, stride, state)
State Machine

22
Dependency-based Prefetching

Potential Producer Window
Correlation Table
One Step Ahead
Jump Pointer Generation/Maintenance

Write a Comment

User Comments (0)

About PowerShow.com

Recommended Relevance Latest Highest Rated Most Viewed

Sort by:

Related More from user

CrystalGraphics Presentations

Introducing-PowerShowcom PowerPoint PPT Presentation

Introducing-PowerShowcom - Introducing-PowerShowcom (Without Music)

CrystalGraphics 3D Character Slides for PowerPoint PowerPoint PPT Presentation

CrystalGraphics 3D Character Slides for PowerPoint - CrystalGraphics 3D Character Slides for PowerPoint

Chart and Diagram Slides for PowerPoint PowerPoint PPT Presentation

Chart and Diagram Slides for PowerPoint - Beautifully designed chart and diagram s for PowerPoint with visually stunning graphics and animation effects. Our new CrystalGraphics Chart and Diagram Slides for PowerPoint is a collection of over 1000 impressively designed data-driven chart and editable diagram s guaranteed to impress any audience. They are all artistically enhanced with visually stunning color, shadow and lighting effects. Many of them are also animated. And they’re ready for you to use in your PowerPoint presentations the moment you need them. – PowerPoint PPT presentation

Related Presentations

Decoupled Architectures for Complexity-Effective General Purpose Processors PowerPoint PPT Presentation

Decoupled Architectures for Complexity-Effective General Purpose Processors - ... utilization by partioning between CP/AP/DP, processors can have specialized ISAs ... DP. IFE. IFE. RF. RF. IFB. IFB. param. param. SAQ. SDQ. LDQ. RD. LAQ ... | PowerPoint PPT presentation | free to view

Programming Models for Blue Gene/L : Charm , AMPI and Applications PowerPoint PPT Presentation

Programming Models for Blue Gene/L : Charm , AMPI and Applications - AMPI: new wine in old bottle. Easier to convert. Can still run original codes on MPI, unchanged ... Use the instrumented data-base periodically to make new decisions ... | PowerPoint PPT presentation | free to view

PredictorDirected Stream Buffers PowerPoint PPT Presentation

PredictorDirected Stream Buffers - Stream Buffers are one of the most used. simple to ... to data cache, register file, and MSHRs. Sherwood, Sair, and Calder. 6. Past Stream Buffer work ... | PowerPoint PPT presentation | free to view

NPS: A Non-interfering Web Prefetching System PowerPoint PPT Presentation

NPS: A Non-interfering Web Prefetching System - Department of Computer Sciences, UT Austin. 3. Outline. Prefetch aggressively as ... Department of Computer Sciences, UT Austin. 4. What is Web Prefetching? ... | PowerPoint PPT presentation | free to view

Speculative Data-Driven Multithreading (an implementation of pre-execution) PowerPoint PPT Presentation

Speculative Data-Driven Multithreading (an implementation of pre-execution) - MT 'claims' physical registers allocated by DDT. Modify register-renaming to do this ' ... Fewer MT fetches (always) Contention. Fewer total fetches. Early ... | PowerPoint PPT presentation | free to view

Decoupled Prefetching for Data Intensive Application PowerPoint PPT Presentation

Decoupled Prefetching for Data Intensive Application - Enhance the utilization and throughput. Lower ... the SMT increases the utilization of functional unit ... Dynamically utilize the functional unit ... | PowerPoint PPT presentation | free to view

PCI Bus Introduction PowerPoint PPT Presentation

PCI Bus Introduction - Max Data Rate (Burst Mode): 133MB/s (32-bit bus, 33MHz) & 266MB/s (64-bit bus, 33MHz) ... A PCI Burst represents a single arbitration but can extend over many cycles ... | PowerPoint PPT presentation | free to view

SUN ULTRASPARC-III ARCHITECTURE PowerPoint PPT Presentation

SUN ULTRASPARC-III ARCHITECTURE - ... address for the branch to the A stage to redirect the fetch stream. ... can be streamed through the prefetch cache in a manner similar to stream buffers. ... | PowerPoint PPT presentation | free to view

The CAE Architecture: Decoupled Program Control for Energy-Efficient Performance Ronny Krashinsky and Michael Sung PowerPoint PPT Presentation

The CAE Architecture: Decoupled Program Control for Energy-Efficient Performance Ronny Krashinsky and Michael Sung - The CAE Architecture: Decoupled Program Control for Energy-Efficient Performance ... G. Tyson, MISC (Multiple Inst. Stream Computer), descendant of PIPE ... | PowerPoint PPT presentation | free to view

CSCI 47175717 Computer Architecture PowerPoint PPT Presentation

CSCI 47175717 Computer Architecture - Can fetch and decode second instruction in parallel with first ... Still only capable of fetching 2 instructions at a time Next pair must wait ... | PowerPoint PPT presentation | free to view

High Performance Computing Group PowerPoint PPT Presentation

High Performance Computing Group - pref brings all data into registers (allocated dynamically) L1 Cache ... and renaming assigns their registers to the preallocated by the pref instruction ... | PowerPoint PPT presentation | free to view

Data Parallel FPGA Workloads: Software Versus Hardware PowerPoint PPT Presentation

Data Parallel FPGA Workloads: Software Versus Hardware - Simplify FPGA design: Customize soft processor ... Lane 1. ALU,Mem Unit. Lane 2. ALU, Mem, Mul. VESPA Parameters. 2,4,8, ... MVL. Maximum Vector Length ... | PowerPoint PPT presentation | free to view

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing PowerPoint PPT Presentation

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing - HiDISC: A Decoupled Architecture for Applications in Data Intensive ... (FLIR SAR VIDEO ATR /SLD Scientific ) Processor. Decoupling Compiler. HiDISC Processor ... | PowerPoint PPT presentation | free to view

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing PowerPoint PPT Presentation

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing - Other Decoupled Processors: ACRI, ZS-1, WM. Cache. 2nd-Level Cache. and Main Memory ... Cache has become a major portion of the chip area ... | PowerPoint PPT presentation | free to view

Cache Design and Tricks PowerPoint PPT Presentation

Cache Design and Tricks - Prefetching too early, however will mean that other accesses might flush the prefetched data from the cache. Memory accesses may take 50 processor clock cycles or more. | PowerPoint PPT presentation | free to view

Computer Architecture Embedded Computing PowerPoint PPT Presentation

Computer Architecture Embedded Computing - Title: EECS 252 Graduate Computer Architecture Lec XX - TOPIC Last modified by: WangYX Created Date: 2/8/2005 3:17:21 AM Document presentation format | PowerPoint PPT presentation | free to view

Using Compression to Improve Chip Multiprocessor Performance PowerPoint PPT Presentation

Using Compression to Improve Chip Multiprocessor Performance - Using Compression to Improve Chip Multiprocessor Performance Alaa R. Alameldeen Dissertation Defense Wisconsin Multifacet Project University of Wisconsin-Madison | PowerPoint PPT presentation | free to view

Memory Hierarchy Design PowerPoint PPT Presentation

Memory Hierarchy Design - Title: CDA-5155 Computer Architecture Principles Fall 2000 Author: CISE DEPT Last modified by: Computing Services Created Date: 10/16/2000 6:04:34 PM | PowerPoint PPT presentation | free to view

The Future of Vector Processors PowerPoint PPT Presentation

The Future of Vector Processors - The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999 | PowerPoint PPT presentation | free to view

Coarse-Grained Coherence PowerPoint PPT Presentation

Coarse-Grained Coherence - Title: On the Value Locality of Store Instructions Author: Kevin Lepak Last modified by: Mikko Lipasti Created Date: 4/20/2000 3:20:45 PM Document presentation format | PowerPoint PPT presentation | free to view

COMP 206: Computer Architecture and Implementation PowerPoint PPT Presentation

COMP 206: Computer Architecture and Implementation - Title: Lecture 4 Author: Montek Singh Last modified by: UNC-CS Created Date: 3/13/2000 2:52:39 AM Document presentation format: Letter Paper (8.5x11 in) | PowerPoint PPT presentation | free to view

Intel SIMD architecture PowerPoint PPT Presentation

Intel SIMD architecture - Intel SIMD architecture Computer Organization and Assembly Languages Yung-Yu Chuang 2006/12/25 | PowerPoint PPT presentation | free to view

Computer Architecture: SIMD and GPUs (Part II) PowerPoint PPT Presentation

Computer Architecture: SIMD and GPUs (Part II) - When a thread is blocked by a memory request, ... (one address generator) 16 memory banks (word-interleaved) 285 cycles * Vector Chaining Vector chaining: ... | PowerPoint PPT presentation | free to view

Decoupled Architectures for Complexity-Effective General Purpose Processors PowerPoint PPT Presentation

Decoupled Architectures for Complexity-Effective General Purpose Processors - Complexity-Effective General Purpose Processors Ronny Krashinsky and Mike Sung 6.893 Term Project Presentation MIT Laboratory for Computer Science | PowerPoint PPT presentation | free to view

A Programmable Memory Hierarchy for Prefetching Linked Data Structures PowerPoint PPT Presentation

A Programmable Memory Hierarchy for Prefetching Linked Data Structures - A Programmable Memory Hierarchy for Prefetching Linked Data Structures Chia-Lin Yang Department of Computer Science and Information Engineering | PowerPoint PPT presentation | free to view

SUN ULTRASPARC-III ARCHITECTURE PowerPoint PPT Presentation

SUN ULTRASPARC-III ARCHITECTURE - SUN ULTRASPARC-III ARCHITECTURE CMPE 511 PRESENTATION Prepared by:Balk r Kayaalt Introduction SPARC stands for a Scalable Processor ARChitecture. | PowerPoint PPT presentation | free to view

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing PowerPoint PPT Presentation

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing - HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing PIs: Alvin M. Despain and Jean-Luc Gaudiot DARPA DIS PI Meeting Santa Fe, NM | PowerPoint PPT presentation | free to view