Cluster Prefetch: Tolerating OnChip Wire

About This Presentation

Title:

Cluster Prefetch: Tolerating OnChip Wire

Description:

Partitioned architectures: small computational. units connected by a communication fabric ... Small computational units with limited functionality. fast clocks, ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 26

Provided by: rajeevbala

Learn more at: https://users.cs.utah.edu

Category:

more less

Transcript and Presenter's Notes

Title: Cluster Prefetch: Tolerating OnChip Wire

1
Cluster Prefetch Tolerating On-Chip Wire Delays
in Clustered Microarchitectures
Rajeev Balasubramonian School of Computing,
University of Utah
July 1st 2004
2
Billion-Transistor Chips

Partitioned architectures small computational
units connected by a communication fabric
Small computational units with limited
functionality ?
fast clocks, low design effort, low power
Numerous computational units ? high
parallelism

3
The Communication Bottleneck

Wire delays do not scale down at the same rate
as
logic delays Agarwal, ISCA00Ho, Proc.
IEEE01
30 cycle delay to go across the chip in 10 years
1-cycle inter-hop latency in the RAW
prototype Taylor, ISCA04

4
Cache Design
Centralized Cache
L1D
6 cyc RAM Access
Address Transfer 6 cyc
Data 6 cyc Transfer
18-cycle access (12 cycles for communication)
5
Cache Design
Centralized Cache
Decentralized Cache
L1D
L1D
L1D
6 cyc RAM Access
Address Transfer 6 cyc
Data 6 cyc Transfer
L1D
L1D
18-cycle access (12 cycles for communication)
6
Research Goals

Identify bottlenecks in cache access
Design cluster prefetch, a latency hiding
mechanism
Evaluate and compare centralized and
decentralized designs

7
Outline

Motivation
Evaluation platform
Cluster prefetch
Centralized vs. decentralized caches
Conclusions

8
Clustered Microarchitectures

Centralized front-end
Dynamically steered
(dependences load)
O-o-o issue and 1-cycle
bypass within a cluster
Hierarchical interconnect

L1D
Instr Fetch
lsq
crossbar
ring
9
Simulation Parameters

Simplescalar-based simulator
In-flight instruction window of 480
16 clusters, each with 60 registers, 30 issue
queue entries, and one FU of each kind
Inter-cluster latencies between 2-10
Primary focus on SPEC-FP programs

10
Steps Involved in Cache Access
L1D
Instr Fetch
RAM Access
Memory Disambiguation
lsq
Instr Dispatch
Data Transfer
Effective Address Transfer
Effective Address Computation
11
Lifetime of a Load
12
Load Address Prediction
Cache Access Cycle 68
Dispatch at cycle 0
Data Transfer Cycle 94
L1D
L S Q
Cluster
Eff. Addr. Transfer Cycle 27
13
Load Address Prediction
Cache Access Cycle 68
Dispatch at cycle 0
Data Transfer Cycle 94
L1D
L S Q
Cluster
Eff. Addr. Transfer Cycle 27
Cache Access Cycle 0
Data Transfer Cycle 26
L1D
L S Q
Cluster
Eff. Addr. Transfer Cycle 27
Address Predictor
14
Memory Dependence Speculation

To allow early cache access, loads must issue
before resolving earlier store addresses
High-confidence store address predictions are
employed for disambiguation
Stores that have never forwarded results within
the LSQ are ignored
Cluster Prefetch Combination of Load Address
Prediction and Memory Dependence Speculation

15
Implementation Details

Centralized table that maintains stride and last
address stride is determined by five
consecutive
accesses and cleared in case of five
mispredicts
Separate centralized table that maintains a
single
bit per entry to indicate stores that pose
conflicts
Each mispredict flushes all subsequent instrs
Storage overhead 18KB

16
Performance Results
Overall IPC improvement 21
17
Results Analysis

Roughly half the programs improved IPC by gt8
Load address prediction rate 65
Store address prediction rate 79
Stores likely to not pose conflicts 59
Avg. number of mispredicts 12K per 100M instrs

18
Decentralized Cache
L1D
L1D

Replicated Cache Banks
Loads do not travel far
Stores cache refills are
broadcast
Memory disambiguation is
not accelerated
Overheads interconnect for
broadcast and cache refill,
power for redundant writes,
distributed LRU, etc.

lsq
lsq
lsq
lsq
L1D
L1D
19
Comparing Centralized Decentralized
L1D
L1D
L1D
IPCs without cluster prefetch
lsq
lsq
lsq
1.43
1.52
IPCs with cluster prefetch
1.73
1.79
lsq
lsq
L1D
L1D
20
Sensitivity Analysis