EECS 470

About This Presentation

Title:

EECS 470

Description:

1 cycle access (early in pipeline) 1-3 cycle access. 6-15 cycle access. 50-300 cycle access ... First flagship architecture since the P6 microarchitecture ... – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 31

Provided by: todda7

Category:

more less

Transcript and Presenter's Notes

Title: EECS 470

1
EECS 470

Cache Systems
Lecture 13
Coverage Chapter 5

2
Cache Design 101
Memory pyramid
Reg 100s bytes
1 cycle access (early in pipeline)
1-3 cycle access
6-15 cycle access
50-300 cycle access
Millions cycle access!
3
Direct-mapped cache
Memory
Cache
Address 01101
00000 00010 00100 00110 01000 01010 01100 01110 10
000 10010 10100 10110 11000 11010 11100 11110
V d tag data
78
23
29
218
0
120
10
0
123
44
0
71
16
0
150
141
162
28
173
214
Block Offset (1-bit)
18
33
21
98
Line Index (2-bit)
33
181
28
129
Tag (2-bit)
19
119
200
42
Compulsory Miss first reference to memory
block Capacity Miss Working set doesnt fit in
cache Conflict Miss Working set maps to same
cache line
210
66
225
74
4
2-way set associative cache
Memory
Cache
Address 01101
00000 00010 00100 00110 01000 01010 01100 01110 10
000 10010 10100 10110 11000 11010 11100 11110
V d tag data
78
23
29
218
0
120
10
0
123
44
0
71
16
0
150
141
162
28
173
214
Block Offset (unchanged)
18
33
21
98
1-bit Set Index
33
181
28
129
Larger (3-bit) Tag
19
119
200
42
Rule of thumb Increasing associativity decreases
conflict misses. A 2-way associative cache has
about the same hit rate as a direct mapped cache
twice the size.
210
66
225
74
5
Effects of Varying Cache Parameters

Total cache size block size ? sets ?
associativity
Positives
Should decrease miss rate
Negatives
May increase hit time
Increased area requirements

6
Effects of Varying Cache Parameters

Bigger block size
Positives
Exploit spatial locality reduce compulsory
misses
Reduce tag overhead (bits)
Reduce transfer overhead (address, burst data
mode)
Negatives
Fewer blocks for given size increase conflict
misses
Increase miss transfer time (multi-cycle
transfers)
Wasted bandwidth for non-spatial data

7
Effects of Varying Cache Parameters

Increasing associativity
Positives
Reduces conflict misses
Low-assoc cache can have pathological behavior
(very high miss)
Negatives
Increased hit time
More hardware requirements (comparators, muxes,
bigger tags)
Decreased improvements past 4- or 8- way.

8
Effects of Varying Cache Parameters

Replacement Strategy (for associative caches)
LRU intuitive difficult to implement with high
assoc worst case performance can occur (N1
element array)
Random Pseudo-random easy to implement
performance close to LRU for high associvity
Optimal replace block that has next reference
farthest in the future hard to implement ?

9
Other Cache Design Decisions

Write Policy How to deal with write misses?
Write-through / no-allocate
Total traffic? Read misses ? block size writes
Common for L1 caches back by L2 (esp. on-chip)
Write-back / write-allocate
Needs a dirty bit to determine whether cache data
differs
Total traffic? (read misses write misses) ?
block size
dirty-block-evictions ? block size
Common for L2 caches (memory bandwidth limited)
Variation Write validate
Write-allocate without fetch-on-write
Needs sub-block cache with valid bits for each
word/byte

10
Other Cache Design Decisions

Write Buffering
Delay writes until bandwidth available
Put them in FIFO buffer
Only stall on write if buffer is full
Use bandwidth for reads first (since they have
latency problems)
Important for write-through caches since write
traffic is frequent
Write-back buffer
Holds evicted (dirty) lines for Write-back caches
Also allows reads to have priority on the L2 or
memory bus.
Usually only needs a small buffer

Ref Eager Writeback Caches
11
Adding a Victim cache
V d tag data (Direct mapped)
V d tag data (fully associative)
0
0
0
0
0
0
0
0
Victim cache (4 lines)
0
0
0
Ref 11010011 Ref 01010011
0
0
0
010
0

Small victim cache adds associativity to hot
lines
Blocks evicted from direct-mapped cache go to
victim
Tag compares are made to direct mapped and
victim
Victim hits cause lines to swap from L1 and
victim
Not very useful for associative L1 caches

0
0
0
0
0
12
Hash-Rehash Cache
V d tag data (Direct mapped)
0
11010011 01010011 11010011
0
0
0
0
0
0
0
0
0
110
0
0
0
0
0
0
13
Hash-Rehash Cache
V d tag data (Direct mapped)
0
11010011 01010011 01000011 Allocate? 11010011
0
Miss Rehash miss
0
0
0
0
0
0
0
0
0
0
0
0
0
0
14
Hash-Rehash Cache
V d tag data (Direct mapped)
0
11010011 01010011 01000011 11010011
0
Miss Rehash miss
0
0
0
0
0
0
0
0
0
0
0
0
0
0
15
Hash-Rehash Cache
V d tag data (Direct mapped)
0
11010011 01010011 01000011 11010011 11000011
0
0
0
0
0
0
0
Miss Rehash Hit!
0
0
0
0
0
0
0
0
16
Hash-Rehash Cache

Calculating performance
Primary hit time (normal Direct mapped)
Rehash hit time (sequential tag lookups)
Block swap time?
Hit rate comparable to 2-way associative.

17
Compiler support for caching

Array Merging (array of structs vs. 2 arrays)
Loop interchange (row vs. column access)
Structure padding and alignment (malloc)
Cache conscious data placement
Pack working set into same line
Map to non-conflicting address is packing
impossible

18
Prefetching

Already done bring in an entire line assuming
spatial locality
Extend this Next Line Prefetch
Bring in the next block in memory as well a miss
line (very good for Icache)
Software prefetch
Loads to R0 have no data dependency
Aggressive/speculative prefetch useful for L2
Speculative prefetch problematic for L1

19
Calculating the Effects of Latency

Does a cache miss reduce performance?
It depends on whether there are critical
instructions waiting for the result

20
Calculating the Effects of Latency

It depends on whether critical resources are held
up
Blocking When a miss occurs, all later reference
to the cache must wait. This is a resource
conflict.
Non-blocking Allows later references to access
cache while miss is being processed.
Generally there is some limit to how many
outstanding misses can be bypassed.

21
P4 Overview (Todds slides)

Latest iA32 processor from Intel
Equipped with the full set of iA32 SIMD
operations
First flagship architecture since the P6
microarchitecture
Pentium 4 ISA Pentium III ISA SSE2
SSE2 (Streaming SIMD Extensions 2) provides
128-bit SIMD integer and floating point
operations prefetch

22
Comparison Between Pentium III and Pentium 4
23
Trace Cache

Primary instruction cache in P4 architecture
Stores 12k decoded mops
On a miss, instructions are fetched from L2
Trace predictor connects traces
Trace cache removes
Decode latency after mispredictions
Decode power for all pre-decoded instructions

24
Execution Pipeline
25
Store and Load Scheduling

Out of order store and load operations
Stores are always in program order
48 loads and 24 stores could be in flight
Store/load buffers are allocated at the
allocation stage
Total 24 store buffers and 48 load buffers

26
Data Stream of Pentium 4 Processor
27
On-chip Caches

L1 instruction cache (Trace Cache)
L1 data cache
L2 unified cache
All caches use a pseudo-LRU replacement algorithm
Parameters

28
L1 Data Cache

Non-blocking
Support up to 4 outstanding load misses
Load latency
2-clock for integer
6-clock for floating-point
1 Load and 1 Store per clock
Load speculation
Assume the access will hit the cache
Replay the dependent instructions when miss
detected

29
L2 Cache

Non-blocking
Load latency
Net load access latency of 7 cycles
Bandwidth
1 load and 1 store in one cycle
New cache operations may begin every 2 cycles
256-bit wide bus between L1 and L2
48Gbytes per second _at_ 1.5GHz

EECS 470 - PowerPoint PPT Presentation

EECS 470

1 cycle access (early in pipeline) 1-3 cycle access. 6-15 cycle access. 50-300 cycle access ... First flagship architecture since the P6 microarchitecture ... – PowerPoint PPT presentation