Title: Memory Hierarchy Design Chapter 5
1Memory Hierarchy DesignChapter 5
2Background
- 1980 no caches
- 1995 two levels of caches
- 2004 even three levels of caches
- Why?
- Processor-Memory gap
3Processor-Memory Gap
µProc 60/yr.
1000
CPU
100
Performance
10
DRAM 7/yr.
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Source lecture handouts Prof. John Kubiatowicz,
CS252 U.C.Berkeley
4Because
- Memory speed is a limiting factor in performance
- Caches are small and fast
- Caches leverage the principle of locality
- Temporal locality data that has been referenced
recently tends to be re-referenced soon - Spatial locality data close (in the address
space) to recently referenced data tends to be
referenced soon
5Review
- Cache block minimum unit of information that can
be present in the cache (several contiguous
memory positions) - Cache hit requested data can be found in cache
- Cache miss requested data cannot be found in
cache - The four design questions
- Where can a block be placed?
- How can a block be found?
- Which block should be replaced?
- What happens on a write?
6Where can a block be placed?
Suppose we need to place block 10
Directly mapped (1-way) 10 mod 8 2
2-way set associative 10 mod 4 set 2
4-way set associative 10 mod 2 set 0
fully associative (8-way, in this case)
anywhere
Placement set address mod ( sets)
Where ( sets) (cache size)/( ways)
7How can a block be found?
Look at the address!
Block Address
Block Offset
Tag
Index
Block Offset
determines set (no index in fully associative
caches)
determines offset in block
block unique id primary key
8Which block should be replaced?
- Random
- Least Recently Used (LRU)
- True LRU may be too costly to implement in
hardware (requires a stack) - Simplified LRU
- First in, First out (FIFO)
9What happens on a write?
- Write through every time a block is written, the
new value is propagated to the next memory level - Easier to implement
- Makes displacement simple and fast
- Reads never have to wait for a displacement to
finish - Writes may have to wait ?use a write buffer
- Write back new value is propagated to the next
memory level only when block is displaced - Makes writes fast
- Uses less memory bandwidth
- Dirty bit may save additional bandwidth
- no need to write clean blocks
- Saves power
10What happens on a write? (cont.)
- Write allocate fetch on write
- Entire block is brought into cache
- No write allocate write around
- Written word is sent to next memory level
- Write policy and write miss policy are
independent, but usually - Write back ? write allocate
- Write through ? no write allocate
11Cache Performance
- AMAT (hit time) (miss rate)(miss penalty)
Which system is faster?
Unified cache Split caches Split caches
Unified cache D-cache I-cache
Size 32KB 16KB 16KB
Miss rate 1.99 0.64 6.47
Hit time I1 / D2 1 1
Miss penalty 50 50 50
75 of accesses are instruction references
12Solution
- AMAT(split) 0.75(10.6450)
0.25(16.4750) - AMAT(split) 2.05
- AMAT(unified) 0.75(11.9950)
0.25(21.9950) - AMAT(unified) 2.24
- Miss Rate(split) 0.750.64 0.256.47
2.10 - Miss Rate(unified) 1.99
- Although split has a higher miss rate, it is
faster on avg!
13Processor Performance
- CPU time (proc cyc mem stall cyc)(clk cyc
time) - proc cyc ICCPI
- mem stall cyc (mem accesses)(miss rate)(miss
penalty)
What is the total CPU time including the caches,
in function of IC and clk cyc time?
CPI (proc) 2.0
Miss penalty 50 cyc
Miss rate 2
Mem ref/inst 1.33
14Processor Performance
- AMAT has large impact on performance
- If CPI decreases, mem stall cyc represents a
larger fraction of total cycles - If clock cycle time decreases, mem stall cyc
represents more cycles
Note in ooo execution processors, part of the
memory access latency is overlapped with
computation
15Improving Cache Performance
- AMAT (hit time) (miss rate)(miss penalty)
- Reducing hit time
- Small and simple caches
- No address translation
- Pipelined cache access
- Trace caches
16Improving Cache Performance
- AMAT (hit time) (miss rate)(miss penalty)
- Reducing miss rate
- Larger block size
- Larger cache size
- Higher associativity
- Way prediction or pseudo-associative caches
- Compiler optimizations (code/data layout)
17Improving Cache Performance
- AMAT (hit time) (miss rate)(miss penalty)
- Reducing miss penalty
- Multilevel caches
- Critical word first
- Read miss before write miss
- Merging write buffers
- Victim caches
18Improving Cache Performance
- AMAT (hit time) (miss rate)(miss penalty)
- Reducing miss rate and miss penalty
- Increase parallelism
- Non-blocking caches
- Prefetching
- Hardware
- Software