Chapter 5: Memory Hierarchy Design

About This Presentation

Title:

Chapter 5: Memory Hierarchy Design

Description:

2-Way Set Associative Simulation. N=16 addresses B=2 bytes/line S=2 sets E=2 entries/set ... [0] CA00-5-20. Fully associative cache. 0000. 0001. 0010. 0011. 0100 ... – PowerPoint PPT presentation

Number of Views:138

Avg rating:3.0/5.0

Slides: 117

Provided by: yirnga

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 5: Memory Hierarchy Design

1
Chapter 5 Memory Hierarchy Design

Yirng-An Chen
Dept. of CIS
Computer Architecture
Fall, 2000

2
Computer System
3
Who Cares About the Memory Hierarchy?

Processor Only Thus Far in Course
CPU cost/performance, ISA, Pipelined Execution
CPU-DRAM Gap
1980 no cache in µproc 1995 2-level cache on
chip(1989 first Intel µproc with a cache on chip)

µProc 60/yr.
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 7/yr.
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
4
Levels in a typical memory hierarchy
cache
virtual memory
4 B
8 B
4 KB
register reference
cache reference
memory reference
disk memory reference
size speed /Mbyte block size
200 B 3 ns 4 B
32 KB / 4MB 6 ns NT?/MB 8 B
128 MB 100 ns NT50/MB 4 KB
20 GB 10 ms NT0.9/MB
larger, slower, cheaper
5
Sources of Memory References
sum 0 for (i 0 i lt n i) sum ai v
sum
Memory Layout
0x0FC
I0
0x100
Abstract Version of Machine Code
I1
0x104
I2
I0 sum lt-- 0 I1 ap lt-- a I2 i lt--
0 I3 if (i gt n) goto done i4 loop t lt--
ap I5 sum lt-- sum t I6 ap lt-- ap
4 I7 i lt-- i 1 I8 if (i lt n) goto
loop I9 done v lt-- sum
0x108
I3
0x10C
I4
0x110
I5
0x114
I6

0x400
a0
0x404
a1
0x408
a2
0x40C
a3
0x410
a4

Memory addresses in bytes
Each instruction data word 4 bytes
Instruction sequences data arrays laid out as
contiguous memory blocks

0x414
a5

0x7A4
v
6
Locality of reference

Principle of Locality
programs tend to reuse data and instructions near
those they have used recently.
temporal locality recently referenced items are
likely to be referenced in the near future.
spatial locality items with nearby addresses
tend to be referenced close together in time.

sum 0 for (i 0 i lt n i) sum ai v
sum

Locality in Example
Data
Reference array elements in succession (spatial)
Instruction
Reference instructions in sequence (spatial)
Cycle through loop repeatedly (temporal)

7
Accessing data in a memory hierarchy
Between any two levels, memory divided into
blocks. Data moves between levels on demand, in
block-sized chunks. Upper-level blocks a subset
of lower-level blocks.
Access word w in block a (hit)
Access word v in block b (miss)
w
v
a
a
a
b
b
b
b
b
a
a
a
Locality smaller HW is faster memory hierarchy
8
Four ?s for Memory Hierarchy Designers

Q1 Where can a block be placed in the upper
level? (Block placement)
Fully Associative, Set Associative, Direct Mapped
Q2 How is a block found if it is in the upper
level? (Block identification)
Tag/Block
Q3 Which block should be replaced on a miss?
(Block replacement)
Random, LRU
Q4 What happens on a write? (Write strategy)
Write Back or Write Through (with Write Buffer)

9
Address spaces
00000 00001 00010 00011 00100 00101 00110 00111 01
000 01001 01010 01011 01100 01101 01110 01111 1000
0 10001 10010 10011 10100 10101 10110 10111 11000
11001 11010 11011 11100 11101 11110 11111
An n-bit address defines an address space of 2n
items 0,...,2n-1.
Address space for n5
10
Partitioning address spaces
Key idea partitioning the address bits
partitions the address space. In general, an
address partitioned into sets of t (tag), s (set
index), and b (block offset) bits, e.g.,
t
s
b
address
tag
set index
offset
belongs to one of 2s equivalence classes (sets),
where each set consists of 2t blocks of
addresses, and each block consists of 2b
addresses. The s bits uniquely identify an
equivalence class. The t bits uniquely identify
each block in the equivalence class. The b bits
define the offset of an address within a block
(block offset).
11
Partitioning address spaces
00000 00001 00010 00011 00100 00101 00110 00111 01
000 01001 01010 01011 01100 01101 01110 01111 1000
0 10001 10010 10011 10100 10101 10110 10111 11000
11001 11010 11011 11100 11101 11110 11111
t1
s3
b1
1
011
0
2s 8 sets of blocks 2t 2 blocks/set 2b 2
addresses/block.
offset 0
block 1
12
Partitioning address spaces
00000 00001 00010 00011 00100 00101 00110 00111 01
000 01001 01010 01011 01100 01101 01110 01111 1000
0 10001 10010 10011 10100 10101 10110 10111 11000
11001 11010 11011 11100 11101 11110 11111
t2
s2
b1
10
11
0
2s 4 sets of blocks 2t 4 blocks/set 2b 2
addresses/block.
offset 0
block 10
13
Partitioning address spaces
set 1
00000 00001 00010 00011 00100 00101 00110 00111 01
000 01001 01010 01011 01100 01101 01110 01111 1000
0 10001 10010 10011 10100 10101 10110 10111 11000
11001 11010 11011 11100 11101 11110 11111
t3
s1
b1
101
1
0
2s 2 sets of blocks 2t 8 blocks/set 2b 2
addresses/block.
offset 0
block 101
14
Partitioning address spaces
00000 00001 00010 00011 00100 00101 00110 00111 01
000 01001 01010 01011 01100 01101 01110 01111 1000
0 10001 10010 10011 10100 10101 10110 10111 11000
11001 11010 11011 11100 11101 11110 11111
set ø
t4
s0
b1
1011
0
2s 1 set of blocks 2t 16 blocks/set 2b 2
addresses/block.
block 1011
15
Basic cache organization
Cache (C S x E x B bytes)
Address space (N 2n bytes)
E blocks/set
Address (n t s b bits)
S 2s sets
t
s
b
Cache block (cache line)
E Describes associativity how many blocks in
set can reside in cache simultaneously
16
Direct mapped cache (E 1)
N 16 byte addresses (n4)
cache size C 8 data bytes line size B 2b
2 bytes/line
t1
s2
b1
x
xx
x
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001
1010 1011 1100 1101 1110 1111
direct mapped cache
E 1 entry/set
00 01 10 11
S 2s 4 sets
1. Determine set from middle bits 2. If something
already there, knock it out 3. Put new block in
cache
17
Direct Mapped Cache Simulation
N16 byte addresses B2 bytes/block S4 sets E1
entry/set Address trace (reads) 0 0000 1
0001 13 1101 8 1000 0 0000
0 0000 (miss)
13 1101 (miss)
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001
1010 1011 1100 1101 1110 1111
v
tag
data
v
tag
data
1
0
m1 m0
1
0
m1 m0
(1)
(2)
1
1
m13 m12
8 1000 (miss)
0 0000 (miss)
v
tag
data
v
tag
data
1
1
m9 m8
1
0
m1 m0
(3)
(4)
1
1
m13 m12
1
1
m13 m12
18
E-way Set-Associative Cache
N 16 addresses (n4)
Cache size C 8 data bytes Line size B 2b
2 bytes
t2
s1
b1
xx
x
x
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001
1010 1011 1100 1101 1110 1111
2-way set associative cache
E 2 entries/set
00 01
S 21 2 sets
19
2-Way Set Associative Simulation
N16 addresses B2 bytes/line S2 sets E2
entries/set Address trace (reads) 0 0000 1
0001 13 1101 8 1000 0 0000
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001
1010 1011 1100 1101 1110 1111
0 (miss)
13 (miss)
8 (miss)
(LRU replacement)
0 (miss)
(LRU replacement)
20
Fully associative cache
N 16 addresses (n4)
cache size C 8 data bytes line size B 2b
2 bytes/line
t3
s0
b1
xxx
x
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001
1010 1011 1100 1101 1110 1111
fully associative cache
E 4 entries/set
S 2s 1 set
21
Fully Associative Cache Simulation
N16 addresses B2 bytes/line S1 sets E4
entries/set Address trace (reads) 0 0000 1
0001 13 1101 8 1000 0 0000
t3
s0
b1
xxx
x
0 (miss)
13 (miss)
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001
1010 1011 1100 1101 1110 1111
v
tag
data
v
tag
data
1
00
m1 m0
1
000
m1 m0
1
110
m13 m12
(1)
(2)
set ø
8 (miss)
v
tag
data
1
000
m1 m0
1
110
m13 m12
(3)
1
100
m9 m8
22
Replacement Algorithms

When a block is fetched, which block in the
target set should be replaced?
Usage based algorithms
Least recently used (LRU)
replace the block that has been referenced least
recently
hard to implement
Non-usage based algorithms
First-in First-out (FIFO)
treat the set as a circular queue, replace block
at head of queue.
easy to implement
Random (RAND)
replace a random block in the set
even easier to implement

23
Implementing LRU
Create an ExE bit matrix for each set (only use
E(E-1) bits.) When block i is referenced, set
row i and clear column i. The LRU block is the
row with all zeros. All other blocks have been
referenced more recently than this one Example
trace (E4) 1 2 3 4 3 2 1
1
2
3
4
1
1
1
0
1
1
0
0
1
0
0
0
0
0
0
1
1
1
1
0
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
1
1
1
Setting Row My reference most recent Clearing
Column I was referenced after you were
3
2
1
0
0
0
0
0
0
1
1
1
1
0
0
1
1
1
0
1
1
1
1
1
1
0
1
0
0
1
1
1
0
1
0
0
0
0
0
24
Miss Rates

Tested on a VAX using 16-byte blocks.
Replacement strategy critical for small caches
does not make a lot of difference for large ones.
Trends More-way associative larger cache size

25
Write Strategies

Write Policy
What happens when processor writes to the cache?
write through
information is written to the block in cache and
memory.
memory always consistent with cache
Can overwrite cache entry
write back
information is written only block in cache.
Modified block written to memory only when it is
replaced.
requires a dirty bit for each block
To remove dirty block from cache, must write back
to main memory
memory not always consistent with cache
Write Buffer
Common optimization for write-through caches
Overlaps memory updates with processor execution

26
Allocation Strategies

On a write miss, is the block loaded from memory
into the cache?
Write Allocate
Block is loaded into cache on a write miss.
Usually used with write back
No-Write Allocate (Write Around)
Block is not loaded into cache on a write miss
Usually used with write through

27
Alpha 21064 direct mapped data cache
34-bit address 256 blocks 32-bytes/block
28
Write merging
A write buffer that does not do write merging
A write buffer that does write merging
29
2-way set associative Cache

Cache size 8192 bytes block size 8 bytes
2-way set associative random replacement WT
with a 1-word write buffer no write allocate.

30
Cache Performance

Average memory access time hit time Miss rate
x Miss Penalty.
CPU time (CPU execution clock cycles Memory
stall clock cycles) x clock cycle time
Memory stall clock cycles (Reads x Read miss
rate x Read miss penalty Writes x Write miss
rate x Write miss penalty)
CPUtime (CPIexecution Mem accesses per
instruction x Miss rate x Miss penalty) x Clock
cycle time x IC
Trend CPI, CCT reduced gt cache performance is
more important than ever!

31
Cache Performance Metrics

Average mem access time hit time Miss rate x
Miss Penalty.
Miss Rate
fraction of memory references not found in cache
(misses/references)
Typical numbers 5-10 for L1, 1-2 for L2
Hit Time
time to deliver a block in the cache to the
processor (includes time to determine whether the
block is in the cache)
Typical numbers
1 clock cycle for L1
3-8 clock cycles for L2
Miss Penalty
additional time required because of a miss
Typically 10-30 cycles for main memory

32
Instruction and Data cache unified?

Miss rates for instruction, data, and unified
caches (unified structural hazards?)
75 (100/(100269)) instruction references
25 data references (26 loads, 9 stores)
Question Which is better? Split or unified
cache?
Miss rate? Memory access time?
Assumptions 1-cycle hit, 50-cycle miss penalty,
2-cycle load/store
hit for unified caches (why more?)

33
Reducing Misses

Classifying Misses 3 Cs
CompulsoryThe first access to a block is not in
the cache, so the block must be brought into the
cache. Also called cold start misses or first
reference misses.(Misses in even an Infinite
Cache)
CapacityIf the cache cannot contain all the
blocks needed during execution of a program,
capacity misses will occur due to blocks being
discarded and later retrieved.(Misses in Fully
Associative Size X Cache)
ConflictIf block-placement strategy is set
associative or direct mapped, conflict misses (in
addition to compulsory capacity misses) will
occur because a block can be discarded and later
retrieved if too many blocks map to its set. Also
called collision misses or interference
misses.(Misses in N-way Associative, Size X
Cache)

34
3Cs Absolute Miss Rate (SPEC92)
miss rate 1-way associative cache size X
miss rate 2-way associative cache size X/2
Conflict
35
How Can Reduce Misses?

3 Cs Compulsory, Capacity, Conflict
In all cases, assume total cache size not
changed
What happens if
1) Change Block Size Which of 3Cs is obviously
affected?
2) Change Associativity Which of 3Cs is
obviously affected?
3) Change Compiler Which of 3Cs is obviously
affected?

36
1. Reduce Misses via Larger Block Size
Conflict misses
Compulsory misses
Capacity misses
37
2. Reduce Misses via Higher Associativity

21 Cache Rule
Miss Rate DM cache size N Miss Rate 2-way cache
size N/2
Beware Execution time is only final measure!
Will Clock Cycle time increase?
Hill 1988 suggested hit time for 2-way vs.
1-way external cache 10, internal 2

38
Avg. Memory Access Time vs. Miss Rate

Example assume CCT 1.10 for 2-way, 1.12 for
4-way, 1.14 for 8-way vs. CCT direct mapped
Cache Size Associativity
(KB) 1-way 2-way 4-way 8-way
1 2.33 2.15 2.07 2.01
2 1.98 1.86 1.76 1.68
4 1.72 1.67 1.61 1.53
8 1.46 1.48 1.47 1.43
16 1.29 1.32 1.32 1.32
32 1.20 1.24 1.25 1.27
64 1.14 1.20 1.21 1.23
128 1.10 1.17 1.18 1.20
(Red means A.M.A.T. not improved by more
associativity)

39
3. Reducing Misses via a Victim Cache

How to combine fast hit time of direct mapped yet
still avoid conflict misses?
Add buffer to place data discarded from cache
Jouppi 1990 4-entry victim cache removed 20
to 95 of conflicts for a 4 KB direct mapped data
cache

CPU address Data Data in out
Tag
Data
?
Victim cache
Write buffer
?
Low level memory
40
4.Reducing Miss RatePseudo-Associativity Caches

How to combine fast hit time of direct mapped and
have the lower conflict misses of 2-way SA cache?
Divide cache On a miss, check other half of
cache to see if there If so, have a pseudo-hit
(slow hit).
Drawback CPU pipeline is hard hit takes
different cycles (hit or slow hit?).
Better for caches not tied directly to processor.

41
5. Reducing Misses by Hardware Prefetching of
Instructions Data

E.g., Instruction Prefetching
Alpha 21064 fetches 2 blocks on a miss
Extra block placed in stream buffer
On miss check stream buffer
Works with data blocks too
Jouppi 1990 1 data stream buffer got 25 misses
from 4KB cache 4 streams got 43
Palacharla Kessler 1994 for scientific
programs for 8 streams got 50 to 70 of misses
from 2 64KB, 4-way set associative caches
Prefetching relies on having extra memory
bandwidth that can be used without penalty

42
6. Reducing Misses by Software Prefetching Data

Data Prefetch
Load data into register (HP PA-RISC loads)
Cache Prefetch load into cache (MIPS IV,
PowerPC, SPARC)
Special prefetching instructions cannot cause
faultsa form of speculative execution
Issuing Prefetch Instructions takes time
Is cost of prefetch issues lt savings in reduced
misses?
Higher superscalar reduces difficulty of issue
bandwidth

43
7. Reducing Misses by Compiler Optimizations

McFarling 1989 reduced caches misses by 75 on
8KB direct mapped cache, 4 byte blocks in
software
Instructions
Reorder procedures in memory so as to reduce
conflict misses
Profiling to look at conflicts(using tools they
developed)
Data
Merging Arrays improve spatial locality by
single array of compound elements vs. 2 arrays
Loop Interchange change nesting of loops to
access data in order stored in memory
Loop Fusion Combine 2 independent loops that
have same looping and some variables overlap
Blocking Improve temporal locality by accessing
blocks of data repeatedly vs. going down whole
columns or rows

44
Merging Arrays Example

/ Before 2 sequential arrays /
int valSIZE
int keySIZE
/ After 1 array of stuctures /
struct merge
int val
int key
struct merge merged_arraySIZE
Reducing conflicts between val key improve
spatial locality

45
Loop Interchange Example

/ Before /
for (k 0 k lt 100 k k1)
for (j 0 j lt 100 j j1)
for (i 0 i lt 5000 i i1)
xij 2 xij
/ After /
for (k 0 k lt 100 k k1)
for (i 0 i lt 5000 i i1)
for (j 0 j lt 100 j j1)
xij 2 xij
Sequential accesses instead of striding through
memory every 100 words improved spatial locality

46
Loop Fusion Example

/ Before /
for (i 0 i lt N i i1)
for (j 0 j lt N j j1)
aij 1/bij cij
for (i 0 i lt N i i1)
for (j 0 j lt N j j1)
dij aij cij
/ After /
for (i 0 i lt N i i1)
for (j 0 j lt N j j1)
aij 1/bij cij
dij aij cij
2 misses per access to a c vs. one miss per
access improve spatial locality

47
Blocking Example

/ Before /
for (i 0 i lt N i i1)
for (j 0 j lt N j j1)
sum 0.0
for (k 0 k lt N k k1)
sum sum aikbkj
cij sum
Two Inner Loops
Read all NxN elements of b
Read N elements of 1 row of a repeatedly
Write N elements of 1 row of c
Capacity Misses a function of N Cache Size
3 NxN gt no capacity misses otherwise ...
Idea compute on BxB submatrix that fits

48
Interactions Between Program Cache

Major Cache Effects to Consider
Total cache size
Try to keep heavily used data in highest level
cache
Block size (sometimes referred to line size)
Exploit spatial locality
Example Application
Multiply n X n matrices
O(n3) total operations
Accesses
n reads per source element
n values summed per destination
But may be able to hold in register

Variable sum held in register
/ ijk / for (i0 iltn i) for (j0 jltn
j) sum 0.0 for (k0 kltn k)
sum aik bkj cij sum

49
Layout of Arrays in Memory
Memory Layout

C Arrays Allocated in Row-Major Order
Each row in contiguous memory locations
Stepping Through Columns in Row
for (i 0 i lt n i)
sum a0i
Accesses successive elements
For block size gt 8, get spatial locality
Cold Start Miss Rate 8/B
Stepping Through Rows in Column
for (i 0 i lt n i)
sum ai0
Accesses distant elements
No spatial locality
Cold Start Miss rate 1

0x80000
a00
0x80008
a01
0x80010
a02
0x80018
a03

0x807F8
a0255
0x80800
a10
0x80808
a11
0x80810
a12
0x80818
a13

0x80FF8
a1255

0xFFFF8
a255255
50
Miss Rate Analysis

Assume
Block size 32B (big enough for 4 doubles)
n is very large
Approximate 1/n as 0.0
Cache not even big enough to hold multiple rows
Analysis Method
Look at access pattern by inner loop

C
51
Matrix multiplication (ijk)
/ ijk / for (i0 iltn i) for (j0 jltn
j) sum 0.0 for (k0 kltn k)
sum aik bkj cij sum

Inner loop
(,j)
(i,j)
(i,)
A
B
C
Row-wise

Approx. Miss Rates
a b c
0.25 1.0 0.0

52
Matrix multiplication (jik)
/ jik / for (j0 jltn j) for (i0 iltn
i) sum 0.0 for (k0 kltn k)
sum aik bkj cij sum

Inner loop
(,j)
(i,j)
(i,)
A
B
C

Approx. Miss Rates
a b c
0.25 1.0 0.0

53
Matrix multiplication (kij)
/ kij / for (k0 kltn k) for (i0 iltn
i) r aik for (j0 jltn j)
cij r bkj
Inner loop
(i,k)
(k,)
(i,)
A
B
C

Approx. Miss Rates
a b c
0.0 0.25 0.25

54
Matrix multiplication (ikj)
/ ikj / for (i0 iltn i) for (k0 kltn
k) r aik for (j0 jltn j)
cij r bkj
Inner loop
(i,k)
(k,)
(i,)
A
B
C

Approx. Miss Rates
a b c
0.0 0.25 0.25

55
Matrix multiplication (jki)
/ jki / for (j0 jltn j) for (k0 kltn
k) r bkj for (i0 iltn i)
cij aik r
Inner loop
(,j)
(,k)
(k,j)
A
B
C

Approx. Miss Rates
a b c
1.0 0.0 1.0

56
Matrix multiplication (kji)
/ kji / for (k0 kltn k) for (j0 jltn
j) r bkj for (i0 iltn i)
cij aik r
Inner loop
(,j)
(,k)
(k,j)
A
B
C

Approx. Miss Rates
a b c
1.0 0.0 1.0

57
Summary of Matrix Multiplication
ijk (L2, S0, MR1.25)
jik (L2, S0, MR1.25)
kij (L2, S1, MR0.5)
for (i0 iltn i) for (j0 jltn j)
sum 0.0 for (k0 kltn k)
sum aik bkj
cij sum
for (j0 jltn j) for (i0 iltn i)
sum 0.0 for (k0 kltn
k) sum aik bkj
cij sum
for (k0 kltn k) for (i0 iltn i)
r aik for (j0 jltn j)
cij r bkj
ikj (L2, S1, MR0.5)
jki (L2, S1, MR2.0)
kji (L2, S1, MR2.0)
for (i0 iltn i) for (k0 kltn k)
r aik for (j0 jltn j)
cij rbkj
for (j0 jltn j) for (k0 kltn k)
r bkj for (i0 iltn i)
cij aik r
for (k0 kltn k) for (j0 jltn j)
r bkj for (i0 iltn i)
cij aik r
58
Matmult Performance (Sparc20)
Multiple columns of B fit in cache?
(L2, S1, MR0.5)
(L2, S0, MR1.25)
(L2, S1, MR2.0)

As matrices grow in size, exceed cache capacity
Different loop orderings give different
performance
Cache effects
Whether or not can accumulate in register

59
Block Matrix Multiplication
Example n8, B 4
A11 A12 A21 A22
B11 B12 B21 B22
C11 C12 C21 C22

X
Key idea Sub-blocks (i.e., Aij) can be treated
just like scalars.
C11 A11B11 A12B21 C12 A11B12
A12B22 C21 A21B11 A22B21 C22
A21B12 A22B22
60
Blocked Matrix Multiply (bijk)
for (jj0 jjltn jjbsize) for (i0 iltn
i) for (jjj j lt min(jjbsize-1,n) j)
cij 0.0 for (kk0 kkltn kkbsize)
for (i0 iltn i) for (jjj j lt
min(jjbsize,n) j) sum 0.0
for (kkk k lt min(kkbsize,n) k)
sum aik bkj
cij sum

bsize called Blocking Factor
Capacity Misses from 2N3 N2 to 2N3/B N2

61
Blocked Matrix Multiply Analysis

Innermost loop pair multiplies 1 X bsize sliver
of A times bsize X bsize block of B and
accumulates into 1 X bsize sliver of C
Loop over i steps through n row slivers of A C,
using same B

Innermost Loop Pair
i
i
A
B
C
Update successive elements of sliver
row sliver accessed bsize times
block reused n times in succession
62
Blocked matmult perf (Sparc20)
63
Reducing Conflict Misses by Blocking

Conflict misses in caches not FA vs. Blocking size

64
Summary of Compiler Optimizations to Reduce Cache
Misses
65
Improving Cache Performance

1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.

66
1. Reducing Miss Penalty Read Priority over
Write on Miss

Write through with write buffers offer RAW
conflicts with main memory reads on cache misses
If simply wait for write buffer to empty, might
increase read miss penalty (old MIPS 1000 by 50
)
Check write buffer contents before read if no
conflicts, let the memory access continue
Write Back?
Read miss replacing dirty block
Normal Write dirty block to memory, and then do
the read
Instead copy the dirty block to a write buffer,
then do the read, and then do the write
CPU stall less since restarts as soon as do read

67
2. Reduce Miss Penalty Subblock Placement

Dont have to load full block on a miss
Have valid bits per subblock to indicate valid
(Originally invented to reduce tag storage)

100 300 200 204
1
1
1
1
1
1
0
0
1
1
0
0
0
0
0
0
Sub-blocks
Valid Bits
68
3. Reduce Miss Penalty Early Restart and
Critical Word First

Dont wait for full block to be loaded before
restarting CPU
Early restartAs soon as the requested word of
the block arrives, send it to the CPU and let the
CPU continue execution
Critical Word FirstRequest the missed word first
from memory and send it to the CPU as soon as it
arrives let the CPU continue execution while
filling the rest of the words in the block. Also
called wrapped fetch and requested word first
Generally useful only in large blocks,

69
4. Reduce Miss Penalty Non-blocking Caches to
reduce stalls on misses

Non-blocking cache or lockup-free cache allow
data cache to continue to supply cache hits
during a miss
requires out-of-order execution CPU
hit under miss reduces the effective miss
penalty by working during miss vs. ignoring CPU
requests
hit under multiple miss or miss under miss
may further lower the effective miss penalty by
overlapping multiple misses
Significantly increases the complexity of the
cache controller as there can be multiple
outstanding memory accesses
Requires multiple memory banks (otherwise cannot
support)

70
Value of Hit Under Miss for SPEC
0-gt1 1-gt2 2-gt64 Base
Hit under n Misses
Integer
Floating Point

8 KB Data Cache, Direct Mapped, 32B block, 16
cycle miss
FP programs on average AMAT 0.68 -gt 0.52 -gt
0.34 -gt 0.26
Int programs on average AMAT 0.24 -gt 0.20 -gt
0.19 -gt 0.19

71
5th Miss Penalty

L2 Equations
AMAT Hit TimeL1 Miss RateL1 x Miss
PenaltyL1
Miss PenaltyL1 Hit TimeL2 Miss RateL2 x Miss
PenaltyL2
AMAT Hit TimeL1 Miss RateL1 x (Hit TimeL2
Miss RateL2 Miss PenaltyL2)
Definitions
Local miss rate misses in this cache divided by
the total number of memory accesses to this cache
(Miss rateL2)
Global miss ratemisses in this cache divided by
the total number of memory accesses generated by
the CPU (Miss RateL1 x Miss RateL2)
Global Miss Rate is what matters

72
L2 cache block size A.M.A.T.

32KB L1, 8 byte path to memory

73
Multi-level caches
Can have separate Icache and Dcache or unified
Icache/Dcache
size speed block size
200 B 5 ns 4 B
8 KB 5 ns 16 B
128 MB DRAM 100 ns 4 KB
10 GB 10 ms
1M SRAM 6 ns 32 KB
larger, slower, cheaper
larger block size, higher associativity, more
likely to write back
74
Alpha 21164 Hierarchy
Processor Chip
L1 Data 1 cycle latency 8KB, direct Write-through
Dual Ported 32B lines
L2 Unified 8 cycle latency 96KB 3-way
assoc. Write-back Write allocate 32B/64B lines
L3 Unified 1M-64M direct Write-back Write
allocate 32B or 64B lines
Main Memory Up to 1TB
Regs.
L1 Instruction 8KB, direct 32B lines

Improving memory performance was main design goal
Earlier Alphas CPUs starved for data

75
Review Improving Cache Performance

1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.

76
1. Fast Hit times via Small and Simple Caches

Why Alpha 21164 has 8KB Instruction and 8KB data
cache 96KB second level cache?
Small data cache and clock rate
Direct Mapped, on chip

77
2. Fast hits by Avoiding Address Translation

Send virtual address to cache? Called Virtually
Addressed Cache or just Virtual Cache vs.
Physical Cache
Every time process is switched logically must
flush the cache otherwise get false hits
Cost is time to flush compulsory misses from
empty cache
Dealing with aliases
Two different virtual addresses map to same
physical address
I/O must interact with cache, so need virtual
address
Solution to aliases
SW guarantees covers index field direct mapped,
they must be uniquecalled page coloring
Solution to cache flush
Add process identifier tag that identifies
process as well as address within process cant
get a hit if wrong process

78
3. Fast Hit Times Via Pipelined Writes

Pipeline Tag Check and Update Cache as separate
stages current write tag check previous write
cache update
Only STORES in the pipeline empty during a
missStore r2, (r1) Check r1Add --Sub
--Store r4, (r3) Mr1lt-r2 check r3
In shade is Delayed Write Buffer must be
checked on reads either complete write or read
from buffer

79
Cache Optimization Summary

Technique MR MP HT Complexity
Larger Block Size 0Higher
Associativity 1Victim Caches 2HW
Prefetching of Instr/Data 2Compiler
Controlled Prefetching 3Compiler Reduce
Misses 0
Priority to Read Misses 1Subblock Placement
1Early Restart Critical Word 1st
2Non-Blocking Caches 3Second Level
Caches 2
Small Simple Caches 0Avoiding Address
Translation 2Pipelining Writes 1

miss rate
miss penalty
hit time
80
What Youve Learned About Caches?

1960-1985 Speed (no. operations)
1990
Pipelined Execution Fast Clock Rate
Out-of-Order execution
Superscalar Instruction Issue
1998 Speed (non-cached memory accesses)

81
Static RAM (SRAM)

Fast
10 ns 1995
Persistent
as long as power is supplied
no refresh required
Expensive
6 transistors/bit
Stable
High immunity to noise and environmental
disturbances
Technology for caches

82
Anatomy of an SRAM bit (cell)
Read - set bit lines high - set word line
high - see which bit line goes low
Write - set bit lines to opposite values -
set word line - Flip cell to new state
83
Example 1-level-decode SRAM (16 x 8)
b7
b7
b1
b1
b0
b0
W0
W1
memory cells
W15
R/W
sense/write amps
sense/write amps
sense/write amps
Input/output lines
d7
d1
d0
84
Dynamic RAM (DRAM)

Slower than SRAM
access time 70 ns 1995
Non-persistent
every row must be accessed every 1 ms
(refreshed)
Cheaper than SRAM
1 transistor/bit
Fragile
electrical noise, light, radiation
Workhorse memory technology

85
Anatomy of a DRAM Cell
Word Line
Storage Node
Bit Line
Access Transistor
Cnode
CBL
Writing
Word Line
Bit Line
V
Storage Node
86
Addressing arrays with bits
Consider an R x C array of addresses, where R
2r and C 2c. Then for each address,
row(address) address / C leftmost r bits of
address col(address) address C
rightmost c bits of address
r bits
c bits
row
col
address
87
Example 2-level decode DRAM (64Kx1)
RAS
256 Rows
Row decoder
256x256 cell array
Row address latch
row
256 Columns
A15-A0
column sense/write amps
R/W
col
Column address latch
column latch and decoder
CAS
Dout
Din
88
DRAM Operation

Row Address (50ns)
Set Row address on address lines strobe RAS
Entire row read stored in column latches
Contents of row of memory cells destroyed
Column Address (10ns)
Set Column address on address lines strobe CAS
Access selected bit
READ transfer from selected column latch to Dout
WRITE Set selected column latch to Din
Rewrite (30ns)
Write back entire row
Timing Access time 60ns lt cycle time 90ns
Must Refresh Periodically Approx. every 1ms
Perform complete memory cycle for each row
Handled in background by memory controller

89
Enhanced Performance DRAMs

Conventional Access
Row Col
RAS CAS RAS CAS ...
Page Mode
Row Series of columns
RAS CAS CAS CAS ...
Gives successive bits
Video RAM
Shift out entire row sequentially
At video rate

Entire row buffered here
Typical Performance
row access time col access time cycle time page
mode cycle time 50ns 10ns
90ns 25ns
90
Main Memory Background

Performance of Main Memory
Latency Cache Miss Penalty
Access Time time between request and word
arrives
Cycle Time time between requests
Bandwidth I/O Large Block Miss Penalty (L2)
Main Memory is DRAM Dynamic Random Access Memory
Dynamic since needs to be refreshed periodically
(8 ms)
Cache uses SRAM Static Random Access Memory
No refresh (6 transistors/bit vs. 1 transistor
Size DRAM/SRAM 4-8, Cost/Cycle time
SRAM/DRAM 8-16
DRAMs capacity 60/yr, cost 30/yr
2.5X cells/area, 1.5X die size in 3 years
Order of importance 1) Cost/bit 2) Capacity

91
Bandwidth Matching

Challenge
CPU works with short cycle times
DRAM (relatively) long cycle times
How can we provide enough bandwidth between
processor memory?
Effect of Caching
Caching greatly reduces amount of traffic to main
memory
But, sometimes need to move large amounts of data
from memory into cache
Trends
Need for high bandwidth much greater for
multimedia applications
Repeated operations on image data
Recent generation machines (e.g., Pentium II)
greatly improve on predecessors

92
High Bandwidth Memory Systems
Solution 1 High BW DRAM
Solution 2 Wide path between memory cache
Solution 3 Memory bank interleaving
Example Page Mode DRAM RAMbus
Example Alpha AXP 21064 256 bit wide bus, L2
cache, and memory.
Example Dec 3000
93
Independent Memory Banks

Memory banks for independent accesses vs. faster
sequential accesses
Multiprocessor
I/O
CPU with Hit under n Misses, Non-blocking Cache
Superbank all memory active on one block
transfer (or Bank)
Bank portion within a superbank that is word
interleaved (or Subbank)

Superbank offset
Superbank number
Bank number
Bank offset
94
Avoiding Bank Conflicts

Lots of banks
int x256512
for (j 0 j lt 512 j j1)
for (i 0 i lt 256 i i1)
xij 2 xij
Even with 128 banks, since 512 is multiple of
128, conflict on word accesses
SW loop interchange or declaring array not power
of 2 (array padding)
HW Prime number of banks
bank number address mod number of banks
address within bank address / number of banks
modulo divide per memory access with prime no.
banks?
address within bank address mod number words in
bank

95
Fast Memory Systems DRAM specific

Multiple CAS accesses several names (page mode)
Extended Data Out (EDO) 30 faster in page mode
New DRAMs to address gap what will they cost,
will they survive?
RAMBUS startup company reinvent DRAM interface
Each Chip a module vs. slice of memory
Short bus between CPU and chips
Does own refresh
Variable amount of data returned
1 byte / 2 ns (500 MB/s per chip)
Synchronous DRAM 2 banks on chip, a clock signal
to DRAM, transfer synchronous to system clock (66
- 150 MHz)
Intel claims RAMBUS Direct (16 b wide) is future
PC memory
Niche memory or main memory?
e.g., Video RAM for frame buffers, DRAM fast
serial output

96
Virtual Memory

Main memory can act as a cache for the secondary
storage (disk)
Advantages
illusion of having more physical memory
program relocation
protection

97
Virtual Memory (cont)
Provides illusion of very large memory sum of
the memory of many jobs greater than physical
memory address space of each job larger than
physical memory Allows available (fast and
expensive) physical memory to be very well
utilized Simplifies memory management Exploits
memory hierarchy to keep average access time
low. Involves at least two storage levels main
(RAM) and secondary (disk) Virtual Address --
address used by the programmer Virtual Address
Space -- collection of such addresses Physical
Address -- address of word in physical memory
98
Virtual Address Spaces
Key idea virtual and physical address spaces are
divided into equal-sized blocks known as virtual
pages and physical pages (page frames)
Physical addresses (PA)
Virtual addresses (VA)
0
0
address translation
vir. page
phy. page
Process 1
2n-1
0
Process 2
2n-1
2m-1
What if the virtual address spaces are bigger
than the physical address space?
99
VM as part of the memory hierarchy
Access word w in virtual page p (hit)
Access word v in virtual page q (miss or page
fault)
v cache block
w cache block
(page frames)
memory
p
p
p
q
page q
q
(pages)
q
q
disk
p
p
p
100
VM address translation
V 0, 1, . . . , n - 1 virtual address
space M 0, 1, . . . , m - 1 physical address
space MAP V --gt M U 0 address mapping
function
n gt m
MAP(a) a' if data at virtual address a is
present at physical
address a' and a' in M 0 if
data at virtual address a is not present in M
a
missing item fault
Name Space V
fault handler
Processor
0
Addr Trans Mechanism
Main Memory
Secondary memory
a
a'
physical address
OS performs this transfer
101
VM address translation
virtual address
31
0
11
12
virtual page number
page offset
address translation
0
11
12
29
physical page number
page offset
physical address
Notice that the page offset bits don't change as
a result of translation
102
Address translation with a page table
virtual address
31
0
11
12
virtual page number
page offset
page table register
access
valid
physical page number
if valid0 then page is not in memory and page
fault exception
0
11
12
29
physical page number
page offset
physical address
103
Page Tables
104
Address translation with a page table(cont)
separate page table(s) per process If V 1
then page is in main memory at frame address
stored in table else address is location of
page in secondary memory Access Rights R
Read-only, R/W read/write, X execute
only If kind of access not compatible with
specified access rights, then
protection_violation_fault If valid bit not set
then page fault Protection Fault access rights
violation causes trap to hardware,
microcode, or software fault handler Page Fault
page not resident in physical memory, also
causes trap usually accompanied by a
context switch current process suspended
while page is fetched from secondary storage
105
VM design issues

Everything driven by enormous cost of misses
hundreds of thousands of clocks.
vs units or tens of clocks for cache misses.
disks are high latency, low bandwidth devices
(compared to memory)
disk performance 10 ms access time,
10-33MBytes/sec transfer rate
Large block sizes
4KBytes - 16 KBytes are typical
amortize high access time
reduce miss rate by exploiting locality

106
VM design issues (cont)

Fully associative page placement
eliminates conflict misses
every miss is a killer, so worth the lower hit
time
Use smart replacement algorithms
handle misses in software
miss penalty is so high anyway, no reason to
handle in hardware
small improvements pay big dividends
Write back only
disk access too slow to afford write through
write buffer

107
Integrating VM and cache
miss
VA
PA
Trans- lation
Cache
Main Memory
CPU
hit
data
It takes an extra memory access to translate VA
to PA. Why not address cache with VA? Aliasing
problem 2 virtual addresses that point to the
same physical page. Result two cache blocks
for one physical location Solutions index
cache with low order VA bits that dont change
during translation. (requires small caches or OS
support such as page coloring)
108
Speeding up translation with a TLB
A translation lookaside buffer (TLB) is a small,
usually fully associative cache, that maps
virtual page numbers to physical page numbers.
hit
miss
VA
PA
TLB Lookup
Cache
Main Memory
CPU
hit
miss
Trans- lation
data
109
Address translation with a TLB
31
0
11
12
virtual address
virtual page number
page offset
valid
physical page number
tag
dirty
valid
valid
valid
valid

TLB hit
physical address
tag
byte offset
index
valid
tag
data

data
cache hit
110
Alpha AXP 21064 TLB
page size 8KB block size 1 PTE (8 bytes) hit
time 1 clock miss penalty 20 clocks TLB size
ITLB 8 PTEs, DTLB 32
PTEs replacement random(but not
last used) placement Fully assoc
Page-frame address lt30gt
Page offset lt13gt
lt1gt lt2gt lt2gt lt30gt lt21gt
V R W Tag Physical
address
1
2
. . .
. . .
(Low-order 13 bits lt13gt of address)
. . .
34-bit physical address
321 Mux
3
lt21gt
4
(High-order 21 bits of address)
111
Mapping an Alpha 21064 virtual address
13 bits
10 bits
Virtual address
Seg0/seg1 000 0 or Selector 111 1
Level1
Level2
Page offset
Level3
PTE size 8 Bytes
Page table base register

L1 page table
L2 page table
L3 page table
Page table entry

PT size 1K PTEs (8 KBytes)
. . .
Page table entry

. . .
Page table entry
. . .
13 bits
21 bits
Physical address
Main memory
112
(No Transcript)
113
Alpha 21164 Chip Photo

Microprocessor Report 9/12/94
Caches
L1 data
L1 instruction
L2 unified
TLB
Branch history

114
Alpha Memory Performance Miss Rates of SPEC92
I miss 6 D miss 32 L2 miss 10
8K
8K
2M
I miss 2 D miss 13 L2 miss 0.6
I miss 1 D miss 21 L2 miss 0.3
115
Alpha CPI Components