Title: Chapter 5: Memory Hierarchy Design
1Chapter 5 Memory Hierarchy Design
- Yirng-An Chen
- Dept. of CIS
- Computer Architecture
- Fall, 2000
2Computer System
3Who Cares About the Memory Hierarchy?
- Processor Only Thus Far in Course
- CPU cost/performance, ISA, Pipelined Execution
- CPU-DRAM Gap
- 1980 no cache in µproc 1995 2-level cache on
chip(1989 first Intel µproc with a cache on chip)
µProc 60/yr.
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 7/yr.
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
4Levels in a typical memory hierarchy
cache
virtual memory
4 B
8 B
4 KB
register reference
cache reference
memory reference
disk memory reference
size speed /Mbyte block size
200 B 3 ns 4 B
32 KB / 4MB 6 ns NT?/MB 8 B
128 MB 100 ns NT50/MB 4 KB
20 GB 10 ms NT0.9/MB
larger, slower, cheaper
5Sources of Memory References
sum 0 for (i 0 i lt n i) sum ai v
sum
Memory Layout
0x0FC
I0
0x100
Abstract Version of Machine Code
I1
0x104
I2
I0 sum lt-- 0 I1 ap lt-- a I2 i lt--
0 I3 if (i gt n) goto done i4 loop t lt--
ap I5 sum lt-- sum t I6 ap lt-- ap
4 I7 i lt-- i 1 I8 if (i lt n) goto
loop I9 done v lt-- sum
0x108
I3
0x10C
I4
0x110
I5
0x114
I6
0x400
a0
0x404
a1
0x408
a2
0x40C
a3
0x410
a4
- Memory addresses in bytes
- Each instruction data word 4 bytes
- Instruction sequences data arrays laid out as
contiguous memory blocks
0x414
a5
0x7A4
v
6Locality of reference
- Principle of Locality
- programs tend to reuse data and instructions near
those they have used recently. - temporal locality recently referenced items are
likely to be referenced in the near future. - spatial locality items with nearby addresses
tend to be referenced close together in time.
sum 0 for (i 0 i lt n i) sum ai v
sum
- Locality in Example
- Data
- Reference array elements in succession (spatial)
- Instruction
- Reference instructions in sequence (spatial)
- Cycle through loop repeatedly (temporal)
7Accessing data in a memory hierarchy
Between any two levels, memory divided into
blocks. Data moves between levels on demand, in
block-sized chunks. Upper-level blocks a subset
of lower-level blocks.
Access word w in block a (hit)
Access word v in block b (miss)
w
v
a
a
a
b
b
b
b
b
a
a
a
Locality smaller HW is faster memory hierarchy
8Four ?s for Memory Hierarchy Designers
- Q1 Where can a block be placed in the upper
level? (Block placement) - Fully Associative, Set Associative, Direct Mapped
- Q2 How is a block found if it is in the upper
level? (Block identification) - Tag/Block
- Q3 Which block should be replaced on a miss?
(Block replacement) - Random, LRU
- Q4 What happens on a write? (Write strategy)
- Write Back or Write Through (with Write Buffer)
9Address spaces
00000 00001 00010 00011 00100 00101 00110 00111 01
000 01001 01010 01011 01100 01101 01110 01111 1000
0 10001 10010 10011 10100 10101 10110 10111 11000
11001 11010 11011 11100 11101 11110 11111
An n-bit address defines an address space of 2n
items 0,...,2n-1.
Address space for n5
10Partitioning address spaces
Key idea partitioning the address bits
partitions the address space. In general, an
address partitioned into sets of t (tag), s (set
index), and b (block offset) bits, e.g.,
t
s
b
address
tag
set index
offset
belongs to one of 2s equivalence classes (sets),
where each set consists of 2t blocks of
addresses, and each block consists of 2b
addresses. The s bits uniquely identify an
equivalence class. The t bits uniquely identify
each block in the equivalence class. The b bits
define the offset of an address within a block
(block offset).
11Partitioning address spaces
00000 00001 00010 00011 00100 00101 00110 00111 01
000 01001 01010 01011 01100 01101 01110 01111 1000
0 10001 10010 10011 10100 10101 10110 10111 11000
11001 11010 11011 11100 11101 11110 11111
t1
s3
b1
1
011
0
2s 8 sets of blocks 2t 2 blocks/set 2b 2
addresses/block.
offset 0
block 1
12Partitioning address spaces
00000 00001 00010 00011 00100 00101 00110 00111 01
000 01001 01010 01011 01100 01101 01110 01111 1000
0 10001 10010 10011 10100 10101 10110 10111 11000
11001 11010 11011 11100 11101 11110 11111
t2
s2
b1
10
11
0
2s 4 sets of blocks 2t 4 blocks/set 2b 2
addresses/block.
offset 0
block 10
13Partitioning address spaces
set 1
00000 00001 00010 00011 00100 00101 00110 00111 01
000 01001 01010 01011 01100 01101 01110 01111 1000
0 10001 10010 10011 10100 10101 10110 10111 11000
11001 11010 11011 11100 11101 11110 11111
t3
s1
b1
101
1
0
2s 2 sets of blocks 2t 8 blocks/set 2b 2
addresses/block.
offset 0
block 101
14Partitioning address spaces
00000 00001 00010 00011 00100 00101 00110 00111 01
000 01001 01010 01011 01100 01101 01110 01111 1000
0 10001 10010 10011 10100 10101 10110 10111 11000
11001 11010 11011 11100 11101 11110 11111
set ø
t4
s0
b1
1011
0
2s 1 set of blocks 2t 16 blocks/set 2b 2
addresses/block.
block 1011
15Basic cache organization
Cache (C S x E x B bytes)
Address space (N 2n bytes)
E blocks/set
Address (n t s b bits)
S 2s sets
t
s
b
Cache block (cache line)
E Describes associativity how many blocks in
set can reside in cache simultaneously
16Direct mapped cache (E 1)
N 16 byte addresses (n4)
cache size C 8 data bytes line size B 2b
2 bytes/line
t1
s2
b1
x
xx
x
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001
1010 1011 1100 1101 1110 1111
direct mapped cache
E 1 entry/set
00 01 10 11
S 2s 4 sets
1. Determine set from middle bits 2. If something
already there, knock it out 3. Put new block in
cache
17Direct Mapped Cache Simulation
N16 byte addresses B2 bytes/block S4 sets E1
entry/set Address trace (reads) 0 0000 1
0001 13 1101 8 1000 0 0000
0 0000 (miss)
13 1101 (miss)
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001
1010 1011 1100 1101 1110 1111
v
tag
data
v
tag
data
1
0
m1 m0
1
0
m1 m0
(1)
(2)
1
1
m13 m12
8 1000 (miss)
0 0000 (miss)
v
tag
data
v
tag
data
1
1
m9 m8
1
0
m1 m0
(3)
(4)
1
1
m13 m12
1
1
m13 m12
18E-way Set-Associative Cache
N 16 addresses (n4)
Cache size C 8 data bytes Line size B 2b
2 bytes
t2
s1
b1
xx
x
x
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001
1010 1011 1100 1101 1110 1111
2-way set associative cache
E 2 entries/set
00 01
S 21 2 sets
192-Way Set Associative Simulation
N16 addresses B2 bytes/line S2 sets E2
entries/set Address trace (reads) 0 0000 1
0001 13 1101 8 1000 0 0000
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001
1010 1011 1100 1101 1110 1111
0 (miss)
13 (miss)
8 (miss)
(LRU replacement)
0 (miss)
(LRU replacement)
20Fully associative cache
N 16 addresses (n4)
cache size C 8 data bytes line size B 2b
2 bytes/line
t3
s0
b1
xxx
x
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001
1010 1011 1100 1101 1110 1111
fully associative cache
E 4 entries/set
S 2s 1 set
21Fully Associative Cache Simulation
N16 addresses B2 bytes/line S1 sets E4
entries/set Address trace (reads) 0 0000 1
0001 13 1101 8 1000 0 0000
t3
s0
b1
xxx
x
0 (miss)
13 (miss)
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001
1010 1011 1100 1101 1110 1111
v
tag
data
v
tag
data
1
00
m1 m0
1
000
m1 m0
1
110
m13 m12
(1)
(2)
set ø
8 (miss)
v
tag
data
1
000
m1 m0
1
110
m13 m12
(3)
1
100
m9 m8
22Replacement Algorithms
- When a block is fetched, which block in the
target set should be replaced? - Usage based algorithms
- Least recently used (LRU)
- replace the block that has been referenced least
recently - hard to implement
- Non-usage based algorithms
- First-in First-out (FIFO)
- treat the set as a circular queue, replace block
at head of queue. - easy to implement
- Random (RAND)
- replace a random block in the set
- even easier to implement
23Implementing LRU
Create an ExE bit matrix for each set (only use
E(E-1) bits.) When block i is referenced, set
row i and clear column i. The LRU block is the
row with all zeros. All other blocks have been
referenced more recently than this one Example
trace (E4) 1 2 3 4 3 2 1
1
2
3
4
1
1
1
0
1
1
0
0
1
0
0
0
0
0
0
1
1
1
1
0
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
1
1
1
Setting Row My reference most recent Clearing
Column I was referenced after you were
3
2
1
0
0
0
0
0
0
1
1
1
1
0
0
1
1
1
0
1
1
1
1
1
1
0
1
0
0
1
1
1
0
1
0
0
0
0
0
24Miss Rates
- Tested on a VAX using 16-byte blocks.
- Replacement strategy critical for small caches
does not make a lot of difference for large ones. - Trends More-way associative larger cache size
25Write Strategies
- Write Policy
- What happens when processor writes to the cache?
- write through
- information is written to the block in cache and
memory. - memory always consistent with cache
- Can overwrite cache entry
- write back
- information is written only block in cache.
Modified block written to memory only when it is
replaced. - requires a dirty bit for each block
- To remove dirty block from cache, must write back
to main memory - memory not always consistent with cache
- Write Buffer
- Common optimization for write-through caches
- Overlaps memory updates with processor execution
26Allocation Strategies
- On a write miss, is the block loaded from memory
into the cache? - Write Allocate
- Block is loaded into cache on a write miss.
- Usually used with write back
- No-Write Allocate (Write Around)
- Block is not loaded into cache on a write miss
- Usually used with write through
27Alpha 21064 direct mapped data cache
34-bit address 256 blocks 32-bytes/block
28Write merging
A write buffer that does not do write merging
A write buffer that does write merging
292-way set associative Cache
- Cache size 8192 bytes block size 8 bytes
2-way set associative random replacement WT
with a 1-word write buffer no write allocate.
30 Cache Performance
- Average memory access time hit time Miss rate
x Miss Penalty. - CPU time (CPU execution clock cycles Memory
stall clock cycles) x clock cycle time - Memory stall clock cycles (Reads x Read miss
rate x Read miss penalty Writes x Write miss
rate x Write miss penalty) - CPUtime (CPIexecution Mem accesses per
instruction x Miss rate x Miss penalty) x Clock
cycle time x IC - Trend CPI, CCT reduced gt cache performance is
more important than ever!
31Cache Performance Metrics
- Average mem access time hit time Miss rate x
Miss Penalty. - Miss Rate
- fraction of memory references not found in cache
(misses/references) - Typical numbers 5-10 for L1, 1-2 for L2
- Hit Time
- time to deliver a block in the cache to the
processor (includes time to determine whether the
block is in the cache) - Typical numbers
- 1 clock cycle for L1
- 3-8 clock cycles for L2
- Miss Penalty
- additional time required because of a miss
- Typically 10-30 cycles for main memory
32Instruction and Data cache unified?
- Miss rates for instruction, data, and unified
caches (unified structural hazards?) - 75 (100/(100269)) instruction references
25 data references (26 loads, 9 stores) - Question Which is better? Split or unified
cache? - Miss rate? Memory access time?
- Assumptions 1-cycle hit, 50-cycle miss penalty,
2-cycle load/store - hit for unified caches (why more?)
33Reducing Misses
- Classifying Misses 3 Cs
- CompulsoryThe first access to a block is not in
the cache, so the block must be brought into the
cache. Also called cold start misses or first
reference misses.(Misses in even an Infinite
Cache) - CapacityIf the cache cannot contain all the
blocks needed during execution of a program,
capacity misses will occur due to blocks being
discarded and later retrieved.(Misses in Fully
Associative Size X Cache) - ConflictIf block-placement strategy is set
associative or direct mapped, conflict misses (in
addition to compulsory capacity misses) will
occur because a block can be discarded and later
retrieved if too many blocks map to its set. Also
called collision misses or interference
misses.(Misses in N-way Associative, Size X
Cache)
343Cs Absolute Miss Rate (SPEC92)
miss rate 1-way associative cache size X
miss rate 2-way associative cache size X/2
Conflict
35How Can Reduce Misses?
- 3 Cs Compulsory, Capacity, Conflict
- In all cases, assume total cache size not
changed - What happens if
- 1) Change Block Size Which of 3Cs is obviously
affected? - 2) Change Associativity Which of 3Cs is
obviously affected? - 3) Change Compiler Which of 3Cs is obviously
affected?
361. Reduce Misses via Larger Block Size
Conflict misses
Compulsory misses
Capacity misses
372. Reduce Misses via Higher Associativity
- 21 Cache Rule
- Miss Rate DM cache size N Miss Rate 2-way cache
size N/2 - Beware Execution time is only final measure!
- Will Clock Cycle time increase?
- Hill 1988 suggested hit time for 2-way vs.
1-way external cache 10, internal 2
38Avg. Memory Access Time vs. Miss Rate
- Example assume CCT 1.10 for 2-way, 1.12 for
4-way, 1.14 for 8-way vs. CCT direct mapped - Cache Size Associativity
- (KB) 1-way 2-way 4-way 8-way
- 1 2.33 2.15 2.07 2.01
- 2 1.98 1.86 1.76 1.68
- 4 1.72 1.67 1.61 1.53
- 8 1.46 1.48 1.47 1.43
- 16 1.29 1.32 1.32 1.32
- 32 1.20 1.24 1.25 1.27
- 64 1.14 1.20 1.21 1.23
- 128 1.10 1.17 1.18 1.20
- (Red means A.M.A.T. not improved by more
associativity)
393. Reducing Misses via a Victim Cache
- How to combine fast hit time of direct mapped yet
still avoid conflict misses? - Add buffer to place data discarded from cache
- Jouppi 1990 4-entry victim cache removed 20
to 95 of conflicts for a 4 KB direct mapped data
cache
CPU address Data Data in out
Tag
Data
?
Victim cache
Write buffer
?
Low level memory
404.Reducing Miss RatePseudo-Associativity Caches
- How to combine fast hit time of direct mapped and
have the lower conflict misses of 2-way SA cache? - Divide cache On a miss, check other half of
cache to see if there If so, have a pseudo-hit
(slow hit). - Drawback CPU pipeline is hard hit takes
different cycles (hit or slow hit?). - Better for caches not tied directly to processor.
415. Reducing Misses by Hardware Prefetching of
Instructions Data
- E.g., Instruction Prefetching
- Alpha 21064 fetches 2 blocks on a miss
- Extra block placed in stream buffer
- On miss check stream buffer
- Works with data blocks too
- Jouppi 1990 1 data stream buffer got 25 misses
from 4KB cache 4 streams got 43 - Palacharla Kessler 1994 for scientific
programs for 8 streams got 50 to 70 of misses
from 2 64KB, 4-way set associative caches - Prefetching relies on having extra memory
bandwidth that can be used without penalty
426. Reducing Misses by Software Prefetching Data
- Data Prefetch
- Load data into register (HP PA-RISC loads)
- Cache Prefetch load into cache (MIPS IV,
PowerPC, SPARC) - Special prefetching instructions cannot cause
faultsa form of speculative execution - Issuing Prefetch Instructions takes time
- Is cost of prefetch issues lt savings in reduced
misses? - Higher superscalar reduces difficulty of issue
bandwidth
437. Reducing Misses by Compiler Optimizations
- McFarling 1989 reduced caches misses by 75 on
8KB direct mapped cache, 4 byte blocks in
software - Instructions
- Reorder procedures in memory so as to reduce
conflict misses - Profiling to look at conflicts(using tools they
developed) - Data
- Merging Arrays improve spatial locality by
single array of compound elements vs. 2 arrays - Loop Interchange change nesting of loops to
access data in order stored in memory - Loop Fusion Combine 2 independent loops that
have same looping and some variables overlap - Blocking Improve temporal locality by accessing
blocks of data repeatedly vs. going down whole
columns or rows
44Merging Arrays Example
- / Before 2 sequential arrays /
- int valSIZE
- int keySIZE
- / After 1 array of stuctures /
- struct merge
- int val
- int key
-
- struct merge merged_arraySIZE
- Reducing conflicts between val key improve
spatial locality
45Loop Interchange Example
- / Before /
- for (k 0 k lt 100 k k1)
- for (j 0 j lt 100 j j1)
- for (i 0 i lt 5000 i i1)
- xij 2 xij
- / After /
- for (k 0 k lt 100 k k1)
- for (i 0 i lt 5000 i i1)
- for (j 0 j lt 100 j j1)
- xij 2 xij
- Sequential accesses instead of striding through
memory every 100 words improved spatial locality
46Loop Fusion Example
- / Before /
- for (i 0 i lt N i i1)
- for (j 0 j lt N j j1)
- aij 1/bij cij
- for (i 0 i lt N i i1)
- for (j 0 j lt N j j1)
- dij aij cij
- / After /
- for (i 0 i lt N i i1)
- for (j 0 j lt N j j1)
- aij 1/bij cij
- dij aij cij
- 2 misses per access to a c vs. one miss per
access improve spatial locality
47Blocking Example
- / Before /
- for (i 0 i lt N i i1)
- for (j 0 j lt N j j1)
- sum 0.0
- for (k 0 k lt N k k1)
- sum sum aikbkj
- cij sum
-
- Two Inner Loops
- Read all NxN elements of b
- Read N elements of 1 row of a repeatedly
- Write N elements of 1 row of c
- Capacity Misses a function of N Cache Size
- 3 NxN gt no capacity misses otherwise ...
- Idea compute on BxB submatrix that fits
48Interactions Between Program Cache
- Major Cache Effects to Consider
- Total cache size
- Try to keep heavily used data in highest level
cache - Block size (sometimes referred to line size)
- Exploit spatial locality
- Example Application
- Multiply n X n matrices
- O(n3) total operations
- Accesses
- n reads per source element
- n values summed per destination
- But may be able to hold in register
Variable sum held in register
/ ijk / for (i0 iltn i) for (j0 jltn
j) sum 0.0 for (k0 kltn k)
sum aik bkj cij sum
49Layout of Arrays in Memory
Memory Layout
- C Arrays Allocated in Row-Major Order
- Each row in contiguous memory locations
- Stepping Through Columns in Row
- for (i 0 i lt n i)
- sum a0i
- Accesses successive elements
- For block size gt 8, get spatial locality
- Cold Start Miss Rate 8/B
- Stepping Through Rows in Column
- for (i 0 i lt n i)
- sum ai0
- Accesses distant elements
- No spatial locality
- Cold Start Miss rate 1
0x80000
a00
0x80008
a01
0x80010
a02
0x80018
a03
0x807F8
a0255
0x80800
a10
0x80808
a11
0x80810
a12
0x80818
a13
0x80FF8
a1255
0xFFFF8
a255255
50Miss Rate Analysis
- Assume
- Block size 32B (big enough for 4 doubles)
- n is very large
- Approximate 1/n as 0.0
- Cache not even big enough to hold multiple rows
- Analysis Method
- Look at access pattern by inner loop
C
51Matrix multiplication (ijk)
/ ijk / for (i0 iltn i) for (j0 jltn
j) sum 0.0 for (k0 kltn k)
sum aik bkj cij sum
Inner loop
(,j)
(i,j)
(i,)
A
B
C
Row-wise
- Approx. Miss Rates
- a b c
- 0.25 1.0 0.0
52Matrix multiplication (jik)
/ jik / for (j0 jltn j) for (i0 iltn
i) sum 0.0 for (k0 kltn k)
sum aik bkj cij sum
Inner loop
(,j)
(i,j)
(i,)
A
B
C
- Approx. Miss Rates
- a b c
- 0.25 1.0 0.0
53Matrix multiplication (kij)
/ kij / for (k0 kltn k) for (i0 iltn
i) r aik for (j0 jltn j)
cij r bkj
Inner loop
(i,k)
(k,)
(i,)
A
B
C
- Approx. Miss Rates
- a b c
- 0.0 0.25 0.25
54Matrix multiplication (ikj)
/ ikj / for (i0 iltn i) for (k0 kltn
k) r aik for (j0 jltn j)
cij r bkj
Inner loop
(i,k)
(k,)
(i,)
A
B
C
- Approx. Miss Rates
- a b c
- 0.0 0.25 0.25
55Matrix multiplication (jki)
/ jki / for (j0 jltn j) for (k0 kltn
k) r bkj for (i0 iltn i)
cij aik r
Inner loop
(,j)
(,k)
(k,j)
A
B
C
- Approx. Miss Rates
- a b c
- 1.0 0.0 1.0
56Matrix multiplication (kji)
/ kji / for (k0 kltn k) for (j0 jltn
j) r bkj for (i0 iltn i)
cij aik r
Inner loop
(,j)
(,k)
(k,j)
A
B
C
- Approx. Miss Rates
- a b c
- 1.0 0.0 1.0
57Summary of Matrix Multiplication
ijk (L2, S0, MR1.25)
jik (L2, S0, MR1.25)
kij (L2, S1, MR0.5)
for (i0 iltn i) for (j0 jltn j)
sum 0.0 for (k0 kltn k)
sum aik bkj
cij sum
for (j0 jltn j) for (i0 iltn i)
sum 0.0 for (k0 kltn
k) sum aik bkj
cij sum
for (k0 kltn k) for (i0 iltn i)
r aik for (j0 jltn j)
cij r bkj
ikj (L2, S1, MR0.5)
jki (L2, S1, MR2.0)
kji (L2, S1, MR2.0)
for (i0 iltn i) for (k0 kltn k)
r aik for (j0 jltn j)
cij rbkj
for (j0 jltn j) for (k0 kltn k)
r bkj for (i0 iltn i)
cij aik r
for (k0 kltn k) for (j0 jltn j)
r bkj for (i0 iltn i)
cij aik r
58Matmult Performance (Sparc20)
Multiple columns of B fit in cache?
(L2, S1, MR0.5)
(L2, S0, MR1.25)
(L2, S1, MR2.0)
- As matrices grow in size, exceed cache capacity
- Different loop orderings give different
performance - Cache effects
- Whether or not can accumulate in register
59Block Matrix Multiplication
Example n8, B 4
A11 A12 A21 A22
B11 B12 B21 B22
C11 C12 C21 C22
X
Key idea Sub-blocks (i.e., Aij) can be treated
just like scalars.
C11 A11B11 A12B21 C12 A11B12
A12B22 C21 A21B11 A22B21 C22
A21B12 A22B22
60Blocked Matrix Multiply (bijk)
for (jj0 jjltn jjbsize) for (i0 iltn
i) for (jjj j lt min(jjbsize-1,n) j)
cij 0.0 for (kk0 kkltn kkbsize)
for (i0 iltn i) for (jjj j lt
min(jjbsize,n) j) sum 0.0
for (kkk k lt min(kkbsize,n) k)
sum aik bkj
cij sum
- bsize called Blocking Factor
- Capacity Misses from 2N3 N2 to 2N3/B N2
61Blocked Matrix Multiply Analysis
- Innermost loop pair multiplies 1 X bsize sliver
of A times bsize X bsize block of B and
accumulates into 1 X bsize sliver of C - Loop over i steps through n row slivers of A C,
using same B
Innermost Loop Pair
i
i
A
B
C
Update successive elements of sliver
row sliver accessed bsize times
block reused n times in succession
62Blocked matmult perf (Sparc20)
63Reducing Conflict Misses by Blocking
- Conflict misses in caches not FA vs. Blocking size
64Summary of Compiler Optimizations to Reduce Cache
Misses
65Improving Cache Performance
- 1. Reduce the miss rate,
- 2. Reduce the miss penalty, or
- 3. Reduce the time to hit in the cache.
661. Reducing Miss Penalty Read Priority over
Write on Miss
- Write through with write buffers offer RAW
conflicts with main memory reads on cache misses - If simply wait for write buffer to empty, might
increase read miss penalty (old MIPS 1000 by 50
) - Check write buffer contents before read if no
conflicts, let the memory access continue - Write Back?
- Read miss replacing dirty block
- Normal Write dirty block to memory, and then do
the read - Instead copy the dirty block to a write buffer,
then do the read, and then do the write - CPU stall less since restarts as soon as do read
672. Reduce Miss Penalty Subblock Placement
- Dont have to load full block on a miss
- Have valid bits per subblock to indicate valid
- (Originally invented to reduce tag storage)
100 300 200 204
1
1
1
1
1
1
0
0
1
1
0
0
0
0
0
0
Sub-blocks
Valid Bits
683. Reduce Miss Penalty Early Restart and
Critical Word First
- Dont wait for full block to be loaded before
restarting CPU - Early restartAs soon as the requested word of
the block arrives, send it to the CPU and let the
CPU continue execution - Critical Word FirstRequest the missed word first
from memory and send it to the CPU as soon as it
arrives let the CPU continue execution while
filling the rest of the words in the block. Also
called wrapped fetch and requested word first - Generally useful only in large blocks,
694. Reduce Miss Penalty Non-blocking Caches to
reduce stalls on misses
- Non-blocking cache or lockup-free cache allow
data cache to continue to supply cache hits
during a miss - requires out-of-order execution CPU
- hit under miss reduces the effective miss
penalty by working during miss vs. ignoring CPU
requests - hit under multiple miss or miss under miss
may further lower the effective miss penalty by
overlapping multiple misses - Significantly increases the complexity of the
cache controller as there can be multiple
outstanding memory accesses - Requires multiple memory banks (otherwise cannot
support)
70Value of Hit Under Miss for SPEC
0-gt1 1-gt2 2-gt64 Base
Hit under n Misses
Integer
Floating Point
- 8 KB Data Cache, Direct Mapped, 32B block, 16
cycle miss - FP programs on average AMAT 0.68 -gt 0.52 -gt
0.34 -gt 0.26 - Int programs on average AMAT 0.24 -gt 0.20 -gt
0.19 -gt 0.19
715th Miss Penalty
- L2 Equations
- AMAT Hit TimeL1 Miss RateL1 x Miss
PenaltyL1 - Miss PenaltyL1 Hit TimeL2 Miss RateL2 x Miss
PenaltyL2 - AMAT Hit TimeL1 Miss RateL1 x (Hit TimeL2
Miss RateL2 Miss PenaltyL2) - Definitions
- Local miss rate misses in this cache divided by
the total number of memory accesses to this cache
(Miss rateL2) - Global miss ratemisses in this cache divided by
the total number of memory accesses generated by
the CPU (Miss RateL1 x Miss RateL2) - Global Miss Rate is what matters
72L2 cache block size A.M.A.T.
- 32KB L1, 8 byte path to memory
73Multi-level caches
Can have separate Icache and Dcache or unified
Icache/Dcache
size speed block size
200 B 5 ns 4 B
8 KB 5 ns 16 B
128 MB DRAM 100 ns 4 KB
10 GB 10 ms
1M SRAM 6 ns 32 KB
larger, slower, cheaper
larger block size, higher associativity, more
likely to write back
74Alpha 21164 Hierarchy
Processor Chip
L1 Data 1 cycle latency 8KB, direct Write-through
Dual Ported 32B lines
L2 Unified 8 cycle latency 96KB 3-way
assoc. Write-back Write allocate 32B/64B lines
L3 Unified 1M-64M direct Write-back Write
allocate 32B or 64B lines
Main Memory Up to 1TB
Regs.
L1 Instruction 8KB, direct 32B lines
- Improving memory performance was main design goal
- Earlier Alphas CPUs starved for data
75Review Improving Cache Performance
- 1. Reduce the miss rate,
- 2. Reduce the miss penalty, or
- 3. Reduce the time to hit in the cache.
761. Fast Hit times via Small and Simple Caches
- Why Alpha 21164 has 8KB Instruction and 8KB data
cache 96KB second level cache? - Small data cache and clock rate
- Direct Mapped, on chip
772. Fast hits by Avoiding Address Translation
- Send virtual address to cache? Called Virtually
Addressed Cache or just Virtual Cache vs.
Physical Cache - Every time process is switched logically must
flush the cache otherwise get false hits - Cost is time to flush compulsory misses from
empty cache - Dealing with aliases
- Two different virtual addresses map to same
physical address - I/O must interact with cache, so need virtual
address - Solution to aliases
- SW guarantees covers index field direct mapped,
they must be uniquecalled page coloring - Solution to cache flush
- Add process identifier tag that identifies
process as well as address within process cant
get a hit if wrong process
783. Fast Hit Times Via Pipelined Writes
- Pipeline Tag Check and Update Cache as separate
stages current write tag check previous write
cache update - Only STORES in the pipeline empty during a
missStore r2, (r1) Check r1Add --Sub
--Store r4, (r3) Mr1lt-r2 check r3 - In shade is Delayed Write Buffer must be
checked on reads either complete write or read
from buffer
79Cache Optimization Summary
- Technique MR MP HT Complexity
- Larger Block Size 0Higher
Associativity 1Victim Caches 2HW
Prefetching of Instr/Data 2Compiler
Controlled Prefetching 3Compiler Reduce
Misses 0 - Priority to Read Misses 1Subblock Placement
1Early Restart Critical Word 1st
2Non-Blocking Caches 3Second Level
Caches 2 - Small Simple Caches 0Avoiding Address
Translation 2Pipelining Writes 1
miss rate
miss penalty
hit time
80What Youve Learned About Caches?
- 1960-1985 Speed (no. operations)
- 1990
- Pipelined Execution Fast Clock Rate
- Out-of-Order execution
- Superscalar Instruction Issue
- 1998 Speed (non-cached memory accesses)
81Static RAM (SRAM)
- Fast
- 10 ns 1995
- Persistent
- as long as power is supplied
- no refresh required
- Expensive
- 6 transistors/bit
- Stable
- High immunity to noise and environmental
disturbances - Technology for caches
82Anatomy of an SRAM bit (cell)
Read - set bit lines high - set word line
high - see which bit line goes low
Write - set bit lines to opposite values -
set word line - Flip cell to new state
83Example 1-level-decode SRAM (16 x 8)
b7
b7
b1
b1
b0
b0
W0
W1
memory cells
W15
R/W
sense/write amps
sense/write amps
sense/write amps
Input/output lines
d7
d1
d0
84Dynamic RAM (DRAM)
- Slower than SRAM
- access time 70 ns 1995
- Non-persistent
- every row must be accessed every 1 ms
(refreshed) - Cheaper than SRAM
- 1 transistor/bit
- Fragile
- electrical noise, light, radiation
- Workhorse memory technology
85Anatomy of a DRAM Cell
Word Line
Storage Node
Bit Line
Access Transistor
Cnode
CBL
Writing
Word Line
Bit Line
V
Storage Node
86Addressing arrays with bits
Consider an R x C array of addresses, where R
2r and C 2c. Then for each address,
row(address) address / C leftmost r bits of
address col(address) address C
rightmost c bits of address
r bits
c bits
row
col
address
87Example 2-level decode DRAM (64Kx1)
RAS
256 Rows
Row decoder
256x256 cell array
Row address latch
row
256 Columns
A15-A0
column sense/write amps
R/W
col
Column address latch
column latch and decoder
CAS
Dout
Din
88DRAM Operation
- Row Address (50ns)
- Set Row address on address lines strobe RAS
- Entire row read stored in column latches
- Contents of row of memory cells destroyed
- Column Address (10ns)
- Set Column address on address lines strobe CAS
- Access selected bit
- READ transfer from selected column latch to Dout
- WRITE Set selected column latch to Din
- Rewrite (30ns)
- Write back entire row
- Timing Access time 60ns lt cycle time 90ns
- Must Refresh Periodically Approx. every 1ms
- Perform complete memory cycle for each row
- Handled in background by memory controller
89Enhanced Performance DRAMs
- Conventional Access
- Row Col
- RAS CAS RAS CAS ...
- Page Mode
- Row Series of columns
- RAS CAS CAS CAS ...
- Gives successive bits
- Video RAM
- Shift out entire row sequentially
- At video rate
Entire row buffered here
Typical Performance
row access time col access time cycle time page
mode cycle time 50ns 10ns
90ns 25ns
90Main Memory Background
- Performance of Main Memory
- Latency Cache Miss Penalty
- Access Time time between request and word
arrives - Cycle Time time between requests
- Bandwidth I/O Large Block Miss Penalty (L2)
- Main Memory is DRAM Dynamic Random Access Memory
- Dynamic since needs to be refreshed periodically
(8 ms) - Cache uses SRAM Static Random Access Memory
- No refresh (6 transistors/bit vs. 1 transistor
Size DRAM/SRAM 4-8, Cost/Cycle time
SRAM/DRAM 8-16 - DRAMs capacity 60/yr, cost 30/yr
- 2.5X cells/area, 1.5X die size in 3 years
- Order of importance 1) Cost/bit 2) Capacity
91Bandwidth Matching
- Challenge
- CPU works with short cycle times
- DRAM (relatively) long cycle times
- How can we provide enough bandwidth between
processor memory? - Effect of Caching
- Caching greatly reduces amount of traffic to main
memory - But, sometimes need to move large amounts of data
from memory into cache - Trends
- Need for high bandwidth much greater for
multimedia applications - Repeated operations on image data
- Recent generation machines (e.g., Pentium II)
greatly improve on predecessors
92High Bandwidth Memory Systems
Solution 1 High BW DRAM
Solution 2 Wide path between memory cache
Solution 3 Memory bank interleaving
Example Page Mode DRAM RAMbus
Example Alpha AXP 21064 256 bit wide bus, L2
cache, and memory.
Example Dec 3000
93Independent Memory Banks
- Memory banks for independent accesses vs. faster
sequential accesses - Multiprocessor
- I/O
- CPU with Hit under n Misses, Non-blocking Cache
- Superbank all memory active on one block
transfer (or Bank) - Bank portion within a superbank that is word
interleaved (or Subbank)
Superbank offset
Superbank number
Bank number
Bank offset
94Avoiding Bank Conflicts
- Lots of banks
- int x256512
- for (j 0 j lt 512 j j1)
- for (i 0 i lt 256 i i1)
- xij 2 xij
- Even with 128 banks, since 512 is multiple of
128, conflict on word accesses - SW loop interchange or declaring array not power
of 2 (array padding) - HW Prime number of banks
- bank number address mod number of banks
- address within bank address / number of banks
- modulo divide per memory access with prime no.
banks? - address within bank address mod number words in
bank
95Fast Memory Systems DRAM specific
- Multiple CAS accesses several names (page mode)
- Extended Data Out (EDO) 30 faster in page mode
- New DRAMs to address gap what will they cost,
will they survive? - RAMBUS startup company reinvent DRAM interface
- Each Chip a module vs. slice of memory
- Short bus between CPU and chips
- Does own refresh
- Variable amount of data returned
- 1 byte / 2 ns (500 MB/s per chip)
- Synchronous DRAM 2 banks on chip, a clock signal
to DRAM, transfer synchronous to system clock (66
- 150 MHz) - Intel claims RAMBUS Direct (16 b wide) is future
PC memory - Niche memory or main memory?
- e.g., Video RAM for frame buffers, DRAM fast
serial output
96Virtual Memory
- Main memory can act as a cache for the secondary
storage (disk) - Advantages
- illusion of having more physical memory
- program relocation
- protection
97Virtual Memory (cont)
Provides illusion of very large memory sum of
the memory of many jobs greater than physical
memory address space of each job larger than
physical memory Allows available (fast and
expensive) physical memory to be very well
utilized Simplifies memory management Exploits
memory hierarchy to keep average access time
low. Involves at least two storage levels main
(RAM) and secondary (disk) Virtual Address --
address used by the programmer Virtual Address
Space -- collection of such addresses Physical
Address -- address of word in physical memory
98Virtual Address Spaces
Key idea virtual and physical address spaces are
divided into equal-sized blocks known as virtual
pages and physical pages (page frames)
Physical addresses (PA)
Virtual addresses (VA)
0
0
address translation
vir. page
phy. page
Process 1
2n-1
0
Process 2
2n-1
2m-1
What if the virtual address spaces are bigger
than the physical address space?
99VM as part of the memory hierarchy
Access word w in virtual page p (hit)
Access word v in virtual page q (miss or page
fault)
v cache block
w cache block
(page frames)
memory
p
p
p
q
page q
q
(pages)
q
q
disk
p
p
p
100VM address translation
V 0, 1, . . . , n - 1 virtual address
space M 0, 1, . . . , m - 1 physical address
space MAP V --gt M U 0 address mapping
function
n gt m
MAP(a) a' if data at virtual address a is
present at physical
address a' and a' in M 0 if
data at virtual address a is not present in M
a
missing item fault
Name Space V
fault handler
Processor
0
Addr Trans Mechanism
Main Memory
Secondary memory
a
a'
physical address
OS performs this transfer
101VM address translation
virtual address
31
0
11
12
virtual page number
page offset
address translation
0
11
12
29
physical page number
page offset
physical address
Notice that the page offset bits don't change as
a result of translation
102Address translation with a page table
virtual address
31
0
11
12
virtual page number
page offset
page table register
access
valid
physical page number
if valid0 then page is not in memory and page
fault exception
0
11
12
29
physical page number
page offset
physical address
103Page Tables
104Address translation with a page table(cont)
separate page table(s) per process If V 1
then page is in main memory at frame address
stored in table else address is location of
page in secondary memory Access Rights R
Read-only, R/W read/write, X execute
only If kind of access not compatible with
specified access rights, then
protection_violation_fault If valid bit not set
then page fault Protection Fault access rights
violation causes trap to hardware,
microcode, or software fault handler Page Fault
page not resident in physical memory, also
causes trap usually accompanied by a
context switch current process suspended
while page is fetched from secondary storage
105VM design issues
- Everything driven by enormous cost of misses
- hundreds of thousands of clocks.
- vs units or tens of clocks for cache misses.
- disks are high latency, low bandwidth devices
(compared to memory) - disk performance 10 ms access time,
10-33MBytes/sec transfer rate - Large block sizes
- 4KBytes - 16 KBytes are typical
- amortize high access time
- reduce miss rate by exploiting locality
106VM design issues (cont)
- Fully associative page placement
- eliminates conflict misses
- every miss is a killer, so worth the lower hit
time - Use smart replacement algorithms
- handle misses in software
- miss penalty is so high anyway, no reason to
handle in hardware - small improvements pay big dividends
- Write back only
- disk access too slow to afford write through
write buffer
107Integrating VM and cache
miss
VA
PA
Trans- lation
Cache
Main Memory
CPU
hit
data
It takes an extra memory access to translate VA
to PA. Why not address cache with VA? Aliasing
problem 2 virtual addresses that point to the
same physical page. Result two cache blocks
for one physical location Solutions index
cache with low order VA bits that dont change
during translation. (requires small caches or OS
support such as page coloring)
108Speeding up translation with a TLB
A translation lookaside buffer (TLB) is a small,
usually fully associative cache, that maps
virtual page numbers to physical page numbers.
hit
miss
VA
PA
TLB Lookup
Cache
Main Memory
CPU
hit
miss
Trans- lation
data
109Address translation with a TLB
31
0
11
12
virtual address
virtual page number
page offset
valid
physical page number
tag
dirty
valid
valid
valid
valid
TLB hit
physical address
tag
byte offset
index
valid
tag
data
data
cache hit
110Alpha AXP 21064 TLB
page size 8KB block size 1 PTE (8 bytes) hit
time 1 clock miss penalty 20 clocks TLB size
ITLB 8 PTEs, DTLB 32
PTEs replacement random(but not
last used) placement Fully assoc
Page-frame address lt30gt
Page offset lt13gt
lt1gt lt2gt lt2gt lt30gt lt21gt
V R W Tag Physical
address
1
2
. . .
. . .
(Low-order 13 bits lt13gt of address)
. . .
34-bit physical address
321 Mux
3
lt21gt
4
(High-order 21 bits of address)
111Mapping an Alpha 21064 virtual address
13 bits
10 bits
Virtual address
Seg0/seg1 000 0 or Selector 111 1
Level1
Level2
Page offset
Level3
PTE size 8 Bytes
Page table base register
L1 page table
L2 page table
L3 page table
Page table entry
PT size 1K PTEs (8 KBytes)
. . .
Page table entry
. . .
Page table entry
. . .
13 bits
21 bits
Physical address
Main memory
112(No Transcript)
113Alpha 21164 Chip Photo
- Microprocessor Report 9/12/94
- Caches
- L1 data
- L1 instruction
- L2 unified
- TLB
- Branch history
114Alpha Memory Performance Miss Rates of SPEC92
I miss 6 D miss 32 L2 miss 10
8K
8K
2M
I miss 2 D miss 13 L2 miss 0.6
I miss 1 D miss 21 L2 miss 0.3
115Alpha CPI Components
- Instruction stall branch mispredict (green)
- Data cache (blue) Instruction cache (yellow)
L2 (pink) Other compute reg conflicts,
structural conflicts
116Modern Systems