Title: CS 161 Ch 7: Memory Hierarchy LECTURE 15
1CS 161Ch 7 Memory Hierarchy LECTURE 15
- Instructor L.N. Bhuyan
- www.cs.ucr.edu/bhuyan
2Direct-mapped Cache Contd.
- The direct mapped cache is simple to design and
its access time is fast (Why?) - Good for L1 (on-chip cache)
- Problem Conflict Miss, so low hit ratio
- Conflict Misses are misses caused by accessing
different memory locations that are mapped to the
same cache index - In direct mapped cache, no flexibility in where
memory block can be placed in cache, contributing
to conflict misses
3Another Extreme Fully Associative
- Fully Associative Cache (8 word block)
- Omit cache index place item in any block!
- Compare all Cache Tags in parallel
4
0
31
Byte Offset
Cache Tag (27 bits long)
Cache Data
Valid
Cache Tag
B 0
B 1
B 31
- By definition Conflict Misses 0 for a fully
associative cache
4Fully Associative Cache
- Must search all tags in cache, as item can be in
any cache block - Search for tag must be done by hardware in
parallel (other searches too slow) - But, the necessary parallel comparator hardware
is very expensive - Therefore, fully associative placement practical
only for a very small cache
5Compromise N-way Set Associative Cache
- N-way set associative N cache blocks for each
Cache Index - Like having N direct mapped caches operating in
parallel - Select the one that gets a hit
- Example 2-way set associative cache
- Cache Index selects a set of 2 blocks from the
cache - The 2 tags in set are compared in parallel
- Data is selected based on the tag result (which
matched the address)
6Example 2-way Set Associative Cache
tag
offset
address
index
Cache Data
Valid
Cache Data
Valid
Cache Tag
Cache Tag
Block 0
Block 0
mux
Cache Block
Hit
7Set Associative Cache Contd.
- Direct Mapped, Fully Associative can be seen as
just variations of Set Associative block
placement strategy - Direct Mapped 1-way Set Associative Cache
- Fully Associative n-way Set associativity for
a cache with exactly n blocks
8(No Transcript)
9Block Replacement Policy
- N-way Set Associative or Fully Associative have
choice where to place a block, (which block to
replace) - Of course, if there is an invalid block, use it
- Whenever get a cache hit, record the cache block
that was touched - When need to evict a cache block, choose one
which hasn't been touched recently Least
Recently Used (LRU) - Past is prologue history suggests it is least
likely of the choices to be used soon - Flip side of temporal locality
10Block Replacement Policy Random
- Sometimes hard to keep track of the LRU block if
lots of choices - How hard for 2-way associativity?
- Second Choice Policy pick one at random and
replace that block - Advantages
- Very simple to implement
- Predictable behavior
- No worst case behavior
11What about Writes to Memory?
- Suppose write data only to cache?
- Main memory and cache would then be inconsistent
- cannot allow - Simplest Policy The information is written to
both the block in the cache and to the block in
the lower-level memory (write-through) - Problem Writes operate at speed of lower level
memory!
12Improving Cache Performance Write Buffer
Cache
Processor
DRAM
Write Buffer
- A Write Buffer is added between Cache and Memory
- Processor writes data into cache write buffer
- Controller write buffer contents to memory
- Write buffer is just a First-In First-Out queue
- Typical number of entries 4 to 10
- Works fine if Store frequency (w.r.t. time)
ltlt 1 / DRAM write cycle time
13Improving Cache Performance Write Back
- Option 2 data is written only to cache block
- Modified cache block is written to main memory
only when it is replaced - Block is unmodified (clean) or modified (dirty)
- This scheme is called Write Back
- Advantage? Repeated writes to same block stay in
cache - Disadvantage? More complex to implement
- Write Back is standard for Pentium Pro, optional
for PowerPC 604
14Improving Caches
- In general, want to minimize Average Access
Time - Hit Time x (1 - Miss Rate) Miss
Penalty x Miss Rate - (recall Hit Time ltlt Miss Penalty)
- So far, have looked at
- Larger Block Size
- Larger Cache
- Higher Associativity
- What else to reduce miss penalty? Add a second
level (L2) cache.
ReduceMiss Rate
15Current Memory Hierarchy
Processor
Control
Secon- dary Mem- ory
Main Mem- ory
L2 Cache
Data-path
L1 cache
regs
Speed(ns) 0.5ns 2ns 6ns 100ns 10,000,000ns Size
(MB) 0.0005 0.05 1-4 100-1000 100,000 Cost
(/MB) -- 100 30 1 0.05 Technology Regs SR
AM SRAM DRAM Disk
16How do we calculate the miss penalty?
- Access time L1 hit time L1 hit rate L1 miss
penalty L1 miss rate - We simply calculate the L1 miss penalty as being
the access time for the L2 cache - Access time L1 hit time L1 hit rate (L2 hit
time L2 hit rate L2 miss penalty L2 miss
rate) L1 miss rate
17Do the numbers for L2 Cache
- Assumptions
- L1 hit time 1 cycle, L1 hit rate 90
- L2 hit time (also L1 miss penalty) 4 cycles,
L2 miss penalty 100 cycles, L2 hit rate 90 - Access time L1 hit time L1 hit rate (L2 hit
time L2 hit rate L2 miss penalty (1 - L2
hit rate) ) L1 miss rate - 10.9 (40.9 1000.1) (1-0.9)
- 0.9 (13.6) 0.1 2.26 clock cycles
18What would it be without the L2 cache?
- Assume that the L1 miss penalty would be 100
clock cycles - 1 0.9 (100) 0.1
- 10.9 clock cycles vs. 2.26 with L2
- So gain a benefit from having the second, larger
cache before main memory - Todays L1 cache sizes 16 KB-64 KB L2 cache
may be 512 KB to 4096 KB
19An Example
- Q Suppose we have a processor with a base CPI of
1.0 assuming all references hit in the primary
cache and a clock rate of 500 MHz. The main
memory access time is 200 ns. Suppose the miss
rate per instn is 5. What is the revised CPI?
How much faster will the machine run if we put a
secondary cache (with 20-ns access time) that
reduces the miss rate to memory to 2? Assume
same access time for hit or miss. - A Miss penalty to main memory 200 ns 100
cycles. Total CPI Base CPI Memory-stall
cycles per instn. Hence, revised CPI 1.0 5 x
100 6.0 - When an L2 with 20-ns (10 cycles) access time is
put, the miss rate to memory is reduced to 2.
So, out of 5 L1 miss, L2 hit is 3 and miss is
2. - The CPI is reduced to 1.0 5 x (10 40 x 100)
3.5. Thus, the m/c with secondary cache is
faster by 6.0/3.5 1.7
20The Three Cs in Memory Hierarchy
- The cache miss consists of three classes.
- - Compulsory misses Caused due to
first access to the block from memory small but
fixed independent of cache size. - - Capacity misses Because the cache
cannot contain all the blocks for its limited
size reduces by increasing cache size - - Conflict misses Because multiple
blocks compete for the same block or set in the
cache. Also called collision misses. reduces by
increasing associativity
213Cs Absolute Miss Rate (SPEC92)
Conflict
22(No Transcript)
23Unified vs Split Caches
- Example
- 16KB ID Inst miss rate0.64, Data miss
rate6.47 - 32KB unified Aggregate miss rate1.99
- Which is better (ignore L2 cache)?
- Assume 33 data ops ? 75 accesses from
instructions (1.0/1.33) - hit time1, miss time50
- Note that data hit has 1 stall for unified cache
(only one port) - AMATHarvard75x(10.64x50)25x(16.47x50)
2.05 - AMATUnified75x(11.99x50)25x(111.99x50)
2.24
24Static RAM (SRAM)
- Six transistors in cross connected fashion
- Provides regular AND inverted outputs
- Implemented in CMOS process
Single Port 6-T SRAM Cell
25Dynamic Random Access Memory - DRAM
- DRAM organization is similar to SRAM except that
each bit of DRAM is constructed using a pass
transistor and a capacitor, shown in next slide - Less number of transistors/bit gives high
density, but slow discharge through capacitor. - Capacitor needs to be recharged or refreshed
giving rise to high cycle time. - Uses a two-level decoder as shown later. Note
that 2048 bits are accessed per row, but only one
bit is used.
26Dynamic RAM
- SRAM cells exhibit high speed/poor density
- DRAM simple transistor/capacitor pairs in high
density form
Word Line
C
Bit Line
...
Sense Amp
27DRAM logical organization (4 Mbit)
- Access time of DRAM Row access time column
access time refreshing
D
Column Decoder
Sense
Amps I/O
1
1
Q
Memory
Array
A0A1
0
Row Decoder
(2,048 x 2,048)
Storage
W
ord Line
Cell
- Square root of bits per RAS/CAS
28Main Memory Organizations Fig. 7.13
C
P
U
C
P
U
C
P
U
M
u
l
t
i
p
l
e
x
o
r
C
a
c
h
e
C
a
c
h
e
C
a
c
h
e
B
u
s
B
u
s
B
u
s
M
e
m
o
r
y
M
e
m
o
r
y
M
e
m
o
r
y
M
e
m
o
r
y
M
e
m
o
r
y
b
a
n
k
1
b
a
n
k
2
b
a
n
k
3
b
a
n
k
0
interleaved memory organization
wide memory organization
M
e
m
o
r
y
one-word widememory organization
DRAM access time gtgt bus transfer time
29Memory Access Time Example
- Assume that it takes 1 cycle to send the address,
15 cycles for each DRAM access and 1 cycle to
send a word of data. - Assuming a cache block of 4 words and one-word
wide DRAM (fig. 7.13a), miss penalty 1 4x15
4x1 65 cycles - With main memory and bus width of 2 words (fig.
7.13b), miss penalty 1 2x15 2x1 33
cycles. For 4-word wide memory, miss penalty is
17 cycles. Expensive due to wide bus and control
circuits. - With interleaved memory of 4 memory banks and
same bus width (fig. 7.13c), the miss penalty 1
1x15 4x1 20 cycles. The memory controller
must supply consecutive addresses to different
memory banks. Interleaving is universally adapted
in high-performance computers.