Chapter Seven Part I Cache Memory - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Chapter Seven Part I Cache Memory

Description:

and 2005 by Chi-Cheng Lin. What is the address format? ... a data miss rate of 4%. If a machine has a CPI of 2 without any memory stalls ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 34

Provided by: toda9

Category:

more less

Transcript and Presenter's Notes

Title: Chapter Seven Part I Cache Memory

1
Chapter SevenPart I Cache Memory
2
Memories Review

SRAM
value is stored on a pair of inverting gates
very fast but takes up more space than DRAM (4 to
6 transistors)
DRAM
value is stored as a charge on capacitor (must be
refreshed)
very small but slower than SRAM (factor of 5 to
10)

3
Exploiting Memory Hierarchy

Users want large and fast memories! SRAM access
times are .5 5ns at cost of 4000 to 10,000
per GB.DRAM access times are 50-70ns at cost of
100 to 200 per GB.Disk access times are 5 to
20 million ns at cost of .50 to 2 per GB.
Try and give it to them anyway
build a memory hierarchy

2004
4
Locality

A principle that makes having a memory hierarchy
a good idea
If an item is referenced,temporal locality it
will tend to be referenced again soon
spatial locality nearby items will tend to be
referenced soon.
Why does code have locality?
Library books analogy
Our initial focus two levels (upper, lower)
block minimum unit of data
hit data requested is in the upper level
miss data requested is not in the upper level

5
Cache

Two issues
How do we know if a data item is in the cache?
If it is, how do we find it?
Our first example
block size is one word of data
"direct mapped"

For each item of data at the lower level, there
is exactly one location in the cache where it
might be. e.g., lots of items at the lower level
share locations in the upper level
6
Direct Mapped Cache

Mapping address is modulo the number of blocks
in the cache

7
Direct Mapped Cache

Add a set of tags and a valid bit
The tag contains the upper portion of the address
The valid bit indicates whether an entry contains
a valid address
Example 1 word / block, memory 32 words ? 32
blocks, cache 8 blocks

Decimal address of reference Binary address of reference Hit or miss Assigned cache block
22 10110 miss 110
26 11010 miss 010
22 10110 hit 110
26 11010 hit 010
16 10000 miss 000
3 00011 miss 011
16 10000 hit 000
18 10010 miss 010
Index V Tag Data
000
001
010
011
100
101
110
111
Note that the address in this example is the word
address. Q How do you obtain the byte address
from a word address?
8
Direct Mapped Cache
Index V Tag Data
000
001
010
011
100
101
110
111
Decimal word address of reference Binary word address of reference Memory block Tag and assigned cache block
22 10110 10110 110
26 11010 11010 010
22 10110 10110 110
26 11010 11010 010
16 10000 10000 000
3 00011 00011 00 011
16 10000 10000 000
18 10010 10010 010
9
Direct Mapped Cache 2 words/block
Index V Tag Data
000
001
010
011
100
101
110
111
Decimal word address of reference Binary word address of reference Memory block Tag and assigned cache block
22 10110 1011 1 011
26 11010 1101 1 101
22 10110 1011 1 011
27 11011 1101 1 101
16 10000 1000 1 000
3 00011 0001 0 001
17 10001 1000 1 000
18 10010 1001 1 001
10
Direct Mapped Cache

For MIPS (4 bytes/word)
What kind of locality are we taking
advantage of?

11
Direct Mapped Cache

What is the address format?
How many bits are required to build a cache?
Example (page 479)
32-bit address, byte-addressable, 4 bytes/word,
4 words/block,
cache size 16 KB
16KB / 4 bytes/word 4K words
4K words / 4 words/block 1K blocks
10 bits for block index, 2 bits for block
offset, 2 bits for byte offset
? 32 10 2 2 18 bits for tag
Address format 31 30 14 13 12 4 3 2
1 0
tag index block
offset byte offset
For each block, 1 bit for V, 18 bits for tag,
4x32 bits for data
1 18 128 147 bits/block
Total cache size 147 bits/block x 1K blocks
147 Kbits gt 16 KB

12
Direct Mapped Cache

Taking advantage of spatial locality (gt1 words /
block)
Example FastMATH embedded microprocessor, 16 KB
cache, 16 words / block

13
Direct Mapped Cache
Consider a cache with 64 blocks, the block size
is 16 bytes. What block number does byte address
1204 map to?
Its word address is 1204/4 30110. With 4 words
per block, word address 30110 is block address
301/4 75. The block offset is 110 which is
012 The index will be 75 modulo 64 1110 which is
0010112 The tag will be 110 which is 00 0000
0000 0000 0000 00012 Binary solution 64
blocks ? 6 bits for index 4 bits for block
offset byte offset 1204 0000 0000 0000 0000
0000 0100 1011 01 00
tag
index blcok offset byte offset
14
Hits vs. Misses

Read hits
this is what we want!
Read misses
stall the CPU, fetch block from memory, deliver
to cache, restart
Write hits
can replace data in cache and memory
(write-through)
write the data only into the cache (write-back
the cache later)
Write misses
read the entire block into the cache, then write
the word

15
Direct Mapped Cache handling read miss

CPU fetches instruction from the cache, if an
instruction access results in a miss
Send the value of current PC 4 to the memory
Instruct the memory to read and wait for the
memory to complete its access
Write the cache entry, put the data from memory
to cache entry, write the upper bits of the
address from the ALU into the tag field, and turn
the valid bit on
Restart the instruction execution at the first
step, fetch the instruction in the cache

16
Direct Mapped Cache handling write

Data inconsistency on a store instruction, after
the data is written into the cache, memory would
have a different value from that in cache
Solutions
Write through always write the data into both
the memory and the cache
Write buffer use the buffer to store the data
while it is being writing to the memory. The
processor continues execution
Write back the modified block in cache is
written to the memory only when it needs to be
replaced in cache

17
Hardware Issues

Make reading multiple words easier by using banks
of memory

18
Increasing Memory Bandwidth

Widen the memory and the bus between the memory
and processor
Reduce the time by minimizing the number of times
we must start a new memory access

Assuming 4 words per block, 1 clock cycle to
send the address, 15 clock cycles to initiate
DRAM access and 1 clock cycle to send a word of
data
With a main memory width of 1 words. The miss
penalty will be 14x154x165 With a main memory
width of 4 words. The miss penalty will be
115117
19
Increasing Memory Bandwidth

Widen the memory only by using multiple memory
banks
One address is sent to 4 banks of memory, reading
is done simultaneously
Reduce the time by minimizing the number of times
we must start a new memory access

Assuming 4 words per block, 1 clock cycle to
send the address, 15 clock cycles to initiate
DRAM access and 1 clock cycle to send a word of
data
With 4 memory banks of 1 word width. The miss
penalty will be 1154x120
20
AMAT

Average memory access time (AMAT)
Average time to access memory considering both
hits and misses and the frequency of different
accesses
Used to capture the fact that the time to access
data for both hits and misses affects performance
Useful as a figure of merit for different cache
systems
AMAT Time for a hit Miss rate Miss penalty
7.17 5 lt7.2gt Find the AMAT for a processor
with a 2 ns clock, a miss penalty of 20 clock
cycles, a miss rate of 0.05 misses per
instruction, and a cache access time (including
hit detection) of 1 clock cycle. Assume that the
read and write miss penalties are the same and
ignore other write stalls.
7.18 5 lt7.2gt Suppose we can improve the miss
rate to 0.03 misses per reference by doubling the
cache size. This causes the cache access time to
increase to 1.2 clock cycles. Using the AMAT as a
metric, determine if this is a good trade-off.

21
Performance

Increasing the block size tends to decrease miss
rate
But, if the block size becomes a significant
fraction of the cache size, the number of cache
blocks will be small
A block of will be replaced in the cache before
many of its words are accessed.
It causes the increase of the miss rate.
How do we solve this?

22
Performance

Simplified model execution time (execution
cycles stall cycles) cycle time stall
cycles of instructions miss ratio miss
penalty
Two ways of improving performance
decreasing the miss ratio
decreasing the miss penalty
What happens if we increase block size?

23
Performance
Assume an instruction cache miss rate for gcc is
2 and a data miss rate of 4. If a machine has a
CPI of 2 without any memory stalls and the miss
penalty is 100 cycles for all misses, determine
how much faster a machine would run with a
perfect cache that never missed. The frequency of
load instruction for gcc is 36
To run a program, CPU will fetch both
instructions and data from the memory If there is
a miss for instruction, penalty will be
Instruction miss cycles IC 2 100 2.00
IC If there is a data miss, penalty will be
Data miss cycles IC 36 4 100 1.44
IC The total memory-stall cycles is 2.00 IC
1.44 IC 3.44 IC The CPI with memory stalls is 2
3.44 5.44 (CPU time with stalls)/(CPU time
perfect) (IC CPIstall Clock cycle) / (IC
CPperfect Clock cycle) 5.44 / 2 2.72
24
Decreasing Miss Ratio with Associativity

N-way set-associative cache consists of a number
of sets, each set has n blocks.
Each memory block is mapped to a unique set in
cache given by the index field
A memory block can be placed in any element of
the set
(Block number) modulo (Number of sets in the
cache)

25
Decreasing Miss Ratio with Associativity

If block address is 13 and the cache has 8
blocks
Set index (Block number) modulo (Number of sets
in the cache) one-way (direct) 13 modulo 8 5
two-way 13 modulo 4 1
four-way 13 modulo 2 1

search
3
6
1
13
26
Set-Associative Cache

Compared to direct mapped, give a series of
references that
results in a lower miss ratio using a 2-way set
associative cache
results in a higher miss ratio using a 2-way set
associative cache
assuming we use the least recently used
replacement strategy

27
An Implementation of 4-Way Set-Associative Cache
28
Set-Associative Cache Issues

Locating a block in the cache
Address format
Tag Set Index Block Offset Byte Offset
Block replacement
LRU (Least Recently Used)
Random

29
Set-Associative Cache Example
Assume a cache of 4K blocks, four-word block
size, and a 32-bit address, find the total number
of sets and the total number of tag bits for
caches that are direct mapped, two-way and
four-way set associative, and fully associative.
(Page 504)
The direct-mapped cache has same number of sets
as blocks, 4K 212, hence the total number of
tag bits is (32 12 4) x 4K 64 Kbits For a
2-way set-associative cache, there are 2K 211
sets. The total number of tag bits is (32 11
4) x 2 x 2K 68 Kbits For a 4-way
set-associative cache, there are 1K 210 sets.
The total number of tag bits is (32 10 4) x 4
x 1K 72 Kbits For a fully set-associative
cache, there is only 1 set. The total number of
tag bits is (32 0 4) x 4K x 1 112 Kbits
30
Performance
31
Decreasing Miss Penalty with Multilevel Caches