Title: Chapter Seven Part I Cache Memory
1Chapter SevenPart I Cache Memory
2Memories Review
- SRAM
- value is stored on a pair of inverting gates
- very fast but takes up more space than DRAM (4 to
6 transistors) - DRAM
- value is stored as a charge on capacitor (must be
refreshed) - very small but slower than SRAM (factor of 5 to
10)
3Exploiting Memory Hierarchy
- Users want large and fast memories! SRAM access
times are .5 5ns at cost of 4000 to 10,000
per GB.DRAM access times are 50-70ns at cost of
100 to 200 per GB.Disk access times are 5 to
20 million ns at cost of .50 to 2 per GB. - Try and give it to them anyway
- build a memory hierarchy
2004
4Locality
- A principle that makes having a memory hierarchy
a good idea - If an item is referenced,temporal locality it
will tend to be referenced again soon - spatial locality nearby items will tend to be
referenced soon. - Why does code have locality?
- Library books analogy
- Our initial focus two levels (upper, lower)
- block minimum unit of data
- hit data requested is in the upper level
- miss data requested is not in the upper level
5Cache
- Two issues
- How do we know if a data item is in the cache?
- If it is, how do we find it?
- Our first example
- block size is one word of data
- "direct mapped"
For each item of data at the lower level, there
is exactly one location in the cache where it
might be. e.g., lots of items at the lower level
share locations in the upper level
6Direct Mapped Cache
- Mapping address is modulo the number of blocks
in the cache
7Direct Mapped Cache
- Add a set of tags and a valid bit
- The tag contains the upper portion of the address
- The valid bit indicates whether an entry contains
a valid address - Example 1 word / block, memory 32 words ? 32
blocks, cache 8 blocks
Decimal address of reference Binary address of reference Hit or miss Assigned cache block
22 10110 miss 110
26 11010 miss 010
22 10110 hit 110
26 11010 hit 010
16 10000 miss 000
3 00011 miss 011
16 10000 hit 000
18 10010 miss 010
Index V Tag Data
000
001
010
011
100
101
110
111
Note that the address in this example is the word
address. Q How do you obtain the byte address
from a word address?
8Direct Mapped Cache
Index V Tag Data
000
001
010
011
100
101
110
111
Decimal word address of reference Binary word address of reference Memory block Tag and assigned cache block
22 10110 10110 110
26 11010 11010 010
22 10110 10110 110
26 11010 11010 010
16 10000 10000 000
3 00011 00011 00 011
16 10000 10000 000
18 10010 10010 010
9Direct Mapped Cache 2 words/block
Index V Tag Data
000
001
010
011
100
101
110
111
Decimal word address of reference Binary word address of reference Memory block Tag and assigned cache block
22 10110 1011 1 011
26 11010 1101 1 101
22 10110 1011 1 011
27 11011 1101 1 101
16 10000 1000 1 000
3 00011 0001 0 001
17 10001 1000 1 000
18 10010 1001 1 001
10Direct Mapped Cache
- For MIPS (4 bytes/word)
- What kind of locality are we taking
advantage of?
11Direct Mapped Cache
- What is the address format?
- How many bits are required to build a cache?
- Example (page 479)
- 32-bit address, byte-addressable, 4 bytes/word,
4 words/block, - cache size 16 KB
- 16KB / 4 bytes/word 4K words
- 4K words / 4 words/block 1K blocks
- 10 bits for block index, 2 bits for block
offset, 2 bits for byte offset - ? 32 10 2 2 18 bits for tag
- Address format 31 30 14 13 12 4 3 2
1 0 - tag index block
offset byte offset - For each block, 1 bit for V, 18 bits for tag,
4x32 bits for data - 1 18 128 147 bits/block
- Total cache size 147 bits/block x 1K blocks
147 Kbits gt 16 KB
12Direct Mapped Cache
- Taking advantage of spatial locality (gt1 words /
block) - Example FastMATH embedded microprocessor, 16 KB
cache, 16 words / block
13Direct Mapped Cache
Consider a cache with 64 blocks, the block size
is 16 bytes. What block number does byte address
1204 map to?
Its word address is 1204/4 30110. With 4 words
per block, word address 30110 is block address
301/4 75. The block offset is 110 which is
012 The index will be 75 modulo 64 1110 which is
0010112 The tag will be 110 which is 00 0000
0000 0000 0000 00012 Binary solution 64
blocks ? 6 bits for index 4 bits for block
offset byte offset 1204 0000 0000 0000 0000
0000 0100 1011 01 00
tag
index blcok offset byte offset
14Hits vs. Misses
- Read hits
- this is what we want!
- Read misses
- stall the CPU, fetch block from memory, deliver
to cache, restart - Write hits
- can replace data in cache and memory
(write-through) - write the data only into the cache (write-back
the cache later) - Write misses
- read the entire block into the cache, then write
the word
15Direct Mapped Cache handling read miss
- CPU fetches instruction from the cache, if an
instruction access results in a miss - Send the value of current PC 4 to the memory
- Instruct the memory to read and wait for the
memory to complete its access - Write the cache entry, put the data from memory
to cache entry, write the upper bits of the
address from the ALU into the tag field, and turn
the valid bit on - Restart the instruction execution at the first
step, fetch the instruction in the cache
16Direct Mapped Cache handling write
- Data inconsistency on a store instruction, after
the data is written into the cache, memory would
have a different value from that in cache - Solutions
- Write through always write the data into both
the memory and the cache - Write buffer use the buffer to store the data
while it is being writing to the memory. The
processor continues execution - Write back the modified block in cache is
written to the memory only when it needs to be
replaced in cache
17Hardware Issues
- Make reading multiple words easier by using banks
of memory
18Increasing Memory Bandwidth
- Widen the memory and the bus between the memory
and processor - Reduce the time by minimizing the number of times
we must start a new memory access
Assuming 4 words per block, 1 clock cycle to
send the address, 15 clock cycles to initiate
DRAM access and 1 clock cycle to send a word of
data
With a main memory width of 1 words. The miss
penalty will be 14x154x165 With a main memory
width of 4 words. The miss penalty will be
115117
19Increasing Memory Bandwidth
- Widen the memory only by using multiple memory
banks - One address is sent to 4 banks of memory, reading
is done simultaneously - Reduce the time by minimizing the number of times
we must start a new memory access
Assuming 4 words per block, 1 clock cycle to
send the address, 15 clock cycles to initiate
DRAM access and 1 clock cycle to send a word of
data
With 4 memory banks of 1 word width. The miss
penalty will be 1154x120
20AMAT
- Average memory access time (AMAT)
- Average time to access memory considering both
hits and misses and the frequency of different
accesses - Used to capture the fact that the time to access
data for both hits and misses affects performance - Useful as a figure of merit for different cache
systems - AMAT Time for a hit Miss rate Miss penalty
- 7.17 5 lt7.2gt Find the AMAT for a processor
with a 2 ns clock, a miss penalty of 20 clock
cycles, a miss rate of 0.05 misses per
instruction, and a cache access time (including
hit detection) of 1 clock cycle. Assume that the
read and write miss penalties are the same and
ignore other write stalls. - 7.18 5 lt7.2gt Suppose we can improve the miss
rate to 0.03 misses per reference by doubling the
cache size. This causes the cache access time to
increase to 1.2 clock cycles. Using the AMAT as a
metric, determine if this is a good trade-off.
21Performance
- Increasing the block size tends to decrease miss
rate - But, if the block size becomes a significant
fraction of the cache size, the number of cache
blocks will be small - A block of will be replaced in the cache before
many of its words are accessed. - It causes the increase of the miss rate.
- How do we solve this?
22Performance
- Simplified model execution time (execution
cycles stall cycles) cycle time stall
cycles of instructions miss ratio miss
penalty - Two ways of improving performance
- decreasing the miss ratio
- decreasing the miss penalty
- What happens if we increase block size?
23Performance
Assume an instruction cache miss rate for gcc is
2 and a data miss rate of 4. If a machine has a
CPI of 2 without any memory stalls and the miss
penalty is 100 cycles for all misses, determine
how much faster a machine would run with a
perfect cache that never missed. The frequency of
load instruction for gcc is 36
To run a program, CPU will fetch both
instructions and data from the memory If there is
a miss for instruction, penalty will be
Instruction miss cycles IC 2 100 2.00
IC If there is a data miss, penalty will be
Data miss cycles IC 36 4 100 1.44
IC The total memory-stall cycles is 2.00 IC
1.44 IC 3.44 IC The CPI with memory stalls is 2
3.44 5.44 (CPU time with stalls)/(CPU time
perfect) (IC CPIstall Clock cycle) / (IC
CPperfect Clock cycle) 5.44 / 2 2.72
24Decreasing Miss Ratio with Associativity
- N-way set-associative cache consists of a number
of sets, each set has n blocks. - Each memory block is mapped to a unique set in
cache given by the index field - A memory block can be placed in any element of
the set - (Block number) modulo (Number of sets in the
cache)
25Decreasing Miss Ratio with Associativity
- If block address is 13 and the cache has 8
blocks - Set index (Block number) modulo (Number of sets
in the cache) one-way (direct) 13 modulo 8 5 - two-way 13 modulo 4 1
- four-way 13 modulo 2 1
search
3
6
1
13
26Set-Associative Cache
- Compared to direct mapped, give a series of
references that - results in a lower miss ratio using a 2-way set
associative cache - results in a higher miss ratio using a 2-way set
associative cache - assuming we use the least recently used
replacement strategy
27An Implementation of 4-Way Set-Associative Cache
28Set-Associative Cache Issues
- Locating a block in the cache
- Address format
- Tag Set Index Block Offset Byte Offset
- Block replacement
- LRU (Least Recently Used)
- Random
29Set-Associative Cache Example
Assume a cache of 4K blocks, four-word block
size, and a 32-bit address, find the total number
of sets and the total number of tag bits for
caches that are direct mapped, two-way and
four-way set associative, and fully associative.
(Page 504)
The direct-mapped cache has same number of sets
as blocks, 4K 212, hence the total number of
tag bits is (32 12 4) x 4K 64 Kbits For a
2-way set-associative cache, there are 2K 211
sets. The total number of tag bits is (32 11
4) x 2 x 2K 68 Kbits For a 4-way
set-associative cache, there are 1K 210 sets.
The total number of tag bits is (32 10 4) x 4
x 1K 72 Kbits For a fully set-associative
cache, there is only 1 set. The total number of
tag bits is (32 0 4) x 4K x 1 112 Kbits
30Performance
31Decreasing Miss Penalty with Multilevel Caches
- Add a second level cache
- often primary cache is on the same chip as the
processor - use SRAMs to add another cache above primary
memory (DRAM) - miss penalty goes down if data is in 2nd level
cache - Example (Pages 505 506)
- CPI of 1.0 on a 5 Ghz machine with a 5 miss
rate, 100ns DRAM access - Adding 2nd level cache with 5ns access time
decreases miss rate to .5 - Using multilevel caches
- try and optimize the hit time on the 1st level
cache - try and optimize the miss rate on the 2nd level
cache
32Cache Complexities
- Not always easy to understand implications of
caches
Theoretical behavior of Radix sort vs. Quicksort
Observed behavior of Radix sort vs. Quicksort
33Cache Complexities
- Here is why
- Memory system performance is often critical
factor - Multilevel caches, pipelined processors, make it
harder to predict outcomes - Compiler optimizations to increase locality
sometimes hurt ILP - Difficult to predict best algorithm need
experimental data