Title: Cache Memory
1Cache Memory
2Outline
- General concepts
- 3 ways to organize cache memory
- Issues with writes
- Suggested Reading 6.4
3Cache Memory
- History
- At very beginning, 3 levels
- Registers, main memory, disk storage
- 10 years later, 4 levels
- Register, SRAM cache, main DRAM memory, disk
storage - Modern processor, 45 levels
- Registers, SRAM L1, L2(,L3) cache, main DRAM
memory, disk storage - Cache memories
- are small, fast SRAM-based memories
- are managed by hardware automatically
- can be on-chip, on-die, off-chip
4Cache Memory
CPU chip
pp. 488
register file
ALU
L1 cache
cache bus
system bus
memory bus
main memory
I/O bridge
bus interface
L2 cache
5Cache Memory
- L1 cache is on-chip
- L2 cache is off-chip several years ago
- L3 cache can be off-chip or on-chip
- CPU looks first for data in L1, then in L2, then
in main memory - Hold frequently accessed blocks of main memory in
caches
6Inserting an L1 cache between the CPU and main
memory
7Generic Cache Memory Organization
8Cache Memory
9Cache Memory
10Addressing caches
11Direct-mapped cache
- Simplest kind of cache
- Characterized by exactly one line per set.
12Accessing direct-mapped caches
- Set selection
- Use the set index bits to determine the set of
interest
13Accessing direct-mapped caches
- Line matching and word extraction
- find a valid line in the selected set with a
matching tag (line matching) - then extract the word (word selection)
14Accessing direct-mapped caches
15Line Replacement on Misses in Directed Caches
- If cache misses
- Retrieve the requested block from the next level
in the memory hierarchy - Store the new block in one of the cache lines of
the set indicated by the set index bits
16Line Replacement on Misses in Directed Caches
- If the set is full of valid cache lines
- One of the existing lines must be evicted
- For a direct-mapped cache
- Each set contains only one line
- Current line is replaced by the newly fetched line
17Direct-mapped cache simulation
- M16 byte addresses
- B2 bytes/block, S4 sets, E1 entry/set
18Direct-mapped cache simulation
M16 byte addresses, B2 bytes/block, S4 sets,
E1 entry/set Address trace (reads) 0 0000 1
0001 13 1101 8 1000 0 0000
19Direct-mapped cache simulation
M16 byte addresses, B2 bytes/block, S4 sets,
E1 entry/set Address trace (reads) 0 0000 1
0001 13 1101 8 1000 0 0000
20Direct-mapped cache simulation
21Why use middle bits as index?
- High-Order Bit Indexing
- Adjacent memory lines would map to same cache
entry - Poor use of spatial locality
- Middle-Order Bit Indexing
- Consecutive memory lines map to different cache
lines - Can hold C-byte region of address space in cache
at one time
22Set associative caches
- Characterized by more than one line per set
23Accessing set associative caches
- Set selection
- identical to direct-mapped cache
24Accessing set associative caches
- Line matching and word selection
- must compare the tag in each valid line in the
selected set.
25Fully associative caches
- Characterized by all of the lines in the only one
set - No set index bits in the address
26Accessing fully associative caches
- Word selection
- must compare the tag in each valid line
27Issues with Writes
- Write hits
- Write through
- Cache updates its copy
- Immediately writes the corresponding cache block
to memory - Write back
- Defers the memory update as long as possible
- Writing the updated block to memory only when it
is evicted from the cache - Maintains a dirty bit for each cache line
28Issues with Writes
- Write misses
- Write-allocate
- Loads the corresponding memory block into the
cache - Then updates the cache block
- No-write-allocate
- Bypasses the cache
- Writes the word directly to memory
- Combination
- Write through, no-write-allocate
- Write back, write-allocate
29Multi-level caches
30Cache performance metrics
- Miss Rate
- fraction of memory references not found in cache
(misses/references) - Typical numbers
- 3-10 for L1
- Hit Rate
- fraction of memory references found in cache (1 -
miss rate)
31Cache performance metrics
- Hit Time
- time to deliver a line in the cache to the
processor (includes time to determine whether the
line is in the cache) - Typical numbers
- 1-2 clock cycle for L1
- 5-10 clock cycles for L2
- Miss Penalty
- additional time required because of a miss
- Typically 25-100 cycles for main memory
32Cache performance metrics
- Cache size
- Hit rate vs. hit time
- Block size
- Spatial locality vs. temporal locality
- Associativity
- Thrashing
- Cost
- Speed
- Miss penalty
- Write strategy
- Simple, read misses, fewer transfer