Title: Lecture 14: Caches
1Lecture 14Caches
- Prof. Kenneth M. Mackenzie
- Computer Systems and Networks
- CS2200, Spring 2003
Includes slides from Bill Leahy
2Caches
- 1. Concept
- 2. Mechanics
- 3. Performance
- 4. Programs vs. a cache
- 5. Caches vs. a program
31. Concept Memory Hierarchyalways reuse a good
idea
Upper Level
Capacity Access Time Cost
Staging Xfer Unit
faster
CPU Registers 100s Bytes lt10s ns
Registers
prog./compiler 1-8 bytes
Instr. Operands
Cache K Bytes 10-100 ns 1-0.1 cents/bit
Cache
cache cntl 8-128 bytes
Blocks
Main Memory M Bytes 200ns- 500ns .0001-.00001
cents /bit
Memory
OS 4K-16K bytes
Pages
Disk G Bytes, 10 ms (10,000,000 ns) 10 - 10
cents/bit
Disk
-5
-6
user/operator Mbytes
Files
Larger
Tape infinite sec-min 10
Tape
Lower Level
-8
42. Mechanics
- N-way set associative N entries for each Cache
Index - N direct mapped caches operates in parallel (N
typically 2 to 4) - Example Two-way set associative cache
- Cache Index selects a set from the cache
- The two tags in the set are compared in parallel
Cache Index
Cache Data
Cache Tag
Valid
Cache Block 0
Adr Tag
Compare
0
1
Mux
Sel1
Sel0
OR
Cache Block
Advantage typically exhibits a hit rate equal to
a 2X-sized direct-mapped cache
Hit
53. PerformanceAverage Memory Access Time
AMAT HitTime (1 - h) x MissPenalty
- Hit time basic time of every access.
- Hit rate (h) fraction of access that hit
- Miss penalty extra time to fetch a block from
lower level, including time to replace in CPU
64. Programs vs. a cache
- Suppose you have a loop like this
- Whats the hit rate in a 64KB/direct/16B-block
cache?
char a10241024 for (i 0 i lt 1024 i)
for (j 0 j lt 1024 j) aij
75. Caches vs. a programCache Misses in
IJPEGpercent misses vs. cache sizesplit ID,
direct-mapped, 16B blocks
data input 938x636 array of 24-bit pixels
1.8Mbytes
8Today Example Problems
- A. Rehash terminology
- B. Performance details
- C. Construct a program to reveal cache structure
- D. TLB
- E. All the caches in a computer system
9A. Terminology
- Take out a piece of paper and draw the following
cache - total data size 256KB
- associativity 4-way
- block size 16 bytes
- address 32 bits
- write-policy write-back
- replacement policy random
- How do you partition the 32-bit address
- How many total bits of storage required?
10Terminology
Row
- Rows (also sets sometimes lines)
- Columns
- Blocks (also elements sometimes lines)
block
column
11B. Performance
12Performance
AMAT HitTime (1 - h) x MissPenalty
- Hit time basic time of every access.
- Hit rate (h) fraction of access that hit
- Miss penalty extra time to fetch a block from
lower level, including time to replace in CPU
But what about multiple caches? What about when
a cache is part of a system?
13Cache as part of a system
M X
1
P C
Instr Cache
DPRF
BEQ
A
Data Cache
M X
M X
D
SE
WB
EX
MEM
ID
IF
14CPI of system
- CPI from the processor
- Addition to the CPI caused by misses
- CPItotal CPIproc (1-h)Misspenalty_in_cycles
- Example
- CPIproc 6.5 (from Project 1)
- h 95
- Miss 100nS
15Multiple Caches
Processor
AMAT Thit_L1 (1-h_L1)
Thit_L2 ((1-h_L1)
(1-h_L2) Tmem) hit
rate of 98in L1 and 95 in L2 would yield an
AMAT of 1 0.2 0.1 1.3nS -- outstanding!
L1 cache
1nS
L2 cache
10nS
BIG SLOW MEMORY
100nS
16C. Measuring Caches
17Measuring Processor Caches
- Generate a test program that, when timed, reveals
the cache size, block size, associativity, etc. - How to do this?
- how do you cause cache misses in a cache of size
X?
18Detecting Cache Size
for (size 1 size lt MAXSIZE size 2) for
(dummy 0 dummy lt ZILLION dummy) for (i
0 i lt size i) arrayi
time this part
- what happens when size lt cache size
- what happens when size gt cache size?
- how can you figure out the block size?
19Cache and Block Size
for (stride 1 stride lt MAXSTRIDE stride
2) for (size 1 size lt MAXSIZE size 2)
for (dummy 0 dummy lt ZILLION dummy)
for (i 0 i lt size i stride) arrayi
time this part
- what happens for stride 1?
- what happens for stride blocksize
20Example
21D. Revisit Paging/VM Hardware
22Paging/VM
Disk
Physical Memory
Operating System
CPU
42
356
356
page table
i
23Paging/VM
Disk
Physical Memory
Operating System
CPU
42
356
356
Place page table in physical memory However this
doubles the time per memory access!!
24Paging/VM
Disk
Physical Memory
Operating System
CPU
42
356
356
Cache!
Special-purpose cache for translations Historicall
y called the TLB Translation Lookaside Buffer
25Translation Cache
Just like any other cache, the TLB can be
organized as fully associative, set
associative, or direct mapped TLBs are usually
small, typically not more than 128 - 256 entries
even on high end machines. This permits
fully associative lookup on these machines.
Most mid-range machines use small n-way
set associative organizations. Note 128-256
entries times 4KB-16KB/entry is only
512KB-4MB the L2 cache is often bigger than the
span of the TLB.
hit
miss
VA
PA
TLB Lookup
Cache
Main Memory
CPU
Translation with a TLB
hit
miss
Trans- lation
data
26Translation Cache
A way to speed up translation is to use a special
cache of recently used page table entries
-- this has many names, but the most
frequently used is Translation Lookaside Buffer
or TLB
Virtual Page Physical Frame Dirty
Ref Valid Access
tag
Really just a cache (a special-purpose cache) on
the page table mappings TLB access time
comparable to cache access time (much less
than main memory access time)
27UltraSPARC III
- TLBs
- L1 Caches
- Tags for the L2 cache (data for the L2 cache is
off-chip)
Figure from microprocessor report
28E. All the Caches
29Full Memory Hierarchyalways reuse a good idea
Upper Level
Capacity Access Time Cost
Staging Xfer Unit
faster
CPU Registers 100s Bytes lt10s ns
Registers
prog./compiler 1-8 bytes
Instr. Operands
Cache K Bytes 10-100 ns 1-0.1 cents/bit
Cache
cache cntl 8-128 bytes
Blocks
Main Memory M Bytes 200ns- 500ns .0001-.00001
cents /bit
Memory
OS 4K-16K bytes
Pages
Disk G Bytes, 10 ms (10,000,000 ns) 10 - 10
cents/bit
Disk
-5
-6
user/operator Mbytes
Files
Larger
Tape infinite sec-min 10
Tape
Lower Level
-8
30Caches?
- L1/L2 hardware caches
- TLB is a cache
- VM is a sort of a cache
314 General Questionsfor Memory Hierarchy
- Q1 Where can a block be placed in the upper
level? (Block placement) - Q2 How is a block found if it is in the upper
level? (Block identification) - Q3 Which block should be replaced on a miss?
(Replacement policy) - Q4 What happens on a write? (Write strategy)
32Compare 4 General Questionsall-HW vs VM-style
caching(small blocks vs. large blocks)
- Q1 Where can a block be placed in the upper
level? HW N sets VM always
full-assoc. - Q2 How is a block found if it is in the upper
level? HW match on tags VM lookup via
map - Q3 Which block should be replaced on a miss?
HW LRU/random VM pseudo-LRU - Q4 What happens on a write? HW WT or WB
VM always WB
33Other Caches?
- Filesystem cache?
- /.netscape/cache/ ??
34Summary Example Problems
- A. Rehash terminology
- B. Performance details
- C. Construct a program to reveal cache structure
- D. TLB
- E. All the caches in a computer system
35Bonus Slides
- Multiprocessors and Caches
36Multicomputer
Proc Cache A
Proc Cache B
interconnect
memory
memory
37MultiprocessorSymmetric Multiprocessor or SMP
Cache A
Cache B
memory
38Cache Coherence Problem
Cache A
Cache B
Read X Write X
Read X ...
Oops!
X 1
X 0
memory
X 0
39Simplest Coherence StrategyEnforce Exactly One
Copy
Cache A
Cache B
Read X Write X
Read X ...
X 1
X 0
memory
X 0
40Exactly One Copy
Read or write/ (invalidate other copies)
INVALID
VALID
More reads or writes
Replacement or invalidation
- Maintain a lock per cache line
- Invalidate other caches on a read/write
- Easy on a bus snoop bus for transactions
41Snoopy Cache
CPU
CPU references check cache tags (as
usual) Cache misses filled from memory (as
usual) Other read/write on bus must check tags,
too, and possibly invalidate
State Tag Data
Bus
42Exactly One Copy
- Works, but performance is crummy.
- Suppose we all just want to read the same memory
location - one lousy global variable n the size of the
problem, written once at the start of the program
and read thereafter
Permit multiple readers (readers/writer lock per
cache line)