Lecture 14: Caches - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

Lecture 14: Caches

Description:

N-way set associative: N entries for each Cache Index ... Most mid-range machines use small. n-way set associative organizations. ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 43

Provided by: Rand235

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 14: Caches

1
Lecture 14Caches

Prof. Kenneth M. Mackenzie
Computer Systems and Networks
CS2200, Spring 2003

Includes slides from Bill Leahy
2
Caches

1. Concept
2. Mechanics
3. Performance
4. Programs vs. a cache
5. Caches vs. a program

3
1. Concept Memory Hierarchyalways reuse a good
idea
Upper Level
Capacity Access Time Cost
Staging Xfer Unit
faster
CPU Registers 100s Bytes lt10s ns
Registers
prog./compiler 1-8 bytes
Instr. Operands
Cache K Bytes 10-100 ns 1-0.1 cents/bit
Cache
cache cntl 8-128 bytes
Blocks
Main Memory M Bytes 200ns- 500ns .0001-.00001
cents /bit
Memory
OS 4K-16K bytes
Pages
Disk G Bytes, 10 ms (10,000,000 ns) 10 - 10
cents/bit
Disk
-5
-6
user/operator Mbytes
Files
Larger
Tape infinite sec-min 10
Tape
Lower Level
-8
4
2. Mechanics

N-way set associative N entries for each Cache
Index
N direct mapped caches operates in parallel (N
typically 2 to 4)
Example Two-way set associative cache
Cache Index selects a set from the cache
The two tags in the set are compared in parallel

Cache Index
Cache Data
Cache Tag
Valid
Cache Block 0

Adr Tag
Compare
0
1
Mux
Sel1
Sel0
OR
Cache Block
Advantage typically exhibits a hit rate equal to
a 2X-sized direct-mapped cache
Hit
5
3. PerformanceAverage Memory Access Time
AMAT HitTime (1 - h) x MissPenalty

Hit time basic time of every access.
Hit rate (h) fraction of access that hit
Miss penalty extra time to fetch a block from
lower level, including time to replace in CPU

6
4. Programs vs. a cache

Suppose you have a loop like this
Whats the hit rate in a 64KB/direct/16B-block
cache?

char a10241024 for (i 0 i lt 1024 i)
for (j 0 j lt 1024 j) aij
7
5. Caches vs. a programCache Misses in
IJPEGpercent misses vs. cache sizesplit ID,
direct-mapped, 16B blocks
data input 938x636 array of 24-bit pixels
1.8Mbytes
8
Today Example Problems

A. Rehash terminology
B. Performance details
C. Construct a program to reveal cache structure
D. TLB
E. All the caches in a computer system

9
A. Terminology

Take out a piece of paper and draw the following
cache
total data size 256KB
associativity 4-way
block size 16 bytes
address 32 bits
write-policy write-back
replacement policy random
How do you partition the 32-bit address
How many total bits of storage required?

10
Terminology
Row

Rows (also sets sometimes lines)
Columns
Blocks (also elements sometimes lines)

block
column
11
B. Performance
12
Performance
AMAT HitTime (1 - h) x MissPenalty

Hit time basic time of every access.
Hit rate (h) fraction of access that hit
Miss penalty extra time to fetch a block from
lower level, including time to replace in CPU

But what about multiple caches? What about when
a cache is part of a system?
13
Cache as part of a system
M X
1
P C
Instr Cache
DPRF
BEQ
A
Data Cache
M X
M X
D
SE
WB
EX
MEM
ID
IF
14
CPI of system

CPI from the processor
Addition to the CPI caused by misses
CPItotal CPIproc (1-h)Misspenalty_in_cycles
Example
CPIproc 6.5 (from Project 1)
h 95
Miss 100nS

15
Multiple Caches
Processor
AMAT Thit_L1 (1-h_L1)
Thit_L2 ((1-h_L1)
(1-h_L2) Tmem) hit
rate of 98in L1 and 95 in L2 would yield an
AMAT of 1 0.2 0.1 1.3nS -- outstanding!
L1 cache
1nS
L2 cache
10nS
BIG SLOW MEMORY
100nS
16
C. Measuring Caches
17
Measuring Processor Caches

Generate a test program that, when timed, reveals
the cache size, block size, associativity, etc.
How to do this?
how do you cause cache misses in a cache of size
X?

18
Detecting Cache Size
for (size 1 size lt MAXSIZE size 2) for
(dummy 0 dummy lt ZILLION dummy) for (i
0 i lt size i) arrayi
time this part

what happens when size lt cache size
what happens when size gt cache size?
how can you figure out the block size?

19
Cache and Block Size
for (stride 1 stride lt MAXSTRIDE stride
2) for (size 1 size lt MAXSIZE size 2)
for (dummy 0 dummy lt ZILLION dummy)
for (i 0 i lt size i stride) arrayi
time this part

what happens for stride 1?
what happens for stride blocksize

20
Example
21
D. Revisit Paging/VM Hardware
22
Paging/VM
Disk
Physical Memory
Operating System
CPU
42
356
356
page table
i
23
Paging/VM
Disk
Physical Memory
Operating System
CPU
42
356
356
Place page table in physical memory However this
doubles the time per memory access!!
24
Paging/VM
Disk
Physical Memory
Operating System
CPU
42
356
356
Cache!
Special-purpose cache for translations Historicall
y called the TLB Translation Lookaside Buffer
25
Translation Cache
Just like any other cache, the TLB can be
organized as fully associative, set
associative, or direct mapped TLBs are usually
small, typically not more than 128 - 256 entries
even on high end machines. This permits
fully associative lookup on these machines.
Most mid-range machines use small n-way
set associative organizations. Note 128-256
entries times 4KB-16KB/entry is only
512KB-4MB the L2 cache is often bigger than the
span of the TLB.
hit
miss
VA
PA
TLB Lookup
Cache
Main Memory
CPU
Translation with a TLB
hit
miss
Trans- lation
data
26
Translation Cache
A way to speed up translation is to use a special
cache of recently used page table entries
-- this has many names, but the most
frequently used is Translation Lookaside Buffer
or TLB
Virtual Page Physical Frame Dirty
Ref Valid Access
tag
Really just a cache (a special-purpose cache) on
the page table mappings TLB access time
comparable to cache access time (much less
than main memory access time)
27
UltraSPARC III

TLBs
L1 Caches
Tags for the L2 cache (data for the L2 cache is
off-chip)

Figure from microprocessor report
28
E. All the Caches
29
Full Memory Hierarchyalways reuse a good idea
Upper Level
Capacity Access Time Cost
Staging Xfer Unit
faster
CPU Registers 100s Bytes lt10s ns
Registers
prog./compiler 1-8 bytes
Instr. Operands
Cache K Bytes 10-100 ns 1-0.1 cents/bit
Cache
cache cntl 8-128 bytes
Blocks
Main Memory M Bytes 200ns- 500ns .0001-.00001
cents /bit
Memory
OS 4K-16K bytes
Pages
Disk G Bytes, 10 ms (10,000,000 ns) 10 - 10
cents/bit
Disk
-5
-6
user/operator Mbytes
Files
Larger
Tape infinite sec-min 10
Tape
Lower Level
-8
30
Caches?

L1/L2 hardware caches
TLB is a cache
VM is a sort of a cache

31
4 General Questionsfor Memory Hierarchy

Q1 Where can a block be placed in the upper
level? (Block placement)
Q2 How is a block found if it is in the upper
level? (Block identification)
Q3 Which block should be replaced on a miss?
(Replacement policy)
Q4 What happens on a write? (Write strategy)

32
Compare 4 General Questionsall-HW vs VM-style
caching(small blocks vs. large blocks)

Q1 Where can a block be placed in the upper
level? HW N sets VM always
full-assoc.
Q2 How is a block found if it is in the upper
level? HW match on tags VM lookup via
map
Q3 Which block should be replaced on a miss?
HW LRU/random VM pseudo-LRU
Q4 What happens on a write? HW WT or WB
VM always WB

33
Other Caches?

Filesystem cache?
/.netscape/cache/ ??

34
Summary Example Problems

A. Rehash terminology
B. Performance details
C. Construct a program to reveal cache structure
D. TLB
E. All the caches in a computer system

35
Bonus Slides

Multiprocessors and Caches

36
Multicomputer
Proc Cache A
Proc Cache B
interconnect
memory
memory
37
MultiprocessorSymmetric Multiprocessor or SMP
Cache A
Cache B
memory
38
Cache Coherence Problem
Cache A
Cache B
Read X Write X
Read X ...
Oops!
X 1
X 0
memory
X 0
39
Simplest Coherence StrategyEnforce Exactly One
Copy
Cache A
Cache B
Read X Write X
Read X ...
X 1
X 0
memory
X 0
40
Exactly One Copy
Read or write/ (invalidate other copies)
INVALID
VALID
More reads or writes
Replacement or invalidation

Maintain a lock per cache line
Invalidate other caches on a read/write
Easy on a bus snoop bus for transactions

41
Snoopy Cache
CPU
CPU references check cache tags (as
usual) Cache misses filled from memory (as
usual) Other read/write on bus must check tags,
too, and possibly invalidate
State Tag Data
Bus
42
Exactly One Copy

Works, but performance is crummy.
Suppose we all just want to read the same memory
location
one lousy global variable n the size of the
problem, written once at the start of the program
and read thereafter

Permit multiple readers (readers/writer lock per
cache line)

Write a Comment

User Comments (0)