Lecture 14: Caches - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Lecture 14: Caches

Description:

N-way set associative: N entries for each Cache Index ... Most mid-range machines use small. n-way set associative organizations. ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 43
Provided by: Rand235
Category:

less

Transcript and Presenter's Notes

Title: Lecture 14: Caches


1
Lecture 14Caches
  • Prof. Kenneth M. Mackenzie
  • Computer Systems and Networks
  • CS2200, Spring 2003

Includes slides from Bill Leahy
2
Caches
  • 1. Concept
  • 2. Mechanics
  • 3. Performance
  • 4. Programs vs. a cache
  • 5. Caches vs. a program

3
1. Concept Memory Hierarchyalways reuse a good
idea
Upper Level
Capacity Access Time Cost
Staging Xfer Unit
faster
CPU Registers 100s Bytes lt10s ns
Registers
prog./compiler 1-8 bytes
Instr. Operands
Cache K Bytes 10-100 ns 1-0.1 cents/bit
Cache
cache cntl 8-128 bytes
Blocks
Main Memory M Bytes 200ns- 500ns .0001-.00001
cents /bit
Memory
OS 4K-16K bytes
Pages
Disk G Bytes, 10 ms (10,000,000 ns) 10 - 10
cents/bit
Disk
-5
-6
user/operator Mbytes
Files
Larger
Tape infinite sec-min 10
Tape
Lower Level
-8
4
2. Mechanics
  • N-way set associative N entries for each Cache
    Index
  • N direct mapped caches operates in parallel (N
    typically 2 to 4)
  • Example Two-way set associative cache
  • Cache Index selects a set from the cache
  • The two tags in the set are compared in parallel

Cache Index
Cache Data
Cache Tag
Valid
Cache Block 0



Adr Tag
Compare
0
1
Mux
Sel1
Sel0
OR
Cache Block
Advantage typically exhibits a hit rate equal to
a 2X-sized direct-mapped cache
Hit
5
3. PerformanceAverage Memory Access Time
AMAT HitTime (1 - h) x MissPenalty
  • Hit time basic time of every access.
  • Hit rate (h) fraction of access that hit
  • Miss penalty extra time to fetch a block from
    lower level, including time to replace in CPU

6
4. Programs vs. a cache
  • Suppose you have a loop like this
  • Whats the hit rate in a 64KB/direct/16B-block
    cache?

char a10241024 for (i 0 i lt 1024 i)
for (j 0 j lt 1024 j) aij
7
5. Caches vs. a programCache Misses in
IJPEGpercent misses vs. cache sizesplit ID,
direct-mapped, 16B blocks
data input 938x636 array of 24-bit pixels
1.8Mbytes
8
Today Example Problems
  • A. Rehash terminology
  • B. Performance details
  • C. Construct a program to reveal cache structure
  • D. TLB
  • E. All the caches in a computer system

9
A. Terminology
  • Take out a piece of paper and draw the following
    cache
  • total data size 256KB
  • associativity 4-way
  • block size 16 bytes
  • address 32 bits
  • write-policy write-back
  • replacement policy random
  • How do you partition the 32-bit address
  • How many total bits of storage required?

10
Terminology
Row
  • Rows (also sets sometimes lines)
  • Columns
  • Blocks (also elements sometimes lines)

block
column
11
B. Performance
12
Performance
AMAT HitTime (1 - h) x MissPenalty
  • Hit time basic time of every access.
  • Hit rate (h) fraction of access that hit
  • Miss penalty extra time to fetch a block from
    lower level, including time to replace in CPU

But what about multiple caches? What about when
a cache is part of a system?
13
Cache as part of a system
M X
1
P C
Instr Cache
DPRF
BEQ
A
Data Cache
M X
M X
D
SE
WB
EX
MEM
ID
IF
14
CPI of system
  • CPI from the processor
  • Addition to the CPI caused by misses
  • CPItotal CPIproc (1-h)Misspenalty_in_cycles
  • Example
  • CPIproc 6.5 (from Project 1)
  • h 95
  • Miss 100nS

15
Multiple Caches
Processor
AMAT Thit_L1 (1-h_L1)
Thit_L2 ((1-h_L1)
(1-h_L2) Tmem) hit
rate of 98in L1 and 95 in L2 would yield an
AMAT of 1 0.2 0.1 1.3nS -- outstanding!
L1 cache
1nS
L2 cache
10nS
BIG SLOW MEMORY
100nS
16
C. Measuring Caches
17
Measuring Processor Caches
  • Generate a test program that, when timed, reveals
    the cache size, block size, associativity, etc.
  • How to do this?
  • how do you cause cache misses in a cache of size
    X?

18
Detecting Cache Size
for (size 1 size lt MAXSIZE size 2) for
(dummy 0 dummy lt ZILLION dummy) for (i
0 i lt size i) arrayi
time this part
  • what happens when size lt cache size
  • what happens when size gt cache size?
  • how can you figure out the block size?

19
Cache and Block Size
for (stride 1 stride lt MAXSTRIDE stride
2) for (size 1 size lt MAXSIZE size 2)
for (dummy 0 dummy lt ZILLION dummy)
for (i 0 i lt size i stride) arrayi
time this part
  • what happens for stride 1?
  • what happens for stride blocksize

20
Example
21
D. Revisit Paging/VM Hardware
22
Paging/VM
Disk
Physical Memory
Operating System
CPU
42
356
356
page table
i
23
Paging/VM
Disk
Physical Memory
Operating System
CPU
42
356
356
Place page table in physical memory However this
doubles the time per memory access!!
24
Paging/VM
Disk
Physical Memory
Operating System
CPU
42
356
356
Cache!
Special-purpose cache for translations Historicall
y called the TLB Translation Lookaside Buffer
25
Translation Cache
Just like any other cache, the TLB can be
organized as fully associative, set
associative, or direct mapped TLBs are usually
small, typically not more than 128 - 256 entries
even on high end machines. This permits
fully associative lookup on these machines.
Most mid-range machines use small n-way
set associative organizations. Note 128-256
entries times 4KB-16KB/entry is only
512KB-4MB the L2 cache is often bigger than the
span of the TLB.
hit
miss
VA
PA
TLB Lookup
Cache
Main Memory
CPU
Translation with a TLB
hit
miss
Trans- lation
data
26
Translation Cache
A way to speed up translation is to use a special
cache of recently used page table entries
-- this has many names, but the most
frequently used is Translation Lookaside Buffer
or TLB
Virtual Page Physical Frame Dirty
Ref Valid Access
tag
Really just a cache (a special-purpose cache) on
the page table mappings TLB access time
comparable to cache access time (much less
than main memory access time)
27
UltraSPARC III
  • TLBs
  • L1 Caches
  • Tags for the L2 cache (data for the L2 cache is
    off-chip)

Figure from microprocessor report
28
E. All the Caches
29
Full Memory Hierarchyalways reuse a good idea
Upper Level
Capacity Access Time Cost
Staging Xfer Unit
faster
CPU Registers 100s Bytes lt10s ns
Registers
prog./compiler 1-8 bytes
Instr. Operands
Cache K Bytes 10-100 ns 1-0.1 cents/bit
Cache
cache cntl 8-128 bytes
Blocks
Main Memory M Bytes 200ns- 500ns .0001-.00001
cents /bit
Memory
OS 4K-16K bytes
Pages
Disk G Bytes, 10 ms (10,000,000 ns) 10 - 10
cents/bit
Disk
-5
-6
user/operator Mbytes
Files
Larger
Tape infinite sec-min 10
Tape
Lower Level
-8
30
Caches?
  • L1/L2 hardware caches
  • TLB is a cache
  • VM is a sort of a cache

31
4 General Questionsfor Memory Hierarchy
  • Q1 Where can a block be placed in the upper
    level? (Block placement)
  • Q2 How is a block found if it is in the upper
    level? (Block identification)
  • Q3 Which block should be replaced on a miss?
    (Replacement policy)
  • Q4 What happens on a write? (Write strategy)

32
Compare 4 General Questionsall-HW vs VM-style
caching(small blocks vs. large blocks)
  • Q1 Where can a block be placed in the upper
    level? HW N sets VM always
    full-assoc.
  • Q2 How is a block found if it is in the upper
    level? HW match on tags VM lookup via
    map
  • Q3 Which block should be replaced on a miss?
    HW LRU/random VM pseudo-LRU
  • Q4 What happens on a write? HW WT or WB
    VM always WB

33
Other Caches?
  • Filesystem cache?
  • /.netscape/cache/ ??

34
Summary Example Problems
  • A. Rehash terminology
  • B. Performance details
  • C. Construct a program to reveal cache structure
  • D. TLB
  • E. All the caches in a computer system

35
Bonus Slides
  • Multiprocessors and Caches

36
Multicomputer
Proc Cache A
Proc Cache B
interconnect
memory
memory
37
MultiprocessorSymmetric Multiprocessor or SMP
Cache A
Cache B
memory
38
Cache Coherence Problem
Cache A
Cache B
Read X Write X
Read X ...
Oops!
X 1
X 0
memory
X 0
39
Simplest Coherence StrategyEnforce Exactly One
Copy
Cache A
Cache B
Read X Write X
Read X ...
X 1
X 0
memory
X 0
40
Exactly One Copy
Read or write/ (invalidate other copies)
INVALID
VALID
More reads or writes
Replacement or invalidation
  • Maintain a lock per cache line
  • Invalidate other caches on a read/write
  • Easy on a bus snoop bus for transactions

41
Snoopy Cache
CPU
CPU references check cache tags (as
usual) Cache misses filled from memory (as
usual) Other read/write on bus must check tags,
too, and possibly invalidate
State Tag Data
Bus
42
Exactly One Copy
  • Works, but performance is crummy.
  • Suppose we all just want to read the same memory
    location
  • one lousy global variable n the size of the
    problem, written once at the start of the program
    and read thereafter

Permit multiple readers (readers/writer lock per
cache line)
Write a Comment
User Comments (0)
About PowerShow.com