Title: CS 161 Ch 7: Memory Hierarchy LECTURE 16
1CS 161Ch 7 Memory Hierarchy LECTURE 16
- Instructor L.N. Bhuyan
- www.cs.ucr.edu/bhuyan
2Cache Access Time
With Load Bypass Average Access Time Hit Time
x (1 - Miss Rate) Miss Penalty x Miss Rate
OR Without Load Bypass Average Memory Acess Time
Time for a hit Miss rate x Miss penalty
3Unified vs Split Caches
- Unified Cache
- Low Miss ratio because more space available for
either instruction or data - Low cache bandwidth because instruction and data
cannot be read at the same time due to one port. - Split Cache
- High miss ratio because either instructions or
data may run out of space even though space is
available at other cache - High bandwidth because an instruction and data
can be accessed at the same time. - Example
- 16KB ID Inst miss rate0.64, Data miss
rate6.47 - 32KB unified Aggregate miss rate1.99
- Which is better (ignore L2 cache)?
- Assume 33 data ops ? 75 accesses from
instructions (1.0/1.33) - hit time1, miss time50
- Note that data hit has 1 stall for unified cache
(only one port) - AMATHarvard75x(10.64x50)25x(16.47x50)
2.05 - AMATUnified75x(11.99x50)25x(111.99x50)
2.24
4Static RAM (SRAM)
- Six transistors in cross connected fashion
- Provides regular AND inverted outputs
- Implemented in CMOS process
Single Port 6-T SRAM Cell
5Dynamic Random Access Memory - DRAM
- DRAM organization is similar to SRAM except that
each bit of DRAM is constructed using a pass
transistor and a capacitor, shown in next slide - Less number of transistors/bit gives high
density, but slow discharge through capacitor. - Capacitor needs to be recharged or refreshed
giving rise to high cycle time. Q What is the
difference between access time and cycle time? - Uses a two-level decoder as shown later. Note
that 2048 bits are accessed per row, but only one
bit is used.
6Dynamic RAM
- SRAM cells exhibit high speed/poor density
- DRAM simple transistor/capacitor pairs in high
density form
Word Line
C
Bit Line
...
Sense Amp
7DRAM logical organization (4 Mbit)
- Access time of DRAM Row access time column
access time refreshing
D
Column Decoder
Sense
Amps I/O
1
1
Q
Memory
Array
A0A1
0
Row Decoder
(2,048 x 2,048)
Storage
W
ord Line
Cell
- Square root of bits per RAS/CAS
8Virtual Memory
- Idea 1 Many Programs sharing DRAM Memory so that
context switches can occur - Idea 2 Allow program to be written without
memory constraints program can exceed the size
of the main memory - Idea 3 Relocation Parts of the program can be
placed at different locations in the memory
instead of a big chunk. - Virtual Memory
- (1) DRAM Memory holds many programs running at
same time (processes) - (2) use DRAM Memory as a kind of cache for disk
9Disk Technology in Brief
tracks
- Disk is mechanical memory
R/W arm
3600 - 7200 RPM rotation speed
- Disk Access Time seek time rotational delay
transfer time - usually measured in milliseconds
- Miss to disk is extremely expensive
- typical access time millions of clock cycles
10Virtual Memory has own terminology
- Each process has its own private virtual address
space (e.g., 232 Bytes) CPU actually generates
virtual addresses - Each computer has a physical address space
(e.g., 128 MegaBytes DRAM) also called real
memory - Address translation mapping virtual addresses to
physical addresses - Allows multiple programs to use (different chunks
of physical) memory at same time - Also allows some chunks of virtual memory to be
represented on disk, not in main memory (to
exploit memory hierarchy)
11Mapping Virtual Memory to Physical Memory
Virtual Memory
- Divide Memory into equal sizedchunks (say, 4KB
each)
Stack
- Any chunk of Virtual Memory assigned to any chunk
of Physical Memory (page)
Physical Memory
64 MB
Single Process
Heap
Static
Code
0
0
12Handling Page Faults
- A page fault is like a cache miss
- Must find page in lower level of hierarchy
- If valid bit is zero, the Physical Page Number
points to a page on disk - When OS starts new process, it creates space on
disk for all the pages of the process, sets all
valid bits in page table to zero, and all
Physical Page Numbers to point to disk - called Demand Paging - pages of the process are
loaded from disk only as needed
13Comparing the 2 levels of hierarchy
- Cache Virtual Memory
- Block or Line Page
- Miss Page Fault
- Block Size 32-64B Page Size 4K-16KB
- Placement Fully AssociativeDirect Mapped,
N-way Set Associative - Replacement Least Recently UsedLRU or
Random (LRU) approximation - Write Thru or Back Write Back
- How Managed Hardware SoftwareHardware (Operati
ng System)
14How to Perform Address Translation?
- VM divides memory into equal sized pages
- Address translation relocates entire pages
- offsets within the pages do not change
- if make page size a power of two, the virtual
address separates into two fields - like cache index, offset fields
virtual address
Virtual Page Number
Page Offset
15Mapping Virtual to Physical Address
Virtual Address
31 30 29 28 27 ..12 11 10
9 8 ... 3 2 1 0
Virtual Page Number
Page Offset
1KB page size
Translation
Page Offset
Physical Page Number
9 8 ... 3 2 1 0
29 28 27 ..12 11 10
Physical Address
16Address Translation
- Want fully associative page placement
- How to locate the physical page?
- Search impractical (too many pages)
- A page table is a data structure which contains
the mapping of virtual pages to physical pages - There are several different ways, all up to the
operating system, to keep this data around - Each process running in the system has its own
page table
17Address Translation Page Table
Virtual Address (VA)
virtual page nbr
offset
Page Table
...
V
A.R.
P. P. N.
Access Rights
Physical Page Number
Val -id
Physical Memory Address (PA)
...
Page Table is located in physical memory
Access Rights None, Read Only, Read/Write,
Executable
disk
18Optimizing for Space
- Page Table too big!
- 4GB Virtual Address Space 4 KB page ? 220 ( 1
million) Page Table Entries ? 4 MB just for Page
Table of single process! - Variety of solutions to tradeoff Page Table size
for slower performance - Use a limit register to restrict page table size
and let it grow with more pages,Multilevel page
table, Paging page tables, etc. - (Take O/S Class to learn more)
19How to Translate Fast?
- Problem Virtual Memory requires two memory
accesses! - one to translate Virtual Address into Physical
Address (page table lookup) - one to transfer the actual data (cache hit)
- But Page Table is in physical memory!
- Observation since there is locality in pages of
data, must be locality in virtual addresses of
those pages! - Why not create a cache of virtual to physical
address translations to make translation fast?
(smaller is faster) - For historical reasons, such a page table cache
is called a Translation Lookaside Buffer, or TLB
20Typical TLB Format
Virtual Physical Valid Ref Dirty Access Page
Nbr Page Nbr Rights
data
tag
- TLB just a cache of the page table mappings
- Dirty since use write back, need to know
whether or not to write page to disk when
replaced - Ref Used to calculate LRU on replacement
- TLB access time comparable to cache (much
less than main memory access time)
21Translation Look-Aside Buffers
- TLB is usually small, typically 32-4,096 entries
- Like any other cache, the TLB can be fully
associative, set associative, or direct mapped
data
data
virtualaddr.
physicaladdr.
TLB
Cache
Main Memory
miss
hit
hit
Processor
miss
PageTable
Disk Memory
OS FaultHandler
page fault/protection violation
22DECStation 3100/MIPS R2000
3
1
3
0
2
9
1
5
1
4
1
3
1
2
1
1
1
0
9
8
3
2
1
0
Virtual Address
P
a
g
e
o
f
f
s
e
t
V
i
r
t
u
a
l
p
a
g
e
n
u
m
b
e
r
1
2
2
0
P
h
y
s
i
c
a
l
p
a
g
e
n
u
m
b
e
r
V
a
l
i
d
D
i
r
t
y
T
a
g
TLB
T
L
B
h
i
t
64 entries, fully associative
2
0
P
a
g
e
o
f
f
s
e
t
P
h
y
s
i
c
a
l
p
a
g
e
n
u
m
b
e
r
Physical Address
C
a
c
h
e
i
n
d
e
x
P
h
y
s
i
c
a
l
a
d
d
r
e
s
s
t
a
g
B
y
t
e
1
4
2
1
6
o
f
f
s
e
t
T
a
g
D
a
t
a
V
a
l
i
d
Cache
16K entries, direct mapped
3
2
D
a
t
a
C
a
c
h
e
h
i
t
23Real Stuff Pentium Pro Memory Hierarchy
- Address Size 32 bits (VA, PA)
- VM Page Size 4 KB, 4 MB
- TLB organization separate i,d TLBs (i-TLB
32 entries, d-TLB 64 entries) 4-way set
associative LRU approximated hardware
handles miss - L1 Cache 8 KB, separate i,d 4-way set
associative LRU approximated 32 byte
block write back - L2 Cache 256 or 512 KB