Memory Hierarchies Chapter 7

About This Presentation

Title:

Memory Hierarchies Chapter 7

Description:

Fake out the program (application) into thinking it has a ... Whatever was at this cache location gets overlaid. How a Block is Found. Fig. 7.7. Similar to ... – PowerPoint PPT presentation

Number of Views:81

Avg rating:3.0/5.0

Slides: 61

Provided by: csBing

Learn more at: https://www.cs.binghamton.edu

Category:

more less

Transcript and Presenter's Notes

Title: Memory Hierarchies Chapter 7

1
Memory HierarchiesChapter 7

N. Guydosh
4/18/04

2
The Basics

Fake out the program (application) into thinking
it has a massive high speed memory - limited only
by the address space.
The appearance of a massive high speed memory
system is an illusion - but the performance
enhancement is real.
Will consider three levels of memory media
Cache
Main Memory
Disk
Principle of locality (pp. 540-541)Programs
access only a very small part of the address
space at any instant of time
Temporal locality If an item is referenced, it
will probably be referenced again soon.
Spatial localityIf an item is referenced, items
whose addresses are close by will tend to be
referenced soon

3
Memory Hierarchy

Organize memory into levels according to cost
speed factors
The higher the level the faster, smaller, and
more expensive (per bit) the memory is
The lower the level the slower, larger, and
cheaper (per bit) the memory becomes
The highest level is directly accessed by the
microprocessor. ... The cache
The levels to be considered areSRAM ... Super
fast memory - expensive DRAM ... Ordinary ram
memory - minimum speed - cheapDISK (DASD) ...
Very slow, very cheap

4
Memory Hierarchy Levels
Cache
Main memory
Disk
Fig. 7.1
5
Data Transfer Between Levels

Block
The minimum unit of data that can be present or
not present in a two level memory (cache system).
It can be as small as a word or may be multiple
words.
This is the basic unit of data transfer between
the cache and main memory
At lower levels of memory, blocks are also known
as pages

6
Data Transfer Between Levels
?blocks
Fig. 7.2
7
The Basic Idea

The data needed by the processor is loaded into
the cache on demand
The data needed by the processor is loaded into
the cache on demand
If the data is present in the cache, the
processor used it directly. This is called a hit
Hit rate or ratio
The fraction (or percent) of the time the data is
found at the high level of memory without
incurring a miss.
Miss rate or ratio
1 - hit rate (or 100 - hit rate in percent)

8
The Basic Idea (cont.)

Hit TimeThe time it takes to access an upper
level of memory of a memory hierarch (cache for
two level)
Miss penaltyThe time it takes to replace a block
in an upper level of a hierarchy with a
corresponding block from a lower level
A miss in a cache is serviced by hardware
A miss at a level below the cache (ex main
memory) is serviced in software by the operating
system (must do I/O at disk level).

9
Why the Idea Works

Fundamental phenomenon or properties which make
this scheme work
Hit ratios must be and are generally very high
High hit ratios are a result of the locality
principle described above
Success is statistical and could never be
deterministically guaranteed. A good (or bad?)
Programmer could always write a program which
would kill a memory hierarchy ... cause thrashing
Memory hierarchies must be tuned via level sized,
block (page) sized to get optimal performances

10
The Basics of Caches

Simplest type single word block
On start up the cache is empty, and as references
are made, it fills up via block misses and
transfers from ram
Once it fills locality principle takes over a
there are very few misses.
Question
Where are the blocks put in the cache and how are
they found and related to main memory blocks?
There must be a mapping scheme between cache and
main memory

11
The Basics of Caches

Direct mapping assuming addresses are block
numbers
A word from memory can map into one and only one
place in the cache.
The cache location is directly derived from the
main memory address.
Direct mapping ruleCache address of a block
(block number) is the main memory block number
address modulo the cache size in units of
blocks
For example if the main memory block number is
21decimal and the cache size was 8 blocks, then
cache address is 21 mod 8 5 decimal.
If we keep the size of the cache (in blocks) a
power of 2, then the cache address is directly
given by the log2 (cache size in blocks) low
order bits. Ex 21(base 10) 010101(base 2).
Log28 3. Three low order bits are 101 5
decimal.
Note that Memory ? cache mapping is many to one
(see P. 546, fig 7.5)
Preview If this is direct mapping, then there
must be a non-direct mapping set associative
more that one place to put a block See later.
a block number is a byte address which has the
low order bits designating the word within a
block, and the byte within a word stripped off.
If an block were a byte, then these addresses
would be complete.

12
Memory Cache MappingSpecial Case One Word Per
BlockPiano Keys
Every block ( word) in memory maps into a
unique location in cache. Two low order bits
for byte within a word stripped off.
Fig. 7.5
13
Memory Cache Mapping (cont.)

Questions
Because each cache location can contain the
contents of a number of different memory
locations, how do we know whether the data in the
cache corresponds to a requested word ie., the
block containing this word?
How do we know if a requested word is in the
cache or not?
The answer is in the contents of a cache entry
blowing in the wind
32 bit data field - the desired raw data
Tag field high order address after modulo low
order cache bits stripped out this identifies
the entry with a unique memory block.
A one bit valid field (validity bit).The valid
bit is turned on if a block has been moved into
the cache on demand If the valid bit is, on the
tag and data field are valid

14
How a Block is Found

The low order address bits of the block number
(log cache size in blocks) from the main memory
address is the index into the cache.
If the valid bit is on, the tag is compared with
the corresponding field in the main memory
address.
If it compares we have a hit
If it does not compare, or if the valid bit is
off, we have a miss and the hardware copies the
desired block from main memory to this location
in cache
Whatever was at this cache location gets overlaid.

15
How a Block is Found
One word (4 bytes) per block
Byte offset within a word bits 0,1
block (frame) number within cache 2, , 11
tag 12, 13, , 31 block number in ? logical
space compare with Page Table
4 bytes
Cache by definition is size 210 Contrast to main
memory in VM Scheme where mem size is arbitrary
Fig. 7.7 Similar to DEC example Fig 7.8
Data is 32 bits Entry has 32201 53
bits Emphasis is more on temporal rather than
spatial locality
16
Handling a Cache Miss

Instruction miss p. 551
Access main memory at address PC-4 for the
desired instruction block. (Read).
Write the memory data in the proper cache
location (low order bits) and the upper bits in
the tag field, then turn on the valid bit
Restart the instruction execution from the
beginning, it will now re-fetch and find it in
the cache.
A cache stall occurs by stalling the entire
machine (rather than only certain instructions as
in a pipeline stall.)

17
Handling a Page Fault

Read data miss
Similar to instruction miss - simply stall
processor until cache is updated - simply retain
ALU address output for processing the miss (where
to move mem data in cache).
Write miss (see pp. 553-554)
If we simply write to data cache without updating
main memory, then cache and memory would be
inconsistent
Simple solution is to use write-through index
to cache with low order bits
Write the data and tag portion into the block
set valid, then write data word to main memory
with entire address.Contrast this later to the
case of more that one word per block.
This method impacts performance - an alternate
approach is gt

18
Handling a Page Fault (cont.)

Write buffer technique (p. 554)
Write data into cache and buffer at the same time
(buffer is fast) ... processor continue execution
sort of a lazy evaluation .
While the processor proceeds, the buffer data is
copied to memory
When write to main memory completes, the write
buffer entry is freed up
If the write buffer is full when a write is
encountered, the processor must stall until a
buffer position frees up.
Problem even though the writes are generated at
a rate less than the rate of absorption by main
memory (average), bursts of writes can stall
the processor ... only remedy is to increase
buffer size.
Buffer is generally small (lt 10 words
A preview of other problems associated with
caches
In a multiprocessing scheme where each processor
may have its own cache and there is a common main
memory
Now we have a cache coherency problem
Not only must we keep the caches in steep with
main memory, but we must keep them in step with
each other more later.

19
Taking advantage of Spatial locality

Up to now there was essentially no spatial
locality
Block size was too small the unit of memory
transfer to the bus is a word
The block size was one word
A block consist of multiple contiguous words
from main memory
Need a cache block of size greater than one word
for spatial locality
Load the desired word and its local companions
into cache
A miss always brings in an entire block
Assume the number of words per block is a power
of 2

20
Taking advantage of Spatial locality

Mapping an address to a multiword cache block
Exampleblock size 16 bytes gt low 4 bits is
byte offset into a block gt low 2 bits is
byte offset into a word gt bits 2 and 3
are word offset into block cache size 64
blocks thus low 6 bits is the block number in
cache.What does byte address of 1202 (decimal)
0x4B2 map to?
Block given by (block address) mod (number of
cache blocks)where block address (actually block
number in logical space) (byte
address) / ( bytes per block) floor(1202/16)
75d 0x4B drop low 4 bit offset in
blockcache block number 75 mod 64 0x4B mod
64 11decimal 001011b
low 6 bits of block number
Summary1202d 0x00004B2 0000 0000 0000 0000
0000 0100 1011 0010
block number? cache location? blk
offsetRemember block_address block_number
block_offset book is a bit sloppy. Also
001011 (in red) field is cache locationAlso
bits in index log2(sets in cache)
log2((size of cache)/(size of set))
log2(64blocks/1 block/set) log264 6. see
later

21
64K Cache Using a 16 Byte Block
16 byte blocks directly mapping or 4 word blocks
directly mapped preview 1 way associative
64 KB 16K words 4K blocks 12 bit index in
cache
Tag associated with block not the word?
12 low bits
Pick off word in a block
Still direct mapping! See set associative later
Fig. 7.10
Fig. 7.10
22
Taking advantage of Spatial locality Miss
Handling

Read miss handling
Processed the same way as a single word block
read miss
A read miss always brings back the entire block

23
Taking advantage of Spatial locality Miss
Handling (cont.)

Write miss handling
Cant simply write the data and corresponding tag
because block contains more than a single word.
When we had one word per block, we simply wrote
the data and tag into the block set valid, then
write data word to main memory.
Must now first bring in the correct block from
memory if the tagmismatches, and then update the
block using write-through or buffering. If we
simple wrote the tag and word, we could possible
be updating the wrong block (intermixing two
blocks) multiple blocks could map to the same
cache location.See bottom page 557.

24
Tuning Performance With Block Size

Very small blocks may lose spatial locality (ex.
1 word/block)
Very large blocks may reduce performance if the
cache is relatively small - competition for space
Spatial locality occurs over a limited address
range - large blocks may bring in data which will
never get referenced dead wood.
Miss rate may increase for very large blocks

25
Tuning Performance With Block Size (cont.)
Fig. 7.12
Cache? size
Cover cache performance later
26
Performance Considerations

Assume that a cache hit gives normal
performance, that is, this is our base line for
no performance degradation peak performance.
We get performance degradation when a cache miss
occurs.
Recall that a cache stall occurs by stalling
the entire machine (rather than only certain
instructions as in a pipeline stall.)
Memory stall cycles are dead cycles elapsing
during a cache stall. This consists of
Memory-stall clock cycles read-stall cycles
write-stall cycles.For example on a per program
basis.Where
Read-stall cycles (reads/program) x (read miss
rate) x (read miss penalty)where read miss
penalty is in cycles, and may be given by some
formula involving say block size.
Write-stall cycles (writes/program) x (write
miss rate) x (write miss penalty) write
buffer stallswhere write buffer stalls accounts
for the case where a buffer is used to update
main memory when a cache write occurs. If write
buffer stalls are a significant part of the
equation, it probably means this is a bad design!
We shall assume a good design where the buffer
is deep enough for this to be an insignificant
term

27
Performance Considerations (cont.)

Assuming that the read and write miss penalty are
the same, and that we can neglect write buffer
stalls, we can write a more general formula.
Memory-stall clock cycles (memory
accesses)/program x (miss rate) x (cache miss
penalty)
For example in homework problem 7.27, the cache
miss penalty we given by the formula 6 (block
size in words) number of cycles.
An example (page 565) gt

28
Performance Considerations (cont.)

Assuming an instruction cache miss rate for gcc
of 2 and a data cache miss rate of 4. If a
machine has a CPI of 2 without any memory stalls
(ie., ideal case of no cache misses), and the
miss penalty is 40 cycles for all misses,
determine how much faster a machine would run
with a perfect cache that never misses. Use
instruction frequencies from page 311 of text.
For instruction count Iinstruction miss cycles
I x 2x 40 0.80I
Data miss cycles I x 36 x 4 x 40
0.56Iwhere the frequency of instructions doing
memory references is 35 from page 311
Total memory stall cycles 0.80I 0.56I
1.36I gt 1 cycle of memory stall per instruction.
The CPI with stalls 2 1.36 3.36
Thus (CPU time with stalls)/(CPU time for
perfect cache) IxCPIstall x (clock cycle
time)/I x CPIperfect x (clock cycle time)
CPIstall / CPIperfect 3.36/2 1.68 ideal
case is 68 better.

29
Performance Considerations (cont.)

Effects of cache/memory interface options on
performance
Cache interacts with memory for cache misses
Goalminimize block transfer time (maximize
bandwidth)minimize cost
Must deal with tradeoffs
Cache and memory communicate over a bus
generally not the main bus
Assume memory is implemented in DRAM and cache in
SRAM
Miss penalty (MP) is the time (in clock cycles)
it takes to transfer between memory and cache
Bandwidth is the bytes/second to transfer a block
ExampleAssume 1 clock cycle to send address to
memory (just need initial address)
15 clock cycles for each DRAM access
initiated effective access time 1 clock
cycle to send a word to cache
Bandwidth (words/block)/(miss penalty)
See fig. 7.13 for three cases gt

30
Bandwidth Example
?------------4 words --------?
MP 11x15 4x1 20 cycles BW (4x4)/20
0.8 read 4 words in parallel and Deliver to
cache one word at a time
1 word ?---?
MP 11x15 1x1 17 cycles BW (4x4)/17
0.94 read 4 words in parallel and Deliver to
cache 4 words at a time
Miss penalty (MP) 14x15 4x1 65
cycles Bandwidth BW (4x4)/65 0.25 read 1
word from memory and deliver to cache one word at
a time
Fig. 7.13
31
Now Comes Amdahls Law!

Summary CPIstall / CPIperfect 3.36/2 1.68
ideal case is 68 better
Lets speed up the processor
The amount of time spent on memory stalls will
take up an increasing fraction of the execution
time.
Example speed up the CPU by reducing CPI from 2
to 1 without changing he clock rate.
The CPI with cache misses is now 1 1.36
2.26perfect cache is now 2.36/1 times faster
instead of 3.36 faster
Amount of execution time spent on memory stalls
would have risen
from 1.36/3.36 41 (slow CPU)
to 1.36/2.36 58 (fast CPU)
!
Similar situation for increasing clock rate
without changing the memory system.

32
The Bottom Line

Relative cache penalties increase as the machine
becomes faster thus if a machine improves both
clock rate and CPI, it suffers a double hit.
The lower the CPI, the more pronounced the impact
of stall cycles.
If main memory of two machines have the same
absolute accesses times, the higher CPU clock
rate leads to a larger miss penalty.
Bottom line put the improvement where it is
needed improve memory system not CPU.

33
More Flexible Placement of BlocksSet
Associative

Up to now we used direct mapping for block
placement in the cache
Only one place to put a block in cache.
Finding a block is easy and fast
Simple directly address it with low order block
number bits.
The other extreme is fully associative
A block could be placed anywhere in the cache.
Finding a block is now more complicated must
search the cache looking for a match on the
tag.
In order to keep the performance high, we do the
search in hardware (see later) at a cost tradeoff
Let us look at schemes between these two
extremes.

34
Set Associative Block Placement

There is now a fixed number of locations (at
least two) where a block can be placed
For n locations it is called an n-way associative
cache
An n-way associative cache consists of a number
of sets, ech having n blocks
Each block in memory maps to a unique set in the
cache using the index field (low mod bits).
Recall that in a direct mapped cache, the
position of a memory block was given by (block
number) mod (number of cache blocks) low order
block bits
Note that the number of cache blocks in this case
is same as the number of sets one block per
set.
In general use set-associative cache the set
number containing the desired memory block is
given by(block number) mod (number of sets in
the cache) .Again this is low order block
bits
See diagram gt

35
Set Associative Block Placement (cont.)
Direct mapped Set associative
Fully associative 8 sets 8 blocks
4 sets 8 blocks 1 set 8 blocks
Set 0 (one set)
Block set
12 mod 8 4
Set 12 mod 4 0
Fig. 7.15
One tag per set
Example above uses cache block number of 12 dec
0xC 1100 bin Note 0xC results when block
offset bits stripped off.
36
Set Associative Block Placement (cont.)
Definition the (logical) size or capacity of
a cache usually means the amount of real or
userdata it can hold. The physical size is
larger to account for tags and status bits (such
as valid bits) also stored there.
Set
DefinitionsAssociatively blocks/set A block
is one or more words The Tag is associated with
a block within a set.
Fig. 7.16
37
Set Associative Mapping

For direct mapped, the location of a memory block
(one or more words) given by index (Block
number) mod (number of cache blocks)
For a general set associative cache, the set
containing a memory block is given byindex
(Block number) mod (number of sets in the cache)
This is consistent with the direct mapped
definition, since for direct mapped, there was
one block per set.
Each block in memory maps to a unique set in the
cache given by the index field.
The placement of a block within a set is not
unique it depends on a replacement algorithm
(example LRU).
Must logically search a set to find a particular
block identified by its tag.
The tag is the high order remaining bits after
the index and offset into the block is stripped
off.
In the case of fully associative (only one set),
there is no index because there is only one place
to index into ie., the entire cache.
The number of bits in the index field is
determined by the size of the cache (in units of
sets). The size of a set is the block size x the
associatively of the set.

38
Set Associative Mapping (cont.)
lt-------block number-------------gt
index (block number) mod (number of sets in
the cache) block size 2(number of bits in block
offset) bytes number of bits in index
Log2(number of sets in cache) number of bits in
index is directly a function of the size of the
cache and the associativity of the sets.
number of bits in index Log2 (size of
cache)/(size of set) Log2 (size of
cache)/(associativity of set)(size of block)
consistent units must be used in
calculations (bytes, words, etc.) Size of
cache mean the amount of real data it holds.
It does not account for validity and tags
stored. Assumes associativity of set is defined
as number of blocks in a set.
39
Set Associative Block Placement (cont.)Example 4
way associative
Block tag index
Block size 1 word 4 bytes
?block offset(2 bits)
tag
index
bits in index log2(sets in cache)
log2(cache size)/set size) log21024/4
log2(256)8
Cache size sets x size of set
256 x 4 words 1024 words 4Kbytes
Fig. 7.19
40
Set Associative Mapping an Example pp. 571-572
Cache size 4 words 4 blocks Sequence of block
numbers 0, 8, 0, 6, 8 CASE 1 direct mapping
one block per setthus sets in cache 4
index bits log2(sets in cache) log2(cache
size)/(set size) log24/1 2
bits in the index The tag must have 32 2 bits
in index 2 bits in block offset 28 bits
lt-------block number----------------------------gt
blk offset
31
4 3 2 1
0 Tag bits 4 31Index bits 2,3 Block offset
bits 0,1block 0 gt set index (0 mod 4)
0 block 6 gt set index (6 mod 4) 2 block
8 gt set index (8 mod 4) 0 Total of 5
misses see text.
41
Set Associative Mapping an Example pp. 571-572
Cache size 4 words 4 blocks Sequence of block
numbers 0, 8, 0, 6, 8 CASE 2 2 way
associative mapping two blocks per setthus
sets in cache 4 blocks/(2 blocks/set) two
sets in the cache bits in index log2(sets in
cache) log22 1 bit in the index The tag must
have 32 1 bits in index 2 bits in block
offset 29 bits
lt-------block number----------------------------gt
blk offset
31
3 2 1 0 Tag
bits 3 31 Index bit 2 Block offset bits
0,1block 0 gt set index (0 mod 2)
0 block 6 gt set index (6 mod 2) 0 block
8 gt set index (8 mod 2) 0 Total of 4
misses see text. use LRU for replacement
42
Set Associative Mapping an Example pp. 571-572
Cache size 4 words 4 blocks Sequence of block
numbers 0, 8, 0, 6, 8 CASE 3 fully associative
mapping 1 set, 4 blocks per setthus sets in
cache 4 blocks/(4 blocks/set) one set in the
cache bits in index log2(sets in cache)
log21 0 bits in the index The tag must have 32
0 bits in index 2 bits in block offset 30
bits
lt-------block number---------------------------gt
blk offset
31
2 1
0 Tag bits 2 31 Index none Block offset bits
0,1block 0 gt set index (0 mod 1)
0 block 6 gt set index (6 mod 1) 0 block
8 gt set index (8 mod 1) 0 Total of 3
misses see text. use LRU for replacement
43
Virtual Memory

An extension of the concepts used for cache
design (or maybe vise-versa?)
Key differences
The cache is now main memory
The backing store is now a hard drive
The size of logical space is now orders of
magnitude larger and in a cache scheme
The hardware search on the tag used in locating
a block in a cache, is no longer feasible in a VM
scheme
A fully associative scheme is used in order to
get the highest hit ratio
No restrictions on where the block (page) goes
Dragging along a tag is prohibitive in space
usage
Searching for a tag is also not practical too
many of them
Thus software is used and addresses are mapped
using a page table
Sometimes the PT is cached using a TLB which
works like a cache

44
Virtual Memory Mapping
Virtual page
Fig. 7.23
45
Virtual Memory Mapping
Virtual address
Fig. 7.21
46
Virtual Memory Mapping
Fig. 7.22
47
Page Faults

Page Faults
If the page referenced by the virtual address is
not in memory we have a page fault.
A page fault is determined by checking the valid
bit in the page table entry indexed by the
virtual page number.
If valid bit is off we have a page fault
Response to a page fault
An exception is raised, and the OS takes over to
fetch the desired page from memory.
Must keep track of disk address to do this
Disk addresses typically kept in page table but
there are other schemes also (p. 586)
If memory has an unused page frame, then load the
new page there.
If memory is full choose an existing valid
memory page to be replaced if this page is
modified, disk must be updated.

48
Page Faults (cont.)

How to we locate the page to be brought in from
the disk?
Need a data structure similar to the page table
to keep track of map virtual page number to disk.
One way is to keep disk addresses in the page
table along with real page numbers other
schemes may be used also.
Used for reading from as well as writing to disk.
Where do you put the page from disk if memory is
full?
Replace some existing page in memory which is
least likely to be needed in the future.
Least Recently Used (LRU) algorithm commonly
used.
LRU in its exact form is costly to implement.
LRU status updates must be made on each
reference.
At least from a logical point of view, must
manage an LRU stack
A number of LRU approximations possible which are
more realistic to implement example use a
reference bit(s) and only replacing pages whose
reference bits are off.

49
Writes to the Memory System

What about writes to cache?
Must keep main memory in step with cache
Use write through or write buffer to memory for
cache writes
Write buffer hides the latency of writing to
memory.
Main memory updated at the time of write.
What about writes to memory?
Must keep disk backing store in step with memory
Disk access is too slow to use write through.
Use lazy evaluation and do updates only on a
replacement.
Write back only when page is to be replaced or
when process owning the page table ends
minimizes disk access.
Keep track of modified pages with a dirty bit
in page table.

50
Virtual Memory Mapping (Cont.)Houston We have
a problem!

We have both a space and time problem
Space would be bad enough but time also!
Page tables can be huge.
If the address is 32 bits and the page size is
2K, then there are 221 ? 2 Million entries at,
say, 4 bytes per entry 8 Megabytes!
To make matters worse, each process has a page
table.
Memory access time is doubled
Even for a page hit, we must first access access
the page table stored in memory, and then get the
desired data,
Two memory accesses to get the desired data in
the most ideal situation (a page hit).

51
Two-Level Paging Example

One solution to the large page table problem
Paging the page table distinct from paged
segment table on p. 588. see other solutions
on pp. 587-588.
A logical address (on 32-bit machine with 4K page
size) is divided into
a page number consisting of 20 bits.
a page offset consisting of 12 bits.
Since the page table is paged, the page number is
further divided into
a 10-bit page number.
a 10-bit page offset.
Thus, a logical address is as followswhere
pi is an index into the outer page table, and p2
is the displacement within the page of the outer
page table.

From Silberschatz et. al., Operating System
Concepts 6th ed. Colored changes by Guydosh
page number
page offset
pi
p2
d
10
10
12
52
Two-Level Page-Table Scheme
One solution to the large page table problem
Page the page table
For a 32 bit address, and a 4k page, the
non-paged PT would have 1 meg entries - at 4
bytes /entry, the PT would take
up 4MB! Solution demand page in only those
blocks of the page table that are needed.
Outer Page Table (1st level) non-pageable, ie.,
pinned in memory gt
2nd level page table gt Each block (page) is
demand paged into memory. Only a fraction of all
pages of this table is resident in memory.
From Silberschatz et. al., Operating System
Concepts, 6th ed. Colored changes by Guydosh
53
Two-Level Page-Table Scheme

Address-translation scheme for a two-level 32-bit
paging architecture

Outer Page Table (1st level), non-pageable, ie.,
pinned in memory
2nd level page table Each block (page) is
demand paged into memory. Only a fraction of all
pages of this table is resident in memory.
From Silberschatz et. al., Operating System
Concepts, 6th ed. Colored changes by Guydosh
Main memory. Where the data is
54
The TLB Preventing Doubling of Memory Access
Time With Page Table

Use a Translation Lookaside Buffer the TLB
A cache for the Page Table distinct from the
main memory cache aready studied
Holds only page table mappings (entries)
Uses the principle of locality of references in
the page table
Contains the most used page table entries a
working set.
Most of the time we should find the desired
address translation in the TLB.
The TLB uses all the principles we already
discussed for general memory caches
associatively, etc.
When a TLB miss occurs, we must determine if this
is a true page fault, or is it do we only need a
TLB update from the page table.
If the latter, then process the reference using
the page table and copy the PT entry to the TLB
with possible replacement.
If a dirty bit is on in the TLB entry being
replaced, then the corresponding entry in the PT
entry must be updated.

55
TLB and PT Relationship
?TLB
Page Table ?
Fig. 7.24
56
TLB Integration With the Data Cache
TLB fully associative
Virtual Address ?
Steps for hits VA xlated by TBL Cache
PAconstructed PA sent to cache Cache page
number split into cache tag and index Data
retrieved as described previously
? 1 word/block
Cache direct mapped
Fig. 7.25 DECStation 3100
Fig. 7.25
57
Read Write Process for TLB/Cache system
Fig. 7.25 DECStation 3100
58
Common Framework for all Memory Hierarchies

We studied three concepts that may look different
on the surface
Caches, TLBs, and virtual memory
All three ideas rely on the fundamental principle
of spatial and temporal locality to work
All three ideas deal with four fundamental
questions
Q1 where can a block be placed?A1 One and only
one place direct mapped cache. A few
places set associative cache. Any place
fully associative cache or VM (with table lookup
instead of search).
Q2 How is a block found?A2Indexing as in
direct mapped Limited search as in set
associative Full search as in a fully
associative cache. Separate lookup table as
in a page table in VM
Q3 What block is replaced on a miss?A3 Must
choose an algorithm for keeping track of
reference behavior or the resident
pages. Typically some variation of LRU or random
Q4 How are writes handled?A4 Write through
use buffers in order to hide write through
latency and lazy evaluation (update on
faults only) if backing store extremely slow as
in VM and

59
Common Framework for all Memory Hierarchies
(cont.)

Key relationship between VM/page table approach
and Cache approach
Both VM and fully associative cache do not
restrict where a page could go The real memory
in a VM scheme resembles a fully associative
cache.
In a true fully associative cache we must search
the cache (in hardware) for the tag (block
number) must store tag along with the data.
In VM we only store the data in memory, and use a
separate indexed page table look up to look up
the physical address of the page a page
searching real memory would be impractical
because of it large size and the waste of space
for tags would be prohibitive.
OK for caches because they are small.
Some generalizations of the meaning memory
hierarchy misses 3 Cs
Compulsory misses initial loading of a cache
cold start
Capacity misses cache cannot contain the entire
working set or the blocks needed for execution
of the program replacement needed danger of
thrashing
Conflict misses cache misses in a set a
associative (not fully associative) or direct
mapping. Multiple blocks compete for same set
and set is full but there may be empty blocks
in other sets which would be off limits to use.
cannot happen in fully associative.

60
Other Memory Hierarchy Topics

Memory performance and bandwidth issues and
tradeoffs see section 7.2
Protection in Virtual Memory see pp. 596-598.
Critical due to multiprogramming environment
Each process has its own virtual address space
Must keep processes from interfering with each
others address spaces and stay away from the
Operating system space.
Perhaps provide some reading ability (read only)
ouside of the process space, but no writi8ng
Sharing memory spaces for functional reasons must
be totally under control of OS
Page tables must be protected put them in
Kernel space
Maintain user and kernel mode
If you want privileges outside of your own space,
you must execute a system call allow the OS to
control the privilege (example shared memory)