Title: Memory Hierarchies Chapter 7
1Memory HierarchiesChapter 7
2The Basics
- Fake out the program (application) into thinking
it has a massive high speed memory - limited only
by the address space. - The appearance of a massive high speed memory
system is an illusion - but the performance
enhancement is real. - Will consider three levels of memory media
- Cache
- Main Memory
- Disk
- Principle of locality (pp. 540-541)Programs
access only a very small part of the address
space at any instant of time - Temporal locality If an item is referenced, it
will probably be referenced again soon. - Spatial localityIf an item is referenced, items
whose addresses are close by will tend to be
referenced soon
3Memory Hierarchy
- Organize memory into levels according to cost
speed factors - The higher the level the faster, smaller, and
more expensive (per bit) the memory is - The lower the level the slower, larger, and
cheaper (per bit) the memory becomes - The highest level is directly accessed by the
microprocessor. ... The cache - The levels to be considered areSRAM ... Super
fast memory - expensive DRAM ... Ordinary ram
memory - minimum speed - cheapDISK (DASD) ...
Very slow, very cheap
4Memory Hierarchy Levels
Cache
Main memory
Disk
Fig. 7.1
5Data Transfer Between Levels
- Block
- The minimum unit of data that can be present or
not present in a two level memory (cache system).
It can be as small as a word or may be multiple
words. - This is the basic unit of data transfer between
the cache and main memory - At lower levels of memory, blocks are also known
as pages
6Data Transfer Between Levels
?blocks
Fig. 7.2
7The Basic Idea
- The data needed by the processor is loaded into
the cache on demand - The data needed by the processor is loaded into
the cache on demand - If the data is present in the cache, the
processor used it directly. This is called a hit
- Hit rate or ratio
- The fraction (or percent) of the time the data is
found at the high level of memory without
incurring a miss. - Miss rate or ratio
- 1 - hit rate (or 100 - hit rate in percent)
8The Basic Idea (cont.)
- Hit TimeThe time it takes to access an upper
level of memory of a memory hierarch (cache for
two level) - Miss penaltyThe time it takes to replace a block
in an upper level of a hierarchy with a
corresponding block from a lower level - A miss in a cache is serviced by hardware
- A miss at a level below the cache (ex main
memory) is serviced in software by the operating
system (must do I/O at disk level).
9Why the Idea Works
- Fundamental phenomenon or properties which make
this scheme work - Hit ratios must be and are generally very high
- High hit ratios are a result of the locality
principle described above - Success is statistical and could never be
deterministically guaranteed. A good (or bad?)
Programmer could always write a program which
would kill a memory hierarchy ... cause thrashing
- Memory hierarchies must be tuned via level sized,
block (page) sized to get optimal performances
10The Basics of Caches
- Simplest type single word block
- On start up the cache is empty, and as references
are made, it fills up via block misses and
transfers from ram - Once it fills locality principle takes over a
there are very few misses. - Question
- Where are the blocks put in the cache and how are
they found and related to main memory blocks? - There must be a mapping scheme between cache and
main memory
11The Basics of Caches
- Direct mapping assuming addresses are block
numbers - A word from memory can map into one and only one
place in the cache. - The cache location is directly derived from the
main memory address. - Direct mapping ruleCache address of a block
(block number) is the main memory block number
address modulo the cache size in units of
blocks - For example if the main memory block number is
21decimal and the cache size was 8 blocks, then
cache address is 21 mod 8 5 decimal. - If we keep the size of the cache (in blocks) a
power of 2, then the cache address is directly
given by the log2 (cache size in blocks) low
order bits. Ex 21(base 10) 010101(base 2).
Log28 3. Three low order bits are 101 5
decimal. - Note that Memory ? cache mapping is many to one
(see P. 546, fig 7.5) - Preview If this is direct mapping, then there
must be a non-direct mapping set associative
more that one place to put a block See later.
a block number is a byte address which has the
low order bits designating the word within a
block, and the byte within a word stripped off.
If an block were a byte, then these addresses
would be complete.
12Memory Cache MappingSpecial Case One Word Per
BlockPiano Keys
Every block ( word) in memory maps into a
unique location in cache. Two low order bits
for byte within a word stripped off.
Fig. 7.5
13Memory Cache Mapping (cont.)
- Questions
- Because each cache location can contain the
contents of a number of different memory
locations, how do we know whether the data in the
cache corresponds to a requested word ie., the
block containing this word? - How do we know if a requested word is in the
cache or not? - The answer is in the contents of a cache entry
blowing in the wind - 32 bit data field - the desired raw data
- Tag field high order address after modulo low
order cache bits stripped out this identifies
the entry with a unique memory block. - A one bit valid field (validity bit).The valid
bit is turned on if a block has been moved into
the cache on demand If the valid bit is, on the
tag and data field are valid
14How a Block is Found
- The low order address bits of the block number
(log cache size in blocks) from the main memory
address is the index into the cache. - If the valid bit is on, the tag is compared with
the corresponding field in the main memory
address. - If it compares we have a hit
- If it does not compare, or if the valid bit is
off, we have a miss and the hardware copies the
desired block from main memory to this location
in cache - Whatever was at this cache location gets overlaid.
15How a Block is Found
One word (4 bytes) per block
Byte offset within a word bits 0,1
block (frame) number within cache 2, , 11
tag 12, 13, , 31 block number in ? logical
space compare with Page Table
4 bytes
Cache by definition is size 210 Contrast to main
memory in VM Scheme where mem size is arbitrary
Fig. 7.7 Similar to DEC example Fig 7.8
Data is 32 bits Entry has 32201 53
bits Emphasis is more on temporal rather than
spatial locality
16Handling a Cache Miss
- Instruction miss p. 551
- Access main memory at address PC-4 for the
desired instruction block. (Read). - Write the memory data in the proper cache
location (low order bits) and the upper bits in
the tag field, then turn on the valid bit - Restart the instruction execution from the
beginning, it will now re-fetch and find it in
the cache. - A cache stall occurs by stalling the entire
machine (rather than only certain instructions as
in a pipeline stall.)
17Handling a Page Fault
- Read data miss
- Similar to instruction miss - simply stall
processor until cache is updated - simply retain
ALU address output for processing the miss (where
to move mem data in cache). - Write miss (see pp. 553-554)
- If we simply write to data cache without updating
main memory, then cache and memory would be
inconsistent - Simple solution is to use write-through index
to cache with low order bits - Write the data and tag portion into the block
set valid, then write data word to main memory
with entire address.Contrast this later to the
case of more that one word per block. - This method impacts performance - an alternate
approach is gt
18Handling a Page Fault (cont.)
- Write buffer technique (p. 554)
- Write data into cache and buffer at the same time
(buffer is fast) ... processor continue execution
sort of a lazy evaluation . - While the processor proceeds, the buffer data is
copied to memory - When write to main memory completes, the write
buffer entry is freed up - If the write buffer is full when a write is
encountered, the processor must stall until a
buffer position frees up. - Problem even though the writes are generated at
a rate less than the rate of absorption by main
memory (average), bursts of writes can stall
the processor ... only remedy is to increase
buffer size. - Buffer is generally small (lt 10 words
- A preview of other problems associated with
caches - In a multiprocessing scheme where each processor
may have its own cache and there is a common main
memory - Now we have a cache coherency problem
- Not only must we keep the caches in steep with
main memory, but we must keep them in step with
each other more later.
19Taking advantage of Spatial locality
- Up to now there was essentially no spatial
locality - Block size was too small the unit of memory
transfer to the bus is a word - The block size was one word
- A block consist of multiple contiguous words
from main memory - Need a cache block of size greater than one word
for spatial locality - Load the desired word and its local companions
into cache - A miss always brings in an entire block
- Assume the number of words per block is a power
of 2
20Taking advantage of Spatial locality
- Mapping an address to a multiword cache block
- Exampleblock size 16 bytes gt low 4 bits is
byte offset into a block gt low 2 bits is
byte offset into a word gt bits 2 and 3
are word offset into block cache size 64
blocks thus low 6 bits is the block number in
cache.What does byte address of 1202 (decimal)
0x4B2 map to? - Block given by (block address) mod (number of
cache blocks)where block address (actually block
number in logical space) (byte
address) / ( bytes per block) floor(1202/16)
75d 0x4B drop low 4 bit offset in
blockcache block number 75 mod 64 0x4B mod
64 11decimal 001011b
low 6 bits of block number - Summary1202d 0x00004B2 0000 0000 0000 0000
0000 0100 1011 0010
block number? cache location? blk
offsetRemember block_address block_number
block_offset book is a bit sloppy. Also
001011 (in red) field is cache locationAlso
bits in index log2(sets in cache)
log2((size of cache)/(size of set))
log2(64blocks/1 block/set) log264 6. see
later
2164K Cache Using a 16 Byte Block
16 byte blocks directly mapping or 4 word blocks
directly mapped preview 1 way associative
64 KB 16K words 4K blocks 12 bit index in
cache
Tag associated with block not the word?
12 low bits
Pick off word in a block
Still direct mapping! See set associative later
Fig. 7.10
Fig. 7.10
22Taking advantage of Spatial locality Miss
Handling
- Read miss handling
- Processed the same way as a single word block
read miss - A read miss always brings back the entire block
23Taking advantage of Spatial locality Miss
Handling (cont.)
- Write miss handling
- Cant simply write the data and corresponding tag
because block contains more than a single word. - When we had one word per block, we simply wrote
the data and tag into the block set valid, then
write data word to main memory. - Must now first bring in the correct block from
memory if the tagmismatches, and then update the
block using write-through or buffering. If we
simple wrote the tag and word, we could possible
be updating the wrong block (intermixing two
blocks) multiple blocks could map to the same
cache location.See bottom page 557.
24Tuning Performance With Block Size
- Very small blocks may lose spatial locality (ex.
1 word/block) - Very large blocks may reduce performance if the
cache is relatively small - competition for space - Spatial locality occurs over a limited address
range - large blocks may bring in data which will
never get referenced dead wood. - Miss rate may increase for very large blocks
25Tuning Performance With Block Size (cont.)
Fig. 7.12
Cache? size
Cover cache performance later
26Performance Considerations
- Assume that a cache hit gives normal
performance, that is, this is our base line for
no performance degradation peak performance. - We get performance degradation when a cache miss
occurs. - Recall that a cache stall occurs by stalling
the entire machine (rather than only certain
instructions as in a pipeline stall.) - Memory stall cycles are dead cycles elapsing
during a cache stall. This consists of - Memory-stall clock cycles read-stall cycles
write-stall cycles.For example on a per program
basis.Where - Read-stall cycles (reads/program) x (read miss
rate) x (read miss penalty)where read miss
penalty is in cycles, and may be given by some
formula involving say block size. - Write-stall cycles (writes/program) x (write
miss rate) x (write miss penalty) write
buffer stallswhere write buffer stalls accounts
for the case where a buffer is used to update
main memory when a cache write occurs. If write
buffer stalls are a significant part of the
equation, it probably means this is a bad design!
We shall assume a good design where the buffer
is deep enough for this to be an insignificant
term
27Performance Considerations (cont.)
- Assuming that the read and write miss penalty are
the same, and that we can neglect write buffer
stalls, we can write a more general formula. - Memory-stall clock cycles (memory
accesses)/program x (miss rate) x (cache miss
penalty) - For example in homework problem 7.27, the cache
miss penalty we given by the formula 6 (block
size in words) number of cycles. - An example (page 565) gt
28Performance Considerations (cont.)
- Assuming an instruction cache miss rate for gcc
of 2 and a data cache miss rate of 4. If a
machine has a CPI of 2 without any memory stalls
(ie., ideal case of no cache misses), and the
miss penalty is 40 cycles for all misses,
determine how much faster a machine would run
with a perfect cache that never misses. Use
instruction frequencies from page 311 of text. - For instruction count Iinstruction miss cycles
I x 2x 40 0.80I - Data miss cycles I x 36 x 4 x 40
0.56Iwhere the frequency of instructions doing
memory references is 35 from page 311 - Total memory stall cycles 0.80I 0.56I
1.36I gt 1 cycle of memory stall per instruction. - The CPI with stalls 2 1.36 3.36
- Thus (CPU time with stalls)/(CPU time for
perfect cache) IxCPIstall x (clock cycle
time)/I x CPIperfect x (clock cycle time)
CPIstall / CPIperfect 3.36/2 1.68 ideal
case is 68 better.
29Performance Considerations (cont.)
- Effects of cache/memory interface options on
performance - Cache interacts with memory for cache misses
- Goalminimize block transfer time (maximize
bandwidth)minimize cost - Must deal with tradeoffs
- Cache and memory communicate over a bus
generally not the main bus - Assume memory is implemented in DRAM and cache in
SRAM - Miss penalty (MP) is the time (in clock cycles)
it takes to transfer between memory and cache - Bandwidth is the bytes/second to transfer a block
- ExampleAssume 1 clock cycle to send address to
memory (just need initial address)
15 clock cycles for each DRAM access
initiated effective access time 1 clock
cycle to send a word to cache - Bandwidth (words/block)/(miss penalty)
- See fig. 7.13 for three cases gt
30Bandwidth Example
?------------4 words --------?
MP 11x15 4x1 20 cycles BW (4x4)/20
0.8 read 4 words in parallel and Deliver to
cache one word at a time
1 word ?---?
MP 11x15 1x1 17 cycles BW (4x4)/17
0.94 read 4 words in parallel and Deliver to
cache 4 words at a time
Miss penalty (MP) 14x15 4x1 65
cycles Bandwidth BW (4x4)/65 0.25 read 1
word from memory and deliver to cache one word at
a time
Fig. 7.13
31Now Comes Amdahls Law!
- Summary CPIstall / CPIperfect 3.36/2 1.68
ideal case is 68 better - Lets speed up the processor
- The amount of time spent on memory stalls will
take up an increasing fraction of the execution
time. - Example speed up the CPU by reducing CPI from 2
to 1 without changing he clock rate. - The CPI with cache misses is now 1 1.36
2.26perfect cache is now 2.36/1 times faster
instead of 3.36 faster - Amount of execution time spent on memory stalls
would have risen - from 1.36/3.36 41 (slow CPU)
- to 1.36/2.36 58 (fast CPU)
- !
- Similar situation for increasing clock rate
without changing the memory system.
32The Bottom Line
- Relative cache penalties increase as the machine
becomes faster thus if a machine improves both
clock rate and CPI, it suffers a double hit. - The lower the CPI, the more pronounced the impact
of stall cycles. - If main memory of two machines have the same
absolute accesses times, the higher CPU clock
rate leads to a larger miss penalty. - Bottom line put the improvement where it is
needed improve memory system not CPU.
33More Flexible Placement of BlocksSet
Associative
- Up to now we used direct mapping for block
placement in the cache - Only one place to put a block in cache.
- Finding a block is easy and fast
- Simple directly address it with low order block
number bits. - The other extreme is fully associative
- A block could be placed anywhere in the cache.
- Finding a block is now more complicated must
search the cache looking for a match on the
tag. - In order to keep the performance high, we do the
search in hardware (see later) at a cost tradeoff - Let us look at schemes between these two
extremes.
34Set Associative Block Placement
- There is now a fixed number of locations (at
least two) where a block can be placed - For n locations it is called an n-way associative
cache - An n-way associative cache consists of a number
of sets, ech having n blocks - Each block in memory maps to a unique set in the
cache using the index field (low mod bits). - Recall that in a direct mapped cache, the
position of a memory block was given by (block
number) mod (number of cache blocks) low order
block bits - Note that the number of cache blocks in this case
is same as the number of sets one block per
set. - In general use set-associative cache the set
number containing the desired memory block is
given by(block number) mod (number of sets in
the cache) .Again this is low order block
bits - See diagram gt
35Set Associative Block Placement (cont.)
Direct mapped Set associative
Fully associative 8 sets 8 blocks
4 sets 8 blocks 1 set 8 blocks
Set 0 (one set)
Block set
12 mod 8 4
Set 12 mod 4 0
Fig. 7.15
One tag per set
Example above uses cache block number of 12 dec
0xC 1100 bin Note 0xC results when block
offset bits stripped off.
36Set Associative Block Placement (cont.)
Definition the (logical) size or capacity of
a cache usually means the amount of real or
userdata it can hold. The physical size is
larger to account for tags and status bits (such
as valid bits) also stored there.
Set
DefinitionsAssociatively blocks/set A block
is one or more words The Tag is associated with
a block within a set.
Fig. 7.16
37Set Associative Mapping
- For direct mapped, the location of a memory block
(one or more words) given by index (Block
number) mod (number of cache blocks) - For a general set associative cache, the set
containing a memory block is given byindex
(Block number) mod (number of sets in the cache) - This is consistent with the direct mapped
definition, since for direct mapped, there was
one block per set. - Each block in memory maps to a unique set in the
cache given by the index field. - The placement of a block within a set is not
unique it depends on a replacement algorithm
(example LRU). - Must logically search a set to find a particular
block identified by its tag. - The tag is the high order remaining bits after
the index and offset into the block is stripped
off. - In the case of fully associative (only one set),
there is no index because there is only one place
to index into ie., the entire cache. - The number of bits in the index field is
determined by the size of the cache (in units of
sets). The size of a set is the block size x the
associatively of the set.
38Set Associative Mapping (cont.)
lt-------block number-------------gt
index (block number) mod (number of sets in
the cache) block size 2(number of bits in block
offset) bytes number of bits in index
Log2(number of sets in cache) number of bits in
index is directly a function of the size of the
cache and the associativity of the sets.
number of bits in index Log2 (size of
cache)/(size of set) Log2 (size of
cache)/(associativity of set)(size of block)
consistent units must be used in
calculations (bytes, words, etc.) Size of
cache mean the amount of real data it holds.
It does not account for validity and tags
stored. Assumes associativity of set is defined
as number of blocks in a set.
39Set Associative Block Placement (cont.)Example 4
way associative
Block tag index
Block size 1 word 4 bytes
?block offset(2 bits)
tag
index
bits in index log2(sets in cache)
log2(cache size)/set size) log21024/4
log2(256)8
Cache size sets x size of set
256 x 4 words 1024 words 4Kbytes
Fig. 7.19
40Set Associative Mapping an Example pp. 571-572
Cache size 4 words 4 blocks Sequence of block
numbers 0, 8, 0, 6, 8 CASE 1 direct mapping
one block per setthus sets in cache 4
index bits log2(sets in cache) log2(cache
size)/(set size) log24/1 2
bits in the index The tag must have 32 2 bits
in index 2 bits in block offset 28 bits
lt-------block number----------------------------gt
blk offset
31
4 3 2 1
0 Tag bits 4 31Index bits 2,3 Block offset
bits 0,1block 0 gt set index (0 mod 4)
0 block 6 gt set index (6 mod 4) 2 block
8 gt set index (8 mod 4) 0 Total of 5
misses see text.
41Set Associative Mapping an Example pp. 571-572
Cache size 4 words 4 blocks Sequence of block
numbers 0, 8, 0, 6, 8 CASE 2 2 way
associative mapping two blocks per setthus
sets in cache 4 blocks/(2 blocks/set) two
sets in the cache bits in index log2(sets in
cache) log22 1 bit in the index The tag must
have 32 1 bits in index 2 bits in block
offset 29 bits
lt-------block number----------------------------gt
blk offset
31
3 2 1 0 Tag
bits 3 31 Index bit 2 Block offset bits
0,1block 0 gt set index (0 mod 2)
0 block 6 gt set index (6 mod 2) 0 block
8 gt set index (8 mod 2) 0 Total of 4
misses see text. use LRU for replacement
42Set Associative Mapping an Example pp. 571-572
Cache size 4 words 4 blocks Sequence of block
numbers 0, 8, 0, 6, 8 CASE 3 fully associative
mapping 1 set, 4 blocks per setthus sets in
cache 4 blocks/(4 blocks/set) one set in the
cache bits in index log2(sets in cache)
log21 0 bits in the index The tag must have 32
0 bits in index 2 bits in block offset 30
bits
lt-------block number---------------------------gt
blk offset
31
2 1
0 Tag bits 2 31 Index none Block offset bits
0,1block 0 gt set index (0 mod 1)
0 block 6 gt set index (6 mod 1) 0 block
8 gt set index (8 mod 1) 0 Total of 3
misses see text. use LRU for replacement
43Virtual Memory
- An extension of the concepts used for cache
design (or maybe vise-versa?) - Key differences
- The cache is now main memory
- The backing store is now a hard drive
- The size of logical space is now orders of
magnitude larger and in a cache scheme - The hardware search on the tag used in locating
a block in a cache, is no longer feasible in a VM
scheme - A fully associative scheme is used in order to
get the highest hit ratio - No restrictions on where the block (page) goes
- Dragging along a tag is prohibitive in space
usage - Searching for a tag is also not practical too
many of them - Thus software is used and addresses are mapped
using a page table - Sometimes the PT is cached using a TLB which
works like a cache
44Virtual Memory Mapping
Virtual page
Fig. 7.23
45Virtual Memory Mapping
Virtual address
Fig. 7.21
46Virtual Memory Mapping
Fig. 7.22
47Page Faults
- Page Faults
- If the page referenced by the virtual address is
not in memory we have a page fault. - A page fault is determined by checking the valid
bit in the page table entry indexed by the
virtual page number. - If valid bit is off we have a page fault
- Response to a page fault
- An exception is raised, and the OS takes over to
fetch the desired page from memory. - Must keep track of disk address to do this
- Disk addresses typically kept in page table but
there are other schemes also (p. 586) - If memory has an unused page frame, then load the
new page there. - If memory is full choose an existing valid
memory page to be replaced if this page is
modified, disk must be updated.
48Page Faults (cont.)
- How to we locate the page to be brought in from
the disk? - Need a data structure similar to the page table
to keep track of map virtual page number to disk. - One way is to keep disk addresses in the page
table along with real page numbers other
schemes may be used also. - Used for reading from as well as writing to disk.
- Where do you put the page from disk if memory is
full? - Replace some existing page in memory which is
least likely to be needed in the future. - Least Recently Used (LRU) algorithm commonly
used. - LRU in its exact form is costly to implement.
- LRU status updates must be made on each
reference. - At least from a logical point of view, must
manage an LRU stack - A number of LRU approximations possible which are
more realistic to implement example use a
reference bit(s) and only replacing pages whose
reference bits are off.
49Writes to the Memory System
- What about writes to cache?
- Must keep main memory in step with cache
- Use write through or write buffer to memory for
cache writes - Write buffer hides the latency of writing to
memory. - Main memory updated at the time of write.
- What about writes to memory?
- Must keep disk backing store in step with memory
- Disk access is too slow to use write through.
- Use lazy evaluation and do updates only on a
replacement. - Write back only when page is to be replaced or
when process owning the page table ends
minimizes disk access. - Keep track of modified pages with a dirty bit
in page table.
50Virtual Memory Mapping (Cont.)Houston We have
a problem!
- We have both a space and time problem
- Space would be bad enough but time also!
- Page tables can be huge.
- If the address is 32 bits and the page size is
2K, then there are 221 ? 2 Million entries at,
say, 4 bytes per entry 8 Megabytes! - To make matters worse, each process has a page
table. - Memory access time is doubled
- Even for a page hit, we must first access access
the page table stored in memory, and then get the
desired data, - Two memory accesses to get the desired data in
the most ideal situation (a page hit).
51Two-Level Paging Example
- One solution to the large page table problem
Paging the page table distinct from paged
segment table on p. 588. see other solutions
on pp. 587-588. - A logical address (on 32-bit machine with 4K page
size) is divided into - a page number consisting of 20 bits.
- a page offset consisting of 12 bits.
- Since the page table is paged, the page number is
further divided into - a 10-bit page number.
- a 10-bit page offset.
- Thus, a logical address is as followswhere
pi is an index into the outer page table, and p2
is the displacement within the page of the outer
page table.
From Silberschatz et. al., Operating System
Concepts 6th ed. Colored changes by Guydosh
page number
page offset
pi
p2
d
10
10
12
52Two-Level Page-Table Scheme
One solution to the large page table problem
Page the page table
For a 32 bit address, and a 4k page, the
non-paged PT would have 1 meg entries - at 4
bytes /entry, the PT would take
up 4MB! Solution demand page in only those
blocks of the page table that are needed.
Outer Page Table (1st level) non-pageable, ie.,
pinned in memory gt
2nd level page table gt Each block (page) is
demand paged into memory. Only a fraction of all
pages of this table is resident in memory.
From Silberschatz et. al., Operating System
Concepts, 6th ed. Colored changes by Guydosh
53Two-Level Page-Table Scheme
- Address-translation scheme for a two-level 32-bit
paging architecture
Outer Page Table (1st level), non-pageable, ie.,
pinned in memory
2nd level page table Each block (page) is
demand paged into memory. Only a fraction of all
pages of this table is resident in memory.
From Silberschatz et. al., Operating System
Concepts, 6th ed. Colored changes by Guydosh
Main memory. Where the data is
54The TLB Preventing Doubling of Memory Access
Time With Page Table
- Use a Translation Lookaside Buffer the TLB
- A cache for the Page Table distinct from the
main memory cache aready studied - Holds only page table mappings (entries)
- Uses the principle of locality of references in
the page table - Contains the most used page table entries a
working set. - Most of the time we should find the desired
address translation in the TLB. - The TLB uses all the principles we already
discussed for general memory caches
associatively, etc. - When a TLB miss occurs, we must determine if this
is a true page fault, or is it do we only need a
TLB update from the page table. - If the latter, then process the reference using
the page table and copy the PT entry to the TLB
with possible replacement. - If a dirty bit is on in the TLB entry being
replaced, then the corresponding entry in the PT
entry must be updated.
55TLB and PT Relationship
?TLB
Page Table ?
Fig. 7.24
56TLB Integration With the Data Cache
TLB fully associative
Virtual Address ?
Steps for hits VA xlated by TBL Cache
PAconstructed PA sent to cache Cache page
number split into cache tag and index Data
retrieved as described previously
? 1 word/block
Cache direct mapped
Fig. 7.25 DECStation 3100
Fig. 7.25
57Read Write Process for TLB/Cache system
Fig. 7.25 DECStation 3100
58Common Framework for all Memory Hierarchies
- We studied three concepts that may look different
on the surface - Caches, TLBs, and virtual memory
- All three ideas rely on the fundamental principle
of spatial and temporal locality to work - All three ideas deal with four fundamental
questions - Q1 where can a block be placed?A1 One and only
one place direct mapped cache. A few
places set associative cache. Any place
fully associative cache or VM (with table lookup
instead of search). - Q2 How is a block found?A2Indexing as in
direct mapped Limited search as in set
associative Full search as in a fully
associative cache. Separate lookup table as
in a page table in VM - Q3 What block is replaced on a miss?A3 Must
choose an algorithm for keeping track of
reference behavior or the resident
pages. Typically some variation of LRU or random - Q4 How are writes handled?A4 Write through
use buffers in order to hide write through
latency and lazy evaluation (update on
faults only) if backing store extremely slow as
in VM and
59Common Framework for all Memory Hierarchies
(cont.)
- Key relationship between VM/page table approach
and Cache approach - Both VM and fully associative cache do not
restrict where a page could go The real memory
in a VM scheme resembles a fully associative
cache. - In a true fully associative cache we must search
the cache (in hardware) for the tag (block
number) must store tag along with the data. - In VM we only store the data in memory, and use a
separate indexed page table look up to look up
the physical address of the page a page
searching real memory would be impractical
because of it large size and the waste of space
for tags would be prohibitive. - OK for caches because they are small.
- Some generalizations of the meaning memory
hierarchy misses 3 Cs - Compulsory misses initial loading of a cache
cold start - Capacity misses cache cannot contain the entire
working set or the blocks needed for execution
of the program replacement needed danger of
thrashing - Conflict misses cache misses in a set a
associative (not fully associative) or direct
mapping. Multiple blocks compete for same set
and set is full but there may be empty blocks
in other sets which would be off limits to use.
cannot happen in fully associative.
60Other Memory Hierarchy Topics
- Memory performance and bandwidth issues and
tradeoffs see section 7.2 - Protection in Virtual Memory see pp. 596-598.
- Critical due to multiprogramming environment
- Each process has its own virtual address space
- Must keep processes from interfering with each
others address spaces and stay away from the
Operating system space. - Perhaps provide some reading ability (read only)
ouside of the process space, but no writi8ng - Sharing memory spaces for functional reasons must
be totally under control of OS - Page tables must be protected put them in
Kernel space - Maintain user and kernel mode
- If you want privileges outside of your own space,
you must execute a system call allow the OS to
control the privilege (example shared memory)