Memory Hierarchies Chapter 7 - PowerPoint PPT Presentation

About This Presentation
Title:

Memory Hierarchies Chapter 7

Description:

Fake out the program (application) into thinking it has a ... Whatever was at this cache location gets overlaid. How a Block is Found. Fig. 7.7. Similar to ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 61
Provided by: csBing
Category:

less

Transcript and Presenter's Notes

Title: Memory Hierarchies Chapter 7


1
Memory HierarchiesChapter 7
  • N. Guydosh
  • 4/18/04

2
The Basics
  • Fake out the program (application) into thinking
    it has a massive high speed memory - limited only
    by the address space.
  • The appearance of a massive high speed memory
    system is an illusion - but the performance
    enhancement is real.
  • Will consider three levels of memory media
  • Cache
  • Main Memory
  • Disk
  • Principle of locality (pp. 540-541)Programs
    access only a very small part of the address
    space at any instant of time
  • Temporal locality If an item is referenced, it
    will probably be referenced again soon.
  • Spatial localityIf an item is referenced, items
    whose addresses are close by will tend to be
    referenced soon

3
Memory Hierarchy
  • Organize memory into levels according to cost
    speed factors
  • The higher the level the faster, smaller, and
    more expensive (per bit) the memory is
  • The lower the level the slower, larger, and
    cheaper (per bit) the memory becomes
  • The highest level is directly accessed by the
    microprocessor. ... The cache
  • The levels to be considered areSRAM ... Super
    fast memory - expensive DRAM ... Ordinary ram
    memory - minimum speed - cheapDISK (DASD) ...
    Very slow, very cheap

4
Memory Hierarchy Levels
Cache
Main memory
Disk
Fig. 7.1
5
Data Transfer Between Levels
  • Block
  • The minimum unit of data that can be present or
    not present in a two level memory (cache system).
    It can be as small as a word or may be multiple
    words.
  • This is the basic unit of data transfer between
    the cache and main memory
  • At lower levels of memory, blocks are also known
    as pages

6
Data Transfer Between Levels
?blocks
Fig. 7.2
7
The Basic Idea
  • The data needed by the processor is loaded into
    the cache on demand
  • The data needed by the processor is loaded into
    the cache on demand
  • If the data is present in the cache, the
    processor used it directly. This is called a hit
  • Hit rate or ratio
  • The fraction (or percent) of the time the data is
    found at the high level of memory without
    incurring a miss.
  • Miss rate or ratio
  • 1 - hit rate (or 100 - hit rate in percent)

8
The Basic Idea (cont.)
  • Hit TimeThe time it takes to access an upper
    level of memory of a memory hierarch (cache for
    two level)
  • Miss penaltyThe time it takes to replace a block
    in an upper level of a hierarchy with a
    corresponding block from a lower level
  • A miss in a cache is serviced by hardware
  • A miss at a level below the cache (ex main
    memory) is serviced in software by the operating
    system (must do I/O at disk level).

9
Why the Idea Works
  • Fundamental phenomenon or properties which make
    this scheme work
  • Hit ratios must be and are generally very high
  • High hit ratios are a result of the locality
    principle described above
  • Success is statistical and could never be
    deterministically guaranteed. A good (or bad?)
    Programmer could always write a program which
    would kill a memory hierarchy ... cause thrashing
  • Memory hierarchies must be tuned via level sized,
    block (page) sized to get optimal performances

10
The Basics of Caches
  • Simplest type single word block
  • On start up the cache is empty, and as references
    are made, it fills up via block misses and
    transfers from ram
  • Once it fills locality principle takes over a
    there are very few misses.
  • Question
  • Where are the blocks put in the cache and how are
    they found and related to main memory blocks?
  • There must be a mapping scheme between cache and
    main memory

11
The Basics of Caches
  • Direct mapping assuming addresses are block
    numbers
  • A word from memory can map into one and only one
    place in the cache.
  • The cache location is directly derived from the
    main memory address.
  • Direct mapping ruleCache address of a block
    (block number) is the main memory block number
    address modulo the cache size in units of
    blocks
  • For example if the main memory block number is
    21decimal and the cache size was 8 blocks, then
    cache address is 21 mod 8 5 decimal.
  • If we keep the size of the cache (in blocks) a
    power of 2, then the cache address is directly
    given by the log2 (cache size in blocks) low
    order bits. Ex 21(base 10) 010101(base 2).
    Log28 3. Three low order bits are 101 5
    decimal.
  • Note that Memory ? cache mapping is many to one
    (see P. 546, fig 7.5)
  • Preview If this is direct mapping, then there
    must be a non-direct mapping set associative
    more that one place to put a block See later.
    a block number is a byte address which has the
    low order bits designating the word within a
    block, and the byte within a word stripped off.
    If an block were a byte, then these addresses
    would be complete.

12
Memory Cache MappingSpecial Case One Word Per
BlockPiano Keys
Every block ( word) in memory maps into a
unique location in cache. Two low order bits
for byte within a word stripped off.
Fig. 7.5
13
Memory Cache Mapping (cont.)
  • Questions
  • Because each cache location can contain the
    contents of a number of different memory
    locations, how do we know whether the data in the
    cache corresponds to a requested word ie., the
    block containing this word?
  • How do we know if a requested word is in the
    cache or not?
  • The answer is in the contents of a cache entry
    blowing in the wind
  • 32 bit data field - the desired raw data
  • Tag field high order address after modulo low
    order cache bits stripped out this identifies
    the entry with a unique memory block.
  • A one bit valid field (validity bit).The valid
    bit is turned on if a block has been moved into
    the cache on demand If the valid bit is, on the
    tag and data field are valid

14
How a Block is Found
  • The low order address bits of the block number
    (log cache size in blocks) from the main memory
    address is the index into the cache.
  • If the valid bit is on, the tag is compared with
    the corresponding field in the main memory
    address.
  • If it compares we have a hit
  • If it does not compare, or if the valid bit is
    off, we have a miss and the hardware copies the
    desired block from main memory to this location
    in cache
  • Whatever was at this cache location gets overlaid.

15
How a Block is Found
One word (4 bytes) per block
Byte offset within a word bits 0,1
block (frame) number within cache 2, , 11
tag 12, 13, , 31 block number in ? logical
space compare with Page Table
4 bytes
Cache by definition is size 210 Contrast to main
memory in VM Scheme where mem size is arbitrary
Fig. 7.7 Similar to DEC example Fig 7.8
Data is 32 bits Entry has 32201 53
bits Emphasis is more on temporal rather than
spatial locality
16
Handling a Cache Miss
  • Instruction miss p. 551
  • Access main memory at address PC-4 for the
    desired instruction block. (Read).
  • Write the memory data in the proper cache
    location (low order bits) and the upper bits in
    the tag field, then turn on the valid bit
  • Restart the instruction execution from the
    beginning, it will now re-fetch and find it in
    the cache.
  • A cache stall occurs by stalling the entire
    machine (rather than only certain instructions as
    in a pipeline stall.)

17
Handling a Page Fault
  • Read data miss
  • Similar to instruction miss - simply stall
    processor until cache is updated - simply retain
    ALU address output for processing the miss (where
    to move mem data in cache).
  • Write miss (see pp. 553-554)
  • If we simply write to data cache without updating
    main memory, then cache and memory would be
    inconsistent
  • Simple solution is to use write-through index
    to cache with low order bits
  • Write the data and tag portion into the block
    set valid, then write data word to main memory
    with entire address.Contrast this later to the
    case of more that one word per block.
  • This method impacts performance - an alternate
    approach is gt

18
Handling a Page Fault (cont.)
  • Write buffer technique (p. 554)
  • Write data into cache and buffer at the same time
    (buffer is fast) ... processor continue execution
    sort of a lazy evaluation .
  • While the processor proceeds, the buffer data is
    copied to memory
  • When write to main memory completes, the write
    buffer entry is freed up
  • If the write buffer is full when a write is
    encountered, the processor must stall until a
    buffer position frees up.
  • Problem even though the writes are generated at
    a rate less than the rate of absorption by main
    memory (average), bursts of writes can stall
    the processor ... only remedy is to increase
    buffer size.
  • Buffer is generally small (lt 10 words
  • A preview of other problems associated with
    caches
  • In a multiprocessing scheme where each processor
    may have its own cache and there is a common main
    memory
  • Now we have a cache coherency problem
  • Not only must we keep the caches in steep with
    main memory, but we must keep them in step with
    each other more later.

19
Taking advantage of Spatial locality
  • Up to now there was essentially no spatial
    locality
  • Block size was too small the unit of memory
    transfer to the bus is a word
  • The block size was one word
  • A block consist of multiple contiguous words
    from main memory
  • Need a cache block of size greater than one word
    for spatial locality
  • Load the desired word and its local companions
    into cache
  • A miss always brings in an entire block
  • Assume the number of words per block is a power
    of 2

20
Taking advantage of Spatial locality
  • Mapping an address to a multiword cache block
  • Exampleblock size 16 bytes gt low 4 bits is
    byte offset into a block gt low 2 bits is
    byte offset into a word gt bits 2 and 3
    are word offset into block cache size 64
    blocks thus low 6 bits is the block number in
    cache.What does byte address of 1202 (decimal)
    0x4B2 map to?
  • Block given by (block address) mod (number of
    cache blocks)where block address (actually block
    number in logical space) (byte
    address) / ( bytes per block) floor(1202/16)
    75d 0x4B drop low 4 bit offset in
    blockcache block number 75 mod 64 0x4B mod
    64 11decimal 001011b
    low 6 bits of block number
  • Summary1202d 0x00004B2 0000 0000 0000 0000
    0000 0100 1011 0010
    block number? cache location? blk
    offsetRemember block_address block_number
    block_offset book is a bit sloppy. Also
    001011 (in red) field is cache locationAlso
    bits in index log2(sets in cache)
    log2((size of cache)/(size of set))
    log2(64blocks/1 block/set) log264 6. see
    later

21
64K Cache Using a 16 Byte Block
16 byte blocks directly mapping or 4 word blocks
directly mapped preview 1 way associative
64 KB 16K words 4K blocks 12 bit index in
cache
Tag associated with block not the word?
12 low bits
Pick off word in a block
Still direct mapping! See set associative later
Fig. 7.10
Fig. 7.10
22
Taking advantage of Spatial locality Miss
Handling
  • Read miss handling
  • Processed the same way as a single word block
    read miss
  • A read miss always brings back the entire block

23
Taking advantage of Spatial locality Miss
Handling (cont.)
  • Write miss handling
  • Cant simply write the data and corresponding tag
    because block contains more than a single word.
  • When we had one word per block, we simply wrote
    the data and tag into the block set valid, then
    write data word to main memory.
  • Must now first bring in the correct block from
    memory if the tagmismatches, and then update the
    block using write-through or buffering. If we
    simple wrote the tag and word, we could possible
    be updating the wrong block (intermixing two
    blocks) multiple blocks could map to the same
    cache location.See bottom page 557.

24
Tuning Performance With Block Size
  • Very small blocks may lose spatial locality (ex.
    1 word/block)
  • Very large blocks may reduce performance if the
    cache is relatively small - competition for space
  • Spatial locality occurs over a limited address
    range - large blocks may bring in data which will
    never get referenced dead wood.
  • Miss rate may increase for very large blocks

25
Tuning Performance With Block Size (cont.)
Fig. 7.12
Cache? size
Cover cache performance later
26
Performance Considerations
  • Assume that a cache hit gives normal
    performance, that is, this is our base line for
    no performance degradation peak performance.
  • We get performance degradation when a cache miss
    occurs.
  • Recall that a cache stall occurs by stalling
    the entire machine (rather than only certain
    instructions as in a pipeline stall.)
  • Memory stall cycles are dead cycles elapsing
    during a cache stall. This consists of
  • Memory-stall clock cycles read-stall cycles
    write-stall cycles.For example on a per program
    basis.Where
  • Read-stall cycles (reads/program) x (read miss
    rate) x (read miss penalty)where read miss
    penalty is in cycles, and may be given by some
    formula involving say block size.
  • Write-stall cycles (writes/program) x (write
    miss rate) x (write miss penalty) write
    buffer stallswhere write buffer stalls accounts
    for the case where a buffer is used to update
    main memory when a cache write occurs. If write
    buffer stalls are a significant part of the
    equation, it probably means this is a bad design!
    We shall assume a good design where the buffer
    is deep enough for this to be an insignificant
    term

27
Performance Considerations (cont.)
  • Assuming that the read and write miss penalty are
    the same, and that we can neglect write buffer
    stalls, we can write a more general formula.
  • Memory-stall clock cycles (memory
    accesses)/program x (miss rate) x (cache miss
    penalty)
  • For example in homework problem 7.27, the cache
    miss penalty we given by the formula 6 (block
    size in words) number of cycles.
  • An example (page 565) gt

28
Performance Considerations (cont.)
  • Assuming an instruction cache miss rate for gcc
    of 2 and a data cache miss rate of 4. If a
    machine has a CPI of 2 without any memory stalls
    (ie., ideal case of no cache misses), and the
    miss penalty is 40 cycles for all misses,
    determine how much faster a machine would run
    with a perfect cache that never misses. Use
    instruction frequencies from page 311 of text.
  • For instruction count Iinstruction miss cycles
    I x 2x 40 0.80I
  • Data miss cycles I x 36 x 4 x 40
    0.56Iwhere the frequency of instructions doing
    memory references is 35 from page 311
  • Total memory stall cycles 0.80I 0.56I
    1.36I gt 1 cycle of memory stall per instruction.
  • The CPI with stalls 2 1.36 3.36
  • Thus (CPU time with stalls)/(CPU time for
    perfect cache) IxCPIstall x (clock cycle
    time)/I x CPIperfect x (clock cycle time)
    CPIstall / CPIperfect 3.36/2 1.68 ideal
    case is 68 better.

29
Performance Considerations (cont.)
  • Effects of cache/memory interface options on
    performance
  • Cache interacts with memory for cache misses
  • Goalminimize block transfer time (maximize
    bandwidth)minimize cost
  • Must deal with tradeoffs
  • Cache and memory communicate over a bus
    generally not the main bus
  • Assume memory is implemented in DRAM and cache in
    SRAM
  • Miss penalty (MP) is the time (in clock cycles)
    it takes to transfer between memory and cache
  • Bandwidth is the bytes/second to transfer a block
  • ExampleAssume 1 clock cycle to send address to
    memory (just need initial address)
    15 clock cycles for each DRAM access
    initiated effective access time 1 clock
    cycle to send a word to cache
  • Bandwidth (words/block)/(miss penalty)
  • See fig. 7.13 for three cases gt

30
Bandwidth Example
?------------4 words --------?
MP 11x15 4x1 20 cycles BW (4x4)/20
0.8 read 4 words in parallel and Deliver to
cache one word at a time
1 word ?---?
MP 11x15 1x1 17 cycles BW (4x4)/17
0.94 read 4 words in parallel and Deliver to
cache 4 words at a time
Miss penalty (MP) 14x15 4x1 65
cycles Bandwidth BW (4x4)/65 0.25 read 1
word from memory and deliver to cache one word at
a time
Fig. 7.13
31
Now Comes Amdahls Law!
  • Summary CPIstall / CPIperfect 3.36/2 1.68
    ideal case is 68 better
  • Lets speed up the processor
  • The amount of time spent on memory stalls will
    take up an increasing fraction of the execution
    time.
  • Example speed up the CPU by reducing CPI from 2
    to 1 without changing he clock rate.
  • The CPI with cache misses is now 1 1.36
    2.26perfect cache is now 2.36/1 times faster
    instead of 3.36 faster
  • Amount of execution time spent on memory stalls
    would have risen
  • from 1.36/3.36 41 (slow CPU)
  • to 1.36/2.36 58 (fast CPU)
  • !
  • Similar situation for increasing clock rate
    without changing the memory system.

32
The Bottom Line
  • Relative cache penalties increase as the machine
    becomes faster thus if a machine improves both
    clock rate and CPI, it suffers a double hit.
  • The lower the CPI, the more pronounced the impact
    of stall cycles.
  • If main memory of two machines have the same
    absolute accesses times, the higher CPU clock
    rate leads to a larger miss penalty.
  • Bottom line put the improvement where it is
    needed improve memory system not CPU.

33
More Flexible Placement of BlocksSet
Associative
  • Up to now we used direct mapping for block
    placement in the cache
  • Only one place to put a block in cache.
  • Finding a block is easy and fast
  • Simple directly address it with low order block
    number bits.
  • The other extreme is fully associative
  • A block could be placed anywhere in the cache.
  • Finding a block is now more complicated must
    search the cache looking for a match on the
    tag.
  • In order to keep the performance high, we do the
    search in hardware (see later) at a cost tradeoff
  • Let us look at schemes between these two
    extremes.

34
Set Associative Block Placement
  • There is now a fixed number of locations (at
    least two) where a block can be placed
  • For n locations it is called an n-way associative
    cache
  • An n-way associative cache consists of a number
    of sets, ech having n blocks
  • Each block in memory maps to a unique set in the
    cache using the index field (low mod bits).
  • Recall that in a direct mapped cache, the
    position of a memory block was given by (block
    number) mod (number of cache blocks) low order
    block bits
  • Note that the number of cache blocks in this case
    is same as the number of sets one block per
    set.
  • In general use set-associative cache the set
    number containing the desired memory block is
    given by(block number) mod (number of sets in
    the cache) .Again this is low order block
    bits
  • See diagram gt

35
Set Associative Block Placement (cont.)
Direct mapped Set associative
Fully associative 8 sets 8 blocks
4 sets 8 blocks 1 set 8 blocks
Set 0 (one set)
Block set
12 mod 8 4
Set 12 mod 4 0
Fig. 7.15
One tag per set
Example above uses cache block number of 12 dec
0xC 1100 bin Note 0xC results when block
offset bits stripped off.
36
Set Associative Block Placement (cont.)
Definition the (logical) size or capacity of
a cache usually means the amount of real or
userdata it can hold. The physical size is
larger to account for tags and status bits (such
as valid bits) also stored there.
Set
DefinitionsAssociatively blocks/set A block
is one or more words The Tag is associated with
a block within a set.
Fig. 7.16
37
Set Associative Mapping
  • For direct mapped, the location of a memory block
    (one or more words) given by index (Block
    number) mod (number of cache blocks)
  • For a general set associative cache, the set
    containing a memory block is given byindex
    (Block number) mod (number of sets in the cache)
  • This is consistent with the direct mapped
    definition, since for direct mapped, there was
    one block per set.
  • Each block in memory maps to a unique set in the
    cache given by the index field.
  • The placement of a block within a set is not
    unique it depends on a replacement algorithm
    (example LRU).
  • Must logically search a set to find a particular
    block identified by its tag.
  • The tag is the high order remaining bits after
    the index and offset into the block is stripped
    off.
  • In the case of fully associative (only one set),
    there is no index because there is only one place
    to index into ie., the entire cache.
  • The number of bits in the index field is
    determined by the size of the cache (in units of
    sets). The size of a set is the block size x the
    associatively of the set.

38
Set Associative Mapping (cont.)
lt-------block number-------------gt
index (block number) mod (number of sets in
the cache) block size 2(number of bits in block
offset) bytes number of bits in index
Log2(number of sets in cache) number of bits in
index is directly a function of the size of the
cache and the associativity of the sets.
number of bits in index Log2 (size of
cache)/(size of set) Log2 (size of
cache)/(associativity of set)(size of block)
consistent units must be used in
calculations (bytes, words, etc.) Size of
cache mean the amount of real data it holds.
It does not account for validity and tags
stored. Assumes associativity of set is defined
as number of blocks in a set.
39
Set Associative Block Placement (cont.)Example 4
way associative
Block tag index
Block size 1 word 4 bytes
?block offset(2 bits)
tag
index
bits in index log2(sets in cache)
log2(cache size)/set size) log21024/4
log2(256)8
Cache size sets x size of set
256 x 4 words 1024 words 4Kbytes
Fig. 7.19
40
Set Associative Mapping an Example pp. 571-572
Cache size 4 words 4 blocks Sequence of block
numbers 0, 8, 0, 6, 8 CASE 1 direct mapping
one block per setthus sets in cache 4
index bits log2(sets in cache) log2(cache
size)/(set size) log24/1 2
bits in the index The tag must have 32 2 bits
in index 2 bits in block offset 28 bits
lt-------block number----------------------------gt
blk offset
31
4 3 2 1
0 Tag bits 4 31Index bits 2,3 Block offset
bits 0,1block 0 gt set index (0 mod 4)
0 block 6 gt set index (6 mod 4) 2 block
8 gt set index (8 mod 4) 0 Total of 5
misses see text.
41
Set Associative Mapping an Example pp. 571-572
Cache size 4 words 4 blocks Sequence of block
numbers 0, 8, 0, 6, 8 CASE 2 2 way
associative mapping two blocks per setthus
sets in cache 4 blocks/(2 blocks/set) two
sets in the cache bits in index log2(sets in
cache) log22 1 bit in the index The tag must
have 32 1 bits in index 2 bits in block
offset 29 bits
lt-------block number----------------------------gt
blk offset
31
3 2 1 0 Tag
bits 3 31 Index bit 2 Block offset bits
0,1block 0 gt set index (0 mod 2)
0 block 6 gt set index (6 mod 2) 0 block
8 gt set index (8 mod 2) 0 Total of 4
misses see text. use LRU for replacement
42
Set Associative Mapping an Example pp. 571-572
Cache size 4 words 4 blocks Sequence of block
numbers 0, 8, 0, 6, 8 CASE 3 fully associative
mapping 1 set, 4 blocks per setthus sets in
cache 4 blocks/(4 blocks/set) one set in the
cache bits in index log2(sets in cache)
log21 0 bits in the index The tag must have 32
0 bits in index 2 bits in block offset 30
bits
lt-------block number---------------------------gt
blk offset
31
2 1
0 Tag bits 2 31 Index none Block offset bits
0,1block 0 gt set index (0 mod 1)
0 block 6 gt set index (6 mod 1) 0 block
8 gt set index (8 mod 1) 0 Total of 3
misses see text. use LRU for replacement
43
Virtual Memory
  • An extension of the concepts used for cache
    design (or maybe vise-versa?)
  • Key differences
  • The cache is now main memory
  • The backing store is now a hard drive
  • The size of logical space is now orders of
    magnitude larger and in a cache scheme
  • The hardware search on the tag used in locating
    a block in a cache, is no longer feasible in a VM
    scheme
  • A fully associative scheme is used in order to
    get the highest hit ratio
  • No restrictions on where the block (page) goes
  • Dragging along a tag is prohibitive in space
    usage
  • Searching for a tag is also not practical too
    many of them
  • Thus software is used and addresses are mapped
    using a page table
  • Sometimes the PT is cached using a TLB which
    works like a cache

44
Virtual Memory Mapping
Virtual page
Fig. 7.23
45
Virtual Memory Mapping
Virtual address
Fig. 7.21
46
Virtual Memory Mapping
Fig. 7.22
47
Page Faults
  • Page Faults
  • If the page referenced by the virtual address is
    not in memory we have a page fault.
  • A page fault is determined by checking the valid
    bit in the page table entry indexed by the
    virtual page number.
  • If valid bit is off we have a page fault
  • Response to a page fault
  • An exception is raised, and the OS takes over to
    fetch the desired page from memory.
  • Must keep track of disk address to do this
  • Disk addresses typically kept in page table but
    there are other schemes also (p. 586)
  • If memory has an unused page frame, then load the
    new page there.
  • If memory is full choose an existing valid
    memory page to be replaced if this page is
    modified, disk must be updated.

48
Page Faults (cont.)
  • How to we locate the page to be brought in from
    the disk?
  • Need a data structure similar to the page table
    to keep track of map virtual page number to disk.
  • One way is to keep disk addresses in the page
    table along with real page numbers other
    schemes may be used also.
  • Used for reading from as well as writing to disk.
  • Where do you put the page from disk if memory is
    full?
  • Replace some existing page in memory which is
    least likely to be needed in the future.
  • Least Recently Used (LRU) algorithm commonly
    used.
  • LRU in its exact form is costly to implement.
  • LRU status updates must be made on each
    reference.
  • At least from a logical point of view, must
    manage an LRU stack
  • A number of LRU approximations possible which are
    more realistic to implement example use a
    reference bit(s) and only replacing pages whose
    reference bits are off.

49
Writes to the Memory System
  • What about writes to cache?
  • Must keep main memory in step with cache
  • Use write through or write buffer to memory for
    cache writes
  • Write buffer hides the latency of writing to
    memory.
  • Main memory updated at the time of write.
  • What about writes to memory?
  • Must keep disk backing store in step with memory
  • Disk access is too slow to use write through.
  • Use lazy evaluation and do updates only on a
    replacement.
  • Write back only when page is to be replaced or
    when process owning the page table ends
    minimizes disk access.
  • Keep track of modified pages with a dirty bit
    in page table.

50
Virtual Memory Mapping (Cont.)Houston We have
a problem!
  • We have both a space and time problem
  • Space would be bad enough but time also!
  • Page tables can be huge.
  • If the address is 32 bits and the page size is
    2K, then there are 221 ? 2 Million entries at,
    say, 4 bytes per entry 8 Megabytes!
  • To make matters worse, each process has a page
    table.
  • Memory access time is doubled
  • Even for a page hit, we must first access access
    the page table stored in memory, and then get the
    desired data,
  • Two memory accesses to get the desired data in
    the most ideal situation (a page hit).

51
Two-Level Paging Example
  • One solution to the large page table problem
    Paging the page table distinct from paged
    segment table on p. 588. see other solutions
    on pp. 587-588.
  • A logical address (on 32-bit machine with 4K page
    size) is divided into
  • a page number consisting of 20 bits.
  • a page offset consisting of 12 bits.
  • Since the page table is paged, the page number is
    further divided into
  • a 10-bit page number.
  • a 10-bit page offset.
  • Thus, a logical address is as followswhere
    pi is an index into the outer page table, and p2
    is the displacement within the page of the outer
    page table.

From Silberschatz et. al., Operating System
Concepts 6th ed. Colored changes by Guydosh
page number
page offset
pi
p2
d
10
10
12
52
Two-Level Page-Table Scheme
One solution to the large page table problem
Page the page table
For a 32 bit address, and a 4k page, the
non-paged PT would have 1 meg entries - at 4
bytes /entry, the PT would take
up 4MB! Solution demand page in only those
blocks of the page table that are needed.
Outer Page Table (1st level) non-pageable, ie.,
pinned in memory gt
2nd level page table gt Each block (page) is
demand paged into memory. Only a fraction of all
pages of this table is resident in memory.
From Silberschatz et. al., Operating System
Concepts, 6th ed. Colored changes by Guydosh
53
Two-Level Page-Table Scheme
  • Address-translation scheme for a two-level 32-bit
    paging architecture

Outer Page Table (1st level), non-pageable, ie.,
pinned in memory
2nd level page table Each block (page) is
demand paged into memory. Only a fraction of all
pages of this table is resident in memory.
From Silberschatz et. al., Operating System
Concepts, 6th ed. Colored changes by Guydosh
Main memory. Where the data is
54
The TLB Preventing Doubling of Memory Access
Time With Page Table
  • Use a Translation Lookaside Buffer the TLB
  • A cache for the Page Table distinct from the
    main memory cache aready studied
  • Holds only page table mappings (entries)
  • Uses the principle of locality of references in
    the page table
  • Contains the most used page table entries a
    working set.
  • Most of the time we should find the desired
    address translation in the TLB.
  • The TLB uses all the principles we already
    discussed for general memory caches
    associatively, etc.
  • When a TLB miss occurs, we must determine if this
    is a true page fault, or is it do we only need a
    TLB update from the page table.
  • If the latter, then process the reference using
    the page table and copy the PT entry to the TLB
    with possible replacement.
  • If a dirty bit is on in the TLB entry being
    replaced, then the corresponding entry in the PT
    entry must be updated.

55
TLB and PT Relationship
?TLB
Page Table ?
Fig. 7.24
56
TLB Integration With the Data Cache
TLB fully associative
Virtual Address ?
Steps for hits VA xlated by TBL Cache
PAconstructed PA sent to cache Cache page
number split into cache tag and index Data
retrieved as described previously
? 1 word/block
Cache direct mapped
Fig. 7.25 DECStation 3100
Fig. 7.25
57
Read Write Process for TLB/Cache system
Fig. 7.25 DECStation 3100
58
Common Framework for all Memory Hierarchies
  • We studied three concepts that may look different
    on the surface
  • Caches, TLBs, and virtual memory
  • All three ideas rely on the fundamental principle
    of spatial and temporal locality to work
  • All three ideas deal with four fundamental
    questions
  • Q1 where can a block be placed?A1 One and only
    one place direct mapped cache. A few
    places set associative cache. Any place
    fully associative cache or VM (with table lookup
    instead of search).
  • Q2 How is a block found?A2Indexing as in
    direct mapped Limited search as in set
    associative Full search as in a fully
    associative cache. Separate lookup table as
    in a page table in VM
  • Q3 What block is replaced on a miss?A3 Must
    choose an algorithm for keeping track of
    reference behavior or the resident
    pages. Typically some variation of LRU or random
  • Q4 How are writes handled?A4 Write through
    use buffers in order to hide write through
    latency and lazy evaluation (update on
    faults only) if backing store extremely slow as
    in VM and

59
Common Framework for all Memory Hierarchies
(cont.)
  • Key relationship between VM/page table approach
    and Cache approach
  • Both VM and fully associative cache do not
    restrict where a page could go The real memory
    in a VM scheme resembles a fully associative
    cache.
  • In a true fully associative cache we must search
    the cache (in hardware) for the tag (block
    number) must store tag along with the data.
  • In VM we only store the data in memory, and use a
    separate indexed page table look up to look up
    the physical address of the page a page
    searching real memory would be impractical
    because of it large size and the waste of space
    for tags would be prohibitive.
  • OK for caches because they are small.
  • Some generalizations of the meaning memory
    hierarchy misses 3 Cs
  • Compulsory misses initial loading of a cache
    cold start
  • Capacity misses cache cannot contain the entire
    working set or the blocks needed for execution
    of the program replacement needed danger of
    thrashing
  • Conflict misses cache misses in a set a
    associative (not fully associative) or direct
    mapping. Multiple blocks compete for same set
    and set is full but there may be empty blocks
    in other sets which would be off limits to use.
    cannot happen in fully associative.

60
Other Memory Hierarchy Topics
  • Memory performance and bandwidth issues and
    tradeoffs see section 7.2
  • Protection in Virtual Memory see pp. 596-598.
  • Critical due to multiprogramming environment
  • Each process has its own virtual address space
  • Must keep processes from interfering with each
    others address spaces and stay away from the
    Operating system space.
  • Perhaps provide some reading ability (read only)
    ouside of the process space, but no writi8ng
  • Sharing memory spaces for functional reasons must
    be totally under control of OS
  • Page tables must be protected put them in
    Kernel space
  • Maintain user and kernel mode
  • If you want privileges outside of your own space,
    you must execute a system call allow the OS to
    control the privilege (example shared memory)
Write a Comment
User Comments (0)
About PowerShow.com