THE%20MEMORY%20HIERARCHY - PowerPoint PPT Presentation

About This Presentation
Title:

THE%20MEMORY%20HIERARCHY

Description:

Title: COSC3330/6308 Computer Architecture Author: Jehan-Fran ois P ris Last modified by: Jehan-Fran ois P ris Created Date: 8/29/2001 4:04:21 AM – PowerPoint PPT presentation

Number of Views:270
Avg rating:3.0/5.0
Slides: 273
Provided by: Jehan57
Category:

less

Transcript and Presenter's Notes

Title: THE%20MEMORY%20HIERARCHY


1
THE MEMORY HIERARCHY
  • Jehan-François Pâris
  • jfparis_at_uh.edu

2
Chapter Organization
  • Technology overview
  • Caches
  • Cache associativity, write through andwrite
    back,
  • Virtual memory
  • Page table organization, the translation
    lookaside buffer (TLB), page fault handling,
    memory protection
  • Virtual machines
  • Cache consistency

3
TECHNOLOGY OVERVIEW
4
Dynamic RAM
  • Standard solution for main memory since 70's
  • Replaced magnetic core memory
  • Bits represented stored on capacitors
  • Charged state represents a one
  • Capacitors discharge
  • Must be dynamically refreshed
  • Achieved by accessing each cell several thousand
    times each second

5
Dynamic RAM
Row select
nMOS transistor
ColumnSelect
Capacitor
Ground
6
The role of the nMOS transistor
Not on the exam
  • Normally, no current can go from the source to
    the drain
  • When the gate is positive with respect to the
    ground, electrons are attracted to the gate (the
    "field effect")and current can go through

7
Magnetic disks
Platter
Servo
Arm
R/W head
8
Magnetic disk (I)
  • Data are stored into circular tracks
  • Tracks are partitioned into a variable number of
    fixed-size sectors
  • If disk drive has more than one platter, all
    tracks corresponding to the same position of the
    R/ W head form a cylinder

9
Magnetic disk (II)
  • Disk spins at a speed varying between
  • 5,400 rpm (laptops) and
  • 15,000 rpm (Seagate Cheetah X15, )
  • Accessing data requires
  • Positioning the head on the right track
  • Seek time
  • Waiting for the data to reach the R/W head
  • On the average half a rotation

10
Disk access times
  • Dominated by seek time and rotational delay
  • We try to reduce seek times by placing all data
    that are likely to be accessed together on
    nearby tracks or same cylinder
  • Cannot do as much for rotational delay
  • On the average half a rotation

11
Average rotational delay
RPM Delay (ms)
5400 5.6
7200 4.2
10,000 3.0
15,000 2.0
12
Overall performance
  • Disk access times are still dominated by
    rotational latency
  • Were 8-10 ms in the late 70's when rotational
    speeds were 3,000 to 3,600 RPM
  • Disk capacities and maximum transfer rates have
    done much better
  • Pack many more tracks per platter
  • Pack many more bits per track

13
The internal disk controller
  • Printed circuit board attached to disk drive
  • As powerful as the CPU of a personal computer of
    the early 80's
  • Functions include
  • Speed buffering
  • Disk scheduling

14
Reliability issues
  • Disk drives have more reliability issues than
    most other computer components
  • Moving parts eventually wear
  • Infant mortality
  • Would be too costly to produceperfect magnetic
    surfaces
  • Disks have bad blocks

15
Disk failure rates
  • Failure rates follow a bathtub curve
  • High infantile mortality
  • Low failure rate during useful life
  • Higher failure rates as disks wear out

16
Disk failure rates (II)
Failurerate
Wearout
Infantilemortality
Useful life
Time
17
Disk failure rates (III)
  • Infant mortality effect can last for months for
    disk drives
  • Cheap ATA disk drives seem to age less gracefully
    than SCSI drives

18
MTTF
  • Disk manufacturers advertise very highMean Times
    To Fail (MTTF) for their products
  • 500,000 to 1,000,000 hours, that is,57 to 114
    years
  • Does not mean that disk will last that long!
  • Means that disks will fail at an average rate of
    one failure per 500,000 to 100,000 hours
    duringtheir useful life

19
More MTTF Issues (I)
  • Manufacturers' claims are not supported by solid
    experimental evidence
  • Obtained by submitting disks to a stress test at
    high temperature and extrapolating results to
    ideal conditions
  • Procedure raises many issues

20
More MTTF Issues (II)
  • Failure rates observed in the field aremuch
    higher
  • Can go up to 8 to 9 percent per year
  • Corresponding MTTFs are 11 to 12.5 years
  • If we have 100 disks and a MTTF of 12.5 years,
    we can expect an average of 8 disk failures per
    year

21
Bad blocks (I)
  • Also known as
  • Irrecoverable read errors
  • Latent sector errors
  • Can be caused by
  • Defects in magnetic substrate
  • Problems during last write

22
Bad blocks (II)
  • Disk controller uses redundant encoding that can
    detect and correct many errors
  • When internal disk controller detects a bad block
  • Marks it as unusable
  • Remaps logical block address of bad block to
    spare sectors
  • Each disk is extensively tested duringburn in
    period before being released

23
The memory hierarchy (I)
Level Device Access Time
1 Fastest registers(2 GHz CPU) 0.5 ns
2 Main memory 10-60 ns
3 Secondary storage (disk) 7 ms
4 Mass storage(CD-ROM library) a few s
24
The memory hierarchy (II)
  • To make sense of these numbers, let us consider
    an analogy

25
Writing a paper (I)
Level Resource Access Time
1 Open book on desk 1 s
2 Book on desk
3 Book in library
4 Book far away
26
Writing a paper (II)
Level Resource Access Time
1 Open book on desk 1 s
2 Book on desk 20-120 s
3 Book in library
4 Book far away
27
Writing a paper (III)
Level Resource Access Time
1 Open book on desk 1 s
2 Book on desk 20-140 s
3 Book in library 162 days
4 Book far away
28
Writing a paper (IV)
Level Resource Access Time
1 Open book on desk 1 s
2 Book on desk 20-140 s
3 Book in library 162 days
4 Book far away 63 years
29
Major issues
  • Huge gaps between
  • CPU speeds and SDRAM access times
  • SDRAM access times and disk access times
  • Both problems have very different solutions
  • Gap between CPU speeds and SDRAM access times
    handled by hardware
  • Gap between SDRAM access times and disk access
    times handled by combination of software and
    hardware

30
Why?
  • Having hardware handle an issue
  • Complicates hardware design
  • Offers a very fast solution
  • Standard approach for very frequent actions
  • Letting software handle an issue
  • Cheaper
  • Has a much higher overhead
  • Standard approach for less frequent actions

31
Will the problem go away?
  • It will become worse
  • RAM access times are not improving as fast as CPU
    power
  • Disk access times are limited by rotational speed
    of disk drive

32
What are the solutions?
  • To bridge the CPU/DRAM gap
  • Interposing between the CPU and the DRAM smaller,
    faster memories that cache the data that the CPU
    currently needs
  • Cache memories
  • Managed by the hardware and invisible to the
    software (OS included)

33
What are the solutions?
  • To bridge the DRAM/disk drive gap
  • Storing in main memory the data blocks that are
    currently accessed (I/O buffer)
  • Managing memory space and disk space as a single
    resource (Virtual memory)
  • I/O buffer and virtual memory are managed by the
    OS and invisible to the user processes

34
Why do these solutions work?
  • Locality principle
  • Spatial localityat any time a process only
    accesses asmall portion of its address space
  • Temporal localitythis subset does not change
    too frequently

35
Can we think of examples?
  • The way we write programs
  • The way we act in everyday life

36
CACHING
37
The technology
  • Caches use faster static RAM (SRAM)
  • Similar organization as that of D flipflops
  • Can have
  • Separate caches for instructions and data
  • Great for pipelining
  • A unified cache

38
A little story (I)
  • Consider a closed-stack library
  • Customers bring book requests to circulation desk
  • Librarians go to stack to fetch requested book
  • Solution is used in national libraries
  • Costlier than open-stack approach
  • Much better control of assets

39
A little story (II)
  • Librarians have noted that some books get asked
    again and again
  • Want to put them closer to the circulation desk
  • Would result in much faster service
  • The problem is how to locate these books
  • They will not be at the right location!

40
A little story (III)
  • Librarians come with a great solution
  • They put behind the circulation desk shelves with
    100 book slots numbered from 00 to 99
  • Each slot is a home for the most recently
    requested book that has a call number whose
    last two digits match the slot number
  • 3141593 can only go in slot 93
  • 1234567 can only go in slot 67

41
A little story (IV)
Let me see if it's in bin 93
The call number of the book I need is 3141593
42
A little story (V)
  • To let the librarian do her job each slot much
    contain either
  • Nothing or
  • A book and its reference number
  • There are many books whose reference number ends
    in 93 or any two given digits

43
A little story (VI)
Sure
Could I get this time the book whose call number
4444493?
44
A little story (VII)
  • This time the librarian will
  • Go bin 93
  • Find it contains a book with a different call
    number
  • She will
  • Bring back that book to the stacks
  • Fetch the new book

45
Basic principles
  • Assume we want to store in a faster memory 2n
    words that are currently accessed by the CPU
  • Can be instructions or data or even both
  • When the CPU will need to fetch an instruction or
    load a word into a register
  • It will look first into the cache
  • Can have a hit or a miss

46
Cache hits
  • Occur when the requested word is found in the
    cache
  • Cache avoided a memory access
  • CPU can proceed

47
Cache misses
  • Occur when the requested word is not found in the
    cache
  • Will need to access the main memory
  • Will bring the new word into the cache
  • Must make space for it by expelling one of the
    cache entries
  • Need to decide which one

48
Handling writes (I)
  • When CPU has to store the contents of a register
    into main memory
  • Write will update the cache
  • If the modified word is already in the cache
  • Everything is fine
  • Otherwise
  • Must make space for it by expelling one of the
    cache entries

49
Handling writes (II)
  • Two ways to handle writes
  • Write through
  • Each write updates both the cache and the main
    memory
  • Write back
  • Writes are not propagated to the main memory
    until the updated word is expelled from the cache

50
Handling writes (II)
  • Write through
  • Write back

CPU
Cache
later
RAM
51
Pros and cons
  • Write through
  • Ensures that memory is always up to date
  • Expelled cache entries can be overwritten
  • Write back
  • Faster writes
  • Complicates cache expulsion procedure
  • Must write back cache entries that have been
    modified in the cache

52
Picking the right solution
  • Caches use write through
  • Provides simpler cache expulsions
  • Can minimize write-through overhead with
    additional circuitry
  • I/O Buffers and virtual memory usewrite back
  • Write-through overhead would be too high

53
A better write through (I)
  • Add a small buffer to speed up write performance
    of write-through caches
  • At least four words
  • Holds modified data until they are written into
    main memory
  • Cache can proceed as soon as data are written
    into the write buffer

54
A better write through (II)
  • Write through
  • Better write through

CPU
Cache
Write buffer
RAM
55
A very basic cache
  • Has 2n entries
  • Each entry contains
  • A word (4 bytes)
  • Its RAM address
  • Sole way to identify the word
  • A bit indicating whether the cache entry contains
    something useful

56
A very basic cache (I)
Actual caches are much bigger
57
A very basic cache (II)
58
Comments (I)
  • The cache organization we have presented is
    nothing but the hardware implementation of a hash
    table
  • Each entry has
  • a key the word address
  • a value word contents plus valid bit

59
Comments (II)
  • The hash function is
  • h(k) (k/4) mod N
  • where k is the key and N is the cache size
  • Can be computed very fast
  • Unlike conventional hash tables, this
    organization has no provision for handling
    collisions
  • Use expulsion to resolve collisions

60
Managing the cache
  • Each word fetched into cache can occupy a single
    cache location
  • Specified by n1 to 2 bits of its address
  • Two words with the same n1 to 2 bitscannot be
    at the same time in the cache
  • Happens whenever the addresses of the two words
    differ by K 2n2

61
Example
  • Assume cache can contain 8 words
  • If word 48 is in the cache it will be stored at
    cache index (48/4) mod 8 12 mod 8 4
  • In our case 2n2 232 32
  • The only possible cache index for word 80 would
    be (80/4) mod 8 20 mod 8 4
  • Same for words 112, 144, 176,

62
Managing the cache
  • Each word fetched into cache can occupy a single
    cache location
  • Specified by n1 to 2 bits of its address
  • Two words with the same n1 to 2 bitscannot be
    at the same time in the cache
  • Happens whenever the addresses of the two words
    differ by K 2n2

63
Saving cache space
  • We do not need to store whole address of each
    word in cache
  • Bits 1 and 0 will always be zero
  • Bits n 1 to 2 can be inferred from thecache
    index
  • If cache has 8 entries, bits 4 to 2
  • Will only store in tag the remaining bits of
    address

64
A very basic cache (III)
Cache uses bits 4 to 2 of word address
65
Storing a new word in the cache
  • Location of new word entry will be obtained from
    LSB of word address
  • Discard 2 LSB
  • Always zero for a well-aligned word
  • Remove n next LSB for a cache of size 2n
  • Given by cache index

MSB of word address
00
n next LSB
66
Accessing a word in the cache (I)
  • Start with word address
  • Remove two least significant bit
  • Always zero

Word address
67
Accessing a word in the cache (II)
  • Split remainder of address into
  • n least significant bits
  • Word address in the cache
  • Cache tag

Word address minus two LSB
n LSB
Cache Tag
68
Towards a better cache
  • Our cache takes into account temporal locality of
    accesses
  • Repeated accesses to the same location
  • But not their spatial locality
  • Accesses to neighboring locations
  • Cache space is poorly used
  • Need 26 1 bits of overhead to store32 bits of
    data

69
Multiword cache (I)
  • Each cache entry will contain a block of 2, 4 ,
    8, words with consecutive addresses
  • Will require words to be well aligned
  • Pair of words should start at an address that is
    multiple of 24 8
  • Group of four words should start at an address
    that is multiple of 44 16

70
Multiword cache (II)
Tag
Contents
71
Multiword cache (III)
  • Has 2n entries each containing 2m words
  • Each entry contains
  • 2m words
  • A tag
  • A bit indicating whether the cache entry contains
    useful data

72
Storing a new word in the cache
  • Location of new word entry will be obtained from
    LSB of word address
  • Discard 2 m LSB
  • Always zero for a well-aligned group of words
  • Take n next LSB for a cache of size 2n

MSB of address
2 m LSB
n next LSB
73
Example
  • Assume
  • Cache can contain 8 entries
  • Each block contains 2 words
  • Words 48 and 52 belong to the same block
  • If word 48 is in the cache it will be stored at
    cache index (48 /8) mod 8 6 mod 8 6
  • If word 48 is in the cache it will be stored at
    cache index (49 /8) mod 8 6 mod 8 6

74
Selecting the right block size
  • Larger block sizes improve the performance of the
    cache
  • Allows us to exploit spatial locality
  • Three limitations
  • Spatial locality effect less pronounced if block
    size exceeds 128 bytes
  • Too many collisions in very small caches
  • Large blocks take more time to be fetched into
    the cache

75
(No Transcript)
76
Collision effect in small cache
  • Consider a 4KB cache
  • If block size is 16 B, that is, 4 words,cache
    will have 256 blocks
  • If block size is 128 B, that is 32 words,cache
    will have 32 blocks
  • Too many collisions

77
Problem
  • Consider a very small cache with 8 entries and a
    block size of 8 bytes (2 words)
  • Which words will be fetched in the cache when the
    CPU accesses words at address 32, 48, 60 and 80?
  • How will these words will be stored in the cache?

78
Solution (I)
  • Since block size is 8 bytes
  • 3 LSB of address used to address one of the 8
    bytes in a block
  • Since cache holds 8 blocks,
  • Next 3 LSB of address used by the cache index
  • As a result, tag has 32 3 3 26 bits

79
Solution (I)
  • Consider words at address 32
  • Cache index is (32/23) mod 23 (32/8) mod 8 4
  • Block tag is 32/26 32/64 0

Row 4 Tag0 32 33 34 35 36 37 38 39
80
Solution (II)
  • Consider words at address 48
  • Cache index is (48/8) mod 8 6
  • Block tag is 48/64 0

Row 6 Tag0 48 49 50 51 52 53 54 55
81
Solution (III)
  • Consider words at address 60
  • Cache index is (60/8) mod 8 7
  • Block tag is 60/64 0

Row 6 Tag0 56 57 58 59 60 61 62 63
82
Solution (IV)
  • Consider words at address 80
  • Cache index is (80/8) mod 8 10 mod 8 2
  • Block tag is 80/64 1

Row 2 Tag1 80 81 82 83 84 85 86 67
83
Set-associative caches (I)
  • Can be seen as 2, 4, 8 caches attached together
  • Reduces collisions

84
Set-associative caches (II)
85
Set-associative caches (III)
  • Advantage
  • We take care of more collisions
  • Like a hash table with a fixed bucket size
  • Results in lower miss rates than direct-mapped
    caches
  • Disadvantage
  • Slower access
  • Best solution if miss penalty is very big

86
Fully associative caches
  • The dream!
  • A block can occupy any index position in the
    cache
  • Requires an associative memory
  • Content-addressable
  • Like our brain!
  • Remain a dream

87
Designing RAM to support caches
  • RAM connected to CPU through a "bus"
  • Clock rate much slower than CPU clock rate
  • Assume that a RAM access takes
  • 1 bus clock cycle to send the address
  • 15 bus clock cycle to initiate a read
  • 1 bus clock cycle to send a word of data

88
Designing RAM to support caches
  • Assume
  • Cache block size is 4 words
  • One-word bank of DRAM
  • Fetching a cache block would take
  • 1 415 41 65 bus clock cycles
  • Transfer rate is 0.25 byte/bus cycle
  • Awful!

89
Designing RAM to support caches
  • Could
  • Double bus width (from 32 to 64 bits)
  • Have a two-word bank of DRAM
  • Fetching a cache block would take
  • 1 215 21 33 bus clock cycles
  • Transfer rate is 0.48 byte/bus cycle
  • Much better
  • Costly solution

90
Designing RAM to support caches
  • Could
  • Have an interleaved memory organization
  • Four one-word banks of DRAM
  • A 32-bit bus

32 bits
RAMbank 1
RAMbank 0
RAMbank 2
RAMbank 3
91
Designing RAM to support caches
  • Can do the 4 accesses in parallel
  • Must still transmit the block 32 bits by 32 bits
  • Fetching a cache block would take
  • 1 15 41 20 bus clock cycles
  • Transfer rate is 0.80 word/bus cycle
  • Even better
  • Much cheaper than having a 64-bit bus

92
ANALYZING CACHE PERFORMANCE
93
Memory stalls
  • Can divide CPU time into
  • NEXEC clock cycles spent executing instructions
  • NMEM_STALLS cycles spent waiting for memory
    accesses
  • We have
  • CPU time (NEXEC NMEM_STALLS)TCYCLE

94
Memory stalls
  • We assume that
  • cache access times can be neglected
  • most CPU cycles spent waiting for memory accesses
    are caused by cache misses
  • Distinguishing between read stalls and write
    stalls
  • NMEM_STALLS NRD_STALLS NWR_STALLS

95
Read stalls
  • Fairly simple
  • NRD_STALLS NMEM_RDRead miss rate Read
    miss penalty

96
Write stalls (I)
  • Two causes of delays
  • Must fetch missing blocks before updating them
  • We update at most 8 bytes of the block!
  • Must take into account cost of write through
  • Buffering delay depends of proximity of writes
    not number of cache misses
  • Writes too close to each other

97
Write stalls (II)
  • We have
  • NWR_STALLS NWRITESWrite miss rate
  • Write miss penalty
    NWR_BUFFER_STALLS
  • In practice, very few buffer stalls if the buffer
    contains at least four words

98
Global impact
  • We have
  • NMEM_STALLS NMEM_ACCESSESCache miss rate
  • Cache miss penalty
  • and also
  • NMEM_STALLS NINSTRUCTIONS(NMISSES/Instruction)
  • Cache miss penalty

99
Example
  • Miss rate of instruction cache is 2 percentMiss
    rate of data cache is 4 percentIn the absence of
    memory stalls, each instruction would take 2
    cyclesMiss penalty is 100 cycles36 percent of
    instructions access the main memory
  • How many cycles are lost due to cache misses?

100
Solution (I)
  • Impact of instruction cache misses
  • 0.02100 2 cycles/instruction
  • Impact of data cache misses
  • 0.360.04100 1.44 cycles/instruction
  • Total impact of cache misses
  • 2 1.44 3.44 cycles/instruction

101
Solution (II)
  • Average number of cycles per instruction
  • 2 3.44 5.44 cycles/instruction
  • Fraction of time wasted
  • 3.44 /5.44 63 percent

102
Problem
  • Redo the example with the following data
  • Miss rate of instruction cache is 3 percentMiss
    rate of data cache is 5 percentIn the absence of
    memory stalls, each instruction would take 2
    cyclesMiss penalty is 100 cycles40 percent of
    instructions access the main memory

103
Solution
  • The fraction of time wasted to memory stalls is
    71 percent

104
Average memory access time
  • Some authors call it AMAT
  • TAVERAGE TCACHE fTMISS
  • where f is the cache miss rate
  • Times can be expressed
  • In nanoseconds
  • In number of cycles

105
Example
  • A cache has a hit rate of 96 percent
  • Accessing data
  • In the cache requires one cycle
  • In the memory requires 100 cycles
  • What is the average memory access time?

106
Solution
Corrected
  • Miss rate 1 Hit rate 0.04
  • Applying the formula
  • TAVERAGE 1 0.04100 5 cycles

107
Impact of a better hit rate
  • What would be the impact of improving the hit
    rate of the cache from 96 to 98 percent?

108
Solution
  • New miss rate 1 New hit rate 0.02
  • Applying the formula
  • TAVERAGE 1 0.02100 3 cycles

When the hit rate is above 80 percent small
improvements in the hit rate willresult in much
better miss rate
109
Examples
  • Old hit rate 80 percentNew hit rate 90
    percent
  • Miss rates goes from 20 to 10 percent!
  • Old hit rate 94 percentNew hit rate 98
    percent
  • Miss rates goes from 6 to 2 percent!

110
In other words
It's the miss rate, stupid!
111
Improving cache hit rate
  • Two complementary techniques
  • Using set-associative caches
  • Must check tags of all blocks with the same index
    values
  • Slower
  • Have fewer collisions
  • Fewer misses
  • Use a cache hierarchy

112
A cache hierarchy (I)
CPU
L1
L1 misses
L2
L2 misses
L3
L3 misses
RAM
113
A cache hierarchy
  • Topmost cache
  • Optimized for speed, not miss rate
  • Rather small
  • Uses a small block size
  • As we go down the hierarchy
  • Cache sizes increase
  • Block sizes increase
  • Cache associativity level increases

114
Example
  • Cache miss rate per instruction is 2 percentIn
    the absence of memory stalls, each instruction
    would take one cycleCache miss penalty is 100
    nsClock rate is 4GHz
  • How many cycles are lost due to cache misses?

115
Solution (I)
  • Duration of clock cycle
  • 1/(4 Ghz) 0.2510-9 s 0.25 ns
  • Cache miss penalty
  • 100ns 400 cycles
  • Total impact of cache misses
  • 0.02400 8 cycles/instruction

116
Solution (II)
  • Average number of cycles per instruction
  • 1 8 9 cycles/instruction
  • Fraction of time wasted
  • 8/9 89 percent

117
Example (cont'd)
  • How much faster would the processor if we added a
    L2 cache that
  • Has a 5 ns access time
  • Would reduce miss rate to main memory to 0.5
    percent?
  • Will see later how to get that

118
Solution (I)
  • L2 cache access time
  • 5ns 20 cycles
  • Impact of cache misses per instruction
  • L1 cache misses L2 cache misses
    0.02200.005400 0.4 2.0 2.4
    cycles/instruction
  • Average number of cycles per instruction
  • 1 2.4 3.4 cycles/instruction

119
Solution (II)
  • Fraction of time wasted
  • 2.4/3.4 63 percent
  • CPU speedup
  • 9/3.4 2.6

120
How to get the 0.005 miss rate
  • Wanted miss rate corresponds to a combined cache
    hit rate of 99.5 percent
  • Let H1 be hit rate of L1 cache and H2 be the hit
    rate of the second cache
  • The combined hit rate of the cache hierarchy isH
    H1 (1-H1)H2

121
How to get the 0.005 miss rate
  • We have0.995 0.98 0.02?H2
  • H2 (0.995 0.98)/0.02 0.75
  • Quite feasible!

122
Can we do better? (I)
  • Keep 98 percent hit rate for L1 cache
  • Raise hit rate of L2 cache to 85 percent
  • L2 cache is now slower 6ns
  • Impact of cache misses per instruction
  • L1 cache misses L2 cache misses
    0.02240.020.15400 0.48 1.2 1.68
    cycles/instruction

123
The verdict
  • Fraction of time wasted per cycle
  • 1.68/2.68 63 percent
  • CPU speedup
  • 9/2.68 3.36

124
Would a faster L2 cache help?
  • Redo the example assuming
  • Miss rate of L1 cache is till 98 percent
  • New faster L2 cache
  • Access time reduced to 3 ns
  • Hit rate only 50 percent

125
The verdict
  • Fraction of time wasted
  • 87 percent
  • CPU speedup
  • 1.72

New L2 cache with a lower access timebut a
higher miss rate performs much worsethan
original L2 cache
126
Cache replacement policy
  • Not an issue in direct mapped caches
  • We have no choice!
  • An issue in set-associative caches
  • Best policy is least recently used (LRU)
  • Expels from the cache a block in thesame set as
    the incoming block
  • Pick block that has not been accessed for the
    longest period of time

127
Implementing LRU policy
  • Easy when each set contains two blocks
  • We attach to each block a use bit that is
  • Set to 1 when the block is accessed
  • Reset to 0 when the other block is accessed
  • We expel block whose use bit is 0
  • Much more complicated for higher associativity
    levels

128
REALIZATIONS
129
Caching in a multicore organization
  • Multicore organizations often involve multiple
    chips
  • Say four chips with four cores per chip
  • Have a cache hierarchy on each chip
  • L1, L2, L3
  • Some caches are private, other are shared
  • Accessing a cache on a chip is much faster than
    accessing a cache on another chip

130
AMD 16-core system (I)
  • AMD 16-core system
  • Sixteen cores on four chips
  • Each core has a 64-KB L1 and a 512-KB L2 cache
  • Each chip has a 2-MB shared L3 cache

131
X/Y where X is latency in cycles Y is bandwidth
in bytes/cycle
132
AMD 16-core system (II)
  • Observe that access times are non-uniform
  • Takes more time to access L1 or L2 cache of
    another core than accessing shared L3 cache
  • Takes more time to access caches in another chip
    than local caches
  • Access times and bandwidths depend onchip
    interconnect topology

133
VIRTUAL MEMORY
134
Main objective (I)
  • To allow programmers to write programs that
    reside
  • partially in main memory
  • partially on disk

135
Main objective (II)
Main memory
Address space (I)
Address space (II)
136
Motivation
  • Most programs do not access their whole address
    space at the same time
  • Compilers go through several phases
  • Lexical analysis
  • Preprocessing (C, C)
  • Syntactic analysis
  • Semantic analysis

137
Advantages (I)
  • VM allows programmers to write programs that
    would not otherwise fit in main memory
  • They will run although much more slowly
  • Very important in 70's and 80's
  • VM allows OS to allocate the main memory much
    more efficiently
  • Do not waste precious memory space
  • Still important today

138
Advantages
  • VM let programmers use
  • Sparsely populated
  • Very large address spaces

VM
D
C
S
L
139
Sparsely populated address spaces
  • Let programmers put different items apart from
    each other
  • Code segment
  • Data segment
  • Stack
  • Shared library
  • Mapped files

Wait until you take 4330 tostudy this
140
Big difference with caching
  • Miss penalty is much bigger
  • Around 5 ms
  • Assuming a memory access time of 50 ns,5 ms
    equals 100,000 memory accesses
  • For caches, miss penalty was around100 cycles

141
Consequences
  • Will use much larger block sizes
  • Blocks, here called pages, measure 4 KB, 8KB,
    with 4 KB an unofficial standard
  • Will use fully associative mapping to reduce
    misses, here called page faults
  • Will use write back to reduce disk accesses
  • Must keep track of modified (dirty) pages in
    memory

142
Virtual memory
  • Combines two big ideas
  • Non-contiguous memory allocationprocesses are
    allocated page frames scattered all over the main
    memory
  • On-demand fetchProcess pages are brought in
    main memory when they are accessed for the first
    time
  • MMU takes care of almost everything

143
Main memory
  • Divided into fixed-size page frames
  • Allocation units
  • Sizes are powers of 2 (512B, , 4KB, )
  • Properly aligned
  • Numbered 0 , 1, 2, . . .

0
1
2
3
4
5
6
7
8
144
Program address space
  • Divided into fixed-size pages
  • Same sizes as page frames
  • Properly aligned
  • Also numbered 0 , 1, 2, . . .

0
1
2
3
4
5
6
7
145
The mapping
  • Will allocate non contiguous page frames to the
    pages of a process

146
The mapping
Page Number Frame number
0 0
1 4
2 2
147
The mapping
  • Assuming 1KB pages and page frames

Virtual Addresses Physical Addresses
0 to 1,023 0 to 1,023
1,024 to 2,047 4,096 to 5,119
2,048 to 3,071 2,048 to 3,071
148
The mapping
  • Observing that 210 1000000000 in binary
  • We will write 0-0 for ten zeroes and 1-1 for ten
    ones

Virtual Addresses Physical Addresses
0000-0 to 0001-1 0000-0 to 0001-1
0010-0 to 0011-1 1000-0 to 1001-1
0100-0 to 0101-1 0100-0 to 0101-1
149
The mapping
  • The ten least significant bits of the address do
    not change

Virtual Addresses Physical Addresses
000 0-0 to 000 1-1 000 0-0 to 000 1-1
001 0-0 to 001 1-1 100 0-0 to 100 1-1
010 0-0 to 010 1-1 010 0-0 to 010 1-1
150
The mapping
  • Must only map page numbers into page frame numbers

Page number Page frame number
000 000
001 100
010 010
151
The mapping
  • Same in decimal

Page number Page frame number
0 0
1 4
2 2
152
The mapping
  • Since page numbers are always in sequence, they
    are redundant

X
Page number Page frame number
0 0
1 4
2 2
153
The algorithm
  • Assume page size 2p
  • Remove p least significant bits from virtual
    address to obtain the page number
  • Use page number to find corresponding page frame
    number in page table
  • Append p least significant bits from virtual
    address to page frame number to get physical
    address

154
Realization
155
The offset
  • Offset contains all bits that remain unchanged
    through the address translation process
  • Function of page size

Page size Offset
1 KB 10 bits
2 KB 11 bits
4KB 12 bits
156
The page number
  • Contains other bits of virtual address
  • Assuming 32-bit addresses

Page size Offset Page number
1 KB 10 bits 22 bits
2 KB 11 bits 21 bits
4KB 12 bits 20 bits
157
Internal fragmentation
  • Each process now occupies an integer number of
    pages
  • Actual process space is not a round number
  • Last page of a process is rarely full
  • On the average, half a page is wasted
  • Not a big issue
  • Internal fragmentation

158
On-demand fetch (I)
  • Most processes terminate without having accessed
    their whole address space
  • Code handling rare error conditions, . . .
  • Other processes go to multiple phases during
    which they access different parts of their
    address space
  • Compilers

159
On-demand fetch (II)
  • VM systems do not fetch whole address space of a
    process when it is brought into memory
  • They fetch individual pages on demand when they
    get accessed the first time
  • Page miss or page fault
  • When memory is full, they expel from memory pages
    that are not currently in use

160
On-demand fetch (III)
  • The pages of a process that are not in main
    memory reside on disk
  • In the executable file for the program being
    run for the pages in the code segment
  • In a special swap area for the data pages that
    were expelled from main memory

161
On-demand fetch (IV)
Main memory
Code
Data
Disk
Executable
Swap area
162
On-demand fetch (V)
  • When a process tries to access data that are nor
    present in main memory
  • MMU hardware detects that the page is missing and
    causes an interrupt
  • Interrupt wakes up page fault handler
  • Page fault handler puts process in waiting state
    and brings missing page in main memory

163
Advantages
  • VM systems use main memory more efficiently than
    other memory management schemes
  • Give to each process more or less what it needs
  • Process sizes are not limited by the size of main
    memory
  • Greatly simplifies program organization

164
Sole disadvantage
  • Bringing pages from disk is a relatively slow
    operation
  • Takes milliseconds while memory access take
    nanoseconds
  • Ten thousand times to hundred thousand times
    slower

165
The cost of a page fault
  • Let
  • Tm be the main memory access time
  • Td the disk access time
  • f the page fault rate
  • Ta the average access time of the VM
  • Ta (1 f ) Tm f (Tm Td ) Tm f Td

166
Example
  • Assume Tm 50 ns and Td 5 ms

f Mean memory access time
10-3 50 ns 5 ms/103 5,050 ns
10-4 50 ns 5 ms/104 550 ns
10-5 50 ns 5 ms/105 100 ns
10-6 50 ns 5 ms/ 106 55 ns
167
Conclusion
  • Virtual memory works best when page fault rate
    is less than a page fault per 100,000
    instructions

168
Locality principle (I)
  • A process that would access its pages in a
    totally unpredictable fashion would perform very
    poorly in a VM system unless all its pages are in
    main memory

169
Locality principle (II)
  • Process P accesses randomly a very large array
    consisting of n pages
  • If m of these n pages are in main memory, the
    page fault frequency of the process will be( n
    m )/ n
  • Must switch to another algorithm

170
Tuning considerations
  • In order to achieve an acceptable performance,a
    VM system must ensure that each process has in
    main memory all the pages it is currently
    referencing
  • When this is not the case, the system performance
    will quickly collapse

171
First problem
  • A virtual memory system has
  • 32 bit addresses
  • 8 KB pages
  • What are the sizes of the
  • Page number field?
  • Offset field?

172
Solution (I)
  • Step 1Convert page size to power of 28 KB
    2----- B
  • Step 2Exponent is length of offset field

173
Solution (II)
  • Step 3Size of page number field Address size
    Offset sizeHere 32 ____ _____ bits
  • Highlight the text in the box to see the answers

13 bits for the offset and 19 bits for the page
number
174
PAGE TABLE REPRESENTATION
175
Page table entries
  • A page table entry (PTE) contains
  • A page frame number
  • Several special bits
  • Assuming 32-bit addresses, all fit into four bytes

176
The special bits (I)
  • Valid bit1 if page is in main memory, 0
    otherwise
  • Missing bit1 if page is in not main memory, 0
    otherwise
  • Serve the same functionUse different conventions

177
The special bits (II)
  • Dirty bit1 if page has been modified since it
    was brought into main memory,0 otherwise
  • A dirty page must be saved in the process swap
    area on disk before being expelled from main
    memory
  • A clean page can be immediately expelled

178
The special bits (III)
  • Page-referenced bit1 if page has been recently
    modified,0 otherwise
  • Often simulated in software

179
Where to store page tables
  • Use a three-level approach
  • Store parts of page table
  • In high speed registers located in the MMUthe
    translation lookaside buffer (TLB)(good
    solution)
  • In main memory (bad solution)
  • On disk (ugly solution)

180
The translation look aside buffer
  • Small high-speed memory
  • Contains fixed number of PTEs
  • Content-addressable memory
  • Entries include page frame number and page number

181
Realizations (I)
  • TLB of Intrisity FastMATH
  • 32-bit addresses
  • 4 KB pages
  • Fully associative TLB with 16 entries
  • Each entry occupies 64 bits
  • 20 bits for page number
  • 20 bits for page frame number
  • Valid bit, dirty bit,

182
Realizations (II)
  • TLB of ULTRA SPARC III
  • 64-bit addresses
  • Maximum program size is 244 bytes, that is,16 TB
  • Supported page sizes are 4 KB, 16KB, 64 KB, 4MB
    ("superpages")

183
Realizations (III)
  • TLB of ULTRA SPARC III
  • Dual direct-mapping (?) TLB
  • 64 entries for code pages
  • 64 entries for data pages
  • Each entry occupies 64 bits
  • Page number and page frame number
  • Context
  • Valid bit, dirty bit,

184
The context (I)
  • Conventional TLBs contain the PTE's for a
    specific address space
  • Must be flushed each time the OS switches from
    the current process to a new process
  • Frequent action in any modern OS
  • Introduces a significant time penalty

185
The context (II)
  • UltraSPARC III architecture adds to TLB entries a
    context identifying a specific address space
  • Page mappings from different address spaces can
    coexist in the TLB
  • A TLB hit now requires a match for both page
    number and context
  • Eliminates the need to flush the TLB

186
TLB misses
  • When a PTE cannot be found in the TLB, a TLB
    miss is said to occur
  • TLB misses can be handled
  • By the computer firmware
  • Cost of miss is one extra memory access
  • By the OS kernel
  • Cost of miss is two context switches

187
Letting SW handle TLB misses
  • As in other exceptions, must save current value
    of PC in EPC register
  • Must also assert the exception by the end of the
    clock cycle during which the memory access occurs
  • In MIPS, must prevent WB cycle to occur after MEM
    cycle that generated the exception

188
Example
  • Consider the instruction
  • lw 1, 0(2)
  • If word at address 2 is not in the TLB,we must
    prevent any update of 1

189
Performance implications
  • When TLB misses are handled by the firmware,
    they are very cheap
  • A TLB hit rate of 99 is very goodAverage
    access cost will be
  • Ta 0.99Tm 0.012Tm 1.01Tm
  • Less true if TLB misses are handled by the kernel

190
Storing the rest of the page table
  • PTs are too large to be stored in main memory
  • Will store active part of the PT in main memory
  • Other entries on disk
  • Three solutions
  • Linear page tables
  • Multilevel page tables
  • Hashed page tables

191
Storing the rest of the page table
  • We will review these solutions even though page
    table organizations are an operating system topic

192
Linear page tables (I)
  • Store PT in virtual memory (VMS solution)
  • Very large page tables need more than 2 levels (3
    levels on MIPS R3000)

193
Linear page tables (II)
PhysicalMemory
Virtual Memory
Other PTs
PT
194
Linear page tables (III)
  • Assuming a page size of 4KB,
  • Each page of virtual memory requires 4 bytes of
    physical memory
  • Each PT maps 4GB of virtual addresses
  • A PT will occupy 4MB
  • Storing these 4MB in virtual memory will require
    4KB of physical memory

195
Multi-level page tables (I)
  • PT is divided into
  • A master index that always remains in main memory
  • Sub indexes that can be expelled

196
Multi-level page tables (II)
lt Page Number gt
VIRTUAL ADDRESS
Offset
1ary
2ary
MASTER INDEX
SUBINDEX
(unchanged)
Frame
Addr
Offset
Frame No
PHYSICAL ADDRESS
197
Multi-level page tables (III)
  • Especially suited for a page size of 4 KB and 32
    bits virtual addresses
  • Will allocate
  • 10 bits of the address for the first level,
  • 10 bits for the second level, and
  • 12 bits for the offset.
  • Master index and sub indexes will all have 210
    entries and occupy 4KB

198
Hashed page tables (I)
  • Only contain paged that are in main memory
  • PTs are much smaller
  • Also known as inverted page tables

199
Hashed page table (II)
PN page number PFN page frame number
200
Selecting the right page size
  • Increasing the page size
  • Increases the length of the offset
  • Decreases the length of the page number
  • Reduces the size of page tables
  • Less entries
  • Increases internal fragmentation
  • 4KB seems to be a good choice

201
MEMORY PROTECTION
202
Objective
  • Unless we have an isolated single-user system, we
    must prevent users from
  • Accessing
  • Deleting
  • Modifying
  • the address spaces of other processes, including
    the kernel

203
Historical considerations
  • Earlier operating systems for personal computers
    did not have any protection
  • They were single-user machines
  • They typically ran one program at a time
  • Windows 2000, Windows XP, Vista and MacOS X are
    protected

204
Memory protection (I)
  • VM ensures that processes cannot access page
    frames that are not referenced in their page
    table.
  • Can refine control by distinguishing among
  • Read access
  • Write access
  • Execute access
  • Must also prevent processes from modifying their
    own page tables

205
Dual-mode CPU
  • Require a dual-mode CPU
  • Two CPU modes
  • Privileged mode or executive mode that allows
    CPU to execute all instructions
  • User mode that allows CPU to execute only safe
    unprivileged instructions
  • State of CPU is determined by a special bits

206
Switching between states
  • User mode will be the default mode for all
    programs
  • Only the kernel can run in supervisor mode
  • Switching from user mode to supervisor mode is
    done through an interrupt
  • Safe because the jump address is at a
    well-defined location in main memory

207
Memory protection (II)
  • Has additional advantages
  • Prevents programs from corrupting address spaces
    of other programs
  • Prevents programs from crashing the kernel
  • Not true for device drivers which are inside the
    kernel
  • Required part of any multiprogramming system

208
INTEGRATING CACHES AND VM
209
The problem
  • In a VM system, each byte of memory has two
    addresses
  • A virtual address
  • A physical address
  • Should cache tags contain virtual addresses or
    physical addresses?

210
Discussion
  • Using virtual addresses
  • Directly available
  • Bypass TLB
  • Cache entries specific to a given address space
  • Must flush caches when the OS selects another
    process
  • Using physical addresses
  • Must access first TLB
  • Cache entries not specific to a given address
    space
  • Do not have to flush caches when the OS selects
    another process

211
The best solution
  • Let the cache use physical addresses
  • No need to flush the cache at each context switch
  • TLB access delay is tolerable

212
Processing a memory access (I)
  • if virtual address in TLB get physical
    addresselse
  • create TLB miss exception break

I use Python because it is very
compacthetland.org/writing/instant-python.html
213
Processing a memory access (II)
  • if read_access while data not in cache
    stall deliver data to CPUelse
    write_access

Continues on next page
214
Processing a memory access (III)
  • if write_access_OK while data not in cache
    stall write data into cache update dirty
    bit put data and address in write buffer
  • else
  • illegal access create TLB miss exception

215
More Problems (I)
  • A virtual memory system has a virtual address
    space of 4 Gigabytes and a page size of 4
    Kilobytes. Each page table entry occupies 4
    bytes.

216
More Problems (II)
  • How many bits are used for the byte offset?
  • Since 4K 2___, the byte offset will use __ bits.
  • Highlight text in box to see the answer

Since 4KB 212bytes, the byte offset uses 12 bits
217
More Problems (III)
  • How many bits are used for the page number?
  • Since 4G 2__ we will have __-bit virtual
    addresses. Since the byte offset occupies ___ of
    these __ bits, __ bits are left for the page
    number.

The page number uses 20 bits of the address
218
More Problems (IV)
  • What is the maximum number of page table entries
    in a page table?
  • Address space/ Page size 2__ / 2__ 2 ___
    PTEs.

220 page table entries
219
More problems (VI)
  • A computer has 32 bit addresses and a page size
    of one kilobyte.
  • How many bits are used to represent the page
    number?
  • ___ bits
  • What is the maximum number of entries in a
    process page table?
  • 2___ entries

220
Answer
  • As 1KB 210 bytes, the byte offset occupies10
    bits
  • The page number uses the remaining 22 bits ofthe
    address

221
Some review questions
  • Why are TLB entries 64-bit wide while page table
    entries only require 32 bits?
  • What would be the main disadvantage of a virtual
    memory system lacking a dirty bit?
  • What is the big limitation of VM systems that
    cannot prevent processes from executing the
    contents of any arbitrary page in their address
    space?

222
Answers
  • We need extra space for storing the page number
  • It would have to write back to disk all pages
    thatit expels even when they were not modified
  • It would make the system less secure

223
VIRTUAL MACHINES
224
Key idea
  • Let different operating systems run at the same
    time on a single computer
  • Windows, Linux and Mac OS
  • A real-time OS and a conventional OS
  • A production OS and a new OS being tested

225
How it is done
  • A hypervisor /VM monitor defines two or more
    virtual machines
  • Each virtual machine has
  • Its own virtual CPU
  • Its own virtual physical memory
  • Its own virtual disk(s)

226
The virtualization process
Hypervisor
227
Reminder
  • In a conventional OS,
  • Kernel executes in privileged/supervisor mode
  • Can do virtually everything
  • User processes execute in user mode
  • Cannot modify their page tables
  • Cannot execute privileged instructions

228
User process
User process
User mode
System call
Privileged mode
Kernel
229
Two virtual machines
User mode
Privileged mode
Hypervisor
230
Explanations (II)
  • Whenever the kernel of a VM issues a privileged
    instruction, an interrupt occurs
  • The hypervisor takes control and do the physical
    equivalent of what the VM attempted to do
  • Must convert virtual RAM addresses into physical
    RAM addresses
  • Must convert virtual disk block addresses into
    physical block addresses

231
Translating a block address
That's block v, w of the actual disk
Access block x, y of my virtual disk
VM kernel
Hypervisor
Virtual disk
Access block v, w of actual disk
Actual disk
232
Handling I/Os
  • Difficult task because
  • Wide variety of devices
  • Some devices may be shared among several VMs
  • Printers
  • Shared disk partition
  • Want to let Linux and Windowsaccess the same
    files

233
Virtual Memory Issues
  • Each VM kernel manages its own memory
Write a Comment
User Comments (0)
About PowerShow.com