Chap 7: Memory Hierarchies - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Chap 7: Memory Hierarchies

Description:

CPI = ideal CPI average stalls per instruction = 1.1(cycles) ( 0.30 (memops/ins) x ... Use a buffer write controller to cut stalls (write-buffer) ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 39
Provided by: ShivkumarK7
Category:

less

Transcript and Presenter's Notes

Title: Chap 7: Memory Hierarchies


1
Chap 7 Memory Hierarchies
  • Shivkumar Kalyanaraman
  • Rensselaer Polytechnic Institute
  • shivkuma_at_ecse.rpi.edu
  • http//www.ecse.rpi.edu/Homepages/shivkuma

2
Overview
  • Memory and memory hierarchies overview
  • Caching direct-mapped, set- and
    fully-associative
  • Virtual memory paging, translation lookaside
    buffers (TLBs), protection issues
  • A common framework for caching virtual memory
  • Real Stuff Pentium pro and Power PC caches VM
    support.

3
Why memory hierarchies?
Processor-DRAM Memory Gap (latency)
µProc 60/yr. (2X/1.5yr)
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 9/yr. (2X/10 yrs)
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
4
Impact on Performance
  • Suppose a processor executes at
  • Clock Rate 200 MHz (5 ns per cycle)
  • Ideal CPI 1.1
  • 50 arith/logic, 30 ld/st, 20 control
  • Suppose10 of memory operations get 50 cycle
    miss penalty
  • CPI ideal CPI average stalls per
    instruction 1.1(cycles) ( 0.30
    (memops/ins) x
  • 0.10 (miss/memop) x 50 (cycle/miss)
    )
  • 1.1 cycle 1.5 cycle 2.6
  • 58 of the time the processor is stalled
    waiting for memory!
  • Every 1 instruction miss rate would add an
    additional 0.5 cycles to the CPI!
  • Need caches to bridge this growing performance
    gap !

5
Memories Review
  • SRAM
  • value is stored on a pair of inverting gates
  • very fast but takes up more space than DRAM (4 to
    6 transistors). Access times 5 - 25ns
  • DRAM
  • value is stored as a charge on capacitor (must be
    refreshed)
  • very small but slower than SRAM (factor of 5 to
    10)

DRAM
SRAM
6
Classical DRAM Organization (square)
bit (data) lines
r o w d e c o d e r
Each intersection represents a 1-T DRAM Cell
RAM Cell Array
word (row) select
Column Selector I/O Circuits
row address
Column Address
  • Row and Column Address together
  • Select 1 bit a time

data
7
Improvements to DRAM access time
  • Page mode and EDO (Extended Data Out) DRAM
  • Access a single row and store it in a buffer
    which acts as an SRAM.
  • Now with different column addresses, different
    bits can be randomly accessed and sent out.
  • Improvement 120 ns -gt 60 ns (page mode) -gt 25 ns
    (EDO)
  • SDRAMs (S synchronous) clock out bits from the
    buffer based upon a clock (100 MHz).
  • Avoids need to pass multiple column addresses.
    Useful for burst access. Time between successive
    bits after row access complete 8-10 ns !
  • All these improvements use DRAM circuitry, adding
    little cost to the system, but improve
    performance dramatically.

8
Exploiting Memory Hierarchy
  • Users want large and fast memories! SRAM access
    times 2 - 25ns. Cost 100 to
    250/Mbyte.DRAM access times 60-120ns Cost
    5 to 10/Mbyte.Disk access times10 to 20
    million ns.Cost.10 to .20/Mbyte.
  • But we know that smaller, not larger is faster !
  • Need to create illusion of large, fast memories
  • Can we use a memory hierarchy ?

9
Locality enables hierarchies ...
  • Although memory allows random access, programs
    access relatively small portions of it at any
    time.
  • Specifically if an item is referenced,
  • it will tend to be referenced again soon
    (Temporal locality)
  • nearby items may be referenced soon (Spatial
    locality)
  • Trick Let us capture these small portions and
    place them in a small, fast memory (which we can
    build) !
  • Programmers compilers must increasingly
    cooperate to enhance locality in programs, else
    cache wont deliver and performance will degrade !

10
Modern memory hierarchy
  • By taking advantage of the principle of
    locality
  • Present the user with as much memory as is
    available in the cheapest technology.
  • Provide access at the speed offered by the
    fastest technology.

1
10,000,000
Speed (ns)
10
100
10,000,000,000
100s
GB
Size (bytes)
KB
MB
TB
11
Memory Hierarchy Terminology
  • Hit data appears in some block in the upper
    level (example Block X)
  • Hit Rate the fraction of memory access in the
    upper level
  • Hit Time Time to access the upper level which
    consists of
  • SRAM access time Time to determine hit/miss
  • Miss data retrieved from a block in the lower
    level (Block Y)
  • Miss Rate 1 - (Hit Rate)
  • Miss Penalty Time to replace a block in the
    upper level
  • Time to deliver the block the processor

12
Cache
  • Two issues
  • How do we know if a data item is in the cache?
    (hit check)
  • If it is, how do we find it?
  • Simple design
  • Block size is one word of data
  • Direct mapped
  • For each item of data at the lower level, there
    is exactly one location in the cache where it
    might be.
  • e.g., lots of items at the lower level share
    locations in the upper level

13
Direct Mapped Cache
  • Mapping address is modulo the number of blocks
    in the cache
  • Cache index (Block address) mod (cache size in
    blocks)
  • Cache index can be derived as a bit field in the
    address
  • Eg Cache 256B Block size32B Byte address
    300
  • Cache index (300/32) mod (256/32) 1
  • Note 300 1 001 00010 (ignore 5 lsbs since blk
    size 32)

14
Direct Mapped Cache
  • For MIPS
  • Qn What kind of locality are we exploiting
    here ?

15
Direct Mapped Cache larger blocks
  • Taking advantage of spatial locality

16
Hits vs. Misses
  • Read hits this is what we want!
  • Read misses Stall the CPU, fetch block from
    memory, deliver to cache, restart
  • Write hits
  • Can replace data in cache and memory
    (write-through)
  • Use a bufferwrite controller to cut stalls
    (write-buffer)
  • Write the data only into the cache (write-back).
  • Write data to memory when the block is being
    tossed out of the cache.
  • Write misses
  • Read the entire block into the cache, then write
    the word
  • Note miss penalty has increased (larger blocks
    transferred per write) when we try to attack miss
    rate (through use of larger blocks for spatial
    locality)

17
Reducing the miss penalty hardware
  • Memory read time Address send time DRAM
    access time data transfer time. Access
    time dominates
  • Speed up
  • 1. Through parallel access of multiple words
  • 2. Through wider bus for transfers
  • 1 2 gt wide memory 1 only interleaved

18
Performance
  • Increasing the block size tends to decrease miss
    rate unless the total of blocks is very
    small
  • Split caches (one for instruction and one for
    data) preferred because you double bandwidth by
    accessing simultaneously.

19
Performance
  • Simplified modelexecution time (execution
    cycles stall cycles) ? cycle timestall cycles
    of instructions ? miss ratio ? miss penalty
  • Where do we head from here ?
  • Two ways of improving performance
  • decreasing the miss ratio increase
    associativity
  • decreasing the miss penalty multi-level caches

20
Decreasing miss ratio with associativity
  • Smart placement of blocks
  • Fully associative can place block anywhere in
    the cache
  • Set-associative can place block anywhere in a
    set of blocks within the cache

21
Implementation 4-way set associative cache
22
Performance
  • Huge improvements upto 4-way. Then we see
    diminishing
  • returns, combined with higher cost of
    implementation

23
Decreasing miss penalty with multilevel caches
  • Add a second level cache
  • often primary cache is on the same chip as the
    processor
  • use SRAMs to add another cache above primary
    memory (DRAM)
  • miss penalty goes down if data is in 2nd level
    cache
  • Example
  • CPI of 1.0 on a 500Mhz machine with a 5 miss
    rate, 200ns DRAM access gt New CPI 6.0
  • Adding 2nd level cache with 20ns access time and
    miss rate of 2 gt New CPI 3.5
  • Using multilevel caches
  • try and optimize the hit time on the 1st level
    cache
  • try and optimize the miss rate on the 2nd level
    cache

24
Virtual Memory
  • Main memory can act as a cache for the secondary
    storage (disk) used for multiprogramming, not
    for direct disk access
  • Goals
  • Illusion of having more physical memory
  • Program relocation support (relieves programmer
    burden)
  • Protection one program does not read/write data
    of another

25
Virtual memory terminology
  • Page equivalent of block. Fixed size.
  • Page faults equivalent of misses
  • Virtual address equivalent of tag.
  • No cache index equivalent fully associative. VM
    table index appears because VM uses a different
    (page-table) implementation of fully-associative.
  • Physical address translated value of virtual
    address. Can be smaller than virtual address. No
    equivalent in caches.
  • Memory mapping (address translation) converting
    virtual to physical addresses. No equivalent in
    caches.
  • Valid bit same as in caches
  • Referenced bit Used to approximate LRU algorithm
  • Dirty bit Used to optimize write-back.

26
Virtual memory design issues
  • Miss penalty huge Access time of disk millions
    of cycles !
  • Highest priority minimize misses (page faults)
  • Use fully-associative cache, large block sizes,
    complex replacement strategies (implemented in
    software) they are worth the price !!
  • Use write-back instead of write-through. This is
    called copy-back in VM. Optimize disk access
    only if the page has actually been written (dirty
    bit).
  • Page fault gt Operating system schedules another
    process!
  • Protection support
  • Break up programs code and data into pages. Add
    process id to the cache index use separate
    tables for different pgms.
  • OS is called via an exception handles page
    faults, protection

27
Page Tables implements fully-associative
- Place virtual page anywhere in physical
memory. - Index page table by the virtual page
number to find physical page address (or disk
address)
28
Using page tables to access a word

- Separate page tables per program - Page table
register values state of program called
process
29
Making Address Translation Fast
  • Why ? Table (memory) access required for every
    memory access (100 overhead) !
  • A hardware cache for address translations
    translation lookaside buffer (TLB). Caches
    entries from the page table.

30
TLBs and caches hierarchy in action
31
A common frameworkfor virtual memory and caches
  • Where can a block be placed ? (Placement)
  • Associativity direct-mapped, set- or
    fully-associative
  • More associativity great for small cache sizes
  • For larger cache sizes, relative improvement at
    best improves only slightly (diminishing returns)
  • How is a block found ? (Search)
  • Direct-mapped, set, fully associative, table/TLB
  • Cost of miss vs cost of implementation
  • Cost of comparators high, miss rate improvements
    small
  • For VM fully associative is a must since misses
    are very expensive extra table also worth it.
    Software replacement schemes, with large page
    sizes make it more worthwhile

32
Common Framework (contd)
  • Set associative placement used for caches and
    TLBs in general
  • Which block should be replaced on a miss ?
    (Replacement)
  • LRU costly - only approximations implemented for
    VM
  • Random for TLB, caches. Some hardware assistance
    provided. Penalty only 1.1 times that of LRU
  • What happens on a write ? (Write policies)
  • Write-through caches. Usually with write buffers
  • misses are simpler to handle and cheaper, since
    no write to lower level required and easy to
    implement
  • Write-back (copy back) used in VM
  • write to the cache only gt at the speed of the
    cache

33
Common framework (contd)
  • multiple writes within block require only one
    write to lower level
  • can use high b/w transfer effectively to write
    back entire block
  • Three C's model
  • Compulsory misses cold start misses
  • Capacity misses when cache can't contain all
    blocks needed
  • Conflict misses conflicts for positions within
    set. Also called collision misses.
  • Challenge Techniques that improve miss rate also
    affect some other aspect of performance or cost
    negatively (hit-time, miss penalty)

34
Modern Systems
  • Very complicated memory systems

35
Current issues in memory hierarchies
  • Processor speeds continue to increase very fast
  • much faster than either DRAM or disk access times
  • Design challenge dealing with this growing
    disparity
  • Trends
  • Synchronous SRAMs (provide a burst of data).
  • Redesign DRAM chips to provide higher bandwidth
    or processing SDRAMs, Page-mode, EDO
  • Restructure code to increase locality (compiler).
  • Use pre-fetching in multi-level caches.
  • Out-of-order execution finding other
    instructions to execute in superscalar
    architectures during cache miss
  • Intelligent RAMs (IRAM)32-bit memory on
    processor!

36
Summary
  • Memory hierarchy concepts
  • Caches, Virtual memories, techniques, performance
  • Similarities, differences.
  • Real stuff and current issues

37
Extras More VM issues
  • OS does not save page table while swapping
    programs - it just loads the page table register
    to point to new table
  • OS responsible for allocating physical memory and
    updating page tables. It also creates space on
    disk to store all pages of a process when it
    creates the process and the data structure to
    point to this disk location
  • LRU approximation Hardware sets the referenced
    bit when page is touched. OS periodically reads
    and clears these bits - this information is used
    to track usage.
  • Approaches to minimize size of page table Eg
    page page tables themselves !
  • Process switching gt TLB needs to be flushed and
    reloaded gt loses locality. Hence the process id
    is concatenated to tag portion of TLB to avoid
    need to clear TLB per context switch.

38
VM issues (contd)
  • User process cannot change page tables gt under
    control of OS
  • Hardware requirements for protection
  • Provide at least two modes user and supervisory
    mode
  • Mode switching mechanisms in MIPS syscall
    instruction in MIPS (system call exception),
    return from exception (RFE) instruction
  • Space which user process can read but not write
  • Some support for shared memory
  • TLB miss need not imply page faults. Both are
    handled by OS.
  • Several subtle issues here including handling of
    exceptions during the exception handling phase
    itself. Need capability to disable and enable
    exceptions. Protection violations must also be
    reported through exceptions to OS.
Write a Comment
User Comments (0)
About PowerShow.com