Chapter 7 Large and Fast: Exploiting Memory Hierarchy - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Chapter 7 Large and Fast: Exploiting Memory Hierarchy

Description:

programs access a relatively small portion of their address space at a given time. ... Page size large enough to amortize the high access time ... – PowerPoint PPT presentation

Number of Views:530
Avg rating:3.0/5.0
Slides: 52
Provided by: boch7
Category:

less

Transcript and Presenter's Notes

Title: Chapter 7 Large and Fast: Exploiting Memory Hierarchy


1
Chapter 7Large and Fast Exploiting Memory
Hierarchy
  • Bo Cheng

2
Principle of locality
  • programs access a relatively small portion of
    their address space at a given time.
  • Temporal locality (locality in time) if an item
    is referenced, it will tend to be referenced
    again soon.
  • Spatial locality (locality in space) if an item
    is referenced, items whose addresses are close
    will tend to be referenced soon.

3
Basic Structure
4
The Principal
  • By combining two concepts (locality and
    hierarchy)
  • Temporal Locality gt Keep most recently accessed
    data items closer to the processor
  • Spatial Locality gt Move blocks consisting of
    multiple contiguous words to upper levels of the
    hierarchy

5
Memory Hierarchy (I)
6
Memory hierarchy (II)
  • Data is copied between adjacent levels
  • Minimum unit of information copied is a block
  • If the requested data appears in some block in
    the upper level, this is called a hit, otherwise
    a miss and a block containing the requested data
    is copied from a lower level.
  • The hit rate or hit ratio, is the fraction of
    memory accesses found in the upper level. The
    miss rate (1.0 - hit rate) is the fraction not
    found at the upper level.
  • Hit time the time to access the upper level
    including the time to determine if the access is
    a hit or a miss.
  • Miss penalty the time to replace a block in the
    upper level.

7
Memory Hierarchy (II)
8
The Moores Law
9
Cache
  • A safe place for hiding or storing things
  • The level of memory hierarchy between processor
    and main memory
  • Refer to any storage managed to take advantage pf
    locality of access
  • Motivation
  • high processor cycle speed
  • low memory cycle speed
  • fast access to recently used portions of a
    program's code and data

10
The Basic Cache Concept
1. The CPU is requesting data item Xn 2. The
request results in a miss 3. The word Xn is
brought from memory into cache
11
Direct Mapped Cache
  • Each memory location is mapped to exactly one
    location in the cache.
  • address of the block modulo number of blocks in
    the cache.
  • Answer two crucial questions
  • How do we know if a data item is in the cache?
  • If it is, how do we find it?

12
The Example of Direct-Mapped Cache
13
(No Transcript)
14
Cache Contents
m
n
  • Tag
  • Identify whether a word in the cache corresponds
    to the requested word.
  • Valid bit
  • indicates whether an entry contains a valid
    address
  • Data

Tag size 32 n 2 32 10 - 2
Size 2index x ( valid tag data)
2n x ( 1 m 48)
15
Direct-Mapped Example
How many total bits are required for
direct-mapped?
  • A Cache
  • 16 KB of data
  • 4-word blocks
  • 32 bits address

4-word
n m 4 32 . (1) 16KB 4K words 210
block ? n 10 m 18 The total bits 210 x (1
18 448) 147 Kbits
16 KB
4 x 4 x 8 128 bits
16
Mapping an address to a cache block
Source http//www.faculty.uaf.edu/ffdr/EE443/
17
Block Size vs. Miss Rate
18
Handling Cache Misses
  • Stall the entire pipeline fetch the requested
    word
  • Steps to handle an instruction cache miss
  • Send the original PC value (PC-4) to the memory.
  • Instruct main memory to perform a read and wait
    for the memory to complete its access.
  • Write the cache entry, putting the data from
    memory in the data portion of the entry, writing
    the upper bits of the address (from the ALU) into
    the tag field, and turning the valid bit on.
  • Restart the instruction execution at the first
    step, which will refresh the instruction, this
    time finding it in the cache.

19
Write-Through
  • A scheme in which writes always update both the
    cache and the memory, ensuring that data is
    always consistent between the two.
  • Write buffer
  • A queue that holds data while the data are
    waiting to be written to memory.

20
Write-Back
  • A scheme that handles writes by updating values
    only to the block in the cache, then writing the
    modified block to the lower level of the
    hierarchy when the block is replaced.
  • Pro Improve performance, especially when writes
    are frequent (and couldnt be handled by write
    buffer)
  • Con More complex to implement

21
Cache Performance
  • CPU time (CPU execution clock cycles
    Memory-stall clock cycles) x Clock cycle time
  • Memory-stall clock cycles Read-stall cycles
    Write-stall cycles
  • Read-stall cycles (Reads/Program) x Read miss
    rate x Read miss penalty
  • Write-stall cycles ((Writes/Program) x Write
    miss rate x Write miss penalty) Write buffer
    stalls
  • Memory-stall clock cycles (MemoryAccess/Program)
    x Miss Rate x Miss Penalty
  • Memory-stall clock cycles (Instructions/Program)
    x Misses/Instructions) x Miss Penalty

22
The Example
Source http//www.faculty.uaf.edu/ffdr/EE443/
(1.38 2)
23
What if .
  • What if the processor is made faster, but the
    memory system stays the same?
  • Speed up the machine by improving the CPI from 2
    to 1 without increasing the clock
  • The system with a perfect cache would be 2.38 / 1
    2.38 times faster
  • The amount of time spent on memory stalls rises
    from 1.38/3.38 41 to 1.38/2.38 58

24
What if .
25
Our Observations
  • Relative cache penalties increases as a processor
    becomes faster
  • The lower the CPI, the more pronounced the impact
    of stall cycles
  • If the main memory system is the same, a higher
    CPU clock rate leads to a larger miss penalty

26
Decreasing miss ratio with associative cache
  • direct-mapped cache A cache structure in which
    each memory location is mapped to exactly one
    location in the cache.
  • set-associative cache A cache that has a fixed
    number of locations (at least two) where each
    block can be placed.
  • fully associative cache A cache structure in
    which a block can be placed in any location in
    the cache.

27
The Example
(12 mod 8) 4
(12 mod 4) 0
Can appear in any of the eight cache block
28
One More Example Direct Mapped
5 Misses
29
Two-Way Set Associative Cache
which block to replace commonly used is LRU
scheme
Least recently used (LRU) A replacement scheme in
which the block replaced is the one that has been
unused for the longest time.
4 Misses
30
The Implementation of 4-Way Set Associative Cache
31
Fully Associative Cache
3 Misses
Increasing degree of associativity ? decrease in
miss rate
32
Performance of Multilevel Cache
11/4 2.8
33
Designing the Memory System to Support Caches (I)
  • Consider hypothetical memory system parameters
  • 1 memory bus clock cycle to send address
  • 15 memory bus clock cycles to initiate DRAM
    access
  • 1 memory bus clock cycle to transfer a word of
    data
  • a cache block is a 4-word blocks
  • 1-word-wide bank of DRAMs
  • The miss penalty is 1 4 15 4 1 65
    clock cycles
  • Number of bytes transferred per clock cycle per
    miss
  • (44) / 65 0.25

34
Designing the Memory System to Support Caches (II)
35
Virtual Memory
  • The technique in which main memory acts as a
    "cache" for the secondary storage
  • automatically manages main memory and secondary
    storage
  • Motivation
  • allow efficient sharing of memory among multiple
    programs
  • remove the programming burdens of a small,
    limited amount of main memory

36
Basic Concepts of Virtual Memory
Source http//www.faculty.uaf.edu/ffdr/EE443/
  • Virtual memory allows each program to exceed the
    size of primary memory
  • It automatically manages two levels of memory
    hierarchy
  • Main memory (physical memory)
  • Secondary storage
  • Same concepts as in caches, different terminology
  • A virtual memory block a page
  • A virtual memory miss a page fault
  • CPU produces a virtual address (which is
    translated to a physical address, used to access
    main memory). This process (accomplished by a
    combination o HW and SW) is called memory mapping
    or address translation.

37
Mapping from a Virtual to Physical Address
232 4 GB
230 1 GB
38
High Cost of a Miss
  • Page fault takes millions of cycles to process
  • E.g., main memory is 100,000 times faster than
    disk
  • This time is dominated by the time it takes to
    get the first word for typical page size
  • Key decisions
  • Page size large enough to amortize the high
    access time
  • Pick organization that reduces page fault rate
    (e.g., fully associative placement of pages)
  • Handle page faults in software (overhead is small
    compared to disk access times) and use clever
    algorithms for page placement
  • Use write-back

39
Page Table
  • Containing the virtual to physical address
    translations in a virtual memory system.
  • Resides in memory
  • Indexed with the page number form the virtual
    address
  • Contains corresponding physical page number
  • Each program has its own page table
  • Hardware includes a register pointing to the
    start of the page table (page table register)

40
Page Table Size
  • For Example
  • Consider 32-bit virtual addresses,
  • 4-KB page size,
  • 4B per page table entry
  • Number of page table entries
  • 230/212 220
  • Size of page table
  • 220 x 4 4 MB

41
Page Faults
  • Occurs when a valid bit (V) is found to be 0
  • Transfer the control to the operating system
    (using the exception mechanism)
  • The operating system must find the appropriate
    page in the next level of hierarchy
  • Decide where to place it in the main memory
  • Where is the page on this disk?
  • The information can be found either in the same
    page table, or in a separate structure
  • The OS creates the space on disk for all the
    pages of the process
  • at the time it creates the process
  • At the same time, a data structure that records
    the location of each
  • page is also created.

42
The Translation-Lookaside Buffer (TLB)
  • Each memory access by a program requires two
    memory accesses
  • Obtain the physical address (reference the page
    table)
  • Get the data
  • Because of the spatial and temporal locality
    within each page, a translation for a virtual
    page will likely be needed in the near future.
  • To speed this process up include a special cache
    that keeps track of recently used translations

43
The Translation-Lookaside Buffer (TLB)
44
Processing read/write requests
45
Where Can a Block Be Placed?
1. Increase in the degree of associativity
usually decreases the miss rate. 2. The
improvement in miss rate comes from reduced
competition for the same location.
46
How Is a Block Found?
47
What block is replaced on a miss?
  • Which block is a candidate for replacement
  • In a fully associative cache all blocks are
    candidates
  • In a set-associative cache all the blocks in
    the set
  • In a direct-mapped cache there is only one
    candidate
  • In set-associative and fully associative caches,
    use one of two strategies
  • 1. Random. (use hardware assistance to make it
    fast)
  • 2. LRU (Least recently used). usually two
    complicated even for fourway associativity.

48
How Are Write Handled?
  • There are two basic options
  • Write-through The information is written to
    both the block in the cache and to the block in
    the lower level of the memory hierarchy
  • Write-back The modified block is written to the
    lower level only when it is replaced
  • ADVANTAGES of WRITE-THROUGH
  • Misses are cheaper and simpler
  • Easier to implement (although it usually requires
    a write buffer)
  • ADVANTAGES of WRITE-BACK
  • CPU can write at the rate that the cache can
    accept
  • Combined writes
  • Effective use of bandwidth (writing the entire
    block)
  • Virtual memory is a special case only a
    write-back is practical

49
The Big Picture
  • Where to place a block?
  • One place (direct-mapped)
  • A few places (set-associative)
  • Any place (fully-associative)
  • How to find a block?
  • Indexing (direct-mapped)
  • Limited search (set-associative)
  • Full search (fully associative)
  • Separate lookup table (page table)
  • 3. Which block should be replaced on a cache
    miss?
  • Random
  • LRU
  • 4. What happens on a write?
  • Write-through
  • Write-back

50
The 3Cs
  • Compulsory misses caused by the first access to
    a block that has never been in the cache
    (cold-start misses)
  • INCREASE THE BLOCK SIZE (increase in miss
    penalty)
  • Capacity misses caused when the cache cannot
    contain all the blocks needed by the program.
    Blocks are being replaced and later retrieved
    again.
  • INCREASE THE SIZE (access time increases as well)
  • Conflict misses occur when multiple blocks
    compete for the same set (collision misses)
  • INCREASE ASSOCIATIVITY (may slow down access time)

51
The Design Challenges
Write a Comment
User Comments (0)
About PowerShow.com