Chapter Seven Part I Cache Memory - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Chapter Seven Part I Cache Memory

Description:

and 2005 by Chi-Cheng Lin. What is the address format? ... a data miss rate of 4%. If a machine has a CPI of 2 without any memory stalls ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 34
Provided by: toda9
Category:
Tags: cache | chapter | memory | part | seven

less

Transcript and Presenter's Notes

Title: Chapter Seven Part I Cache Memory


1
Chapter SevenPart I Cache Memory
2
Memories Review
  • SRAM
  • value is stored on a pair of inverting gates
  • very fast but takes up more space than DRAM (4 to
    6 transistors)
  • DRAM
  • value is stored as a charge on capacitor (must be
    refreshed)
  • very small but slower than SRAM (factor of 5 to
    10)

3
Exploiting Memory Hierarchy
  • Users want large and fast memories! SRAM access
    times are .5 5ns at cost of 4000 to 10,000
    per GB.DRAM access times are 50-70ns at cost of
    100 to 200 per GB.Disk access times are 5 to
    20 million ns at cost of .50 to 2 per GB.
  • Try and give it to them anyway
  • build a memory hierarchy

2004
4
Locality
  • A principle that makes having a memory hierarchy
    a good idea
  • If an item is referenced,temporal locality it
    will tend to be referenced again soon
  • spatial locality nearby items will tend to be
    referenced soon.
  • Why does code have locality?
  • Library books analogy
  • Our initial focus two levels (upper, lower)
  • block minimum unit of data
  • hit data requested is in the upper level
  • miss data requested is not in the upper level

5
Cache
  • Two issues
  • How do we know if a data item is in the cache?
  • If it is, how do we find it?
  • Our first example
  • block size is one word of data
  • "direct mapped"

For each item of data at the lower level, there
is exactly one location in the cache where it
might be. e.g., lots of items at the lower level
share locations in the upper level
6
Direct Mapped Cache
  • Mapping address is modulo the number of blocks
    in the cache

7
Direct Mapped Cache
  • Add a set of tags and a valid bit
  • The tag contains the upper portion of the address
  • The valid bit indicates whether an entry contains
    a valid address
  • Example 1 word / block, memory 32 words ? 32
    blocks, cache 8 blocks

Decimal address of reference Binary address of reference Hit or miss Assigned cache block
22 10110 miss 110
26 11010 miss 010
22 10110 hit 110
26 11010 hit 010
16 10000 miss 000
3 00011 miss 011
16 10000 hit 000
18 10010 miss 010
Index V Tag Data
000
001
010
011
100
101
110
111
Note that the address in this example is the word
address. Q How do you obtain the byte address
from a word address?
8
Direct Mapped Cache
Index V Tag Data
000
001
010
011
100
101
110
111
Decimal word address of reference Binary word address of reference Memory block Tag and assigned cache block
22 10110 10110 110
26 11010 11010 010
22 10110 10110 110
26 11010 11010 010
16 10000 10000 000
3 00011 00011 00 011
16 10000 10000 000
18 10010 10010 010
9
Direct Mapped Cache 2 words/block
Index V Tag Data
000
001
010
011
100
101
110
111
Decimal word address of reference Binary word address of reference Memory block Tag and assigned cache block
22 10110 1011 1 011
26 11010 1101 1 101
22 10110 1011 1 011
27 11011 1101 1 101
16 10000 1000 1 000
3 00011 0001 0 001
17 10001 1000 1 000
18 10010 1001 1 001
10
Direct Mapped Cache
  • For MIPS (4 bytes/word)
  • What kind of locality are we taking
    advantage of?

11
Direct Mapped Cache
  • What is the address format?
  • How many bits are required to build a cache?
  • Example (page 479)
  • 32-bit address, byte-addressable, 4 bytes/word,
    4 words/block,
  • cache size 16 KB
  • 16KB / 4 bytes/word 4K words
  • 4K words / 4 words/block 1K blocks
  • 10 bits for block index, 2 bits for block
    offset, 2 bits for byte offset
  • ? 32 10 2 2 18 bits for tag
  • Address format 31 30 14 13 12 4 3 2
    1 0
  • tag index block
    offset byte offset
  • For each block, 1 bit for V, 18 bits for tag,
    4x32 bits for data
  • 1 18 128 147 bits/block
  • Total cache size 147 bits/block x 1K blocks
    147 Kbits gt 16 KB

12
Direct Mapped Cache
  • Taking advantage of spatial locality (gt1 words /
    block)
  • Example FastMATH embedded microprocessor, 16 KB
    cache, 16 words / block

13
Direct Mapped Cache
Consider a cache with 64 blocks, the block size
is 16 bytes. What block number does byte address
1204 map to?
Its word address is 1204/4 30110. With 4 words
per block, word address 30110 is block address
301/4 75. The block offset is 110 which is
012 The index will be 75 modulo 64 1110 which is
0010112 The tag will be 110 which is 00 0000
0000 0000 0000 00012 Binary solution 64
blocks ? 6 bits for index 4 bits for block
offset byte offset 1204 0000 0000 0000 0000
0000 0100 1011 01 00
tag
index blcok offset byte offset
14
Hits vs. Misses
  • Read hits
  • this is what we want!
  • Read misses
  • stall the CPU, fetch block from memory, deliver
    to cache, restart
  • Write hits
  • can replace data in cache and memory
    (write-through)
  • write the data only into the cache (write-back
    the cache later)
  • Write misses
  • read the entire block into the cache, then write
    the word

15
Direct Mapped Cache handling read miss
  • CPU fetches instruction from the cache, if an
    instruction access results in a miss
  • Send the value of current PC 4 to the memory
  • Instruct the memory to read and wait for the
    memory to complete its access
  • Write the cache entry, put the data from memory
    to cache entry, write the upper bits of the
    address from the ALU into the tag field, and turn
    the valid bit on
  • Restart the instruction execution at the first
    step, fetch the instruction in the cache

16
Direct Mapped Cache handling write
  • Data inconsistency on a store instruction, after
    the data is written into the cache, memory would
    have a different value from that in cache
  • Solutions
  • Write through always write the data into both
    the memory and the cache
  • Write buffer use the buffer to store the data
    while it is being writing to the memory. The
    processor continues execution
  • Write back the modified block in cache is
    written to the memory only when it needs to be
    replaced in cache

17
Hardware Issues
  • Make reading multiple words easier by using banks
    of memory

18
Increasing Memory Bandwidth
  • Widen the memory and the bus between the memory
    and processor
  • Reduce the time by minimizing the number of times
    we must start a new memory access

Assuming 4 words per block, 1 clock cycle to
send the address, 15 clock cycles to initiate
DRAM access and 1 clock cycle to send a word of
data
With a main memory width of 1 words. The miss
penalty will be 14x154x165 With a main memory
width of 4 words. The miss penalty will be
115117
19
Increasing Memory Bandwidth
  • Widen the memory only by using multiple memory
    banks
  • One address is sent to 4 banks of memory, reading
    is done simultaneously
  • Reduce the time by minimizing the number of times
    we must start a new memory access

Assuming 4 words per block, 1 clock cycle to
send the address, 15 clock cycles to initiate
DRAM access and 1 clock cycle to send a word of
data
With 4 memory banks of 1 word width. The miss
penalty will be 1154x120
20
AMAT
  • Average memory access time (AMAT)
  • Average time to access memory considering both
    hits and misses and the frequency of different
    accesses
  • Used to capture the fact that the time to access
    data for both hits and misses affects performance
  • Useful as a figure of merit for different cache
    systems
  • AMAT Time for a hit Miss rate Miss penalty
  • 7.17 5 lt7.2gt Find the AMAT for a processor
    with a 2 ns clock, a miss penalty of 20 clock
    cycles, a miss rate of 0.05 misses per
    instruction, and a cache access time (including
    hit detection) of 1 clock cycle. Assume that the
    read and write miss penalties are the same and
    ignore other write stalls.
  • 7.18 5 lt7.2gt Suppose we can improve the miss
    rate to 0.03 misses per reference by doubling the
    cache size. This causes the cache access time to
    increase to 1.2 clock cycles. Using the AMAT as a
    metric, determine if this is a good trade-off.

21
Performance
  • Increasing the block size tends to decrease miss
    rate
  • But, if the block size becomes a significant
    fraction of the cache size, the number of cache
    blocks will be small
  • A block of will be replaced in the cache before
    many of its words are accessed.
  • It causes the increase of the miss rate.
  • How do we solve this?

22
Performance
  • Simplified model execution time (execution
    cycles stall cycles) cycle time stall
    cycles of instructions miss ratio miss
    penalty
  • Two ways of improving performance
  • decreasing the miss ratio
  • decreasing the miss penalty
  • What happens if we increase block size?

23
Performance
Assume an instruction cache miss rate for gcc is
2 and a data miss rate of 4. If a machine has a
CPI of 2 without any memory stalls and the miss
penalty is 100 cycles for all misses, determine
how much faster a machine would run with a
perfect cache that never missed. The frequency of
load instruction for gcc is 36
To run a program, CPU will fetch both
instructions and data from the memory If there is
a miss for instruction, penalty will be
Instruction miss cycles IC 2 100 2.00
IC If there is a data miss, penalty will be
Data miss cycles IC 36 4 100 1.44
IC The total memory-stall cycles is 2.00 IC
1.44 IC 3.44 IC The CPI with memory stalls is 2
3.44 5.44 (CPU time with stalls)/(CPU time
perfect) (IC CPIstall Clock cycle) / (IC
CPperfect Clock cycle) 5.44 / 2 2.72
24
Decreasing Miss Ratio with Associativity
  • N-way set-associative cache consists of a number
    of sets, each set has n blocks.
  • Each memory block is mapped to a unique set in
    cache given by the index field
  • A memory block can be placed in any element of
    the set
  • (Block number) modulo (Number of sets in the
    cache)

25
Decreasing Miss Ratio with Associativity
  • If block address is 13 and the cache has 8
    blocks
  • Set index (Block number) modulo (Number of sets
    in the cache) one-way (direct) 13 modulo 8 5
  • two-way 13 modulo 4 1
  • four-way 13 modulo 2 1

search
3
6
1
13
26
Set-Associative Cache
  • Compared to direct mapped, give a series of
    references that
  • results in a lower miss ratio using a 2-way set
    associative cache
  • results in a higher miss ratio using a 2-way set
    associative cache
  • assuming we use the least recently used
    replacement strategy

27
An Implementation of 4-Way Set-Associative Cache
28
Set-Associative Cache Issues
  • Locating a block in the cache
  • Address format
  • Tag Set Index Block Offset Byte Offset
  • Block replacement
  • LRU (Least Recently Used)
  • Random

29
Set-Associative Cache Example
Assume a cache of 4K blocks, four-word block
size, and a 32-bit address, find the total number
of sets and the total number of tag bits for
caches that are direct mapped, two-way and
four-way set associative, and fully associative.
(Page 504)
The direct-mapped cache has same number of sets
as blocks, 4K 212, hence the total number of
tag bits is (32 12 4) x 4K 64 Kbits For a
2-way set-associative cache, there are 2K 211
sets. The total number of tag bits is (32 11
4) x 2 x 2K 68 Kbits For a 4-way
set-associative cache, there are 1K 210 sets.
The total number of tag bits is (32 10 4) x 4
x 1K 72 Kbits For a fully set-associative
cache, there is only 1 set. The total number of
tag bits is (32 0 4) x 4K x 1 112 Kbits
30
Performance
31
Decreasing Miss Penalty with Multilevel Caches
  • Add a second level cache
  • often primary cache is on the same chip as the
    processor
  • use SRAMs to add another cache above primary
    memory (DRAM)
  • miss penalty goes down if data is in 2nd level
    cache
  • Example (Pages 505 506)
  • CPI of 1.0 on a 5 Ghz machine with a 5 miss
    rate, 100ns DRAM access
  • Adding 2nd level cache with 5ns access time
    decreases miss rate to .5
  • Using multilevel caches
  • try and optimize the hit time on the 1st level
    cache
  • try and optimize the miss rate on the 2nd level
    cache

32
Cache Complexities
  • Not always easy to understand implications of
    caches

Theoretical behavior of Radix sort vs. Quicksort
Observed behavior of Radix sort vs. Quicksort
33
Cache Complexities
  • Here is why
  • Memory system performance is often critical
    factor
  • Multilevel caches, pipelined processors, make it
    harder to predict outcomes
  • Compiler optimizations to increase locality
    sometimes hurt ILP
  • Difficult to predict best algorithm need
    experimental data
Write a Comment
User Comments (0)
About PowerShow.com