CS 2200 Lecture 1315 Memory Management - PowerPoint PPT Presentation

1 / 100
About This Presentation
Title:

CS 2200 Lecture 1315 Memory Management

Description:

A 'hit': block. found in. upper lever 'hit rate': fraction of accesses. resulting in hits. ... The tag field is compared against it for a hit ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 101
Provided by: michaelt8
Category:

less

Transcript and Presenter's Notes

Title: CS 2200 Lecture 1315 Memory Management


1
CS 2200 Lecture 13-15Memory Management
  • (Lectures based on the work of Jay Brockman,
    Sharon Hu, Randy Katz, Peter Kogge, Bill Leahy,
    Ken MacKenzie, Richard Murphy, and Michael
    Niemier)

2
The principle of locality
  • says that most programs dont access all code or
    data uniformly
  • For instance in a loop, a small subset of
    instructions might be executed over and over
    again
  • And a block of memory address might be accessed
    sequentially
  • This idea has lead to the idea of a memory
    hierarchy
  • Some important things to note
  • Fast memory is expensive
  • Levels of memory usually smaller/faster than
    previous
  • Levels of memory usually subset one another
  • So, all of the stuff in a higher level is in some
    level below it

3
The processor/memory bottleneck
4
The processor/memory bottleneck
Memory Capacity (Single Chip DRAM)
Kb
Year
5
How big is the problem?
Processor-DRAM Memory Gap (latency)
µProc 60/yr. (2X/1.5yr)
1000
CPU
Moores Law
100
Performance
10
DRAM 9/yr. (2X/10yrs)
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
6
Memory Background
row decoder
wordline
storage cell
bitline
address in
sense amplifiers
column mux
data in/out
7
Pick Your Storage Cells
  • DRAM
  • dynamic must be refreshed
  • densest technology. Cost/bit is paramount
  • SRAM
  • static value is stored in a latch
  • fastest technology 8-16x faster than DRAM
  • larger cell 4-8x larger
  • more expensive 8-16x more per bit
  • others
  • EEPROM/Flash high density, non-volatile
  • core...

8
SolutionSmall memory unit closer to processor
Processor
small, fast memory
BIG SLOW MEMORY
9
Terminology
Processor
upper level (the cache)
small, fast memory
Memory
lower level (sometimes called backing store)
BIG SLOW MEMORY
10
Terminology
Processor
hit rate fraction of accesses resulting in
hits.
A hit block found in upper lever
small, fast memory
Memory
BIG SLOW MEMORY
11
Terminology
Processor
hit rate fraction of accesses resulting in
hits.
A miss not found in upper level, must look in
lower level
small, fast memory
Memory
miss rate (1 - hit_rate)
BIG SLOW MEMORY
12
Terminology Summary
  • Hit data appears in some block in the upper
    level (example Block X in cache)
  • Hit Rate the fraction of memory access found in
    the upper level
  • Hit Time Time to access the upper level which
    consists of
  • RAM access time Time to determine hit/miss
  • Miss data needs to be retrieved from a block in
    the lower level (example Block Y in memory)
  • Miss Rate 1 - (Hit Rate)
  • Miss Penalty Extra time to replace a block in
    the upper level
  • Time to deliver the block the processor
  • Hit Time ltlt Miss Penalty (500 instructions on
    21264)

13
Average Memory Access Time
AMAT HitTime (1 - h) x MissPenalty
  • Hit time basic time of every access.
  • Hit rate (h) fraction of access that hit
  • Miss penalty extra time to fetch a block from
    lower level, including time to replace in CPU

14
The Full Memory Hierarchyalways reuse a good
idea
Capacity Access Time Cost
Upper Level
Staging Xfer Unit
faster
CPU Registers 100s Bytes lt10s ns
Registers
prog./compiler 1-8 bytes
Instr. Operands
Cache K Bytes 10-100 ns 1-0.1 cents/bit
Cache
cache cntl 8-128 bytes
Blocks
Main Memory M Bytes 200ns- 500ns .0001-.00001
cents /bit
Memory
OS 4K-16K bytes
Pages
Disk G Bytes, 10 ms (10,000,000 ns) 10 - 10
cents/bit
Disk
-5
-6
user/operator Mbytes
Files
Larger
Tape infinite sec-min 10
Tape
Lower Level
-8
15
A brief description of a cache
  • Discussion will center around caches
  • Cache basically next level of memory hierarchy up
    from the register file
  • All of the values in the register file should be
    in the cache
  • Cache entries are usually referred to as blocks
  • A block is the minimum amount of information that
    can be in the cache
  • If were looking for an item in the cache and we
    find it, we say we have a cache hit it not a
    cache miss
  • The cache miss rate is the fraction of accesses
    that are not in the cache
  • The miss penalty is the of clock cycles
    required b/c of the miss

Mem. stall cycles Inst. count x Mem. ref./inst.
x Miss rate x Miss penalty
16
Some initial questions to consider
  • Where can a block be placed in an upper level of
    memory hierarchy (i.e. a cache)?
  • How is a block found if it is an upper level of
    memory hierarchy?
  • Or in other words, how is the block identified?
  • Which block in the cache should be replaced on a
    cache miss if the entire cache is full and we
    want to bring in new data?
  • What happens if a you want to write back to a
    memory location?
  • Do you just write to the cache?
  • Do you write somewhere else?

17
Where can a block be placed in a cache?
  • There are 3 fundamental schemes for block
    placement in a cache
  • A direct mapped cache
  • Each block (or some set of data to be stored) can
    only go to 1 place in the cache
  • Usually (Block address) MOD ( of blocks in the
    cache)
  • A fully associative cache
  • A block can be placed anywhere in the cache
  • A set associative cache
  • A set is a group of blocks in the cache
  • A block is mapped onto a set and then the block
    can be placed anywhere within that set
  • Usually (Block address) MOD ( of sets in the
    cache)
  • If n blocks, we call it n-way set associative

18
Where can a block be placed in a cache?
Fully Associative
Direct Mapped
Set Associative
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
Cache
Set 0
Set 1
Set 2
Set 3
Block 12 can go anywhere
Block 12 can go only into Block 4 (12 mod 8)
Block 12 can go anywhere in set 0 (12 mod 4)
1 2 3 4 5 6 7 8 9..
Memory
12
19
1 KB Direct Mapped Cache, 32B blocks
  • For a 2 N byte cache
  • The uppermost (32 - N) bits are always the Cache
    Tag
  • The lowest M bits are the Byte Select (Block Size
    2M)

0
4
31
9
Cache Index
Cache Tag
Example 0x50
Byte Select
Ex 0x01
Ex 0x00
Stored as part of the cache state
Cache Data
Valid Bit
Cache Tag

0
Byte 0
Byte 1
Byte 31

1
0x50
Byte 32
Byte 33
Byte 63
2
3




31
Byte 992
Byte 1023
20
2. Associativity
  • Requiring that every memory location be cachable
    in exactly one place (direct-mapped) was simple
    but incredibly limiting
  • How can we relax this constraint?

21
Associativity
  • Block 12 placed in an 8 block cache
  • Fully associative, direct mapped, 2-way set
    associative
  • S.A. Mapping Block Number Modulo Number Sets

Direct Mapped (12 mod 8) 4
2-Way Assoc (12 mod 4) 0
Full Mapped
Cache
Memory
22
Two-way Set Associative Cache
  • N-way set associative N entries for each Cache
    Index
  • N direct mapped caches operate in parallel (N 2
    to 4)
  • Example Two-way set associative cache
  • Cache Index selects a set from the cache
  • The two tags in the set are compared in parallel

Cache Index
Cache Data
Cache Tag
Valid
Cache Block 0



Adr Tag
Compare
0
1
Mux
Sel1
Sel0
OR
Advantage typically exhibits a hit rate equal to
a 2X-sized direct-mapped cache
Cache Block
Hit
23
Disadvantage of Set Associative Cache
  • N-way Set Associative Cache v. Direct Mapped
    Cache
  • N comparators vs. 1
  • Extra MUX delay for the data
  • Data comes AFTER Hit/Miss

24
Associativity
  • If you have associativity gt 1 you have to have a
    replacement policy (like VM!)
  • FIFO
  • LRU
  • random
  • Full or Full-map associativity means you
    check every tag in parallel and a memory block
    can go into any cache block
  • virtual memory is effectively fully associative

25
How is a block found in the cache?
  • Caches have an address tag on each block frame
    that provides the block address
  • Tag of every cache block that might have our
    entry is examined against the CPU address (in
    parallel! why?)
  • Each entry usually has a valid bit
  • Says whether or not the data is good and not
    garbage
  • If bit is not set, there cant be a match
  • How does the address provided to the CPU relate
    to the entry in the cache?
  • Entry is divided between block address and block
    offset...
  • and further divided between the tag field and
    index field

26
How is a block found in the cache?
Block Address
Block Offset
  • The block offset field selects the data from the
    block
  • (i.e. the address of the desired data within the
    block)
  • The index field selects a specific set
  • The tag field is compared against it for a hit
  • Question Could we/should we do a compare on
    more of the address than the tag?
  • Not necessary checking the index is redundant
  • It was used to select the set to be checked
  • Ex. An address stored in set 0 must have 0 in
    the index field
  • Offset is not necessary in the comparison the
    entire block is present or not and all block
    offsets must match

Tag
Index
27
Which block should be replaced on a cache miss?
  • If we look something up in the cache and our
    entry is not there, we generally want to get our
    data from memory and put it in the cache
  • B/c the principle of locality says well probably
    use it again
  • With direct mapped caches we only have one choice
    of what block to replace
  • Fully associative or set associative offer more
    choices
  • Usually 2 strategies
  • Random pick any possible block and replace it
  • LRU stands for Least Recently Used
  • Why not throw out the block not used for the
    longest time
  • Can be expensive, usually approximated, not much
    better than random i.e. 5.18 vs. 5.69 for
    16KB 2-way set associative

28
What happens on a write?
  • FYI most accesses to a cache are reads
  • Used to fetch instructions (reads)
  • Most instructions dont write to memory
  • For DLX only about 7 of memory traffic involve
    writes
  • Translates to about 25 of cache data traffic
  • Make the common case fast! Optimize cache for
    reads!
  • Actually pretty easy to do
  • Can read block while comparing/reading tag
  • Block read begins as soon as address is available
  • If a hit, address just passed right on to the CPU
  • Writes take longer. Any idea why?

29
What happens on a write?
  • Generically, there are 2 kinds of write policies
  • Write through (or store through)
  • With write through, information is written to
    both the block in the cache and to the block in
    lower-level memory
  • Write back (or copy back)
  • With write back, information is written only to
    the cache. It will be written back to
    lower-level memory when the cache block is
    replaced
  • The dirty bit
  • Each cache entry usually has a bit that specifies
    if a write has occurred in that block or not
  • This helps reduce the frequency of writes to
    lower-level memory upon block replacement

30
What happens on a write?
  • Write back versus write through
  • Write back is advantageous because
  • Writes occur at the speed of the cache and we
    dont incur the delay of lower-level memory
  • Multiple writes to a cache block result in only 1
    lower-level memory access
  • Write through is advantageous because
  • Lower-levels of memory have the most recent copy
    of data
  • If the CPU has to wait for a write, we have a
    write stall
  • One way around this is a write buffer
  • Ideally, the CPU shouldnt have to stall during a
    write
  • Instead, data written to a buffer which then
    sends it back to lower-levels of the memory
    hierarchy

31
What happens on a write?
  • What if we want to do a write and the block we
    want to write to isnt in the cache?
  • There are 2 common policies
  • Write allocate (or fetch on write)
  • The block is loaded on a write miss
  • The idea behind this is that subsequent writes
    will be captured by the cache (ideal for a write
    back cache)
  • No-write allocate (or write around)
  • The block is modified in the lower-level and not
    loaded into the cache
  • Usually used for write-through caches (as
    subsequent writes will still have to go to
    memory)

32
An example the Alpha 20164 data and instruction
cache
Block Addr.
Block Offset
lt21gt
lt8gt
lt5gt
1
CPU Address
Tag
Index
4
Data in
Data out
Valid lt1gt
Tag lt21gt
Data lt256gt
(256 blocks)
2
...
Write Buffer
41 Mux
3
?
Lower level memory
33
Ex. Alpha cache trace step 1
  • First, the address coming into the cache is
    divided into two fields
  • 29-bit block address and a 5-bit block offset
  • Block offset further divided into
  • An address tag and a cache index
  • The cache index selects the tag to be tested to
    see if the desired block is in the cache
  • Size of the index depends on cache size, block
    size, and set associativity
  • So, the index is 8-bits wide and the tag is 29-8
    21 bits

34
Ex. Alpha cache trace step 2
  • Next, we need to do an index selection
    essentially what happens here is
  • (With direct mapping data read/tag checked in
    parallel)

Index (8 bits)
Tag (21 bits)
Data (256 bits)
Valid (1 bit)

35
Ex. Alpha cache trace step 3,4
  • After reading the tag from the cache, its
    compared against the tag portion of the block
    address from the CPU
  • If the tags do match, the data is still not
    necessarily valid the valid bit must be set
    as well
  • If the valid bit is not set, then the results are
    ignored by the CPU
  • If the tags do match, its OK for the CPU to load
    data
  • Note
  • The 21064 allows 2 clock cycles for these 4 steps
  • Instructions following a load in the next 2 clock
    cycles would stall if they tried to use the
    result of the load

36
What happens on a write in the Alpha?
  • If something (i.e. a data word) is supposed to be
    written to the cache, the 1st 3 steps will
    actually be the same
  • After tag comparison hit, the write takes place
  • B/c the Alpha uses a write through cache, we
    also have to go back to main memory
  • So, we go to the write buffer next (4 blocks in
    Alpha)
  • If the buffer has room in it, the data is copied
    there and as far as the CPU is concerned the
    write is done
  • May have to merge the writes however
  • If the buffer is full, the CPU must wait until
    the buffer had an empty entry

37
What happens on a read miss (with the Alpha
cache)?
  • This is here just so you can get a practical
    idea of whats going on with a real cache
  • Say we try to read something in the Alpha cache
    and its not there
  • We have to get it from the next level of memory
    hierarchy
  • So, what happens?
  • Cache tells the CPU to stall to wait for new data
  • We need to get 32 bytes of data but only have 16
    bytes of available bandwidth
  • Each transfer takes 5 clock cycles
  • So well need 10 clock cycles to get all 32
    bytes
  • The Alpha cache is direct mapped so theres
    only one place for it to go

38
One way
  • Have a scheme that allows the contents of an main
    memory address to be found in exactly one place
    in the cache.
  • Remember the cache is smaller than the level
    below it, thus multiple locations could map to
    the same place
  • Severe restriction! But lets see what we can do
    with it...

39
One way
Example Looking for Location 10011 (19) Look in
011 (3) 3 19 MOD 8
40
One way
If there are four possible locations in
memory which map into the same location in
our cache...
41
One way
TAG
000 001 010 011 100 101 110 111
We can add tags which tell us if we have a match.
00 00 00 10 00 00 00 00
42
One way
TAG
000 001 010 011 100 101 110 111
But there is still a problem! What if we havent
put anything into the cache? The 00 (for
example) will confuse us.
00 00 00 00 00 00 00 00
43
One way
V
000 001 010 011 100 101 110 111
Solution Add valid bit
0 0 0 0 0 0 0 0
44
One way
V
000 001 010 011 100 101 110 111
Now if the valid bit is set our match is good
0 0 0 1 0 0 0 0
45
Basic Algorithm
  • Assume we want contents of location M
  • Calculate CacheAddr M CacheSize
  • Calculate TargetTag M / CacheSize
  • if (ValidCacheAddr SET
  • TagCacheAddr TargetTag)
  • return DataCacheAddr
  • else
  • Fetch contents of location M from backup memory
  • Put in DataCacheAddr
  • Update TagCacheAddr and ValidCacheAddr

hit
miss
46
Example
  • Cache is initially empty
  • We get following sequence of memory references
  • 10110
  • 11010
  • 10110
  • 11010
  • 10000
  • 00011
  • 10000
  • 10010

47
Example
TAG
V
000 001 010 011 100 101 110 111
00 00 00 00 00 00 00 00
0 0 0 0 0 0 0 0
Initial Condition
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
48
Example
TAG
V
000 001 010 011 100 101 110 111
00 00 00 00 00 00 00 00
0 0 0 0 0 0 0 0
10110 Result?
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
49
Example
TAG
V
000 001 010 011 100 101 110 111
00 00 00 00 00 00 10 00
0 0 0 0 0 0 1 0
10110 Miss
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
50
Example
TAG
V
000 001 010 011 100 101 110 111
00 00 00 00 00 00 10 00
0 0 0 0 0 0 1 0
11010 Result?
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
51
Example
TAG
V
000 001 010 011 100 101 110 111
00 00 11 00 00 00 10 00
0 0 1 0 0 0 1 0
11010 Miss
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
52
Example
TAG
V
000 001 010 011 100 101 110 111
00 00 11 00 00 00 10 00
0 0 1 0 0 0 1 0
10110 Result?
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
53
Example
TAG
V
000 001 010 011 100 101 110 111
00 00 11 00 00 00 10 00
0 0 1 0 0 0 1 0
10110 Hit
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
54
Example
TAG
V
000 001 010 011 100 101 110 111
00 00 11 00 00 00 10 00
0 0 1 0 0 0 1 0
11010 Result?
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
55
Example
TAG
V
000 001 010 011 100 101 110 111
00 00 11 00 00 00 10 00
0 0 1 0 0 0 1 0
11010 Hit
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
56
Example
TAG
V
000 001 010 011 100 101 110 111
00 00 11 00 00 00 10 00
0 0 1 0 0 0 1 0
10000 Result?
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
57
Example
TAG
V
000 001 010 011 100 101 110 111
10 00 11 00 00 00 10 00
1 0 1 0 0 0 1 0
10000 Miss
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
58
Example
TAG
V
000 001 010 011 100 101 110 111
10 00 11 00 00 00 10 00
1 0 1 0 0 0 1 0
00011 Result?
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
59
Example
TAG
V
000 001 010 011 100 101 110 111
10 00 11 00 00 00 10 00
1 0 1 1 0 0 1 0
00011 Miss
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
60
Example
TAG
V
000 001 010 011 100 101 110 111
10 00 11 00 00 00 10 00
1 0 1 1 0 0 1 0
10000 Result?
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
61
Example
TAG
V
000 001 010 011 100 101 110 111
10 00 11 00 00 00 10 00
1 0 1 1 0 0 1 0
10000 Hit
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
62
Example
TAG
V
000 001 010 011 100 101 110 111
10 00 11 00 00 00 10 00
1 0 1 1 0 0 1 0
10010 Result?
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
63
Example
TAG
V
000 001 010 011 100 101 110 111
10 00 10 00 00 00 10 00
1 0 1 1 0 0 1 0
10010 Miss
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
64
Instruction and data caches
  • Most processors have separate caches for data and
    for instructions
  • Why?
  • What if a load or store instruction is executed?
  • Processor should request data and fetch another
    instruction at the same time
  • If both were in the same cache, there could be a
    structural hazard
  • The Alpha actually uses an 8-KB instruction cache
    similar almost identical to the instruction cache
  • Note may see the term unified or mixed
    cache
  • These contain both instructions and data

65
Cache performance
  • When a evaluating cache performance, a bit of a
    fallacy is to only focus on miss rate
  • The temptation may arise b/c miss rate is
    actually independent of HW implementation
  • May think it gives you an apples-to-apples
    comparison
  • A better way is to use the following equation
  • Average memory access time
  • Hit time Miss Rate X Miss Penalty
  • Average memory access time is kinda like CPI
  • Its a good measure of performance but still not
    perfect
  • Again, the best end-to-end comparison is of
    course execution time

66
A cache example
  • We want to compare the following
  • A 16-KB data cache and a 16-KB instruction cache
    versus a 32-KB unified cache
  • Assume that a hit takes just 1 clock cycle to
    process
  • The miss penalty is 50 clock cycles
  • In a unified cache, a load or store hit takes 1
    extra clock cycle b/c 1 cache port leads to a
    structural hazard
  • 75 of accesses are instruction references
  • Whats the avg. memory access time in each case?

Miss Rates
67
A cache example continued
  • First of all we need to figure out the overall
    miss rate for the split caches
  • (75 x 0.64) (25 x 6.47) 2.10
  • This compares to the unified cache miss rate of
    1.99
  • Well use the average memory access time formula
    from a few slides ago but break it up into
    instruction and data references
  • Average memory access time split cache
  • 75 x (1 0.64 x 50) 25 x (1 6.47 x 50)
  • (75 x 1.32) (25 x 4.235) 2.05 cycles
  • Average memory access time unified cache
  • 75 x (1 1.99 x 50) 25 x (1 1 1.99 x
    50)
  • (75 x 1.995) (25 x 2.995) 2.24 cycles
  • So, despite higher miss rate, the access time is
    faster for split cache!

68
The Big Picture
  • A very generic equation for total CPU time is
  • (CPU execution clock cycles memory stall
    cycles) x clock cycle time
  • This formula raises the question of whether or
    not the clock cycles for a cache hit should be
    included in
  • The CPU execution cycles part of the equation
  • or the memory stall cycles part of the equation
  • Convention puts them in the CPU execution cycles
    part
  • So, if you think back to the basic DLX pipeline,
    cache hit time would be included as part of the
    memory stage
  • This allows memory stall cycles to be defined in
    terms of
  • of access per program, miss penalty (in clock
    cycles), and the miss rate for writes and reads

69
Memory access equations
  • Using what we defined on the previous slide, we
    can say
  • Memory stall clock cycles
  • Reads x Read miss rate x Read miss penalty
  • Writes x Write miss rate x Write miss penalty
  • Often, reads and writes are combined/averaged
  • Memory stall cycles
  • Memory access x Miss rate x Miss penalty
    (approximation)
  • Its also possible to factor in instruction count
    to get a complete formula

70
Reducing cache misses
  • Obviously, we want our data accesses to result in
    cache hits, not misses as this will optimize
    our performance
  • Start by looking at ways to increase the of
    hits.
  • but first look at the 3 kinds of misses!
  • Compulsory misses
  • Obviously, the very 1st access to a block in the
    cache will not be a hit just b/c the datas not
    there yet!
  • Capacity misses
  • A cache is only so big. It probably wont be
    able to store every block accessed in a program
    must swap out!
  • Conflict misses
  • These result from set-associative or direct
    mapped caches
  • Blocks discarded/retrieved if too many map to a
    location

71
Cache misses and the architect
  • What can a computer architect do about the 3
    kinds of cache misses?
  • Compulsory, capacity, and conflict
  • Well, all conflict misses can be avoided with a
    fully associative cache
  • But fully associative caches mean expensive HW,
    possibly slower clock rates, and other bad stuff
  • Capacity misses might be avoided by making the
    cache bigger small caches can lead to thrashing
  • With thrashing, data moves between 2 levels of
    memory hierarchy very frequently can really
    slow down perf.
  • Larger blocks can mean fewer compulsory misses
  • But can turn a capacity miss into a conflict miss!

72
(1) Larger cache block size
  • The easiest way to reduce miss rate is to
    increase the cache block size
  • This will help eliminate what kind of misses?
  • This helps improve miss rate because of the
    principle of locality
  • Recall that temporal locality says that if
    something is accessed once, itll probably
    accessed again soon
  • Spatial locality says that if something is
    accessed, something nearby it will probably be
    accessed
  • Larger block sizes help with spatial locality
  • Be careful though!
  • Larger block sizes can increase the miss penalty!
  • Generally, larger blocks reduce the of total
    blocks in the cache

73
Larger cache block size (graph comparison)
74
(1) Larger cache block size (example)
  • Assume that to access a lower-level of memory
    hierarchy you
  • Automatically incur a 40 clock cycle overhead
  • And then get 16 bytes of data every 2 clock
    cycles
  • Thus, we get 16 bytes in 42 clock cycles, 32 in
    44, etc
  • Using the following data, which block size has
    the minimum average memory access time?

Cache sizes
Miss rates
75
(1) Larger cache block size (example continued)
  • Recall that the Average memory access time
  • Hit time Miss rate X Miss penalty
  • Well assume that a cache hit will otherwise take
    1 clock cycle which is independent of block
    size
  • So, for a 16-byte block in a 1-KB cache
  • Average memory access time
  • 1 (15.05 X 42) 7.321 clock cycles
  • And for a 256-byte block in a 256-KB cache
  • Average memory access time
  • 1 (0.49 X 72) 1.353 clock cycles
  • The rest of the data is included on the next
    overhead

76
(1) Larger cache block size (example continued)
Cache sizes
Red entries are the lowest average time for a
particular configuration
Note All of these block sizes are common in
processors today Note Data for cache sizes is
in units of clock cycles
77
(1) Larger cache block sizes (wrap-up)
  • Of course the computer architect is trying to
    minimize both the cache miss rate and the cache
    miss penalty at the same time!
  • The selection of block size depends on both the
    latency and the bandwidth of the lower-level
    memory
  • High latency and high bandwidth encourage large
    block size
  • The cache gets many more bytes per miss for a
    small increase in miss penalty
  • Low latency and low bandwidth encourage small
    block size
  • Twice the miss penalty of a small block may be
    close to the penalty of a block twice the size
  • The larger number of small blocks may reduce
    conflict misses

78
(2) Higher associativity
  • Generally speaking, higher associativity can
    improve cache miss rates
  • By looking at data, one can see that an 8-way set
    associative cache is
  • For all intents and purposes a fully-associative
    cache
  • This helps lead to the 21 cache rule of thumb
  • First of all, a bit of an unfortunate rhyme
    there
  • But seriously it says
  • A direct mapped cache of size N has about the
    same miss rate as a 2-way set-associative cache
    of size N/2
  • But, diminishing returns set in sooner or later
  • Greater associativity can come at the cost of
    increased hit time

79
Cache miss penalties
  • Recall the equation for average memory access
    time
  • Hit time Miss Rate X Miss Penalty
  • Weve talked about lots of ways to improve the
    miss rates of our caches in the previous slides
  • But, just by looking at the formula we can see
  • Improving the miss penalty will work just as
    well!
  • Also, remember that technology trends have made
    processor speeds much faster than memory/DRAM
    speeds
  • Thus, the relative cost of miss penalties has
    increased over time!

80
Second-level caches
  • The 1st 4 techniques discussed all impact the
    CPU
  • This technique focuses on cache/main memory
    interface
  • Processor/memory performance gap makes architects
    consider
  • If they should make caches faster to keep pace
    with CPUs
  • If they should make caches larger to overcome
    widening gap between CPU and main memory
  • One solution is to do both
  • Add another level of cache (L2) between the 1st
    level cache (L1) and main memory
  • Ideally L1 will be fast enough to match the speed
    of the CPU while L2 will be large enough to
    reduce the penalty of going to main memory

81
Second-level caches
  • This will of course introduce a new definition
    for average memory access time
  • Hit timeL1 Miss RateL1 Miss PenaltyL1
  • Where, Miss PenaltyL1
  • Hit TimeL2 Miss RateL2 Miss PenaltyL2
  • So 2nd level miss rate measure from 1st level
    cache misses
  • A few definitions to avoid confusion
  • Local miss rate
  • of misses in the cache divided by total of
    memory accesses to the cache specifically Miss
    RateL2
  • Global miss rate
  • of misses in the cache divided by total of
    memory accesses generated by the CPU
    specifically -- Miss RateL1 Miss RateL2

82
(5) Second-level caches
  • Example
  • In 1000 memory references there are 40 misses in
    the L1 cache and 20 misses in the L2 cache. What
    are the various miss rates?
  • Miss Rate L1 (local or global) 40/1000 4
  • Miss Rate L2 (local) 20/40 50
  • Miss Rate L2 (global) 20/1000 2
  • Note that global miss rate is very similar to
    single cache miss rate of the L2 cache
  • (if the L2 size gtgt L1 size)
  • Local cache rate not good measure of secondary
    caches its a function of L1 miss rate
  • Which can vary by changing the L1 cache
  • Use global cache miss rate to evaluating 2nd
    level caches!

83
Second-level caches(some odds and ends
comments)
  • The speed of the L1 cache will affect the clock
    rate of the CPU while the speed of the L2 cache
    will affect only the miss penalty of the L1 cache
  • Which of course could affect the CPU in various
    ways
  • 2 big things to consider when designing the L2
    cache are
  • Will the L2 cache lower the average memory access
    time portion of the CPI?
  • If so, how much will is cost?
  • In terms of HW, etc.
  • 2nd level caches are usually BIG!
  • Usually L1 is a subset of L2
  • Should have few capacity misses in L2 cache
  • Only worry about compulsory and conflict for
    optimizations

84
(5) Second-level caches (example)
  • Given the following data
  • 2-way set associativity increases hit time by 10
    of a CPU clock cycle (or 1 of the overall time
    it takes for a hit)
  • Hit time for L2 direct mapped cache is 10 clock
    cycles
  • Local miss rate for L2 direct mapped cache is
    25
  • Local miss rate for L2 2-way set associative
    cache is 20
  • Miss penalty for the L2 cache is 50 clock
    cycles
  • What is the impact of using a 2-way set
    associative cache on our miss penalty?

85
(5) Second-level caches (example)
  • Miss penaltyDirect mapped L2
  • 10 25 50 22.5 clock cycles
  • Adding the cost of associativity increases the
    hit cost by only 0.1 clock cycles
  • Thus, Miss penalty2-way set associative L2
  • 10.1 20 50 20.1 clock cycles
  • However, we cant have a fraction for a number of
    clock cycles (i.e. 10.1 aint possible!)
  • Well either need to round up to 11 or optimize
    some more to get it down to 10. So
  • 10 20 50 20.0 clock cycles or
  • 11 20 50 21.0 clock cycles (both better
    than 22.5)

86
(5) Second level caches(some final random
comments)
  • We can reduce the miss penalty by reducing the
    miss rate of the 2nd level caches using
    techniques previously discussed
  • I.e. Higher associativity or psuedo-associativity
    are worth considering b/c they have a small
    impact on 2nd level hit time
  • And much of the average access time is due to
    misses in the L2 cache
  • Could also reduce misses by increasing L2 block
    size
  • Need to think about something called the
    multilevel inclusion property
  • In other words, all data in L1 cache is always in
    L2
  • Gets complex for writes, and what not

87
Multilevel cachesRecall 1-level cache numbers
Processor
cache
1nS
AMAT Thit (1-h) Tmem 1nS
(1-h) 100nS hit rate of 98 would yield an
AMAT of 3nS ... pretty good!
BIG SLOW MEMORY
100nS
88
Multilevel CacheAdd a medium-size, medium-speed
L2
Processor
AMAT Thit_L1 (1-h_L1)
Thit_L2 ((1-h_L1)
(1-h_L2) Tmem) hit
rate of 98in L1 and 95 in L2 would yield an
AMAT of 1 0.2 0.1 1.3nS -- outstanding!
L1 cache
1nS
L2 cache
10nS
BIG SLOW MEMORY
100nS
89
Reducing the hit time
  • Again, recall our average memory access time
    equation
  • Hit time Miss Rate Miss Penalty
  • Weve talked about reducing the Miss Rate and the
    Miss Penalty Hit time can also be a big
    component
  • On many machines cache accesses can affect the
    clock cycle time so making this small is a good
    thing!
  • Well talk about a few ways next

90
Small and simple caches
  • Why is this good?
  • Generally, smaller hardware is faster so a
    small cache should help the hit time
  • If an L1 cache is small enough, it should fit on
    the same chip as the actual processing logic
  • Processor avoids time going off chip!
  • Some designs compromise and keep tags on a chip
    and data off chip allows for fast tag check and
    gtgt memory capacity
  • Direct mapping also falls under the category of
    simple
  • Relates to point above as well you can check
    tag and read data at the same time!

91
Cache Mechanics Summary
  • Basic action
  • look up block
  • check tag
  • select byte from block
  • Block size
  • Associativity
  • Write Policy

92
Great Cache Questions
  • How do you use the processors address bits to
    look up a value in a cache?
  • How many bits of storage are required in a cache
    with a given organization

93
Great Cache Questions
  • How do you use the processors address bits to
    look up a value in a cache?
  • How many bits of storage are required in a cache
    with a given organization
  • E.g 64KB, direct, 16B blocks, write-back
  • 16KB 8 bits for data
  • 4K (16 1 1) for tag, valid and dirty bits

tag
index
offset
94
More Great Cache Questions
  • Suppose you have a loop like this
  • Whats the hit rate in a 64KB/direct/16B-block
    cache?

char a10241024 for (i 0 i lt 1024 i)
for (j 0 j lt 1024 j) aij
95
A. Terminology
  • Take out a piece of paper and draw the following
    cache
  • total data size 256KB
  • associativity 4-way
  • block size 16 bytes
  • address 32 bits
  • write-policy write-back
  • replacement policy random
  • How do you partition the 32-bit address
  • How many total bits of storage required?

96
C. Measuring Caches
97
Measuring Processor Caches
  • Generate a test program that, when timed, reveals
    the cache size, block size, associativity, etc.
  • How to do this?
  • how do you cause cache misses in a cache of size
    X?

98
Detecting Cache Size
for (size 1 size lt MAXSIZE size 2) for
(dummy 0 dummy lt ZILLION dummy) for (i
0 i lt size i) arrayi
time this part
  • what happens when size lt cache size
  • what happens when size gt cache size?
  • how can you figure out the block size?

99
Cache and Block Size
for (stride 1 stride lt MAXSTRIDE stride
2) for (size 1 size lt MAXSIZE size 2)
for (dummy 0 dummy lt ZILLION dummy)
for (i 0 i lt size i stride) arrayi
time this part
  • what happens for stride 1?
  • what happens for stride blocksize

100
Cache as part of a system
M X
1
P C
Instr Cache
DPRF
BEQ
A
Data Cache
M X
M X
D
SE
WB
EX
MEM
ID
IF
Write a Comment
User Comments (0)
About PowerShow.com