Title: Class
1Class 4 Cache Memory
- Memory Hierarchy
- Computer Memory Overview
- A. Static Dynamic RAM
- B. Performance Parameters
- C. Motivation for cache memory
- III. Cache Memory
- A. Locality Principle
- B. Hit Rate
- IV. Cache Mapping
- A. Direct
- B. Associative
- C. Set Associative
2Processor vs. Memory
3The Memory Hierarchy
4Everything is Cache?
- Small, fast storage used to improve average
time to slow memory - Exploits spatial and temporal locality
- In computer architecture, almost everything is
cache! - - First level cache is a cache of second level
cache - - Second level cache is a cache of memory
- - Memory is a cache of disk
5Caching Timing
- Time scales are HUGE (10mS is 10,000,000 cycles)
2nS 10nS 100nS 10mS 100S
10x
100000x
6Computer Memory
- Computer memory is a collection of cells capable
of storing binary information. - Two states 0 and 1 in each cell
- Two operations Write and Read
-
(write)
(read)
Ferromagnetic core
A cell with 3 terminals (look familiar?)
semiconductors
7How Data is Represented
- Storing Electrical Signals
- Data is represented as electrical signals.
- Digital signals are used to transmit data to and
from devices attached to the system bus. - Storage devices must accept electrical signals as
input and output.
Digital electrical signal Alternating between
10101010 and 10101011 every second (by voltage)
8Static RAM (SRAM)
- Implemented with multiple transistors, usually 6.
- One state represents 1, the other state
represents 0. - Pros
- Simplicity Holds its data without external
refresh, for as long as power is supplied to the
circuit. - Speed SRAM is faster than DRAM.
- Cons
- Cost SRAM is, byte for byte, two orders of
magnitude more expensive than DRAM. - Size SRAMs take up much more space than DRAMs
(which is part of why the cost is higher). - Used for cache memory
9Dynamic RAM (DRAM)
- Uses transistor and capacitor.
- Loses charge quickly.
- Require a fresh infusion of power thousands of
times per second. - Each refresh operation is called a refresh cycle.
- Pros
- Size ¼ size of SRAM (uses only one transistor).
- Cost Much less expensive (100x) than SRAM.
- Cons
- Cell refreshing Constant refreshing (reading) of
cell. This refreshing action is why the memory is
called dynamic. - Speed SRAM is 10 times faster.
- Used for main memory (RAM) in computers.
10Synchronous DRAM (SDRAM)
- SDRAM Not the same as SRAM.
- Faster than asynchronous DRAM
- Read-ahead RAM that uses the same clock pulse as
the system bus. - Read and write operations are broken into a
series of simple steps and each step can be
completed in one bus clock cycle. - Pipelining is possible with SDRAM.
- SDRAM is the most commonly used technology for
PC memory today.
11So you want fast?
- It is possible to build a computer which uses
only SRAM. Why not?? - This would be very fast
- This would need no cache
- How can you cache cache?
- BUT it would cost a very large amount of .
12Motivation for Caches
- Large main memories (DRAM) are slow
- Small cache memories (SRAM) are fast
- Make average access time small by servicing most
accesses from small, fast memory -
By combining a small fast memory with a large
slow memory, we can get the speed (almost) of the
fast memory for the price of the slow memory. The
fast memory is called a cache.
13Cache Memory Data Transfer
- Small amount of fast memory (SRAM)
- Sits between normal main memory and CPU
- May be located on CPU chip (i.e. Level 1 cache)
- Goal High speed, high capacity
(256 words)
32-bits (4 bytes)
Cache checked first, then main memory
14Cache Memory Principle
- Basic idea most heavily used memory words are
kept in the cache. When a memory word is
required, the CPU first looks in the cache. If
the word is not there, it looks to main memory. -
- In order to be successful, a large percentage of
the words searched must be in the cache. We can
ensure this by exploiting the locality principle.
When a word is referenced, it and some of its
neighbors are brought into the cache, so next
time it can be accessed quickly.
15Cache Memory Locality
- Principle of Locality states that programs access
a small portion of address space at any instant
of time (e.g. 90 of time in 10 of code) - Temporal Locality (Locality in Time) If an item
is referenced, it will tend to be referenced
again soon (loops) - Keep more recently accessed data items closer to
the processor - Spatial Locality (Locality in Space) If an item
is referenced, it will tend to be referenced
again in close proximity to previous one (arrays) - Move blocks consisting of contiguous words to the
upper levels
16Cache Examples
- Library If they know that a class is doing a
unit on Russian literature and they know the
first book is Dostoevskys The Brothers
Karamazov, they could keep a cache at the front
desk to save retrieval time (temporal locality).
They could also keep a cache of related books
such as Tolstoys War and Peace (spatial
locality). - Music downloads (temporal and spatial)
17How the Locality Principle Works with Cache Memory
- Using the locality principle, main memory and
cache are divided up into fixed-size blocks. When
referring to the cache, these blocks are called
cache lines. When a cache miss occurs, the entire
cache block is loaded. - Instructions and data can either be kept in the
same cache (unified cache) or in separate caches
(split caches). Today split caches are most used. - - There can also be multiple caches (on chip,
off chip but in the same package as the CPU, and
farther away).
18Cache Terminology
- Hit data appears in some block in the upper
level (example Block X) - Hit Rate the fraction of memory access found in
the upper level - Hit Time Time to access the upper level which
consists of time to determine hit/miss SRAM
access time - Miss data needs to be retrieved from block in
lower level (Block Y) - Miss rate 1 (hit rate)
- Miss Penalty Time to replace a block in the
upper level from lower level Time to deliver
the block to the processor - Hit Time ltlt Miss penalty
-
19Cache Memory Access Time
- Let T1 be the cache access time, T2 the main
memory access time, and h the hit ratio (fraction
of references satisfied out of the cache). Then
the mean access time is - T1 (1 - h) T2
- As h approaches 1, all references can be
satisfied out of cache and the access time
approaches T1. On the other hand, as h approaches
0, the access time approaches T1 T2
20Example Hit Rate
- Suppose that the processor has access to 2 levels
of memory. Level 1 contains 1000 words and has an
access time of 0.01µs level 2 contains 100,000
words and has an access time of 0.1µs. Assume
that if a word is in level 1, then the processor
accesses it directly. If it is in level 2, then
the word is first transferred to level 1 and then
accessed by the processor. Assume 95 of memory
accesses are found in the cache. What is the
average time to access a word? - T1 0.01µs, T2 0.10µs
- Mean Access Time (0.01µs) (1 - 0.95)(0.10µs)
- 0.015µs
21Pentium 4 Block Diagram
22Typical Values
Hit time 1 cycle Miss penalty 10 cycles
(access transfer) Miss rate usually well
under 10
23Cache Operation - Overview
24Some Laptop Specs
- Recognize any terms here?
25Cache Evolution
26Intel and Cache Memory
27Cache Mapping
- How is cache mapped to main memory? How does one
determine whether there is a cache hit? - Since the number of memory blocks is larger than
the number of cache lines, special mapping
policies are needed to map memory block into
cache line. - Mapping Policies
- Direct Mapping
- Associative Mapping
- Set Associative Mapping
28Main Memory/Cache Structure
- Main memory space is divided into blocks
- Suppose
- 1). of addressable words 2n
- if 32-bit address, then 232 addressable words
- 2). Block size K words/block,
- EX 4 words/block
- 3). of memory blocks B2n/K
- B 232 /4 230 blocks
- Cache space is divided into
- m cache lines
- say m 214 lines
- With K 4 words/line
- then cache size
- 4 words/line 214 lines
- 216 words 64K words
Main Memory
Cache
29Cache Organization
- (1) How do you know if something is in the cache?
- (2) If it is in the cache, how to find it?
- Answer to (1) and (2) depends on type or
organization of the cache - In a direct mapped cache, each memory address is
associated with one possible block within the
cache - Therefore, we only need to look in a single
location in the cache for the data if it exists
in the cache. This makes cache access fast!
30Simplest Cache Direct Mapped
4-Block Direct Mapped Cache
MainMemory
Cache Index
Address
0
00
0
0000
1
01
1
0001
2
10
2
0010
3
11
3
0011
4
0100
Memory block address
5
0101
6
0110
index
tag
7
0111
8
1000
9
- index determines block in cache
- If number of cache blocks is power of 2, then
cache index is just the lower n bits of memory
address
1001
10
1010
11
1011
12
1100
13
1101
14
1110
15
1111
31Elements of Direct-Mapped Cache
- If block size gt 1 word, rightmost bits of index
are really the offset of a word (possibly byte
number) within the indexed block
32Direct-mapped Cache Issues
- The direct mapped cache is simple to design and
its access time is fast (Why?) - Good for L1 (on-chip cache)
- Problem Conflict Miss, so lower hit ratio
- Conflict Misses are misses caused by accessing
different memory locations that are mapped to the
same cache index - In direct mapped cache, no flexibility in where
memory block can be placed in cache, contributing
to conflict misses
33Direct Mapping Example
With m (214) cache lines and s (222) blocks in
main memory
0000000000000100 (line 1)
0011001110011100 (line 3303)
214 16,384 lines
222 4,194,304 blocks
34Direct Mapping Exercise
- (Stallings 4.8) Consider a machine with a byte
addressable main memory of 216 bytes and block
size of 8 bytes. Assume that a direct mapped
cache consisting of 32 lines is used with this
machine. - How is a 16-bit memory address divided into tag,
line number, and byte number? - We would have 8 words per block, so there would
be 3 bits for that. There are 32 cache lines, so
it needs 5 bits to refer to the cache line. This
leaves 8 bits for the tag. - b) Into what line would bytes with each of the
following addresses be stored? - 0001 0001 0001 1011 The 5 bits for the cache
line are 00011 3. - 1100 0011 0011 0100 The 5 bits for the cache
line are 00110 6. - 1101 0000 0001 1101 The 5 bits for the cache
line are 00011 3. - 1010 1010 1010 1010 The 5 bits for the cache
line are 10101 21. - c) Suppose the byte with address 0001 1010 0001
1010 is stored in the cache. What are the
addresses of the other bytes stored along with
it? - The addresses stored in the same block range
from 0001 1010 0001 1000 to 0001 1010 0001 1111. - How many total bytes of memory can be stored in
the cache? - The cache has 32 lines with 8 bytes each line
256 bytes. - Why is the tag also stored in the cache?
- The tag determines if it is a cache hit. The
tags are used to distinguish between possible
entries in a cache line.
35Direct Mapping pros cons
- Advantage
- Simple
- Inexpensive
- Disadvantage
- Fixed cache location for a given block leads to
high miss rate and low performance. - For example, if the processor repeatedly
accesses 2 memory blocks that map to the same
cache line, the cache misses will be very high,
even though other cache lines may be idle.
36Associative Mapping
- A main memory block can be mapped into any cache
line randomly. The mapping relationship is not
fixed. - Memory address contains two fields tag and word.
- Tag uniquely identifies the memory block.
- Cache searching needs to check every lines tag.
37Associative Mapping Example
Word 2 bits
Address
Tag 22 bits
- Cache size is 4 x word size in bytes x m cache
lines. If word 1 byte then a block 4 bytes and
cache size is 4m bytes. - Tag has 22 bits (2224M). 2 bits identify the
byte (or word) in a memory block or a cache line,
so there are 22 4 bytes per block (or line). - Memory size is 222 22 224 16 M Bytes each
block has 4 bytes, there are a total of 4 M
blocks of memory - Compare address tag field with tag entry in each
cache line for cache hit or miss. - e.g.
- Address (24 bits) 0001 0110 0011 0011 1001
1100 (Binary) - 1 6 3 3 9
C (Hex) - Tag (22 bits) 00 0101 1000 1100
1110 0111 (Binary) 0
5 8 C E 7 (Hex)
38Associative Mapping Example
22 bit tag
24 bit memory address
Features High hit rate, but very complex
hardware design.
39Set Associative Mapping
- Combining features of direct mapping and
associative mapping - Cache is divided into a number of sets
- Each set contains a number of lines
- A given memory block can only be mapped into a
specific set - In the set, the memory block can be mapped into
any cache line - Example 2-way set associative mapping
- 2 cache lines in each set
- A given memory block can be mapped into only
one cache set. Inside each set, the memory block
can be mapped into any one of the cache lines.
40Address Structure
- Address contains three fields
- W bits identify the word or byte in a memory
block or a - cache line, 2s blocks in main memory
- Cache space is divided into 2d sets
- Each set contains up to 2 (s-d) cache lines
- Two steps searching
- Using set field to determine cache set to look in
- Comparing tag field to check the cache hit or
miss
41Set Associative Exercise
- (Stallings 4.1) A set associative cache consists
of 64 lines, or slots, divided into 4-line sets.
Main memory contains 4K blocks of 128 words each.
Show the format of main memory addresses. - The cache is divided into 16 sets of 4 lines (16
x 4 64) each. Therefore, 4 bits are needed to
identify the set number. - There are 128 words. That is 27 words. So the
word field must be 7 bits long. - Main memory consists of 4K 212 blocks.
Therefore, the set plus tag lengths must be 12
bits and therefore the tag length is 8 bits.
42Set Associative Example
43Replacement Algorithms
- When there is no available cache line in which
to place a memory block, a replacement algorithm
is implemented. The replacement algorithm
determines which line is to be freed up for the
new block. - For direct mapping function, because each block
only maps to the fixed cache line, no replace
algorithm is needed. - We will focus on replacement algorithms for
fully associative mapping and set associative
mapping.
44Replacement Algorithms Direct mapping
- No choice
- Each block only maps to one line
- Replace that line
45Replacement Algorithms for Associative Set
Associative
- Least Recently Used (LRU)
- e.g. In 2-way set associative, determine which
of the 2 blocks is the least recently used. - First In First Out (FIFO)
- replace the block that has been in cache longest
- Least Frequently Used (LFU)
- replace the block which has the fewest hits
- Random Replacement
- randomly replace one block from cache line.
46Write Policy
- Some unpleasant situations
- If a word in the cache has been changed, then the
corresponding word in the memory is invalid. - If a word in memory is changed, then the
corresponding cache word is invalid. - If multiple processor-cache modules exist, a word
in one cache is changed, then the corresponding
word in other caches will be invalid. - Write policy is executed to update memory or
cache thus to keep the data consistent in memory
and in cache. - Write through
- Write back
47Write through
- All writes go to main memory as well as cache
- Multiple CPUs can monitor main memory traffic to
keep local (to CPU) cache up to date - Pro
- Cache and memory kept coherent
- Cons
- Lots of traffic
- Slows down writes
48Write back
- Updates initially made in cache only
- Update bit for cache slot is set when update
occurs - If block is to be replaced, write to main memory
only if update bit is set - Pro
- More efficient than write through, less traffic
- Cons
- Caches get out of sync with memory
- I/O must access main memory through cache
- This has been done on all Intel processors since
the 486.