CHAPTER 7 LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

About This Presentation

Title:

CHAPTER 7 LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Description:

CHAPTER 7 LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Topics to be covered Principle of locality Memory hierarchy Cache concepts and cache organization – PowerPoint PPT presentation

Number of Views:315

Avg rating:3.0/5.0

Slides: 66

Provided by: Sun162

Category:

more less

Transcript and Presenter's Notes

Title: CHAPTER 7 LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

1
CHAPTER 7LARGE AND FAST EXPLOITING MEMORY
HIERARCHY

Topics to be covered
Principle of locality
Memory hierarchy
Cache concepts and cache organization
Virtual memory concepts
Impact of cache virtual memories on performance

2
PRINCIPLE OF LOCALITY

Two types of locality inherent in programs are
Temporal locality Locality in time
If an item is referenced, it will tend to be
referenced again soon
Spatial locality Locality in space
If an item is referenced, items whose
addresses are close by will tend to be
referenced soon
The principle of locality allows the
implementation of
memory hierarchy.

3
MEMORY HIERARCHY

Consists of multiple levels of memory with
different speeds and sizes.
Goal is to provide the user with memory at a low
cost, while providing access at the speed offered
by the fastest memory.

4
MEMORY HIERARCHY (Continued)

CPU
Speed Size
Cost/bit Implemented Using
Cache Fastest
Smallest Highest SRAM
Memory
Main DRAM
Memory
Secondary Slowest Biggest
Lowest Magnetic Memory Disk
Memory hierarchy in a computer

5
CACHE MEMORY

Cache represents the level of memory hierarchy
between the main memory and the CPU.
Terms associated with cache
Hit The item requested by the processor is
found in some block
in the cache.
Miss The item requested by the processor is not
found in the
cache.

6
Terms associated with cache (Continued)

Hit rate The fraction of the memory access
found in the cache. Used as a measure of
performance of the cache.
Hit rate (Number of hits) ? Number of
access
(Number of hits) ? ( hits
misses)
Miss rate The fraction of memory access not
found in cache.
Miss rate 1.0 - Hit rate

7
Terms associated with cache (Continued)

Hit time Time to access the cache memory
Includes the time needed to determine
whether the access is a hit or miss.
Miss penalty
Time to replace a cache block with the
corresponding block from the memory ? the time to
deliver this block to the processor

8
Cache Organizations

Three types of cache organizations available
Direct-mapped cache
Set associative cache
Fully associative cache

9
DIRECT MAPPED CACHE

Each main memory block is mapped to exactly one
location in the
cache. (It is assumed for right now that 1 block
1word)
For each block in the main memory, a
corresponding cache
location is assigned based on the address of the
block in the main
memory.
Mapping used in a direct-mapped cache
Cache index (Memory block address) modulo
(Number of blocks in the cache)

10
Example of a Direct-Mapped Cache

Figure 7.5 A direct-mapped cache of 8
blocks

11
Accessing a Cache Location and Identifying a Hit

We need to know
Whether a cache block has valid information
- done using valid bit
and
Whether the cache block corresponds to the
requested word
- done using tags

12
CONTENTS OF A CACHE MEMORY BLOCK

A cache memory block consists of the data bits,
tag bits and
a valid (V)bit.
V bit is set only if the cache block has valid
information.
Cache V Tag Data
index

13
CACHE AND MAIN MEMORY STRUCTURE

Cache Memory
index V Tag Block address
Data
0 0
1 1
2 Block
(K words)
K-1
Block
length
(K words)
Word
length

14
CACHE CONCEPT

CPU
Word transfer
Cache
Block transfer
Main Memory

15
IDENTIFYING A CACHE HIT

The index of a cache block and the tag contents
of
that block uniquely specify the memory address of
the word contained in the cache block.
Example
Consider a 32-bit memory address and a cache
block size of one
word. The cache has 1024 words. Compute the
following.
Cache index ? bits
Byte offset ? bits
Tag ? bits
Actual cache size ? bits

16
Example (Continued)

Figure. 7.7 Identifying a hit on the cache
block

17
A Cache Example

The series of memory address references given as
word addresses are 22, 26, 22, 26, 16, 3, 16, and
18. Assume a direct-mapped cache with 8 one-word
blocks that is initially empty. Label each
reference in the list as a hit or miss and show
the contents of the cache after each reference.

18
A Cache Example (Continued)

Index V Tag Data Index V Tag
Data
000 000
001 001
010 010
011 (a) 011 (b)
100 100
101 101
110 110
111 111
Initial state of the cache

19
A Cache Example (Continued)

Index V Tag Data Index V Tag
Data
000 000
001 001
010 010
011 (c) 011 (d)
100 100
101 101
110 110
111 111

20
A Cache Example (Continued)

Index V Tag Data Index V Tag
Data
000 000
001 001
010 010
011 (e) 011 (f)
100 100
101 101
110 110
111 111

21
HANDLING CACHE MISSES

If the cache reports a miss, then the
corresponding block has to be loaded into the
cache from the main memory.
The requested word may be forwarded immediately
to the processor as the block is being updated
or
The requested word may be delayed until the
entire block has been stored in the cache

22
HANDLING CACHE MISSES FOR INSTRUCTIONS

Decrement PC by 4
Fetch the block containing the missed instruction
from the main memory and write the block into the
cache
Write the instruction block into the data
portion of the referenced cache block
Copy the upper bits of the referenced memory
address into the tag field of the cache memory
Turn the valid (V) bit on
Restart the fetch cycle - this will refetch the
instruction, this time finding it
in cache

23
HANDLING CACHE MISSES FOR DATA

Read the block containing the missed data from
the main memory and write the block into the
cache
Write the data into the data portion of the
referenced cache block
Copy the upper bits of the referenced memory
address into the tag field of the cache memory
Turn the valid (V) bit on
The requested word may be forwarded immediately
to the processor as the block is being updated
or
The requested word may be delayed until the
entire block has been stored in the cache

24
CACHE WRITE POLICIES

Two techniques used in handling a write to a
cache
block in response to a cache write miss
Write-through Technique
Updates both the cache and the main memory
for each
write
Write-back Technique
Writes to cache only and postpones
updating the main
memory until the block is replaced in
the cache

25
CACHE WRITE POLICIES (Continued)

The write-back strategy usually employs a dirty
bit associated with
each cache block, much the same as the valid bit.
The dirty bit will be set the first time a value
is written to the block.
When a block in the cache is to be replaced, its
dirty bit is examined.
If the dirty bit has been set, the block is
written back to the main memory
otherwise it is simply overwritten.

26
TAKING ADVANTAGE OF SPATIAL LOCALITY

To take advantage of the spatial locality we
should have a
block that is larger than one word in length
(multiple-word
block).
When a miss occurs, the entire block (consisting
of
multiple words that are adjacent) will be
fetched from the
main memory and brought into cache.
The total number of tags and valid bits, in a
cache with
multiword block, is less because each tag and
valid bit are used for
several words.

27
Cache with Multiple-word Blocks - Example

Figure 7.9 A 16 KB cache using 16-word blocks

28
Identifying a Cache Block for a Given Memory
Address

For a given memory address, the corresponding
cache
block can be determined as follows
Step 1 Identify the memory block that contains
the given memory address
Memory block address (Word address) div
(Number of words in the block) Memory
block address (Byte address) div (Number of
bytes in the block)
(Memory block address is essentially the
block number in the main memory.)
Step 2 Compute the cache index corresponding to
the memory block
Cache block address (Memory Block address)
Modulo (Number of blocks
in cache)

29
HANDLING CACHE MISSES FOR A MULTIWORD BLOCK

For a cache read miss, the corresponding block is
copied into the cache from the memory.
A cache write miss, in a multiword block cache,
is
handled in two steps
Step 1 Copy the corresponding block from memory
into cache
Step 2 Update the cache block with the requested
word

30
EFFECT OF A LARGER BLOCK SIZE ON PERFORMANCE

In general, the miss rate falls when we increase
the block size.
The miss rate may actually go up, if the block
size is made very
large compared with the cache size, due to the
following reasons
The number of blocks that can be held in the
cache will become small, giving rise to a great
deal of competition for these blocks.
A block may be bumped out of the cache before
many of its words can be used.
Increasing the block size increases the cost of a
miss (miss penalty)

31
DESIGNING MEMORY SYSTEMS TO SUPPORT CACHES

Three memory organizations are widely used
One-word-wide memory organization
Wide memory organization
Interleaved memory organization
Figure 7.11 Three options for designing the
memory system

32
Figure 7.11 Three options for designing the
memory system
33
Example

Consider the following memory access times
1 clock cycle to send the address
10 clock cycles for each DRAM access
initiated
1 clock cycle to send a word of data
Assume we have a cache block of four words.
Discuss
the impact of the different organizations on miss
penalty
and bandwidth per miss.

34
MEASURING AND IMPROVING CACHE PERFORMANCE

Total cycles CPU spends on a program equals
(Clock cycles CPU spends executing the program)
(Clock cycles CPU spends waiting for the memory
system)
Total CPU time Total CPU cycles Clock cycle
time
(CPU execution clock cycles
Memory-stall clock cycles) Clock cycle
time

35
MEASURING AND IMPROVING CACHE PERFORMANCE
(Continued)

Memory-stall clock cycles Read-stall cycles
Write-stall cycles
Read-stall cycles Number of reads Read miss
rate Read miss penalty
Write-stall cycles Number of writes Write
miss rate Write miss penalty
Total memory access Number of reads Number of
writes
Therefore,
Memory-stall cycles Total memory accesses Miss
rate Miss penalty

36
MEASURING AND IMPROVING CACHE PERFORMANCE

Two ways of improving cache performance
Decreasing the cache miss rate
Decreasing the cache miss penalty

37
Example

Assume the following
Instruction cache miss rate for gcc 5
Data cache miss rate for gcc 10
If a machine has a CPI of 4 without memory stalls
and a miss
penalty of 12 cycles for all misses, determine
how much faster a
machine would run with a perfect cache that never
missed. The
frequency of data transfer instructions in gcc is
33.

38
REDUCING CACHE MISSES BY MORE FLEXIBLE PLACEMENT
OF BLOCKS

Direct Mapped Cache A block could go in
exactly one place
Set Associative Cache There are a fixed number
of locations
where each block can be placed.
Each block in the memory maps to a unique set in
the cache given by the index field.
A block can be placed in any element of that set.
The set corresponding to a memory block is given
by
Cache set (Block address) modulo (Number of
sets in the cache)
A set associative cache with n possible locations
for a block is
called an n-way set associative cache.

39
REDUCING CACHE MISSES BY MORE FLEXIBLE PLACEMENT
OF BLOCKS (Continued)

Fully Associative Cache A block can be placed
in any location in the cache.
To find a block in a fully associative cache, all
the entries
(blocks) in the cache must be searched
Figure 7.14 Examples of direct-mapped, set
associative and fully associative caches
The miss rate decreases with the increase in the
degree
of associativity.

40
Figure 7.14 Examples of direct-mapped, set
associative and fully associative caches

41
LOCATING A BLOCK IN THE CACHE

Fig. 7.17 Locating a block in a four-way set
associative cache

42
CHOOSING WHICH BLOCK TO REPLACE

Direct-mapped Cache
When a miss occurs, the requested block can go
in exactly one position. So the block occupying
that position must be replaced.

43
CHOOSING WHICH BLOCK TO REPLACE (Continued)

Set associative or fully associative Cache
When a miss occurs we have a choice of where to
place the requested block, and therefore a choice
of which block to replace.
Set associative cache
All the blocks in the selected set are
candidates for replacement.
Fully associative cache
All blocks in the cache are candidates for
replacement.

44
Replacing a Block in Set Associative and Fully
Associative Caches

Strategies employed are
First-in-first-out (FIFO)
The block replace is the one that was brought
in in first
Least-frequently used (LFU)
The block replaced is the one that is least
frequently used
Random Blocks to be replaced are randomly
selected
Least Recently Used (LRU)
The block replaced is the one that has been
unused for the
longest time.
LRU is the most commonly used replacement
technique.

45
REDUCING THE MISS PENALTY USING MULTILEVEL CACHES

To further close the gap between the fast clock
rates of modern
processors and the relatively long time required
to access
DRAMs, high-performance microprocessors support
an
additional level of caching.
This second-level cache (often off chip in a
separate set of
SRAMs) will be accessed whenever a miss occurs in
the
primary cache.
Since the access time for the second-level cache
is significantly
less than the access time of the main memory, the
miss penalty
of the primary cache is greatly reduced.

46
Evolution of Cache organization

80386 No on-chip cache. Employs a
direct-mapped external cache with a block size
of 16 bytes (4, 32-bit words). Employs
write-through technique.
80486 Has a single on-chip cache of 8 KByte
with a block size of 16 bytes and a 4-way set
associative organization. Employs write-through
technique and LRU replacement algorithm.
Pentium/Pentium Pro
Employs split instruction and data caches (2
on-chip caches, one for data and one for
instructions). Each cache is 8 KByte with a block
size of 32 bytes and a 4-way set associative
organization. Employs a write-back policy and
the LRU replacement algorithm.
Supports the use of a 2-way set associative
level 2 cache of 256 or 512 Kbytes with a block
size of 32, 64, or 128 bytes. Employs a
write-back policy and the LRU replacement
algorithm. Can be dynamically configured to
support write-through caching. In Pentium Pro,
the secondary cache is on a separate die, but
packaged together with the processor.

47
Evolution of Cache organization (Continued)

Pentium II Employs split instruction and
data caches. Each cache is 16 Kbytes.
Supports the use of a level 2 cache of 512
Kbytes.
PII Xeon Employs split instruction and
data caches. Each cache is 16 Kbytes.
Supports the use of a level 2 cache of 1 Mbytes
or 2 Mbytes.
Celeron Employs split instruction and
data caches. Each cache is 16 Kbytes.
Supports the use of a level 2 cache of 128
Kbytes.
Pentium III Employs split instruction and
data caches. Each cache is 16 Kbytes.
Supports the use of a level 2 cache of 512
Kbytes.

48
Evolution of Cache Organization (Continued)

Pentium 4 Employs split instruction and data
caches (2 on-chip caches, one for data and one
for instructions). Each cache is 8 KByte with a
block size of 64 bytes and a 4-way set
associative organization. Employs a write-back
policy and the LRU replacement algorithm.
Supports the use of a 2-way set associative
level 2 cache of 256 Kbytes with a block size
of 128 bytes. Employs a write-back policy and
the LRU replacement algorithm. Can be dynamically
configured to support write-through caching.

49
Evolution of Cache organization (Continued)

Power PC
Model 601 has a single on-chip 32 Kbytes, 8-way
set associative cache with a block size of 32
bytes.
Model 603 has two on-chip 8 Kbytes, 2-way set
associative caches with a block size of 32 bytes.
Model 604 has two on-chip 16 KByte, 4-way set
associative caches with a block size of 32 bytes.
Uses LRU replacement algorithm and write-through
and write-back techniques. Employs a 2-way set
associative level 2 cache of 256 or 512 Kbytes
with a block size of 32 bytes.
Model 620 has two on-chip 32 Kbytes, 8-way set
associative caches with a block size of 64 bytes.
Model G3 has two on-chip 32 Kbytes, 8-way set
associative caches with a block size of 64 bytes.
Model G4 has two on-chip 32 Kbytes, 8-way set
associative caches with a block size of 32 bytes

50
ELEMENTS OF CACHE DESIGN

The key elements that serve to classify and
differentiate cache architectures are as follows
Cache size
Mapping function
Direct
Set associative
Fully associative
Replacement algorithms (for set and fully
associative)
Least-recently used (LRU)
First-in-first-out (FIFO)
Least-frequently used (LFU)
Random
Write policy
Write-through
Write-back
Block size
Number of caches
Single-level or two-level
Unified or split

CACHE SIZE
The size of the cache should be small enough so
that the overall average cost per bit is close to
that of the memory alone and large enough so that
the overall average access time is close to that
of the cache alone.
Large caches tend to be slightly slower than
small ones (because of the additional gates
involved).
Cache size is also limited by the available chip
and board area.
Because the performance of the cache is very
sensitive to the nature of the workload, it is
almost impossible to arrive at an optimum cache
size. But studies have suggested that cache sizes
of between 1K and 512K words would be optimum.

MAPPING FUNCTION
The choice of the mapping function dictates how
the cache is organized.
The direct mapping technique is simple and
inexpensive to implement. The main disadvantage
is that there is a fixed cache location for any
given block. Thus, if for example a program
happens to repeatedly reference words from two
different blocks that map into the same cache
location, then the blocks will be continually
swapped in the cache, and the hit ratio will be
very low.
With (fully) associative mapping, there is
flexibility as to which block to replace when a
new block is read into the cache. Replacement
algorithms are designed to maximize the hit
ratio. The principal disadvantage is the complex
circuitry required to examine the tags of all
cache locations in parallel.
Set associative mapping is a compromise that
exhibits the strengths of both the direct and
fully associative approaches without their
disadvantages. The use of two blocks per set is
the most common set associative organization. It
significantly improves the hit ratio over direct
mapping.

REPLACEMENT ALGORITHMS
For set associative and fully associative
mapping, a replacement algorithm is required. To
achieve high speed, such algorithms must be
implemented in hardware.
WRITE POLICY
The write-through technique, even though simple
to implement, has the disadvantage that it
generates substantial memory traffic and may
create a bottleneck. The write-back technique
minimizes memory writes. The disadvantage is
that portions of the main memory are invalid, and
hence access by I/O modules can be allowed only
through the cache. This calls for complex
circuitry and a potential bottleneck.
BLOCK SIZE
Larger blocks reduce the number of blocks that
fit into a cache. Because each block fetch
overwrites older cache contents, a small number
of cache blocks result in data being overwritten
shortly after it is fetched. Also, as a block
becomes larger, each additional word is farther
from the requested word, therefore less likely to
be needed in the near future.
The relationship between block size and hit
ratio is complex, depending on the locality
characteristics of a given program. Studies have
shown that a block size of from 4 to 8
addressable units (words or bytes) would be
reasonably close to optimum.

NUMBER OF CACHES
Two aspects have to be considered here number
of levels of caches and the use of unified versus
split caches.
Single- Versus Two-Level Caches As logic
density has increased, it has become possible to
have cache on the same chip as the processor the
on-chip cache. The on-chip caches reduces the
processors external bus activity and therefore
speeds up execution times and increase the
overall system performance. If the system is
also provided with an off-chip or external cache,
then the system is said to have two-level cache,
with the on-chip cache designated as level 1 (L1)
and the external cache designated as level 2
(L2). In the absence of an external cache, for
every on-chip cache miss, the processor will have
to access the DRAM. Due to the typical slow bus
speed and slow DRAM access time the overall
performance of the system will go down. The
potential savings due to the use of an L2 cache
depends on the hit rates in both the L1 and L2
caches. Studies have shown that, in general, the
use of L2 cache does improve performance.
Unified Versus Split Cache When the on-chip
cache first made its appearance, many of the
designs consisted of a single on-chip cache used
to store both data and instructions. More
recently, it has become common to split the
on-chip cache into two one dedicated to
instructions and one dedicated to data.
Unified cache has the following advantages For
a given cache size, a unified cache has a higher
hit rate than split caches because it balances
the load between instruction and data fetches
automatically. Only one cache needs to be
designed and implemented.
The advantage of split cache is that it
eliminates the contention for cache between the
instruction fetch unit and the execution unit.
This is extremely important in any design that
implements pipelining of instructions.

55
VIRTUAL MEMORY

Virtual memory permits each process to use the
main memory as if it were the only user, and to
extend the apparent size of accessible memory
beyond its actual physical size.
The virtual address generated by the CPU is
translated into a physical address, which in turn
can be used to access the main memory. The
process of translating the virtual address into a
physical address is called memory mapping or
address translation.
Page A virtual memory block
Page fault A virtual memory miss
Figure 7.19 The virtual addressed memory with
pages mapped to the main memory
Figure7.20 Mapping from a virtual to a physical
address

56
Figure 7.19 The virtual addressed memory with
pages mapped to the main memory
57
Figure7.20 Mapping from a virtual to a physical
address
58
PLACING A PAGE AND FINDING IT AGAIN

Operating system must maintain a page table.
Page Table
Maps virtual pages to physical pages or else to
locations in the secondary memory
Resides in memory
Indexed with the page number from the virtual
address
Contains the pp of the corresponding vp
Each program has its own page table, which maps
the virtual address
space of that program to main memory.
No tags are required in the page table because
the page table contains a
mapping for every possible virtual page.

59
PLACING A PAGE AND FINDING IT AGAIN (Continued)

Page table register
Indicates the location of the page table in the
memory
Points to the start of the page table.
Figure 7.21 Mapping from a virtual to a physical
address using the page table register and the
page table

60
Figure 7.21 Mapping from a virtual to a physical
address using the page table register and
the page table
61
PAGE FAULTS

If the valid bit for a virtual page is off, a
page fault occurs.
Operating system is given control at this point
(exception mechanism)
OS finds the page in the next level of the
hierarchy (magnetic disc for example)
OS decide where to place the requested page in
main memory
OS also creates a data structure that tracks
which processes and which virtual addresses use
each physical page.
When a page fault occurs, if all the pages in
the main memory are in use, the OS has to choose
a page to replace. The algorithm usually employed
is the LRU replacement algorithm.