7' Large and Fast: Exploiting Memory Hierarchy

About This Presentation

Title:

7' Large and Fast: Exploiting Memory Hierarchy

Description:

the 20 highest address bits are the unique tag for this memory block ... For fixed cache capacity, large set size leads to higher hit rates ... – PowerPoint PPT presentation

Number of Views:78

Avg rating:3.0/5.0

Slides: 59

Provided by: sjlee9

Category:

more less

Transcript and Presenter's Notes

Title: 7' Large and Fast: Exploiting Memory Hierarchy

1
7. Large and FastExploiting Memory Hierarchy
2
The Big Picture Where are We Now?

The Five Classic Components of a Computer

3
Technology Trends
Capacity Speed (latency) Logic 2x
in 3 years 2x in 3 years DRAM 4x in 3
years 2x in 10 years Disk 4x in 3 years 2x
in 10 years
4
Who Cares About the Memory Hierarchy?
Processor-DRAM Memory Gap (latency)
5
The Goal illusion of large, fast, cheap memory

Fact
Large memories are slow
Fast memories are small
How do we create a memory that is large, cheap
and fast (most of the time)?
Hierarchy

6
Exploiting Memory Hierarchy

Users want large and fast memories!
As of 2004, SRAM access times are .5 5ns at
cost of 4000 to 10,000 per GB.DRAM access
times are 50-70ns at cost of 100 to 200 per
GB.Disk access times are 5 to 20 million ns at
cost of .50 to 2 per GB.
Try and give it to them anyway
build a memory hierarchy

7
Memory Hierarchy of a Modern Computer System

By taking advantage of the principle of locality
Present the user with as much memory as is
available in the cheapest technology.
Provide access at the speed offered by the
fastest technology.

8
Memory Hierarchy Why Does it Work? Locality!

Spatial Locality (Locality in Space)
gt Move blocks consists of contiguous words to
the upper levels
Temporal Locality (Locality in Time)
gt Keep most recently accessed data items closer
to the processor

9
Memory Hierarchy Terminology

Hit data appears in some block in the upper
level (example Block X)
Hit Rate the fraction of memory access found in
the upper level
Hit Time Time to access the upper level which
consists of
RAM access time Time to determine hit/miss
Miss data needs to be retrieve from a block in
the lower level (Block Y)
Miss Rate 1 - (Hit Rate)
Miss Penalty Time to replace a block in the
upper level
Time to deliver the block the processor
Hit Time ltlt Miss Penalty

10
Memory Hierarchy Technology

Random Access
Random is good access time is the same for all
locations
Volatile Memory
DRAM Dynamic Random Access Memory
High density, low power, cheap, slow
Dynamic need to be refreshed regularly
SRAM Static Random Access Memory
Low density, high power, expensive, fast
Static content will last forever(until lose
power)
Non-Volatile Memory
ROM(Mask ROM, PROM, EPROM, E2PROM)
Flash Memory, FRAM, MRAM
Not-so-random Access Technology
Access time varies from location to location and
from time to time
Examples Disk, CD-ROM
Sequential Access Technology
access time linear in location (e.g.,Tape)

11
Main Memory Background

Main Memory is DRAM Dynamic Random Access Memory
1 transistor and 1 capacitor ( 2 transistors) /
bit
Dynamic since needs to be refreshed periodically
(8 ms)
Addresses divided into 2 halves (Memory as a 2D
matrix)
Row address and then column address
Number of address pins cut into half
Called address multiplexing
Cache uses SRAM Static Random Access Memory
No refresh
6 transistors/bit
No address multiplexing
SRAM is faster, and more expensive than DRAM
Size SRAM/DRAM 4-8
Cost SRAM/DRAM 20-25 (1997)
Access Time DRAM/SRAM - 5-12

12
Cache

Motivation
Slow speed of DRAM main memory limits processor
performance
a smaller SRAM memory matches processor speed
Make the average access time near SRAM
if the large majority of memory references hit
the cache
Reduce bandwidth required of the large memory

13
Cache Organization

Cache duplicates part of main memory
we specify an address in main memory to search
whether a copy of that memory location resides in
the cache
need a mapping between main memory location and
cache location

Direct-mapped Cache
Each memory address maps to a UNIQUE cache
location determined by a simple modulo function
Simplest implementation because only one cache
location to search

14
Memory Reference Sequence in Direct-Mapped Cache
15
Directed-Mapped Cache Lookup

For a cache with block size 4 bytes and total
capacity 4KB (1024 blocks)
the 2 lowest address bits specify the byte within
a block
the next 10 address bits specify the blocks
index within the cache
the 20 highest address bits are the unique tag
for this memory block
the valid bit specifies whether the block is an
accurate copy of memory

16
Cache Entry Example
17
Bits in a Cache

Total Bits Required for a Directed-mapped cache
with a 4KB of data and 1-word blocks, assuming a
32-bit address
Block size 1-words 4 bytes
of blocks 4KB / 4 bytes 1K blocks
Each block has 4 bytes of data a tag a valid
bit
Tag size 32 bits (data address) 10 bits
(block address)
2 bits (byte in a block)
20 bits
Total bits in a cache 1K x (4 bytes 20 bits
1 bit)
53Kbits (
6.625KB)

18
Cache Blocks

Cache Block (sometimes called a cache line)
a cache entry that has in its own cache tag
previous example uses 4-byte blocks
Lager cache blocks take advantage of spatial
locality
Example of 64KB cache using 4-word (16-byte) block

19
Block Size Tradeoff

In general, larger block size take advantage of
spatial locality BUT
Larger block size means larger miss penalty
Takes longer time to fill up the block
If block size is too big relative to cache size,
miss rate will go up
Too few cache blocks
In gerneral, Average Access Time
Hit Time x (1 - Miss Rate) Miss Penalty x
Miss Rate

Block Size
Block Size
20
Block Size Tradeoff (cont.)

Data from simulating a direct-mapped cache
Note miss rate trends
capacity increases for fixed block size
block size increases for fixed capacity

21
Hits vs. Misses

Read hits
this is what we want!
Read misses
stall the CPU, fetch block from memory, deliver
to cache, restart
Write hits
can replace data in cache and memory
(write-through)
write the data only into the cache (write-back
the cache later)
Write misses
read the entire block into the cache, then write
the word

22
Hardware Issues

Make reading multiple words easier by using banks
of memory
It can get a lot more complicated...

23
Synchronous DRAM (SDRAM) Timing
24
Increasing Bandwidth - Interleaving
25
Split Cache

Use split caches because there is more spatial
locality in code
Two independent caches operating in parallel
Instruction cache and data cache
Used to increase cache bandwidth
i.e. the data rate between cache and processor
Miss rate slightly higher than that of combined
cache
e.g. Total cache size 32KB
Split cache effect miss rate 3.24
Combined cache miss rate 3.18
Increased cache bandwidth easily overcomes the
disadvantage of slightly increased miss rate
Free from cache contention in instruction
pipelining

26
More about Cache Write

Cache read much easier to handle than cache write
Read does not change value of data
Cache write
Need to keep data in the cache and memory
consistent
Two Options
Write-Through write to both cache and memory
control is simple
Isnt memory too slow for this?
Write-Back write to cache only
write the cache block to memory when that cache
block is being replaced on a cache miss
reduces the memory bandwidth required
keep a bit ( called the dirty bit ) per cache
block to track whether the block has been
modified
only need to write back modified blocks
control can be complex

27
Write Buffer for Write Through

A Write Buffer is needed between the Cache and
Memory
Processor writes data into the cache and the
write buffer
Memory controller write contents of the buffer
to memory
Write buffer is just a FIFO (First-In First-Out)
queue
Typical number of entries 4 - 8
Works fine if Store frequency ltlt 1 / (DRAM
write cycle)
In Write buffer saturation, stall processor to
allow memory to catch up

28
Cache Performance

We can safely assume cache access time (hit time)
is single clock cycle
CPU Time with perfect cache CPU cycles x Clock
cycle time
CPU Time with real world cache (CPU Cycles
Memory Stall Cycles) x Clock cycle time
Memory system affects
Memory stall cycles
cache miss stalls write buffer stalls (in case
of write-back cache)
Clock cycle time
since cache access often determines clock speed
for a processor
Memory stall cycles Read stall cycles Write
stall cycles
Read stall cycles Read miss rate x Reads x
Read miss penalty
For write-back cache
Write stall cycles Write miss rate x Writes x
Write miss penalty
Can combine read and write components
memory stall cycles Miss Rate x MemAccesses x
Miss Penalty
For write-through caches
add write buffer stalls

29
Cache Performance Example

Assume
Miss rate for instruction 5
Miss rate for data 8
Data references per instruction 0.4
CPI with perfect cache 1.4
Miss penalty 20 cycles
Find performance relative to perfect cache with
no misses (same clock rate)
Misses/instruction 0.05(instruction miss)
0.4x0.08(data miss) 0.082
Miss stall CPI 0.082 x 20 1.64
Performance is ratio of CPIs (instruction, clock
rate is the same)

30
Set-Associative Caches

Improve cache hit ratio by allowing a memory
location to be placed in more than one cache
block
N-way associative cache allows placement in any
block of a set with N elements
N is the set size
Number of blocks N x number of sets
Set number is selected by a simple modulo
function of the address bits (the set number is
also called the index)
Fully-associative cache
when there is a single set allowing a memory
location to be placed in any cache block
Directed- mapped organization can be considered a
degenerate set-associative cache with set-size1
For fixed cache capacity, large set size leads to
higher hit rates
because more combinations of cache blocks can be
present in the cache at the same time

31
Set-Associative Cache Examples
32
Implementation of 4-Way Set-Associative Cache
33
Miss Rate vs. Set Size

Data is for gcc (gnu C-compiler) and spice for
DECStation 3100 with separate instrution/data
64KB caches using 16B blocks
In general, the benefit increasing associativity
beyond 24 has minimal impact on miss ratio

34
Miss Rate vs. Set Size
Data for SPEC92 on combined instruction/data
cache with 32B block
35
Disadvantage of Set Associative Cache

N-way Set Associative Cache versus Direct Mapped
Cache
N comparators vs. 1
Extra MUX delay for the data
Data comes AFTER Hit/Miss decision and set
selection
In a direct mapped cache, Cache Block is
available BEFORE Hit/Miss
Possible to assume a hit and continue. Recover
later if miss.
Example
2-way set associative cache

36
Cache Block Replacement Policies

Directed Mapped Cache
Each memory location mapped to a single cache
location
No replacement policy is necessary
new item replaces previous item in that cache
location
Set-Associative Caches
N-way set associative cache
each memory location has a choice of N cache
location
Cache miss handling for set-associative caches
bring in new block from memory
identify a block in the selected set to replace
in case of full
need to decide which block to replace

37
Cache Block Replacement Policies (cont.)

Random Replacement
Hardware randomly selects a cache block to
replace
Optimal Replacement
Replace the block that will be used latest in the
future
Least Recently Used (LRU)
Hardware keeps track of access history
replace entry that has not been used for the
longest time
Simple for 2-way associative
single bit in each set to indicate which block
was more recently used
Implementing LRU gets harder for higher degrees
of associativity
In practice replacement policy has minor impact
on miss rate
Especially for high associativity

38
Decreasing miss penalty with multilevel caches

Add a second level cache
often primary cache is on the same chip as the
processor
Primary cache, L1 cache, on-chip cache
use SRAMs to add another cache above primary
memory (DRAM)
L2 cache
miss penalty goes down if data is in 2nd level
cache
On-die L2 cache
Started to get integrated into the same die since
late 1998, and now became a general trend
Example
CPI of 1.0 on a 5 Ghz machine with a 2 miss
rate, 100ns DRAM access
Adding 2nd level cache with 5ns access time
decreases miss rate to .5
Performance gain is 2.8
Refer to the textbook (pp.505506)
Using multilevel caches
try to optimize the hit time on the 1st level
cache
try to optimize the miss rate on the 2nd level
cache

39
Cache Complexities

Not always easy to understand implications of
caches

Theoretical behavior of Radix sort vs. Quicksort
Observed behavior of Radix sort vs. Quicksort
40
Cache Complexities

Here is why
Memory system performance is often critical
factor
multilevel caches, pipelined processors, make it
harder to predict outcomes
Compiler optimizations to increase locality
sometimes hurt ILP
Difficult to predict best algorithm need
experimental data

41
Summary Improving Cache Performance

Cache Performance is determined by
Average Memory Access Time Hit Time Miss Rate
x Miss Penalty
Use Better Technology
Use faster RAMs
Cost and availability are limitations
Decrease Hit Time
Make cache smaller, but miss rate increases
Use direct mapped instead of set-associative, but
miss rate increases
Decrease Miss Rate
Make cache large, but can increase hit time
Add associativity, but can increase hit time
Increase block size, but increase miss penalty
Decrease Miss Penalty
Reduce transfer time component of miss penalty
Add another level of cache (L2 cache)

42
Another View of the Memory Hierarchy
43
Memory Hierarchy Requirements

If Principle of Locality allows caches to offer
(close to) speed of cache memory with size of
DRAM memory,then recursively why not use at next
level to give speed of DRAM memory, size of Disk
memory?
Share memory between multiple processes but still
provide protection dont let one program
read/write memory from another
Address space give each program the illusion
that it has its own private memory
compiler, linker, and loader are simplified
because they see only the virtual address space
abstracted from physical memory allocation

44
Virtual Memory

Called Virtual Memory
Also allows OS to share memory, protect programs
from each other
Today, more important for protection vs. just
another level of memory hierarchy
Each process thinks it has all the memory to
itself
Historically, it predates caches

45
Virtual Memory

Addressable Memory Space vs. Physical Memory
example
32bit memory address can specify 4GB memory
physical main memory 16MB 512MB
Distinguish between virtual and physical
addresses
virtual address is used by the programmer to
address memory within a processs address space
physical address is used by the hardware to
access a physical memory location
Virtual Memory provides appearance of very
large memory
total memory of all jobs gtgt physical memory
address space of each job gt physical memory
Simplifies memory management for multi-processing
system
each program operates in its own virtual address
space as if it is the only program running in the
system
Uses 2 storage levels
primary (DRAM) and secondary (Hard Disk)
Exploits hierarchy to reduce average access time
as in cache

46
Virtual to Physical Address Translation

Each program operates in its own virtual address
space
as if it is the only program running in the
system
Each program is protected from the other
OS can decide where each program goes in memory
Hardware (HW) provides virtual ? physical mapping

47
Paged Virtual Memory

Most common form of address translation
virtual and physical address space partitioned
into blocks of equal size
virtual address space blocks are called pages
physical address space blocks are called frames
(or page frames)
Placement
any page can be placed in any frame
(fully-associative)
Pages are fetched on demand

48
Paging Organization

Paging can map any virtual page to any physical
frame
Data missing from main memory must be transferred
from secondary memory (disk)
misses(page fault) handled by Operating System
miss time very large, so OS manages the hierarchy
and schedules another process instead of stalling
(context switching)

Virtual addresses
Physical addresses
49
Paging/Virtual Memory Multiple Processes
50
Address Translation

Program uses virtual addresses
Relocation a program can be loaded anywhere in
physical memory without recompiling or re-linking
Memory accessed with physical addresses
Hardware (HW) provides virtual ? physical mapping
need a translation table for each process
When a virtual address is missing from main
memory, the OS handles the miss
read the missing data, create the translation,
return to re-execute the instruction that caused
the miss

51
Address Mapping
52
Address Translation Algorithm

If V1, the mapping is valid
CPU checks permissions (R,R/W,X) against access
type
if access is permitted, generate physical address
and proceeds
if access is not permitted, generates a
protection fault
If V! 1, the mapping is invalid
wanted page does not reside in main memory
CPU generates a page fault
Faults are exceptions handled by the OS
page faults
the OS fetches the missing page, creates a map
entry, and restarts the process
another user process is switched in to execute
while the page is brought from disk (context
switching)
protection faults
checks whether it is a programming error or
permission needs to be changed

53
Making VM Fast TLB

If page table is kept in memory
all memory reference require two accesses
one for page table entry and one to get the
actual data
Translation Lookaside Buffer (TLB)
additional cache for page table only
hardware maintains a cache of recently-used page
table translations
look up all accesses up in TLB
hit in TLB gives the physical page number
miss in TLB gt get translation from the page
table and reload
TLB usually smaller than cache (each entry maps a
full page)
more associativity possible and common
similar speed to cache access
contains all bits needed to translate address,
implement VM
Typical TLB entry

Valid
Virtual Address
Physical Address
Dirty
Access Rights
54
Virtual Memory and Cache

OS manages memory hierarchy between secondary
storage and main memory
allocates physical memory to virtual memory and
specifies the mapping to hardware through page
tables
hardware caches recently used page table entries
in the TLB

55
TLBs and caches
56
Page Replacement and Write Policies