Title: 7' Large and Fast: Exploiting Memory Hierarchy
17. Large and FastExploiting Memory Hierarchy
2The Big Picture Where are We Now?
- The Five Classic Components of a Computer
3Technology Trends
Capacity Speed (latency) Logic 2x
in 3 years 2x in 3 years DRAM 4x in 3
years 2x in 10 years Disk 4x in 3 years 2x
in 10 years
4Who Cares About the Memory Hierarchy?
Processor-DRAM Memory Gap (latency)
5The Goal illusion of large, fast, cheap memory
- Fact
- Large memories are slow
- Fast memories are small
- How do we create a memory that is large, cheap
and fast (most of the time)? - Hierarchy
6Exploiting Memory Hierarchy
- Users want large and fast memories!
- As of 2004, SRAM access times are .5 5ns at
cost of 4000 to 10,000 per GB.DRAM access
times are 50-70ns at cost of 100 to 200 per
GB.Disk access times are 5 to 20 million ns at
cost of .50 to 2 per GB. - Try and give it to them anyway
- build a memory hierarchy
7Memory Hierarchy of a Modern Computer System
- By taking advantage of the principle of locality
- Present the user with as much memory as is
available in the cheapest technology. - Provide access at the speed offered by the
fastest technology.
8Memory Hierarchy Why Does it Work? Locality!
- Spatial Locality (Locality in Space)
- gt Move blocks consists of contiguous words to
the upper levels - Temporal Locality (Locality in Time)
- gt Keep most recently accessed data items closer
to the processor
9Memory Hierarchy Terminology
- Hit data appears in some block in the upper
level (example Block X) - Hit Rate the fraction of memory access found in
the upper level - Hit Time Time to access the upper level which
consists of - RAM access time Time to determine hit/miss
- Miss data needs to be retrieve from a block in
the lower level (Block Y) - Miss Rate 1 - (Hit Rate)
- Miss Penalty Time to replace a block in the
upper level - Time to deliver the block the processor
- Hit Time ltlt Miss Penalty
10Memory Hierarchy Technology
- Random Access
- Random is good access time is the same for all
locations - Volatile Memory
- DRAM Dynamic Random Access Memory
- High density, low power, cheap, slow
- Dynamic need to be refreshed regularly
- SRAM Static Random Access Memory
- Low density, high power, expensive, fast
- Static content will last forever(until lose
power) - Non-Volatile Memory
- ROM(Mask ROM, PROM, EPROM, E2PROM)
- Flash Memory, FRAM, MRAM
- Not-so-random Access Technology
- Access time varies from location to location and
from time to time - Examples Disk, CD-ROM
- Sequential Access Technology
- access time linear in location (e.g.,Tape)
11Main Memory Background
- Main Memory is DRAM Dynamic Random Access Memory
- 1 transistor and 1 capacitor ( 2 transistors) /
bit - Dynamic since needs to be refreshed periodically
(8 ms) - Addresses divided into 2 halves (Memory as a 2D
matrix) - Row address and then column address
- Number of address pins cut into half
- Called address multiplexing
- Cache uses SRAM Static Random Access Memory
- No refresh
- 6 transistors/bit
- No address multiplexing
- SRAM is faster, and more expensive than DRAM
- Size SRAM/DRAM 4-8
- Cost SRAM/DRAM 20-25 (1997)
- Access Time DRAM/SRAM - 5-12
12Cache
- Motivation
- Slow speed of DRAM main memory limits processor
performance - a smaller SRAM memory matches processor speed
- Make the average access time near SRAM
- if the large majority of memory references hit
the cache - Reduce bandwidth required of the large memory
13Cache Organization
- Cache duplicates part of main memory
- we specify an address in main memory to search
whether a copy of that memory location resides in
the cache - need a mapping between main memory location and
cache location
- Direct-mapped Cache
- Each memory address maps to a UNIQUE cache
location determined by a simple modulo function - Simplest implementation because only one cache
location to search
14Memory Reference Sequence in Direct-Mapped Cache
15Directed-Mapped Cache Lookup
- For a cache with block size 4 bytes and total
capacity 4KB (1024 blocks) - the 2 lowest address bits specify the byte within
a block - the next 10 address bits specify the blocks
index within the cache - the 20 highest address bits are the unique tag
for this memory block - the valid bit specifies whether the block is an
accurate copy of memory
16Cache Entry Example
17Bits in a Cache
- Total Bits Required for a Directed-mapped cache
with a 4KB of data and 1-word blocks, assuming a
32-bit address - Block size 1-words 4 bytes
- of blocks 4KB / 4 bytes 1K blocks
- Each block has 4 bytes of data a tag a valid
bit - Tag size 32 bits (data address) 10 bits
(block address) - 2 bits (byte in a block)
- 20 bits
- Total bits in a cache 1K x (4 bytes 20 bits
1 bit) - 53Kbits (
6.625KB)
18Cache Blocks
- Cache Block (sometimes called a cache line)
- a cache entry that has in its own cache tag
- previous example uses 4-byte blocks
- Lager cache blocks take advantage of spatial
locality - Example of 64KB cache using 4-word (16-byte) block
19Block Size Tradeoff
- In general, larger block size take advantage of
spatial locality BUT - Larger block size means larger miss penalty
- Takes longer time to fill up the block
- If block size is too big relative to cache size,
miss rate will go up - Too few cache blocks
- In gerneral, Average Access Time
- Hit Time x (1 - Miss Rate) Miss Penalty x
Miss Rate
Block Size
Block Size
20Block Size Tradeoff (cont.)
- Data from simulating a direct-mapped cache
- Note miss rate trends
- capacity increases for fixed block size
- block size increases for fixed capacity
21Hits vs. Misses
- Read hits
- this is what we want!
- Read misses
- stall the CPU, fetch block from memory, deliver
to cache, restart - Write hits
- can replace data in cache and memory
(write-through) - write the data only into the cache (write-back
the cache later) - Write misses
- read the entire block into the cache, then write
the word
22Hardware Issues
- Make reading multiple words easier by using banks
of memory - It can get a lot more complicated...
23Synchronous DRAM (SDRAM) Timing
24Increasing Bandwidth - Interleaving
25Split Cache
- Use split caches because there is more spatial
locality in code - Two independent caches operating in parallel
- Instruction cache and data cache
- Used to increase cache bandwidth
- i.e. the data rate between cache and processor
- Miss rate slightly higher than that of combined
cache - e.g. Total cache size 32KB
- Split cache effect miss rate 3.24
- Combined cache miss rate 3.18
- Increased cache bandwidth easily overcomes the
disadvantage of slightly increased miss rate - Free from cache contention in instruction
pipelining
26More about Cache Write
- Cache read much easier to handle than cache write
- Read does not change value of data
- Cache write
- Need to keep data in the cache and memory
consistent - Two Options
- Write-Through write to both cache and memory
- control is simple
- Isnt memory too slow for this?
- Write-Back write to cache only
- write the cache block to memory when that cache
block is being replaced on a cache miss - reduces the memory bandwidth required
- keep a bit ( called the dirty bit ) per cache
block to track whether the block has been
modified - only need to write back modified blocks
- control can be complex
27Write Buffer for Write Through
- A Write Buffer is needed between the Cache and
Memory - Processor writes data into the cache and the
write buffer - Memory controller write contents of the buffer
to memory - Write buffer is just a FIFO (First-In First-Out)
queue - Typical number of entries 4 - 8
- Works fine if Store frequency ltlt 1 / (DRAM
write cycle) - In Write buffer saturation, stall processor to
allow memory to catch up
28Cache Performance
- We can safely assume cache access time (hit time)
is single clock cycle - CPU Time with perfect cache CPU cycles x Clock
cycle time - CPU Time with real world cache (CPU Cycles
Memory Stall Cycles) x Clock cycle time - Memory system affects
- Memory stall cycles
- cache miss stalls write buffer stalls (in case
of write-back cache) - Clock cycle time
- since cache access often determines clock speed
for a processor - Memory stall cycles Read stall cycles Write
stall cycles - Read stall cycles Read miss rate x Reads x
Read miss penalty - For write-back cache
- Write stall cycles Write miss rate x Writes x
Write miss penalty - Can combine read and write components
- memory stall cycles Miss Rate x MemAccesses x
Miss Penalty - For write-through caches
- add write buffer stalls
29Cache Performance Example
- Assume
- Miss rate for instruction 5
- Miss rate for data 8
- Data references per instruction 0.4
- CPI with perfect cache 1.4
- Miss penalty 20 cycles
- Find performance relative to perfect cache with
no misses (same clock rate) - Misses/instruction 0.05(instruction miss)
0.4x0.08(data miss) 0.082 - Miss stall CPI 0.082 x 20 1.64
- Performance is ratio of CPIs (instruction, clock
rate is the same)
30Set-Associative Caches
- Improve cache hit ratio by allowing a memory
location to be placed in more than one cache
block - N-way associative cache allows placement in any
block of a set with N elements - N is the set size
- Number of blocks N x number of sets
- Set number is selected by a simple modulo
function of the address bits (the set number is
also called the index) - Fully-associative cache
- when there is a single set allowing a memory
location to be placed in any cache block - Directed- mapped organization can be considered a
degenerate set-associative cache with set-size1 - For fixed cache capacity, large set size leads to
higher hit rates - because more combinations of cache blocks can be
present in the cache at the same time
31Set-Associative Cache Examples
32Implementation of 4-Way Set-Associative Cache
33Miss Rate vs. Set Size
- Data is for gcc (gnu C-compiler) and spice for
DECStation 3100 with separate instrution/data
64KB caches using 16B blocks - In general, the benefit increasing associativity
beyond 24 has minimal impact on miss ratio
34Miss Rate vs. Set Size
Data for SPEC92 on combined instruction/data
cache with 32B block
35Disadvantage of Set Associative Cache
- N-way Set Associative Cache versus Direct Mapped
Cache - N comparators vs. 1
- Extra MUX delay for the data
- Data comes AFTER Hit/Miss decision and set
selection - In a direct mapped cache, Cache Block is
available BEFORE Hit/Miss - Possible to assume a hit and continue. Recover
later if miss. - Example
- 2-way set associative cache
36Cache Block Replacement Policies
- Directed Mapped Cache
- Each memory location mapped to a single cache
location - No replacement policy is necessary
- new item replaces previous item in that cache
location - Set-Associative Caches
- N-way set associative cache
- each memory location has a choice of N cache
location - Cache miss handling for set-associative caches
- bring in new block from memory
- identify a block in the selected set to replace
in case of full - need to decide which block to replace
37Cache Block Replacement Policies (cont.)
- Random Replacement
- Hardware randomly selects a cache block to
replace - Optimal Replacement
- Replace the block that will be used latest in the
future - Least Recently Used (LRU)
- Hardware keeps track of access history
- replace entry that has not been used for the
longest time - Simple for 2-way associative
- single bit in each set to indicate which block
was more recently used - Implementing LRU gets harder for higher degrees
of associativity - In practice replacement policy has minor impact
on miss rate - Especially for high associativity
38Decreasing miss penalty with multilevel caches
- Add a second level cache
- often primary cache is on the same chip as the
processor - Primary cache, L1 cache, on-chip cache
- use SRAMs to add another cache above primary
memory (DRAM) - L2 cache
- miss penalty goes down if data is in 2nd level
cache - On-die L2 cache
- Started to get integrated into the same die since
late 1998, and now became a general trend - Example
- CPI of 1.0 on a 5 Ghz machine with a 2 miss
rate, 100ns DRAM access - Adding 2nd level cache with 5ns access time
decreases miss rate to .5 - Performance gain is 2.8
- Refer to the textbook (pp.505506)
- Using multilevel caches
- try to optimize the hit time on the 1st level
cache - try to optimize the miss rate on the 2nd level
cache
39Cache Complexities
- Not always easy to understand implications of
caches
Theoretical behavior of Radix sort vs. Quicksort
Observed behavior of Radix sort vs. Quicksort
40Cache Complexities
- Here is why
- Memory system performance is often critical
factor - multilevel caches, pipelined processors, make it
harder to predict outcomes - Compiler optimizations to increase locality
sometimes hurt ILP - Difficult to predict best algorithm need
experimental data
41Summary Improving Cache Performance
- Cache Performance is determined by
- Average Memory Access Time Hit Time Miss Rate
x Miss Penalty - Use Better Technology
- Use faster RAMs
- Cost and availability are limitations
- Decrease Hit Time
- Make cache smaller, but miss rate increases
- Use direct mapped instead of set-associative, but
miss rate increases - Decrease Miss Rate
- Make cache large, but can increase hit time
- Add associativity, but can increase hit time
- Increase block size, but increase miss penalty
- Decrease Miss Penalty
- Reduce transfer time component of miss penalty
- Add another level of cache (L2 cache)
42Another View of the Memory Hierarchy
43Memory Hierarchy Requirements
- If Principle of Locality allows caches to offer
(close to) speed of cache memory with size of
DRAM memory,then recursively why not use at next
level to give speed of DRAM memory, size of Disk
memory? - Share memory between multiple processes but still
provide protection dont let one program
read/write memory from another - Address space give each program the illusion
that it has its own private memory - compiler, linker, and loader are simplified
because they see only the virtual address space
abstracted from physical memory allocation
44Virtual Memory
- Called Virtual Memory
- Also allows OS to share memory, protect programs
from each other - Today, more important for protection vs. just
another level of memory hierarchy - Each process thinks it has all the memory to
itself - Historically, it predates caches
45Virtual Memory
- Addressable Memory Space vs. Physical Memory
- example
- 32bit memory address can specify 4GB memory
- physical main memory 16MB 512MB
- Distinguish between virtual and physical
addresses - virtual address is used by the programmer to
address memory within a processs address space - physical address is used by the hardware to
access a physical memory location - Virtual Memory provides appearance of very
large memory - total memory of all jobs gtgt physical memory
- address space of each job gt physical memory
- Simplifies memory management for multi-processing
system - each program operates in its own virtual address
space as if it is the only program running in the
system - Uses 2 storage levels
- primary (DRAM) and secondary (Hard Disk)
- Exploits hierarchy to reduce average access time
as in cache
46Virtual to Physical Address Translation
- Each program operates in its own virtual address
space - as if it is the only program running in the
system - Each program is protected from the other
- OS can decide where each program goes in memory
- Hardware (HW) provides virtual ? physical mapping
47Paged Virtual Memory
- Most common form of address translation
- virtual and physical address space partitioned
into blocks of equal size - virtual address space blocks are called pages
- physical address space blocks are called frames
(or page frames) - Placement
- any page can be placed in any frame
(fully-associative) - Pages are fetched on demand
48Paging Organization
- Paging can map any virtual page to any physical
frame - Data missing from main memory must be transferred
from secondary memory (disk) - misses(page fault) handled by Operating System
- miss time very large, so OS manages the hierarchy
and schedules another process instead of stalling
(context switching)
Virtual addresses
Physical addresses
49Paging/Virtual Memory Multiple Processes
50Address Translation
- Program uses virtual addresses
- Relocation a program can be loaded anywhere in
physical memory without recompiling or re-linking - Memory accessed with physical addresses
- Hardware (HW) provides virtual ? physical mapping
- need a translation table for each process
- When a virtual address is missing from main
memory, the OS handles the miss - read the missing data, create the translation,
return to re-execute the instruction that caused
the miss
51Address Mapping
52Address Translation Algorithm
- If V1, the mapping is valid
- CPU checks permissions (R,R/W,X) against access
type - if access is permitted, generate physical address
and proceeds - if access is not permitted, generates a
protection fault - If V! 1, the mapping is invalid
- wanted page does not reside in main memory
- CPU generates a page fault
- Faults are exceptions handled by the OS
- page faults
- the OS fetches the missing page, creates a map
entry, and restarts the process - another user process is switched in to execute
while the page is brought from disk (context
switching) - protection faults
- checks whether it is a programming error or
permission needs to be changed
53Making VM Fast TLB
- If page table is kept in memory
- all memory reference require two accesses
- one for page table entry and one to get the
actual data - Translation Lookaside Buffer (TLB)
- additional cache for page table only
- hardware maintains a cache of recently-used page
table translations - look up all accesses up in TLB
- hit in TLB gives the physical page number
- miss in TLB gt get translation from the page
table and reload - TLB usually smaller than cache (each entry maps a
full page) - more associativity possible and common
- similar speed to cache access
- contains all bits needed to translate address,
implement VM - Typical TLB entry
Valid
Virtual Address
Physical Address
Dirty
Access Rights
54Virtual Memory and Cache
- OS manages memory hierarchy between secondary
storage and main memory - allocates physical memory to virtual memory and
specifies the mapping to hardware through page
tables - hardware caches recently used page table entries
in the TLB
55TLBs and caches
56Page Replacement and Write Policies
- When a page fault occurs, choose a page to
replace - fully associative so any frame/page is a
candidate - choose empty one if it exists
- choose either (just as we did for cache)
- LRU
- Random
- Write Policy always write-back
- keep a dirty bit
- set to 1 if page is modified
- when a modified page is replaced, the OS writes
it back to disk
57Modern Systems
58Modern Systems
- Things are getting complicated!