Title: EEM 486: Computer Architecture Lecture 6 Memory Systems and Caches
1EEM 486 Computer ArchitectureLecture 6Memory
Systems and Caches
2The Big Picture Where are We Now?
- The Five Classic Components of a Computer
3The Art of Memory System Design
Workload or Benchmark programs
Processor
reference stream ltop,addrgt, ltop,addrgt,ltop,addrgt,
ltop,addrgt, . . . op i-fetch, read, write
Memory
Optimize the memory system organization to
minimize the average memory access time for
typical workloads
SRAM
Cache
Main Memory
DRAM
4Technology Trends
5Processor-DRAM Memory Gap
6The Goal illusion of large, fast, cheap memory
- Facts
- Large memories are slow but cheap (DRAM)
- Fast memories are small yet expensive (SRAM)
- How do we create a memory that is large, fast and
cheap? - Memory hierarchy
- Parallelism
7The Principle of Locality
- The principle of locality Programs access a
relatively small - portion of their address space at any instant of
time - Temporal Locality (Locality in Time)
- gt If an item is referenced, it will tend to be
referenced again soon - gt Keep most recently accessed data items closer
to the processor - Spatial Locality (Locality in Space)
- gt If an item is referenced, nearby items will
tend to be referenced soon - gt Move blocks of contiguous words to the upper
levels - Q Why does code have locality?
8Memory Hierarchy
- Based on the principle of locality
- A way of providing large, cheap, and fast memory
9Cache Memory
10Elements of Cache Design
- Cache size
- Mapping function
- Direct
- Set Associative
- Fully Associative
- Replacement algorithm
- Least recently used (LRU)
- First in first out (FIFO)
- Random
- Write policy
- Write through
- Write back
- Line size
- Number of caches
- Single or two level
- Unified or split
11Terminology
- Hit data appears in some block in the upper
level - Hit Rate the fraction of memory accesses found
in the upper level - Hit Time time to access the upper level which
consists of - RAM access time Time to determine hit/miss
12Terminology
- Miss data needs to be retrieved from a block
- in the lower level
- Miss Rate 1 - (Hit Rate)
- Miss Penalty Time to replace a block in the
upper level - Time to deliver
the block the processor - Hit Time ltlt Miss Penalty
13Direct Mapped Cache
Each memory location is mapped to exactly one
location in the cache Cache block
(Block address) modulo ( of cache blocks)
Low order log2 ( of cache
blocks) bits of the address
1464 KByte Direct Mapped Cache
- Why do we need a Tag field?
- Why do we need a Valid bit field?
- What kind of locality are we taking
- care of?
- Total number of bits in a cache
- 2n x (valid tag block)
- 2n of cache blocks
- valid 1 bit
- tag 32 (n 2) 32-bit byte address
- 1 word blocks
- block 32 bit
15Reading from Cache
- Address the cache by PC or ALU
- If the cache signals hit, we have a read hit
- The requested word will be on the data lines
- Otherwise, we have a read miss
- stall the CPU
- fetch the block from memory and write into cache
- restart the execution
16Writing to Cache
- Address the cache by PC or ALU
- If the cache signals hit, we have a write hit
- We have two options
- write-through write the data into both cache and
memory - write-back write the data only into cache and
- write it into memory only
when it is replaced - Otherwise, we have a write miss
- Handle write miss as if it were a write hit
1764 KByte Direct Mapped Cache
- Taking advantage of spatial locality
18Writing to Cache
- Address the cache by PC or ALU
- If the cache signals hit, we have a write hit
- Write-through cache write the data into both
cache and memory - Otherwise, we have a write miss
- stall the CPU
- fetch the block from memory and write into cache
- restart the execution and rewrite the word
19Associativity in Caches
- Compute the set number
- (Block number) modulo (Number of sets)
- Choose one of the blocks in the computed set
20Set Asscociative Cache
- N-way set associative
- N direct mapped caches operates in parallel
- N entries for each cache index
- N comparators and a N-to-1 mux
- Data comes AFTER Hit/Miss decision and set
selection
A four-way set associative cache
21Fully Associative Cache
- A block can be anywhere in the cache gt No Cache
Index - Compare the Cache Tags of all cache entries in
parallel - Practical for small number of cache blocks
22Four Questions for Caches
- Q1 Block placement?
- Where can a block be placed in the upper
level? - Q2 Block identification?
- How is a block found if it is in the
upper level? - Q3 Block replacement?
- Which block should be replaced on a
miss? - Q4 Write strategy?
- What happens on a write?
23Q1 Block Placement?
- Block 12 to be placed in an 8 block cache
Direct mapped One place - (Block address) mod (
of cache blocks) Set associative A few places -
(Block address) mod ( of cache sets)
of cache sets of cache
blocks/degree of associativity Fully
associative Any place
24Q2 Block Identification?
Direct mapped Indexing index, 1
comparison N-way set associative Limited search
index the set, N comparison Fully associative
Full search search all cache entries
25Q3 Replacement Policy on a Miss?
- Easy for Direct Mapped
- Set Associative or Fully Associative
- Random Randomly select one of the blocks in the
set - LRU (Least Recently Used) Select the block in
the set which has been -
unused for the longest time - Associativity 2-way 4-way 8-way
- Size LRU Random LRU
Random LRU Random - 16 KB 5.2 5.7 4.7
5.3 4.4 5.0 - 64 KB 1.9 2.0 1.5
1.7 1.4 1.5 - 256 KB 1.15 1.17 1.13
1.13 1.12 1.12
26Q4 Write Policy?
- Write through The information is written to both
the block in the cache and to the block in the
lower-level memory - Write back The information is written only to
the block in the cache. The modified cache block
is written to main memory only when it is
replaced - is block clean or dirty?
- Pros and Cons of each?
- WT read misses cannot result in writes
- WB no writes of repeated writes
- WT always combined with write buffers to avoid
- waiting for lower level memory
27Cache Performance
- CPU time (CPU execution clock cycles
- Memory stall clock cycles) x Cycle time
- Note memory hit time is included in execution
cycles - Stalls due to cache misses
- Memory stall clock cycles Read-stall clock
cycles - Write-stall clock cycles
- Read-stall clock cycles Reads x Read miss
rate x Read miss penalty - Write-stall clock cycles Writes x Write miss
rate x Write miss penalty - If read miss penalty write miss penalty,
- Memory stall clock cycles Memory accesses x
Miss rate x Miss penalty
28Cache Performance
- CPU time Instruction count x CPI x Cycle time
- Inst count x Cycle time x
- (ideal CPI Memory stalls/Inst
Other stalls/Inst) - Memory Stalls/Inst
- Instruction Miss Rate x Instruction
Miss Penalty - Loads/Inst x Load Miss Rate x Load Miss
Penalty - Stores/Inst x Store Miss Rate x Store
Miss Penalty - Average Memory Access time (AMAT)
- Hit Time (Miss Rate x Miss Penalty)
- (Hit Rate x Hit Time) (Miss Rate x Miss Time)
29Example
- Suppose a processor executes at
- Clock Rate 200 MHz (5 ns per cycle)
- Base CPI 1.1
- 50 arith/logic, 30 ld/st, 20 control
- Suppose that 10 of memory operations get 50
cycle miss penalty - Suppose that 1 of instructions get same miss
penalty - CPI Base CPI average stalls per instruction
- 1.1(cycles/ins) 0.30 (Data
Mops/ins) x 0.10 (miss/Data Mop) x 50
(cycle/miss) 1 (Inst Mop/ins) x
0.01 (miss/Inst Mop) x 50 (cycle/miss) - (1.1 1.5 .5) cycle/ins 3.1
- AMAT (1/1.3)x10.01x50 (0.3/1.3)x10.1x50
2.54
30Improving Cache Performance
CPU Time IC x CT x (ideal CPI memory
stalls) Average Memory Access time Hit Time
(Miss Rate x Miss Penalty)
(Hit Rate x Hit Time) (Miss Rate x Miss Time)
- Options to reduce AMAT
- 1. Reduce the miss rate,
- 2. Reduce the miss penalty, or
- 3. Reduce the time to hit in the cache
31Reduce Misses Larger Block Size
Increasing block size also increases miss penalty
!
32Reduce Misses Higher Associativity
Increasing associativity also increases both time
and hardware cost !
33Reducing Penalty Second-Level Cache
- L2 Equations
- AMAT Hit TimeL1 Miss RateL1 x Miss
PenaltyL1 - Miss PenaltyL1 Hit TimeL2 Miss RateL2 x Miss
PenaltyL2 - AMAT Hit TimeL1
- Miss RateL1 x (Hit TimeL2 Miss RateL2 x
Miss PenaltyL2)
34Designing the Memory System to Support Caches
- Wide
- CPU/Mux 1 word Mux/Cache, Bus, Memory N words
- Interleaved
- CPU, Cache, Bus- 1 word
- N Memory Modules
- Simple
- CPU, Cache, Bus, Memory same width (32 bits)
35Main Memory Performance
- DRAM (Read/Write) Cycle Time gtgt
- DRAM
(Read/Write) Access Time - DRAM (Read/Write) Cycle Time
- How frequent can you initiate an access?
- DRAM (Read/Write) Access Time
- How quickly will you get what you want once you
initiate an access? - DRAM Bandwidth Limitation
36Increasing Bandwidth - Interleaving
Access Pattern without Interleaving
CPU
Memory
Memory Bank 0
Access Pattern with 4-way Interleaving
Memory Bank 1
CPU
Memory Bank 2
Memory Bank 3
Access Bank 1
Access Bank 0
Access Bank 2
Access Bank 3
We can Access Bank 0 again
37Summary 1/2
- The Principle of Locality
- Program likely to access a relatively small
portion of the address space at any instant of
time. - Temporal Locality Locality in Time
- Spatial Locality Locality in Space
- Three (1) Major Categories of Cache Misses
- Compulsory Misses sad facts of life. Example
cold start misses. - Conflict Misses increase cache size and/or
associativity. Nightmare Scenario ping pong
effect! - Capacity Misses increase cache size
- Cache Design Space
- total size, block size, associativity
- replacement policy
- write-hit policy (write-through, write-back)
- write-miss policy
38Summary 2/2 The Cache Design Space
Cache Size
- Several interacting dimensions
- cache size
- block size
- associativity
- replacement policy
- write-through vs write-back
- write allocation
- The optimal choice is a compromise
- depends on access characteristics
- workload
- use (I-cache, D-cache, TLB)
- depends on technology / cost
- Simplicity often wins
Associativity
Block Size
Bad
Factor A
Factor B
Good
Less
More