EEM 486: Computer Architecture Lecture 6 Memory Systems and Caches presentation

About This Presentation

Transcript and Presenter's Notes

Title: EEM 486: Computer Architecture Lecture 6 Memory Systems and Caches

1
EEM 486 Computer ArchitectureLecture 6Memory
Systems and Caches
2
The Big Picture Where are We Now?

The Five Classic Components of a Computer

3
The Art of Memory System Design
Workload or Benchmark programs
Processor
reference stream ltop,addrgt, ltop,addrgt,ltop,addrgt,
ltop,addrgt, . . . op i-fetch, read, write
Memory
Optimize the memory system organization to
minimize the average memory access time for
typical workloads
SRAM
Cache
Main Memory
DRAM
4
Technology Trends
5
Processor-DRAM Memory Gap
6
The Goal illusion of large, fast, cheap memory

Facts
Large memories are slow but cheap (DRAM)
Fast memories are small yet expensive (SRAM)
How do we create a memory that is large, fast and
cheap?
Memory hierarchy
Parallelism

7
The Principle of Locality

The principle of locality Programs access a
relatively small
portion of their address space at any instant of
time
Temporal Locality (Locality in Time)
gt If an item is referenced, it will tend to be
referenced again soon
gt Keep most recently accessed data items closer
to the processor
Spatial Locality (Locality in Space)
gt If an item is referenced, nearby items will
tend to be referenced soon
gt Move blocks of contiguous words to the upper
levels
Q Why does code have locality?

8
Memory Hierarchy

Based on the principle of locality
A way of providing large, cheap, and fast memory

9
Cache Memory
10
Elements of Cache Design

Cache size
Mapping function
Direct
Set Associative
Fully Associative
Replacement algorithm
Least recently used (LRU)
First in first out (FIFO)
Random
Write policy
Write through
Write back
Line size
Number of caches
Single or two level
Unified or split

11
Terminology

Hit data appears in some block in the upper
level
Hit Rate the fraction of memory accesses found
in the upper level
Hit Time time to access the upper level which
consists of
RAM access time Time to determine hit/miss

12
Terminology

Miss data needs to be retrieved from a block
in the lower level
Miss Rate 1 - (Hit Rate)
Miss Penalty Time to replace a block in the
upper level
Time to deliver
the block the processor
Hit Time ltlt Miss Penalty

13
Direct Mapped Cache
Each memory location is mapped to exactly one
location in the cache Cache block
(Block address) modulo ( of cache blocks)
Low order log2 ( of cache
blocks) bits of the address
14
64 KByte Direct Mapped Cache

Why do we need a Tag field?
Why do we need a Valid bit field?
What kind of locality are we taking
care of?
Total number of bits in a cache
2n x (valid tag block)
2n of cache blocks
valid 1 bit
tag 32 (n 2) 32-bit byte address
1 word blocks
block 32 bit

15
Reading from Cache

Address the cache by PC or ALU
If the cache signals hit, we have a read hit
The requested word will be on the data lines
Otherwise, we have a read miss
stall the CPU
fetch the block from memory and write into cache
restart the execution

16
Writing to Cache

Address the cache by PC or ALU
If the cache signals hit, we have a write hit
We have two options
write-through write the data into both cache and
memory
write-back write the data only into cache and
write it into memory only
when it is replaced
Otherwise, we have a write miss
Handle write miss as if it were a write hit

17
64 KByte Direct Mapped Cache

Taking advantage of spatial locality

18
Writing to Cache

Address the cache by PC or ALU
If the cache signals hit, we have a write hit
Write-through cache write the data into both
cache and memory
Otherwise, we have a write miss
stall the CPU
fetch the block from memory and write into cache
restart the execution and rewrite the word

19
Associativity in Caches

Compute the set number
(Block number) modulo (Number of sets)
Choose one of the blocks in the computed set

20
Set Asscociative Cache

N-way set associative
N direct mapped caches operates in parallel
N entries for each cache index
N comparators and a N-to-1 mux
Data comes AFTER Hit/Miss decision and set
selection

A four-way set associative cache
21
Fully Associative Cache

A block can be anywhere in the cache gt No Cache
Index
Compare the Cache Tags of all cache entries in
parallel
Practical for small number of cache blocks

22
Four Questions for Caches

Q1 Block placement?
Where can a block be placed in the upper
level?
Q2 Block identification?
How is a block found if it is in the
upper level?
Q3 Block replacement?
Which block should be replaced on a
miss?
Q4 Write strategy?
What happens on a write?

23
Q1 Block Placement?

Block 12 to be placed in an 8 block cache

Direct mapped One place - (Block address) mod (
of cache blocks) Set associative A few places -
(Block address) mod ( of cache sets)
of cache sets of cache
blocks/degree of associativity Fully
associative Any place
24
Q2 Block Identification?
Direct mapped Indexing index, 1
comparison N-way set associative Limited search
index the set, N comparison Fully associative
Full search search all cache entries
25
Q3 Replacement Policy on a Miss?

Easy for Direct Mapped
Set Associative or Fully Associative
Random Randomly select one of the blocks in the
set
LRU (Least Recently Used) Select the block in
the set which has been
unused for the longest time
Associativity 2-way 4-way 8-way
Size LRU Random LRU
Random LRU Random
16 KB 5.2 5.7 4.7
5.3 4.4 5.0
64 KB 1.9 2.0 1.5
1.7 1.4 1.5
256 KB 1.15 1.17 1.13
1.13 1.12 1.12

26
Q4 Write Policy?

Write through The information is written to both
the block in the cache and to the block in the
lower-level memory
Write back The information is written only to
the block in the cache. The modified cache block
is written to main memory only when it is
replaced
is block clean or dirty?
Pros and Cons of each?
WT read misses cannot result in writes
WB no writes of repeated writes
WT always combined with write buffers to avoid
waiting for lower level memory

27
Cache Performance

CPU time (CPU execution clock cycles
Memory stall clock cycles) x Cycle time
Note memory hit time is included in execution
cycles
Stalls due to cache misses
Memory stall clock cycles Read-stall clock
cycles
Write-stall clock cycles
Read-stall clock cycles Reads x Read miss
rate x Read miss penalty
Write-stall clock cycles Writes x Write miss
rate x Write miss penalty
If read miss penalty write miss penalty,
Memory stall clock cycles Memory accesses x
Miss rate x Miss penalty

28
Cache Performance

CPU time Instruction count x CPI x Cycle time
Inst count x Cycle time x
(ideal CPI Memory stalls/Inst
Other stalls/Inst)
Memory Stalls/Inst
Instruction Miss Rate x Instruction
Miss Penalty
Loads/Inst x Load Miss Rate x Load Miss
Penalty
Stores/Inst x Store Miss Rate x Store
Miss Penalty
Average Memory Access time (AMAT)
Hit Time (Miss Rate x Miss Penalty)
(Hit Rate x Hit Time) (Miss Rate x Miss Time)

29
Example

Suppose a processor executes at
Clock Rate 200 MHz (5 ns per cycle)
Base CPI 1.1
50 arith/logic, 30 ld/st, 20 control
Suppose that 10 of memory operations get 50
cycle miss penalty
Suppose that 1 of instructions get same miss
penalty
CPI Base CPI average stalls per instruction
1.1(cycles/ins) 0.30 (Data
Mops/ins) x 0.10 (miss/Data Mop) x 50
(cycle/miss) 1 (Inst Mop/ins) x
0.01 (miss/Inst Mop) x 50 (cycle/miss)
(1.1 1.5 .5) cycle/ins 3.1
AMAT (1/1.3)x10.01x50 (0.3/1.3)x10.1x50
2.54

30
Improving Cache Performance
CPU Time IC x CT x (ideal CPI memory
stalls) Average Memory Access time Hit Time
(Miss Rate x Miss Penalty)
(Hit Rate x Hit Time) (Miss Rate x Miss Time)

Options to reduce AMAT
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache

31
Reduce Misses Larger Block Size
Increasing block size also increases miss penalty
!
32
Reduce Misses Higher Associativity
Increasing associativity also increases both time
and hardware cost !
33
Reducing Penalty Second-Level Cache

L2 Equations
AMAT Hit TimeL1 Miss RateL1 x Miss
PenaltyL1
Miss PenaltyL1 Hit TimeL2 Miss RateL2 x Miss
PenaltyL2
AMAT Hit TimeL1
Miss RateL1 x (Hit TimeL2 Miss RateL2 x
Miss PenaltyL2)

34
Designing the Memory System to Support Caches

Wide
CPU/Mux 1 word Mux/Cache, Bus, Memory N words

Interleaved
CPU, Cache, Bus- 1 word
N Memory Modules

Simple
CPU, Cache, Bus, Memory same width (32 bits)

35
Main Memory Performance

DRAM (Read/Write) Cycle Time gtgt
DRAM
(Read/Write) Access Time
DRAM (Read/Write) Cycle Time
How frequent can you initiate an access?
DRAM (Read/Write) Access Time
How quickly will you get what you want once you
initiate an access?
DRAM Bandwidth Limitation

36
Increasing Bandwidth - Interleaving
Access Pattern without Interleaving
CPU
Memory
Memory Bank 0
Access Pattern with 4-way Interleaving
Memory Bank 1
CPU
Memory Bank 2
Memory Bank 3
Access Bank 1
Access Bank 0
Access Bank 2
Access Bank 3
We can Access Bank 0 again
37
Summary 1/2

The Principle of Locality
Program likely to access a relatively small
portion of the address space at any instant of
time.
Temporal Locality Locality in Time
Spatial Locality Locality in Space
Three (1) Major Categories of Cache Misses
Compulsory Misses sad facts of life. Example
cold start misses.
Conflict Misses increase cache size and/or
associativity. Nightmare Scenario ping pong
effect!
Capacity Misses increase cache size
Cache Design Space
total size, block size, associativity
replacement policy
write-hit policy (write-through, write-back)
write-miss policy

38
Summary 2/2 The Cache Design Space
Cache Size

Several interacting dimensions
cache size
block size
associativity
replacement policy
write-through vs write-back
write allocation
The optimal choice is a compromise
depends on access characteristics
workload
use (I-cache, D-cache, TLB)
depends on technology / cost
Simplicity often wins

Associativity
Block Size
Bad
Factor A
Factor B
Good
Less
More

Write a Comment

User Comments (0)

About PowerShow.com

EEM 486: Computer Architecture Lecture 6 Memory Systems and Caches PowerPoint PPT Presentation