Title: ECE6130: Computer Architecture: Memory Hierarchy design
1ECE6130 Computer ArchitectureMemory Hierarchy
design
- Dr. Xubin He
- http//www.ece.tntech.edu/hexb
- Email hexb_at_tntech.edu
- Tel 931-3723462, Brown Hall 319
2What you have learned so far
- Fundamentals of Computer Design
- Cost and Technology Trends, Amdahls law,
Principles of locality, CPU performance Equations - ISA
- Classifications, Addressing Modes, Operands and
Operations, MIPS - Pipelining
- Hazards, MIPS 5-stage pipeline
- ILP
- Dynamic scheduling, Dynamic Branch Prediction
Next Memory Hierarchy Design Cache Performance,
Techniques to Improve Memory Performance, Memory
Organization Technology, case study
3Since 1980, CPU has outpaced DRAM ...
Q. How do architects address this gap?
A. Put smaller, faster cache memories between
CPU and DRAM. Create a memory hierarchy.
Performance (1/latency)
CPU 60 per yr 2X in 1.5 yrs
1000
CPU
100
DRAM 9 per yr 2X in 10 yrs
10
DRAM
1980
2000
1990
Year
4What is a cache?
- Small, fast storage used to improve average
access time to slow memory. - Exploits spatial and temporal locality
- In computer architecture, almost everything is a
cache! - Registers a cache on variables software
managed - First-level cache a cache on second-level cache
- Second-level cache a cache on memory
- Memory a cache on disk (virtual memory)
- TLB a cache on page table
- Branch-prediction buffer a cache on prediction
information?
Proc/Regs
L1-Cache
Bigger
Faster
L2-Cache
Memory
Disk, Tape, etc.
51977 DRAM faster than microprocessors
6Levels of the Memory Hierarchy
Upper Level
Capacity Access Time Cost
faster
CPU Registers 100s Bytes lt1 ns
Registers
Instr. Operands
Cache K Bytes 1 ns 1-0.1 cents/bit
Cache
Blocks
Main Memory M-G Bytes 100 ns .0001-.00001 cents
/bit
Memory
Pages
Disk G-T Bytes, 10 ms (10,000,000 ns) 10 - 10
cents/bit
Disk
-6
-5
Files
Larger
Tape infinite sec-min 10
Tape
Lower Level
-8
7Memory Hierarchy Apple iMac G5
Goal Illusion of large, fast, cheap memory
Let programs address a memory space that scales
to the disk size, at a speed that is usually as
fast as register access
8iMacs PowerPC 970 All caches on-chip
9The Principle of Locality
- The Principle of Locality
- Program access a relatively small portion of the
address space at any instant of time. - Two Different Types of Locality
- Temporal Locality (Locality in Time) If an item
is referenced, it will tend to be referenced
again soon (e.g., loops, reuse) - Spatial Locality (Locality in Space) If an item
is referenced, items whose addresses are close by
tend to be referenced soon (e.g., straightline
code, array access) - Last 15 years, HW relied on locality for speed
It is a property of programs which is exploited
in machine design.
10Programs with locality cache well ...
Memory Address (one dot per access)
Time
Donald J. Hatfield, Jeanette Gerald Program
Restructuring for Virtual Memory. IBM Systems
Journal 10(3) 168-192 (1971)
11Memory Hierarchy Terminology
- Hit data appears in some block in the upper
level (example Block X) - Hit Rate the fraction of memory access found in
the upper level - Hit Time Time to access the upper level which
consists of - RAM access time Time to determine hit/miss
- Miss data needs to be retrieve from a block in
the lower level (Block Y) - Miss Rate 1 - (Hit Rate)
- Miss Penalty Time to replace a block in the
upper level - Time to deliver the block the processor
- Hit Time ltlt Miss Penalty (500 instructions on
21264!)
12More terms
- Block a fixed-size collection of data, which is
retrieved from the main memory and placed into
cache. (cache unit) - Temporal localityrecently accessed data items
are likely to be accessed in the near future. - Spatial locality items whose addresses are near
one another tend to be referenced closed together
in time. - The time required for the cache miss depends on
both latency and bandwidth of the memory. Latency
determines the time to retrieve the first word of
the block, and bandwidth determines the time to
retrieve the rest of this block. - Vitural memory the address space is ususally
broken into fixed number of blocks (pages). At
any time, each page resides either in main memory
or on disk. When the CPU references an item
within a page that is not in the cache or main
memory, a page fault occurs, and the entire page
is then moved from the disk to main memory. - The cache and main memory have the same
relationship as the main memory and disk.
13Cache performance
- Memory stall cycles the number of cycles during
which CPU is stalled waiting for memory access. - CPUtime(CPUclock cyclesMemstall cycles) x Cycle
time - Memstall cycles of Misses x Miss Penalty
- IC x
-
MemAccess
y
MissPenalt
MissRate
IC
Inst
Miss Rate the fraction of cache accesses that
result in a miss.
14Impact on Performance
- Example assume we have a computer where CPI is
1.0 when all memory accesses hit the cache. The
only data accesses are loads and stores, and
these total 50 of the insts. If the miss penalty
is 25 clock cycles and the miss rate is 2, How
much fast the computer would be, and what is that
if all insts were cache hits?
15Traditional Four Questions for Memory Hierarchy
Designers
- Q1 Where can a block be placed in the upper
level? (Block placement) - Fully Associative, Set Associative, Direct Mapped
- Q2 How is a block found if it is in the upper
level? (Block identification) - Tag/Block
- Q3 Which block should be replaced on a miss?
(Block replacement) - Random, LRU
- Q4 What happens on a write? (Write strategy)
- Write Back or Write Through (with Write Buffer)
16Q1 Where can a block be placed in the upper
level? (Block placement)
- (a) fully associative any block in the main
memory can be - placed in any block frame.
- It is flexible but expensive due to
associativity - (b) direct mapping each block in memory is
placed in a fixed block frame with the following
mapping function - block frame (block addr in mem.) MOD (of
block frames in cache) - It is inflexible but simple and economical.
- (c) set associative a compromise between fully
associative and direct mapping The cache is
divided into sets of block frames, and each block
from the memory is first mapped to - a fixed set wherein the block can be placed in
any block frame. Mapping to a set follows the
function, called a bit selection - set (block addr in mem.)MOD( of sets in
cache) - - n-way set associative there are n blocks in a
set - - fully associative is an m-way set associative
if there are m block frames in the cache
whereas, direct mapping is one-way set
associative - - one-way, two-way, and four-way are the most
frequently used methods.
17Q1 Where can a block be placed in the upper
level?
- Block 12 placed in 8 block cache
- Fully associative, direct mapped, 2-way set
associative - S.A. Mapping Block Number Modulo Number Sets
Direct Mapped (12 mod 8) 4
2-Way Assoc (12 mod 4) 0
Full Mapped
Cache
Memory
18Q2 How is a block found if it is in the upper
level?(Block identification)
- each block frame in the cache has an address tag
indicating the block's address in the memory - all possible tags are searched in parallel
- a valid bit is attached to the tag to indicate
whether the block - contains valid information or not
- an address for a datum from CPU, A, is divided
into block - address field and block offset field
- block address A / block size
- block offset (A) MOD (block size)
- block address is further divided into tag and
index - index indicates the set in which the block may
reside, while tag is compared to indicate a hit
or a miss
19Example Alpha 21264 Data Cache
For 2-way set associative, use round-robin (FIFO)
to choose where to go
16 bytes
Cache miss
Figure 5.7
20Q3 Which block should be replaced on a miss?
(Block replacement)
- the more choices for replacement, the more
expensive for hardware -- direct mapping is the
simplest - random vs. least-recently used (LRU) the former
has uniform allocation and is simple to build
while the latter can take advantage of temporal
locality but can be expensive to implement (why?) - Random, LRU, FIFO
21Q3 After a cache read miss, if there are no
empty cache blocks, which block should be removed
from the cache?
A randomly chosen block? Easy to implement, how
well does it work?
The Least Recently Used (LRU) block?
Appealing, but hard to implement for high
associativity
22Q4 What happens on a write? (Write strategy)
- most cache accesses are reads all instruction
accesses are reads, and most instructions dont
write to memory. - optimize reads to make the common case fast,
observing that CPU doesn't have to wait for
writes while must wait for reads fortunately,
read is easy reading and tag comparison can be
done in parallel but write is hard - (a) cannot overlap tag reading and block
writing (destructive) - (b) CPU specifies write size only 1 - 8 bytes
- Thus write strategies often distinguish cache
design - (a) write through (or store through) write info
to blocks in both levels - - ensuring consistency at the cost of memory and
bus bandwidth - - write stalls may be alleviated by using write
buffers by overlapping processor execution with
memory updating - (b) write back (store in) write info to blocks
only in cache level - - minimizing memory and bus traffic at the cost
of weak consistency - - use dirty bit to indicate modification, reduce
frequency of write-back on replacement - - read misses may result in writes (why?)
- On a write miss the data are not needed
- (a) write allocate (The block is allocated on a
write miss) normally used in write-back - (b) no-write allocate (write miss does not
affect cache, modified in
lower-level cache) normally used in
write-through
23Q4 What happens on a write?
Additional option -- let writes to an un-cached
address allocate a new cache line
(write-allocate).
24 Write Buffers for Write-Through Caches
Q. Why a write buffer ?
A. So CPU doesnt stall
Q. Why a buffer, why not just one register ?
A. Bursts of writes are common.
Q. Are Read After Write (RAW) hazards an issue
for write buffer?
A. Yes! Drain buffer before next read, or send
read 1st after check write buffers.
25- Example
- Assume a fully associative write back cache with
many cache entries that start empty. Below is a
sequence of five memory operations (the address
is in square brackets) - WriteMem100
- WriteMem100
- ReadMem200
- WriteMem200
- WriteMem100
- For no-write allocate, ? Misses and ? hits
- For write allocate, ? Misses and ? hits?
26Example Fig. 5.7 Alpha AXP 21064 data cache
(64-bit machine). Cache size 65,536 bytes
(64K), block size 64 bytes, with two-way
set-associative placement, write back, write
allocate on a write miss, 44 bit physical
address. What is the index size?
27Unified vs Split Caches
- Unified vs Separate ID
- Example
- 16KB ID Inst miss rate0.64, Data miss
rate6.47 - 32KB unified Aggregate miss rate1.99
- Which is better (ignore L2 cache)?
- Assume 33 data ops ? 75 accesses from
instructions (1.0/1.33) - hit time1, miss time50
- Note that data hit has 1 stall for unified cache
(only one port) - AMATHarvard75x(10.64x50)25x(16.47x50)
2.05 - AMATUnified75x(11.99x50)25x(111.99x50)
2.24
Proc
Proc
I-Cache-1
D-Cache-1
Unified Cache-2
28Discussion single (unified) cache vs. separate
cache different miss rates (74 instruction vs.
26 data), see figure 5.8. (May have structural
hazards from Load/Store with a single/unified
cache)
Figure 5.8 Miss per 1000 instructions for inst,
data and unified caches of diff sizes
29The Limits of Physical Addressing
A0-A31
A0-A31
CPU
Memory
D0-D31
D0-D31
Machine language programs must be aware of the
machine organization
No way to prevent a program from accessing any
machine resource
30Solution Add a Layer of Indirection
Physical Addresses
Virtual Addresses
Physical
A0-A31
Virtual
A0-A31
Address Translation
CPU
Memory
D0-D31
D0-D31
Data
User programs run in an standardized virtual
address space
Address Translation hardware managed by the
operating system (OS) maps virtual address to
physical memory
Hardware supports modern OS features Protection
, Translation, Sharing
31Three Advantages of Virtual Memory
- Translation
- Program can be given consistent view of memory,
even though physical memory is scrambled - Makes multithreading reasonable (now used a lot!)
- Only the most important part of program (Working
Set) must be in physical memory. - Contiguous structures (like stacks) use only as
much physical memory as necessary yet still grow
later. - Protection
- Different threads (or processes) protected from
each other. - Different pages can be given special behavior
- (Read Only, Invisible to user programs, etc).
- Kernel data protected from User programs
- Very important for protection from malicious
programs - Sharing
- Can map same physical page to multiple
users(Shared memory)
32Page tables encode virtual address spaces
A virtual address space is divided into blocks of
memory called pages
frame
frame
frame
frame
A valid page table entry codes physical memory
frame address for the page
33Page tables encode virtual address spaces
A virtual address space is divided into blocks of
memory called pages
34Details of Page Table
Page Table
frame
frame
frame
frame
virtual address
- Page table maps virtual page numbers to physical
frames (PTE Page Table Entry) - Virtual memory gt treat memory ? cache for disk
35Page tables may not fit in memory!
A table for 4KB pages for a 32-bit address space
has 1M entries
Each process needs its own address space!
Top-level table wired in main memory
Subset of 1024 second-level tables in main
memory rest are on disk or unallocated
36VM and Disk Page replacement policy
Dirty bit page written. Used bit set to 1 on
any reference
Set of all pages in Memory
Architects role support setting dirty and used
bits
37TLB Design Concepts
38MIPS Address Translation How does it work?
Physical Addresses
Virtual Addresses
Virtual
Physical
A0-A31
A0-A31
CPU
Memory
D0-D31
D0-D31
Data
TLB also contains protection bits for virtual
address
Fast common case Virtual address is in TLB,
process has permission to read/write it.
39The TLB caches page table entries
Physical and virtual pages must be the same size!
for ASID
MIPS handles TLB misses in software (random
replacement). Other machines use hardware.
40Can TLB and caching be overlapped?
A. Inflexibility. Size of cache limited by page
size.
41Problems With Overlapped TLB Access
Overlapped access only works as long as the
address bits used to index into the cache
do not change as the result of VA
translation This usually limits things to small
caches, large page sizes, or high n-way set
associative caches if you want a large
cache Example suppose everything the same
except that the cache is increased to 8 K
bytes instead of 4 K
11
2
cache index
00
This bit is changed by VA translation, but is
needed for cache lookup
12
20
virt page
disp
Solutions go to 8K byte page sizes
go to 2 way set associative cache or SW
guarantee VA13PA13
2 way set assoc cache
1K
10
4
4
42Use virtual addresses for cache?
Virtual Addresses
Physical Addresses
A0-A31
Physical
Virtual
A0-A31
Translation Look-Aside Buffer (TLB)
Virtual
Cache
CPU
Main Memory
D0-D31
D0-D31
D0-D31
Only use TLB on a cache miss !
Downside a subtle, fatal problem. What is it?
A. Synonym problem. If two address spaces share a
physical frame, data may be in cache twice.
Maintaining consistency is a nightmare.
43Summary 1/3 The Cache Design Space
- Several interacting dimensions
- cache size
- block size
- associativity
- replacement policy
- write-through vs write-back
- write allocation
- The optimal choice is a compromise
- depends on access characteristics
- workload
- use (I-cache, D-cache, TLB)
- depends on technology / cost
- Simplicity often wins
Cache Size
Associativity
Block Size
Bad
Factor A
Factor B
Good
Less
More
44Summary 2/3 Caches
- The Principle of Locality
- Program access a relatively small portion of the
address space at any instant of time. - Temporal Locality Locality in Time
- Spatial Locality Locality in Space
- Three Major Categories of Cache Misses
- Compulsory Misses sad facts of life. Example
cold start misses. - Capacity Misses increase cache size
- Conflict Misses increase cache size and/or
associativity. Nightmare Scenario ping pong
effect! - Write Policy Write Through vs. Write Back
- Today CPU time is a function of (ops, cache
misses) vs. just f(ops) affects Compilers, Data
structures, and Algorithms
45Summary 3/3 TLB, Virtual Memory
- Page tables map virtual address to physical
address - TLBs are important for fast translation
- TLB misses are significant in processor
performance - funny times, as most systems cant access all of
2nd level cache without TLB misses! - Caches, TLBs, Virtual Memory all understood by
examining how they deal with 4 questions 1)
Where can block be placed?2) How is block found?
3) What block is replaced on miss? 4) How are
writes handled? - Today VM allows many processes to share single
memory without having to swap all processes to
disk today VM protection is more important than
memory hierarchy benefits, but computers insecure