Title: CSL718 : Memory Hierarchy
1CSL718 Memory Hierarchy
- Cache Performance Improvement
- 23rd Feb, 2006
2Performance
- Average memory access time
- Hit time Mem stalls / access
- Hit time Miss rate Miss penalty
- Program execution time
- IC Cycle time (CPIexec Mem stalls / instr)
- Mem stalls / instr
- Miss rate Miss Penalty Mem accesses / instr
- Miss Penalty in OOO processor
- Total miss latency - Overlapped miss latency
3Performance Improvement
- Reducing miss penalty
- Reducing miss rate
- Reducing miss penalty miss rate
- Reducing hit time
4Reducing Miss Penalty
- Multi level caches
- Critical word first and early restart
- Giving priority to read misses over write
- Merging write buffer
- Victim caches
5Multi Level Caches
- Average memory access time
- Hit timeL1 Miss rateL1 Miss penaltyL1
- Miss penaltyL1
- Hit timeL2 Miss rateL2 Miss penaltyL2
- Multi level inclusion
- and
- Multi level exclusion
6Misses in Multilevel Cache
- Local Miss rate
- no. of misses / no. of requests, as seen at a
level - Global Miss rate
- no. of misses / no. of requests, on the whole
- Solo Miss rate
- miss rate if only this cache was present
7Two level cache miss example
B L1, L2
A L1, L2
D L1, L2
Local miss (L1) (BD)/(ABCD) Local miss (L2)
D/(BD) Global Miss D/(ABCD) Solo miss
(L2) (CD)/(ABCD)
C L1, L2
8Critical Word First and Early Restart
- Read policy
- Load policy
- More effective when block size is large
9Read Miss Priority Over Write
- Provide write buffers
- Processor writes into buffer and proceeds (for
write through as well as write back) - On read miss
- wait for buffer to be empty, or
- check addresses in buffer for conflict
10Merging Write Buffer
- Merge writes belonging to same block in case of
write through
11Victim Cache (proposed by Jouppi)
- Evicted blocks are recycled
- Much faster than getting a block from the next
level - Size 1 to 5 blocks
- A significant fraction of misses may be found in
victim cache
to proc
Cache
Victim Cache
from mem
12Reducing Miss Rate
- Large block size
- Larger cache
- Higher associativity
- Way prediction and pseudo-associative cache
- Warm start in multi-tasking
- Compiler optimizations
13Large Block Size
- Reduces compulsory misses
- Too large block size - misses increase
- Miss Penalty increases
14Large Cache
- Reduces capacity misses
- Hit time increases
- Keep small L1 cache and large L2 cache
15Higher Associativity
- Reduces conflict misses
- 8-way is almost like fully associative
- Hit time increases
16Way Prediction and Pseudo-associative Cache
- Way prediction low miss rate of SA cache with
hit time of DM cache - Only one tag is compared initially
- Extra bits are kept for prediction
- Hit time in case of mis-prediction is high
- Pseudo-assoc. or column assoc. cache get
advantage of SA cache in a DM cache - Check sequentially in a pseudo-set
- Fast hit and slow hit
17Warm Start in Multi-tasking
- Cold start
- process starts with empty cache
- blocks of previous process invalidated
- Warm start
- some blocks from previous activation are still
available
18Compiler optimizations
- Loop interchange
- Improve spatial locality by scanning arrays
row-wise - Blocking
- Improve temporal and spatial locality
19Improving Locality
Matrix Multiplication example
20Cache Organization for the example
- Cache line (or block) 4 matrix elements.
- Matrices are stored row wise.
- Cache cant accommodate a full row/column. (In
other words, L, M and N are so large w.r.t. the
cache size that after an iteration along any of
the three indices, when an element is accessed
again, it results in a miss.) - Ignore misses due to conflict between matrices.
(as if there was a separate cache for each
matrix.)
21Matrix Multiplication Code I
- for (i 0 i lt L i)
- for (j o j lt M j)
- for (k 0 k lt N k)
- cij Aik Bkj
- C A B
- accesses LM LMN LMN
- misses LM/4 LMN/4 LMN
- Total misses LM(5N1)/4
22Matrix Multiplication Code II
- for (k 0 k lt N k)
- for (i 0 i lt L i)
- for (j o j lt M j)
- cij Aik Bkj
- C A B
- accesses LMN LN LMN
- misses LMN/4 LN LMN/4
- Total misses LN(2M4)/4
23Matrix Multiplication Code III
- for (i 0 i lt L i)
- for (k 0 k lt N k)
- for (j o j lt M j)
- cij Aik Bkj
- C A B
- accesses LMN LN LMN
- misses LMN/4 LN/4 LMN/4
- Total misses LN(2M1)/4
24Blocking
jj
kk
i
5 nested loops blocking factor b
j
k
kk
jj
jj
k
j
i
i
?
j
kk
k
- C A B
- accesses LMN/b LMN/b LMN
- misses LMN/4b LMN/4b MN/4
- Total misses MN(2L/b1)/4
25Loop Blocking
- for (k 0 k lt N k4)
- for (i 0 i lt L i)
- for (j o j lt M j)
- cij AikBkj
- Aik1Bk1j
- Aik2Bk2j
- Aik3Bk3j
- C A B
- accesses LMN/4 LN LMN
- misses LMN/16 LN/4 LMN/4
- Total misses LN(5M/41)/4
26Reducing Miss Penalty Miss Rate
- Non-blocking cache
- Hardware prefetching
- Compiler controlled prefetching
27Non-blocking Cache
- In OOO processor
- Hit under a miss
- complexity of cache controller increases
- Hit under multiple misses or miss under a miss
- memory should be able to handle multiple misses
28Hardware Prefetching
- Prefetch items before they are requested
- both data and instructions
- What and when to prefetch?
- fetch two blocks on a miss (requestednext)
- Where to keep prefetched information?
- in cache
- in a separate buffer (most common case)
29Prefetch Buffer/Stream Buffer
to proc
Cache
prefetch buffer
from mem
30Hardware prefetching Stream buffers
- Joupis experiment 1990
- Single instruction stream buffer catches 15 to
25 misses from a 4KB direct mapped instruction
cache with 16 byte blocks - 4 block buffer 50, 16 block 72
- single data stream buffer catches 25 misses from
4 KB direct mapped cache - 4 data stream buffers (each prefetching at a
different address) 43
31HW prefetching UltraSPARC III example
- 64 KB data cache, 36.9 misses per 1000
instructions - 22 instructions make data reference
- hit time 1, miss penalty 15
- prefetch hit rate 20
- 1 cycle to get data from prefetch buffer
- What size of cache will give same performance?
- miss rate 36.9/220 16.7
- av mem access time 1(.167.21)(.167.815)3.0
46 - effective miss rate (3.046-1)/1513.6gt 256 KB
cache
32Compiler Controlled Prefetching
- Register prefetch / Cache prefetch
- Faulting / non-faulting (non-binding)
- Semantically invisible (no change in registers or
cache contents) - Makes sense if processor doesnt stall while
prefetching (non-blocking cache) - Overhead of prefetch instruction should not
exceed the benefit
33SW Prefetch Example
- 8 KB direct mapped, write back data cache with 16
byte blocks. - a is 3 ? 100, b is 101 ? 3
- for (i 0 i lt 3 i)
- for (j o j lt 100 j)
- aij bj0 bj10
- each array element is 8 bytes
- misses in array a 3 100 /2 150
- misses in array b 101
- total misses 251
34SW Prefetch Example contd.
- Suppose we need to prefetch 7 iterations in
advance - for (j o j lt 100 j)
- prefetch(bj70)
- prefetch(a0j7)
- a0j bj0 bj10
- for (i 1 i lt 3 i)
- for (j o j lt 100 j)
- prefetch(aij7)
- aij bj0 bj10
- misses in first loop 7 (for b0..60) 4
(for a00..6 ) - misses in second loop 4 (for a10..6) 4
(for a20..6 ) - total misses 19, total prefetches 400
35SW Prefetch Example contd.
- Performance improvement?
- Assume no capacity and conflict misses,
- prefetches overlap with each other and with
misses - Original loop 7, Prefetch loops 9 and 8 cycles
- Miss penalty 100 cycles
- Original loop 3007 251100 27,200 cycles
- 1st prefetch loop 1009 11100 2,000 cycles
- 2nd prefetch loop 2008 8100 2,400 cycles
- Speedup 27200/(20002400) 6.2
36Reducing Hit Time
- Small and simple caches
- Avoid time loss in address translation
- Pipelined cache access
- Trace caches
37Small and Simple Caches
- Small size gt faster access
- Small size gt fit on the chip, lower delay
- Simple (direct mapped) gt lower delay
- Second level tags may be kept on chip
38Cache access time estimates using CACTI
.8 micron technology, 1 R/W port, 32 b address,
64 b o/p, 32 B block
39Avoid time loss in addr translation
- Virtually indexed, physically tagged cache
- simple and effective approach
- possible only if cache is not too large
- Virtually addressed cache
- protection?
- multiple processes?
- aliasing?
- I/O?
40Cache Addressing
- Physical Address
- first convert virtual address into physical
address, then access cache - no time loss if index field available without
address translation - Virtual Address
- access cache directly using the virtual address
41Problems with virtually addressed cache
- page level protection?
- copy protection info from TLB
- same virtual address from two different processes
needs to be distinguished - purge cache blocks on context switch or use PID
tags along with other address tags - aliasing (different virtual addresses from two
processes pointing to same physical address)
inconsistency? - I/O uses physical addresses
42Multi processes in virtually addr cache
- purge cache blocks on context switch
- use PID tags along with other address tags
43Inconsistency in virtually addr cache
- Hardware solution (Alpha 21264)
- 64 KB cache, 2-way set associative, 8 KB page
- a block with a given offset in a page can map to
8 locations in cache - check all 8 locations, invalidate duplicate
entries - Software solution (page coloring)
- make 18 lsbs of all aliases same ensures that
direct mapped cache ? 256 KB has no duplicates - i.e., 4 KB pages are mapped to 64 sets (or colors)
44Pipelined Cache Access
- Multi-cycle cache access but pipelined
- reduces cycle time but hit time is more than one
cycle - Pentium 4 takes 4 cycles
- greater penalty on branch misprediction
- more clock cycles between issue of load and use
of data
45Trace Caches
- what maps to a cache block?
- not statically determined
- decided by the dynamic sequence of instructions,
including predicted branches - Used in Pentium 4 (NetBurst architecture)
- starting addresses not word size powers of 2
- Better utilization of cache space
- downside same instruction may be stored
multiple times