CSL718 : Memory Hierarchy - PowerPoint PPT Presentation

About This Presentation
Title:

CSL718 : Memory Hierarchy

Description:

Provide write buffers ... kk. k. kk. i. j. jj. i. jj. kk. i. j. k. 5 nested loops. blocking factor = b. C A B. accesses LMN/b LMN/b LMN ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 46
Provided by: anshul8
Category:

less

Transcript and Presenter's Notes

Title: CSL718 : Memory Hierarchy


1
CSL718 Memory Hierarchy
  • Cache Performance Improvement
  • 23rd Feb, 2006

2
Performance
  • Average memory access time
  • Hit time Mem stalls / access
  • Hit time Miss rate Miss penalty
  • Program execution time
  • IC Cycle time (CPIexec Mem stalls / instr)
  • Mem stalls / instr
  • Miss rate Miss Penalty Mem accesses / instr
  • Miss Penalty in OOO processor
  • Total miss latency - Overlapped miss latency

3
Performance Improvement
  • Reducing miss penalty
  • Reducing miss rate
  • Reducing miss penalty miss rate
  • Reducing hit time

4
Reducing Miss Penalty
  • Multi level caches
  • Critical word first and early restart
  • Giving priority to read misses over write
  • Merging write buffer
  • Victim caches

5
Multi Level Caches
  • Average memory access time
  • Hit timeL1 Miss rateL1 Miss penaltyL1
  • Miss penaltyL1
  • Hit timeL2 Miss rateL2 Miss penaltyL2
  • Multi level inclusion
  • and
  • Multi level exclusion

6
Misses in Multilevel Cache
  • Local Miss rate
  • no. of misses / no. of requests, as seen at a
    level
  • Global Miss rate
  • no. of misses / no. of requests, on the whole
  • Solo Miss rate
  • miss rate if only this cache was present

7
Two level cache miss example
B L1, L2
A L1, L2
D L1, L2
Local miss (L1) (BD)/(ABCD) Local miss (L2)
D/(BD) Global Miss D/(ABCD) Solo miss
(L2) (CD)/(ABCD)
C L1, L2
8
Critical Word First and Early Restart
  • Read policy
  • Load policy
  • More effective when block size is large

9
Read Miss Priority Over Write
  • Provide write buffers
  • Processor writes into buffer and proceeds (for
    write through as well as write back)
  • On read miss
  • wait for buffer to be empty, or
  • check addresses in buffer for conflict

10
Merging Write Buffer
  • Merge writes belonging to same block in case of
    write through

11
Victim Cache (proposed by Jouppi)
  • Evicted blocks are recycled
  • Much faster than getting a block from the next
    level
  • Size 1 to 5 blocks
  • A significant fraction of misses may be found in
    victim cache

to proc
Cache
Victim Cache
from mem
12
Reducing Miss Rate
  • Large block size
  • Larger cache
  • Higher associativity
  • Way prediction and pseudo-associative cache
  • Warm start in multi-tasking
  • Compiler optimizations

13
Large Block Size
  • Reduces compulsory misses
  • Too large block size - misses increase
  • Miss Penalty increases

14
Large Cache
  • Reduces capacity misses
  • Hit time increases
  • Keep small L1 cache and large L2 cache

15
Higher Associativity
  • Reduces conflict misses
  • 8-way is almost like fully associative
  • Hit time increases

16
Way Prediction and Pseudo-associative Cache
  • Way prediction low miss rate of SA cache with
    hit time of DM cache
  • Only one tag is compared initially
  • Extra bits are kept for prediction
  • Hit time in case of mis-prediction is high
  • Pseudo-assoc. or column assoc. cache get
    advantage of SA cache in a DM cache
  • Check sequentially in a pseudo-set
  • Fast hit and slow hit

17
Warm Start in Multi-tasking
  • Cold start
  • process starts with empty cache
  • blocks of previous process invalidated
  • Warm start
  • some blocks from previous activation are still
    available

18
Compiler optimizations
  • Loop interchange
  • Improve spatial locality by scanning arrays
    row-wise
  • Blocking
  • Improve temporal and spatial locality

19
Improving Locality
Matrix Multiplication example
20
Cache Organization for the example
  • Cache line (or block) 4 matrix elements.
  • Matrices are stored row wise.
  • Cache cant accommodate a full row/column. (In
    other words, L, M and N are so large w.r.t. the
    cache size that after an iteration along any of
    the three indices, when an element is accessed
    again, it results in a miss.)
  • Ignore misses due to conflict between matrices.
    (as if there was a separate cache for each
    matrix.)

21
Matrix Multiplication Code I
  • for (i 0 i lt L i)
  • for (j o j lt M j)
  • for (k 0 k lt N k)
  • cij Aik Bkj
  • C A B
  • accesses LM LMN LMN
  • misses LM/4 LMN/4 LMN
  • Total misses LM(5N1)/4

22
Matrix Multiplication Code II
  • for (k 0 k lt N k)
  • for (i 0 i lt L i)
  • for (j o j lt M j)
  • cij Aik Bkj
  • C A B
  • accesses LMN LN LMN
  • misses LMN/4 LN LMN/4
  • Total misses LN(2M4)/4

23
Matrix Multiplication Code III
  • for (i 0 i lt L i)
  • for (k 0 k lt N k)
  • for (j o j lt M j)
  • cij Aik Bkj
  • C A B
  • accesses LMN LN LMN
  • misses LMN/4 LN/4 LMN/4
  • Total misses LN(2M1)/4

24
Blocking
jj
kk
i
5 nested loops blocking factor b
j
k
kk
jj
jj
k
j
i
i
?

j
kk
k
  • C A B
  • accesses LMN/b LMN/b LMN
  • misses LMN/4b LMN/4b MN/4
  • Total misses MN(2L/b1)/4

25
Loop Blocking
  • for (k 0 k lt N k4)
  • for (i 0 i lt L i)
  • for (j o j lt M j)
  • cij AikBkj
  • Aik1Bk1j
  • Aik2Bk2j
  • Aik3Bk3j
  • C A B
  • accesses LMN/4 LN LMN
  • misses LMN/16 LN/4 LMN/4
  • Total misses LN(5M/41)/4

26
Reducing Miss Penalty Miss Rate
  • Non-blocking cache
  • Hardware prefetching
  • Compiler controlled prefetching

27
Non-blocking Cache
  • In OOO processor
  • Hit under a miss
  • complexity of cache controller increases
  • Hit under multiple misses or miss under a miss
  • memory should be able to handle multiple misses

28
Hardware Prefetching
  • Prefetch items before they are requested
  • both data and instructions
  • What and when to prefetch?
  • fetch two blocks on a miss (requestednext)
  • Where to keep prefetched information?
  • in cache
  • in a separate buffer (most common case)

29
Prefetch Buffer/Stream Buffer
to proc
Cache
prefetch buffer
from mem
30
Hardware prefetching Stream buffers
  • Joupis experiment 1990
  • Single instruction stream buffer catches 15 to
    25 misses from a 4KB direct mapped instruction
    cache with 16 byte blocks
  • 4 block buffer 50, 16 block 72
  • single data stream buffer catches 25 misses from
    4 KB direct mapped cache
  • 4 data stream buffers (each prefetching at a
    different address) 43

31
HW prefetching UltraSPARC III example
  • 64 KB data cache, 36.9 misses per 1000
    instructions
  • 22 instructions make data reference
  • hit time 1, miss penalty 15
  • prefetch hit rate 20
  • 1 cycle to get data from prefetch buffer
  • What size of cache will give same performance?
  • miss rate 36.9/220 16.7
  • av mem access time 1(.167.21)(.167.815)3.0
    46
  • effective miss rate (3.046-1)/1513.6gt 256 KB
    cache

32
Compiler Controlled Prefetching
  • Register prefetch / Cache prefetch
  • Faulting / non-faulting (non-binding)
  • Semantically invisible (no change in registers or
    cache contents)
  • Makes sense if processor doesnt stall while
    prefetching (non-blocking cache)
  • Overhead of prefetch instruction should not
    exceed the benefit

33
SW Prefetch Example
  • 8 KB direct mapped, write back data cache with 16
    byte blocks.
  • a is 3 ? 100, b is 101 ? 3
  • for (i 0 i lt 3 i)
  • for (j o j lt 100 j)
  • aij bj0 bj10
  • each array element is 8 bytes
  • misses in array a 3 100 /2 150
  • misses in array b 101
  • total misses 251

34
SW Prefetch Example contd.
  • Suppose we need to prefetch 7 iterations in
    advance
  • for (j o j lt 100 j)
  • prefetch(bj70)
  • prefetch(a0j7)
  • a0j bj0 bj10
  • for (i 1 i lt 3 i)
  • for (j o j lt 100 j)
  • prefetch(aij7)
  • aij bj0 bj10
  • misses in first loop 7 (for b0..60) 4
    (for a00..6 )
  • misses in second loop 4 (for a10..6) 4
    (for a20..6 )
  • total misses 19, total prefetches 400

35
SW Prefetch Example contd.
  • Performance improvement?
  • Assume no capacity and conflict misses,
  • prefetches overlap with each other and with
    misses
  • Original loop 7, Prefetch loops 9 and 8 cycles
  • Miss penalty 100 cycles
  • Original loop 3007 251100 27,200 cycles
  • 1st prefetch loop 1009 11100 2,000 cycles
  • 2nd prefetch loop 2008 8100 2,400 cycles
  • Speedup 27200/(20002400) 6.2

36
Reducing Hit Time
  • Small and simple caches
  • Avoid time loss in address translation
  • Pipelined cache access
  • Trace caches

37
Small and Simple Caches
  • Small size gt faster access
  • Small size gt fit on the chip, lower delay
  • Simple (direct mapped) gt lower delay
  • Second level tags may be kept on chip

38
Cache access time estimates using CACTI
.8 micron technology, 1 R/W port, 32 b address,
64 b o/p, 32 B block
39
Avoid time loss in addr translation
  • Virtually indexed, physically tagged cache
  • simple and effective approach
  • possible only if cache is not too large
  • Virtually addressed cache
  • protection?
  • multiple processes?
  • aliasing?
  • I/O?

40
Cache Addressing
  • Physical Address
  • first convert virtual address into physical
    address, then access cache
  • no time loss if index field available without
    address translation
  • Virtual Address
  • access cache directly using the virtual address

41
Problems with virtually addressed cache
  • page level protection?
  • copy protection info from TLB
  • same virtual address from two different processes
    needs to be distinguished
  • purge cache blocks on context switch or use PID
    tags along with other address tags
  • aliasing (different virtual addresses from two
    processes pointing to same physical address)
    inconsistency?
  • I/O uses physical addresses

42
Multi processes in virtually addr cache
  • purge cache blocks on context switch
  • use PID tags along with other address tags

43
Inconsistency in virtually addr cache
  • Hardware solution (Alpha 21264)
  • 64 KB cache, 2-way set associative, 8 KB page
  • a block with a given offset in a page can map to
    8 locations in cache
  • check all 8 locations, invalidate duplicate
    entries
  • Software solution (page coloring)
  • make 18 lsbs of all aliases same ensures that
    direct mapped cache ? 256 KB has no duplicates
  • i.e., 4 KB pages are mapped to 64 sets (or colors)

44
Pipelined Cache Access
  • Multi-cycle cache access but pipelined
  • reduces cycle time but hit time is more than one
    cycle
  • Pentium 4 takes 4 cycles
  • greater penalty on branch misprediction
  • more clock cycles between issue of load and use
    of data

45
Trace Caches
  • what maps to a cache block?
  • not statically determined
  • decided by the dynamic sequence of instructions,
    including predicted branches
  • Used in Pentium 4 (NetBurst architecture)
  • starting addresses not word size powers of 2
  • Better utilization of cache space
  • downside same instruction may be stored
    multiple times
Write a Comment
User Comments (0)
About PowerShow.com