Lecture 16: Large Cache Innovations - PowerPoint PPT Presentation

About This Presentation
Title:

Lecture 16: Large Cache Innovations

Description:

Q1 (HM), Q2 (bpred), Q3 (stalls), Q7 (loops) ... Q5 (multi-core) less than a handful got 8 points ... pollution and eviction before use) 14. Stream Buffers ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 21
Provided by: rajeevbala
Learn more at: https://my.eng.utah.edu
Category:

less

Transcript and Presenter's Notes

Title: Lecture 16: Large Cache Innovations


1
Lecture 16 Large Cache Innovations
  • Today Large cache design and other cache
    innovations
  • Midterm scores
  • 91-80 17 students
  • 79-75 14 students
  • 74-68 8 students
  • 63-54 5 students
  • Q1 (HM), Q2 (bpred), Q3 (stalls), Q7 (loops)
    mostly correct
  • Q4 (ooo) 50 correct many didnt stall
    renaming
  • Q5 (multi-core) less than a handful got 8
    points
  • Q6 (memdep) less than a handful got part (b)
    right and
  • correctly articulated the
    effect on power/energy

2
Shared Vs. Private Caches in Multi-Core
  • Advantages of a shared cache
  • Space is dynamically allocated among cores
  • No wastage of space because of replication
  • Potentially faster cache coherence (and easier
    to
  • locate data on a miss)
  • Advantages of a private cache
  • small L2 ? faster access time
  • private bus to L2 ? less contention

3
UCA and NUCA
  • The small-sized caches so far have all been
    uniform cache
  • access the latency for any access is a
    constant, no matter
  • where data is found
  • For a large multi-megabyte cache, it is
    expensive to limit
  • access time by the worst case delay hence,
    non-uniform
  • cache architecture

4
Large NUCA
  • Issues to be addressed for
  • Non-Uniform Cache Access
  • Mapping
  • Migration
  • Search
  • Replication

CPU
5
Static and Dynamic NUCA
  • Static NUCA (S-NUCA)
  • The address index bits determine where the block
  • is placed
  • Page coloring can help here as well to improve
    locality
  • Dynamic NUCA (D-NUCA)
  • Blocks are allowed to move between banks
  • The block can be anywhere need some search
  • mechanism
  • Each core can maintain a partial tag structure
    so they
  • have an idea of where the data might be
    (complex!)
  • Every possible bank is looked up and the search
  • propagates (either in series or in parallel)
    (complex!)

6
Example Organization
Latency 65 cyc
Data must be placed close to the center-of-gravity
of requests
Latency 13-17cyc
7
Examples Frequency of Accesses
  • Dark ? more
  • accesses
  • OLTP (on-line
  • transaction
  • processing)
  • Ocean ?
  • (scientific code)

8
Core 0
Core 1
Core 2
Core 3
Core 4
Core 5
Core 6
Core 7
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
Scalable Non-broadcast Interconnect
Shared L2 Cache and Directory State
L2 Cache Controller
9
Core 0
Core 1
Core 2
Core 3
Core 4
Core 5
Core 6
Core 7
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L2
L2
L2
L2
L2
L2
L2
L2
Scalable Non-broadcast Interconnect
Replicated Tags of all L2 and L1 Caches
Controller that handles L2 misses
Off-chip access
10
A single tile composed of a core, L1 caches,
and a bank (slice) of the shared L2 cache
Core 0
Core 1
Core 2
Core 3
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L2
L2
L2
L2
Core 4
Core 5
Core 6
Core 7
The cache controller forwards address requests
to the appropriate L2 bank and handles
coherence operations
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L2
L2
L2
L2
Memory Controller for off-chip access
11
Memory controller for off-chip access
Top die with L2 cache banks Each core
has low-latency access to one L2 bank
L2
L2
L2
L2
Core 0
Core 1
Core 2
Core 3
Bottom die with cores and L1 caches
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
12
The Best of S-NUCA and D-NUCA
  • Employ S-NUCA (no search required) and use page
  • coloring to influence the blocks cache index
    bits and
  • hence the bank that the block gets placed in
  • Page migration enables block movement just as in
    D-NUCA

13
Prefetching
  • Hardware prefetching can be employed for any of
    the
  • cache levels
  • It can introduce cache pollution prefetched
    data is
  • often placed in a separate prefetch buffer to
    avoid
  • pollution this buffer must be looked up in
    parallel
  • with the cache access
  • Aggressive prefetching increases coverage, but
    leads
  • to a reduction in accuracy ? wasted memory
    bandwidth
  • Prefetches must be timely they must be issued
    sufficiently
  • in advance to hide the latency, but not too
    early (to avoid
  • pollution and eviction before use)

14
Stream Buffers
  • Simplest form of prefetch on every miss, bring
    in
  • multiple cache lines
  • When you read the top of the queue, bring in the
    next line

L1
Sequential lines
Stream buffer
15
Stride-Based Prefetching
  • For each load, keep track of the last address
    accessed
  • by the load and a possibly consistent stride
  • FSM detects consistent stride and issues
    prefetches

incorrect
init
steady
correct
correct
incorrect (update stride)
PC
tag
prev_addr
stride
state
correct
correct
trans
no-pred
incorrect (update stride)
incorrect (update stride)
16
Compiler Optimizations
  • Loop interchange loops can be re-ordered to
    exploit
  • spatial locality
  • for (j0 jlt100 j)
  • for (i0 ilt5000 i)
  • xij 2 xij
  • is converted to
  • for (i0 ilt5000 i)
  • for (j0 jlt100 j)
  • xij 2 xij

17
Exercise
  • Re-organize data accesses so that a piece of
    data is
  • used a number of times before moving on in
    other
  • words, artificially create temporal locality

for (i0iltNi) for (j0jltNj)
r0 for (k0kltNk) r r
yik zkj xij r
for (jj0 jjltN jj B) for (kk0 kkltN kk
B) for (i0iltNi) for (jjj jlt
min(jjB,N) j) r0 for (kkk
klt min(kkB,N) k) r r yik
zkj xij xij r
y
z
x
y
z
x
18
Exercise
for (jj0 jjltN jj B) for (kk0 kkltN kk
B) for (i0iltNi) for (jjj jlt
min(jjB,N) j) r0 for (kkk
klt min(kkB,N) k) r r yik
zkj xij xij r
y
z
x
y
z
x
y
z
x
y
z
x
y
z
x
19
Exercise
  • Original code could have 2N3 N2 memory
    accesses,
  • while the new version has 2N3/B N2

for (i0iltNi) for (j0jltNj)
r0 for (k0kltNk) r r
yik zkj xij r
for (jj0 jjltN jj B) for (kk0 kkltN kk
B) for (i0iltNi) for (jjj jlt
min(jjB,N) j) r0 for (kkk
klt min(kkB,N) k) r r yik
zkj xij xij r
y
z
x
y
z
x
20
Title
  • Bullet
Write a Comment
User Comments (0)
About PowerShow.com