Lecture 7: Implementing Cache Coherence - PowerPoint PPT Presentation

About This Presentation

Title:

Lecture 7: Implementing Cache Coherence

Description:

Lecture 7: Implementing Cache Coherence Topics: implementation details – PowerPoint PPT presentation

Number of Views:91

Avg rating:3.0/5.0

Slides: 23

Provided by: RajeevB86

Learn more at: https://users.cs.utah.edu

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 7: Implementing Cache Coherence

1
Lecture 7 Implementing Cache Coherence

Topics implementation details

2
Implementing Coherence Protocols

Correctness and performance are not the only
metrics
Deadlock a cycle of resource dependencies,
where each
process holds shared resources in a
non-preemptible
fashion
Livelock similar to deadlock, but transactions
continue in
the system without each process making forward
progress
Starvation an extreme case of unfairness

3
Basic Implementation

Assume single level of cache, atomic bus
transactions
It is simpler to implement a processor-side
cache
controller that monitors requests from the
processor and
a bus-side cache controller that services the
bus
Both controllers are constantly trying to read
tags
tags can be duplicated (moderate area overhead)
unlike data, tags are rarely updated
tag updates stall the other controller

4
Reporting Snoop Results

Uniprocessor system initiator places address on
bus, all
devices monitor address, one device acks by
raising a
wired-OR signal, data is transferred
In a multiprocessor, memory has to wait for the
snoop
result before it chooses to respond need 3
wired-OR
signals (i) indicates that a cache has a copy,
(ii) indicates
that a cache has a modified copy, (iii)
indicates that the
snoop has not completed
Ensuring timely snoops the time to respond
could be
fixed or variable (with the third wired-OR
signal), or the
memory could track if a cache has a block in M
state

5
Non-Atomic State Transitions

Note that a cache controllers actions are not
all atomic tag
look-up, bus arbitration, bus transaction,
data/tag update
Consider this block A in shared state in P1 and
P2 both
issue a write the bus controllers are ready
to issue an
upgrade request and try to acquire the bus is
there a
problem?
The controller can keep track of additional
intermediate
states so it can react to bus traffic (e.g.
S?M, I?M, I?S,E)
Alternatively, eliminate upgrade request use
the shared
wire to suppress memorys response to an
exclusive-rd

6
Serialization

Write serialization is an important requirement
for
coherence and sequential consistency writes
must be
seen by all processors in the same order
On a write, the processor hands the request to
the cache
controller and some time elapses before the bus
transaction happens (the external world sees
the write)
If the writing processor continues its execution
after
handing the write to the controller, the same
write order
may not be seen by all processors hence, the
processor
is not allowed to continue unless the write has
completed

7
Livelock

Livelock can happen if the processor-cache
handshake
is not designed correctly
Before the processor can attempt the write, it
must
acquire the block in exclusive state
If all processors are writing to the same block,
one of
them acquires the block first if another
exclusive request
is seen on the bus, the cache controller must
wait for the
processor to complete the write before
releasing the block
-- else, the processors write will fail again
because the
block would be in invalid state

8
Atomic Instructions

A testset instruction acquires the block in
exclusive
state and does not release the block until the
read and
write have completed
Should an LL bring the block in exclusive state
to avoid
bus traffic during the SC?
Note that for the SC to succeed, a bit
associated with
the cache block must be set (the bit is reset
when a
write to that block is observed or when the
block is evicted)
What happens if an instruction between the LL
and SC
causes the LL-SC block to always be replaced?

9
Multilevel Cache Hierarchies

Ideally, the snooping protocol employed for L2
must be
duplicated for L1 redundant work because of
blocks
common to L1 and L2
Inclusion greatly simplifies the implementation

10
Maintaining Inclusion

Assuming equal block size, if L1 is 8KB 2-way
and L2 is
256KB 8-way, is the hierarchy inclusive?
(assume that an
L1 miss brings a block into L1 and L2)
Assuming equal block size, if L1 is 8KB
direct-mapped
and L2 is 256KB 8-way, is the hierarchy
inclusive?
To maintain inclusion, L2 replacements must also
evict
relevant blocks in L1

11
Intra-Hierarchy Protocol

Some coherence traffic needs to be propagated to
L1
likewise, L1 write traffic needs to be
propagated to L2
What is the best way to implement the above?
More
traffic? More state?
In general, external requests propagate upward
from L3 to
L1 and processor requests percolate down from
L1 to L3
Dual tags are not as important as the L2 can
filter out
bus transactions and the L1 can filter out
processor
requests

12
Split Transaction Bus

What would it take to implement the protocol
correctly
while assuming a split transaction bus?
Split transaction bus a cache puts out a
request, releases
the bus (so others can use the bus), receives
its response
much later
Assumptions
only one request per block can be outstanding
separate lines for addr (request) and data
(response)

13
Split Transaction Bus
Proc 1
Proc 2
Proc 3
Cache
Cache
Cache
Request lines
Response lines
14
Design Issues

When does the snoop complete? What if the snoop
takes
a long time?
What if the buffer in a processor/memory is
full? When
does the buffer release an entry? Are the
buffers identical?
How does each processor ensure that a block does
not
have multiple outstanding requests?
What determines the write order requests or
responses?

15
Design Issues II

What happens if a processor is arbitrating for
the bus and
witnesses another bus transaction for the same
address?
If the processor issues a read miss and there is
already a
matching read in the request table, can we
reduce bus
traffic?

16
Shared Cache Designs

There are benefits to sharing the first level
cache among
many processors (for example, in a CMP)
no coherence protocol
low cost communication between processors
better prefetching by processors
working set overlap allows shared cache size to
be
smaller than combined size of private caches
improves utilization

Disadvantages
high contention for ports
longer hit latency (size and proximity)
more conflict misses

17
TLBs

Recall that a TLB caches virtual to physical
page
translations
While swapping a page out, can we have a problem
in
a multiprocessor system?

All matching entries in every processors TLB
must be
removed
TLB shootdown the initiating processor sends a
special
instruction to other TLBs asking them to
invalidate a page
table entry

18
Case Study SGI Challenge

Supports 18 or 36 MIPS processors
Employs a 1.2 GB/s 47.6 MHz system bus
(Powerpath-2)
The bus has 256-bit-wide data, 40-bit-wide
address, plus
33 other signals (non multiplexed)
Split transaction, supporting eight outstanding
requests
Employs the MESI protocol by default also
supports
update transactions

19
Processor Board

Each board has four processors (to reduce the
number
of slots on the bus from 36 to 9)
A-chip has request tables, arbitration logic,
etc.

L2
L2
L2
L2
MIPS
MIPS
MIPS
MIPS
CC
Tags
CC
Tags
CC
Tags
CC
Tags
A-chip
D-chip
20
Latencies

75ns for an L2 cache hit
300ns for a cache miss to percolate down to the
A-chip
Additional 400ns for the data to be delivered to
the D-chips
across the bus (includes 250ns memory latency)
Another 300ns for the data to reach the
processor
Note that the system bus can accommodate 256
bits of
data, while the CC-chip to processor interface
can handle
64 bits at a time

21
Sun Enterprise 6000

Supports 30 UltraSparcs
2.67 GB/s 83.5 MHz Gigaplane system bus
Non multiplexed bus with 256 bits of data, 41
bits of
address, and 91 bits of control/error
correction, etc.
Split transaction bus with up to 112 outstanding
requests
Each node speculatively drives the bus (in
parallel with
arbitration)
L2 hits are 40 ns, memory access is 300 ns

22
Title

Bullet

Write a Comment

User Comments (0)

About PowerShow.com

Recommended Relevance Latest Highest Rated Most Viewed

Sort by:

Related More from user

CrystalGraphics Presentations

Introducing-PowerShowcom PowerPoint PPT Presentation

Introducing-PowerShowcom - Introducing-PowerShowcom (Without Music)

CrystalGraphics 3D Character Slides for PowerPoint PowerPoint PPT Presentation

CrystalGraphics 3D Character Slides for PowerPoint - CrystalGraphics 3D Character Slides for PowerPoint

Chart and Diagram Slides for PowerPoint PowerPoint PPT Presentation

Chart and Diagram Slides for PowerPoint - Beautifully designed chart and diagram s for PowerPoint with visually stunning graphics and animation effects. Our new CrystalGraphics Chart and Diagram Slides for PowerPoint is a collection of over 1000 impressively designed data-driven chart and editable diagram s guaranteed to impress any audience. They are all artistically enhanced with visually stunning color, shadow and lighting effects. Many of them are also animated. And they’re ready for you to use in your PowerPoint presentations the moment you need them. – PowerPoint PPT presentation

Related Presentations

CS162 Computer Architecture Lecture 15: Symmetric Multiprocessor: Cache Protocols PowerPoint PPT Presentation

CS162 Computer Architecture Lecture 15: Symmetric Multiprocessor: Cache Protocols - Snooping Solution (Snoopy Bus): Send all requests for data to all processors ... An Basic Snoopy Protocol. Invalidation protocol, write-back cache ... | PowerPoint PPT presentation | free to view

Lecture 4: Directory-Based Coherence PowerPoint PPT Presentation

Lecture 4: Directory-Based Coherence - When the home receives a read request, it looks up. memory (speculative read) and directory in ... the invalidate is overwritten when the read reply finally ... | PowerPoint PPT presentation | free to view

CS 258 Parallel Computer Architecture Lecture 19 Directory Based Cache Coherence PowerPoint PPT Presentation

CS 258 Parallel Computer Architecture Lecture 19 Directory Based Cache Coherence - Split bus transaction into request and response ... Buffering between bus and cache controllers ... amortization of node fixed costs over multiple processors ... | PowerPoint PPT presentation | free to view

Lecture 19: Cache Basics PowerPoint PPT Presentation

Lecture 19: Cache Basics - ... byte words. 101000. Direct-mapped cache: each address maps to ... Example. 32 KB 4-way set-associative data cache array with 32. byte line sizes. How many sets? ... | PowerPoint PPT presentation | free to view

Shared Memory Programming: Threads and OpenMP Lecture 6 PowerPoint PPT Presentation

Shared Memory Programming: Threads and OpenMP Lecture 6 - Performance comparison Summary CS267 Lecture 6 * Parallel Programming with Threads CS267 Lecture 6 * Recall Programming Model 1: Shared Memory ... Memory/Cache ... | PowerPoint PPT presentation | free to view

Shared Memory Programming: Threads and OpenMP Lecture 6 - ... (physical registers, cache, memory ... Threads and OpenMP Lecture 6 Outline Parallel Programming with Threads Recall Programming Model 1: Shared Memory ... | PowerPoint PPT presentation | free to view

Lecture 12: Hardware/Software Trade-Offs PowerPoint PPT Presentation

Lecture 12: Hardware/Software Trade-Offs - Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory University of Utah Capacity Limitations In a Sequent NUMA-Q design above, A remote ... | PowerPoint PPT presentation | free to view

Distributed systems and their properties (Lecture 2 for Programming of Interactive Systems) PowerPoint PPT Presentation

Distributed systems and their properties (Lecture 2 for Programming of Interactive Systems) - Distributed systems and their properties (Lecture 2 for Programming of Interactive Systems) Fredrik Kilander & Wei Li Interactive systems Interactive systems * Most ... | PowerPoint PPT presentation | free to view

Lecture 3: Directory Protocol Implementations PowerPoint PPT Presentation

Lecture 3: Directory Protocol Implementations - Lecture 3: Directory Protocol Implementations Topics: coherence vs. msg-passing, corner cases in directory protocols ... | PowerPoint PPT presentation | free to view

Lecture 14 Software Design for Low-Power PowerPoint PPT Presentation

Lecture 14 Software Design for Low-Power - Title: Testing in the Fourth Dimension Author: pagrawal Last modified by: bushnell Created Date: 11/3/2000 2:09:08 AM Document presentation format | PowerPoint PPT presentation | free to view

Cache Coherence in Scalable Machines PowerPoint PPT Presentation

Cache Coherence in Scalable Machines - Scalable Machines Scalable Cache Coherent Systems Scalable, distributed memory plus coherent replication Scalable distributed memory machines P-C-M nodes connected by ... | PowerPoint PPT presentation | free to view

Lecture 1: Introduction and Memory Systems PowerPoint PPT Presentation

Lecture 1: Introduction and Memory Systems - Lecture 1: Introduction and Memory Systems CS 7810 Course organization: 7 lectures on memory systems 3 lectures on cache coherence and consistency | PowerPoint PPT presentation | free to view

Lecture 8: Snooping and Directory Protocols PowerPoint PPT Presentation

Lecture 8: Snooping and Directory Protocols - Title: PowerPoint Presentation Author: Rajeev Balasubramonian Last modified by: Rajeev Balasubramonian Created Date: 9/20/2002 6:19:18 PM Document presentation format | PowerPoint PPT presentation | free to view

Shared Memory Programming: Threads and OpenMP Lecture 6 - Slides by Jim Demmel and Kathy Yelick ... Threads and OpenMP Lecture 6 James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr14/ | PowerPoint PPT presentation | free to view

Lecture 8: CoreConnect and The PLB Bus PowerPoint PPT Presentation

Lecture 8: CoreConnect and The PLB Bus - Title: Lecture 1 Author: Nick Carter Last modified by: Ying-Yu Chen Created Date: 1/13/2002 11:30:08 PM Document presentation format: On-screen Show (4:3) | PowerPoint PPT presentation | free to view

Shared Memory Programming: Threads and OpenMP Lecture 6 - ... Shared Memory Program is a collection of threads of control. Can be created dynamically, mid-execution, in some languages Each thread has a set of private ... | PowerPoint PPT presentation | free to view

Distributed Memory Machines and Programming Lecture 7 PowerPoint PPT Presentation

Distributed Memory Machines and Programming Lecture 7 - Process 0 Process 1 Send(data) Receive(data) ... Note also the latency hiding effects of communication networks in which send and receive overhead overlap in time. | PowerPoint PPT presentation | free to view

Lecture 12: Large Cache Design PowerPoint PPT Presentation

Lecture 12: Large Cache Design - Title: PowerPoint Presentation Author: Rajeev Balasubramonian Last modified by: Rajeev Balasubramonian Created Date: 9/20/2002 6:19:18 PM Document presentation format | PowerPoint PPT presentation | free to view

CS252 Graduate Computer Architecture Lecture 12 Vector Processing (Con PowerPoint PPT Presentation

CS252 Graduate Computer Architecture Lecture 12 Vector Processing (Con - Graduate Computer Architecture Lecture 12 Vector Processing (Con t) Branch Prediction John Kubiatowicz Electrical Engineering and Computer Sciences | PowerPoint PPT presentation | free to view

15-740/18-740 Computer Architecture Lecture 4: Pipelining PowerPoint PPT Presentation

15-740/18-740 Computer Architecture Lecture 4: Pipelining - 15-740/18-740 Computer Architecture Lecture 4: Pipelining Prof. Onur Mutlu Carnegie Mellon University | PowerPoint PPT presentation | free to view

15-740/18-740 Computer Architecture Lecture 18: Caching in Multi-Core PowerPoint PPT Presentation

15-740/18-740 Computer Architecture Lecture 18: Caching in Multi-Core - 15-740/18-740 Computer Architecture Lecture 18: Caching in Multi-Core Prof. Onur Mutlu Carnegie Mellon University | PowerPoint PPT presentation | free to view

Lecture 14: Course Review PowerPoint PPT Presentation

Lecture 14: Course Review - Title: PowerPoint Presentation Last modified by: lenovo Created Date: 1/1/1601 12:00:00 AM Document presentation format: (4:3) Other titles | PowerPoint PPT presentation | free to view

Lecture 5: Snooping Protocol Design Issues PowerPoint PPT Presentation

Lecture 5: Snooping Protocol Design Issues - Lecture 5: Snooping Protocol Design Issues Topics: barriers, basic snooping protocol implementation, multi-level cache hierarchies University of Utah Barriers ... | PowerPoint PPT presentation | free to view

15-740/18-740 Computer Architecture Lecture 1: Intro, Principles, Tradeoffs PowerPoint PPT Presentation

15-740/18-740 Computer Architecture Lecture 1: Intro, Principles, Tradeoffs - 15-740/18-740 Computer Architecture Lecture 1: Intro, Principles, Tradeoffs Prof. Onur Mutlu Carnegie Mellon University | PowerPoint PPT presentation | free to view

15-740/18-740 Computer Architecture Lecture 11: OoO Wrap-Up and Advanced Caching PowerPoint PPT Presentation

15-740/18-740 Computer Architecture Lecture 11: OoO Wrap-Up and Advanced Caching - 15-740/18-740 Computer Architecture Lecture 11: OoO Wrap-Up and Advanced Caching Prof. Onur Mutlu Carnegie Mellon University | PowerPoint PPT presentation | free to view

18-742 Parallel Computer Architecture Lecture 11: Caching in Multi-Core Systems PowerPoint PPT Presentation

18-742 Parallel Computer Architecture Lecture 11: Caching in Multi-Core Systems - 18-742 Parallel Computer Architecture Lecture 11: Caching in Multi-Core Systems Prof. Onur Mutlu and Gennady Pekhimenko Carnegie Mellon University | PowerPoint PPT presentation | free to view