Uncorq: Unconstrained Snoop Request Delivery in EmbeddedRing Multiprocessors - PowerPoint PPT Presentation

About This Presentation
Title:

Uncorq: Unconstrained Snoop Request Delivery in EmbeddedRing Multiprocessors

Description:

Novel snoopy cache coherence for mid-sized machines. data messages use any path ... Snoopy, invalidate protocol. response. request. Single supplier protocol. request ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 25
Provided by: karins
Learn more at: https://microarch.org
Category:

less

Transcript and Presenter's Notes

Title: Uncorq: Unconstrained Snoop Request Delivery in EmbeddedRing Multiprocessors


1
Uncorq Unconstrained Snoop Request Delivery in
Embedded-Ring Multiprocessors
http//iacoma.cs.uiuc.edu
2
Motivation
  • CMPs are ubiquitous
  • Shared memory caches cache coherence
  • Traditional cache coherence solutions
  • shared bus-based electrical, layout issues
  • directory-based indirection, storage

3
Embedded-ring cache coherence ISCA 2006
  • Novel snoopy cache coherence for mid-sized
    machines
  • logical ring is embedded in network
  • control messages use ring
  • data messages use any path
  • Simple and inexpensive to implement
  • Snoop requests can have long latencies

4
Contributions
  • Propose invariant for transaction serialization
  • Propose performance enhancements
  • Uncorq unconstrained snoop request delivery
  • reduces cache-to-cache transfer latency
  • Simple hardware data prefetching technique
  • reduces memory-to-cache transfer latency

5
Embedded-ring terminology
  • Snoopy, invalidate protocol
  • Single supplier protocol
  • Types of messages
  • snoop request
  • snoop response

control messages
  • snoop request response
  • data


6
Ordering invariant
7
Transaction serialization
S
I
M
S
I
old value
new value
8
Serialization enforcement with embedded-ring
  • Logical unidirectional ring provides partial
    ordering
  • Distributed algorithm establishes global order
  • for same-address transactions
  • On simultaneous transactions to same address
  • one is declared the winner (first to reach
    supplier)
  • others have to retry

9
How to serialize transactions

No clear first transaction
A
Bs request reaches S first
B
Ring guarantees responses are forwarded in the
order S performed snoop operations
S

A receives Bs positive response before its own
A retries B ? A
10
Enforcing transaction serialization
  • Node whose request arrives at supplier node
    first is the winner
  • What we need to enforce transaction
    serialization

Ordering Invariant the order in which responses
travel the ring after leaving the supplier must
be the same as the order in which the supplier
processed their corresponding requests.
loser node sees other nodes positive response
before its own
11
UncorqUnconstrained snooprequest delivery
12
Uncorq idea
Baseline
Idea requests do not have to follow the ring
(but responses do)
13
Benefit of Uncorq
Reduced cache-to-cache transfer latency
time
request
snoop
data
Baseline
Uncorq
14
Implications of Uncorq
  • Uncorq no longer restricts order of requests
  • Nodes may receive and process requests in any
    order
  • Responses may also get reordered

Problem distributed algorithm relies on the fact
that response order reflects order of requests
at supplier
15
Example incorrect transaction ordering
A node cannot forward any other response if it
has an outstanding positive snoop outcome
16
How Uncorq stalls responses
  • Local transaction table (per-node structure)
  • records messages that node is currently
    processing




A
B

C
requests
?
?
addr
C
responses
?
?
17
Optimization prefetching from memory
  • Goal reduce latency of memory-to-cache transfers
  • Access memory in parallel with ring snoop

optimized
unoptimized
(1)
(2)
(1)
(1)
memory
memory
  • Predict when no node will supply data

18
Evaluation
19
Experimental setup
  • 64 nodes in a single CMP
  • Interconnection network 2D torus with
    embedded-ring
  • SESC simulator (sesc.sourceforge.net)
  • SPLASH-2, SPECjbb and SPECweb workloads

20
Cache-to-cache transfer latency
21
Execution Time
1
0.9
0.8
?
0.7
Baseline
normalized execution time
0.6
Uncorq
0.5
UncorqPref
0.4
0.3
?
0.2
0.1
0
SPLASH-2
SPECjbb
SPECweb
  • Uncorq significantly reduces execution time
    (reduction 5-23)
  • Uncorq Pref performs the best (reduction
    13-26)

22
Also in the paper
  • Serialization mechanism for case with no supplier
  • System and node forward progress
  • Fences and memory consistency issues
  • Characterization of prefetching mechanism
  • Comparison against ccHyperTransport

23
Conclusion
  • Propose invariant for transaction serialization
  • Propose performance enhancements
  • Uncorq unconstrained snoop request delivery
  • Simple hardware data prefetching technique
  • Reduce execution time by 13-26

24
Uncorq Unconstrained Snoop Request Delivery in
Embedded-Ring Multiprocessors
http//iacoma.cs.uiuc.edu
Write a Comment
User Comments (0)
About PowerShow.com