Title: Uncorq: Unconstrained Snoop Request Delivery in EmbeddedRing Multiprocessors
1Uncorq Unconstrained Snoop Request Delivery in
Embedded-Ring Multiprocessors
http//iacoma.cs.uiuc.edu
2Motivation
- Shared memory caches cache coherence
- Traditional cache coherence solutions
- shared bus-based electrical, layout issues
- directory-based indirection, storage
3Embedded-ring cache coherence ISCA 2006
- Novel snoopy cache coherence for mid-sized
machines
- logical ring is embedded in network
- control messages use ring
- data messages use any path
- Simple and inexpensive to implement
- Snoop requests can have long latencies
4Contributions
- Propose invariant for transaction serialization
- Propose performance enhancements
- Uncorq unconstrained snoop request delivery
- reduces cache-to-cache transfer latency
- Simple hardware data prefetching technique
- reduces memory-to-cache transfer latency
5Embedded-ring terminology
- Snoopy, invalidate protocol
control messages
6Ordering invariant
7Transaction serialization
S
I
M
S
I
old value
new value
8Serialization enforcement with embedded-ring
- Logical unidirectional ring provides partial
ordering
- Distributed algorithm establishes global order
- for same-address transactions
- On simultaneous transactions to same address
- one is declared the winner (first to reach
supplier) - others have to retry
9How to serialize transactions
No clear first transaction
A
Bs request reaches S first
B
Ring guarantees responses are forwarded in the
order S performed snoop operations
S
A receives Bs positive response before its own
A retries B ? A
10Enforcing transaction serialization
- Node whose request arrives at supplier node
first is the winner
- What we need to enforce transaction
serialization
Ordering Invariant the order in which responses
travel the ring after leaving the supplier must
be the same as the order in which the supplier
processed their corresponding requests.
loser node sees other nodes positive response
before its own
11UncorqUnconstrained snooprequest delivery
12Uncorq idea
Baseline
Idea requests do not have to follow the ring
(but responses do)
13Benefit of Uncorq
Reduced cache-to-cache transfer latency
time
request
snoop
data
Baseline
Uncorq
14Implications of Uncorq
- Uncorq no longer restricts order of requests
- Nodes may receive and process requests in any
order
- Responses may also get reordered
Problem distributed algorithm relies on the fact
that response order reflects order of requests
at supplier
15Example incorrect transaction ordering
A node cannot forward any other response if it
has an outstanding positive snoop outcome
16How Uncorq stalls responses
- Local transaction table (per-node structure)
- records messages that node is currently
processing
A
B
C
requests
?
?
addr
C
responses
?
?
17Optimization prefetching from memory
- Goal reduce latency of memory-to-cache transfers
- Access memory in parallel with ring snoop
optimized
unoptimized
(1)
(2)
(1)
(1)
memory
memory
- Predict when no node will supply data
18Evaluation
19Experimental setup
- Interconnection network 2D torus with
embedded-ring
- SESC simulator (sesc.sourceforge.net)
- SPLASH-2, SPECjbb and SPECweb workloads
20Cache-to-cache transfer latency
21Execution Time
1
0.9
0.8
?
0.7
Baseline
normalized execution time
0.6
Uncorq
0.5
UncorqPref
0.4
0.3
?
0.2
0.1
0
SPLASH-2
SPECjbb
SPECweb
- Uncorq significantly reduces execution time
(reduction 5-23)
- Uncorq Pref performs the best (reduction
13-26)
22Also in the paper
- Serialization mechanism for case with no supplier
- System and node forward progress
- Fences and memory consistency issues
- Characterization of prefetching mechanism
- Comparison against ccHyperTransport
23Conclusion
- Propose invariant for transaction serialization
- Propose performance enhancements
- Uncorq unconstrained snoop request delivery
- Simple hardware data prefetching technique
- Reduce execution time by 13-26
24Uncorq Unconstrained Snoop Request Delivery in
Embedded-Ring Multiprocessors
http//iacoma.cs.uiuc.edu