Title: Distributed Shared Memory for Large-Scale Dynamic Systems
1Distributed Shared Memoryfor Large-Scale
Dynamic Systems
- Vincent Gramoli
- supervised by Michel Raynal
2My Thesis
- Implementing a distributed shared memory for
- large-scale dynamic systems
3My Thesis
- Implementing a distributed shared memory for
- large-scale dynamic systems
- is
- NECESSARY,
4My Thesis
- Implementing a distributed shared memory for
- large-scale dynamic systems
- is
- NECESSARY,
- DIFFICULT,
5My Thesis
- Implementing a distributed shared memory for
- large-scale dynamic systems
- is
- NECESSARY,
- DIFFICULT,
- DOABLE!
6RoadMap
- Necessary? Communicating in Large-Scale Systems
- An Example of Distributed Shared Memory
- Difficult? Facing Dynamism is not trivial
- Difficult? Facing Scalability is tricky too
- Doable? Yes, here is a solution!
- Conclusion
7RoadMap
- Necessary? Communicating in Large-Scale Systems
- An Example of Distributed Shared Memory
- Difficult? Facing Dynamism is not trivial
- Difficult? Facing Scalability is tricky too
- Doable? Yes, here is a solution!
- Conclusion
8Distributed Systems Enlarge
- Internet explosion IPv4 -gt IPv6
- Multiplication of personal devices
- 17 billions of network devices by 2012 (IDC
prediction)
Internet
9Distributed Systems are Dynamic
- Independent computational entities act
asynchronously, and are affected by unpredictable
events (join/leaving). - These sporadic activities make the system dynamic
10Massively Accessed Applications
- WebServices use large information
- eBay Auctioning service
- Wikipedia Collaborative encyclopedia
- LastMinute Booking application
- but require too much power supply and cost too
much
increase (auction)
modify (article)
reserve (tickets)
11Massively Distributed Applications
- Peer-to-Peer applications share resources
- BitTorrent File Sharing
- Skype Voice over IP
- Joost Video Streaming
- but prevent large-scale collaboration.
copy
exchange
create
12Filling the Gap is Necessary
- Providing distributed applications where entities
(nodes) can fully collaborate - P2Pedia using P2P to built a collaborative
encyclopedia - P2P eBay using P2P as an auctioning service
13There are 2 Ways of Colaborating
- Using a Shared Memory
- A node writes information in the memory
- Another node reads information from the memory
- Using Message Passing
- A node sends a message to another node
- The second node receives the message from the
other
Memory
Read v
Write v
Node 1
Node 2
Node 3
Node 1
Send v
Node 2
Recv v
Node 3
14Shared Memory is Easier to Use
- Shared Memory is easy to use
- If information is written, collaboration
progresses! - Message Passing is difficult to use
- To which node the information should be sent?
15Message Passing Tolerates Failures
- Shared Memory is failure-prone
- Communication relies on memory availability
- Message-Passing is fault-tolerant
- As long as there is a way to route a message
Memory
Read v
Write v
Node 1
Node 2
Node 3
Node 1
Node 2
Node 3
Send v
Recv v
16The Best of the 2 Ways
- Distributed Shared Memory (DSM)
- emulates a Shared Memory to provide simplicity,
- in the Message Passing model to tolerate
failures.
DSM
read / write(v) operations
read-ack(v) / write-ack
17RoadMap
- Necessary? Communicating in Large-Scale Systems
- An Example of Distributed Shared Memory
- Difficult? Facing Dynamism is not trivial
- Difficult? Facing Scalability is tricky too
- Doable? Yes, here is a solution!
- Conclusion
18Our DSM ConsistencyAtomicity
- Atomicity (Linearizability) defines an operation
ordering - If an operation ends before another starts, then
it can not be ordered after - Write operations are totally ordered and read
operations are ordered with respect to write
operations - A read returns the last value written (or the
default one if none exist)
19Quorum-based DSM
Sharing memory robustly in message-passing
systems H. Attiya, A. Bar-Noy, D. Dolev, JACM
1995
- Quorums mutually intersecting sets of nodes
- Ex. 3 quorums of size q2, with memory size m3
Q1 n Q2 ? Ø Q1 n Q3 ? Ø Q2 n Q3 ? Ø
Q1
Q2
Q3
- Each node of the quorums maintains
- A local value v of the object
- A unique tag t, the version number of this value
20Quorum-based DSM
- Read and write operations
- A node i reads the object value vk by
- Asking vj and tj to each node j of a quorum
- Choosing the value vk with the largest tag tk
- Replicating vk and tk to all nodes of a quorum
- A node i writes a new object value vn by
- Asking tj to each node j of a quorum
- Choosing a larger tn than any tj returned
- Replicating vn and tn to all nodes of a quorum
Get ltvk,tkgt
Set ltvk,tkgt
Get ltvk,tkgt
tn tk
Set ltvn,tngt
21Quorum-based DSM
Q1
Q2
Q3
value? tag?
v1,t1
22Quorum-based DSM
Q1
Q2
Q3
v1,t1
23Quorum-based DSM
Q1
Q2
Q3
Output v1
24Quorum-based DSM
Input v2
Q1
Q2
Q3
25Quorum-based DSM
max tag?
t1
Q1
Q2
Q3
26Quorum-based DSM
Q1
Q2
v2,t2 (with t2 gt t1)
Q3
27Quorum-based DSM
- Works well in static system
- Number of failures f must be f m - q
Q1 n Q2 ? Ø Q2 n Q3 ? Ø
Q1
Q2
Q3
- All operations can access a quorum
28Quorum-based DSM
- Does not work in dynamic systems
- All quorums may fail if failures are unbounded
Problem Q1 n Q2 Ø and Q1 n Q3 Ø and
Q2 n Q3 Ø
Q1
Q2
Q3
29RoadMap
- Necessary? Communicating in Large-Scale Systems
- An Example of Distributed Shared Memory
- Difficult? Facing Dynamism is not trivial
- Difficult? Facing Scalability is tricky too
- Doable? Yes, here is a solution!
- Conclusion
30Reconfiguring
- Dynamism produces unbounded number of failures
- Solution Reconfiguration
- Replacing the quorum configuration periodically
Problem Q1 n Q2 Ø and Q1 n Q3 Ø and
Q2 n Q3 Ø
Q1
Q2
Q3
31Agreeing on the Configuration
- All must agree on the next configuration
- Quorum-based consensus algorithm Paxos
- Before, a consensus block complemented the DSM
service - Paxos, 3-phase leader-based algorithm
- Prepare a ballot (2 message delays)
- Propose a configuration to install (2 message
delays) - Propagate the decided configuration (1 message
delay)
RAMBO Reconfigurable Atomic Memory Service for
Dynamic Networks N. Lynch, A. Shvartsman, DISC
2002
32RDS Reconfigurable Distributed Storage
- RDS integrates consensus service into the
reconfigurable DSM - Fast version of Paxos
- Remove the first phase (in some cases)
- Quorums also propagate configuration
- Ensuring Read/Write Atomicity
- Piggyback object information into Paxos messages
- Parallelizing Obsolete Configuration Removal
- Add an additional message to the propagate phase
of Paxos
33Contributions
- Operations are fast (sometimes optimal)
- 1 to 2 message delays
- Reconfiguration is fast (fault-tolerance)
- 3 to 5 message delays
- While
- Operation atomicity and
- Operation independence are preserved
34Facing Dynamism
- Reconfigurable Distributed Storage
- G. Chockler, S. Gilbert, V. Gramoli, P. Musial,
A. Shvartsman - Proceedings of OPODIS 2005
35RoadMap
- Necessary? Communicating in Large-Scale Systems
- An Example of Distributed Shared Memory
- Difficult? Facing Dynamism is not trivial
- Difficult? Facing Scalability is tricky too
- Doable? Yes, here is a solution!
- Conclusion
36Facing Scalability is Difficult
- Problems
- Large-scale participation induces load
- When load is too high, requests can be lost
- Bandwidth resources are limited
- Goal Tolerate load by preventing communication
overhead - Solution A DSM that adapts to load variations
and that restricts communication
37Using Logical Overlay
- Object replicas r1, , rk share a 2-dim
coordinate space
r1 r1 r2 r3 r4
r5 r6 r7 r8 r8
rk-1
rk
38Benefiting from Locality
- Each replica ri can communicate only with its
nearest neighbors
ri
39Reparing the Overlay
- Topology takeover mechanism
If a node ri fails, a takeover node rj replaces it
rj
ri
A Scalable Content-Addressable Network S.
Ratnasamy, P. Francis, M. Handley, R. Karp, S.
Shenker SIGCOMM 2001
40Dynamic Bi-Quorums
- Bi-Quorums
- Quorums of two types where not all quorums
intersect - Quorums of different types intersect
- Vertical Quorum All replicas responsible of an
abscissa x - Horizontal Quorum All replicas responsible of an
ordinate y
x
For any horizontal quorum H and any vertical
quorum V H ? V ? Ø
y
41Operation Execution
- Read Operation
- Get up-to-date value and largest tag on a
horizontal quorum, - 2) Propagate this value and tag on a vertical
quorum.
- Write Operation
- Get up-to-date value and largest tag on a
horizontal quorum, - 2) Propagate the value to write (and a higher
tag) twice on the same vertical quorum
42Load Adaptation
Thwart requests follow the diagonal until a
non-overloaded node is found.
Expansion A node is added to the memory if no
non-overloaded node is found.
Shrink if underloaded, a node leaves the memory
after having notified its neighbors.
43Contributions
- SQUARE is a DSM that
- Scales well by tolerating load variations
- Defines load-optimal quorums (under reasonable
assumption) - Uses communication efficient reconfiguration
44Operation Latency
Request rate Memory size Read Latency Write Latency
100 10 479 733
125 14 622 812
250 24 1132 1396
500 46 1501 2173
1000 98 2408 3501
Bad News The operation latency increases with
the load (request rate)
45Facing Scalability is Difficult
- P2P Architecture for Self- Atomic Memory
- E. Anceaume, M. Gradinariu, V. Gramoli, A.
Virgillito - Proceedings of ISPAN 2005
- SQUARE Scalable Quorum-Based Atomic Memory
- with Local Reconfiguration
- V. Gramoli, E. Anceaume, A. Virgillito
- Proceedings of ACM SAC 2007
46RoadMap
- Necessary? Communicating in Large-Scale Systems
- An Example of Distributed Shared Memory
- Difficult? Facing Dynamism is not trivial
- Difficult? Facing Scalability is tricky too
- Doable? Yes, here is a solution!
- Conclusion
47Probability for modeling Reality
- Motivations for Probabilistic Solutions
- Tradeoff prevents deterministic solutions
efficiency - Allowing more Realistic Models
- Any node can fail independently
- Even if it is unlikely that many nodes fail at
the same time
48What is Churn?
- Churn is the dynamism intensity!
- Dynamic System
- n interconnected nodes
- Nodes join/leave the system
- A joining node is new
- Here, we model the churn simply as c
- At each time unit, cn nodes leave the network
- At each time unit, cn nodes enter the network
49Relaxing Consistency
- Every operation verifies all atomicity rules with
high probability! - Unsuccessful operation operation that violate at
east one of those rules - Probabilistic Atomicity
- If an operation Op1 ends before another Op2
starts, then it is ordered after with probability
e e-ß2 (with ß a constant) (If this happen,
operation Op2 is considered as unsuccessful) - Write operations are totally ordered and read
operations are ordered w.r.t. write operations - A read returns the last successfully value
written (or the default one if none exist) with
probability 1- e-ß2 (with ß a constant)(If this
does not hold, then the read is unsuccessful)
50TQS Timed Quorum System
- Intersection is provided during a bounded period
of time with high probability - Gossip-based algorithm in parallel
- Shuffle set of neighbors using gossip-based
algorithm - Traditional read/write operations using two
message round-trip between the client and a
quorum - Consult value and tag from a quorum
- Create new larger tag (if write)
- Propagate value and tag to a quorum
51TQS Timed Quorum System
- Contacting a quorum
- Disseminate message with TTL l to k neighbors,
- Decrement TTL received if first time received.
- Forward received messages to k neighbors if their
TTL is not null. - So that at the end, we have contacted nodes
-
with ?, the max period of time - between 2 successful operations
52Complexity of our Implementation
- Assumptions
- At least one operation succeeds every ? time
units - Gossip-based protocol provides uniformity
- Operation Time Complexity (in expectation)
- where D (1-c)-? is the dynamic parameter
53Complexity of our Implementation
- Operation Communication Complexity (in
expectation) - where D (1-c)-? is the dynamic parameter
54Complexity of our Implementation
- Operation Communication Complexity (in
expectation) - where D (1-c)-? is the dynamic parameter
- If D is a constant, then it reaches communication
complexity of static systems presented in
Probabilistic Quorum Systems D. Malkhi, M.
Reiter, A. Wool, R. Wright Information and Comp.
J. 2001
55Probability of Success
Quorum size
n 10,000
10 of failures
30 of failures
50 of failures
Probability of non-intersecting
70 of failures
90 of failures
56Contributions
- TQS relies on timely and probabilistic
intersections - Operation latency is low
- Operation communication complexity is low
- No reconfigurations are needed
- Replication is inherently done by the operations
- Atomicity is ensured with high probability
57A DSM to face Scalability and Dynamism
- Core Persistence in Peer-to-Peer Systems
Relating Size to Lifetime - V. Gramoli, A-.M. Kermarrec, A. Mostéfaoui, M.
Raynal, B. Sericola - Proceedings of RDDS 2006 (in conjunction with OTM
2006) - Timed Quorum Systems for Large-Scale and Dynamic
Environments - V. Gramoli, M. Raynal
- Proceedings of OPODIS 2007
58RoadMap
- Necessary? Communicating in Large-Scale Systems
- An Example of Distributed Shared Memory
- Difficult? Facing Dynamism is not trivial
- Difficult? Facing Scalability is tricky too
- Doable? Yes, here is a solution!
- Conclusion
59Conclusion
- We have presented three DSM
- Dynamism RDS
- Scalability SQUARE
- Dynamism and Scalability TQS
60Conclusion
Solutions Latency Communication Guarantee
RDS Low High Safe
SQUARE High Low Safe
TQS Low Low High Probability
61Open Questions
- Could we still speed up operations?
- Disseminating continuously up-to-date values
- Consulting values that have already been
aggregated - How to model dynamism?
- Differing results for the P2P File-Sharing
- What would it be for different applications?
62END
63Load Balancing
Good News The load is well-balanced over the
replicas
63
64Load Adaptation
Good News The memory self-adapts well in face of
dynamism
64
65Reconfigurable Distributed Storage
- Prepare phase
- The leader creates a new ballot and sends it to
quorums - A quorum of nodes send back their candidate
config. - The leader chooses the configuration for the
ballot -
- Propose phase
- The leader sends the ballot and its config. to
quorums The leader sends its tag and value and
adds the current configuration - A quorum of nodes can send their ballot vote,
their tag and value to quorums - These quorum nodes decide the next configuration
- Propagate phase
- These quorum nodes propagate the decided
configuation to quorums - These quorum nodes remove the old configuration
if not done already
65