Title: Computer Science 328 Distributed Systems
1Computer Science 328Distributed Systems
- Lecture 20
- Replication Control II
2Gossiping Architecture
- The replica managers exchange gossip messages
periodically in order to convey the updates they
have each received from clients. - Objective provisioning of highly available
service. - Each client obtains a consistent service over
time in response to a query, a RM may have to
wait until it receives updates from other RMs.
The RM then provides with data that at least
reflects the updates that the client has observed
so far. - Relaxed consistency between replicas all RMs
eventually receive all updates and they apply
updates with ordering guarantees.
3Query and Update Operations in a Gossip Service
Service
RM
gossip
RM
RM
Query,
prev
Val,
new
Update,
prev
Update id
FE
FE
Query
Val
Update
Clients
4Various Timestamps
- Virtual timestamps are used to control the order
of operation processing. The timestamp contains
an entry for each RM. - Each front end keeps a vector timestamp, prev,
that reflects the latest data values accessed by
the front end. The FE sends it in every request
to a RM. - When a RM returns a value as a result of a query
operation, it supplies a new timestamp, new. - An update operation returns a timestamp, update
id. - Each returned timestamp is merged with the FEs
previous timestamp to record the data that has
been observed by the client.
5Front ends Propagate Their Timestamps
Since client-to-client communication can also
lead to causal relationships between operations
applied to services, the FE piggys back
its timestamp on messages to other clients.
6A Gossip Replica Manager
7Replica Manager State
- Value value of the object maintained by the RM.
- Value timestamp the timestamp that represents
the updates that are reflected in the value.
Updated whenever an update operation is applied. - Update log records all update operations as soon
as they are received. - Keeps all the updates that are not stable, where
a stable update is one that can be applied
consistently with its ordering guarantees. - Keeps stable updates that have been applied, but
cannot be purged yet, because no confirmation
that these updates have been received at all
other RMs is received. - Replica timestamp represents updates that have
been accepted by the RM into the log. - Executed operation table contains the
FE-supplied Ids of updates that have been applied
to the value. - Used to prevent an update being applied twice, as
an update may arrive from a FE and in gossip
messages from other RMs. - Timestamp table contains, for each RM, the
latest timestamp that arrive in gossip message
from that RM.
8A Closer Look at the Virtual Timestamp
- The ith element of a vector timestamp held by RMi
corresponds to the number of updates received
from FEs by RMi - The jth element of a vector timestamp held by RMi
equals the number of updates received by RMj and
forwarded to RMi in gossiping messages.
9Update Operations
- Each update request u contains
- The update operation, u.op
- The FEs timestamp, u.prev
- A unique id that the FE generates, u.id.
- Upon receipt of an update request, the RM
- Checks if u has been processed by looking up u.id
in the executed operation table and in the update
log. - If not, increments the ith element in the replica
timestamp by one to keep track of the number of
updates directly received from FEs. - Places a record for the update in the RMs log.
- logRecord lti, ts, u.op, u.prev, u.idgt
- where ts is derived from u.prev by replacing
u.prevs ith element by the ith element of its
replica timestamp. - Returns ts back to the FE, which merges it with
its timestamp.
10Update Operation (Contd)
- The stability condition for an update u is
- u.prev lt valueTS
- All the updates on which this update depends
have already been applied to the value. - When the update operation u becomes stable, the
RM does the following - value apply(value, u.op)
- valueTS merge(valueTS, ts) (update the value
timestamp) - executed executed U u.id (update the
executed operation table)
11Exchange of Gossiping Messages
- A gossip message m consists of the log of the RM,
m.log, and the replica timestamp, m.ts. - A RM that receives a gossip message has three
tasks - (1) Merge the arriving log with its own.
- Let replicaTS denote the recipient RMs replica
timestamp. A record r in m.log is added to the
recipients log unless r.ts lt replicaTS. - replicaTS ? merge(replicaTS, m.ts)
- (2) Apply any updates that have become stable but
not been executed (stable updates in the arrived
log may make pending updates become stable) - Eliminate records from the log and the executed
operation table when it is known that the updates
have been applied everywhere.
12Update Propagation
- The frequency with which RMs send gossip messages
depends on the application. - The policy for choosing a partner with which to
exchange gossip - Random policies choose a partner randomly
(perhaps with weighted probabilities) - In deterministic policies, a RM can examine its
timestamp table and choose the RM that is the
furthest behind in the updates it has received. - Topological policies arrange the RMs into a fixed
graph, such as a mesh, a ring, or a tree. - Each has its own merits and drawbacks. The ring
topology produces relatively little communication
but is subject to high transmission latencies
since gossip has to traverse several RMs.
13Query Operations
- A query request q contains the operation, q.op,
and the timestamp, q.prev, sent by the FE. - Let valueTS denote the RMs value timestamp, then
q can be applied if - Q.prev lt valueTS
- The RM keeps q on a hold back queue until the
condition is fulfilled. - If valueTs is (2,5,5) and q.prev is (2,4,6), then
one update from RM3 is missing. - Once the query is applied, the RM returns
- new ?valueTS
- to the FE, and the FE merges new with its
timestamp.
14Transactions with Replicated Data
15Transactions on Replicated Data
16One Copy Serialization
- In a non-replicated system, transactions appear
to be performed one at a time in some order. This
is achieved by ensuring a serially equivalent
interleaving of transaction operations. - One-copy serializability The effect of
transactions performed by clients on replicated
objects should be the same as if they had been
performed one at a time on a single set of
objects.
17Two Phase Commit Protocol For Replicated Objects
- In the first phase, the coordinator sends the
canCommit? Command to the workers, each of which
then passes it onto the other RMs and collect
their replies before replying to the coordinator. - In the second phase, the coordinator sends the
doCommit or doAbort request, which is passed onto
the members of the groups of RMs.
18Primary Copy Replication
- All the client requests are directed to a single
primary RM. - Concurrency control is applied at the primary.
- To commit a transaction, the primary communicates
with the backup RMs and replies to the client.
19Read One/Write All Replication
- FEs may communicate with any RM.
- Every write operation must be performed at all of
the RMs, each of which sets a write lock on the
object. - Each read operation is performed by a single RM,
which sets a read lock on the object. - Consider pairs of operations of different
transactions on the same object. - Any pair of write operations will require
conflicting locks at all of the RMs. - A read operation and a write operation will
require conflicting locks at a single RM. - The one-copy serializability is achieved.
20Available Copies Replication
- A clients read request on an object must be
performed by any available RM, but a clients
update request must be performed by all available
RMs in the group. - Example in the next slide At X, transaction T
has read A and hence transaction U is not allowed
to update A until transaction T has completed. - As long as the set of available RMs does not
change, local concurrency control achieves
one-copy serializability in the same way as in
read-one/write-all replication. - This is not true if RMs fail and recover during
conflicting transactions.
21Available Copies Approach
22The Impact of RM Failure
- Assume that (i) RM X fails just after T has
performed getBalance but before it has performed
deposit and (ii) RM N fails just after U has
performed getBalance, but before it has performed
deposit. - Ts deposit will be performed at RMs M and P, and
Us deposit will be performed at RM Y. - The concurrency control on A at RM X does not
prevent transaction U from updating A at RM Y. - Solution crashes and recoveries must be
serialized with respect to transaction
operations.
23Local Validation (using Our Example)
- From Ts perspective,
- T has read from an object at X, X must fail after
Ts operation. - T observes the failure of N when it attempts to
update the object, Ns failure must be before T. - N fails ? T reads object A at X T writes objects
B at M and P ? T commits ? X fails. - From Us perspective,
- X fails ? U reads object B at N U writes object
A at Y ? U commits ? N fails. - At the time T commits, it checks N is still not
available and X, M and P are still available. If
so, T can commit. - This implies X fails after T validated and before
U validated. Us validation fails because N has
already failed.
24Network Partition
As there has been a partition, pairs of
conflicting transactions have been allowed to
commit in different partitions. The only choice
after the network is recovered is to abort one
of the transactions. In the pessimistic quorum
consensus approach, updates are allowed only in a
partition that has the majority of RMs and
updates are propagated to the other RMs when
the partition is repaired.
25Static Quorums
- The decision about how many RMs should be
involved in an operation on replicated data is
called Quorum selection - Quorum rules state that
- At least r replicas must be accessed for read
- At least w replicas must be accessed for write
- r w gt N, where N is the number of replicas
- w gt N/2
- Each object has a version number or a consistent
timestamp - Static Quorum predefines r and w , is a
- pessimistic approach if partition occurs,
update is possible in, at most, one partition
26Voting with Static Quorums
- A version of quorum selection where each replica
has a number of votes. Quorum is reached by
majority of votes (N is the total number of
votes) - e.g. a cache replica may be given a 0 vote,
-
- - with r w 2, access time for write is 750
ms access time for read without cache is 750 ms
access time with cache is 175 - 825ms
Replica votes access time version chk
P(failure)
Cache 0 100ms 0ms 0 Rep1
1 750ms 75ms 1 Rep2 1 750ms 75ms 1 Rep3
1 750ms 75ms 1
27Quorum Concensus Examples
28Optimistic Quorum Approaches
- An Optimistic Quorum selection allows writes to
proceed in any partition. - Failed processors are considered for write
quorum, but not for read quorum. - This might lead to write-write conflicts
- Optimistic Quorum is practical when
- Conflicting updates are rare
- Conflicts are always detectable
- Damage from conflicts can be easily confined
- Repair of damaged data is possible or an update
can be discarded without consequences.
29View-based Quorum
- Quorum is based on views at any time
- In a small partition, the group plays dead,
hence no writes occur. - In a large enough partition, inaccessible nodes
are considered in the quorum as ghost processors.
- Once the partition is repaired, processors in
the small partition know who to contact for
updates. - but taking a straw poll for every action is
very costly Can we do better?
30View-based Quorum - details
- We define a read and a write threshold
- Aw minimum nodes in a view for write, Aw gt N/2
- Ar minimum nodes in a view for read
- Aw Ar gt N
- If ordinary quorum cannot be reached for an
operation, we take a straw poll, i.e. update
views - In a large enough partition for read, Ar ?
Viewsize In a large enough partition for
write, Aw ? Viewsize (inaccessible nodes are
considered as ghosts.) - Views are per object, numbered sequentially and
only updated if necessary - The first update after partition repair forces
restoration for nodes in the smaller partition
31Example View-based Quorum
- Consider N 5, w 5, r 1, Aw 3, Ar 3
32Example View-based Quorum (contd)