Title: CS556: Distributed Systems
1CS-556 Distributed Systems
Consistency Replication (III)
- Manolis Marazakis
- maraz_at_csd.uoc.gr
2Fault Tolerance ?
- Define correctness criteria
- When 2 replicas are separated by network
partition - Both are deemed incorrect stop serving.
- One (the master) continues the other ceases
service. - One (the master) continues to accept updates
both continue to supply reads (of possibly stale
data). - Both continue service subsequently synchronise.
3Fault Tolerance
- Design to recover after a failure with no loss of
(committed) data. - Designs for fault tolerance
- Single server, fail and recover
- Primary server with trailing backups
- Replicated service
4Network Partitions
- Separate but viable groups of servers
- Optimistic schemes validate on recovery
- Available copies with validation
- Pessimistic schemes limit availability until
recovery
5Replication under partitions
- Available copies with validation
- Validation involves
- Aborting conflicting Txs
- Compensations
- Precedence graphs for detecting inconsistencies
- Quorum consensus
- Version number per replica
- Operations should only be applied to replicas
with the current version number - Virtual partition
6Transactions with Replicated Data
- Better performance
- Concurrent service
- Reduced latency
- Higher availability
- Fault tolerance
- What if a replica fails or becomes isolated ?
- Upon rejoining, it must catch up
- Replicated transaction service
- Data replicated at a set of replica managers
- Replication transparency
- One copy serializability
- Read one, write all
Failures must be observed to have happened
before any active Txs at other servers
7Active replication (I)
8Active replication (II)
- RMs are state machines with equivalent roles
- Front ends communicates the client requests to RM
group, using totally ordered reliable multicast - RMs process independently requests reply to
front end (correct RMs process each request
identically) - Front end can synthesize final response to client
(tolerating Byzantine failures) - Active replication provides sequential
consistency if multicast is reliable ordered - Byzantine failures (F out of 2F1) front end
waits until it gets F1 identical responses
9Available Copies Replication
replica managers
getBalance(A)
- Not all copies will always be available.
- Failures
- Timeout at failed replica
- Rejected by recovering, unsynchronised replica
deposit(B)
deposit(A)
getBalance(B)
10Passive Replication (I)
- At any time, system has single primary RM
- One or more secondary backup RMs
- Front ends communicate with primary, primary
executes requests, response to all backups - If primary fails, one backup is promoted to
primary - New primary starts from Coordination phase for
each new request - What happens if primary crashes
before/during/after agreement phase?
11Passive Replication (II)
12Passive replication (III)
- Satisfies linearizability
- Front end looks up new primary, when current
primary does not respond - Primary RM is performance bottleneck
- Can tolerate F failures for F1 RMs
- Variation clients can access backup RMs
(linearizability is lost, but clients get
sequential consistency) - SUN NIS (yellow pages) uses passive replication
clients can contact primary or backup servers for
reads, but only primary server for updates
13Consensus for HA Systems (I)
- B.W. Lampson, How to build a highly available
system using consensus. - Distributed Algorithms, ed. Babaoglu and
Marzullo, Lecture Notes in Computer Science 1151,
Springer, 1996, pp 1-17 - Based on Lamports framework
- Replicated state machines
- Deterministic function (state, input) ?
(new_state, output) - Paxos consensus algorithm
- Ensure that all non-faulty processes see the same
inputs - We can make the order a part of the input value
by defining a total order on the set of inputs - Analysis of concurrent systems
- Leases
14Consensus for HA Systems (II)
- Lamports Paxos algorithm
- It is run by a set of leader processes that guide
a set of agent processes to achieve consensus. - It is correct no matter how many simultaneous
leaders there are and no matter how often leader
or agent processes fail and recover, how slow
they are, or how many messages are lost, delayed,
or duplicated. - It terminates if there is a single leader for a
long enough time during which the leader can talk
to a majority of the agent processes twice. - It may not terminate if there are always too many
leaders - Guaranteed termination is impossible !!
15Consensus for HA Systems (III)
- Sequence of rounds
- In each round, a single leader attempts to reach
consensus on a single variable - Query the agents to learn their status on past
rounds - Select a value and command agents to accept it
- If a majority of the agents accepts the value,
propagate this value to all as the outcome - 2 ½ round-trips for a successful round
- If the leader fails repeatedly, or more than
leaders compete, it may take multiple rounds to
reach consensus
16Bayou
- A data management system to support collaboration
in diverse network environments - Variable degrees of connectedness
- Rely only on occasional pair-wise communication
- No notion of disconnected mode of operation
- Tentative Committed data
- Update everywhere
- No locking
- Incremental reconciliation
- Eventual consistency
- Application participation
- Conflict detection resolution
- Merge procedures
- No replication transparency
17Design Overview
- Client stub (API) Servers
- Per replica state
- Database (Tuple Store)
- Write log Undo log
- Version vector
- Reconciliation protocol (peer-to-peer)
- Eventual consistency
- All replicas receive all updates
- Any two replicas that have received the same set
of updates have identical databases.
18Accommodating weak connectivity
- Weakly consistent replication
- read-any/write-any access
- Session sharing semantics
- Epidemic propagation
- pair-wise contacts (anti-entropy sessions)
- agree on the set of writes their order
- Convergence rate depends on
- connectivity frequency of anti-entropy
sessions - partner selection policy
19Conventional Approaches
- Version vectors
- Optimistic concurrency control
- Problems
- Concurrent writes to the same object may not
conflict, while writes to different objects may
conflict - depending on object granularity
- Bayou Account for application semantics
- conflict detection with help of application
20Conflict detection resolution
- Shared calendar example
- Conflicting meetings overlap in time
- Resolution reschedule to alternate time
- Dependency check
- included in every write
- Together with expected result
- calls merge procedure if a conflict is detected
- can query the database
- produces new update
21A Bayou write
- Processed at each replica
- Bayou_Write(update,dep_check,mergeproc)
- IF (DB_EVAL(dep_check.query) ltgt
- dep_check.expected_result)
- resolved_update EXECUTE(mergeproc)
- ELSE
- resolved_update update
- DB_EXEC(resolved_update)
22Example of write in the shared calendar
- Updateinsert, Meetings, 12/18/95, 130pm,
60min, Budget Meeting - Dependency_checkquerySELECT key FROM Meetings
WHERE day12/18/95 AND startlt230pm AND
endgt130pm, expected_resultEMPTY - MergeProc
- alternates 12/18/95, 300pm, 12/19/95,
930am - FOREACH a IN alternates
- / check if feasible, produce newupdate /
- if(newupdate ) / no feasible alternate /
- newupdate insert, ErrorLog, Update
- Return(newupdate)
23Eventual consistency
- Propagation
- All replicas receive all updates
- chain of pair-wise interactions
- Determinism
- All replicas execute writes in the same way
- Including conflict detection resolution
- Global order
- All replicas apply writes to their databases in
the same order - Since writes include arbitrarily complex merge
procedures, it is effectively impossible to
determine if two writes commute or to transform
them so that they can be re-ordered - Tentative writes are ordered by the timestamp
assigned by their accepting servers - Total order using ltTimestamp, serverIDgt
- Desirable so that a cluster of isolated servers
agree on the tentative resolution of conflicts
24Undoing re-applying writes
- Servers may accept writes (from clients or other
servers) in an order that differs from the
acceptable execution order - Servers immediately apply all known writes
- Therefore
- Servers must be able to undo the effects of some
previous tentative execution of a write
re-apply it in a different order - The number of retries depends only on the order
in which writes arrive (via anti-entropy
sessions) - Not on the likelihood of conflicts
- Each server maintains write log undo log
- Sorted by committed or tentative timestamp
- Committed writes take precedence
25Constraints on write
- Must produce the same result on all replicas with
equal write logs preceding that write - Client-provided merge procedures
- can only access the database parameters
provided - cannot access time-varying or server-specific
state - pid, time, file system
- have uniform bounds on memory processor
- So that failures due to resource usage are
uniform - Otherwise, non-deterministic behavior !
26Global order
- Writes are totally ordered w.r.t. write-stamp
- (commit-stamp, accept-stamp, server-id)
- accept-stamp
- assigned by the server that initially receives
the write - derived from logical clock
- monotonically increasing
- Global clock sync. is not required
- commit-stamp
- initialized to ??
- updated when write is stabilized
27Stabilizing writes (I)
- A write is stable at a replica when it has been
executed for the last time at that replica - All writes with earlier write-stamps are known to
the replica, and no future writes will be given
earlier write-stamps - Convenient for applications to have a notion of
confirmation/commitment - Stabilize as soon as possible
- allow replicas to prune their write-logs
- inform applications/users that writes have been
confirmed -or- fully resolved
28Stabilizing writes (II)
- A write may be executed several times at a server
may produce different results - Depending on servers execution history
- The Bayou API provides means to inquire about the
stability of a write - By including the current clock value in
anti-entropy sessions, a server can determine
that a write is stable when it has a lower
timestamp than all servers clocks - A single server that remains disconnected may
prevent writes from stabilizing, and cause
rollbacks upon its re-connection - Explicit commit procedure
29Committing writes
- In a given data collection, one replica is
designated the primary - Commits writes by assigning a commit-stamp
- No requirement for majority quorum
- A disconnected server can be the primary for a
users personal data objects - Committing a write makes it stable
- Commit-stamp determined total order
- Committed writes are ordered before tentative
writes - Replicas are informed of committed writes in
commit-stamp order
30Applying sets of writes
Receive_Writes(wset, from) if(from
CLIENT) TS MAXsysClock, TS1 w
First(wset) w.WID TS, mySrvID
w.state TENTATIVE WriteLogAppend(w)
Bayou_Write(w.update, w.dependency_check,
w.mergeproc) else / received via
anti-entropy -gt wset is ordered / w
First(wset) insertionPoint
WriteLogIdentifyInsertionPoint(w.WID)
TupleStoreRollbackTo(insertionPoint)
WriteLogInsert(wset) for each w in
WriteLog, w after insertionPoint
Bayou_Write(w.update, w.dependency_check,
w.mergeproc) w Last(wset)
TS MAXTS, w.WID.timestamp
31Epidemic protocols (I)
- Scalable propagation of updates in
eventually-consistent data stores - eventually all replicas receive all updates
- in as few msgs as possible
- Aggregation of multiple updates in each msg
- Classification of servers
- Infective
- Susceptible
- Removed
32Epidemic protocols (II)
- Anti-entropy propagation
- Server P randomly selects server Q
- Options
- P pushes updates to Q
- Problem of delay if we have relatively many
infective servers - P pulls updates from Q
- Spreading of updates is triggered by susceptible
servers - P and Q exchange updates (push/pull)
- Assuming that only a single infective server
- Both push pull eventually spread updates
- Optimization
- Ensure that at least a number of servers
immediately become infective
33Epidemic protocols (III)
- Rumor spreading (gossiping)
- Server P randomly selects Q to push updates
- If Q already has seen the updates of P, then P
may lose interest - with probability 1/k
- Rapid propagation
- but no guarantee that all servers will see all
updates
s servers that remain ignorant of an update
s e (k1)(1-s)
Enhancements by combining gossiping with
anti-entropy
k 3 ? s lt 0.02
34Epidemic protocols (IV)
- Spreading a deletion is hard
- After removing a data item, a server may receive
old copies ! - Must record deletions spread them
- Death certificates
- Time-stamped upon creation time
- Enforce TTL of certificates
- Based on estimate of max. update propagation time
- Maintain a few dormant death certificates
- that never expire
- as a guarantee that a death certificate can be
re-spread in case an obsolete update is received
for a deleted data item
35The gossip architecture (I)
- Replicate data close to points where groups of
clients need it - Periodic exchange of msgs among RMs
- Front-ends send queries updates to any RM they
choose - Any RM that is available can provide acceptable
response times - Consistent service over time
- Relaxed consistency bet. replicas
36The gossip architecture (II)
- Causal update ordering
- Forced ordering
- Causal total
- A Forced-order a Causal-order update that are
related by the happened-before relation may be
applied in different orders at different RMs ! - Immediate ordering
- Updates are applied in a consistent order
relative to any other update at all RMs
37The gossip architecture (III)
- Bulletin board application example
- Posting items -gt causal order
- Adding a subscriber -gt forced order
- Removing a subscriber -gt immediate order
- Gossip messages updates among RMs
- Front-ends maintain prev vector timestamp
- One entry per RM
- RMs respond with new vector timestamp
38State components of a gossip RM
39Query operations in gossip
- RM must return a value that is at least as recent
as the requests timestamp - Q.prev lt valueTS
- List of pending query operations
- Hold back until above condition is satisfied
- RM can wait for missing updates
- or request updates from the RMs concerned
- RMs response includes valueTS
40Updates in causal order
- RM-i checks to see if operation ID is in its
executed table or in its log - Discard update if it has already seen it
- Increment i-th element of replica timestamp
- Count of updates received from front-ends
- Assign vector timestamp (ts) to the update
- Replace i-th element of u.prev by i-th element of
replica timestamp - Insert log entry
- lti, ts, u.op, u.prev, u.idgt
- Stability condition u.prev lt valueTS
- All updates on which u depends have been applied
41Forced immediate order
- Unique sequence number is appended to update
timestamps - Primary RM acts as sequencer
- Another RM can be elected to take over
consistently as sequencer - Majority of RMs (including primary) must record
which update is the next in sequence - Immediate ordering by having the primary order
them in the sequence (along with forced updates
considering causal updates as well) - Agreement protocol on sequence
42Gossip timestamps
- Gossip msgs bet. RMs
- Replica timestamp log
- Receivers tasks
- Merge arriving log m.log with its own
- Add record r to local log if replicaTS lt r.ts
- Apply any updates that have become stable
- This may in turn make pending updates become
stable - Eliminate records from log entries in executed
table - Once it is established that they have been
applied everywhere - Sort the set of stable updates in timestamp order
- r is applied only if there is no s s.t. s.prev lt
r.prev - tableTSj m.ts
- If tableTSic gt r.tsc, for all i, then r is
discarded - c RM that created record r
- ACKs by front-ends to discard records from
executed table
43Update propagation
- How long before all RMs receive an update ?
- Frequency duration of network partitions
- Beyond systems control !
- Frequency of gossip msgs
- Policy for choosing a gossip partner
- Random
- Weighted probabilities to favor near partners
- Surprisingly robust !
- But exhibits variable update propagation times
- Deterministic
- Simple function of RMs state
- Eg Examine timestamp table choose the RM that
appears to be the furthest behind in updates
received - Topological
- Based on fixed arrangement of RMs into a graph
- Ring, mesh, trees
- Trade-off amount of communication against higher
latencies the possibility that a single failure
will affect other RMs
44Scalability concerns
- 2 messages per query (bet. front-end RM)
- Causal update
- G messages per gossip message
- 2 (R-1)/G messages exchanged
- Increasing G leads to
- Less messages
- but also worse delivery latencies
- RM has to wait for more updates to arrive before
propagating them - Improvement by having read-only replicas
- Provided that update/query ratio is low !
- Updated by gossip msgs but do not receive
updates directly from front-ends - Can be situated close to client groups
- Vector timestamps need only include updateable RMs
45References
- R. Ladin, B. Liskov, L. Shrira and S. Ghemawat,
Providing Availability using Lazy Replication,
ACM Trans. Computer Systems, vol. 10, no.4, pp.
360-391, 1992. - A. Demers, D. Greene, C. Hauser, W. Irish and J.
Larson, Epidemic algorithms for replicated
database maintenance, Proc. 6th ACM Symposium
on Principles of Distributed Computing, pp. 1-12,
1987. - D. B. Terry, M. M. Theimer, K. Petersen, A. J.
Demers, M. J. Spreitzer, and C. Hauser.Â
Managing Update Conflicts in Bayou, a Weakly
Connected Replicated Storage System'', Proc. 15th
ACM SOSP, pp.172-183, 1995.