CS556: Distributed Systems - PowerPoint PPT Presentation

1 / 45

About This Presentation

Title:

CS556: Distributed Systems

Description:

Undoing & re ... Servers must be able to undo the effects of some previous ' ... Each server maintains write log & undo log. Sorted by committed or ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 46

Provided by: mar177

Category:

more less

Transcript and Presenter's Notes

Title: CS556: Distributed Systems

1
CS-556 Distributed Systems
Consistency Replication (III)

Manolis Marazakis
maraz_at_csd.uoc.gr

2
Fault Tolerance ?

Define correctness criteria
When 2 replicas are separated by network
partition
Both are deemed incorrect stop serving.
One (the master) continues the other ceases
service.
One (the master) continues to accept updates
both continue to supply reads (of possibly stale
data).
Both continue service subsequently synchronise.

3
Fault Tolerance

Design to recover after a failure with no loss of
(committed) data.
Designs for fault tolerance
Single server, fail and recover
Primary server with trailing backups
Replicated service

4
Network Partitions

Separate but viable groups of servers
Optimistic schemes validate on recovery
Available copies with validation
Pessimistic schemes limit availability until
recovery

5
Replication under partitions

Available copies with validation
Validation involves
Aborting conflicting Txs
Compensations
Precedence graphs for detecting inconsistencies
Quorum consensus
Version number per replica
Operations should only be applied to replicas
with the current version number
Virtual partition

6
Transactions with Replicated Data

Better performance
Concurrent service
Reduced latency
Higher availability
Fault tolerance
What if a replica fails or becomes isolated ?
Upon rejoining, it must catch up
Replicated transaction service
Data replicated at a set of replica managers
Replication transparency
One copy serializability
Read one, write all

Failures must be observed to have happened
before any active Txs at other servers
7
Active replication (I)
8
Active replication (II)

RMs are state machines with equivalent roles
Front ends communicates the client requests to RM
group, using totally ordered reliable multicast
RMs process independently requests reply to
front end (correct RMs process each request
identically)
Front end can synthesize final response to client
(tolerating Byzantine failures)
Active replication provides sequential
consistency if multicast is reliable ordered
Byzantine failures (F out of 2F1) front end
waits until it gets F1 identical responses

9
Available Copies Replication
replica managers
getBalance(A)

Not all copies will always be available.
Failures
Timeout at failed replica
Rejected by recovering, unsynchronised replica

deposit(B)
deposit(A)
getBalance(B)
10
Passive Replication (I)

At any time, system has single primary RM
One or more secondary backup RMs
Front ends communicate with primary, primary
executes requests, response to all backups
If primary fails, one backup is promoted to
primary
New primary starts from Coordination phase for
each new request
What happens if primary crashes
before/during/after agreement phase?

11
Passive Replication (II)
12
Passive replication (III)

Satisfies linearizability
Front end looks up new primary, when current
primary does not respond
Primary RM is performance bottleneck
Can tolerate F failures for F1 RMs
Variation clients can access backup RMs
(linearizability is lost, but clients get
sequential consistency)
SUN NIS (yellow pages) uses passive replication
clients can contact primary or backup servers for
reads, but only primary server for updates

13
Consensus for HA Systems (I)

B.W. Lampson, How to build a highly available
system using consensus.
Distributed Algorithms, ed. Babaoglu and
Marzullo, Lecture Notes in Computer Science 1151,
Springer, 1996, pp 1-17
Based on Lamports framework
Replicated state machines
Deterministic function (state, input) ?
(new_state, output)
Paxos consensus algorithm
Ensure that all non-faulty processes see the same
inputs
We can make the order a part of the input value
by defining a total order on the set of inputs
Analysis of concurrent systems
Leases

14
Consensus for HA Systems (II)

Lamports Paxos algorithm
It is run by a set of leader processes that guide
a set of agent processes to achieve consensus.
It is correct no matter how many simultaneous
leaders there are and no matter how often leader
or agent processes fail and recover, how slow
they are, or how many messages are lost, delayed,
or duplicated.
It terminates if there is a single leader for a
long enough time during which the leader can talk
to a majority of the agent processes twice.
It may not terminate if there are always too many
leaders
Guaranteed termination is impossible !!

15
Consensus for HA Systems (III)

Sequence of rounds
In each round, a single leader attempts to reach
consensus on a single variable
Query the agents to learn their status on past
rounds
Select a value and command agents to accept it
If a majority of the agents accepts the value,
propagate this value to all as the outcome
2 ½ round-trips for a successful round
If the leader fails repeatedly, or more than
leaders compete, it may take multiple rounds to
reach consensus

16
Bayou

A data management system to support collaboration
in diverse network environments
Variable degrees of connectedness
Rely only on occasional pair-wise communication
No notion of disconnected mode of operation
Tentative Committed data
Update everywhere
No locking
Incremental reconciliation
Eventual consistency
Application participation
Conflict detection resolution
Merge procedures
No replication transparency

17
Design Overview

Client stub (API) Servers
Per replica state
Database (Tuple Store)
Write log Undo log
Version vector
Reconciliation protocol (peer-to-peer)
Eventual consistency
All replicas receive all updates
Any two replicas that have received the same set
of updates have identical databases.

18
Accommodating weak connectivity

Weakly consistent replication
read-any/write-any access
Session sharing semantics
Epidemic propagation
pair-wise contacts (anti-entropy sessions)
agree on the set of writes their order
Convergence rate depends on
connectivity frequency of anti-entropy
sessions
partner selection policy

19
Conventional Approaches

Version vectors
Optimistic concurrency control
Problems
Concurrent writes to the same object may not
conflict, while writes to different objects may
conflict
depending on object granularity
Bayou Account for application semantics
conflict detection with help of application

20
Conflict detection resolution

Shared calendar example
Conflicting meetings overlap in time
Resolution reschedule to alternate time
Dependency check
included in every write
Together with expected result
calls merge procedure if a conflict is detected
can query the database
produces new update

21
A Bayou write

Processed at each replica
Bayou_Write(update,dep_check,mergeproc)
IF (DB_EVAL(dep_check.query) ltgt
dep_check.expected_result)
resolved_update EXECUTE(mergeproc)
ELSE
resolved_update update
DB_EXEC(resolved_update)

22
Example of write in the shared calendar

Updateinsert, Meetings, 12/18/95, 130pm,
60min, Budget Meeting
Dependency_checkquerySELECT key FROM Meetings
WHERE day12/18/95 AND startlt230pm AND
endgt130pm, expected_resultEMPTY
MergeProc
alternates 12/18/95, 300pm, 12/19/95,
930am
FOREACH a IN alternates
/ check if feasible, produce newupdate /
if(newupdate ) / no feasible alternate /
newupdate insert, ErrorLog, Update
Return(newupdate)

23
Eventual consistency

Propagation
All replicas receive all updates
chain of pair-wise interactions
Determinism
All replicas execute writes in the same way
Including conflict detection resolution
Global order
All replicas apply writes to their databases in
the same order
Since writes include arbitrarily complex merge
procedures, it is effectively impossible to
determine if two writes commute or to transform
them so that they can be re-ordered
Tentative writes are ordered by the timestamp
assigned by their accepting servers
Total order using ltTimestamp, serverIDgt
Desirable so that a cluster of isolated servers
agree on the tentative resolution of conflicts

24
Undoing re-applying writes

Servers may accept writes (from clients or other
servers) in an order that differs from the
acceptable execution order
Servers immediately apply all known writes
Therefore
Servers must be able to undo the effects of some
previous tentative execution of a write
re-apply it in a different order
The number of retries depends only on the order
in which writes arrive (via anti-entropy
sessions)
Not on the likelihood of conflicts
Each server maintains write log undo log
Sorted by committed or tentative timestamp
Committed writes take precedence

25
Constraints on write

Must produce the same result on all replicas with
equal write logs preceding that write
Client-provided merge procedures
can only access the database parameters
provided
cannot access time-varying or server-specific
state
pid, time, file system
have uniform bounds on memory processor
So that failures due to resource usage are
uniform
Otherwise, non-deterministic behavior !

26
Global order

Writes are totally ordered w.r.t. write-stamp
(commit-stamp, accept-stamp, server-id)
accept-stamp
assigned by the server that initially receives
the write
derived from logical clock
monotonically increasing
Global clock sync. is not required
commit-stamp
initialized to ??
updated when write is stabilized

27
Stabilizing writes (I)

A write is stable at a replica when it has been
executed for the last time at that replica
All writes with earlier write-stamps are known to
the replica, and no future writes will be given
earlier write-stamps
Convenient for applications to have a notion of
confirmation/commitment
Stabilize as soon as possible
allow replicas to prune their write-logs
inform applications/users that writes have been
confirmed -or- fully resolved

28
Stabilizing writes (II)

A write may be executed several times at a server
may produce different results
Depending on servers execution history
The Bayou API provides means to inquire about the
stability of a write
By including the current clock value in
anti-entropy sessions, a server can determine
that a write is stable when it has a lower
timestamp than all servers clocks
A single server that remains disconnected may
prevent writes from stabilizing, and cause
rollbacks upon its re-connection
Explicit commit procedure

29
Committing writes

In a given data collection, one replica is
designated the primary
Commits writes by assigning a commit-stamp
No requirement for majority quorum
A disconnected server can be the primary for a
users personal data objects
Committing a write makes it stable
Commit-stamp determined total order
Committed writes are ordered before tentative
writes
Replicas are informed of committed writes in
commit-stamp order

30
Applying sets of writes
Receive_Writes(wset, from) if(from
CLIENT) TS MAXsysClock, TS1 w
First(wset) w.WID TS, mySrvID
w.state TENTATIVE WriteLogAppend(w)
Bayou_Write(w.update, w.dependency_check,
w.mergeproc) else / received via
anti-entropy -gt wset is ordered / w
First(wset) insertionPoint
WriteLogIdentifyInsertionPoint(w.WID)
TupleStoreRollbackTo(insertionPoint)
WriteLogInsert(wset) for each w in
WriteLog, w after insertionPoint
Bayou_Write(w.update, w.dependency_check,
w.mergeproc) w Last(wset)
TS MAXTS, w.WID.timestamp
31
Epidemic protocols (I)

Scalable propagation of updates in
eventually-consistent data stores
eventually all replicas receive all updates
in as few msgs as possible
Aggregation of multiple updates in each msg
Classification of servers
Infective
Susceptible
Removed

32
Epidemic protocols (II)

Anti-entropy propagation
Server P randomly selects server Q
Options
P pushes updates to Q
Problem of delay if we have relatively many
infective servers
P pulls updates from Q
Spreading of updates is triggered by susceptible
servers
P and Q exchange updates (push/pull)
Assuming that only a single infective server
Both push pull eventually spread updates
Optimization
Ensure that at least a number of servers
immediately become infective

33
Epidemic protocols (III)

Rumor spreading (gossiping)
Server P randomly selects Q to push updates
If Q already has seen the updates of P, then P
may lose interest
with probability 1/k
Rapid propagation
but no guarantee that all servers will see all
updates

s servers that remain ignorant of an update
s e (k1)(1-s)
Enhancements by combining gossiping with
anti-entropy
k 3 ? s lt 0.02
34
Epidemic protocols (IV)

Spreading a deletion is hard
After removing a data item, a server may receive
old copies !
Must record deletions spread them
Death certificates
Time-stamped upon creation time
Enforce TTL of certificates
Based on estimate of max. update propagation time
Maintain a few dormant death certificates
that never expire
as a guarantee that a death certificate can be
re-spread in case an obsolete update is received
for a deleted data item

35
The gossip architecture (I)

Replicate data close to points where groups of
clients need it
Periodic exchange of msgs among RMs
Front-ends send queries updates to any RM they
choose
Any RM that is available can provide acceptable
response times
Consistent service over time
Relaxed consistency bet. replicas

36
The gossip architecture (II)

Causal update ordering
Forced ordering
Causal total
A Forced-order a Causal-order update that are
related by the happened-before relation may be
applied in different orders at different RMs !
Immediate ordering
Updates are applied in a consistent order
relative to any other update at all RMs

37
The gossip architecture (III)

Bulletin board application example
Posting items -gt causal order
Adding a subscriber -gt forced order
Removing a subscriber -gt immediate order
Gossip messages updates among RMs
Front-ends maintain prev vector timestamp
One entry per RM
RMs respond with new vector timestamp

38
State components of a gossip RM
39
Query operations in gossip

RM must return a value that is at least as recent
as the requests timestamp
Q.prev lt valueTS
List of pending query operations
Hold back until above condition is satisfied
RM can wait for missing updates
or request updates from the RMs concerned
RMs response includes valueTS

40
Updates in causal order

RM-i checks to see if operation ID is in its
executed table or in its log
Discard update if it has already seen it
Increment i-th element of replica timestamp
Count of updates received from front-ends
Assign vector timestamp (ts) to the update
Replace i-th element of u.prev by i-th element of
replica timestamp
Insert log entry
lti, ts, u.op, u.prev, u.idgt
Stability condition u.prev lt valueTS
All updates on which u depends have been applied

41
Forced immediate order

Unique sequence number is appended to update
timestamps
Primary RM acts as sequencer
Another RM can be elected to take over
consistently as sequencer
Majority of RMs (including primary) must record
which update is the next in sequence
Immediate ordering by having the primary order
them in the sequence (along with forced updates
considering causal updates as well)
Agreement protocol on sequence

42
Gossip timestamps

Gossip msgs bet. RMs
Replica timestamp log
Receivers tasks
Merge arriving log m.log with its own
Add record r to local log if replicaTS lt r.ts
Apply any updates that have become stable
This may in turn make pending updates become
stable
Eliminate records from log entries in executed
table
Once it is established that they have been
applied everywhere
Sort the set of stable updates in timestamp order
r is applied only if there is no s s.t. s.prev lt
r.prev
tableTSj m.ts
If tableTSic gt r.tsc, for all i, then r is
discarded
c RM that created record r
ACKs by front-ends to discard records from
executed table

43
Update propagation

How long before all RMs receive an update ?
Frequency duration of network partitions
Beyond systems control !
Frequency of gossip msgs
Policy for choosing a gossip partner
Random
Weighted probabilities to favor near partners
Surprisingly robust !
But exhibits variable update propagation times
Deterministic
Simple function of RMs state
Eg Examine timestamp table choose the RM that
appears to be the furthest behind in updates
received
Topological
Based on fixed arrangement of RMs into a graph
Ring, mesh, trees
Trade-off amount of communication against higher
latencies the possibility that a single failure
will affect other RMs

44
Scalability concerns

2 messages per query (bet. front-end RM)
Causal update
G messages per gossip message
2 (R-1)/G messages exchanged
Increasing G leads to
Less messages
but also worse delivery latencies
RM has to wait for more updates to arrive before
propagating them
Improvement by having read-only replicas
Provided that update/query ratio is low !
Updated by gossip msgs but do not receive
updates directly from front-ends
Can be situated close to client groups
Vector timestamps need only include updateable RMs

45
References

R. Ladin, B. Liskov, L. Shrira and S. Ghemawat,
Providing Availability using Lazy Replication,
ACM Trans. Computer Systems, vol. 10, no.4, pp.
360-391, 1992.
A. Demers, D. Greene, C. Hauser, W. Irish and J.
Larson, Epidemic algorithms for replicated
database maintenance, Proc. 6th ACM Symposium
on Principles of Distributed Computing, pp. 1-12,
1987.
D. B. Terry, M. M. Theimer, K. Petersen, A. J.
Demers, M. J. Spreitzer, and C. Hauser.
Managing Update Conflicts in Bayou, a Weakly
Connected Replicated Storage System'', Proc. 15th
ACM SOSP, pp.172-183, 1995.