Title: CS556: Distributed Systems
1CS-556 Distributed Systems
High Availability Replication (II)
- Manolis Marazakis
- maraz_at_csd.uoc.gr
2Transactions with Replicated Data
- Better performance
- Concurrent service
- Reduced latency
- Higher availability
- Fault tolerance
- What if a replica fails or becomes isolated ?
- Upon rejoining, it must catch up
- Replicated transaction service
- Data replicated at a set of replica managers
- Replication transparency
- One copy serializability
- Read one, write all
Failures must be observed to have happened
before any active Txs at other servers
3Fault Tolerance ?
- Define correctness criteria
- When 2 replicas are separated by network
partition - Both are deemed incorrect stop serving.
- One (the master) continues the other ceases
service. - One (the master) continues to accept updates
both continue to supply reads (of possibly stale
data). - Both continue service subsequently synchronise.
4Linearizability
- Sequence of client i operations Oi0, Oi1, Oi2,
- Single server would serialize client operations
in some order - e.g., O10, O11, O20, O21, O12,
- This is a virtual interleaving of client
operations in a server with single-copy of data - A replicated shared object service is
linearizable if for any execution there is an
interleaving of the client operations that
satisfies - The interleaved sequence of operations meets
specification of single correct copy of objects - The order of operations in interleaving is
consistent with real times at which the
operations occurred at actual execution
5Sequential consistency
- Linearizability is hard to achieve in practice,
without precise clock synchronization - A replicated shared object service is
sequentially consistent if for any execution
there is an interleaving of the client operations
that satisfies - The interleaved sequence of operations meets
specification of single correct copy of objects - The order of operations in interleaving is
consistent with program order in which each
client executed them - ATTENTION no total ordering between clients
- Every linearizable service is sequentially
consistent (the converse is not true)
6Example
- Client 1 Client 2
- setBalance-B(x,1)
- getBalance-A(y) 0
- getBalance-A(x) 0
- setBalance-A(y,2)
- Real-time criterion of linearizability is not
satisfied - Find interleaving that satisfies both criteria
for sequential consistency
7Passive Replication (I)
- At any time, system has single primary RM
- One or more secondary backup RMs
- Front ends communicate with primary, primary
executes requests, response to all backups - If primary fails, one backup is promoted to
primary - New primary starts from Coordination phase for
each new request - What happens if primary crashes
before/during/after agreement phase?
8Passive Replication (II)
9Passive replication (III)
- Satisfies linearizability
- Front end looks up new primary, when current
primary does not respond - Primary RM is performance bottleneck
- Can tolerate F failures for F1 RMs
- Variation clients can access backup RMs
(linearizability is lost, but clients get
sequential consistency) - SUN NIS (yellow pages) uses passive replication
clients can contact primary or backup servers for
reads, but only primary server for updates
10Active replication (I)
- RMs are state machines with equivalent roles
- Front ends communicates the client requests to RM
group, using totally ordered reliable multicast - RMs process independently requests reply to
front end (correct RMs process each request
identically) - Front end can synthesize final response to client
(tolerating Byzantine failures) - Active replication provides sequential
consistency if multicast is reliable ordered - Byzantine failures (F out of 2F1) front end
waits until it gets F1 identical responses
11Active replication (II)
12Replication Architectures
replica managers
- How many replicas are required?
- All or majority ?
- Forward all updates as soon as received.
- Two phase commit protocol.
- Contacted replica acts as coordinator
- What if one of the replicas isnt available?
- Primary copy replication
getBalance(A)
deposit(B)
13Available Copies Replication
replica managers
getBalance(A)
- Not all copies will always be available.
- Failures
- Timeout at failed replica
- Rejected by recovering, unsynchronised replica
deposit(B)
deposit(A)
getBalance(B)
14Local Validation
- Failure recovery events do not occur during a
Tx. - Example
- T reads A before server Xs failure, therefore
Tgt failX - T observes server Ns failure when it writes B,
therefore failNgtT - failNgt T.getBalance(A)gt T.deposit(B) gt failX
- failXgt U.getBalance(B)gt U.deposit(A) gt failN
Server x fails followed by Transaction U which
is followed by Server Ns failure which is
followed by Transaction T which is
followed by server Xs failure. This is
inconsistent so the transactions must not be
allowed to commit.
Failure and recovery must be serialised just like
a Tx They occur before or after a Tx, but
not during.
15Network Partitions
- Separate but viable groups of servers
- Optimistic schemes validate on recovery
- Available copies with validation
- Pessimistic schemes limit availability until
recovery
16Fault Tolerance
- Design to recover after a failure w/o loss of
(committed) data. - Designs for fault tolerance
- Single server, fail and recover
- Primary server with trailing backups
- Replicated service
17Ordered Multicast
- FIFO ordering If a correct process issues
multicast(g, m1) followed by multicast(g,m2) then
every correct process that delivers m2 will
deliver m1 before m2. - Causal ordering If multicast(g, m1) happened
before multicast(g, m2) then any correct process
that delivers m2 will deliver m1 before m2. - Total ordering If a correct process delivers m1
before it delivers m2, then any other correct
process that delivers m2 will deliver m1 before
m2.
18Total Causal Ordering
C3 no happened before indication -
delivered in different orders at P2, P3, P4
19Synch Ordering
All replicas of the same request are either
processed before the synch request or after it.
- Essentially a synch request serves to
flush the system.
20Causal Ordering - Vector Timestamps
- Lamports logical clock does not show causality
- Vector timestamp - a sequence of events at each
process from each source - FE keeps timestamp of RM at last read
21Group membership service
For each group,this service delivers to any
member process a series of views.
22View delivery constraints
- Order
- If p delivers v(g) then v(g), then no other
process delivers v(g) before v(g) - Integrity
- If p delivers v(g), then p v(g)
- Non-triviality
- If q joins a group is reachable from p, then
eventually q is always in the views that p
delivers - If a group is partitioned, then eventually the
views delivered in any one partition will exclude
processes in another partition
23View-synchronous guarantees (I)
- Agreement
- If p delivers message m in view v(g) then
delivers v(g), then all processes that survive
to deliver v(g) also delivers m in the view v(g) - Integrity
- If p delivers message m, then it will not deliver
m again. - Validity
- Correct processes always deliver the messages
that they send. - If the system fails to deliver a message to any
process q, then it notifies the surviving
processes by deliverign a new view, with q
excluded, immediately after the view in which any
of them delivered the message.
24View-synchronous guarantees (II)
P sends a msg while in view (p, q, r) crashes
25View-synchronous guarantees (III)
- State transfer to a new group member
- Delivery of first view containing the new process
- Group representative captures its state
- Send state to new member (one-to-one)
- Suspend execution
- All (previous) group members suspend their
execution as well - New member delivers state
- Integrate new state
- Multicast Commence message to the group
26The gossip architecture (I)
- Replicate data close to points where groups of
clients need it - Periodic exchange of msgs among RMs
- Front-ends send queries updates to any RM they
choose - Any RM that is available can provide acceptable
response times - Consistent service over time
- Relaxed consistency bet. replicas
27The gossip architecture (II)
- Causal update ordering
- Forced ordering
- Causal total
- A Forced-order a Causal-order update that are
related by the happened-before relation may be
applied in different orders at different RMs ! - Immediate ordering
- Updates are applied in a consistent order
relative to any other update at all RMs
28The gossip architecture (III)
- Bulletin board application example
- Posting items -gt causal order
- Adding a subscriber -gt forced order
- Removing a subscriber -gt immediate order
- Gossip messages updates among RMs
- Front-ends maintain prev vector timestamp
- One entry per RM
- RMs respond with new vector timestamp
29State components of a gossip RM
30Query operations in gossip
- RM must return a value that is at least as recent
as the requests timestamp - Q.prev lt valueTS
- List of pending query operations
- Hold back until above condition is satisfied
- RM can wait for missing updates
- or request updates from the RMs concerned
- RMs response includes valueTS
31Updates in causal order
- RM-i checks to see if operation ID is in its
executed table or in its log - Discard update if it has already seen it
- Increment i-th element of replica timestamp
- Count of updates received from front-ends
- Assign vector timestamp (ts) to the update
- Replace i-th element of u.prev by i-th element of
replica timestamp - Insert log entry
- lti, ts, u.op, u.prev, u.idgt
- Stability condition u.prev lt valueTS
- All updates on which u depends have been applied
32Forced immediate order
- Unique sequence number is appended to update
timestamps - Primary RM acts as sequencer
- Another RM can be elected to take over
consistently as sequencer - Majority of RMs (including primary) must record
which update is the next in sequence - Immediate ordering by having the primary order
them in the sequence (along with forced updates
considering causal updates as well) - Agreement protocol on sequence
33References
- K.P. Birman, The process group approach to
reliable distributed computing, CACM, vol. 36,
no. 12, pp. 36-53, 1993. - R. Ladin, B. Liskov, L. Shrira and S. Ghemawat,
Providing Availability using Lazy Replication,
ACM Trans. Computer Systems, vol. 10, no.4, pp.
360-391, 1992.