CS556: Distributed Systems - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

CS556: Distributed Systems

Description:

Undoing & re ... Servers must be able to undo the effects of some previous ' ... Each server maintains write log & undo log. Sorted by committed or ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 46
Provided by: mar177
Category:

less

Transcript and Presenter's Notes

Title: CS556: Distributed Systems


1
CS-556 Distributed Systems
Consistency Replication (III)
  • Manolis Marazakis
  • maraz_at_csd.uoc.gr

2
Fault Tolerance ?
  • Define correctness criteria
  • When 2 replicas are separated by network
    partition
  • Both are deemed incorrect stop serving.
  • One (the master) continues the other ceases
    service.
  • One (the master) continues to accept updates
    both continue to supply reads (of possibly stale
    data).
  • Both continue service subsequently synchronise.

3
Fault Tolerance
  • Design to recover after a failure with no loss of
    (committed) data.
  • Designs for fault tolerance
  • Single server, fail and recover
  • Primary server with trailing backups
  • Replicated service

4
Network Partitions
  • Separate but viable groups of servers
  • Optimistic schemes validate on recovery
  • Available copies with validation
  • Pessimistic schemes limit availability until
    recovery

5
Replication under partitions
  • Available copies with validation
  • Validation involves
  • Aborting conflicting Txs
  • Compensations
  • Precedence graphs for detecting inconsistencies
  • Quorum consensus
  • Version number per replica
  • Operations should only be applied to replicas
    with the current version number
  • Virtual partition

6
Transactions with Replicated Data
  • Better performance
  • Concurrent service
  • Reduced latency
  • Higher availability
  • Fault tolerance
  • What if a replica fails or becomes isolated ?
  • Upon rejoining, it must catch up
  • Replicated transaction service
  • Data replicated at a set of replica managers
  • Replication transparency
  • One copy serializability
  • Read one, write all

Failures must be observed to have happened
before any active Txs at other servers
7
Active replication (I)
8
Active replication (II)
  • RMs are state machines with equivalent roles
  • Front ends communicates the client requests to RM
    group, using totally ordered reliable multicast
  • RMs process independently requests reply to
    front end (correct RMs process each request
    identically)
  • Front end can synthesize final response to client
    (tolerating Byzantine failures)
  • Active replication provides sequential
    consistency if multicast is reliable ordered
  • Byzantine failures (F out of 2F1) front end
    waits until it gets F1 identical responses

9
Available Copies Replication
replica managers
getBalance(A)
  • Not all copies will always be available.
  • Failures
  • Timeout at failed replica
  • Rejected by recovering, unsynchronised replica

deposit(B)
deposit(A)
getBalance(B)
10
Passive Replication (I)
  • At any time, system has single primary RM
  • One or more secondary backup RMs
  • Front ends communicate with primary, primary
    executes requests, response to all backups
  • If primary fails, one backup is promoted to
    primary
  • New primary starts from Coordination phase for
    each new request
  • What happens if primary crashes
    before/during/after agreement phase?

11
Passive Replication (II)
12
Passive replication (III)
  • Satisfies linearizability
  • Front end looks up new primary, when current
    primary does not respond
  • Primary RM is performance bottleneck
  • Can tolerate F failures for F1 RMs
  • Variation clients can access backup RMs
    (linearizability is lost, but clients get
    sequential consistency)
  • SUN NIS (yellow pages) uses passive replication
    clients can contact primary or backup servers for
    reads, but only primary server for updates

13
Consensus for HA Systems (I)
  • B.W. Lampson, How to build a highly available
    system using consensus.
  • Distributed Algorithms, ed. Babaoglu and
    Marzullo, Lecture Notes in Computer Science 1151,
    Springer, 1996, pp 1-17
  • Based on Lamports framework
  • Replicated state machines
  • Deterministic function (state, input) ?
    (new_state, output)
  • Paxos consensus algorithm
  • Ensure that all non-faulty processes see the same
    inputs
  • We can make the order a part of the input value
    by defining a total order on the set of inputs
  • Analysis of concurrent systems
  • Leases

14
Consensus for HA Systems (II)
  • Lamports Paxos algorithm
  • It is run by a set of leader processes that guide
    a set of agent processes to achieve consensus.
  • It is correct no matter how many simultaneous
    leaders there are and no matter how often leader
    or agent processes fail and recover, how slow
    they are, or how many messages are lost, delayed,
    or duplicated.
  • It terminates if there is a single leader for a
    long enough time during which the leader can talk
    to a majority of the agent processes twice.
  • It may not terminate if there are always too many
    leaders
  • Guaranteed termination is impossible !!

15
Consensus for HA Systems (III)
  • Sequence of rounds
  • In each round, a single leader attempts to reach
    consensus on a single variable
  • Query the agents to learn their status on past
    rounds
  • Select a value and command agents to accept it
  • If a majority of the agents accepts the value,
    propagate this value to all as the outcome
  • 2 ½ round-trips for a successful round
  • If the leader fails repeatedly, or more than
    leaders compete, it may take multiple rounds to
    reach consensus

16
Bayou
  • A data management system to support collaboration
    in diverse network environments
  • Variable degrees of connectedness
  • Rely only on occasional pair-wise communication
  • No notion of disconnected mode of operation
  • Tentative Committed data
  • Update everywhere
  • No locking
  • Incremental reconciliation
  • Eventual consistency
  • Application participation
  • Conflict detection resolution
  • Merge procedures
  • No replication transparency

17
Design Overview
  • Client stub (API) Servers
  • Per replica state
  • Database (Tuple Store)
  • Write log Undo log
  • Version vector
  • Reconciliation protocol (peer-to-peer)
  • Eventual consistency
  • All replicas receive all updates
  • Any two replicas that have received the same set
    of updates have identical databases.

18
Accommodating weak connectivity
  • Weakly consistent replication
  • read-any/write-any access
  • Session sharing semantics
  • Epidemic propagation
  • pair-wise contacts (anti-entropy sessions)
  • agree on the set of writes their order
  • Convergence rate depends on
  • connectivity frequency of anti-entropy
    sessions
  • partner selection policy

19
Conventional Approaches
  • Version vectors
  • Optimistic concurrency control
  • Problems
  • Concurrent writes to the same object may not
    conflict, while writes to different objects may
    conflict
  • depending on object granularity
  • Bayou Account for application semantics
  • conflict detection with help of application

20
Conflict detection resolution
  • Shared calendar example
  • Conflicting meetings overlap in time
  • Resolution reschedule to alternate time
  • Dependency check
  • included in every write
  • Together with expected result
  • calls merge procedure if a conflict is detected
  • can query the database
  • produces new update

21
A Bayou write
  • Processed at each replica
  • Bayou_Write(update,dep_check,mergeproc)
  • IF (DB_EVAL(dep_check.query) ltgt
  • dep_check.expected_result)
  • resolved_update EXECUTE(mergeproc)
  • ELSE
  • resolved_update update
  • DB_EXEC(resolved_update)

22
Example of write in the shared calendar
  • Updateinsert, Meetings, 12/18/95, 130pm,
    60min, Budget Meeting
  • Dependency_checkquerySELECT key FROM Meetings
    WHERE day12/18/95 AND startlt230pm AND
    endgt130pm, expected_resultEMPTY
  • MergeProc
  • alternates 12/18/95, 300pm, 12/19/95,
    930am
  • FOREACH a IN alternates
  • / check if feasible, produce newupdate /
  • if(newupdate ) / no feasible alternate /
  • newupdate insert, ErrorLog, Update
  • Return(newupdate)

23
Eventual consistency
  • Propagation
  • All replicas receive all updates
  • chain of pair-wise interactions
  • Determinism
  • All replicas execute writes in the same way
  • Including conflict detection resolution
  • Global order
  • All replicas apply writes to their databases in
    the same order
  • Since writes include arbitrarily complex merge
    procedures, it is effectively impossible to
    determine if two writes commute or to transform
    them so that they can be re-ordered
  • Tentative writes are ordered by the timestamp
    assigned by their accepting servers
  • Total order using ltTimestamp, serverIDgt
  • Desirable so that a cluster of isolated servers
    agree on the tentative resolution of conflicts

24
Undoing re-applying writes
  • Servers may accept writes (from clients or other
    servers) in an order that differs from the
    acceptable execution order
  • Servers immediately apply all known writes
  • Therefore
  • Servers must be able to undo the effects of some
    previous tentative execution of a write
    re-apply it in a different order
  • The number of retries depends only on the order
    in which writes arrive (via anti-entropy
    sessions)
  • Not on the likelihood of conflicts
  • Each server maintains write log undo log
  • Sorted by committed or tentative timestamp
  • Committed writes take precedence

25
Constraints on write
  • Must produce the same result on all replicas with
    equal write logs preceding that write
  • Client-provided merge procedures
  • can only access the database parameters
    provided
  • cannot access time-varying or server-specific
    state
  • pid, time, file system
  • have uniform bounds on memory processor
  • So that failures due to resource usage are
    uniform
  • Otherwise, non-deterministic behavior !

26
Global order
  • Writes are totally ordered w.r.t. write-stamp
  • (commit-stamp, accept-stamp, server-id)
  • accept-stamp
  • assigned by the server that initially receives
    the write
  • derived from logical clock
  • monotonically increasing
  • Global clock sync. is not required
  • commit-stamp
  • initialized to ??
  • updated when write is stabilized

27
Stabilizing writes (I)
  • A write is stable at a replica when it has been
    executed for the last time at that replica
  • All writes with earlier write-stamps are known to
    the replica, and no future writes will be given
    earlier write-stamps
  • Convenient for applications to have a notion of
    confirmation/commitment
  • Stabilize as soon as possible
  • allow replicas to prune their write-logs
  • inform applications/users that writes have been
    confirmed -or- fully resolved

28
Stabilizing writes (II)
  • A write may be executed several times at a server
    may produce different results
  • Depending on servers execution history
  • The Bayou API provides means to inquire about the
    stability of a write
  • By including the current clock value in
    anti-entropy sessions, a server can determine
    that a write is stable when it has a lower
    timestamp than all servers clocks
  • A single server that remains disconnected may
    prevent writes from stabilizing, and cause
    rollbacks upon its re-connection
  • Explicit commit procedure

29
Committing writes
  • In a given data collection, one replica is
    designated the primary
  • Commits writes by assigning a commit-stamp
  • No requirement for majority quorum
  • A disconnected server can be the primary for a
    users personal data objects
  • Committing a write makes it stable
  • Commit-stamp determined total order
  • Committed writes are ordered before tentative
    writes
  • Replicas are informed of committed writes in
    commit-stamp order

30
Applying sets of writes
Receive_Writes(wset, from) if(from
CLIENT) TS MAXsysClock, TS1 w
First(wset) w.WID TS, mySrvID
w.state TENTATIVE WriteLogAppend(w)
Bayou_Write(w.update, w.dependency_check,
w.mergeproc) else / received via
anti-entropy -gt wset is ordered / w
First(wset) insertionPoint
WriteLogIdentifyInsertionPoint(w.WID)
TupleStoreRollbackTo(insertionPoint)
WriteLogInsert(wset) for each w in
WriteLog, w after insertionPoint
Bayou_Write(w.update, w.dependency_check,
w.mergeproc) w Last(wset)
TS MAXTS, w.WID.timestamp
31
Epidemic protocols (I)
  • Scalable propagation of updates in
    eventually-consistent data stores
  • eventually all replicas receive all updates
  • in as few msgs as possible
  • Aggregation of multiple updates in each msg
  • Classification of servers
  • Infective
  • Susceptible
  • Removed

32
Epidemic protocols (II)
  • Anti-entropy propagation
  • Server P randomly selects server Q
  • Options
  • P pushes updates to Q
  • Problem of delay if we have relatively many
    infective servers
  • P pulls updates from Q
  • Spreading of updates is triggered by susceptible
    servers
  • P and Q exchange updates (push/pull)
  • Assuming that only a single infective server
  • Both push pull eventually spread updates
  • Optimization
  • Ensure that at least a number of servers
    immediately become infective

33
Epidemic protocols (III)
  • Rumor spreading (gossiping)
  • Server P randomly selects Q to push updates
  • If Q already has seen the updates of P, then P
    may lose interest
  • with probability 1/k
  • Rapid propagation
  • but no guarantee that all servers will see all
    updates

s servers that remain ignorant of an update
s e (k1)(1-s)
Enhancements by combining gossiping with
anti-entropy
k 3 ? s lt 0.02
34
Epidemic protocols (IV)
  • Spreading a deletion is hard
  • After removing a data item, a server may receive
    old copies !
  • Must record deletions spread them
  • Death certificates
  • Time-stamped upon creation time
  • Enforce TTL of certificates
  • Based on estimate of max. update propagation time
  • Maintain a few dormant death certificates
  • that never expire
  • as a guarantee that a death certificate can be
    re-spread in case an obsolete update is received
    for a deleted data item

35
The gossip architecture (I)
  • Replicate data close to points where groups of
    clients need it
  • Periodic exchange of msgs among RMs
  • Front-ends send queries updates to any RM they
    choose
  • Any RM that is available can provide acceptable
    response times
  • Consistent service over time
  • Relaxed consistency bet. replicas

36
The gossip architecture (II)
  • Causal update ordering
  • Forced ordering
  • Causal total
  • A Forced-order a Causal-order update that are
    related by the happened-before relation may be
    applied in different orders at different RMs !
  • Immediate ordering
  • Updates are applied in a consistent order
    relative to any other update at all RMs

37
The gossip architecture (III)
  • Bulletin board application example
  • Posting items -gt causal order
  • Adding a subscriber -gt forced order
  • Removing a subscriber -gt immediate order
  • Gossip messages updates among RMs
  • Front-ends maintain prev vector timestamp
  • One entry per RM
  • RMs respond with new vector timestamp

38
State components of a gossip RM
39
Query operations in gossip
  • RM must return a value that is at least as recent
    as the requests timestamp
  • Q.prev lt valueTS
  • List of pending query operations
  • Hold back until above condition is satisfied
  • RM can wait for missing updates
  • or request updates from the RMs concerned
  • RMs response includes valueTS

40
Updates in causal order
  • RM-i checks to see if operation ID is in its
    executed table or in its log
  • Discard update if it has already seen it
  • Increment i-th element of replica timestamp
  • Count of updates received from front-ends
  • Assign vector timestamp (ts) to the update
  • Replace i-th element of u.prev by i-th element of
    replica timestamp
  • Insert log entry
  • lti, ts, u.op, u.prev, u.idgt
  • Stability condition u.prev lt valueTS
  • All updates on which u depends have been applied

41
Forced immediate order
  • Unique sequence number is appended to update
    timestamps
  • Primary RM acts as sequencer
  • Another RM can be elected to take over
    consistently as sequencer
  • Majority of RMs (including primary) must record
    which update is the next in sequence
  • Immediate ordering by having the primary order
    them in the sequence (along with forced updates
    considering causal updates as well)
  • Agreement protocol on sequence

42
Gossip timestamps
  • Gossip msgs bet. RMs
  • Replica timestamp log
  • Receivers tasks
  • Merge arriving log m.log with its own
  • Add record r to local log if replicaTS lt r.ts
  • Apply any updates that have become stable
  • This may in turn make pending updates become
    stable
  • Eliminate records from log entries in executed
    table
  • Once it is established that they have been
    applied everywhere
  • Sort the set of stable updates in timestamp order
  • r is applied only if there is no s s.t. s.prev lt
    r.prev
  • tableTSj m.ts
  • If tableTSic gt r.tsc, for all i, then r is
    discarded
  • c RM that created record r
  • ACKs by front-ends to discard records from
    executed table

43
Update propagation
  • How long before all RMs receive an update ?
  • Frequency duration of network partitions
  • Beyond systems control !
  • Frequency of gossip msgs
  • Policy for choosing a gossip partner
  • Random
  • Weighted probabilities to favor near partners
  • Surprisingly robust !
  • But exhibits variable update propagation times
  • Deterministic
  • Simple function of RMs state
  • Eg Examine timestamp table choose the RM that
    appears to be the furthest behind in updates
    received
  • Topological
  • Based on fixed arrangement of RMs into a graph
  • Ring, mesh, trees
  • Trade-off amount of communication against higher
    latencies the possibility that a single failure
    will affect other RMs

44
Scalability concerns
  • 2 messages per query (bet. front-end RM)
  • Causal update
  • G messages per gossip message
  • 2 (R-1)/G messages exchanged
  • Increasing G leads to
  • Less messages
  • but also worse delivery latencies
  • RM has to wait for more updates to arrive before
    propagating them
  • Improvement by having read-only replicas
  • Provided that update/query ratio is low !
  • Updated by gossip msgs but do not receive
    updates directly from front-ends
  • Can be situated close to client groups
  • Vector timestamps need only include updateable RMs

45
References
  • R. Ladin, B. Liskov, L. Shrira and S. Ghemawat,
    Providing Availability using Lazy Replication,
    ACM Trans. Computer Systems, vol. 10, no.4, pp.
    360-391, 1992.
  • A. Demers, D. Greene, C. Hauser, W. Irish and J.
    Larson, Epidemic algorithms for replicated
    database maintenance,  Proc. 6th ACM Symposium
    on Principles of Distributed Computing, pp. 1-12,
    1987.
  • D. B. Terry, M. M. Theimer, K. Petersen, A. J.
    Demers, M. J. Spreitzer, and C. Hauser. 
    Managing Update Conflicts in Bayou, a Weakly
    Connected Replicated Storage System'', Proc. 15th
    ACM SOSP, pp.172-183, 1995.
Write a Comment
User Comments (0)
About PowerShow.com