Computer Science 328 Distributed Systems - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Computer Science 328 Distributed Systems

Description:

This is not true if RMs fail and recover during conflicting transactions. ... Assume that (i) RM X fails just after T has performed getBalance but before it ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 33
Provided by: mehdith
Category:

less

Transcript and Presenter's Notes

Title: Computer Science 328 Distributed Systems


1
Computer Science 328Distributed Systems
  • Lecture 20
  • Replication Control II

2
Gossiping Architecture
  • The replica managers exchange gossip messages
    periodically in order to convey the updates they
    have each received from clients.
  • Objective provisioning of highly available
    service.
  • Each client obtains a consistent service over
    time in response to a query, a RM may have to
    wait until it receives updates from other RMs.
    The RM then provides with data that at least
    reflects the updates that the client has observed
    so far.
  • Relaxed consistency between replicas all RMs
    eventually receive all updates and they apply
    updates with ordering guarantees.

3
Query and Update Operations in a Gossip Service
Service
RM
gossip
RM
RM
Query,
prev
Val,
new
Update,
prev
Update id
FE
FE
Query
Val
Update
Clients
4
Various Timestamps
  • Virtual timestamps are used to control the order
    of operation processing. The timestamp contains
    an entry for each RM.
  • Each front end keeps a vector timestamp, prev,
    that reflects the latest data values accessed by
    the front end. The FE sends it in every request
    to a RM.
  • When a RM returns a value as a result of a query
    operation, it supplies a new timestamp, new.
  • An update operation returns a timestamp, update
    id.
  • Each returned timestamp is merged with the FEs
    previous timestamp to record the data that has
    been observed by the client.

5
Front ends Propagate Their Timestamps
Since client-to-client communication can also
lead to causal relationships between operations
applied to services, the FE piggys back
its timestamp on messages to other clients.
6
A Gossip Replica Manager
7
Replica Manager State
  • Value value of the object maintained by the RM.
  • Value timestamp the timestamp that represents
    the updates that are reflected in the value.
    Updated whenever an update operation is applied.
  • Update log records all update operations as soon
    as they are received.
  • Keeps all the updates that are not stable, where
    a stable update is one that can be applied
    consistently with its ordering guarantees.
  • Keeps stable updates that have been applied, but
    cannot be purged yet, because no confirmation
    that these updates have been received at all
    other RMs is received.
  • Replica timestamp represents updates that have
    been accepted by the RM into the log.
  • Executed operation table contains the
    FE-supplied Ids of updates that have been applied
    to the value.
  • Used to prevent an update being applied twice, as
    an update may arrive from a FE and in gossip
    messages from other RMs.
  • Timestamp table contains, for each RM, the
    latest timestamp that arrive in gossip message
    from that RM.

8
A Closer Look at the Virtual Timestamp
  • The ith element of a vector timestamp held by RMi
    corresponds to the number of updates received
    from FEs by RMi
  • The jth element of a vector timestamp held by RMi
    equals the number of updates received by RMj and
    forwarded to RMi in gossiping messages.

9
Update Operations
  • Each update request u contains
  • The update operation, u.op
  • The FEs timestamp, u.prev
  • A unique id that the FE generates, u.id.
  • Upon receipt of an update request, the RM
  • Checks if u has been processed by looking up u.id
    in the executed operation table and in the update
    log.
  • If not, increments the ith element in the replica
    timestamp by one to keep track of the number of
    updates directly received from FEs.
  • Places a record for the update in the RMs log.
  • logRecord lti, ts, u.op, u.prev, u.idgt
  • where ts is derived from u.prev by replacing
    u.prevs ith element by the ith element of its
    replica timestamp.
  • Returns ts back to the FE, which merges it with
    its timestamp.

10
Update Operation (Contd)
  • The stability condition for an update u is
  • u.prev lt valueTS
  • All the updates on which this update depends
    have already been applied to the value.
  • When the update operation u becomes stable, the
    RM does the following
  • value apply(value, u.op)
  • valueTS merge(valueTS, ts) (update the value
    timestamp)
  • executed executed U u.id (update the
    executed operation table)

11
Exchange of Gossiping Messages
  • A gossip message m consists of the log of the RM,
    m.log, and the replica timestamp, m.ts.
  • A RM that receives a gossip message has three
    tasks
  • (1) Merge the arriving log with its own.
  • Let replicaTS denote the recipient RMs replica
    timestamp. A record r in m.log is added to the
    recipients log unless r.ts lt replicaTS.
  • replicaTS ? merge(replicaTS, m.ts)
  • (2) Apply any updates that have become stable but
    not been executed (stable updates in the arrived
    log may make pending updates become stable)
  • Eliminate records from the log and the executed
    operation table when it is known that the updates
    have been applied everywhere.

12
Update Propagation
  • The frequency with which RMs send gossip messages
    depends on the application.
  • The policy for choosing a partner with which to
    exchange gossip
  • Random policies choose a partner randomly
    (perhaps with weighted probabilities)
  • In deterministic policies, a RM can examine its
    timestamp table and choose the RM that is the
    furthest behind in the updates it has received.
  • Topological policies arrange the RMs into a fixed
    graph, such as a mesh, a ring, or a tree.
  • Each has its own merits and drawbacks. The ring
    topology produces relatively little communication
    but is subject to high transmission latencies
    since gossip has to traverse several RMs.

13
Query Operations
  • A query request q contains the operation, q.op,
    and the timestamp, q.prev, sent by the FE.
  • Let valueTS denote the RMs value timestamp, then
    q can be applied if
  • Q.prev lt valueTS
  • The RM keeps q on a hold back queue until the
    condition is fulfilled.
  • If valueTs is (2,5,5) and q.prev is (2,4,6), then
    one update from RM3 is missing.
  • Once the query is applied, the RM returns
  • new ?valueTS
  • to the FE, and the FE merges new with its
    timestamp.

14
Transactions with Replicated Data
15
Transactions on Replicated Data
16
One Copy Serialization
  • In a non-replicated system, transactions appear
    to be performed one at a time in some order. This
    is achieved by ensuring a serially equivalent
    interleaving of transaction operations.
  • One-copy serializability The effect of
    transactions performed by clients on replicated
    objects should be the same as if they had been
    performed one at a time on a single set of
    objects.

17
Two Phase Commit Protocol For Replicated Objects
  • In the first phase, the coordinator sends the
    canCommit? Command to the workers, each of which
    then passes it onto the other RMs and collect
    their replies before replying to the coordinator.
  • In the second phase, the coordinator sends the
    doCommit or doAbort request, which is passed onto
    the members of the groups of RMs.

18
Primary Copy Replication
  • All the client requests are directed to a single
    primary RM.
  • Concurrency control is applied at the primary.
  • To commit a transaction, the primary communicates
    with the backup RMs and replies to the client.

19
Read One/Write All Replication
  • FEs may communicate with any RM.
  • Every write operation must be performed at all of
    the RMs, each of which sets a write lock on the
    object.
  • Each read operation is performed by a single RM,
    which sets a read lock on the object.
  • Consider pairs of operations of different
    transactions on the same object.
  • Any pair of write operations will require
    conflicting locks at all of the RMs.
  • A read operation and a write operation will
    require conflicting locks at a single RM.
  • The one-copy serializability is achieved.

20
Available Copies Replication
  • A clients read request on an object must be
    performed by any available RM, but a clients
    update request must be performed by all available
    RMs in the group.
  • Example in the next slide At X, transaction T
    has read A and hence transaction U is not allowed
    to update A until transaction T has completed.
  • As long as the set of available RMs does not
    change, local concurrency control achieves
    one-copy serializability in the same way as in
    read-one/write-all replication.
  • This is not true if RMs fail and recover during
    conflicting transactions.

21
Available Copies Approach
22
The Impact of RM Failure
  • Assume that (i) RM X fails just after T has
    performed getBalance but before it has performed
    deposit and (ii) RM N fails just after U has
    performed getBalance, but before it has performed
    deposit.
  • Ts deposit will be performed at RMs M and P, and
    Us deposit will be performed at RM Y.
  • The concurrency control on A at RM X does not
    prevent transaction U from updating A at RM Y.
  • Solution crashes and recoveries must be
    serialized with respect to transaction
    operations.

23
Local Validation (using Our Example)
  • From Ts perspective,
  • T has read from an object at X, X must fail after
    Ts operation.
  • T observes the failure of N when it attempts to
    update the object, Ns failure must be before T.
  • N fails ? T reads object A at X T writes objects
    B at M and P ? T commits ? X fails.
  • From Us perspective,
  • X fails ? U reads object B at N U writes object
    A at Y ? U commits ? N fails.
  • At the time T commits, it checks N is still not
    available and X, M and P are still available. If
    so, T can commit.
  • This implies X fails after T validated and before
    U validated. Us validation fails because N has
    already failed.

24
Network Partition
As there has been a partition, pairs of
conflicting transactions have been allowed to
commit in different partitions. The only choice
after the network is recovered is to abort one
of the transactions. In the pessimistic quorum
consensus approach, updates are allowed only in a
partition that has the majority of RMs and
updates are propagated to the other RMs when
the partition is repaired.
25
Static Quorums
  • The decision about how many RMs should be
    involved in an operation on replicated data is
    called Quorum selection
  • Quorum rules state that
  • At least r replicas must be accessed for read
  • At least w replicas must be accessed for write
  • r w gt N, where N is the number of replicas
  • w gt N/2
  • Each object has a version number or a consistent
    timestamp
  • Static Quorum predefines r and w , is a
  • pessimistic approach if partition occurs,
    update is possible in, at most, one partition

26
Voting with Static Quorums
  • A version of quorum selection where each replica
    has a number of votes. Quorum is reached by
    majority of votes (N is the total number of
    votes)
  • e.g. a cache replica may be given a 0 vote,
  • - with r w 2, access time for write is 750
    ms access time for read without cache is 750 ms
    access time with cache is 175 - 825ms

Replica votes access time version chk
P(failure)
Cache 0 100ms 0ms 0 Rep1
1 750ms 75ms 1 Rep2 1 750ms 75ms 1 Rep3
1 750ms 75ms 1
27
Quorum Concensus Examples
28
Optimistic Quorum Approaches
  • An Optimistic Quorum selection allows writes to
    proceed in any partition.
  • Failed processors are considered for write
    quorum, but not for read quorum.
  • This might lead to write-write conflicts
  • Optimistic Quorum is practical when
  • Conflicting updates are rare
  • Conflicts are always detectable
  • Damage from conflicts can be easily confined
  • Repair of damaged data is possible or an update
    can be discarded without consequences.

29
View-based Quorum
  • Quorum is based on views at any time
  • In a small partition, the group plays dead,
    hence no writes occur.
  • In a large enough partition, inaccessible nodes
    are considered in the quorum as ghost processors.
  • Once the partition is repaired, processors in
    the small partition know who to contact for
    updates.
  • but taking a straw poll for every action is
    very costly Can we do better?

30
View-based Quorum - details
  • We define a read and a write threshold
  • Aw minimum nodes in a view for write, Aw gt N/2
  • Ar minimum nodes in a view for read
  • Aw Ar gt N
  • If ordinary quorum cannot be reached for an
    operation, we take a straw poll, i.e. update
    views
  • In a large enough partition for read, Ar ?
    Viewsize In a large enough partition for
    write, Aw ? Viewsize (inaccessible nodes are
    considered as ghosts.)
  • Views are per object, numbered sequentially and
    only updated if necessary
  • The first update after partition repair forces
    restoration for nodes in the smaller partition

31
Example View-based Quorum
  • Consider N 5, w 5, r 1, Aw 3, Ar 3

32
Example View-based Quorum (contd)
Write a Comment
User Comments (0)
About PowerShow.com