Formal Models for Distributed Negotiations Commit Protocols


Formal Models forDistributed NegotiationsCommit
XVII Escuela de Ciencias Informaticas (ECI 2003),
Buenos Aires, July 21-26 2003
Roberto Bruni Dipartimento di Informatica
Università di Pisa
Distributed DataBases
  • Data can be inherently distributed
  • e.g. customers accounts in different branches of
    the same bank
  • Data are distributed to achieve failure
  • e.g. replicated file systems
  • Partial failures can lead to inconsistent results
  • Commits have to be coordinated among participants
    to preserve data consistency

Distributed DataBases
Atomic Commitment Problem
  • Reach a globally consistent state despite
  • Each participant has two possible decision values
  • commit
  • All participants will make the transactions
    updates permanent
  • abort
  • All will roll-back
  • Individual decisions are irreversible
  • A commit decision requires unanimity of YES votes

Atomic Commitment Properties
  • Consensus
  • All participants that decide reach the same
  • If any participant decides commit, then all
    participants must have voted YES
  • If all participants have voted YES and no
    failures occur, then commit is decided
  • Irreversibility
  • Each participant decides at most once

Commitment Protocols
  • Atomic commitment protocol
  • satisfies all atomic commitment properties
  • ensures that transactions terminate consistently
    at all participating sites of a distributed
    database, even in presence of failures
  • Non-blocking
  • if it permits transaction termination to proceed
    at correct participants despite failures of
  • is the activity of ensuring that Sw and Hw
    failures do not corrupt persistent data
  • can limit time intervals of resource locking

Some Assumptions
  • One of the participants acts as unique
    coordinator (centralized version)
  • At most one (if no failures, then there is one
  • A participant assumes the role of coordinator
    within a fixed time interval from the beginning
    of the transaction
  • The transaction begins at a single participant
    called the invoker
  • sends start messages to other participants
  • Only undeliverable messages are dropped
  • All participants can communicate (useful later)

Generic ACP Coordinator
  • send VOTE-REQTid to all participants
  • set-timeout
  • wait-for voteTid from all participants
  • if (all votes are YES) then
  • broadcast(commitTid, participants)
  • else // at least one vote is NO
  • broadcast(abortTid, participants)
  • on-timeout // escape blocking wait-for
  • broadcast(abortTid, participants)

Phase 1
Phase 2
Generic ACP Participants
  • set-timeout
  • wait-for VOTE-REQTid from coordinator // 1
  • send voteTid to coordinator
  • if (voteNO) then // unilateral abort
  • decide abort
  • else
  • set-timeout
  • wait-for decision from coordinator // 2
  • if (decisionabort) then decide abort
  • else decide commit
  • on-timeout termination-protocol // escape 2
  • on-timeout decide abort //escape 1

Simple Broadcast
  • broadcast(m,S)
  • // Broadcaster
  • send m to all processes in S
  • deliver m
  • // other processes in S
  • upon-receipt m // non-blocking
  • deliver m
  • This corresponds to the 2PC Protocol

Timeout Actions
  • Participants must wait
  • VOTE_REQ from coordinator
  • If this takes too long can just decide abort
  • Coordinator collects votes
  • No global decision is yet made
  • Coordinator can decide abort
  • commit / abort from coordinator
  • The participants already took a decision (YES)
  • It is now uncertain
  • It must consult other participants according to
    the termination protocol

Termination Protocol (TP)
  • What if a participant that voted YES times out
    waiting for the response from coordinator?
  • It invokes a termination protocol to contact
  • the coordinator
  • other participants (cooperative TP)
  • can have already voted or not yet voted
  • There are failure scenarios for which no
    termination protocol can lead to a decision
  • Blocking scenario correct participants cannot
  • e.g. coordinator crashes during broadcast
  • all faulty participants deliver and crash
  • all correct participants do not deliver the
  • if faulty participants do not recover any
    decision could contradict the decision of a
    participant that crashed

Non-Blocking ACP I
  • set-timeout
  • wait-for VOTE-REQTid from coordinator // 1
  • send voteTid to coordinator
  • if (voteNO) then // unilateral abort
  • decide abort
  • else
  • set-timeout
  • wait-for decision from coordinator // 2
  • if (decisionabort) then decide abort
  • else decide commit
  • on-timeout decide abort // escape 2
  • on-timeout decide abort //escape 1

Non-Blocking ACP II
  • broadcast(m,S)
  • // Broadcaster as before
  • // other processes in S
  • upon-first-receipt m
  • send m to all processes in S // S can be sent
    along VOTE_REQ
  • deliver m
  • any process receiving m relays m to all others
    (if any correct process receives m, all correct
    process receive m, even if broadcaster crashes)
  • m is delivered only after relaying

  • Participant p is recovering from a failure
  • Must reach a consistent decision
  • Suppose p remembers its state at the time it
  • Before voting
  • it can unilaterally abort
  • After deciding abort
  • it can unilaterally abort
  • After receiving commit / abort from coordinator
  • it had already decided and must behave
  • During the uncertainty period (voted YES)
  • Independent recovery is not possible!
  • Termination protocol is needed

Distributed Transaction Log
  • DTL is kept in stable storage at each site
  • Its content must survive failures
  • Coordinators and participants at that site can
    record information about transactions
  • Before/after sending VOTE_REQ, the coordinator C
    writes start2PC(S,Tid)
  • Before voting YES, a participant writes
  • Before/after voting NO, a participant writes
  • Before C sends commit, it writes commit(Tid)
  • Before/after C sends abort, it writes abort(Tid)
  • After receiving the decision, participant writes

Recovery From DTL
  • If DTL contains start2PC (the site hosted the
  • If it also contains commit/abort
  • The coordinator decided before failure
  • Otherwise
  • The coordinator can decide abort (and record it
    in DTL)
  • Otherwise
  • It contains commit/abort
  • The participant has reached decision before the
  • Does not contain yes
  • Either failed before voting or voted no
  • The participant can unilaterally abort
  • Otherwise (it contains yes but not commit/abort)
  • The participant failed in its uncertainty period
  • Must use the termination protocol

Cooperative TP Initiator
  • send DECISION_REQTid to all processes in S
  • wait-for decisionTid from any process
  • if (decisioncommit) then
  • write commit in DTL
  • else // decisionabort
  • write abort in DTL

Cooperative TP Responder
  • wait-for decisionTid from any process p
  • if (abort(Tid) in DTL) then
  • send abort to p
  • else if (commit(Tid) in DTL) then
  • send commit to p

Evaluation of 2PC
  • Criteria Reliability vs Efficiency
  • Resiliency
  • What failures can be tolerated?
  • Blocking
  • Can processes be blocked?
  • Under which conditions?
  • Time Complexity
  • How long does it take to reach a decision?
  • Message Complexity
  • How many messages are exchanged to reach a
  • What are their dimensions?

  • Reliability and Efficiency are conflicting goals
  • each can be achieved at the expenses of the other
  • The choice of protocol depends on which goal is
    more important for a specific application
  • Whatever protocol is chosen, we should optimize
    for the case with no failures
  • Hopefully the normal operating state of the system

Measuring Time Complexity
  • A round is the max time for a message to reach
    its destination
  • Timeouts are based on the assumption that such a
    delay is known
  • Note that many messages can be sent in a single
  • Two messages must belong to different rounds iff
    one cannot be sent before the other is received
  • Rounds are taken as time units
  • We count the number of rounds needed for
    unblocked sites to reach a decision, in the worst
  • This neglects the time needed to process messages
  • Reasonable messages delays usually exceed
    processing delays
  • Other two factors can be relevant
  • DTL management (on stable storage)
  • Broadcasting preparation (to a large number of

Measuring Message Complexity
  • Number of messages sent during the whole protocol
  • Reasonable measure if individual messages are not
    very large
  • Otherwise we should measure the length of
    messages, not merely their number
  • Here messages are short, so we abstract away from
    their lengths

Reliability of 2PC
  • Resiliency
  • 2PC is resilient to
  • site failures
  • communication failures
  • In fact, the cause of timeouts is not important
  • Blocking
  • 2PC is subject to blocking
  • Probabilistic analysis can be performed depending
    on the probabilistic distribution of failures

Time Complexity of 2PC
  • In absence of failure, 2PC requires 3 rounds
  • Broadcast VOTE-REQ
  • Collect votes
  • Broadcast global decision
  • If failures happen, The TP may need 2 additional
  • Broadcast DECISION_REQ
  • Reply from a process outside its uncertainty
  • Note that several TPs can be initiated separately
    in the same round
  • Up to 5 rounds, independently from the number of
  • But processes may remain blocked for an unbounded
    period of time

Message Complexity of 2PC
  • Let N1 be the number of participants, including
    the coordinator
  • In each round of 2PC, there are N messages sent
  • Hence, in absence of failures 2PC uses 3N
  • Cooperative TP is invoked by all participants
    that voted YES but did not receive commit / abort
  • Let there be M such participants
  • M initiators, each sending N DECISION_REQ (MN
  • At most N-M1 processes will respond to the first
  • In the worst case only one process abandons its
    uncertainty and will respond to another
    initiator (N-M1)(N-M2)N

Calculating the Message Complexity of 2PC
  • In the worst case the total number of TP messages
    will be
  • NM ?i1 (N-Mi) NM NM M2 M(M1)/2
  • 2NM M2/2 M/2 messages
  • This quantity is maximum when MN
  • N(3N1)/2 messages
  • The 2PC together with worst-case TP amount to
  • 3N N(3N1)/2 N(3N7)/2 messages

Communication Topology
  • The communication topology of a protocol is the
    specification of who sends messages to whom
  • e.g. in 2PC without TP, the coordinator sends
    messages to participants and vice versa
  • Participants do not send messages directly to
    each other
  • The topology is described as a tree of height 1


Alternative 2PCs
  • To reduce time and message complexity of
    centralized 2PC, two variations have been
    proposed, based on different communication
  • Decentralized 2PC
  • Communication topology is a complete graph
  • Improve time complexity
  • Linear 2PC (aka Nested 2PC)
  • Linearly ordered processes
  • Reduce the number of messages

Decentralized 2PC
  • Depending on its own vote, the coordinator sends
    YES or NO to all participants
  • Informs that it is time to vote
  • Tells the coordinators vote
  • If the message is NO
  • Each participant decides abort and stops
  • Otherwise, each participant sends back its vote
  • After receiving all votes each process can decide
  • If all are YES and its own vote is YES, decide
  • Otherwise it decides abort
  • Timeouts can be employed as in the centralized

Evaluation of Decentralized 2PC
  • In the absence of failures, only 2 rounds are
  • Coordinator voting YES / NO
  • Each participant voting YES / NO
  • More messages are needed N2N messages
  • N messages in the first round
  • N2 messages in the second round
  • (and this is just in absence of failures)

Linear 2PC
  • Each participant can communicate only with its
    left / right neighbors
  • The coordinator is the leftmost process
  • It sends its vote YES / NO to its right neighbor
  • This message has a dual meaning as in
    decentralized 2PC
  • Each participant p waits for the vote from its
    left neighbor
  • If it is YES, and p votes YES, then p tells YES
    to its right neighbor
  • Otherwise, p tells NO to its right neighbor
  • When the rightmost participant receives the vote,
    it makes the final decision commit / abort
  • The decision is propagated from right to left
  • When the coordinator receives it, the protocol
  • Timeout periods are influenced by positions

Evaluation of Linear 2PC
  • Only 2N messages needed
  • N votes from left to right
  • N decisions from right to left
  • (and this is just in absence of failures)
  • Unfortunately the same amount of rounds is
    needed 2N rounds
  • No two messages are sent concurrently

Comparison of 2PC Variants
  • Hybrid communication topologies are also possible
  • e.g. Linear for voting, complete for conveying
  • 2N messages, N1 rounds
  • The choice of the protocol might be influenced by
    the available communication topology

From 2PC to 3PC
  • In 2PC, if all operational participants are
    uncertain, they are blocked
  • They cannot decide abort even if aware that
    processes they cannot communicate with have
    failed, because some of them could have decided
    commit before failure
  • The 3CP is an ACP designed to rule out this
  • It guarantees that if any operational process is
    uncertain, then no (operational / failed) process
    can have decided commit
  • Thus, if p realizes that any operational site is
    uncertain, then p can decide abort
  • Why does 2PC violate this property?
  • A participant p can receive commit while q is
    still uncertain

Sketch of 3PC The Idea
  • After the coordinator has found that all votes
    were YES, it sends pre-commit messages to all
  • When a participant p receives pre-commit, it
    knows that all participants voted YES
  • p is no longer uncertain, but does not decide
    commit yet
  • p knows that it will decide commit unless it
  • p acknowledges the receipt of pre-commit
  • When the coordinator collects all acks it knows
    that no participant is uncertain
  • The coordinator sends commit to all participants
  • When a participant receives commit, it decides
  • If a participant voted NO, then 3PC behaves as 2PC

Sketch of 3PC Some Notes
  • In absence of failures, 3PC involves 5 rounds and
    up to 5N messages
  • Participants have four possible states
  • Aborted, Uncertain, Committable, Committed
  • For p and q any two participants, only certain
    combinations of their states are possible
  • Timeouts can occur in five situations
  • 3 are trivially handled
  • 2 require a complex termination protocol
  • Election protocol (for a new coordinator) based
    on a linear ordering of participants
  • The new coordinator checks the states of all
    operational participants
  • Timeouts are again necessary

  • We have seen
  • Atomic Commitment Problem
  • Several ACP protocols
  • Generic ACP
  • Centralized 2PC (Good middle ground)
  • Non-Blocking ACP
  • Decentralized 2PC (OK if end-to-end delays must
    be minimized)
  • Linear 2PC (OK if messages are expensive)
  • 3PC (sketched)
  • Learned some criteria to evaluate and compare
  • Usually also dependent on the communication

  • Concurrency control and recovery in database
    systems (Chapter 7, Addison-Wesley 1987)
  • P. Bernstein, N. Goodman, V. Hadzilacos
  • Non-blocking atomic commitment (Chapter 6 of
    Distributed Systems, Addison-Wesley 1995)
  • O. Babaoglu, S. Toueg
