Title: Paxos Commit
1Paxos Commit
- Jim Gray
- Leslie Lamport
- Microsoft Research
- Preview of a paper in preparation
- Presented Microsoft Research Techfest
- 3 March 2004,
- Redmond, WA
- Article MSR-TR-2003-96
- Consensus on Transaction Commit
- http//research.microsoft.com/research/pubs/view.a
spx?tr_id701
2Commit is Common
- Do you?I do.I now pronounce you
- Ready on the set?Ready!Action!
- OfferSignatureDeal / lawsuit
- Marriage ceremony
- Theater
- Contract law
3The Common Picture
Ready
Action!
director
actors
Ready?
Action!
actors
Ready?
Ready
Action!
actors
Ready?
Ready
Ready?
Ready
Action!
4All or Nothing If any actor says no the deal is
off.
No deal!
Ready?
actors
director
Ready
No deal!
Ready?
actors
No!
No deal!
Ready?
actors
Ready
Ready?
Ready
No deal!
5The Database Version
director
RM
director
actors
actors
RM
actors
RM
Commit
Ready?
Ready
Commit
Commit
TM Transaction Manager RM Resource Manager
6Two Phase Commit
- N Resource Managers (RMs)
- Want all RMs to commit or all abort.
- Coordinated by Transaction Manager (TM)TM sends
Prepare, Commit-Abort - RM responds Prepared, Aborted
- 3N1 messages
- N1 stable writes
- Delay
- 4 message
- 2 stable write
- Blocking if TM fails, Commit-Abort stalls
7The Problem With 2PC
- Atomicity all or nothing
- Consistency does right thing
- Isolation no concurrency anomalies
- Durability / Reliability state survives
failures - Availability always up
Blocks if TM fails
8Problem Statement
- ACID Transactions make error handling easy.
- One fault can make 2-Phase Commit block.
- Goal ACID and Available.Non-blocking despite F
faults.
9Fault-Tolerant Two Phase Commit
Prepared
client
TM
RM
RequestCommit
Prepare
Prepared
Prepare
TM
RM
RequestCommit
Prepare
Prepared
If the 2PC Transaction Manager (TM) Fails,
transaction blocks.
Solution Add a spare transaction manager
(non blocking commit, 3 phase commit)
10Fault-Tolerant Two Phase Commit
client
TM
RM
abort
Prepared
Prepare
commit
commit
TM
RM
TM
Prepared
commit
Prepare
RequestCommit
Prepare
Prepared
Inconsistent! Now What?
Prepare
Prepared
commit
commit
abort
If the 2PC Transaction Manager (TM) Fails,
transaction blocks.
Solution Add a spare transaction manager
(non blocking commit, 3 phase commit)
But What if.?
The complexity is a mess.
11Fault Tolerant 2PC
- Several workarounds proposed in database
community - Often called "3-phase" or "non-blocking" commit.
- None with complete algorithm and correctness
proof.
12Reaching Agreement in the Presence of Faults
Shostak, Pease, Lamport
JACM, 1980
- 25 years of theory
- Now called the Consensus problem
- N processes want to agree on a value, even if F
of them have failed.
13Consensus
Propose X
consensus box
client
W Chosen
Propose W
client
W Chosen
client
W Chosen
- collects proposed values
- Picks one proposed value
- remembers it forever
14Consensus for CommitThe Obvious Approach
consensus box
RM
client
TM
Propose Prepared
Prepared Chosen
Request Commit
Prepared
Prepare
Commit
Commit
Prepare
Commit
TM
RM
Prepared Chosen
Prepared
RequestCommit
Prepare
Prepared
Propose Prepared
Prepared Chosen
Commit
Commit
- Get consensus on TMs decision.
- TM just learns consensus value.
- TM is stateless
15Consensus for CommitThe Paxos Commit Approach
RM
client
TM
Request Commit
consensus box
Propose RM1 Prepared
Prepare
RM1 Prepared Chosen
Commit
Commit
Prepare
consensus box
Commit
RM
TM
Propose RM2 Prepared
RM2 Prepared Chosen
RequestCommit
Prepare
Propose RM1 Prepared
Propose RM2 Prepared
RM1 Prepared Chosen
RM2 Prepared Chosen
Commit
Commit
- Get consensus on each RMs choice.
- TM just combines consensus values.
- TM is stateless
16The Obvious Approach
Paxos Commit
One fewer message delay
Prepare
Prepare
Prepared
Propose RM1 Prepared
Propose RM2 Prepared
Propose Prepared
RM1 Prepared Chosen
Prepared Chosen
RM2 Prepared Chosen
Commit
Commit
17Consensus in Action
RM
Consensus box
Propose RM Prepared
acceptor
Propose RM Prepared
Vote RM Prepared
TM
Propose RM Prepared
RM Prepared Chosen
Vote RM Prepared
acceptor
Vote RM Prepared
TM
acceptor
- The normal (failure-free) case
- Two message delays
- Can optimize
18Consensus in Action
RM
Consensus box
acceptor
TM
acceptor
TM
TM
acceptor
TM can always learn what was chosen, or get
Aborted chosen if nothing chosen yet
if majority of acceptors working .
19The Complete Algorithm
- Subtle.
- More weird cases than most people imagine.
- Proved correct.
20Paxos Commit
- N RMs
- 2F1 acceptors (2F1 TMs)
- If F1 acceptors see all RMs prepared, then
transaction committed. - 2F(N1) 3N 1 messages5 message delays 2
stable write delays.
21Two-Phase Commit
Paxos Commit
tolerates F faults
- 3N1 messages
- N1 stable writes
- 4 message delays
- 2 stable-write delays
- 3N 2F(N1) 1 messages
- N2F1 stable writes
- 5 message delays
- 2 stable-write delays
Same algorithm when F0 and TM Acceptor
22Summary
- Commit is common
- Two Phase commit is good butIt is the
un-availability protocol - Paxos commit is non-blocking if there are at
most F faults. - When F0 (no fault-tolerance), Paxos Commit
2PC
23(No Transcript)
24Paxos Consensus
- Group has a leader known to all
- leader election is a subroutine
- Process proposes a value v to leader.
- Leader sends proposal (phase 2) (ballot, value)
to all acceptors - Acceptors respond withmax(ballot, value) they
have seen - If leader gets no higher ballot, and gets at
least F1 responses then leader can announce
(ballot, value)
- Full protocol 3-phase
- Phase 1
- Leader starts new ballot
- Phase 2
- Leader proposes value
- Phase 3
- If value accepted by F1 then value is accepted.
- If not, leader tries to get majority value
accepted.
6F4 messages, 2F1 stable writes 4 message
delays and 2 stable write delays
25Using ConsensusHave a consensus for each RM
Prepared
client
TM
RM
RequestCommit
consensus box
Prepare
Commit
consensus box
Prepared
Commit
Prepare
Commit
TM
RM
RequestCommit
Prepare
Prepared
Commit
Commit
26Propose X
consensus box
RM
X Chosen
Propose W
TM
X Chosen
X Chosen
TM
27Paxos Commit (success case)
Acceptors
Commit Leader
28Consensus
- The distributed systems theory community has
thought about this a lot. - They call it ConsensusN processes want to agree
on a value - Want to tolerate F faults
- Tolerate F processes stopping
- Tolerate F Messages delayed or lost
- If there are fewer than F faults in a windowThen
consensus achieved. - Byzantine faults need 3F acceptors
- Benign faults need 2F1 acceptorsstalls but
safe if more than F faults