Title: Replication techniques Primary-backup, RSM, Paxos
1Replication techniquesPrimary-backup, RSM, Paxos
2Fault tolerance gt replication
- How to recover a single node from power failure?
- Wait for reboot
- Data is durable, but service is unavailable
temporarily - Use multiple nodes to provide service
- Another node takes over to provide service
3Replicated state machine (RSM)
- RSM is a general replication method
- Lab 6 apply RSM to lock service
- RSM Rules
- All replicas start in the same initial state
- Every replica apply operations in the same order
- All operations must be deterministic
- All replicas end up in the same state
4RSM
opA opB
opB opA
opA
opA
opB
opB
- How to maintain a single order in the face of
concurrent client requests?
5RSM primary/backup
opA opB
opA opB
primary
backup
- Primary/backup ensure a single order of ops
- Primary orders operations
- Backups execute operations in order
6Case study Hypervisor Bressoud and Schneider
- Goal fault tolerant computing
- Banks, NASA etc. need it
- CPUs are most likely to fail due to complexity
- Hypervisor primary/backup replication
- If primary fails, backup takes over
- Caveat assuming failure detection is perfect
7Hypervisor replicates at VM-level
- Why replicating at VM-level?
- Hardware fault-tolerant machines are big in 80s
- Software solution is more economical
- Replicating at O/S level is messy (many
interfaces) - Replicating at app level requires programmer
efforts - Replicating at VM level has a cleaner interface
(and no need to change O/S or app) - Primary and backup execute the same sequence of
machine instructions
8A Strawman design
mem
mem
- Two identical machines
- Same initial memory/disk contents
- Start execute on both machines
- Will they perform the same computation?
9Strawman flaws
- To see the same effect, operations must be
deterministic - What are deterministic ops?
- ADD, MUL etc.
- Read time-of-day register, cycle counter,
privilege level? - Read memory?
- Read disk?
- Interrupt timing?
- External input devices (network, keyboard)
10Hypervisors architecture
Strawman replicates disks at both
machines Problem disks might not behave
identically (e.g. fail at different sectors)
mem
mem
SCSI bus
primary
- Hypervisor connects devices to
- to both machines
- Only primary reads/writes to devices
- Primary sends read values to backup
- Only primary handles interrupts from h/w
- Primary sends interrupts to backup
ethernet
backup
11Hypervisor executes in epochs
- Challenge must execute interrupts at the same
point in instruction streams on both nodes - Strawman execute one instruction at a time
- Backup waits from primary to send interrupt at
end of each instruction - Very slow.
- Hypervisor executes in epochs
- CPU h/w interrupts every N instructions (so both
nodes stop at the same point) - Primary delays all interrupts till end of an
epoch - Primary sends all interrupts to backup
12Hypervisor failover
- If primary fails, backup must handle I/O
- Suppose primary fails at epoch E1
- In Epoch E, backup times out waiting for end,
E1 - Backup delivers all buffered interrupts at the
end of E - Backup starts epoch E1
- Backup becomes primary at epoch E2
13Hypervisor failover
- Backup does not know if primary executed I/O
epoch E1? - Relies on O/S to re-try the I/O
- Device needs to support repeated ops
- OK for disk writes/reads
- OK for network (TCP will figure it out)
- How about keyboard, printer, ATM cash machine?
14Hypervisor implementation
- Hypervisor needs to trap every non-deterministic
instruction - Time-of-day register
- HP TLB replacement
- HP branch-and-link instruction
- Memory-mapped I/O loads/stores
- Performance penalty is reasonable
- A factor of two slow down
- How about its performance on modern hardware?
15Caveats in Hypervisor
- Hypervisor assumes failure detection is perfect
- What if the network between primary/backup fails?
- Primary is still running
- Backup becomes a new primary
- Two primaries at the same time!
- Can timeouts detect failures correctly?
- Pings from backup to primary are lost
- Pings from backup to primary are delayed
16Paxos fault tolerant agreement
- Paxos lets all nodes agree on the same value
despite node failures, network failures and
delays - Extremely useful
- e.g. Nodes agree that X is the primary
- e.g. Nodes agree that Y is the last operation
executed
17Paxos general approach
- One (or more) node decides to be the leader
- Leader proposes a value and solicits acceptance
from others - Leader announces result or try again
18Paxos requirement
- Correctness (safety)
- All nodes agree on the same value
- The agreed value X has been proposed by some node
- Fault-tolerance
- If less than N/2 nodes fail, the rest nodes
should reach agreement eventually w.h.p - Liveness is not guaranteed
19Why is agreement hard?
- What if gt1 nodes become leaders simultaneously?
- What if there is a network partition?
- What if a leader crashes in the middle of
solicitation? - What if a leader crashes after deciding but
before announcing results? - What if the new leader proposes different values
than already decided value?
20Paxos setup
- Each node runs as a proposer, acceptor and
learner - Proposer (leader) proposes a value and solicit
acceptence from acceptors - Leader announces the chosen value to learners
21Strawman
- Designate a single node X as acceptor (e.g. one
with smallest id) - Each proposer sends its value to X
- X decides on one of the values
- X announces its decision to all learners
- Problem?
- Failure of the single acceptor halts decision
- Need multiple acceptors!
22Strawman 2 multiple acceptors
- Each proposer (leader) propose to all acceptors
- Each acceptor accepts the first proposal it
receives and rejects the rest - If the leader receives positive replies from a
majority of acceptors, it chooses its own value - There is at most 1 majority, hence only a single
value is chosen - Leader sends chosen value to all learners
- Problem
- What if multiple leaders propose simultaneously
so there is no majority accepting?
23Paxos solution
- Proposals are ordered by proposal
- Each acceptor may accept multiple proposals
- If a proposal with value v is chosen, all higher
proposals have value v
24Paxos operation node state
- Each node maintains
- na, va highest proposal and its corresponding
accepted value - nh highest proposal seen
- myn my proposal in current Paxos
25Paxos operation 3P protocol
- Phase 1 (Prepare)
- A node decides to be leader (and propose)
- Leader choose myn gt nh
- Leader sends ltprepare, myngt to all nodes
- Upon receiving ltprepare, ngt
- If n lt nh
- reply ltprepare-rejectgt
- Else
- nh n
- reply ltprepare-ok, na,vagt
This node will not accept any proposal lower
than n
26Paxos operation
- Phase 2 (Accept)
- If leader gets prepare-ok from a majority
- V non-empty value corresponding to the highest
na received - If V null, then leader can pick any V
- Send ltaccept, myn, Vgt to all nodes
- If leader fails to get majority prepare-ok
- Delay and restart Paxos
- Upon receiving ltaccept, n, Vgt
- If n lt nh
- reply with ltaccept-rejectgt
- else
- na n va V nh n
- reply with ltaccept-okgt
27Paxos operation
- Phase 3 (Decide)
- If leader gets accept-ok from a majority
- Send ltdecide, vagt to all nodes
- If leader fails to get accept-ok from a majority
- Delay and restart Paxos
28Paxos operation an example
nhN10 na va null
nhN20 na va null
nhN00 na va null
Prepare,N11
Prepare,N11
nh N11 na null va null
nh N11 na null va null
ok, na vanull
ok, na vanulll
Accept,N11,val1
Accept,N11,val1
nhN11 na N11 va val1
nhN11 na N11 va val1
ok
ok
Decide,val1
Decide,val1
N0
N1
N2
29Paxos properties
- When is the value V chosen?
- When leader receives a majority prepare-ok and
proposes V - When a majority nodes accept V
- When the leader receives a majority accept-ok for
value V
30Understanding Paxos
- What if more than one leader is active?
- Suppose two leaders use different proposal
number, N010, N111 - Can both leaders see a majority of prepare-ok?
31Understanding Paxos
- What if leader fails while sending accept?
- What if a node fails after receiving accept?
- If it doesnt restart
- If it reboots
- What if a node fails after sending prepare-ok?
- If it reboots
32Using Paxos for RSM
- Fault-tolerant RSM requires consistent replica
membership - Membership ltprimary, backupsgt
- RSM goes through a series of membership changes
- ltvid-0, primary, backupsgtltvid-1, primary,
backupsgt .. - Use Paxos to agree on the ltprimary, backupsgt for
a particular vid
33Lab5 Using Paxos for RSM
All nodes start with static config vid1N1
vid1 N1
N2 joins
A majority in vid1N1 accept vid2 N1,N2
vid2 N1,N2
N3 joins
A majority in vid2N1,N2 accept vid3 N1,N2,N3
vid3 N1,N2, N3
N3 fails
A majority in vid3N1,N2,N3 accept vid4 N1,N2
vid4 N1,N2
34Lab5 Using Paxos for RSM
vid1 N1
N1
vid2 N1,N2
vid1 N1
N2
vid2 N1,N2
35Lab5 Using Paxos for RSM
vid1 N1
N1
vid2 N1,N2
N3
vid1 N1
vid2 N1,N2
vid1 N1
N2
N3 joins
vid2 N1,N2
36Lab6 re-configurable RSM
- Use RSM to replicate lock_server
- Primary in each view assigns a viewstamp to each
client requests - Viewstamp is a tuple (vidseqno)
- (00)(01)(02)(03)(10)(11)(12)(20)(21)
- All replicas execute client requests in viewstamp
order
37Lab6 Viewstamp replication
- To execute an op with viewstamp (vs), a replica
must have executed all ops lt vs - A newly joined replica need to transfer state to
ensure its state reflect executions of all ops lt
vs
38Lab5 Using Paxos for RSM
vid1 N1
N1
myvs(150)