Replication techniques Primary-backup, RSM, Paxos

About This Presentation

Title:

Replication techniques Primary-backup, RSM, Paxos

Description:

Replication techniques Primary-backup, RSM, Paxos Jinyang Li Fault tolerance = replication How to recover a single node from power failure? Wait for reboot Data is ... – PowerPoint PPT presentation

Number of Views:102

Avg rating:3.0/5.0

Slides: 39

Provided by: Jinya1

Learn more at: https://www.news.cs.nyu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Replication techniques Primary-backup, RSM, Paxos

1
Replication techniquesPrimary-backup, RSM, Paxos

Jinyang Li

2
Fault tolerance gt replication

How to recover a single node from power failure?
Wait for reboot
Data is durable, but service is unavailable
temporarily
Use multiple nodes to provide service
Another node takes over to provide service

3
Replicated state machine (RSM)

RSM is a general replication method
Lab 6 apply RSM to lock service
RSM Rules
All replicas start in the same initial state
Every replica apply operations in the same order
All operations must be deterministic
All replicas end up in the same state

4
RSM
opA opB
opB opA
opA
opA
opB
opB

How to maintain a single order in the face of
concurrent client requests?

5
RSM primary/backup
opA opB
opA opB
primary
backup

Primary/backup ensure a single order of ops
Primary orders operations
Backups execute operations in order

6
Case study Hypervisor Bressoud and Schneider

Goal fault tolerant computing
Banks, NASA etc. need it
CPUs are most likely to fail due to complexity
Hypervisor primary/backup replication
If primary fails, backup takes over
Caveat assuming failure detection is perfect

7
Hypervisor replicates at VM-level

Why replicating at VM-level?
Hardware fault-tolerant machines are big in 80s
Software solution is more economical
Replicating at O/S level is messy (many
interfaces)
Replicating at app level requires programmer
efforts
Replicating at VM level has a cleaner interface
(and no need to change O/S or app)
Primary and backup execute the same sequence of
machine instructions

8
A Strawman design
mem
mem

Two identical machines
Same initial memory/disk contents
Start execute on both machines
Will they perform the same computation?

9
Strawman flaws

To see the same effect, operations must be
deterministic
What are deterministic ops?
ADD, MUL etc.
Read time-of-day register, cycle counter,
privilege level?
Read memory?
Read disk?
Interrupt timing?
External input devices (network, keyboard)

10
Hypervisors architecture
Strawman replicates disks at both
machines Problem disks might not behave
identically (e.g. fail at different sectors)
mem
mem
SCSI bus
primary

Hypervisor connects devices to
to both machines
Only primary reads/writes to devices
Primary sends read values to backup
Only primary handles interrupts from h/w
Primary sends interrupts to backup

ethernet
backup
11
Hypervisor executes in epochs

Challenge must execute interrupts at the same
point in instruction streams on both nodes
Strawman execute one instruction at a time
Backup waits from primary to send interrupt at
end of each instruction
Very slow.
Hypervisor executes in epochs
CPU h/w interrupts every N instructions (so both
nodes stop at the same point)
Primary delays all interrupts till end of an
epoch
Primary sends all interrupts to backup

12
Hypervisor failover

If primary fails, backup must handle I/O
Suppose primary fails at epoch E1
In Epoch E, backup times out waiting for end,
E1
Backup delivers all buffered interrupts at the
end of E
Backup starts epoch E1
Backup becomes primary at epoch E2

13
Hypervisor failover

Backup does not know if primary executed I/O
epoch E1?
Relies on O/S to re-try the I/O
Device needs to support repeated ops
OK for disk writes/reads
OK for network (TCP will figure it out)
How about keyboard, printer, ATM cash machine?

14
Hypervisor implementation

Hypervisor needs to trap every non-deterministic
instruction
Time-of-day register
HP TLB replacement
HP branch-and-link instruction
Memory-mapped I/O loads/stores
Performance penalty is reasonable
A factor of two slow down
How about its performance on modern hardware?

15
Caveats in Hypervisor

Hypervisor assumes failure detection is perfect
What if the network between primary/backup fails?
Primary is still running
Backup becomes a new primary
Two primaries at the same time!
Can timeouts detect failures correctly?
Pings from backup to primary are lost
Pings from backup to primary are delayed

16
Paxos fault tolerant agreement

Paxos lets all nodes agree on the same value
despite node failures, network failures and
delays
Extremely useful
e.g. Nodes agree that X is the primary
e.g. Nodes agree that Y is the last operation
executed

17
Paxos general approach

One (or more) node decides to be the leader
Leader proposes a value and solicits acceptance
from others
Leader announces result or try again

18
Paxos requirement

Correctness (safety)
All nodes agree on the same value
The agreed value X has been proposed by some node
Fault-tolerance
If less than N/2 nodes fail, the rest nodes
should reach agreement eventually w.h.p
Liveness is not guaranteed

19
Why is agreement hard?

What if gt1 nodes become leaders simultaneously?
What if there is a network partition?
What if a leader crashes in the middle of
solicitation?
What if a leader crashes after deciding but
before announcing results?
What if the new leader proposes different values
than already decided value?

20
Paxos setup

Each node runs as a proposer, acceptor and
learner
Proposer (leader) proposes a value and solicit
acceptence from acceptors
Leader announces the chosen value to learners

21
Strawman

Designate a single node X as acceptor (e.g. one
with smallest id)
Each proposer sends its value to X
X decides on one of the values
X announces its decision to all learners
Problem?
Failure of the single acceptor halts decision
Need multiple acceptors!

22
Strawman 2 multiple acceptors

Each proposer (leader) propose to all acceptors
Each acceptor accepts the first proposal it
receives and rejects the rest
If the leader receives positive replies from a
majority of acceptors, it chooses its own value
There is at most 1 majority, hence only a single
value is chosen
Leader sends chosen value to all learners
Problem
What if multiple leaders propose simultaneously
so there is no majority accepting?

23
Paxos solution

Proposals are ordered by proposal
Each acceptor may accept multiple proposals
If a proposal with value v is chosen, all higher
proposals have value v

24
Paxos operation node state

Each node maintains
na, va highest proposal and its corresponding
accepted value
nh highest proposal seen
myn my proposal in current Paxos

25
Paxos operation 3P protocol

Phase 1 (Prepare)
A node decides to be leader (and propose)
Leader choose myn gt nh
Leader sends ltprepare, myngt to all nodes
Upon receiving ltprepare, ngt
If n lt nh
reply ltprepare-rejectgt
Else
nh n
reply ltprepare-ok, na,vagt

This node will not accept any proposal lower
than n
26
Paxos operation

Phase 2 (Accept)
If leader gets prepare-ok from a majority
V non-empty value corresponding to the highest
na received
If V null, then leader can pick any V
Send ltaccept, myn, Vgt to all nodes
If leader fails to get majority prepare-ok
Delay and restart Paxos
Upon receiving ltaccept, n, Vgt
If n lt nh
reply with ltaccept-rejectgt
else
na n va V nh n
reply with ltaccept-okgt

27
Paxos operation

Phase 3 (Decide)
If leader gets accept-ok from a majority
Send ltdecide, vagt to all nodes
If leader fails to get accept-ok from a majority
Delay and restart Paxos

28
Paxos operation an example
nhN10 na va null
nhN20 na va null
nhN00 na va null
Prepare,N11
Prepare,N11
nh N11 na null va null
nh N11 na null va null
ok, na vanull
ok, na vanulll
Accept,N11,val1
Accept,N11,val1
nhN11 na N11 va val1
nhN11 na N11 va val1
ok
ok
Decide,val1
Decide,val1
N0
N1
N2
29
Paxos properties

When is the value V chosen?
When leader receives a majority prepare-ok and
proposes V
When a majority nodes accept V
When the leader receives a majority accept-ok for
value V

30
Understanding Paxos

What if more than one leader is active?
Suppose two leaders use different proposal
number, N010, N111
Can both leaders see a majority of prepare-ok?

31
Understanding Paxos

What if leader fails while sending accept?
What if a node fails after receiving accept?
If it doesnt restart
If it reboots
What if a node fails after sending prepare-ok?
If it reboots

32
Using Paxos for RSM

Fault-tolerant RSM requires consistent replica
membership
Membership ltprimary, backupsgt
RSM goes through a series of membership changes
ltvid-0, primary, backupsgtltvid-1, primary,
backupsgt ..
Use Paxos to agree on the ltprimary, backupsgt for
a particular vid

33
Lab5 Using Paxos for RSM
All nodes start with static config vid1N1
vid1 N1
N2 joins
A majority in vid1N1 accept vid2 N1,N2
vid2 N1,N2
N3 joins
A majority in vid2N1,N2 accept vid3 N1,N2,N3
vid3 N1,N2, N3
N3 fails
A majority in vid3N1,N2,N3 accept vid4 N1,N2
vid4 N1,N2
34
Lab5 Using Paxos for RSM
vid1 N1
N1
vid2 N1,N2
vid1 N1
N2
vid2 N1,N2
35
Lab5 Using Paxos for RSM
vid1 N1
N1
vid2 N1,N2
N3
vid1 N1
vid2 N1,N2
vid1 N1
N2
N3 joins
vid2 N1,N2
36
Lab6 re-configurable RSM

Use RSM to replicate lock_server
Primary in each view assigns a viewstamp to each
client requests
Viewstamp is a tuple (vidseqno)
(00)(01)(02)(03)(10)(11)(12)(20)(21)
All replicas execute client requests in viewstamp
order

37
Lab6 Viewstamp replication

To execute an op with viewstamp (vs), a replica
must have executed all ops lt vs
A newly joined replica need to transfer state to
ensure its state reflect executions of all ops lt
vs

38
Lab5 Using Paxos for RSM
vid1 N1
N1
myvs(150)

Write a Comment

User Comments (0)

About PowerShow.com

Replication techniques Primary-backup, RSM, Paxos - PowerPoint PPT Presentation

Replication techniques Primary-backup, RSM, Paxos

Replication techniques Primary-backup, RSM, Paxos Jinyang Li Fault tolerance = replication How to recover a single node from power failure? Wait for reboot Data is ... – PowerPoint PPT presentation