Replication techniques Primary-backup, RSM, Paxos - PowerPoint PPT Presentation

About This Presentation
Title:

Replication techniques Primary-backup, RSM, Paxos

Description:

Replication techniques Primary-backup, RSM, Paxos Jinyang Li Fault tolerance = replication How to recover a single node from power failure? Wait for reboot Data is ... – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 39
Provided by: Jinya1
Category:

less

Transcript and Presenter's Notes

Title: Replication techniques Primary-backup, RSM, Paxos


1
Replication techniquesPrimary-backup, RSM, Paxos
  • Jinyang Li

2
Fault tolerance gt replication
  • How to recover a single node from power failure?
  • Wait for reboot
  • Data is durable, but service is unavailable
    temporarily
  • Use multiple nodes to provide service
  • Another node takes over to provide service

3
Replicated state machine (RSM)
  • RSM is a general replication method
  • Lab 6 apply RSM to lock service
  • RSM Rules
  • All replicas start in the same initial state
  • Every replica apply operations in the same order
  • All operations must be deterministic
  • All replicas end up in the same state

4
RSM
opA opB
opB opA
opA
opA
opB
opB
  • How to maintain a single order in the face of
    concurrent client requests?

5
RSM primary/backup
opA opB
opA opB
primary
backup
  • Primary/backup ensure a single order of ops
  • Primary orders operations
  • Backups execute operations in order

6
Case study Hypervisor Bressoud and Schneider
  • Goal fault tolerant computing
  • Banks, NASA etc. need it
  • CPUs are most likely to fail due to complexity
  • Hypervisor primary/backup replication
  • If primary fails, backup takes over
  • Caveat assuming failure detection is perfect

7
Hypervisor replicates at VM-level
  • Why replicating at VM-level?
  • Hardware fault-tolerant machines are big in 80s
  • Software solution is more economical
  • Replicating at O/S level is messy (many
    interfaces)
  • Replicating at app level requires programmer
    efforts
  • Replicating at VM level has a cleaner interface
    (and no need to change O/S or app)
  • Primary and backup execute the same sequence of
    machine instructions

8
A Strawman design
mem
mem
  • Two identical machines
  • Same initial memory/disk contents
  • Start execute on both machines
  • Will they perform the same computation?

9
Strawman flaws
  • To see the same effect, operations must be
    deterministic
  • What are deterministic ops?
  • ADD, MUL etc.
  • Read time-of-day register, cycle counter,
    privilege level?
  • Read memory?
  • Read disk?
  • Interrupt timing?
  • External input devices (network, keyboard)

10
Hypervisors architecture
Strawman replicates disks at both
machines Problem disks might not behave
identically (e.g. fail at different sectors)
mem
mem
SCSI bus
primary
  • Hypervisor connects devices to
  • to both machines
  • Only primary reads/writes to devices
  • Primary sends read values to backup
  • Only primary handles interrupts from h/w
  • Primary sends interrupts to backup

ethernet
backup
11
Hypervisor executes in epochs
  • Challenge must execute interrupts at the same
    point in instruction streams on both nodes
  • Strawman execute one instruction at a time
  • Backup waits from primary to send interrupt at
    end of each instruction
  • Very slow.
  • Hypervisor executes in epochs
  • CPU h/w interrupts every N instructions (so both
    nodes stop at the same point)
  • Primary delays all interrupts till end of an
    epoch
  • Primary sends all interrupts to backup

12
Hypervisor failover
  • If primary fails, backup must handle I/O
  • Suppose primary fails at epoch E1
  • In Epoch E, backup times out waiting for end,
    E1
  • Backup delivers all buffered interrupts at the
    end of E
  • Backup starts epoch E1
  • Backup becomes primary at epoch E2

13
Hypervisor failover
  • Backup does not know if primary executed I/O
    epoch E1?
  • Relies on O/S to re-try the I/O
  • Device needs to support repeated ops
  • OK for disk writes/reads
  • OK for network (TCP will figure it out)
  • How about keyboard, printer, ATM cash machine?

14
Hypervisor implementation
  • Hypervisor needs to trap every non-deterministic
    instruction
  • Time-of-day register
  • HP TLB replacement
  • HP branch-and-link instruction
  • Memory-mapped I/O loads/stores
  • Performance penalty is reasonable
  • A factor of two slow down
  • How about its performance on modern hardware?

15
Caveats in Hypervisor
  • Hypervisor assumes failure detection is perfect
  • What if the network between primary/backup fails?
  • Primary is still running
  • Backup becomes a new primary
  • Two primaries at the same time!
  • Can timeouts detect failures correctly?
  • Pings from backup to primary are lost
  • Pings from backup to primary are delayed

16
Paxos fault tolerant agreement
  • Paxos lets all nodes agree on the same value
    despite node failures, network failures and
    delays
  • Extremely useful
  • e.g. Nodes agree that X is the primary
  • e.g. Nodes agree that Y is the last operation
    executed

17
Paxos general approach
  • One (or more) node decides to be the leader
  • Leader proposes a value and solicits acceptance
    from others
  • Leader announces result or try again

18
Paxos requirement
  • Correctness (safety)
  • All nodes agree on the same value
  • The agreed value X has been proposed by some node
  • Fault-tolerance
  • If less than N/2 nodes fail, the rest nodes
    should reach agreement eventually w.h.p
  • Liveness is not guaranteed

19
Why is agreement hard?
  • What if gt1 nodes become leaders simultaneously?
  • What if there is a network partition?
  • What if a leader crashes in the middle of
    solicitation?
  • What if a leader crashes after deciding but
    before announcing results?
  • What if the new leader proposes different values
    than already decided value?

20
Paxos setup
  • Each node runs as a proposer, acceptor and
    learner
  • Proposer (leader) proposes a value and solicit
    acceptence from acceptors
  • Leader announces the chosen value to learners

21
Strawman
  • Designate a single node X as acceptor (e.g. one
    with smallest id)
  • Each proposer sends its value to X
  • X decides on one of the values
  • X announces its decision to all learners
  • Problem?
  • Failure of the single acceptor halts decision
  • Need multiple acceptors!

22
Strawman 2 multiple acceptors
  • Each proposer (leader) propose to all acceptors
  • Each acceptor accepts the first proposal it
    receives and rejects the rest
  • If the leader receives positive replies from a
    majority of acceptors, it chooses its own value
  • There is at most 1 majority, hence only a single
    value is chosen
  • Leader sends chosen value to all learners
  • Problem
  • What if multiple leaders propose simultaneously
    so there is no majority accepting?

23
Paxos solution
  • Proposals are ordered by proposal
  • Each acceptor may accept multiple proposals
  • If a proposal with value v is chosen, all higher
    proposals have value v

24
Paxos operation node state
  • Each node maintains
  • na, va highest proposal and its corresponding
    accepted value
  • nh highest proposal seen
  • myn my proposal in current Paxos

25
Paxos operation 3P protocol
  • Phase 1 (Prepare)
  • A node decides to be leader (and propose)
  • Leader choose myn gt nh
  • Leader sends ltprepare, myngt to all nodes
  • Upon receiving ltprepare, ngt
  • If n lt nh
  • reply ltprepare-rejectgt
  • Else
  • nh n
  • reply ltprepare-ok, na,vagt

This node will not accept any proposal lower
than n
26
Paxos operation
  • Phase 2 (Accept)
  • If leader gets prepare-ok from a majority
  • V non-empty value corresponding to the highest
    na received
  • If V null, then leader can pick any V
  • Send ltaccept, myn, Vgt to all nodes
  • If leader fails to get majority prepare-ok
  • Delay and restart Paxos
  • Upon receiving ltaccept, n, Vgt
  • If n lt nh
  • reply with ltaccept-rejectgt
  • else
  • na n va V nh n
  • reply with ltaccept-okgt

27
Paxos operation
  • Phase 3 (Decide)
  • If leader gets accept-ok from a majority
  • Send ltdecide, vagt to all nodes
  • If leader fails to get accept-ok from a majority
  • Delay and restart Paxos

28
Paxos operation an example
nhN10 na va null
nhN20 na va null
nhN00 na va null
Prepare,N11
Prepare,N11
nh N11 na null va null
nh N11 na null va null
ok, na vanull
ok, na vanulll
Accept,N11,val1
Accept,N11,val1
nhN11 na N11 va val1
nhN11 na N11 va val1
ok
ok
Decide,val1
Decide,val1
N0
N1
N2
29
Paxos properties
  • When is the value V chosen?
  • When leader receives a majority prepare-ok and
    proposes V
  • When a majority nodes accept V
  • When the leader receives a majority accept-ok for
    value V

30
Understanding Paxos
  • What if more than one leader is active?
  • Suppose two leaders use different proposal
    number, N010, N111
  • Can both leaders see a majority of prepare-ok?

31
Understanding Paxos
  • What if leader fails while sending accept?
  • What if a node fails after receiving accept?
  • If it doesnt restart
  • If it reboots
  • What if a node fails after sending prepare-ok?
  • If it reboots

32
Using Paxos for RSM
  • Fault-tolerant RSM requires consistent replica
    membership
  • Membership ltprimary, backupsgt
  • RSM goes through a series of membership changes
  • ltvid-0, primary, backupsgtltvid-1, primary,
    backupsgt ..
  • Use Paxos to agree on the ltprimary, backupsgt for
    a particular vid

33
Lab5 Using Paxos for RSM
All nodes start with static config vid1N1
vid1 N1
N2 joins
A majority in vid1N1 accept vid2 N1,N2
vid2 N1,N2
N3 joins
A majority in vid2N1,N2 accept vid3 N1,N2,N3
vid3 N1,N2, N3
N3 fails
A majority in vid3N1,N2,N3 accept vid4 N1,N2
vid4 N1,N2
34
Lab5 Using Paxos for RSM
vid1 N1
N1
vid2 N1,N2
vid1 N1
N2
vid2 N1,N2
35
Lab5 Using Paxos for RSM
vid1 N1
N1
vid2 N1,N2
N3
vid1 N1
vid2 N1,N2
vid1 N1
N2
N3 joins
vid2 N1,N2
36
Lab6 re-configurable RSM
  • Use RSM to replicate lock_server
  • Primary in each view assigns a viewstamp to each
    client requests
  • Viewstamp is a tuple (vidseqno)
  • (00)(01)(02)(03)(10)(11)(12)(20)(21)
  • All replicas execute client requests in viewstamp
    order

37
Lab6 Viewstamp replication
  • To execute an op with viewstamp (vs), a replica
    must have executed all ops lt vs
  • A newly joined replica need to transfer state to
    ensure its state reflect executions of all ops lt
    vs

38
Lab5 Using Paxos for RSM
vid1 N1
N1
myvs(150)
Write a Comment
User Comments (0)
About PowerShow.com