Title: Fault Tolerance and Consensus
1Fault Tolerance and Consensus
- Problem definitions
- Stopping failures
- Byzantine failures
- Randomized solutions
- Impossibility in asynchronous systems
Fundamentals and Design of Distributed Systems
D.H.J. Epema
Parallel and Distributed Group
2Fault tolerance and consensus (1)
- Two persons (Alice and Bob) try to make an
appointment - Two propositions
- PA Alice wants to have the appointment
- PB Bob wants to have the appointment
- Alice sends message A1 that she wants the
appointment - Bob receives A1, and Knows PA KB(PA)
- Bob sends back a message B1 that he wants to go
too - Alice receives B1, and so KA(PB) and KA(KB(PA))
hold - Alice sends confirmation A2 back
- This continues for ever
- Problem messages may have arbitrary delays and
may get lost
KA(PB), KA(KB(PA))
A
A1
B1
A2
B
KB(PA)
3Fault tolerance and consensus (2)
- Processors may need to reach consensus
- Applications
- commit a transaction in a database
- all participating sites have to agree on
committing the results - in a distributed database with replication
- when a record is to be modified, the database
servers holding the replicas have to agree on the
modification - in a replicated computation
- processes have to start with the same input value
(e.g., from a sensor), so they have to agree on
this value - Agreement modeled as agreeing on the value of a
single bit - Reaching consensus is a problem in the face of
failures
4Fault classification
- Possible processor failures
- fail-stop (crash) failures
- a process just stops
- when in a round in a synchronous system a process
should send a set of messages, it may only send a
subset - omission failures
- fail to send or receive a message
- performance failures
- not meeting timing specifications
- Byzantine failures
- random (malicious) behavior
5Model aspects
- Synchronous versus asynchronous
- reaching agreement is much more difficult in
asynchronous systems difference between long
delay and processor/link failure cannot be
detected - Authentication
- without messages cannot be forged or altered by
a process before passing them along to others - with messages cannot be forged or modified
- agreement much more difficult to reach when
messages are non-authenticated - Network connectivity
- we assume a complete network
6Agreement with stopping failures
- All processes start with an initial value from
some set V - Every process has to decide on a value in V such
that - Agreement no two processes decide on different
values - Validity if all processes start with the same
value v, then no process decides on a value
different from v - Termination all non-faulty processes decide
within finite time
7Byzantine generals
attack
no attack
- City surrounded by armies
- Armies have to attack simultaneously in order to
conquer the city - Communication between generals by means of
messengers - Some generals of the armies are traitors
8The Byzantine agreement problem
- One process (the source or commander) starts with
a binary value - Each of the remaining processes (the lieutenants)
has to decide on a binary value such that - Agreement all non-faulty processes agree on the
same value - Validity if the source is non-faulty, then all
non-faulty processes agree on the initial
value of the source - Termination all processes decide within finite
time - So if the source is faulty, the non-faulty
processes can agree on any value - It is irrelevant on what value a faulty process
decides
C (0/1)
9Two variations
- All generals start with a value
- Variation 1
- all non-faulty generals have to agree on a vector
with a value for every general - solution run a copy of an algorithm for the
previous problem for every general - Variation 2
- all non-faulty generals have to agree on a single
value - solution apply the same decision rule on the
vector in every general (e.g., majority function)
(v1,v2,,vn)
majority(v1,v2,vn)
10A solution for stopping failures (1)
- Solution by flooding decision values
- No more than f failing processes
- Every process starts with a value v
- Every process maintains a set W (with decision
values seen sofar) - Initially Wv
- Then, do f1 rounds
- broadcast current value of W to all other
processes - receive all these sets and set W to the union of
them all and W - Finally,
- if W contains only a single element v, decide(v)
- else decide(default)
11A solution for stopping failures (2)
- Validity and termination are trivially satisfied
- For agreement
- enough to show that all processes that are still
active at the end of round f1 then have the same
set W - because there are f1 rounds and at most f
failing processes, there is at least one round r
in which no process fails - in round r all active processes exchange their
sets W, and so have identical sets W at the end
of the round - from then on, all sets W in all active processes
are identical
round r
nobody fails
all same W
12A solution for stopping failures (3)
- Optimization
- processes only need to know whether at the end
W1 or Wgt1 - so let processes only broadcast at most two
values - their initial value
- the first different value they receive
13Conditions for a solution for Byzantine
- Number of processes n
- Maximum number of possibly failing processes f
- Necessary and sufficient condition for a solution
to Byzantine agreement - fltn/3
- Minimal number of rounds in a deterministic
solution - f1
- There exist randomized solutions with a lower
expected number of rounds
14Example three generals (1)
- Scenario 1 Lieutenant L2 is a traitor
C
note all messages sent and received by L1
0
0
0
L1
L2
1
15Example three generals (2)
- Scenario 2 Commander C is a traitor
C
same messages sent and received by L1
0
1
0
L1
L2
1
16Example three generals (3)
- L1 has to decide 0 in scenario 1, because both L1
and C are loyal and C starts with a 0 - Lieutenant L1 cannot distinguish the two
scenarios - So L1 also has to decide 0 in scenario 2
- So a loyal lieutenant (L1) always has to follow
the commander - The same holds for L2, so L2 has to decide 1 in
scenario 2 - Contradiction L1 and L2 are both loyal in
scenario 2, but decide on different values! - This is an example of an impossibility result
17A solution for Byzantine agreement (1)
- Algorithm is recursive with f1 levels
- Without authentication, modeled with Oral
Messages (OM) - When a message is supposed to be sent according
to the algorithm, but a process does not send it,
this is detected, and a default value (e.g., 0)
is assumed - Bottom case of the recursion OM(0) (no failures)
- the commander broadcasts its initial value
- every other process decides on the value it
receives
18A solution for Byzantine agreement (2)
- OM(f), fgt0 (resilient to f failures)
- the commander broadcasts its initial value
- process numbering commander0, lieutenants
1,2,,n-1 - let vi be the value received from the commander
by lieutenant Li, or the default if no value is
received - recursive step
- Li executes OM(f-1), acting as the commander for
the other lieutenants (L1, , Li-1, Li1, ,
Ln-1) - let vj be the value on which Li decides in the
recursive step with Lj as the commander (for
i,j1,2,...,n-1, i ? j) - Li decides on majority(v1,,vi,,vn-1)
19A solution for Byzantine agreement (3)
OM(f)
v1
vi
v2
vn-1
L1
L2
Li
Ln-1
OM(f-1)
Li
L2
Ln-1
here Li decides on its own v1 as a lieutenant of
L1
20A solution for Byzantine agreement (4)
- So a lieutenant does not decide on the majority
of all values it receives!!! - But Li decides on majority(majority(),majority
(),,vi,,majority(),,majority())
computed as the decision when acting as a
lieutenant in OM(f-1)
obtained directly from the commander
21A solution for Byzantine agreement (5)
- Number of executions
- OM(f) 1 time
- OM(f-1) (n-1) times
- OM(k) (n-1)(n-2) (n-fk) times for
k0,1,...,f-1 - Total number of messages is of order nf1
- OM(f) n-1
- OM(f-1) (n-1)(n-2)
- OM(k) (n-1)(n-2) (n-(f-k))(n-(f-k1))
- OM(0) (n-1)(n-2) (n-(f1)) (f1 factors, this
dominates)
22A solution for Byzantine Agreement (6)
0
level 0
In lieutenant L6
n7 f2 i6
1
2
3
4
5
level 1
level 2
3
1
2
4
2
3
4
5
- In order to decide, every lieutenant Li creates a
labelled tree with f1 levels - level 0 the root with label 0 (the commander)
- level 1 n-2 children with all labels except 0
and i - at every subsequent level all ids that have not
yet occurred on the path from the root and are
different from i - the degree decreases by 1 at every level
23A solution for Byzantine Agreement (7)
0
level 0
In lieutenant L6
v6
n7 f2 i6
1
2
3
4
5
value received from L1
level 1
level 2
3
1
2
4
2
3
4
5
value received through L1 and L5
- Label the nodes of the tree with additional
labels - level 0 vi (value received from the commander)
- level 1 the value that Lj told Li that the
commander told him - label of any node the value that was passed to
Li from the commander through the chain of
lieutenants on the path from the root to the node
24A solution for Byzantine Agreement (8)
v6
level 0
In lieutenant L6
n7 f2 i6
majority(v,w,x,y,z)
v
level 1
level 2
w
x
y
z
decide z
- Decide by propagating the result up with the
majority function - at the leaf level decide on the value received
(OM(0)) - at every next higher level take the majority of
the local value and the decisions at child nodes - the final value at the root is the final decision
25Example four generals (1)
Lieutenant is a traitor
C
v
v
OM(1)
v
v
v
v
?
3xOM(0)
v
?
Every loyal lieutenant receives v,v,?
26Example four generals (2)
Lieutenant is a traitor
C
x
z
y
x
y
y
z
x
z
Every loyal lieutenant receives x,y,z
27Byzantine agreem. with authentication (1)
- Every message carries a signature
- The signature of a loyal general cannot be forged
- Alteration of the contents of a signed message
can be detected - Every (loyal) general can verify the signature of
any other (loyal) general - Any number f of traitors can be allowed
- Commander is process 0
- Structure of message from (and signed by) the
commander, and subsequently signed and sent by
lieutenants Li1, Li2, - (v s0 si1 sik)
- Every lieutenant maintains a set of orders V
- Some choice function on V for deciding (e.g.,
majority, minimum)
28Byzantine agreem. with authentication (2)
- Algorithm in commander
- send(v s0) to every lieutenant
- Algorithm in every lieutenant Li
- upon receipt of (v s0 si1 . sik) do
- if (v not in V) then
- V V union v
- if (k lt f) then
- for (j in 1,2,,n-1 \ i,i1,,ik) do
- send(v s0 si1 sik i) to Lj
- if (Li will not receive any more messages) then
- decide(choice(V))
sign and propagate messages long enough
29Example three generals
Format valuesignature(s)
C
10
00
V0,1
V0,1
101
L1
L2
002
30Randomized Byzantine agreement (1)
- Solution for synchronous and asynchronous
systems!! - n processes, of which at most f fail, ngt5f
- Every process has an initial value v
- The algorithm proceeds in rounds consisting of
three phases - a notification phase (messages contain an N)
- a proposal phase (messages contain a P)
- a decision phase
- When a process expects messages from all other
processes, it is no use waiting for more than n-f
messages - When not enough processors support a possible
decision, a process starts the next round with a
new random input value v
31Randomized Byzantine agreement (2)
- r1
- r1 decidedfalse
- do forever
- broadcast(N,r,v)
- await (n-f) messages of the form (N,r,)
- if (gt(nf)/2 messages (N,r,w), w0,1) then
- broadcast(P,r,w)
- else broadcast(P,r,?)
- if decided then STOP
- else await (n-f) messages of the form (P,r,)
- if (gtf messages (P,r,w), w0,1) then
- vw
- if (gt3f messages (P,r,w)) then
- decide(w)
- decidedtrue
- else vrandom(0,1)
- rr1
conditions explained later
notification phase
proposal phase
decision phase
32Randomized Byzantine agreement (3)
- No simultaneous contradicting proposals by
correct processes - Lemma 1
- If a correct process proposes v in round r, then
no other correct process proposes 1-v in round r - Proof
- the process has received more than (nf)/2
messages (N,r,v) - of these, more than (n-f)/2 are from correct
processes, which is a majority of the correct
processes
33Randomized Byzantine agreement (4)
- When all correct processes have the same value,
immediate decision - Lemma 2
- If at the start of round r all correct processes
have the same value v, then they all decide v in
round r - Proof
- each correct process receives at least n-f
notification messages, at least n-2f of which are
from correct processes, and so of the form
(N,r,v) - because ngt5f, we have n-2f n/2n/2-2f gt
n/25f/2-2f (nf)/2 - so each correct process proposes v
- so, each correct process receives at least n-2f
messages of the form (P,r,v) - because ngt5f, we have n-2fgt3f, and so each
correct process decides v
34Randomized Byzantine agreement (5)
- Decision of any correct process immediately
followed by others - Lemma 3
- If a correct process decides v in round r, all
correct processes decide v in round r1 - Proof
- enough all correct processes propose v in round
r1 - if a process decides v in round r, it must have
received more than 3f proposals for v, m of which
are from correct processes for some mgt2f - so every other correct processor receives at
least m-fgtf proposals for v, so it starts the
next round with this value - now use Lemma 2
35Randomized Byzantine agreement (6)
- Theorem
- If ngt5f, the algorithm guarantees agreement,
validity, and terminates with probability 1 - Proof
- with probability 1, enough processors will pick a
common value v to have at least one correct
process decide - Expected number of rounds is of order 2n (in
fact, slightly better) - Remark randomization is used only if there is
not enough initial support for any decision
anyway
36Randomized coordinated attack (1)
- Synchronous system
- Coordinated-attack problem
- Complete graph
- System runs for a fixed number r of rounds
- Messages may get lost (all links may exhibit
failures) - Processes do not exhibit failures
37Randomized coordinated attack (2)
- Validity
- if all processes start with 0, they all decide 0
- if all processes start with 1 and all messages
are received, they all decide 1 - Agreement with some probability
- Psome process decides 0 and some process decides
1e, for some 0 e 1 (probability of
disagreement) - Termination trivial
38Adversaries
- Faults modeled with an adversary who can on
purpose try to deceive the system/processors - Here, the adversary can choose
- the input values of the processors
- the communication pattern (can omit arbitrary
messages) - In the algorithm, we get e1/r
39Communication patterns (1)
- Communication pattern a subset of the set
- (i,j,k) (i,j) an edge in the processor graph, k
a round number - We will define an ordering ? for pairs (i,k) for
a communication pattern ? - Interpretation (i,k) ? (j,l) means that j has
at least the same knowledge in round l as i had
in round k
message in round k
i
j
40Communication patterns (2)
- Ordering ? for pairs (i,k) for a communication
pattern ? - Knowledge is monotonic
- (i,k) ? (i,l) if k l
- All knowledge is transferred in messages
- if (i,j,k) in ?, then (i,k-1) ? (j,k)
- Transitive closure
- if (i,k) ? (i,k) and (i,k) ? (i,k),
then (i,k) ? (i,k) -
41Information level (1)
- The information level on pairs (i,k) is defined
as - k0 level?(i,0)0
- kgt0 if there is a j?i such that (j,0) ? (i,k),
then level?(i,k)0 - kgt0 let ljmaxlevel?(j,k) (j,k) ? (i,k)
- then, level?(i,k)1minlj j?i
- The information level
- starts at 0
- indicates what a process knows about other
processes - is incremented when a process has heard about the
previous level of all other processes
42Information level (2)
- It can be shown that
- the information levels of different processes in
the same round never differ by more than 1 - if the communication pattern is complete (all
triples (i,j,k) appear), then level?(i,k)k for
all i and k
0
0
1
0
information levels
1
2
2
3
43The algorithm (1)
- Ideas
- Process 1 picks a random number k between 1 and r
- Full information distribution in every round (on
correct links) - Processes maintain information on the initial
values v and the levels of all processes - Messages are of the form (L,V,k), with
- L a vector with the levels as far as known by the
sending process - V a vector with the initial values of all
processes - k the round number picked by process 1
- Levels and initial values of other processes, and
k initially undefined
44The algorithm (2)
- Picking a round number in process 1
- if ((i1) and (round0)) then
keyrandom(1,r) - Sending a message in every round in every
process - send(L,V,key) to all j
all locally known information
45The algorithm (3)
- Receiving all message in a round in process i
- rounds rounds1
- for (j1 to i-1, i1 to N) do
- receive(Lj,Vj,kj) from j / on correct
links / - if (kj ? undefined) then keykj /
round number picked by 1 / - if (for all l, Vj(l) ? undefined) then
Vi(l) Vj(l)/ copy init. vals / - if (for all l, Lj(l) gt Li(l)) then Li(l)
Lj(l) / copy levels / - Li(i)1minLi(j) j ? i / compute own
level / - if (roundsr) then
- if (key ? undefined) and (Li(i)key) and
(Vi(j)1 for all j) then - decision1
- else decision0
-
all processes started with 1
46Use of levels and key
- In a sense, processes agree on their levels,
i.e., on the actual round they have reached at
the end of the algorithm - The key chosen by process 1 is a guess of this
level
47Why do we get e1/r ?
- Sketch
- Let li be the value of Li(i) in round r
- The levels li differ by at most 1
- If keygtmaxli or at least one process has
initial value 0, all processes decide 0
(agreement) - If keyminli and all processes have initial
value 1, all processes decide 1 (again agreement) - So the only case where disagreement is possible,
is when keymaxli, which has probability 1/r,
since maxli is determined by the adversary and
key is uniform on 1,r
48We cant do much better
- It can be shown that
- Any r-round algorithm for the randomized
coordinated attack problem has probability of
disagreement at least equal to 1/(r1)
49Impossibility of consensus in asynchronous systems
- The consensus problem (weak form)
- Consistency all correct processors that take a
decision, take the same decision - Validity both 0 and 1 are possible decisions
from possibly different initial configurations
(to avoid trivial solutions) - Termination at least one correct processor
eventually takes a decision - Theorem there is no such solution
(FLPFischer-Lynch-Petterson)
50The model (1)
- Every processor has
- an input variable x
- an output variable y with possible values 0 and 1
- Messages are of the form (P,m), with
- P the destination processor
- m the message contents
- Message passing is not necessarily FIFO
- This is modeled with a single global message
buffer
P
Q
x y
x y
(P,m)
51The model (2)
- Every processor has a deterministic transition
function - In a single step, a processor
- receives a message (any from the message buffer
meant for it) - does a local computation
- sends a finite number of messages
- A configuration is defined by
- the internal state of every processor
- the contents of the message buffer
52The model (3)
- In an initial configuration
- every processor is in an initial state (with some
value for x) - the message buffer is empty
- A step
- takes the system from one configuration to the
next - is defined by a single step of one processor
- is associated with an event (P,m)
- receiving the message m by P
- performing the ensuing computation
- and sending a finite number of messages
- A schedule s from C is a sequence of events
applied to C - The configuration s(C) with s a finite schedule
is said to be reachable from C
e(P,m)
C Ce(C)
s
C Cs(C)
53The model (4)
- A run is a sequence of configurations associated
with a schedule - A processor is in a decision state if it has
given y a value - A run is a deciding run if at least one
processors reaches a decision state - A processor is non-faulty in a run if it performs
infinitely many steps - A run is admissable if at most one processor is
faulty, and all messages to non-faulty processors
are eventually received - A configuration is bivalent if both decision
values are possible (for a correct processor) - A configuration is univalent if only one of the
decision values is possible (0-valent or 1-valent)
54Outline impossibility proof
- Suppose there exists a 1-resilient consensus
protocol - Step 1. There exists a bivalent initial
configuration - Step 2. For every bivalent configuration C and
processor P, there is a finite schedule s such
that s(C) is bivalent and P takes a step in s - Step 3. Keep the system in bivalent
configurations forever - apply step 1 let B1 be a bivalent initial
configuration - for every i, apply step 2 there exists an s such
that Bi1s(Bi) is bivalent with process Pj
taking the final step in s, ji mod n
s
s
s
s
all processes in turn
B1
B2
Bi
Bi1
55Proof step 1
C0
0
C1
1
- Consider a 0-valent configuration C0 and a
1-valent configuration C1 - We can assume that C0 and C1 differ in the
initial value of only one process, say process P
(by flipping one value at a time) - An admissable deciding run starting in C0 in the
schedule s of which P does not take any step can
also be applied to C1 - Decision is 0 in the run starting in C0 and 1 in
the run starting in C1 contradiction
0-val.
1-val.
P
C0
C1
P
others s
56Proof step 2 (1)
C
C
- Let C be a bivalent configuration in which
e(P,m) is applicable - Let C be the set of configurations reachable from
C without ever applying e - Let D be the set of configurations obtained by
applying e to every configuration in C - To prove D contains a bivalent configuration
apply e
e(C)
D
57Proof step 2 (2)
C
C
- Assume D does not contain bivalent configurations
- D does contain both 0-valent and 1-valent
configurations - Let Ei be an i-valent configuration reachable
from C, i0,1 - If Ei in C, let Fie(Ei)
- If Ei not in C, event e was used in getting from
C to Ei, and so there exists an Fi in D such that
Ei can be reached from Fi - Fi is i-valent, i0,1
Ei
apply e
e(C)
D
Fi
Fi
Ei
58Proof step 2 (3)
C
C
- So D must have neighboring 0- and 1-valent
configurations D0 and D1 - So there exist C0 and C1 in C such that
- C1e(C0) for some step e(P,m)
- Die(Ci) is i-valent, i0,1
- Steps in different processes commute
- Two cases
- P?P then e and e commute and D1e(D0)
contradiction!!
C0
e
C1
apply e
(P,m)
e(C)
D
D0
D1
59Proof step 2 (4)
- PP
- consider a finite deciding run from C0 in which P
does not do any steps let s be the schedule and
let As(C0) - s is applicable to D0 and D1
- let Eis(Di), i0,1 Ei is i-valent
- then E0e(A) and E1e(e(A))
- so A is bivalent contradiction with run to A
deciding!!
C0
e
e
e
C1
D0
s
D1
s
A
e
e
s
e
E0
E1