Title: Distributed Systems 2006
1Distributed Systems 2006
- Overcoming Failures in a Distributed System
- With material adapted from Ken Birman
2Leslie Lamport
- A distributed system is one in which the failure
of a machine you have never heard of can cause
your own machine to become unusable
3Plan
- Goals
- Static and Dynamic Membership
- Logical Time
- Distributed Commit
4Thought question
- Suppose that a distributed system was built by
interconnecting a set of extremely reliable
components running on fault-tolerant hardware - Would such a system be expected to be reliable?
- Perhaps not. The pattern of interaction, the
need to match rates of data production and
consumption, and other distributed factors all
can prevent a system from operating correctly!
5Example (1)
- The Web components are individually reliable
- But the Web can fail by returning inconsistent or
stale data, can freeze up or claim that a server
is not responding (even if both browser and
server are operational), and it can be so slow
that we consider it faulty even if it is working - For stateful systems (the Web is stateless) this
issue extends to joint behavior of sets of
programs
6Example (2)
- Ariane 5
- June 4, 1996, 40 seconds after takeoff
- Self destruction after abrupt course correction
- caused by the complete loss of guidance and
attitude information due to specification and
design errors in the software of the inertial
reference system - Loss of 500 million , but no loss of life
- Where are the distribution aspects?
7Our Goal Here
- We want to replicate data and computation
- For availability
- For performance
- while guaranteeing consistent behavior
- Work towards virtual synchronous communication
- System appears to have no replicated data
- System appears to only have multi-thread
concurrency
8Synchronous and Asynchronous Executions
p
q
r
p
q
r
In the synchronous model messages arrive on time
None of these properties holds in an asynchronous
model
processes share a synchronized clock
and failures are easily detected
9Reality Neither One
- Real distributed systems arent synchronous
- Although some can come close
- Nor are they asynchronous
- Software often treats them as asynchronous
- In reality, clocks work well so in practice we
often use time cautiously and can even put limits
on message delays - For our purposes we usually start with an
asynchronous model - Subsequently enrich it with sources of time when
useful
10Steps Towards Our Goal
Robust Web Services Well build them with these
tools
Tools for solving practical replication and
availability problems well base them on ordered
multicast
Ordered multicast Well base it on
fault-tolerant multicast
Fault-tolerant multicast Well use membership
Tracking group membership Well base 2PC and 3PC
2PC and 3PC Our first tools (lowest layer)
11Membership
- Which processes are available in a distributed
system? - Dynamic membership
- Use group membership protocol to track members
- Performant, complicated
- Static membership
- Use static list of potential group members
- Resolve liveness on a per-operation basis
- May be slow, simpler
- (Approaches may be combined)
12Dynamic Membership
- Provides a Group Membership Service (GMS)
- Processes as members
- Processes may join or leave the group and monitor
other processes in the group - (More next time)
- 80,000 updates per seconds, 5 members
- Static membership tens of updates per second, 5
members
13Static Membership
- Example
- Static set of potential members
- E.g., p, q, r, s, t
- Support replicated data on members
- E.g., x integer value
- E.g., x t 0, v 0 -gt t 21, v 17 -gt t 25, v
97, ... - Each process records version of x and value of x
- p reading a value?
- Cannot just look at its own version may have
been changed at others
14Quorum Update and Read
- Simple fix
- Make sure that operations reach a majority of
processes in the system - Update and read only if supported by a majority
of processes - x will be sure to read latest value updated
just take one with largest version - General fix
- Two basic rules
- A quorum read should intersect prior quorum write
at at least one process - Likewise, quorum writes should intersect prior
quorum writes - In a group of size N
- Qr Qw gt N
- Qw Qw gt N
- The example again, N 5
- Qr 3, Qw 3
- Other possibilities?
- Note that we want Qw lt N for fault tolerance,
thus Qr gt 1!
15Update Protocol
- 1) p issues RPC-style read request to one replica
after another - p collects at least Qr replies
- p notes version (and value)
- 2) p computes new version of data
- Larger than maximum current version received
- 3) p issues RPC to Qw members asking to prepare
- Processes reply to p
- 4) p checks number of acknowledgements
- gt Qw -gt commit
- lt Qw -gt abort
- (Actually a two-phase commit protocol (2PC) is
used in 3) and 4), more later)
16Time
- We were somewhat careful to avoid time in static
membership - In distributed system we need practical ways to
deal with time - E.g., we may need to agree that update A occurred
before update B - Or offer a lease on a resource that expires
at time 101001.50 - Or guarantee that a time critical event will
reach all interested parties within 100ms
17But what does Time Mean?
- Time on a machines local clock
- But was it set accurately?
- And could it drift, e.g. run fast or slow?
- What about faults, like stuck bits?
- Time on a global clock?
- E.g. with GPS receiver
- Still not accurate enough to determine which
events happens before other events - Or could try to agree on time
18Lamports Approach
- Leslie Lamport suggested that we should reduce
time to its basics - Cannot order events according to a global clock
- None available
- Can use logical clock
- Time basically becomes a way of labeling events
so that we may ask if event A happened before
event B - Answer should be consistent with what could have
happened with respect to a global clock - Often this is what matters
19Drawing time-line pictures
sndp(m)
p
m
D
q
rcvq(m) delivq(m)
20Drawing time-line pictures
- A, B, C and D are events.
- Could be anything meaningful to the application
- microcode, program code, file write, message
handling, - So are snd(m) and rcv(m) and deliv(m)
- What ordering claims are meaningful?
sndp(m)
p
A
B
m
D
C
q
rcvq(m) delivq(m)
21Drawing time-line pictures
- A happens before B, and C before D
- Local ordering at a single process
- Write and
sndp(m)
p
A
B
m
D
q
C
rcvq(m) delivq(m)
22Drawing time-line pictures
- sndp(m) also happens before rcvq(m)
- Distributed ordering introduced by a message
- Write
sndp(m)
p
A
B
m
D
q
C
rcvq(m) delivq(m)
23Drawing time-line pictures
- A happens before D
- Transitivity A happens before sndp(m), which
happens before rcvq(m), which happens before D
sndp(m)
p
A
B
m
D
q
C
rcvq(m) delivq(m)
24Drawing time-line pictures
- B and D are concurrent
- Looks like B happens first, but D has no way to
know. No information flowed
sndp(m)
p
A
B
m
D
q
C
rcvq(m) delivq(m)
25The Happens-Before Relation
- Well say that A happens-before B, written A?B,
if - 1) A?PB according to the local ordering, or
- 2) A is a snd and B is a rcv and A?MB, or
- A and B are related under the transitive closure
of rules 1. and 2. - Thus, A?D
- So far, this is just a mathematical notation, not
a systems tool - A new event seen by a process happens logically
after other events seen by that process - A message receive happens logically after a
message has been sent
26Simultaneous Actions
- There are many situations in which we want to
talk about some form of simultaneous event - Think about updating replicated data
- Perhaps we have multiple conflicting updates
- The need is to ensure that they will happen in
the same order at all copies - This looks like a kind of simultaneous action
- Want to know the states of a distributed systems
that might have occurred at an instant of
real-time
27Temporal distortions
- Things can be complicated because we cant
predict - Message delays (they vary constantly)
- Execution speeds (often a process shares a
machine with many other tasks) - Timing of external events
- Lamport looked at this question too
28Temporal distortions
p
0
a
d
e
b
c
p
1
f
p
2
p
3
29Temporal distortions
p
0
a
d
e
b
c
p
1
f
p
2
p
3
30Temporal distortions
- Timelines can stretch
- caused by scheduling effects, message delays,
message loss
p
0
a
d
e
b
c
p
1
f
p
2
p
3
31Temporal distortions
- Timelines can shrink
- E.g. something lets a machine speed up
p
0
a
d
e
b
c
p
1
f
p
2
p
3
32Temporal distortions
- Cuts represent instants of time
- Viz., subsets of events, one per process
- E.g., a, c, a, rcv(d), f, rcv(e)
- But not every cut makes sense
- Black cuts could occur but not gray ones.
p
0
a
d
e
b
c
p
1
f
p
2
p
3
33Temporal distortions
- Red messages cross gray cuts backwards
- Need to avoid capturing states in which a message
is received but nobody is shown as having sent it - Consistent cuts
- If rcv(m) is in cut, snd(m) (or earlier) is in
cut - snd(m) may be in cut without rcv(m) is in cut
- m is in message channel
p
0
a
d
e
b
c
p
1
f
p
2
p
3
34Who Cares?
- Suppose
- p has lock
- m release lock
- p sends m to q
- snd(m) -gt rcv(q)
- Inconsistent cut
- rcv(q)
- Sees that both p and q have lock
35Logical clocks
- A simple tool that can capture parts of the
happens before relation - First version uses just a single integer
- Designed for big (64-bit or more) counters
- Each process p maintains LTp, a local counter
- A message m will carry LTm
36Rules for managing logical clocks
- When an event happens at a process p it
increments LTp - Any event that matters to p
- Normally, also snd and rcv events (since we want
receive to occur after the matching send) - When p sends m, set
- LTm LTp
- When q receives m, set
- LTq max(LTq, LTm)1
37Time-line with LT annotations
- LT(A) 1, LT(sndp(m)) 2, LT(m) 2
- LT(rcvq(m))max(1,2)13, etc
sndp(m)
p
A
B
m
q
D
C
rcvq(m) delivq(m)
38Logical clocks
- If A happens before B, A?B, then LT(A)ltLT(B)
- A?B A E0 ? ?En B, where each pair is
ordered either by ?p or ?m - LT associated with these only increase
- But converse might not be true
- If LT(A)ltLT(B) cant be sure that A?B
- This is because processes that dont communicate
still assign timestamps and hence events will
seem to have an order
39Can we do better?
- One option is to use vector clocks
- Here we treat timestamps as a list
- One counter for each process
- Rules for managing vector times differ from what
did with logical clocks
40Vector clocks
- Clock is a vector e.g. VT(A)1, 0
- Well just assign p index 0 and q index 1
- Vector clocks require either agreement on the
numbering/static membership, or that the actual
process ids be included with the vector - Rules for managing vector clock
- When event happens at p, increment VTpindexp
- Normally, also increment for snd and rcv events
- When sending a message, set VT(m)VTp
- When receiving, set VTqmax(VTq, VT(m))
- Where max is max on components of vector
41Time-line with VT annotations
sndp(m)
p
A
B
m
VT(m)2,0
D
q
C
rcvq(m) delivq(m)
Could also be 1,0 if we decide not to increment
the clock on a snd event. Decision depends on
how the timestamps will be used.
42Rules for comparison of VTs
- Well say that VTA VTB if
- ?i, VTAi VTBi
- And well say that VTA lt VTB if
- VTA VTB but VTA ? VTB
- That is, for some i, VTAi lt VTBi
- Examples?
- 2,4 2,4
- 1,3 lt 7,3
- 1,3 is incomparable to 3,1
43Time-line with VT annotations
- VT(A)1,0. VT(D)2,4. So VT(A)ltVT(D)
- VT(B)3,0. So VT(B) and VT(D) are incomparable
sndp(m)
p
A
B
m
VT(m)2,0
D
q
C
rcvq(m) delivq(m)
44Vector time and happens before
- If A?B, then VT(A)ltVT(B)
- Write a chain of events from A to B
- Step by step the vector clocks get larger
- But also VT(A)ltVT(B) then A?B
- Two cases
- If A and B both happen at same process p all
events seen by p increments vector clocks - If A happens at p and B at q, can trace the path
back by which q learned VT(A)p since q only
updates VT(A)p based on message receipt from,
say, q - If q ltgt p trace further back
- (Otherwise A and B happened concurrently)
45Introducing wall clock time
- There are several options
- Extend a logical clock or vector clock with the
clock time and use it to break ties - Makes meaningful statements like B and D were
concurrent, although B occurred first - But unless clocks are closely synchronized such
statements could be erroneous! - We use a clock synchronization algorithm to
reconcile differences between clocks on various
computers in the network
46Synchronizing clocks
- Without help, clocks will often differ by many
milliseconds - Problem is that when a machine downloads time
from a network clock it cant be sure what the
delay was - This is because the uplink and downlink
delays are often very different in a network - Outright failures of clocks are rare
47Synchronizing clocks
- Suppose p synchronizes with time.windows.com and
notes that 123 ms elapsed while the protocol was
running what time is it now?
Delay 123ms
p
What time is it?
0923.02921
time.windows.com
48Synchronizing clocks
- Options?
- P could guess that the delay was evenly split,
but this is rarely the case in WAN settings
(downlink speeds are higher) - P could ignore the delay
- P could factor in only certain delay, e.g. if
we know that the link takes at least 5ms in each
direction. Works best with GPS time sources! - In general cant do better than uncertainty in
the link delay from the time source down to p
49Consequences?
- In a network of processes, we must assume that
clocks are - Not perfectly synchronized. Even GPS has
uncertainty, although small - We say that clocks are inaccurate (with respect
to real time) - And clocks can drift during periods between
synchronizations - Relative drift between clocks is their
precision (with respect to each other)
50Thought question
- We are building an anti-missile system
- Radar tells the interceptor where it should be
and what time to get there - Do we want the radar and interceptor to be as
accurate as possible, or as precise as possible?
51Thought question
- We want them to agree on the time but it isnt
important whether they are accurate with respect
to true time - Precision matters more than accuracy
- Although for this, a GPS time source would be the
way to go - Might achieve higher precision than we can with
an internal synchronization protocol!
52Transactions in distributed systems
- A client and database might not run on same
computer - Both may not fail at same time
- Also, either could timeout waiting for the other
in normal situations - When this happens, we normally abort the
transaction - Exception is a timeout that occurs while commit
is being processed - If server fails, one effect of crash is to break
locks even for read-only access
53Transactions in distributed systems
- What if data is on multiple servers?
- In a networked system, transactions run against a
single database system - Indeed, many systems structured to use just a
single operation a one shot transaction! - In true distributed systems may want one
application to talk to multiple databases - Main issue that arises is that now we can have
multiple database servers that are touched by one
transaction - Reasons?
- Data spread around each owns subset
- Could have replicated some data object on
multiple servers, e.g. to load-balance read
access for large client set - Might do this for high availability
- Solve using 2-phase commit (2PC) protocol!
54Two-phase commit in transactions
- Phase 1
- Transaction wishes to commit. Data managers
force updates and lock records to the disk (e.g.
to the log) and then say prepared to commit - Phase 2
- Transaction manager makes sure all are prepared,
then says commit (or abort, if some are not) - Data managers then make updates permanent or
rollback to old values, and release locks
55As a time-line picture
Vote?
Commit!
2PC initiator
p
q
r
s
t
All vote commit
56As a time-line picture
Phase 1
Phase 2
Vote?
Commit!
2PC initiator
p
q
r
s
t
All vote commit
57Missing Stuff
- Eventually will need to do some form of garbage
collection - Issue is that participants need memory of the
protocol, at least for a while - But can delay garbage collection and run it later
on behalf of many protocol instances - Part of any real implementation but not thought
of as part of the protocol
58Fault tolerance
- We can separate this into three cases
- Group member fails initiator remains healthy
- Initiator fails group members remain healthy
- Both initiator and group member fail
- Further separation
- Handling recovery of a failed member
- Recovery after total failure of the whole group
59Fault tolerance
- Some cases are pretty easy
- E.g. if a member fails before voting we just
treat it as an abort - If a member fails after voting commit, we assume
that when it recovers it will finish up the
commit and perform whatever action we requested - Hard cases involve crash of initiator
60Initiator fails, members healthy
- When did it fail?
- Could fail before starting the 2PC protocol
- In this case if the members were expecting the
protocol to run, e.g., to terminate a pending
transaction on a database, they do unilateral
abort - Could fail after some are prepared to commit
- Those members need to learn the outcome before
they can finish the protocol - Could fail after some have learned the outcome
- Others may still be in a prepared state
61How to handle initiator failures?
- Wait for initiator to come up again
- May hold resources on members
- Rather
- Initiator should record the decision in a logging
server for use after crashes - If decision is logged, a process may learn
outcome by examining log if initiator fails
(timeout needed here) - Also, members can help one-another terminate the
protocol - This is needed if a failure happens before the
initiator has a chance to log its decision - A process member may repeat phase 1
62Problems?
- 2PC has a bad state
- Suppose that the initiator and a member, p, both
fail and we are not using a log - May not always want to use log because of extra
overhead and reliability concerns - Other members cannot determine if commit should
abort or not - p may have transferred 10M to a bank account,
want to be consistent with that - There is a case in which we cant terminate the
protocol!
63As a time-line picture
Phase 1
Phase 2
Commit!
Vote?
2PC initiator
p
q
r
s
t
All vote commit
64Can we do Better?
- 3 phase commit (3PC)
- Assumes detectable failures
- We happen to know that real systems cant detect
failures, unless they can unplug the power for a
faulty node - Idea is to add an extra prepared to commit stage
653PC
Phase 1
Phase 2
Phase 3
Vote?
Prepare to commit
Commit!
3PC initiator
p
q
r
s
t
All vote commit
All say ok
They commit
66Why 3PC?
- A new leader in the group can deduce the
outcomes when this protocol is used - Main insight?
- In 2PC the decision to commit can be known by
only initiator and one other process - In 3PC nobody can enter the commit state unless
all are first in the prepared state - Makes it possible to determine the state, then
push the protocol forward (or back) - But does require accurate failure detections
- Only commit if all operational in prepared to
commit state or abort if all operational in ok to
commit state - Failed processes may learn outcome when they
become operational
67Value of 3PC?
- Even with inaccurate failure detections, it
greatly reduces the window of vulnerability - The bad case for 2PC is not so uncommon
- Especially if a group member is the initiator
- In that case one badly timed failure freezes the
whole group - With 3PC in real systems, the troublesome case
becomes very unlikely - But the problems remain
- E.g., in network partition where half may be
prepared to commit and half may be ok to commit
68State diagram for non-faulty member
Protocol starts in the initial state. Initiator
sends the OK to commit inquiry
Coordinator failure sends us into an inquiry mode
in which someone (anyone) tries to figure out the
situation
We collect responses. If any is an abort, we
enter the abort stage
Here, we finish off the prepare state if a
crash interrupted it, by resending the prepare
message (needed in case only some processes saw
the coordinators message before it crashed)
Otherwise send prepare-to-commit messages out
This state corresponds to the coordinator sending
out the commit messages. We enter the state when
all members receive them
We get here if some processes were still in the
initial OK to commit? stage
In this case it is safe to abort, and we do so
69Summary
- We looked at goals and prerequisites for
consistent replication - (Static and) and Dynamic Membership
- Logical Time
- Distributed Commit