Distributed Systems 2006

About This Presentation

Title:

Distributed Systems 2006

Description:

'A distributed system is one in which the failure of a machine you have never ... so in practice we often use time cautiously and can even put limits on message delays ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 70

Provided by: klausmari

Category:

more less

Transcript and Presenter's Notes

Title: Distributed Systems 2006

1
Distributed Systems 2006

Overcoming Failures in a Distributed System
With material adapted from Ken Birman

2
Leslie Lamport

A distributed system is one in which the failure
of a machine you have never heard of can cause
your own machine to become unusable

3
Plan

Goals
Static and Dynamic Membership
Logical Time
Distributed Commit

4
Thought question

Suppose that a distributed system was built by
interconnecting a set of extremely reliable
components running on fault-tolerant hardware
Would such a system be expected to be reliable?
Perhaps not. The pattern of interaction, the
need to match rates of data production and
consumption, and other distributed factors all
can prevent a system from operating correctly!

5
Example (1)

The Web components are individually reliable
But the Web can fail by returning inconsistent or
stale data, can freeze up or claim that a server
is not responding (even if both browser and
server are operational), and it can be so slow
that we consider it faulty even if it is working
For stateful systems (the Web is stateless) this
issue extends to joint behavior of sets of
programs

6
Example (2)

Ariane 5
June 4, 1996, 40 seconds after takeoff
Self destruction after abrupt course correction
caused by the complete loss of guidance and
attitude information due to specification and
design errors in the software of the inertial
reference system
Loss of 500 million , but no loss of life
Where are the distribution aspects?

7
Our Goal Here

We want to replicate data and computation
For availability
For performance
while guaranteeing consistent behavior
Work towards virtual synchronous communication
System appears to have no replicated data
System appears to only have multi-thread
concurrency

8
Synchronous and Asynchronous Executions
p
q
r
p
q
r
In the synchronous model messages arrive on time
None of these properties holds in an asynchronous
model
processes share a synchronized clock
and failures are easily detected
9
Reality Neither One

Real distributed systems arent synchronous
Although some can come close
Nor are they asynchronous
Software often treats them as asynchronous
In reality, clocks work well so in practice we
often use time cautiously and can even put limits
on message delays
For our purposes we usually start with an
asynchronous model
Subsequently enrich it with sources of time when
useful

10
Steps Towards Our Goal
Robust Web Services Well build them with these
tools
Tools for solving practical replication and
availability problems well base them on ordered
multicast
Ordered multicast Well base it on
fault-tolerant multicast
Fault-tolerant multicast Well use membership
Tracking group membership Well base 2PC and 3PC
2PC and 3PC Our first tools (lowest layer)
11
Membership

Which processes are available in a distributed
system?
Dynamic membership
Use group membership protocol to track members
Performant, complicated
Static membership
Use static list of potential group members
Resolve liveness on a per-operation basis
May be slow, simpler
(Approaches may be combined)

12
Dynamic Membership

Provides a Group Membership Service (GMS)
Processes as members
Processes may join or leave the group and monitor
other processes in the group
(More next time)
80,000 updates per seconds, 5 members
Static membership tens of updates per second, 5
members

13
Static Membership

Example
Static set of potential members
E.g., p, q, r, s, t
Support replicated data on members
E.g., x integer value
E.g., x t 0, v 0 -gt t 21, v 17 -gt t 25, v
97, ...
Each process records version of x and value of x
p reading a value?
Cannot just look at its own version may have
been changed at others

14
Quorum Update and Read

Simple fix
Make sure that operations reach a majority of
processes in the system
Update and read only if supported by a majority
of processes
x will be sure to read latest value updated
just take one with largest version
General fix
Two basic rules
A quorum read should intersect prior quorum write
at at least one process
Likewise, quorum writes should intersect prior
quorum writes
In a group of size N
Qr Qw gt N
Qw Qw gt N
The example again, N 5
Qr 3, Qw 3
Other possibilities?
Note that we want Qw lt N for fault tolerance,
thus Qr gt 1!

15
Update Protocol

1) p issues RPC-style read request to one replica
after another
p collects at least Qr replies
p notes version (and value)
2) p computes new version of data
Larger than maximum current version received
3) p issues RPC to Qw members asking to prepare
Processes reply to p
4) p checks number of acknowledgements
gt Qw -gt commit
lt Qw -gt abort
(Actually a two-phase commit protocol (2PC) is
used in 3) and 4), more later)

16
Time

We were somewhat careful to avoid time in static
membership
In distributed system we need practical ways to
deal with time
E.g., we may need to agree that update A occurred
before update B
Or offer a lease on a resource that expires
at time 101001.50
Or guarantee that a time critical event will
reach all interested parties within 100ms

17
But what does Time Mean?

Time on a machines local clock
But was it set accurately?
And could it drift, e.g. run fast or slow?
What about faults, like stuck bits?
Time on a global clock?
E.g. with GPS receiver
Still not accurate enough to determine which
events happens before other events
Or could try to agree on time

18
Lamports Approach

Leslie Lamport suggested that we should reduce
time to its basics
Cannot order events according to a global clock
None available
Can use logical clock
Time basically becomes a way of labeling events
so that we may ask if event A happened before
event B
Answer should be consistent with what could have
happened with respect to a global clock
Often this is what matters

19
Drawing time-line pictures
sndp(m)
p
m
D
q
rcvq(m) delivq(m)
20
Drawing time-line pictures

A, B, C and D are events.
Could be anything meaningful to the application
microcode, program code, file write, message
handling,
So are snd(m) and rcv(m) and deliv(m)
What ordering claims are meaningful?

sndp(m)
p
A
B
m
D
C
q
rcvq(m) delivq(m)
21
Drawing time-line pictures

A happens before B, and C before D
Local ordering at a single process
Write and

sndp(m)
p
A
B
m
D
q
C
rcvq(m) delivq(m)
22
Drawing time-line pictures

sndp(m) also happens before rcvq(m)
Distributed ordering introduced by a message
Write

sndp(m)
p
A
B
m
D
q
C
rcvq(m) delivq(m)
23
Drawing time-line pictures

A happens before D
Transitivity A happens before sndp(m), which
happens before rcvq(m), which happens before D

sndp(m)
p
A
B
m
D
q
C
rcvq(m) delivq(m)
24
Drawing time-line pictures

B and D are concurrent
Looks like B happens first, but D has no way to
know. No information flowed

sndp(m)
p
A
B
m
D
q
C
rcvq(m) delivq(m)
25
The Happens-Before Relation

Well say that A happens-before B, written A?B,
if
1) A?PB according to the local ordering, or
2) A is a snd and B is a rcv and A?MB, or
A and B are related under the transitive closure
of rules 1. and 2.
Thus, A?D
So far, this is just a mathematical notation, not
a systems tool
A new event seen by a process happens logically
after other events seen by that process
A message receive happens logically after a
message has been sent

26
Simultaneous Actions

There are many situations in which we want to
talk about some form of simultaneous event
Think about updating replicated data
Perhaps we have multiple conflicting updates
The need is to ensure that they will happen in
the same order at all copies
This looks like a kind of simultaneous action
Want to know the states of a distributed systems
that might have occurred at an instant of
real-time

27
Temporal distortions

Things can be complicated because we cant
predict
Message delays (they vary constantly)
Execution speeds (often a process shares a
machine with many other tasks)
Timing of external events
Lamport looked at this question too

28
Temporal distortions

What does now mean?

p

0
a
d

e
b
c

p

1
f

p

2
p

3
29
Temporal distortions

What does now mean?

p

0
a
d

e
b
c

p

1
f

p

2
p

3
30
Temporal distortions

Timelines can stretch
caused by scheduling effects, message delays,
message loss

p

0
a
d

e
b
c

p

1
f

p

2
p

3
31
Temporal distortions

Timelines can shrink
E.g. something lets a machine speed up

p

0
a
d

e
b
c

p

1
f

p

2
p

3
32
Temporal distortions

Cuts represent instants of time
Viz., subsets of events, one per process
E.g., a, c, a, rcv(d), f, rcv(e)
But not every cut makes sense
Black cuts could occur but not gray ones.

p

0
a
d

e
b
c

p

1
f

p

2
p

3
33
Temporal distortions

Red messages cross gray cuts backwards
Need to avoid capturing states in which a message
is received but nobody is shown as having sent it
Consistent cuts
If rcv(m) is in cut, snd(m) (or earlier) is in
cut
snd(m) may be in cut without rcv(m) is in cut
m is in message channel

p

0
a
d

e
b
c

p

1
f

p

2
p

3
34
Who Cares?

Suppose
p has lock
m release lock
p sends m to q
snd(m) -gt rcv(q)
Inconsistent cut
rcv(q)
Sees that both p and q have lock

35
Logical clocks

A simple tool that can capture parts of the
happens before relation
First version uses just a single integer
Designed for big (64-bit or more) counters
Each process p maintains LTp, a local counter
A message m will carry LTm

36
Rules for managing logical clocks

When an event happens at a process p it
increments LTp
Any event that matters to p
Normally, also snd and rcv events (since we want
receive to occur after the matching send)
When p sends m, set
LTm LTp
When q receives m, set
LTq max(LTq, LTm)1

37
Time-line with LT annotations

LT(A) 1, LT(sndp(m)) 2, LT(m) 2
LT(rcvq(m))max(1,2)13, etc

sndp(m)
p
A
B
m
q
D
C
rcvq(m) delivq(m)
38
Logical clocks

If A happens before B, A?B, then LT(A)ltLT(B)
A?B A E0 ? ?En B, where each pair is
ordered either by ?p or ?m
LT associated with these only increase
But converse might not be true
If LT(A)ltLT(B) cant be sure that A?B
This is because processes that dont communicate
still assign timestamps and hence events will
seem to have an order

39
Can we do better?

One option is to use vector clocks
Here we treat timestamps as a list
One counter for each process
Rules for managing vector times differ from what
did with logical clocks

40
Vector clocks

Clock is a vector e.g. VT(A)1, 0
Well just assign p index 0 and q index 1
Vector clocks require either agreement on the
numbering/static membership, or that the actual
process ids be included with the vector
Rules for managing vector clock
When event happens at p, increment VTpindexp
Normally, also increment for snd and rcv events
When sending a message, set VT(m)VTp
When receiving, set VTqmax(VTq, VT(m))
Where max is max on components of vector

41
Time-line with VT annotations
sndp(m)
p
A
B
m
VT(m)2,0
D
q
C
rcvq(m) delivq(m)
Could also be 1,0 if we decide not to increment
the clock on a snd event. Decision depends on
how the timestamps will be used.
42
Rules for comparison of VTs

Well say that VTA VTB if
?i, VTAi VTBi
And well say that VTA lt VTB if
VTA VTB but VTA ? VTB
That is, for some i, VTAi lt VTBi
Examples?
2,4 2,4
1,3 lt 7,3
1,3 is incomparable to 3,1

43
Time-line with VT annotations

VT(A)1,0. VT(D)2,4. So VT(A)ltVT(D)
VT(B)3,0. So VT(B) and VT(D) are incomparable

sndp(m)
p
A
B
m
VT(m)2,0
D
q
C
rcvq(m) delivq(m)
44
Vector time and happens before

If A?B, then VT(A)ltVT(B)
Write a chain of events from A to B
Step by step the vector clocks get larger
But also VT(A)ltVT(B) then A?B
Two cases
If A and B both happen at same process p all
events seen by p increments vector clocks
If A happens at p and B at q, can trace the path
back by which q learned VT(A)p since q only
updates VT(A)p based on message receipt from,
say, q
If q ltgt p trace further back
(Otherwise A and B happened concurrently)

45
Introducing wall clock time

There are several options
Extend a logical clock or vector clock with the
clock time and use it to break ties
Makes meaningful statements like B and D were
concurrent, although B occurred first
But unless clocks are closely synchronized such
statements could be erroneous!
We use a clock synchronization algorithm to
reconcile differences between clocks on various
computers in the network

46
Synchronizing clocks

Without help, clocks will often differ by many
milliseconds
Problem is that when a machine downloads time
from a network clock it cant be sure what the
delay was
This is because the uplink and downlink
delays are often very different in a network
Outright failures of clocks are rare

47
Synchronizing clocks

Suppose p synchronizes with time.windows.com and
notes that 123 ms elapsed while the protocol was
running what time is it now?

Delay 123ms
p
What time is it?
0923.02921
time.windows.com
48
Synchronizing clocks

Options?
P could guess that the delay was evenly split,
but this is rarely the case in WAN settings
(downlink speeds are higher)
P could ignore the delay
P could factor in only certain delay, e.g. if
we know that the link takes at least 5ms in each
direction. Works best with GPS time sources!
In general cant do better than uncertainty in
the link delay from the time source down to p

49
Consequences?

In a network of processes, we must assume that
clocks are
Not perfectly synchronized. Even GPS has
uncertainty, although small
We say that clocks are inaccurate (with respect
to real time)
And clocks can drift during periods between
synchronizations
Relative drift between clocks is their
precision (with respect to each other)

50
Thought question

We are building an anti-missile system
Radar tells the interceptor where it should be
and what time to get there
Do we want the radar and interceptor to be as
accurate as possible, or as precise as possible?

51
Thought question

We want them to agree on the time but it isnt
important whether they are accurate with respect
to true time
Precision matters more than accuracy
Although for this, a GPS time source would be the
way to go
Might achieve higher precision than we can with
an internal synchronization protocol!

52
Transactions in distributed systems

A client and database might not run on same
computer
Both may not fail at same time
Also, either could timeout waiting for the other
in normal situations
When this happens, we normally abort the
transaction
Exception is a timeout that occurs while commit
is being processed
If server fails, one effect of crash is to break
locks even for read-only access

53
Transactions in distributed systems

What if data is on multiple servers?
In a networked system, transactions run against a
single database system
Indeed, many systems structured to use just a
single operation a one shot transaction!
In true distributed systems may want one
application to talk to multiple databases
Main issue that arises is that now we can have
multiple database servers that are touched by one
transaction
Reasons?
Data spread around each owns subset
Could have replicated some data object on
multiple servers, e.g. to load-balance read
access for large client set
Might do this for high availability
Solve using 2-phase commit (2PC) protocol!

54
Two-phase commit in transactions

Phase 1
Transaction wishes to commit. Data managers
force updates and lock records to the disk (e.g.
to the log) and then say prepared to commit
Phase 2
Transaction manager makes sure all are prepared,
then says commit (or abort, if some are not)
Data managers then make updates permanent or
rollback to old values, and release locks

55
As a time-line picture
Vote?
Commit!
2PC initiator
p
q
r
s
t
All vote commit
56
As a time-line picture
Phase 1
Phase 2
Vote?
Commit!
2PC initiator
p
q
r
s
t
All vote commit
57
Missing Stuff

Eventually will need to do some form of garbage
collection
Issue is that participants need memory of the
protocol, at least for a while
But can delay garbage collection and run it later
on behalf of many protocol instances
Part of any real implementation but not thought
of as part of the protocol

58
Fault tolerance

We can separate this into three cases
Group member fails initiator remains healthy
Initiator fails group members remain healthy
Both initiator and group member fail
Further separation
Handling recovery of a failed member
Recovery after total failure of the whole group

59
Fault tolerance

Some cases are pretty easy
E.g. if a member fails before voting we just
treat it as an abort
If a member fails after voting commit, we assume
that when it recovers it will finish up the
commit and perform whatever action we requested
Hard cases involve crash of initiator

60
Initiator fails, members healthy

When did it fail?
Could fail before starting the 2PC protocol
In this case if the members were expecting the
protocol to run, e.g., to terminate a pending
transaction on a database, they do unilateral
abort
Could fail after some are prepared to commit
Those members need to learn the outcome before
they can finish the protocol
Could fail after some have learned the outcome
Others may still be in a prepared state

61
How to handle initiator failures?

Wait for initiator to come up again
May hold resources on members
Rather
Initiator should record the decision in a logging
server for use after crashes
If decision is logged, a process may learn
outcome by examining log if initiator fails
(timeout needed here)
Also, members can help one-another terminate the
protocol
This is needed if a failure happens before the
initiator has a chance to log its decision
A process member may repeat phase 1

62
Problems?

2PC has a bad state
Suppose that the initiator and a member, p, both
fail and we are not using a log
May not always want to use log because of extra
overhead and reliability concerns
Other members cannot determine if commit should
abort or not
p may have transferred 10M to a bank account,
want to be consistent with that
There is a case in which we cant terminate the
protocol!

63
As a time-line picture
Phase 1
Phase 2
Commit!
Vote?
2PC initiator
p
q
r
s
t
All vote commit
64
Can we do Better?

3 phase commit (3PC)
Assumes detectable failures
We happen to know that real systems cant detect
failures, unless they can unplug the power for a
faulty node
Idea is to add an extra prepared to commit stage

65
3PC
Phase 1
Phase 2
Phase 3
Vote?
Prepare to commit
Commit!
3PC initiator
p
q
r
s
t
All vote commit
All say ok
They commit
66
Why 3PC?

A new leader in the group can deduce the
outcomes when this protocol is used
Main insight?
In 2PC the decision to commit can be known by
only initiator and one other process
In 3PC nobody can enter the commit state unless
all are first in the prepared state
Makes it possible to determine the state, then
push the protocol forward (or back)
But does require accurate failure detections
Only commit if all operational in prepared to
commit state or abort if all operational in ok to
commit state
Failed processes may learn outcome when they
become operational

67
Value of 3PC?

Even with inaccurate failure detections, it
greatly reduces the window of vulnerability
The bad case for 2PC is not so uncommon
Especially if a group member is the initiator
In that case one badly timed failure freezes the
whole group
With 3PC in real systems, the troublesome case
becomes very unlikely
But the problems remain
E.g., in network partition where half may be
prepared to commit and half may be ok to commit

68
State diagram for non-faulty member
Protocol starts in the initial state. Initiator
sends the OK to commit inquiry
Coordinator failure sends us into an inquiry mode
in which someone (anyone) tries to figure out the
situation
We collect responses. If any is an abort, we
enter the abort stage
Here, we finish off the prepare state if a
crash interrupted it, by resending the prepare
message (needed in case only some processes saw
the coordinators message before it crashed)
Otherwise send prepare-to-commit messages out
This state corresponds to the coordinator sending
out the commit messages. We enter the state when
all members receive them
We get here if some processes were still in the
initial OK to commit? stage
In this case it is safe to abort, and we do so
69
Summary