Distributed Systems 2006 - PowerPoint PPT Presentation

1 / 69
About This Presentation
Title:

Distributed Systems 2006

Description:

'A distributed system is one in which the failure of a machine you have never ... so in practice we often use time cautiously and can even put limits on message delays ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 70
Provided by: klausmari
Category:

less

Transcript and Presenter's Notes

Title: Distributed Systems 2006


1
Distributed Systems 2006
  • Overcoming Failures in a Distributed System
  • With material adapted from Ken Birman

2
Leslie Lamport
  • A distributed system is one in which the failure
    of a machine you have never heard of can cause
    your own machine to become unusable

3
Plan
  • Goals
  • Static and Dynamic Membership
  • Logical Time
  • Distributed Commit

4
Thought question
  • Suppose that a distributed system was built by
    interconnecting a set of extremely reliable
    components running on fault-tolerant hardware
  • Would such a system be expected to be reliable?
  • Perhaps not. The pattern of interaction, the
    need to match rates of data production and
    consumption, and other distributed factors all
    can prevent a system from operating correctly!

5
Example (1)
  • The Web components are individually reliable
  • But the Web can fail by returning inconsistent or
    stale data, can freeze up or claim that a server
    is not responding (even if both browser and
    server are operational), and it can be so slow
    that we consider it faulty even if it is working
  • For stateful systems (the Web is stateless) this
    issue extends to joint behavior of sets of
    programs

6
Example (2)
  • Ariane 5
  • June 4, 1996, 40 seconds after takeoff
  • Self destruction after abrupt course correction
  • caused by the complete loss of guidance and
    attitude information due to specification and
    design errors in the software of the inertial
    reference system
  • Loss of 500 million , but no loss of life
  • Where are the distribution aspects?

7
Our Goal Here
  • We want to replicate data and computation
  • For availability
  • For performance
  • while guaranteeing consistent behavior
  • Work towards virtual synchronous communication
  • System appears to have no replicated data
  • System appears to only have multi-thread
    concurrency

8
Synchronous and Asynchronous Executions
p
q
r
p
q
r
In the synchronous model messages arrive on time
None of these properties holds in an asynchronous
model
processes share a synchronized clock
and failures are easily detected
9
Reality Neither One
  • Real distributed systems arent synchronous
  • Although some can come close
  • Nor are they asynchronous
  • Software often treats them as asynchronous
  • In reality, clocks work well so in practice we
    often use time cautiously and can even put limits
    on message delays
  • For our purposes we usually start with an
    asynchronous model
  • Subsequently enrich it with sources of time when
    useful

10
Steps Towards Our Goal
Robust Web Services Well build them with these
tools
Tools for solving practical replication and
availability problems well base them on ordered
multicast
Ordered multicast Well base it on
fault-tolerant multicast
Fault-tolerant multicast Well use membership
Tracking group membership Well base 2PC and 3PC
2PC and 3PC Our first tools (lowest layer)
11
Membership
  • Which processes are available in a distributed
    system?
  • Dynamic membership
  • Use group membership protocol to track members
  • Performant, complicated
  • Static membership
  • Use static list of potential group members
  • Resolve liveness on a per-operation basis
  • May be slow, simpler
  • (Approaches may be combined)

12
Dynamic Membership
  • Provides a Group Membership Service (GMS)
  • Processes as members
  • Processes may join or leave the group and monitor
    other processes in the group
  • (More next time)
  • 80,000 updates per seconds, 5 members
  • Static membership tens of updates per second, 5
    members

13
Static Membership
  • Example
  • Static set of potential members
  • E.g., p, q, r, s, t
  • Support replicated data on members
  • E.g., x integer value
  • E.g., x t 0, v 0 -gt t 21, v 17 -gt t 25, v
    97, ...
  • Each process records version of x and value of x
  • p reading a value?
  • Cannot just look at its own version may have
    been changed at others

14
Quorum Update and Read
  • Simple fix
  • Make sure that operations reach a majority of
    processes in the system
  • Update and read only if supported by a majority
    of processes
  • x will be sure to read latest value updated
    just take one with largest version
  • General fix
  • Two basic rules
  • A quorum read should intersect prior quorum write
    at at least one process
  • Likewise, quorum writes should intersect prior
    quorum writes
  • In a group of size N
  • Qr Qw gt N
  • Qw Qw gt N
  • The example again, N 5
  • Qr 3, Qw 3
  • Other possibilities?
  • Note that we want Qw lt N for fault tolerance,
    thus Qr gt 1!

15
Update Protocol
  • 1) p issues RPC-style read request to one replica
    after another
  • p collects at least Qr replies
  • p notes version (and value)
  • 2) p computes new version of data
  • Larger than maximum current version received
  • 3) p issues RPC to Qw members asking to prepare
  • Processes reply to p
  • 4) p checks number of acknowledgements
  • gt Qw -gt commit
  • lt Qw -gt abort
  • (Actually a two-phase commit protocol (2PC) is
    used in 3) and 4), more later)

16
Time
  • We were somewhat careful to avoid time in static
    membership
  • In distributed system we need practical ways to
    deal with time
  • E.g., we may need to agree that update A occurred
    before update B
  • Or offer a lease on a resource that expires
    at time 101001.50
  • Or guarantee that a time critical event will
    reach all interested parties within 100ms

17
But what does Time Mean?
  • Time on a machines local clock
  • But was it set accurately?
  • And could it drift, e.g. run fast or slow?
  • What about faults, like stuck bits?
  • Time on a global clock?
  • E.g. with GPS receiver
  • Still not accurate enough to determine which
    events happens before other events
  • Or could try to agree on time

18
Lamports Approach
  • Leslie Lamport suggested that we should reduce
    time to its basics
  • Cannot order events according to a global clock
  • None available
  • Can use logical clock
  • Time basically becomes a way of labeling events
    so that we may ask if event A happened before
    event B
  • Answer should be consistent with what could have
    happened with respect to a global clock
  • Often this is what matters

19
Drawing time-line pictures
sndp(m)
p
m
D
q
rcvq(m) delivq(m)
20
Drawing time-line pictures
  • A, B, C and D are events.
  • Could be anything meaningful to the application
  • microcode, program code, file write, message
    handling,
  • So are snd(m) and rcv(m) and deliv(m)
  • What ordering claims are meaningful?

sndp(m)
p
A
B
m
D
C
q
rcvq(m) delivq(m)
21
Drawing time-line pictures
  • A happens before B, and C before D
  • Local ordering at a single process
  • Write and

sndp(m)
p
A
B
m
D
q
C
rcvq(m) delivq(m)
22
Drawing time-line pictures
  • sndp(m) also happens before rcvq(m)
  • Distributed ordering introduced by a message
  • Write

sndp(m)
p
A
B
m
D
q
C
rcvq(m) delivq(m)
23
Drawing time-line pictures
  • A happens before D
  • Transitivity A happens before sndp(m), which
    happens before rcvq(m), which happens before D

sndp(m)
p
A
B
m
D
q
C
rcvq(m) delivq(m)
24
Drawing time-line pictures
  • B and D are concurrent
  • Looks like B happens first, but D has no way to
    know. No information flowed

sndp(m)
p
A
B
m
D
q
C
rcvq(m) delivq(m)
25
The Happens-Before Relation
  • Well say that A happens-before B, written A?B,
    if
  • 1) A?PB according to the local ordering, or
  • 2) A is a snd and B is a rcv and A?MB, or
  • A and B are related under the transitive closure
    of rules 1. and 2.
  • Thus, A?D
  • So far, this is just a mathematical notation, not
    a systems tool
  • A new event seen by a process happens logically
    after other events seen by that process
  • A message receive happens logically after a
    message has been sent

26
Simultaneous Actions
  • There are many situations in which we want to
    talk about some form of simultaneous event
  • Think about updating replicated data
  • Perhaps we have multiple conflicting updates
  • The need is to ensure that they will happen in
    the same order at all copies
  • This looks like a kind of simultaneous action
  • Want to know the states of a distributed systems
    that might have occurred at an instant of
    real-time

27
Temporal distortions
  • Things can be complicated because we cant
    predict
  • Message delays (they vary constantly)
  • Execution speeds (often a process shares a
    machine with many other tasks)
  • Timing of external events
  • Lamport looked at this question too

28
Temporal distortions
  • What does now mean?


p

0
a
d


e
b
c



p

1
f

p

2
p

3
29
Temporal distortions
  • What does now mean?


p

0
a
d


e
b
c



p

1
f

p

2
p

3
30
Temporal distortions
  • Timelines can stretch
  • caused by scheduling effects, message delays,
    message loss


p

0
a
d


e
b
c



p

1
f

p

2
p

3
31
Temporal distortions
  • Timelines can shrink
  • E.g. something lets a machine speed up


p

0
a
d


e
b
c



p

1
f

p

2
p

3
32
Temporal distortions
  • Cuts represent instants of time
  • Viz., subsets of events, one per process
  • E.g., a, c, a, rcv(d), f, rcv(e)
  • But not every cut makes sense
  • Black cuts could occur but not gray ones.


p

0
a
d


e
b
c



p

1
f

p

2
p

3
33
Temporal distortions
  • Red messages cross gray cuts backwards
  • Need to avoid capturing states in which a message
    is received but nobody is shown as having sent it
  • Consistent cuts
  • If rcv(m) is in cut, snd(m) (or earlier) is in
    cut
  • snd(m) may be in cut without rcv(m) is in cut
  • m is in message channel


p

0
a
d


e
b
c



p

1
f

p

2
p

3
34
Who Cares?
  • Suppose
  • p has lock
  • m release lock
  • p sends m to q
  • snd(m) -gt rcv(q)
  • Inconsistent cut
  • rcv(q)
  • Sees that both p and q have lock

35
Logical clocks
  • A simple tool that can capture parts of the
    happens before relation
  • First version uses just a single integer
  • Designed for big (64-bit or more) counters
  • Each process p maintains LTp, a local counter
  • A message m will carry LTm

36
Rules for managing logical clocks
  • When an event happens at a process p it
    increments LTp
  • Any event that matters to p
  • Normally, also snd and rcv events (since we want
    receive to occur after the matching send)
  • When p sends m, set
  • LTm LTp
  • When q receives m, set
  • LTq max(LTq, LTm)1

37
Time-line with LT annotations
  • LT(A) 1, LT(sndp(m)) 2, LT(m) 2
  • LT(rcvq(m))max(1,2)13, etc

sndp(m)
p
A
B
m
q
D
C
rcvq(m) delivq(m)
38
Logical clocks
  • If A happens before B, A?B, then LT(A)ltLT(B)
  • A?B A E0 ? ?En B, where each pair is
    ordered either by ?p or ?m
  • LT associated with these only increase
  • But converse might not be true
  • If LT(A)ltLT(B) cant be sure that A?B
  • This is because processes that dont communicate
    still assign timestamps and hence events will
    seem to have an order

39
Can we do better?
  • One option is to use vector clocks
  • Here we treat timestamps as a list
  • One counter for each process
  • Rules for managing vector times differ from what
    did with logical clocks

40
Vector clocks
  • Clock is a vector e.g. VT(A)1, 0
  • Well just assign p index 0 and q index 1
  • Vector clocks require either agreement on the
    numbering/static membership, or that the actual
    process ids be included with the vector
  • Rules for managing vector clock
  • When event happens at p, increment VTpindexp
  • Normally, also increment for snd and rcv events
  • When sending a message, set VT(m)VTp
  • When receiving, set VTqmax(VTq, VT(m))
  • Where max is max on components of vector

41
Time-line with VT annotations
sndp(m)
p
A
B
m
VT(m)2,0
D
q
C
rcvq(m) delivq(m)
Could also be 1,0 if we decide not to increment
the clock on a snd event. Decision depends on
how the timestamps will be used.
42
Rules for comparison of VTs
  • Well say that VTA VTB if
  • ?i, VTAi VTBi
  • And well say that VTA lt VTB if
  • VTA VTB but VTA ? VTB
  • That is, for some i, VTAi lt VTBi
  • Examples?
  • 2,4 2,4
  • 1,3 lt 7,3
  • 1,3 is incomparable to 3,1

43
Time-line with VT annotations
  • VT(A)1,0. VT(D)2,4. So VT(A)ltVT(D)
  • VT(B)3,0. So VT(B) and VT(D) are incomparable

sndp(m)
p
A
B
m
VT(m)2,0
D
q
C
rcvq(m) delivq(m)
44
Vector time and happens before
  • If A?B, then VT(A)ltVT(B)
  • Write a chain of events from A to B
  • Step by step the vector clocks get larger
  • But also VT(A)ltVT(B) then A?B
  • Two cases
  • If A and B both happen at same process p all
    events seen by p increments vector clocks
  • If A happens at p and B at q, can trace the path
    back by which q learned VT(A)p since q only
    updates VT(A)p based on message receipt from,
    say, q
  • If q ltgt p trace further back
  • (Otherwise A and B happened concurrently)

45
Introducing wall clock time
  • There are several options
  • Extend a logical clock or vector clock with the
    clock time and use it to break ties
  • Makes meaningful statements like B and D were
    concurrent, although B occurred first
  • But unless clocks are closely synchronized such
    statements could be erroneous!
  • We use a clock synchronization algorithm to
    reconcile differences between clocks on various
    computers in the network

46
Synchronizing clocks
  • Without help, clocks will often differ by many
    milliseconds
  • Problem is that when a machine downloads time
    from a network clock it cant be sure what the
    delay was
  • This is because the uplink and downlink
    delays are often very different in a network
  • Outright failures of clocks are rare

47
Synchronizing clocks
  • Suppose p synchronizes with time.windows.com and
    notes that 123 ms elapsed while the protocol was
    running what time is it now?

Delay 123ms
p
What time is it?
0923.02921
time.windows.com
48
Synchronizing clocks
  • Options?
  • P could guess that the delay was evenly split,
    but this is rarely the case in WAN settings
    (downlink speeds are higher)
  • P could ignore the delay
  • P could factor in only certain delay, e.g. if
    we know that the link takes at least 5ms in each
    direction. Works best with GPS time sources!
  • In general cant do better than uncertainty in
    the link delay from the time source down to p

49
Consequences?
  • In a network of processes, we must assume that
    clocks are
  • Not perfectly synchronized. Even GPS has
    uncertainty, although small
  • We say that clocks are inaccurate (with respect
    to real time)
  • And clocks can drift during periods between
    synchronizations
  • Relative drift between clocks is their
    precision (with respect to each other)

50
Thought question
  • We are building an anti-missile system
  • Radar tells the interceptor where it should be
    and what time to get there
  • Do we want the radar and interceptor to be as
    accurate as possible, or as precise as possible?

51
Thought question
  • We want them to agree on the time but it isnt
    important whether they are accurate with respect
    to true time
  • Precision matters more than accuracy
  • Although for this, a GPS time source would be the
    way to go
  • Might achieve higher precision than we can with
    an internal synchronization protocol!

52
Transactions in distributed systems
  • A client and database might not run on same
    computer
  • Both may not fail at same time
  • Also, either could timeout waiting for the other
    in normal situations
  • When this happens, we normally abort the
    transaction
  • Exception is a timeout that occurs while commit
    is being processed
  • If server fails, one effect of crash is to break
    locks even for read-only access

53
Transactions in distributed systems
  • What if data is on multiple servers?
  • In a networked system, transactions run against a
    single database system
  • Indeed, many systems structured to use just a
    single operation a one shot transaction!
  • In true distributed systems may want one
    application to talk to multiple databases
  • Main issue that arises is that now we can have
    multiple database servers that are touched by one
    transaction
  • Reasons?
  • Data spread around each owns subset
  • Could have replicated some data object on
    multiple servers, e.g. to load-balance read
    access for large client set
  • Might do this for high availability
  • Solve using 2-phase commit (2PC) protocol!

54
Two-phase commit in transactions
  • Phase 1
  • Transaction wishes to commit. Data managers
    force updates and lock records to the disk (e.g.
    to the log) and then say prepared to commit
  • Phase 2
  • Transaction manager makes sure all are prepared,
    then says commit (or abort, if some are not)
  • Data managers then make updates permanent or
    rollback to old values, and release locks

55
As a time-line picture
Vote?
Commit!
2PC initiator
p
q
r
s
t
All vote commit
56
As a time-line picture
Phase 1
Phase 2
Vote?
Commit!
2PC initiator
p
q
r
s
t
All vote commit
57
Missing Stuff
  • Eventually will need to do some form of garbage
    collection
  • Issue is that participants need memory of the
    protocol, at least for a while
  • But can delay garbage collection and run it later
    on behalf of many protocol instances
  • Part of any real implementation but not thought
    of as part of the protocol

58
Fault tolerance
  • We can separate this into three cases
  • Group member fails initiator remains healthy
  • Initiator fails group members remain healthy
  • Both initiator and group member fail
  • Further separation
  • Handling recovery of a failed member
  • Recovery after total failure of the whole group

59
Fault tolerance
  • Some cases are pretty easy
  • E.g. if a member fails before voting we just
    treat it as an abort
  • If a member fails after voting commit, we assume
    that when it recovers it will finish up the
    commit and perform whatever action we requested
  • Hard cases involve crash of initiator

60
Initiator fails, members healthy
  • When did it fail?
  • Could fail before starting the 2PC protocol
  • In this case if the members were expecting the
    protocol to run, e.g., to terminate a pending
    transaction on a database, they do unilateral
    abort
  • Could fail after some are prepared to commit
  • Those members need to learn the outcome before
    they can finish the protocol
  • Could fail after some have learned the outcome
  • Others may still be in a prepared state

61
How to handle initiator failures?
  • Wait for initiator to come up again
  • May hold resources on members
  • Rather
  • Initiator should record the decision in a logging
    server for use after crashes
  • If decision is logged, a process may learn
    outcome by examining log if initiator fails
    (timeout needed here)
  • Also, members can help one-another terminate the
    protocol
  • This is needed if a failure happens before the
    initiator has a chance to log its decision
  • A process member may repeat phase 1

62
Problems?
  • 2PC has a bad state
  • Suppose that the initiator and a member, p, both
    fail and we are not using a log
  • May not always want to use log because of extra
    overhead and reliability concerns
  • Other members cannot determine if commit should
    abort or not
  • p may have transferred 10M to a bank account,
    want to be consistent with that
  • There is a case in which we cant terminate the
    protocol!

63
As a time-line picture
Phase 1
Phase 2
Commit!
Vote?
2PC initiator
p
q
r
s
t
All vote commit
64
Can we do Better?
  • 3 phase commit (3PC)
  • Assumes detectable failures
  • We happen to know that real systems cant detect
    failures, unless they can unplug the power for a
    faulty node
  • Idea is to add an extra prepared to commit stage

65
3PC
Phase 1
Phase 2
Phase 3
Vote?
Prepare to commit
Commit!
3PC initiator
p
q
r
s
t
All vote commit
All say ok
They commit
66
Why 3PC?
  • A new leader in the group can deduce the
    outcomes when this protocol is used
  • Main insight?
  • In 2PC the decision to commit can be known by
    only initiator and one other process
  • In 3PC nobody can enter the commit state unless
    all are first in the prepared state
  • Makes it possible to determine the state, then
    push the protocol forward (or back)
  • But does require accurate failure detections
  • Only commit if all operational in prepared to
    commit state or abort if all operational in ok to
    commit state
  • Failed processes may learn outcome when they
    become operational

67
Value of 3PC?
  • Even with inaccurate failure detections, it
    greatly reduces the window of vulnerability
  • The bad case for 2PC is not so uncommon
  • Especially if a group member is the initiator
  • In that case one badly timed failure freezes the
    whole group
  • With 3PC in real systems, the troublesome case
    becomes very unlikely
  • But the problems remain
  • E.g., in network partition where half may be
    prepared to commit and half may be ok to commit

68
State diagram for non-faulty member
Protocol starts in the initial state. Initiator
sends the OK to commit inquiry
Coordinator failure sends us into an inquiry mode
in which someone (anyone) tries to figure out the
situation
We collect responses. If any is an abort, we
enter the abort stage
Here, we finish off the prepare state if a
crash interrupted it, by resending the prepare
message (needed in case only some processes saw
the coordinators message before it crashed)
Otherwise send prepare-to-commit messages out
This state corresponds to the coordinator sending
out the commit messages. We enter the state when
all members receive them
We get here if some processes were still in the
initial OK to commit? stage
In this case it is safe to abort, and we do so
69
Summary
  • We looked at goals and prerequisites for
    consistent replication
  • (Static and) and Dynamic Membership
  • Logical Time
  • Distributed Commit
Write a Comment
User Comments (0)
About PowerShow.com