Distributed Systems: Atomicity, decision making, snapshots - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

Distributed Systems: Atomicity, decision making, snapshots

Description:

Slides adapted from Ken's CS514 lectures. Distributed Systems: Atomicity, ... after Custer, who died at Little Bighorn because he arrived a couple of days too early! ... – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 65
Provided by: ranveer7
Category:

less

Transcript and Presenter's Notes

Title: Distributed Systems: Atomicity, decision making, snapshots


1
Distributed Systems Atomicity, decision making,
snapshots
2
Announcements
  • Please complete course evaluations
  • http//www.engineering.cornell.edu/CourseEval/
  • Prelim II coming up this week
  • Thursday, April 26th, 730900pm, 1½ hour exam
  • 101 Phillips
  • Closed book, no calculators/PDAs/
  • Bring ID
  • Topics
  • Since last Prelim, up to (and including) Monday,
    April 23rd
  • Lectures 19-34, chapters 10-18 (7th ed)
  • Review Session Tuesday, April 24th
  • during second half of 415 Section
  • Homework 6 (and solutions) available via CMS
  • Do it without looking at solutions. However, it
    will not be graded

3
Review What time is it?
  • In distributed system we need practical ways to
    deal with time
  • E.g. we may need to agree that update A occurred
    before update B
  • Or offer a lease on a resource that expires at
    time 1010.0150
  • Or guarantee that a time critical event will
    reach all interested parties within 100ms

4
Review Event Ordering
  • Problem distributed systems do not share a clock
  • Many coordination problems would be simplified if
    they did (first one wins)
  • Distributed systems do have some sense of time
  • Events in a single process happen in order
  • Messages between processes must be sent before
    they can be received
  • How helpful is this?

5
Review Happens-before
  • Define a Happens-before relation (denoted by ?).
  • 1) If A and B are events in the same process, and
    A was executed before B, then A ? B.
  • 2) If A is the event of sending a message by one
    process and B is the event of receiving that
    message by another process, then A ? B.
  • 3) If A ? B and B ? C then A ? C.

6
Review Total ordering?
  • Happens-before gives a partial ordering of events
  • We still do not have a total ordering of events
  • We are not able to order events that happen
    concurrently
  • Concurrent if (not A?B) and (not B?A)

7
Review Partial Ordering
Pi -gtPi1 Qi -gt Qi1 Ri -gt Ri1
R0-gtQ4 Q3-gtR4 Q1-gtP4 P1-gtQ2
8
Review Total Ordering?
P0, P1, Q0, Q1, Q2, P2, P3, P4, Q3, R0, Q4, R1,
R2, R3, R4
P0, Q0, Q1, P1, Q2, P2, P3, P4, Q3, R0, Q4, R1,
R2, R3, R4
P0, Q0, P1, Q1, Q2, P2, P3, P4, Q3, R0, Q4, R1,
R2, R3, R4
9
Review Timestamps
  • Assume each process has a local logical clock
    that ticks once per event and that the processes
    are numbered
  • Clocks tick once per event (including message
    send)
  • When send a message, send your clock value
  • When receive a message, set your clock to MAX(
    your clock, timestamp of message 1)
  • Thus sending comes before receiving
  • Only visibility into actions at other nodes
    happens during communication, communicate
    synchronizes the clocks
  • If the timestamps of two events A and B are the
    same, then use the process identity numbers to
    break ties.
  • This gives a total ordering!

10
Review Distributed Mutual Exclusion
  • Want mutual exclusion in distributed setting
  • The system consists of n processes each process
    Pi resides at a different processor
  • Each process has a critical section that requires
    mutual exclusion
  • Problem Cannot use atomic testAndSet primitive
    since memory not shared and processes may be on
    physically separated nodes
  • Requirement
  • If Pi is executing in its critical section, then
    no other process Pj is executing in its critical
    section
  • Compare three solutions
  • Centralized Distributed Mutual Exclusion (CDME)
  • Fully Distributed Mutual Exclusion (DDME)
  • Token passing

11
Today
  • Atomicity and Distributed Decision Making
  • What time is it now?
  • Synchronized clocks
  • What does the entire system look like at this
    moment?

12
Atomicity
  • Recall
  • Atomicity either all the operations associated
    with a program unit are executed to completion,
    or none are performed.
  • In a distributed system may have multiple copies
    of the data
  • (e.g. replicas are good for reliability/availabili
    ty)
  • PROBLEM How do we atomically update all of the
    copies?
  • That is, either all replicas reflect a change or
    none

13
Generals Paradox
  • Generals paradox
  • Constraints of problem
  • Two generals, on separate mountains
  • Can only communicate via messengers
  • Messengers can be captured
  • Problem need to coordinate attack
  • If they attack at different times, they all die
  • If they attack at same time, they win
  • Named after Custer, who died at Little Bighorn
    because he arrived a couple of days too early!
  • Can messages over an unreliable network be used
    to guarantee two entities do something
    simultaneously?
  • Remarkably, no, even if all messages get
    through
  • No way to be sure last message gets through!

14
Replica Consistency Problem -Concurrent and
conflicting updates
  • Imagine we have multiple bank servers and a
    client desiring to update their back account
  • How can we do this?
  • Allow a client to update any server then have
    server propagate update to other servers?
  • Simple and wrong!
  • Simultaneous and conflicting updates can occur at
    different servers?
  • Have client send update to all servers?
  • Same problem - race condition which of the
    conflicting update will reach each server first

15
Two-phase commit
  • Since we cant solve the Generals Paradox (i.e.
    simultaneous action), concurrent and conflicting
    updates may be sent by clients, lets solve a
    related problem
  • Distributed transaction Two machines agree to do
    something, or not do it, atomically
  • Algorithm for providing atomic updates in a
    distributed system
  • Give the servers (or replicas) a chance to say no
    and if any server says no, client aborts the
    operation

16
Framework
  • Goal Update all replicas atomically
  • Either everyone commits or everyone aborts
  • No inconsistencies even in face of failures
  • Caveat Assume only crash or fail-stop failures
  • Crash servers stop when they fail do not
    continue and generate bad data
  • Fail-stop in addition to crash, fail-stop
    failure is detectable.
  • Definitions
  • Coordinator Software entity that shepherds the
    process (in our example could be one of the
    servers)
  • Ready to commit side effects of update safely
    stored on non-volatile storage
  • Even if crash, once I say I am ready to commit
    then a recover procedure will find evidence and
    continue with commit protocol

17
Two Phase Commit Phase 1
  • Coordinator send a PREPARE message to each
    replica
  • Coordinator waits for all replicas to reply with
    a vote
  • Each participant replies with a vote
  • Votes PREPARED if ready to commit and locks data
    items being updated
  • Votes NO if unable to get a lock or unable to
    ensure ready to commit

18
Two Phase Commit Phase 2
  • If coordinator receives PREPARED vote from all
    replicas then it may decide to commit or abort
  • Coordinator send its decision to all participants
  • If participant receives COMMIT decision then
    commit changes resulting from update
  • If participant received ABORT decision then
    discard changes resulting from update
  • Participant replies DONE
  • When Coordinator received DONE from all
    participants then can delete record of outcome

19
Performance
  • In absence of failure, 2PC (two-phase-commit)
    makes a total of 2 (1.5?) round trips of messages
    before decision is made
  • Prepare
  • Vote NO or PREPARE
  • Commit/abort
  • Done (but done just for bookkeeping, does not
    affect response time)

20
Failure Handling in 2PC Replica Failure
  • The log contains a ltcommit Tgt record.
  • In this case, the site executes redo(T).
  • The log contains an ltabort Tgt record.
  • In this case, the site executes undo(T).
  • The log contains a ltready Tgt record
  • In this case consult coordinator Ci.
  • If Ci is down, site sends query-status T message
    to the other sites.
  • The log contains no control records concerning T.
  • In this case, the site executes undo(T).

21
Failure Handling in 2PC Coordinator Ci Failure
  • If an active site contains a ltcommit Tgt record in
    its log, then T must be committed.
  • If an active site contains an ltabort Tgt record in
    its log, then T must be aborted.
  • If some active site does not contain the record
    ltready Tgt in its log then the failed coordinator
    Ci cannot have decided to commit T. Rather than
    wait for Ci to recover, it is preferable to abort
    T.
  • All active sites have a ltready Tgt record in their
    logs, but no additional control records. In this
    case we must wait for the coordinator to recover.
  • Blocking problem T is blocked pending the
    recovery of site Si.

22
Failure Handling
  • Failure detected with timeouts
  • If participant times out before getting a PREPARE
    can abort
  • If coordinator times out waiting for a vote can
    abort
  • If a participant times out waiting for a decision
    it is blocked!
  • Wait for Coordinator to recover?
  • Punt to some other resolution protocol
  • If a coordinator times out waiting for done, keep
    record of outcome
  • other sites may have a replica.

23
Failures in distributed systems
  • We may want to avoid relying on a single
    server/coordinator/boss to make progress
  • Thus want the decision making to be distributed
    among the participants (all nodes created
    equal) gt the consensus problem in distributed
    systems.
  • However depending on what we can assume about the
    network, it may be impossible to reach a decision
    in some cases!

24
Impossibility of Consensus
  • Network characteristics
  • Synchronous - some upper bound on
    network/processing delay.
  • Asynchronous - no upper bound on
    network/processing delay.
  • Fischer Lynch and Paterson showed
  • With even just one failure possible, you cannot
    guarantee consensus.
  • Cannot guarantee consensus process will terminate
  • Assumes asynchronous network
  • Essence of proof Just before a decision is
    reached, we can delay a node slightly too long to
    reach a decision.
  • But we still want to do it.. Right?

25
Distributed Decision Making Discussion
  • Why is distributed decision making desirable?
  • Fault Tolerance!
  • A group of machines can come to a decision even
    if one or more of them fail during the process
  • Simple failure mode called failstop (different
    modes later)
  • After decision made, result recorded in multiple
    places
  • Undesirable feature of Two-Phase Commit Blocking
  • One machine can be stalled until another site
    recovers
  • Site B writes prepared to commit record to its
    log, sends a yes vote to the coordinator (site
    A) and crashes
  • Site A crashes
  • Site B wakes up, check its log, and realizes that
    it has voted yes on the update. It sends a
    message to site A asking what happened. At this
    point, B cannot decide to abort, because update
    may have committed
  • B is blocked until A comes back
  • A blocked site holds resources (locks on updated
    items, pages pinned in memory, etc) until learns
    fate of update
  • Alternative There are alternatives such as
    Three Phase Commit which dont have this
    blocking problem
  • What happens if one or more of the nodes is
    malicious?
  • Malicious attempting to compromise the decision
    making
  • Known as Byzantine fault tolerance. More on this
    next time

26
Introducing wall clock time
  • Back to the notion of time
  • Distributed systems sometimes needs more precise
    notion of time other than happens-before
  • There are several options
  • Instead of network/process identitity to break
    ties
  • Extend a logical clock with the clock time and
    use it to break ties
  • Makes meaningful statements like B and D were
    concurrent, although B occurred first
  • But unless clocks are closely synchronized such
    statements could be erroneous!
  • We use a clock synchronization algorithm to
    reconcile differences between clocks on various
    computers in the network

27
Synchronizing clocks
  • Without help, clocks will often differ by many
    milliseconds
  • Problem is that when a machine downloads time
    from a network clock it cant be sure what the
    delay was
  • This is because the uplink and downlink
    delays are often very different in a network
  • Outright failures of clocks are rare

28
Synchronizing clocks
  • Suppose p synchronizes with time.windows.com and
    notes that 123 ms elapsed while the protocol was
    running what time is it now?

Delay 123ms
p
What time is it?
0923.02921
time.windows.com
29
Synchronizing clocks
  • Options?
  • p could guess that the delay was evenly split,
    but this is rarely the case in WAN settings
    (downlink speeds are higher)
  • p could ignore the delay
  • p could factor in only certain delay, e.g. if
    we know that the link takes at least 5ms in each
    direction. Works best with GPS time sources!
  • In general cant do better than uncertainty in
    the link delay from the time source down to p

30
Consequences?
  • In a network of processes, we must assume that
    clocks are
  • Not perfectly synchronized.
  • We say that clocks are inaccurate
  • Even GPS has uncertainty, although small
  • And clocks can drift during periods between
    synchronizations
  • Relative drift between clocks is their precision

31
Temporal distortions
  • Things can be complicated because we cant
    predict
  • Message delays (they vary constantly)
  • Execution speeds (often a process shares a
    machine with many other tasks)
  • Timing of external events
  • Lamport looked at this question too

32
Temporal distortions
  • What does now mean?


p

0
a
d


e
b
c



p

1
f

p

2
p

3
33
Temporal distortions
What does now mean?

p

0
a
d


e
b
c



p

1
f

p

2
p

3
34
Temporal distortions
Timelines can stretch caused by
scheduling effects, message delays, message loss

p

0
a
d


e
b
c



p

1
f

p

2
p

3
35
Temporal distortions
Timelines can shrink E.g. something lets a
machine speed up

p

0
a
d


e
b
c



p

1
f

p

2
p

3
36
Temporal distortions
Cuts represent instants of time. But not
every cut makes sense Black cuts could occur
but not gray ones.

p

0
a
d


e
b
c



p

1
f

p

2
p

3
37
Consistent cuts and snapshots
  • Idea is to identify system states that might
    have occurred in real-life
  • Need to avoid capturing states in which a message
    is received but nobody is shown as having sent it
  • This the problem with the gray cuts

38
Temporal distortions
Red messages cross gray cuts backwards

p

0
a
d


e
b
c



p

1
f

p

2
p

3
39
Temporal distortions
Red messages cross gray cuts backwards In
a nutshell the cut includes a message that was
never sent

p

0
a

e
b
c



p

1
p

2
p

3
40
Who cares?
  • Suppose, for example, that we want to do
    distributed deadlock detection
  • System lets processes wait for actions by other
    processes
  • A process can only do one thing at a time
  • A deadlock occurs if there is a circular wait

41
Deadlock detection algorithm
  • p worries perhaps we have a deadlock
  • p is waiting for q, so sends whats your state?
  • q, on receipt, is waiting for r, so sends the
    same question and r for s. And s is waiting on
    p.

42
Suppose we detect this state
  • We see a cycle
  • but is it a deadlock?

p
q
Waiting for
Waiting for
Waiting for
r
s
Waiting for
43
Phantom deadlocks!
  • Suppose system has a very high rate of locking.
  • Then perhaps a lock release message passed a
    query message
  • i.e. we see q waiting for r and r waiting for
    s but in fact, by the time we checked r, q was
    no longer waiting!
  • In effect we checked for deadlock on a gray cut
    an inconsistent cut.

44
Consistent cuts and snapshots
  • Goal is to draw a line across the system state
    such that
  • Every message received by a process is shown as
    having been sent by some other process
  • Some pending messages might still be in
    communication channels
  • A cut is the frontier of a snapshot

45
Chandy/Lamport Algorithm
  • Assume that if pi can talk to pj they do so using
    a lossless, FIFO connection
  • Now think about logical clocks
  • Suppose someone sets his clock way ahead and
    triggers a flood of messages
  • As these reach each process, it advances its own
    time eventually all do so.
  • The point where time jumps forward is a
    consistent cut across the system

46
Using logical clocks to make cuts
Message sets the time forward by a lot

p

0
a
d


e
b
c



p

1
f

p

2
p

3
Algorithm requires FIFO channels must delay e
until b has been delivered!
47
Using logical clocks to make cuts
Cut occurs at point where time advanced

p

0
a
d


e
b
c



p

1
f

p

2
p

3
48
Turn idea into an algorithm
  • To start a new snapshot, pi
  • Builds a message Pi is initiating snapshot k.
  • The tuple (pi, k) uniquely identifies the
    snapshot
  • In general, on first learning about snapshot (pi,
    k), px
  • Writes down its state pxs contribution to the
    snapshot
  • Starts tape recorders for all communication
    channels
  • Forwards the message on all outgoing channels
  • Stops tape recorder for a channel when a
    snapshot message for (pi, k) is received on it
  • Snapshot consists of all the local state
    contributions and all the tape-recordings for the
    channels

49
Chandy/Lamport
  • This algorithm, but implemented with an outgoing
    flood, followed by an incoming wave of snapshot
    contributions
  • Snapshot ends up accumulating at the initiator,
    pi
  • Algorithm doesnt tolerate process failures or
    message failures.

50
Chandy/Lamport
w
t
q
r
p
s
u
y
v
x
z
A network
51
Chandy/Lamport
w
t
I want to start a snapshot
q
r
p
s
u
y
v
x
z
A network
52
Chandy/Lamport
w
t
q
p records local state
r
p
s
u
y
v
x
z
A network
53
Chandy/Lamport
w
p starts monitoring incoming channels
t
q
r
p
s
u
y
v
x
z
A network
54
Chandy/Lamport
w
t
q
contents of channel p-y
r
p
s
u
y
v
x
z
A network
55
Chandy/Lamport
w
p floods message on outgoing channels
t
q
r
p
s
u
y
v
x
z
A network
56
Chandy/Lamport
w
t
q
r
p
s
u
y
v
x
z
A network
57
Chandy/Lamport
w
q is done
t
q
r
p
s
u
y
v
x
z
A network
58
Chandy/Lamport
w
t
q
q
r
p
s
u
y
v
x
z
A network
59
Chandy/Lamport
w
t
q
q
r
p
s
u
y
v
x
z
A network
60
Chandy/Lamport
w
t
q
q
r
p
s
u
y
v
x
z
s
z
A network
61
Chandy/Lamport
w
x
t
q
q
r
p
u
s
u
y
v
x
z
s
z
v
A network
62
Chandy/Lamport
w
w
x
t
q
q
r
p
z
s
s
v
y
u
r
u
y
v
x
z
A network
63
Chandy/Lamport
w
t
q
q
p
Done!
r
p
s
r
s
u
t
u
w
v
y
v
y
x
x
z
z
A snapshot of a network
64
Whats in the state?
  • In practice we only record things important to
    the application running the algorithm, not the
    whole state
  • E.g. locks currently held, lock release
    messages
  • Idea is that the snapshot will be
  • Easy to analyze, letting us build a picture of
    the system state
  • And will have everything that matters for our
    real purpose, like deadlock detection
Write a Comment
User Comments (0)
About PowerShow.com