Title: CS 141a Distributed Computation LaboratoryDetection Algorithms
1Detecting Properties of Distributed Systems
- Central ideas
- Understanding global snapshots
- Developing detection algorithms for different
problems starting from global snapshots. - Understanding logical clocks
2REVIEW
- Time lines
- True snapshots taken at a time t.
- Problem with absence of global clock.
- Distributed snapshots.
3Process Time Lines
Process time line
State change
message
time
Process P
Q
R
4Time Lines and True Snapshots
All processes take their local snapshots at
exactly the same time.
snapshot
Time t
snapshot
Time t
5Time Lines and Distributed Snapshots
Events AFTER snapshot
Processes take local snapshots at different times
that satisfy the criterion.
Events BEFORE snapshot
6KEY Property of Distributed Snaphsots
- All edges between AFTER snapshot events to
BEFORE snapshot events are directed from BEFORE
snapshot event to AFTER snapshot event.
7Time Lines and Distributed Snapshots
Process lines
Events AFTER snapshot
Directions of edges crossing boundary are from
BEFORE events to AFTER events.
Message lines
Events BEFORE snapshot
8Process States at Distributed Snapshots
Events AFTER snapshot
State of process P at this point on its time line
State of process R at this point on its time line
Events BEFORE snapshot
State of process Q at this point on its time line
9Channel States at Distributed Snapshots
State of channel from P to Q is given by sequence
of message lines crossing the BEFORE-AFTER
boundary
Events BEFORE snapshot
10Start and Finish of Distributed Snapshots
finish
Snapshot in progress
start
time
11Key Theorem
- Distributed snapshot state is reachable from
snapshot initiating state, and - 2. distributed snapshot state can reach snapshot
terminating state.
12Proof of Key Theorem
Depiction of a computation
Time
BEFORE snapshot events in red
AFTER snapshot events in black
13Proof of Key Theorem
Given a computation Flipping the order of a
BEFORE event that follows an AFTER event gives us
a new computation with the same states.
flip
BEFORE snapshot event
AFTER snapshot event
14Proof of Key Theorem
- Cases
- AFTER event is a
- Message receive,
- Message send
- Autonomous (local) process event
- Cases
- BEFORE event is a
- Message receive,
- Message send
- Autonomous (local) process event
15Proof of Key Theorem
- To make the flip we need only prohibit
- the BEFORE event receiving a message sent by the
AFTER event - The BEFORE event appearing later than the AFTER
event on the same process.
16Proof of Key Theorem
Carry out flips of pairs of adjacent events where
a BEFORE event follows and AFTER event.
17Proof of Key Theorem
Repeat such flips until all BEFORE events occur
before all AFTER events.
18Proof of Key Theorem
Snapshot algorithm ends
Snapshot algorithm starts
Snapshot state
19Proof of Key Theorem
Snapshot algorithm ends
Snapshot algorithm starts
Snapshot state
reachable
reachable
20Deriving Snapshot Algorithms
- We need only prohibit
- A BEFORE event receiving a message sent by an
AFTER event - A BEFORE event appearing later than an AFTER
event on the same process.
21Snapshot Algorithm 1
- Initiator takes its local snapshot and sends one
signal message on each of its outgoing channels. - When a process receives a signal for the first
time, it sends a signal on each of its outgoing
channels, and it records the state of the channel
on which it received its first signal as being
empty. - When a process receives a signal after it has
recorded its state, the process records the state
of the channel on which the signal was received
as the sequence of messages the process received
after it recorded its state and before it
received the signal.
22Proof of Correctness
- The algorithm ensures that the following never
happens - A BEFORE event receiving a message sent by an
AFTER event - A BEFORE event appearing later than an AFTER
event on the same process.
23Snapshot Algorithm 2
- Logical clocks.
- Each process ticks its local (integer) clock
forward by at least one with each event. - When a process sends a message, it timestamps the
message with its local clock at the time that it
sent the message. - When a receiver gets a message with timestamp T
the receiver ensures that its timestamp upon
receiving the message is greater than T. - All processes take their local snapshots at the
same logical time.
24Proof of Correctness
- The algorithm ensures that the following never
happens - A BEFORE event receiving a message sent by an
AFTER event - A BEFORE event appearing later than an AFTER
event on the same process.
25Termination Detection
- A distributed system is represented by a directed
graph where vertices represent processes and
directed edges represent directed channels. - The graph is fixed, so processes and channels are
not created or destroyed. - A process is either idle or active.
- An idle process remains idle until the process
receives a message from any of its incoming
channels. - An active process can send messages at any time.
- An active process can become idle at any time.
26Termination Detection
- Detect when all processes are idle and all
channels are empty.
27Termination Detection Algorithm 1
Given distributed system represented by directed
graph.
directed channel
process
28Termination Detection Algorithm 1
Detector is a process that is part of the
operating system.
detector
Channels from client processes to the detector
directed channel
process
29Termination Detection Algorithm 1
- When an active process becomes idle it sends a
message to the detector (along the channel to the
detector). The message contains, for each channel
incident on the process, the number of messages
received on each incoming channel and the number
of messages sent on each outgoing channel. - The detector has two local variables for each
channel c in the underlying distributed system
numberSentc, numberReceivedc are the values
of the number of messages sent and received on
channel c in the messages that the detector last
received about channel c from client processes.
30Termination Detection Algorithm 1
- When the detector receives a message from a
client process the detector updates its values of
numberSent and numberReceived to the values
in the message. For example, if the message from
the client contains 20 messages received on an
incoming channel c and 30 messages sent on an
outgoing channel d, then the detector sets its
local variables numberSentd to 30 and
numberReceivedc to 20.
31Termination Detection Algorithm 1
- After updating its local variables the detector
checks whether for all channels c, numberSentc
numberReceivedc, and if does, then the
detector claims termination. - Initial conditions of local variables in the
detector are set to ensure that this termination
condition is not satisfied initially. For
example, set numberSentc to 2 and
numberReceivedc -1.
32Termination Detection Algorithm
- Note that the values of the local variables,
numberSentc and numberReceivedc are NOT, in
general the number of messages sent on channel c
and the number of messages received on channel c
at this instant. - The messages from client processes to the
detector can take arbitrary (finite) time in
flight. So, by the time the message from the
client reaches the detector the message may be
old and not representative of the current
situation in the client.
- Is this algorithm correct?
33Proof Obligation
- We must prove that the following never happens
- A BEFORE event receiving a message sent by an
AFTER event - A BEFORE event appearing later than an AFTER
event on the same process. - In this termination detection algorithm, what are
the points in the process time line at which each
process snapshots itself?
34Proof Obligation
In this termination detection algorithm, what are
the points in the process time line at which each
process snapshots itself?
The detector claimed termination based on certain
values of numberSentc and numberReceivedc for
all c. Lets say these values are NSc and
NRc. The detector set numberSentc to NSc
when it received a message from the client
likewise it set numberReceivedc to NRc when
it received a message from a client. The snapshot
point for a process is the point at which it sent
the message containing the values NSc or NRc.
35Proof
Proof by contradiction. Consider first process p
to become active after its snapshot.
Active here
Snapshot points
Idle here
Process p timeline
36Proof
An idle process becomes active only when it
receives a message.
Active here
message
Snapshot points
Idle here
37Proof
- The message was sent by a process either
- After it took its snapshot or
- Before it took its snapshot.
- Let us consider each case in turn.
- We will show that both cases are impossible.
38Proof
- CASE 1
- Suppose the message was sent by a process q after
q took its snapshot. - When q took its snapshot q was idle (from the
algorithm). - When q sent the message, q was active (because
only active processes send messages). - So q became active after its snapshot and before
sending the message. - But p is the first process to become active after
its snapshot. - So q became active after p received the message.
- So q sent the message after p received the
message. This is impossible.
39Proof
- CASE 2 Suppose a process q sent the message
before q took its snapshot. - The number of messages sent on the channel from q
to p in the message that q sends to the detector
includes this message that q sent to p. - The number of messages received on the channel
from q to p in the message that p sends to the
detector does not include this message that q
sent to p. - So for this channel c, NSc is not equal to
NRc. - So, the detector could not have claimed
termination.
40Proposed Variant of Detection Algorithm
- Instead of keeping track of numbers of messages
sent and received on each channel separately,
keep track of ONLY the total number of messages
sent and received by each process on all channels.
41The Variant
- Previous Algorithm When an active process
becomes idle it sends a message to the detector
(along the channel to the detector). The message
contains, for each channel incident on the
process, the number of messages received on each
incoming channel and the number of messages sent
on each outgoing channel. - Variant The message contains the TOTAL number of
messages received by the process on all its
incoming channels, and the total number of
messages sent on all its outgoing channels.
42The Variant
- Previous Algorithm The detector has two local
variables for each channel c in the underlying
distributed system numberSentc,
numberReceivedc are the values of the number of
messages sent and received on channel c in the
messages that the detector last received about
channel c from client processes. - Variant The detector has two local variables for
each process totalNumberSentp and
totalNumberReceivedp for each process p.
43Proposed Variant
- The detector updates its (possibly out-of-date)
copy of totalNumberSentp and totalNumberReceived
p when it receives a message from process p,
and claims termination if - Sum over all p of totalNumberSentp
- Sum over all p of totalNumberReceivedp
- Is this algorithm correct?
44Counterexample
- The algorithm is incorrect. Heres a counter
example with 3 processes P, Q, R with channels
between them. Initially all processes are active
and all channels are empty. - P sends a message to Q and becomes idle. When it
becomes idle it sends a message to the detector
with totalNumberSentP 1, totalNumberReceivedP
0 - Q sends a message to P and becomes idle. When it
becomes idle it sends a message to the detector
with totalNumberSentQ 1, totalNumberReceivedQ
0
45Counterexample (continued)
- P gets the message from Q and becomes active. It
sends a message to R and P remains active. - Q gets the message from P and becomes active. It
sends a message to R and Q remains active. - R receives the messages from P and Q, and then R
becomes idle. When R becomes idle it sends a
message to the detector with totalNumberSentR
0 and totalNumberReceivedR 2. - The detector claims termination though P and Q
are still active!