Distributed Synchronization

About This Presentation

Title:

Distributed Synchronization

Description:

Electing a Leader. Replicated data schemes use a primary copy (the up-to-date by definition) ... In leader election, it is an integral part of the algorithm's ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 62

Provided by: csBg

Category:

more less

Transcript and Presenter's Notes

Title: Distributed Synchronization

1
Distributed Synchronization

No shared memory
processes reside on different machines ?
semaphores are ruled out
Processes (processors a.k.a. Agents)
communicate by message passing
Models of distributed computation
Distributed mutual exclusion
Leader election

2
Model of distributed computation

Events
Sending of messages
Receiving of messages
internal interrupt or time-out
Processes (processors) can wait for events
process p waits for events by executing
Wait for A1, A2,
A1 (sourse parameters)
code to handle A1
process q executes send(p, A1 parameters)
p will eventually perform the code for A1, with
the unpacked parameters

3
Causality

No global system state
cannot be determined by a single observer
Communication delays
impossible to synchronize two observers
(machines) exactly
Distributed systems are causal (no traveling
back in time)
for each processor separately, events are
totally ordered

4
Simplest causality happens_before

send always happens_before receive
two events of the same agent are ordered
e1 lt e3 e4 lt e7 e7 ?? e5

5
Ordering events

define the happens_before relation as the
transitive closure of the two relations
e1 ltm e2 send ? receive
e3 ltp e4 e3 ? e4 on processor p
require
e1 lth e2 and e2 lth e3 gt e1 lth e3
in the former example
e1 lth e8 (msg. processor msg.)
e2 lth e7 (msg. processor)

6
Ordering by time stamps

a global partial order can be achieved by a
topological order of the lth relation
for ordering events during execution, one needs
to compute the order on the fly
This can be done by assigning time-stamps to
events
Lamport 78

7
Lamports time-stamp algorithm

Initially,
my_TS 0
On event e,
if e is the receipt of a message m,
my_TS max(m.TS, my_TS)
my_TS
e.TS my_TS
if e is the sending of message m
m.TS my_TS

8
Lamports time-stamps

timestamps assigned by Lamports algorithm are
causal
e1 ltm e3 gt e1.TS lt e3.TS
e1 ltp e4 gt e1.TS lt e4.TS

9
Causality violations

the causality relation between two equal
time-stamps is not clear
Lamport suggests to determine by processor
but what about the meaning ?...

10
Vector time-stamps

try to use a multiple time-stamp
record time-stamps of all processors
Vector time-stamps, containing information (TS)
on all processors
e1.VT v e2.VT ltgt e1.VTi e2.VTi for
all i
e1.VT ltVT e2.VT ltgt e1.VT e2.VT and
e1.VT ? e2.VT

11
Vector time-stamps
12
A simple algorithm for VT

Initially,
my_VT 0,,,0
On event e,
if e is the reciept of message m,
for i 1 to M
my_VTi max(m.VTi, my_VTi)
my_VTself
e.VT my_VT
if e is the sending of message m,
m.VT my_VT

13
Comparing vector time-stamps
14
Detecting causality violation
15
In simpler words

sending and receiving messages are events
if the sending events of two different messages
are ordered
e1 send(m) lth e2 send(m)
then the violation in the example is
e4 rec(m) ltp e3 rec(m)

16
Causal Communication

One could enforce our form of causality
Block incoming messages and deliver them when
they fit causally
Each source of messages enumerates them
sequentially
The receiver only delivers a message that fits
the sequence of messages received from the same
source
Assuming no messages are lost
One method of numbering can be Lamports TSs

17
Causal communication - algorithm

Initially
each earliestk is set to the 1k timestamp
each blockedk is set to
On the receipt of message m from processor p
not_earlier(proc_i_vts, proc_j_vts,i)
delivery_list if msg_vtsi lt
proc_i_vtsi
if(blockedp is empty) return TRUE
earliestp m.timestamp else
add m to the tail of blockedp return FALSE
while(there is a k s.t. blockedk is not empty
and
for every i1..M except for k and Self
not_earlier(earliesti,earliestk,i))
remove the message at the head of blockedk
-gt delivery_list
if(blockedk is not empty
set earliestk to m.timestamp, where m at
head of blockedk
else
increment earliestk by 1k
Deliver the messages in delivery_list, in causal
order

18
Delivering messages causally
19
Consistent States

In order to detect certain failures, the
systems state has to be examined
Examining data of different processors at
different times can generate artifacts
It is difficult to define simultaneity in a
distributed system
The state of the processors is not enough, since
there might be undelivered messages

20
Consistent States (II)

A global state can be meaningful (i.e.
consistent) only if it is reachable
A a consistent state is one that can actually
happen through series of legal operations of the
distributed system
In other words, if in the state the processor pk
has received a message from processor pm, then
the state of processor pm must be such that the
message has already been sent

21
Consistent States (III)

Surprisingly, a simple condition can guarantee a
consistent global state
For every pair of observations oi , oj , that
are part of the state, it is not the case that
oi lth oj
An immediate implementation of the lth relation
are vector time-stamps

22
Phantom deadlock
23
Distributed Mutual Exclusion

The simplest way to ensure mutual exclusion is
to use a global clock
Allow the processor that sent the
earliest-request to enter the critical section
Processors can use Lamports method to share a
global clock
Exactly one of n requests is deterministically
the earliest
Requests time-stamps are compared to local TSs

24
Global clock based DME (Ricart Agrawala81)

Request_CS
my_TS current_TS
requesting TRUE
pending_reply M-1
for every other processor j,
send(j, Remote_Request my_timestamp)
wait until pending_reply is 0
Release_CS
requesting FALSE
for j1 through M
if deferred_replyj TRUE
send(j, Reply)
deferred_replyj FALSE

25
DME (Ricart Agrawala81)

Main
wait until a message is received
Remote_Request(sender request_time)
if(not requesting or my_timestamp gt
request_time)
send(sender, Reply)
else
deferred_replysender TRUE
Reply(sender)
pending_reply--

26
Global clock based DME

Deterministic decision, who is later in the queue
for the CS
For M processors a minimal number of messages
needed to enter 2(M-1)
Protocol uses symmetric information every
processor receives the same information and can
compute the decision of all other processors

27
Token based ME

Unique token circulating among all processors
A processor possessing the token can enter the
CS
Fixed logical structure among processors token
travels along this structure
Processors passing the token along a ring, good
performance if processors have frequent requests
Fixed order that is not related to the order of
requests or to number of requesting processors

28
Token based ME on a Tree

Hierarchical logical structure needs less
messages delivered during a request for token
a unique path from any (requesting) processor to
any other (holding token)
only one neighbor lies on the path to the token
holder
Each processor stores a pointer to its neighbor
on the path to the token - current_dir
requests are moved to next neighbors on the path
and the requests (i.e. return path) are stored in
a FIFO queue
released tokens are sent to top of queue
each processor on the path can deliver token to
its top of queue

29
Tree-based token ME (Raymond89)

Nq(neighbor) Add neighbor to requestQ
Dq() Return the name at head of requestQ
ismt() True iff requestQ is empty
Request_CS() Release_CS()
if not Token_hldr Incs false
if ismt() if not ismt()
send(current_dir, REQUEST)
current_dir Dq()
Nq(self)
send(current_dir, TOKEN)
wait until Token_hldr is true
Token_hldr false
Incs true if not ismt()
send(current_dir, REQUEST)

30
Raymonds algorithm - Main

while(true)
REQUEST TOKEN
if Token_hldr current_dir Dq()
if Incs if current_dir self
Nq(sender) Token_hldr true
else else
current_dir sender
send(current_dir, TOKEN)
send(current_dir, TOKEN) if not
ismt()
Token_hldr false
send(current_dir, REQUEST)
else
if ismt()
send(current_dir, REQUEST)
Nq(sender)

31
Raymonds algorithm example
32
Raymonds algorithm - features

low storage required O(d)
low message passing overhead O(logn) per
request
when demand increases, work to pass token
decreases..
But token may travel a long distance before
reaching its destination
Therefore good only for CS which is seldom
entered.

33
Path compression token-based ME

IsRequesting True iff processor is requesting
token
Current_dir The current guess of end of waiting
line
Next next processor for token (NIL - end of
waiting line)
Request_CS() Release_CS()
IsRequesting true Incs false
if not Token_hldr IsRequesting false
send(current_dir, REQUEST self) if next
? NIL
current_dir self Token_hldr
false
next NIL send(next, TOKEN)
wait until Token_hldr is true next
NIL
Incs true

34
Path compression (II)

while(true)
REQUEST(requester)
if IsRequesting true
if next NIL
next requester
else
send(current_dir, REQUEST requester)
elseif Token_hldr true
Token_hldr false
send(requester, TOKEN requester)
else
send(current_dir, REQUEST requester)
current_dir requester
TOKEN()
Token_hldr true

35
Path compression (example)
36
Path compression

Processors submit forward information about the
requester
Processors receiving requests point back to the
last in line
which is potentially the holder of the token
The last in line for the token adds the new
request to the tail
The last in line itself would add new requesters
to its tail
The state of all processors is not necessarily
consistent (improved efficiency)
requests are sent to better knowing processors

37
Electing a Leader

Replicated data schemes use a primary copy (the
up-to-date by definition)
Distributed computation might need a
coordinator, to assign tasks to participating
processors
If a leader fails, a new leader has to be
elected in order to determine the systems state
and restart computation
All participants must know who the leader is. In
synchronization, non token holders only need to
know that one does not hold the token
In synchronization, (stability) failure
treatment can be an additional requirement
In leader election, it is an integral part of
the algorithms behavior, if the coordinator
fails a coordinator has to be elected

38
The Bully algorithm

General assumptions
Processors can store their state during failure
and increase version numbers upon recovery
Failures halt all processing (no erratic
behavior)
Additional assumption
Messages are delivered within Tm seconds
Nodes respond to messages within Tp seconds
This allows a reliable failure detector
If a processor does not respond to a message
within T 2 Tm Tp it must have failed
These are called Synchronous Systems

39
Algorithm requirements

Nodes have one of four status values
Down, Election, Reorganization, Normal
Correctnes assertion 1
For G, a consistent state, and for any pair of
nodes pi, pj
1. If statusi e Normal, Reorganization and
statusj e Normal, Reorganization then
Coordinatori Coordinatorj
2. If statusi statusj Normal then
Definitioni Definitionj
Recovering from failure, a node sets its status
to Down. Starting an election process, it changes
to Election. Finishing an election, nodes go into
Reorganization. Receiving the new common state gt
Normal
If two nodes think that they are in working
order, they agree on who is the coordinator and
on the state of the system

40
Algorithm requirements

Guarantee that the election algorithm makes
progress, i.e. does not stay in the Election
status (for example)
Correctnes assertion 2
For a consistent state G, eventually (with no
failures)
1. There is a node i s.t. Statei Normal and
Coordinatori i
2. For any other nonfailed node j Statej
Normal and Coordinatorj i
The simplest strategy is to assign priorities to
all processors, so that they know the priorities
of all others
Each one just finds out whether higher priority
nodes are not failed

41
The Bully algorithm - initialization

Up set of processors known to be in the group
halted identity of processor that notified of the
current election
Coordinator_Timeout() / check if
coordinator is alive /
if State Normal or State Reorganization
send(Coordinator, AreYouUp), timeoutT
wait until Coordinator sends AYU_answer
timeoutT
on timeout
Election()
Recovery() Check() / Coordinator checks all
others /
State Down if State Normal and
Coordinator Self
Election() for every other node j
send(j, AreYouNormal)
wait until j sends (AYN_answer
status), timeoutT
if (j ? Up and statusFalse) or j
!? Up
Election()
return()

42
The Bully algorithm - election()

Election()
highest True
For every higher-priority processor p
send(p, AreYouUp)
wait up to T seconds for (AYU_answer)
messages
AYU_answer(sender)
highest False
if highest False
return()
State Election
halted Self
Up
For every lower-priority processor p
send(p, Enter_Election)
wait up to T seconds for (EE_answer) messages
EE_answer(sender)
Up Up U sender

43
The Bully algorithm - election() II

num_answers 0
Coordinator Self
State Reorganization
for each p in Up
send(p, Set_Coordinator Self)
wait up to T seconds for (SC_answer) messages
SC_answer(sender)
num_answers num_answers 0
if num_answers lt Up for each p in Up
Election() send(p, New_State
Definition)
return() wait up to T seconds for
(SC_answer) messages
NS_answer(sender)
num_answers
if num_answers lt Up
Election()
return()

44
The Bully algorithm

The election procedure run by each agent first
determines whether a better leader exists
If so, wait for the leader to initiate election
Otherwise, attempt to establish itself as leader
Whenever an Enter_Election message is received
immediate response is needed, even if higher
priority nodes were all checked because a
leader may have recovered
The same is true for receiving a Set_Coordinator
message. Update coordinator and move to state of
Reorganization
Similarly, update the Definition (state of
system)

45
The Bully algorithm - Control

Main()
while(True)
wait for a message
case SreYouUp(sender)
send(sender, AYU_answer)
case AreYouNormal(sender)
if State Normal, send(sender, AYN_answer
True)
else send(sender, AYN_answer False)
case Enter_Election(sender)
State Election
stop_processing()
stop the election procedure, if it is
processing
halted sender
send(sender, EE_answer)

46
The Bully algorithm Control (II)

case Set_Coordinator(sender, newleader)
if State Election and halted newleader
Coordinator newleader
State Reorganization
send(sender, SC_answer
case New_State(sender, newdef)
if Coordinator sender and State
Reorganization
Definition newdef
State Normal

47
The Bully algorithm Example
48
The Bully algorithm

simple algorithm that makes a strong assumption
timeouts can accurately detect failed processors
lost messages or overfull buffers can make the
bully algorithm elect two leaders
very long timeouts can make failure detection
almost certain
but, make the algorithm run too long
maybe the leader should be tied up to the group
that it coordinates, in case timeout gets so long
that failure is no longer certain

49
Electing a group leader

Electing a global leader is too difficult, when
timeouts are not clear or too long
Tie the coordinator to the group it leads
All members of a group agree on the group
number
Each group number is unique
The group number is part of the state definition
of each node
Only members of the same group agree on the
identity of the coordinator

50
Electing a group leader - correctness

Correctness assertion 3
For G, a consistent state and any pair of nodes
pi and pj the following two conditions hold
If Statusi ? Normal, reorganization, Statusj ?
Normal, reorganization, and Groupi Groupj,
then Coordinatori Coordinatorj
If Statusi Normal, Statusj Normal, and Groupi
Groupj, then Definitioni Definitionj
.

51
Electing a group leader liveness

Let R be a maximal set of nodes that can
communicate in a consistent state G0. The
following conditions are eventually true in any
run, starting at s.t. R remains the maximal set
of communicating nodes
Correctness assertion 4
There is a node pi ? R, such that Statei Normal
and Coordinatori pi
For any other non failed node pi ? R, Statej
Normal and Coordinatorj pi

52
Electing a group leader observations

Correctness assertion 3 is easy to satisfy. Any
processor p that wishes to establish itself as a
leader, forms a unique group number and suggests
to some group of processor to join the group with
itself as Coordinator
Group identifier is unique ? assertion 3.1 is
fullfilled
If participants accept the Definition that p
circulates together with the new group
identifier, assertion 3.2 is satisfied
The hard assertions to satisfy is 4
Run an election algorithm that insures that each
node in the group has the same Coordinator in
finite time
For more than one Coordinator, the Bully
Algorithm may make Coordinators compete for
participants and not enable progress

53
The Invitation Algorithm

Groups need to coalesce into larger groups
Coordinators search periodically for other
groups
The coordinators found are kept in a set
Others
Each Coordinator detecting another group, tries
to merge it with its own
To avoid deadlock it delays for a time between
detecting and acting
unlike the Bully algorithm, timeout does not
mean much
Not hearing from your coordinator ? form your
own group and proceed
Group IDs are unique composed of a nodes ID
and a running number

54
The Invitation Algorithm Check for groups

Check()
if State Normal and Coordinator Self
Others
for every other node p,
send (p, AreYouCoordinator)
wait up to T seconds for (AYC_answer)
messages
AYC_answer(sender is_coordinator)
if is_coordinator True
Others Others U sender
if Others
return()
wait for a time inversely proportional to your
priority
Merge(Others)

55
The Invitation Algorithm suspected failure

Timeout()
if Coordinator Self
return()
send(Coordinator, AreYouThere group)
wait for AYT_answer, timeout is T
on timeout,
is_coordinator False
AYT_answer(sender is_coordinator)
if is_coordinator False
Recovery()
Recovery()
State Election
stop_processing()
Counter group Self Counter
Coordinator Self
Up
State Reorganization

56
The Invitation Algorithm Merge groups

Merge(Coordinator_set)
if Coordinator Self and State Normal
State Election
stop_processing()
Counter Group Self Counter
Coordinator Self
UpSet Up
Up
for each p in Coordinator_set,
send(p, Invitation Self, Group)
for each p in UpSet,
send(p, Invitation Self, Group)
wait for T seconds
State Reorganization
num_answer 0
for each p in Up
send(p, Ready Group, Definition)
wait up to T seconds for Ready_answer messages
Ready_answer(sender ingroup, new_group)

57
The Invitation Algorithm Invite Merging

Invitation()
while True
wait for Invitation (new_coordinator
new_group)
if State Normal
stop_processing()
old_coordinator Coordinator
UpSet Up
State Election
Coordinator new_coordinator
Group new_group
if old_coordinator Self
for each p in UpSet
send(p, Invitation Coordinator, Group)
send(Coordinator, Accept Group)
wait up to T seconds for an Accept_answer(sender
accepted) message
on Timeout,
accepted False
if accepted is False invoke Recovery()
State Reorganization

58
The Invitation Algorithm Main

Main()
while True
wait for a message
Ready(sender new_group, new_description)
if Group new_group and State Reorganization
Description new_description
State Normal
send(Coordinator, Ready_answer True, Group)
else
send(sender, Ready_answer False)
AreYouCoordinator(sender)
if State Normal and Coordinator Self
send(sender, AYC_answer True)
else
send(sender, AYC_answer False)

59
The Invitation Algorithm Main

Main()
while True
wait for a message
..
AreYouThere(sender old_group)
if Group old_group and Coordinator Self and
sender in Up
send(sender, AYT_answer True)
else
send(sender, AYT_answer False)
Accept(sender new_group)
if State Election and Coordinator Self and
Group new_group
Up Up U sender
send(sender, Accept_answer True)
else
send(sender, Accept_answer False)

60
The Invitation Algorithm Example
61
Electing a leader

Imposing a strong logical structure on the
system (synchronous) enables an efficient
algorithm the Bully algorithm
Lacking such a structure (asynchronous) needs a
slower algorithm (merging groups) the
Invitation algorithm
Bounded response time is at the basis of the
(synchronous) Bully algorithm
The Invitation algorithm works correctly in the
presence of timing failures (i.e. practical)
Stating the correctness of asynchronous
algorithms is much more complex
Every processor agrees on a value in a
synchronous system
Only a group of processors needs to agree on
group membership and on a value, in an
asynchronous system
Consistency is relative to the group,
uncommunicating processors can be ignored
The Invitation algorithm uses sequence numbers,
instead of global knowledge