Title: Distributed Synchronization
1Distributed Synchronization
- No shared memory
- processes reside on different machines ?
semaphores are ruled out - Processes (processors a.k.a. Agents)
communicate by message passing - Models of distributed computation
- Distributed mutual exclusion
- Leader election
2Model of distributed computation
- Events
- Sending of messages
- Receiving of messages
- internal interrupt or time-out
- Processes (processors) can wait for events
- process p waits for events by executing
- Wait for A1, A2,
- A1 (sourse parameters)
- code to handle A1
- process q executes send(p, A1 parameters)
- p will eventually perform the code for A1, with
the unpacked parameters
3Causality
- No global system state
- cannot be determined by a single observer
- Communication delays
- impossible to synchronize two observers
(machines) exactly - Distributed systems are causal (no traveling
back in time) - for each processor separately, events are
totally ordered
4Simplest causality happens_before
- send always happens_before receive
- two events of the same agent are ordered
- e1 lt e3 e4 lt e7 e7 ?? e5
5Ordering events
- define the happens_before relation as the
transitive closure of the two relations - e1 ltm e2 send ? receive
- e3 ltp e4 e3 ? e4 on processor p
- require
- e1 lth e2 and e2 lth e3 gt e1 lth e3
- in the former example
- e1 lth e8 (msg. processor msg.)
- e2 lth e7 (msg. processor)
6Ordering by time stamps
- a global partial order can be achieved by a
topological order of the lth relation - for ordering events during execution, one needs
to compute the order on the fly - This can be done by assigning time-stamps to
events - Lamport 78
7Lamports time-stamp algorithm
- Initially,
- my_TS 0
-
- On event e,
- if e is the receipt of a message m,
- my_TS max(m.TS, my_TS)
- my_TS
- e.TS my_TS
- if e is the sending of message m
- m.TS my_TS
8Lamports time-stamps
- timestamps assigned by Lamports algorithm are
causal - e1 ltm e3 gt e1.TS lt e3.TS
- e1 ltp e4 gt e1.TS lt e4.TS
9Causality violations
- the causality relation between two equal
time-stamps is not clear - Lamport suggests to determine by processor
- but what about the meaning ?...
10Vector time-stamps
- try to use a multiple time-stamp
- record time-stamps of all processors
- Vector time-stamps, containing information (TS)
on all processors - e1.VT v e2.VT ltgt e1.VTi e2.VTi for
all i - e1.VT ltVT e2.VT ltgt e1.VT e2.VT and
- e1.VT ? e2.VT
11Vector time-stamps
12A simple algorithm for VT
- Initially,
- my_VT 0,,,0
- On event e,
- if e is the reciept of message m,
- for i 1 to M
- my_VTi max(m.VTi, my_VTi)
- my_VTself
- e.VT my_VT
- if e is the sending of message m,
- m.VT my_VT
13Comparing vector time-stamps
14Detecting causality violation
15In simpler words
- sending and receiving messages are events
- if the sending events of two different messages
are ordered - e1 send(m) lth e2 send(m)
- then the violation in the example is
- e4 rec(m) ltp e3 rec(m)
16Causal Communication
- One could enforce our form of causality
- Block incoming messages and deliver them when
they fit causally - Each source of messages enumerates them
sequentially - The receiver only delivers a message that fits
the sequence of messages received from the same
source - Assuming no messages are lost
- One method of numbering can be Lamports TSs
17Causal communication - algorithm
- Initially
- each earliestk is set to the 1k timestamp
- each blockedk is set to
- On the receipt of message m from processor p
not_earlier(proc_i_vts, proc_j_vts,i) - delivery_list if msg_vtsi lt
proc_i_vtsi - if(blockedp is empty) return TRUE
- earliestp m.timestamp else
- add m to the tail of blockedp return FALSE
- while(there is a k s.t. blockedk is not empty
and - for every i1..M except for k and Self
- not_earlier(earliesti,earliestk,i))
- remove the message at the head of blockedk
-gt delivery_list - if(blockedk is not empty
- set earliestk to m.timestamp, where m at
head of blockedk - else
- increment earliestk by 1k
-
- Deliver the messages in delivery_list, in causal
order
18Delivering messages causally
19Consistent States
- In order to detect certain failures, the
systems state has to be examined - Examining data of different processors at
different times can generate artifacts - It is difficult to define simultaneity in a
distributed system - The state of the processors is not enough, since
there might be undelivered messages
20Consistent States (II)
- A global state can be meaningful (i.e.
consistent) only if it is reachable - A a consistent state is one that can actually
happen through series of legal operations of the
distributed system - In other words, if in the state the processor pk
has received a message from processor pm, then
the state of processor pm must be such that the
message has already been sent
21Consistent States (III)
- Surprisingly, a simple condition can guarantee a
consistent global state - For every pair of observations oi , oj , that
are part of the state, it is not the case that - oi lth oj
- An immediate implementation of the lth relation
are vector time-stamps
22Phantom deadlock
23Distributed Mutual Exclusion
- The simplest way to ensure mutual exclusion is
to use a global clock - Allow the processor that sent the
earliest-request to enter the critical section - Processors can use Lamports method to share a
global clock - Exactly one of n requests is deterministically
the earliest - Requests time-stamps are compared to local TSs
24Global clock based DME (Ricart Agrawala81)
- Request_CS
- my_TS current_TS
- requesting TRUE
- pending_reply M-1
- for every other processor j,
- send(j, Remote_Request my_timestamp)
- wait until pending_reply is 0
- Release_CS
- requesting FALSE
- for j1 through M
- if deferred_replyj TRUE
- send(j, Reply)
- deferred_replyj FALSE
25DME (Ricart Agrawala81)
- Main
- wait until a message is received
- Remote_Request(sender request_time)
- if(not requesting or my_timestamp gt
request_time) - send(sender, Reply)
- else
- deferred_replysender TRUE
- Reply(sender)
- pending_reply--
26Global clock based DME
- Deterministic decision, who is later in the queue
for the CS - For M processors a minimal number of messages
needed to enter 2(M-1) - Protocol uses symmetric information every
processor receives the same information and can
compute the decision of all other processors
27Token based ME
- Unique token circulating among all processors
- A processor possessing the token can enter the
CS - Fixed logical structure among processors token
travels along this structure - Processors passing the token along a ring, good
performance if processors have frequent requests - Fixed order that is not related to the order of
requests or to number of requesting processors
28Token based ME on a Tree
- Hierarchical logical structure needs less
messages delivered during a request for token - a unique path from any (requesting) processor to
any other (holding token) - only one neighbor lies on the path to the token
holder - Each processor stores a pointer to its neighbor
on the path to the token - current_dir - requests are moved to next neighbors on the path
and the requests (i.e. return path) are stored in
a FIFO queue - released tokens are sent to top of queue
- each processor on the path can deliver token to
its top of queue
29Tree-based token ME (Raymond89)
- Nq(neighbor) Add neighbor to requestQ
- Dq() Return the name at head of requestQ
- ismt() True iff requestQ is empty
- Request_CS() Release_CS()
- if not Token_hldr Incs false
- if ismt() if not ismt()
- send(current_dir, REQUEST)
current_dir Dq() - Nq(self)
send(current_dir, TOKEN) - wait until Token_hldr is true
Token_hldr false - Incs true if not ismt()
- send(current_dir, REQUEST)
30Raymonds algorithm - Main
- while(true)
- REQUEST TOKEN
- if Token_hldr current_dir Dq()
- if Incs if current_dir self
- Nq(sender) Token_hldr true
- else else
- current_dir sender
send(current_dir, TOKEN) - send(current_dir, TOKEN) if not
ismt() - Token_hldr false
send(current_dir, REQUEST) - else
- if ismt()
- send(current_dir, REQUEST)
- Nq(sender)
31Raymonds algorithm example
32Raymonds algorithm - features
- low storage required O(d)
- low message passing overhead O(logn) per
request - when demand increases, work to pass token
decreases.. - But token may travel a long distance before
reaching its destination - Therefore good only for CS which is seldom
entered.
33Path compression token-based ME
- IsRequesting True iff processor is requesting
token - Current_dir The current guess of end of waiting
line - Next next processor for token (NIL - end of
waiting line) - Request_CS() Release_CS()
- IsRequesting true Incs false
- if not Token_hldr IsRequesting false
- send(current_dir, REQUEST self) if next
? NIL - current_dir self Token_hldr
false - next NIL send(next, TOKEN)
- wait until Token_hldr is true next
NIL - Incs true
34Path compression (II)
- while(true)
- REQUEST(requester)
- if IsRequesting true
- if next NIL
- next requester
- else
- send(current_dir, REQUEST requester)
- elseif Token_hldr true
- Token_hldr false
- send(requester, TOKEN requester)
- else
- send(current_dir, REQUEST requester)
- current_dir requester
- TOKEN()
- Token_hldr true
35Path compression (example)
36Path compression
- Processors submit forward information about the
requester - Processors receiving requests point back to the
last in line - which is potentially the holder of the token
- The last in line for the token adds the new
request to the tail - The last in line itself would add new requesters
to its tail - The state of all processors is not necessarily
consistent (improved efficiency) - requests are sent to better knowing processors
37Electing a Leader
- Replicated data schemes use a primary copy (the
up-to-date by definition) - Distributed computation might need a
coordinator, to assign tasks to participating
processors - If a leader fails, a new leader has to be
elected in order to determine the systems state
and restart computation - All participants must know who the leader is. In
synchronization, non token holders only need to
know that one does not hold the token - In synchronization, (stability) failure
treatment can be an additional requirement - In leader election, it is an integral part of
the algorithms behavior, if the coordinator
fails a coordinator has to be elected
38The Bully algorithm
- General assumptions
- Processors can store their state during failure
and increase version numbers upon recovery - Failures halt all processing (no erratic
behavior) - Additional assumption
- Messages are delivered within Tm seconds
- Nodes respond to messages within Tp seconds
- This allows a reliable failure detector
- If a processor does not respond to a message
within T 2 Tm Tp it must have failed - These are called Synchronous Systems
39Algorithm requirements
- Nodes have one of four status values
- Down, Election, Reorganization, Normal
- Correctnes assertion 1
- For G, a consistent state, and for any pair of
nodes pi, pj - 1. If statusi e Normal, Reorganization and
statusj e Normal, Reorganization then
Coordinatori Coordinatorj - 2. If statusi statusj Normal then
Definitioni Definitionj - Recovering from failure, a node sets its status
to Down. Starting an election process, it changes
to Election. Finishing an election, nodes go into
Reorganization. Receiving the new common state gt
Normal - If two nodes think that they are in working
order, they agree on who is the coordinator and
on the state of the system
40Algorithm requirements
- Guarantee that the election algorithm makes
progress, i.e. does not stay in the Election
status (for example) - Correctnes assertion 2
- For a consistent state G, eventually (with no
failures) - 1. There is a node i s.t. Statei Normal and
Coordinatori i - 2. For any other nonfailed node j Statej
Normal and Coordinatorj i - The simplest strategy is to assign priorities to
all processors, so that they know the priorities
of all others - Each one just finds out whether higher priority
nodes are not failed
41The Bully algorithm - initialization
- Up set of processors known to be in the group
- halted identity of processor that notified of the
current election - Coordinator_Timeout() / check if
coordinator is alive / - if State Normal or State Reorganization
- send(Coordinator, AreYouUp), timeoutT
- wait until Coordinator sends AYU_answer
timeoutT - on timeout
- Election()
- Recovery() Check() / Coordinator checks all
others / - State Down if State Normal and
Coordinator Self - Election() for every other node j
- send(j, AreYouNormal)
- wait until j sends (AYN_answer
status), timeoutT - if (j ? Up and statusFalse) or j
!? Up - Election()
- return()
42The Bully algorithm - election()
- Election()
- highest True
- For every higher-priority processor p
- send(p, AreYouUp)
- wait up to T seconds for (AYU_answer)
messages - AYU_answer(sender)
- highest False
- if highest False
- return()
- State Election
- halted Self
- Up
- For every lower-priority processor p
- send(p, Enter_Election)
- wait up to T seconds for (EE_answer) messages
- EE_answer(sender)
- Up Up U sender
43The Bully algorithm - election() II
- num_answers 0
- Coordinator Self
- State Reorganization
- for each p in Up
- send(p, Set_Coordinator Self)
- wait up to T seconds for (SC_answer) messages
- SC_answer(sender)
- num_answers num_answers 0
- if num_answers lt Up for each p in Up
- Election() send(p, New_State
Definition) - return() wait up to T seconds for
(SC_answer) messages - NS_answer(sender)
- num_answers
- if num_answers lt Up
- Election()
- return()
44The Bully algorithm
- The election procedure run by each agent first
determines whether a better leader exists - If so, wait for the leader to initiate election
- Otherwise, attempt to establish itself as leader
- Whenever an Enter_Election message is received
immediate response is needed, even if higher
priority nodes were all checked because a
leader may have recovered - The same is true for receiving a Set_Coordinator
message. Update coordinator and move to state of
Reorganization - Similarly, update the Definition (state of
system)
45The Bully algorithm - Control
- Main()
- while(True)
- wait for a message
- case SreYouUp(sender)
- send(sender, AYU_answer)
- case AreYouNormal(sender)
- if State Normal, send(sender, AYN_answer
True) - else send(sender, AYN_answer False)
- case Enter_Election(sender)
- State Election
- stop_processing()
- stop the election procedure, if it is
processing - halted sender
- send(sender, EE_answer)
46The Bully algorithm Control (II)
-
- case Set_Coordinator(sender, newleader)
- if State Election and halted newleader
- Coordinator newleader
- State Reorganization
- send(sender, SC_answer
- case New_State(sender, newdef)
- if Coordinator sender and State
Reorganization - Definition newdef
- State Normal
-
47The Bully algorithm Example
48The Bully algorithm
- simple algorithm that makes a strong assumption
- timeouts can accurately detect failed processors
- lost messages or overfull buffers can make the
bully algorithm elect two leaders - very long timeouts can make failure detection
almost certain - but, make the algorithm run too long
- maybe the leader should be tied up to the group
that it coordinates, in case timeout gets so long
that failure is no longer certain
49Electing a group leader
- Electing a global leader is too difficult, when
timeouts are not clear or too long - Tie the coordinator to the group it leads
- All members of a group agree on the group
number - Each group number is unique
- The group number is part of the state definition
of each node - Only members of the same group agree on the
identity of the coordinator
50Electing a group leader - correctness
- Correctness assertion 3
- For G, a consistent state and any pair of nodes
pi and pj the following two conditions hold - If Statusi ? Normal, reorganization, Statusj ?
Normal, reorganization, and Groupi Groupj,
then Coordinatori Coordinatorj - If Statusi Normal, Statusj Normal, and Groupi
Groupj, then Definitioni Definitionj - .
51Electing a group leader liveness
- Let R be a maximal set of nodes that can
communicate in a consistent state G0. The
following conditions are eventually true in any
run, starting at s.t. R remains the maximal set
of communicating nodes - Correctness assertion 4
- There is a node pi ? R, such that Statei Normal
and Coordinatori pi - For any other non failed node pi ? R, Statej
Normal and Coordinatorj pi
52Electing a group leader observations
- Correctness assertion 3 is easy to satisfy. Any
processor p that wishes to establish itself as a
leader, forms a unique group number and suggests
to some group of processor to join the group with
itself as Coordinator - Group identifier is unique ? assertion 3.1 is
fullfilled - If participants accept the Definition that p
circulates together with the new group
identifier, assertion 3.2 is satisfied - The hard assertions to satisfy is 4
- Run an election algorithm that insures that each
node in the group has the same Coordinator in
finite time - For more than one Coordinator, the Bully
Algorithm may make Coordinators compete for
participants and not enable progress
53The Invitation Algorithm
- Groups need to coalesce into larger groups
- Coordinators search periodically for other
groups - The coordinators found are kept in a set
Others - Each Coordinator detecting another group, tries
to merge it with its own - To avoid deadlock it delays for a time between
detecting and acting - unlike the Bully algorithm, timeout does not
mean much - Not hearing from your coordinator ? form your
own group and proceed - Group IDs are unique composed of a nodes ID
and a running number
54The Invitation Algorithm Check for groups
- Check()
- if State Normal and Coordinator Self
- Others
- for every other node p,
- send (p, AreYouCoordinator)
- wait up to T seconds for (AYC_answer)
messages - AYC_answer(sender is_coordinator)
- if is_coordinator True
- Others Others U sender
- if Others
- return()
- wait for a time inversely proportional to your
priority - Merge(Others)
55The Invitation Algorithm suspected failure
- Timeout()
- if Coordinator Self
- return()
- send(Coordinator, AreYouThere group)
- wait for AYT_answer, timeout is T
- on timeout,
- is_coordinator False
- AYT_answer(sender is_coordinator)
- if is_coordinator False
- Recovery()
- Recovery()
- State Election
- stop_processing()
- Counter group Self Counter
- Coordinator Self
- Up
- State Reorganization
56The Invitation Algorithm Merge groups
- Merge(Coordinator_set)
- if Coordinator Self and State Normal
- State Election
- stop_processing()
- Counter Group Self Counter
- Coordinator Self
- UpSet Up
- Up
- for each p in Coordinator_set,
- send(p, Invitation Self, Group)
- for each p in UpSet,
- send(p, Invitation Self, Group)
- wait for T seconds
- State Reorganization
- num_answer 0
- for each p in Up
- send(p, Ready Group, Definition)
- wait up to T seconds for Ready_answer messages
- Ready_answer(sender ingroup, new_group)
57The Invitation Algorithm Invite Merging
- Invitation()
- while True
- wait for Invitation (new_coordinator
new_group) - if State Normal
- stop_processing()
- old_coordinator Coordinator
- UpSet Up
- State Election
- Coordinator new_coordinator
- Group new_group
- if old_coordinator Self
- for each p in UpSet
- send(p, Invitation Coordinator, Group)
- send(Coordinator, Accept Group)
- wait up to T seconds for an Accept_answer(sender
accepted) message - on Timeout,
- accepted False
- if accepted is False invoke Recovery()
- State Reorganization
58The Invitation Algorithm Main
- Main()
- while True
- wait for a message
- Ready(sender new_group, new_description)
- if Group new_group and State Reorganization
- Description new_description
- State Normal
- send(Coordinator, Ready_answer True, Group)
- else
- send(sender, Ready_answer False)
- AreYouCoordinator(sender)
- if State Normal and Coordinator Self
- send(sender, AYC_answer True)
- else
- send(sender, AYC_answer False)
59The Invitation Algorithm Main
- Main()
- while True
- wait for a message
- ..
- AreYouThere(sender old_group)
- if Group old_group and Coordinator Self and
sender in Up - send(sender, AYT_answer True)
- else
- send(sender, AYT_answer False)
- Accept(sender new_group)
- if State Election and Coordinator Self and
Group new_group - Up Up U sender
- send(sender, Accept_answer True)
- else
- send(sender, Accept_answer False)
60The Invitation Algorithm Example
61Electing a leader
- Imposing a strong logical structure on the
system (synchronous) enables an efficient
algorithm the Bully algorithm - Lacking such a structure (asynchronous) needs a
slower algorithm (merging groups) the
Invitation algorithm - Bounded response time is at the basis of the
(synchronous) Bully algorithm - The Invitation algorithm works correctly in the
presence of timing failures (i.e. practical) - Stating the correctness of asynchronous
algorithms is much more complex - Every processor agrees on a value in a
synchronous system - Only a group of processors needs to agree on
group membership and on a value, in an
asynchronous system - Consistency is relative to the group,
uncommunicating processors can be ignored - The Invitation algorithm uses sequence numbers,
instead of global knowledge