Title: Synchronization
1Chapter 7
2Topics
- Physical clock synchronization
- Logical clock synchronization
- Causality relation
- Lamports logical clock
- Vector logical clock
- Multicast
- ISIS vector clock
- Snapshot
3New Issues in DS
- Global time
- Event order
- e1 at 100pm on machine m1, e2 at 101pm at
machine m2. Which event happens earlier? - Global state
- snapshot
- Mutual exclusion Synchronization
4Time and Clock
- Two roles of time
- - Defines temporal order among events
- - Duration (measured by timer)
- UTC (Universal Coordinated Time) is based
- on Cesium-133 atom oscillation located at
over 200 labs in the world - With satellites, 0.5ms accuracy is possible.
- (100 MIPS ? 50,000 instructions in 0.5ms).
5Clock Skew
- Skew Clock reading (from single clock) is
location-dependent, e.g., distance from satellite
or clock source on a circuit board - Drift Multiple clocks.
- t the real time
- Cp(t) the reading of a clock p at time t
(Cp(t) t for ideal clock) - dCp(t) /dt ticking rate (dCp(t) /dt 1 for
ideal clock)
6Consequence An Example
- When each machine has its own clock, an event
that occurred after another event may
nevertheless be assigned an earlier time. But it
is a different story in DS.
Time?
7Physical Clock Synchronization
- Cristians Algorithm
- The Berkeley Algorithm
- Network Time Protocol (NTP)
- OSF DCE
8Cristians Algorithm Architecture
- WWV-node receiving UTC-signals, serving as the
central UTC-time server (CUTCS) for the DS - WWV is a short wave radio station in Colorado.
- Periodically every node in the DS sends a time
request to the central UTC time server CUTCS. - The CUTCS responds with its current time tUTC
9Adjust Time
- When the client gets the reply, it simply set its
clock to tUTC - Time may run backward
- Introduce the change gradually
- Consider the time for message propagation.
10Adjust Time (continued)
- Estimate Tp (propagation delay) from
- T1 T0 2 x Tp I. where I
processing time. - Current time t (servers time in message) Tp.
- Assumption?
11The Berkeley Algorithm
- In Cristians algorithm the central time server
was passive - Now its active, i.e. it periodically polls all
other nodes to hand out their current local times
ci(t). - Based on the answers it calculates a mean and
tells all other machines to advance or slow down
their clocks accordingly.
12Relative Clock Synchronization
- Time server periodically sends its time to
clients and asks for theirs. - Clients respond with how far ahead or behind the
server they are - Time server uses the estimated local times for
building the arithmetic mean - Deviations from this arithmetic mean are sent to
nodes enabling them to slow down respectively to
speed up.
13The Berkeley Algorithm
- The time daemon asks all the other machines for
their clock values - The machines answer
- The time daemon tells everyone how to adjust
their clock
14Summary
- Cristians method and the Berkeley algorithm are
intended for intranets - Both may be improved with fault tolerance methods
- Instead of one UTC server in Cristians algorithm
use n time servers and always take the first
answer from whatever time serve - Instead of taking the arithmetic mean from all
clients in the Berkeley algorithm take the fault
tolerance mean, i.e. skip deviations with a
certain threshold
15Network Time Protocol (NTP)
- Goal
- absolute (UTC)-time service in large nets (e.g.
Internet) - high availability (via fault tolerance)
- protection against fabrication (via
authentication) - Architecture
- time-servers build up a hierarchical
synchronization subnet - all primary servers have an UTC-receiver
- secondary servers are synchronized by their
parent primary server - all other stations are leaves on level 3 being
synchronized by level 2 time servers - accuracy of clocks decreases with increasing
level number - the net is able to reconfigure
16NTP
Reliability from redundant paths, scalability,
authenticated time sources
17Synchronization of Servers (NTP)
- Synchronization subnet can reconfigure if
failures occur,e.g. - Primary having lost its UTC source can become a
secondary - Secondary having lost its primary can use another
one - Modes of synchronization
- Multicast mode (for quick LANs, low accuracy)
- A server within a LAN periodically multicasts
time to other leaves in the LAN which set their
clocks assuming some delay - Procedure-call mode (Cristians algorithm with
medium accuracy) - A server responds to requests with its actual
timestamp - Symmetric mode (high accuracy)
- Pairs of servers exchange message containing times
18OSF DCE
- Time is an interval t-e, te.
Two intervals overlap ? cannot say which time is
earlier (In case of overlap, Unix make should
recompile).
19Logical Clock Synchronization
- A powerful building block in DS
- Duplicate detection
- Cache consistency
- Leases
- Commitment
-
20Leslie Lamport
Time, Clocks, and the Ordering of Events in a
Distributed System
Best known for his work on 1)Temporal
logic 2)LaTeX
Microsoft
21The Paper
- Handles problems of clock drift in a DS
- Identifies main function of computer clocks, i.e.
ordering of events - Indicates which conditions clocks must satisfy to
fulfill their role - Introduces logical clocks
- Benefits of logical clocks?
- Needed for determining causality
22Logical Time
- In many cases its sufficient just to order the
related relevant events, i.e. we want to be able
to position these events relatively, but not
absolutely. - Interesting Relative position of an event on the
time axis ? no need for any scaling on this time
axis
23Ordering Events
- Event ordering linked with concept of causality
- Saying that event a happened before event b is
same as saying that event a could have affected
the outcome of event b - If events a and b happen on processes that do not
exchange any data, their exact ordering is not
important.
24Causality Relationship
- Event changes state of process.
- State remains same till next event occurs.
25Formal Definition
- a? b defined by
- If a and b are in the same process, and a occurs
earlier than b, then a?b. - If a is a sending event and b is receiving event
of same message, then a? b. - If a?b and b?c, then a?c. Transitive
- If a? b, then a causally precedes (or happened
before) b a and b are causally related - a and b are concurrent if neither a?b nor b?a.
26Message-Related Events
- Sending event
- Receiving event
- Message arrival (at kernel) and delivery
- (to user process) Kernel can control timing
of delivery after arrival.
27Example
Time
e11?e12?e21?e22?e32 , furthermore
e31?e32, whereas e31 is neither related (has
happened before) to e11, nor to e12, nor to e21,
nor to e22. e31 is concurrent to e11, e12, e21,
and e22.
28Lamport Clock
- Suppose E events, each e in E gets a Lamport
time stamp L(e), as follows - 1. e is a pure local event or a sending-event
if e has no local predecessor, then L(e) 1,
otherwise there is a local predecessor e, thus
L(e) L(e) 1 - 2. e is a receiving event, with a corresponding
sending-event s if e has no local predecessor,
then L(e) L(s) 1, otherwise there is a local
predecessor e, thus L(e) maxL(s),L(e) 1
29Example
Note Each local counter is incremented with each
local event. In a communication we adjust the
involved counters of the two communicating nodes
to be consistent with the happened-before-relation
. Remark Same mechanism can be used to adjust
clocks on different nodes. The Lamport time is
consistent with the happened-before-relation,
i.e. if x? y, then L( x)ltL( y), but not vice
versa.
30Adjusting Clocks
Without adjusting local clocks
With adjusting local clocks
31Limitation on Lamport Clocks
From Lamport time values you cannot conclude
whether two events are in the happen-before
relationship,e.g. e11 and e32.
32Total Ordering of Events
- Lamport-time only gives us a partial-ordering of
distributed events. - To implement the total ordering
- Each processor is assigned by a unique id
(integer) - Given two events e1 and e2, e1 is ordered before
e2 if L(e1) lt L(e2) or L(e1)L(e2) and Id(e1) lt
Id(e2)
33Holding Back Deliveries
- Delay the delivery of messages that arrived too
soon - Useful when delivering messages from kernel to
processes - Hold back the delivery of M to process P until
there is a guarantee that no message M with
L(M) lt L(M) will arrive at P in the future.
34Implementation
- Assumption messages from a particular source
arrive in the FIFO order - Each site maintains a set of message queues, one
for each other site - When a message arrives, placed in the
corresponding queue - When all queues are non-empty, compare the
timestamps of the messages at the heads of the
queues, and deliver the messages with the oldest
timestamp.
35Limitation
- All message queues need be non-empty.
- Normally not true.
- Require multicast to solve the problem.
- With Lamport clock, L(a)ltL(b) does not mean a?b.
- Unnecessarily delay some messages.
- Vector Clock.
36Event Counter
Event counter at Pi Initialized at 0
and incremented for each event
37Vector Time
- Assumption n tasks (processes) Pi in DS
- Each Pi has its own local clock being a
n-dimensional vector (initially zeroed) - Vi(a) is timestamp of event a at process Pi
- Vii number of local events at Pi
- Vij is Pi best guess of how many events have
been on Pj
38Rules
- There is a DS with n distributed processes.
n-dimensional vector Vi is vector-time of process
Pi if it is built according to the following
rules - (1) Initially, Vi (0, , 0) for all 1ltiltn
- (2) For a local event on process Pi Vii
- (3) Pi includes the value t Vi in every msg m
- (4) When Pj receives a message m with timestamp
t, it sets Vjk maxtk, Vj k, for 1ltkltn
and k ! j - Communication cost?
- Little overhead compared to Lamport clock
39Example
M1
M2
M3
Time
40Notation
- We define global V(e) Vi(e) if event e happens
in Pi - We write V(a) ? V(b) if
- V(a)k ? V(b)k for all k.
- Here V(a)k denotes the kth component of V(a).
- We write V(a) lt V(b) if
- V(a)k ? V(b)k for all k, and
- V(a)j lt V(b)j for at least one j
41Vector Time Characteristics
- The following inter relationships between
causality or the happened-before relation and
vector-time hold - A.) e?e iff V(e) lt V(e)
- B.) ee iff V(e) V(e)
- The vector-time is the best known estimation for
global sequencing that is based only on local
information.
42Proof
- a? b iff V(a) lt V(b)
- Proof
- For A fixed b, a ? b iff a is in shaded area iff
each component of V(a) ? corresponding component
of V(b).
43Multicasting
- A message is sent to all the members of a group
- Sending video stream to a set of customers
- Implementing a chat program
- Sending updates to a group of replica managers
44IPv4 Multicast Addresses
- Class D (starts with bit sequence1110)
- 224.0.0.1 to 239.255.255.255 (about 228?268
million) - 224.0.0.1 is for all systems on this subnet
- 224.2.0.0 224.2.127.253 are for multimedia
conference calls
45Causal Ordering of Messages
- Suppose m1 and m2 are two messages being
received at the same node i. A set of messages is
causally ordered if for all pairs ltm1, m2gt the
following holds - send(m1) ? send(m2) ? receive(m1)?receive(m2)
46Causality Violation
- Suppose M1s sending event happened before M3s
sending event. - Causality violation occurs if M1 is delivered
after M3 - (In particular, non-FIFO delivery is causality
violation).
- Delay the delivery of M3 to P2 until M1 arrives.
- ISIS system using multicast
47Formal Description of ISIS Clock ICi
- Pi initializes its clock ICi 0,,0.
- For each msg sending event by Pi
- ICii
- Pi attaches ICi to message it sends.
- Upon receiving msg M from Pj with M.ts, Pi checks
if - 1) M.tsj ICij 1 (M is next msg expected
from Pj) - 2) ICik ? M.tsk for all other k (all msgs
from Pk that sender Pj has received have been
received by Pi) - If both are satisfied, Pi delivers M after
ICij - Otherwise, Pi puts M in hold-back Q until they
are satisfied.
48Example
P1
P2
P3
Migrate foo to P2
Where is foo?
100
001
000
101
M1
M2.ts1 gt IC311 Put M2 in Hold-back Q
201
M2
foo is at P2
M1.ts1 IC311 IC3 ? 101 deliver M1 IC3 ?
201 deliver M2
Time
- Note jth component of M.ts is
- sequence number of latest msg sent
- by Pj that is known to sender of M
49Safety
- Show that msgs are delivered in timestamp order.
- Suppose not
- Let m (m) be event of sending message M (M)
- Assume Pi delivered msg M (from Pk) before M
(from Pj), even though - M.ts ( ICj(m)) lt M.ts (ICk(m))
.(A) (1) - (a) Just before Pi delivered M
- ICij1 ICj(m)j hence ICij
lt ICj(m)j (2) - (b) Delivery of M would have resulted in
- ICij ICk(m)j
- at time of delivery
- (a) and (b) contradict (A) since (b) took place
before (a), hence ICij ? ICij
50Liveness
- Show the system starvation-free no message will
wait forever in the hold-back Q - Assume Q is the hold-back queue in Pi and is
non-empty. Let M be a msg in Q which is not
preceded by any other msg in Q. Suppose M was
sent by Pj.
51Proof
- Assume ICik lt M.tsk for some k (!j), i.e.,
condition (2) is violated want to derive
contradiction from this. - Let M be latest msg from Pk that Pj delivered
prior to sending M so that M.ts lt M.ts and
M.tsk M.tsk. - If Pi hasnt delivered copy of msg M from Pk ,
then M with M.ts lt M.ts is in holdback Q of Pi,
contradicting assumption that M is not causally
preceded by any other msg in holdback Q of Pi. - So Pi must have delivered copy of msg M from
Pk.Thus ICik ? M.tsk M.tsk,
contradicting ICik lt M.tsk - Must give up assumption that Pi cannot deliver M.
52Proof Illustration
53Global state of a DS
- Consists of
- Local state of each node (task, process)
- Messages in transit
- Why interested in a global state?
- Suppose local computation has stopped on each
node and there are no pending messages, then - 1. Distributed application has terminated
successfully? or - 2. Deadlock?
- Problem lack of global time
- Consequences?
- How take a snapshot
54Snapshots (taken at 200pm by local clocks)
A
B
B
B
A
A
100
0
0
100
159pm
200pm
100 In channel
0
100
100
100
201
200pm
sum 100
sum 0
sum 200
(a)
(b)
(c)
Snapshots taken at
55Census Taking in Ancient Kingdom
- Want to take census counting all people, some of
whom may be traveling on highways.
56Census Taking Algorithm
- Close all gates into/out of each village
(process) and count people (record process state)
in village - Open each outgoing gate and send official with a
red cap (special marker message). - Open each incoming gate and count all travelers
(record channel state messages sent but not
received yet) who arrive ahead of official (with
a red cap). - Tally the counts from all villages.
57Chandy/Lamport Snapshot Algorithm
- All processes are initially white Messages sent
by white(red) processes are also white (red) - MSend Marker sending rule for process P
- Suspend all other activities until done
- Record Ps state
- Turn red
- Send one marker over each output channel of P.
- MReceive Marker receiving rule for P
- On receiving marker over channel C,
- if P is white Record state of channel C as
empty - Invoke MSend
- else record the state of C as sequence of white
messages received since P turned red. - Stop when marker is received on each incoming
channel - MSend and MReceive are atomic.
58Assumptions
- No process failures, no message loss
- Point to point message delivery is ordered
- How to guarantee it? ISIS clock
- Network is strongly connected.
- Why?
59Snapshot (1)
A
B
B
A
msgs arriving before maker constitute channel
state
100
100 in channel
0
0
0
marker
100 in channel
sum 100
sum 100
(a)
(b)
OK
OK
Need not use time.
60Snapshots (2)
B
A
B
A
100
100
0
marker
marker
100
100 in channel
marker
100
sum 200
sum 100
(c)
(d)
Cannot happen
Will be like this
61Cuts
- Cut C divides all events to PC (those which
happened in the past relative to C) and FC
(future events). - Cut C is consistent if there is no message whose
sending event is in FC and whose receiving event
is in PC.
62Progress shown by cuts
Time
p4
p1
p2
p3
P
Q
q1
q2
q3
1
2
3
4
5
7
8
There are 54 20 possible cuts.
63Example
Time
p3
p4
p1
p2
P
q2
q3
q4
q5
Q
M
q1
R
r1
r4
r2
r3
inconsistent cut
consistent cut
State recorded by SNAPSHOT ?consistent cut
64Checkpoint
- Cut C is consistent ? C doesnt contradict
sequence of events experienced by any site ? can
assume it did exist at the same time - Can use snapshot as checkpoint, from which
activity in distributed system can be resumed
after crash