Title: Synchronization
1Synchronization
2Synchronization
- Multiple processes must not simultaneously access
shared resource - Ordering may be important
- Such as, msg 1 must come before msg 2
- Time
- Absolute time vs relative time
- May want one process to coordinate
- Election algorithms
3Synchronization
- Special topics
- Distributed mutual exclusion
- Protect shared resources from simultaneous access
- Distributed transactions
- Similar, but try to optimize access thru
advanced concurrency control
4What Time is It?
- Easy to answer in a non-dist system
- Spse A asks for time, then B
- Bs time will be later than As
- In dist system, this may not be true
- Spse A checks time, then B
- Bs time might not be later than As
- That is, time on A and B might not agree
- If time comes from a central location, network
communication variation is a problem
5What Time is It?
- Why do we care about time?
- Consider make example
- Make used to compile and link multiple source
files into one executable file - If file.o was last modified before file.c, then
file.c must be recompiled - If file.o was last modified after file.c, then no
need to recompile file.c - This breaks if time is not the same in
distributed system
6Clock Synchronization
- Both machines have their own clock
- Clocks differ by 2
- What will make do with output.c?
- Oops!
7Time
- With single processor system
- Doesnt matter if time is incorrect
- Relative time is whats important
- If more than one processor
- Clock skew is inevitable
- Multiple clock problems
- How to synchronize with real clock?
- How to synchronize clocks with each other?
- But first we digress
8Physical Clocks
- Time between 2 transits of the sun
- Solar day
- Solar second is 1/86400th solar day
9Physical Clocks
- Period of earth rotation not constant
- Earth is slowing due to drag
- Days are getting longer
- Atomic clock invented 1948
- Official second is now
- 9,192,631,770 transitions of cesium 133
- International Atomic Time (TAI)
- Today, 86,400 TAI seconds is about 3 msec less
than mean solar day!
10Physical Clocks
- Solar seconds are not of constant length
- TAI seconds are of constant length
- Leap seconds are used to keep in phase with sun
- Add leap second when discrepancy gt 800 msec
- Otherwise noon would eventually be before
breakfast ? might cause riots!
11Physical Clocks
- TAI with leap seconds is known as
- Universal Coordinated Time (UTC)
- UTC replaces Greenwich Mean Time (GMT)
- NIST operates radio WWV from Colorado
- Sends out pulse at start of each UTC second
- But only accurate to within ?1 msec
- Do to atmospheric effects, can vary by ?10 msec
- Some satellites offer similar service
- In any case, must know relative position
- To compensate for propagation delay
12Clock Sync. Algorithms
- Suppose one machine monitor WWV
- How to keep other clocks in sync?
- Let t be UTC time
- Let Cp(t) be time on machine p
- Ideally, want Cp(t) t
- Well be happy if dCp/dt 1
13Clock Sync. Algorithms
- Clocks drift
- Suppose
- One clock is slow and one is fast
- Drift apart at twice the drift rate
14Clock Sync. Algorithms
- Let Cp(t) be time on machine p
- Ideally, want Cp(t) t
- Or dCp/dt 1
- But processor clocks can drift
- If maximum rate of drift is ?
- After ?t, two clocks could be 2? ? ?t apart
- If you want clocks to differ by less than ?
- Must synchronize clocks every ? / 2? seconds
- How to synchronize?
15Clock Sync. Algorithms
- How to synchronize clocks?
- Cristians algorithm
- Pull protocol
- Berkeley algorithm
- Push protocol
- Averaging algorithms
- Decentralized approach
- Network Time Protocol (NTP)
- Multiple external time sources
16Cristian's Algorithm
- Spse time server has WWV time
- Clients want to stay within ? of others
- Every ? / 2? seconds or less
- Client asks time server for time
- Somebody got an algorithm named after themselves
for that? - See next slide
17Cristian's Algorithm
- What are the potential problems?
- Time cannot run backwards
- Takes (variable) time to get reply
18Cristian's Algorithm
- Time cannot run backwards
- If clock is fast
- Increment time more slowly than usual
- Must account for time to get reply
- How to do this?
- Educated guess! Roundtrip time divided by 2
- Account for time server takes to process,
multiple roundtrip measurements, etc., etc.
19Berkeley Algorithm
- Cristians algorithm
- Time server is passive
- Berkeley algorithm
- Time server is aggressive
- Does not require server to know UTC
- Server polls clients
- Computes average time
- Pushes result to clients
20Berkeley Algorithm
- Server asks others for their clock values
- Machines answer
- Server tells others how to adjust their clock
21Averaging Algorithms
- Cristians and Berkeley are centralized
- Averaging (decentralized) approach
- All machines broadcast time
- Everybody computes average
- The usual refinements apply
- When to broadcast?
- Only practical on a LAN
22Network Time Protocol
- According to book, NTP uses
- advanced clock synchronization algorithms
- Accuracy range of 1 to 50 msec
- But NTP is not very secure
- NTP actually uses Marzullos Algorithm
- Aka the Intersection Algorithm
- Have a collection of times intervals
- Example time of 10?2 gives interval 8,12
23Network Time Protocol
- Given collection of times intervals
- Of the form a,b
- Marzullos algorithm finds consistent interval
- Efficient linear in time and space
- If no consistent interval, finds interval(s)
consistent with the most sources - Marzullo takes center of resulting interval
- Intersection Algorithm refines this
- Use statistical info on confidence intervals
- Selected time not necessarily midpoint
24Multiple External Time Sources
- Suppose very accurate time needed
- Multiple UTC sources?
- But these will not agree
- So need to average (or similar)
- Network delays
- Processing delays, etc.
- Not clear that this helps very much!
25Use of Synchronized Clocks
- Today, computers can be at or near UTC
- How to make use of this?
- To enforce at most once delivery
- Traditional approach
- Server keeps track of msg numbers
- Checks list against incoming msg numbers
- How long to keep list? What if server crashes?
- Alternative is to use timestamps
- We discuss other apps in later sections
26Logical Clocks
- Usually good enough to agree on time
- Even if its not the actual time
- Often sufficient to agree on order
- Recall make example
- Lamport time
- Synchronize logical clocks
- Vector timestamps
- Extension of Lamports algorithm
27Lamport Timestamps
- Happens before a ? b
- According to Tanenbaum a ? b if all processes
agree that a came before b - Lamport actually defines ? as the smallest
relation satisfying - If a occurs before b on same processor then a ? b
- If a send, b receive, a ? b
- Transitive a ? b and b ? c implies a ? c
28Lamport Timestamps
- Happens before a ? b
- Does happens before equal really happened
before? - If a and b are on same process and a occurs
before b, then a ? b - If a msg sent, b (same) msg received, then
a ? b - It takes time for message to be sent
- If a ?b and b ? a, msgs are concurrent
/
/
29Lamport Timestamps
- For event a, want timestamp C(a)
- If a ? b then C(a) lt C(b)
- C is a non-decreasing function
- Time cannot go backwards!
- Lamports solution
- Each msg carries timestamp with it
- If local time is less than timestamp, set local
time to timestamp 1 - Advance clock between any two events
- Illustrated on next slide
30Lamport Timestamps
- Three processes with different clocks
- Lamport's algorithm corrects the clocks
31Lamport Timestamps
- Can also insure that no two events ever occur at
exactly the same time - 40.1 for process 1
- 40.2 for process 2, etc.
- With this refinement, we have a total ordering on
all events in the system - If a ? b on same process then C(a) lt C(b)
- If a msg sent, b msg received, then we have
C(a) lt C(b) - If a ? b then C(a) ? C(b)
32Totally-Ordered Multicast
- Consider replicated database
- Spse replica in San Jose and in New York
- Query goes to nearest copy
- Updates are tricky
- Must have updates in same order at replicas
- For example Interest calculation and deposit
- For consistency, no right order
- Just want updates to happen in same order
- Correctness is a different story
33Non-Totally-Ordered Multicast
Deposit
Interest
- Assumptions
- 1000 in acct, deposit is 1000, interest rate is
10 - On left, 2200, on right 2100
- Inconsistent!
34Totally-Ordered Multicast
- Assume msgs received in order and no loss
- Using Lamport timestamps
- Msgs timestamped with senders logical time
- Multicast sent to all sites, including sender
- Msgs go into local queue in timestamp order
- Multicast ACK msgs (to yourself too)
- Message only removed from queue if
- It is at head of queue and
- It has been ACKed
- Does this work? See next slide
35Totally-Ordered Multicast
- 1000 in acct, deposit is 1000, interest rate
10 - What happens in this case?
Deposit
Interest
After 45 Deposit 45
After 10 Interest 10
0
30
Deposit
Interest
Later Interest 10 Deposit 45 ACK(D)
46 ACK(I) 90
Later Interest 10 Deposit 45 ACK(D)
46 ACK(I) 90
10
45
20
60
45
75
ACK(I)
ACK(D)
46
90
90
105
91
120
36Totally-Ordered Multicast
Deposit
Interest
After 45 Deposit 45
After 10 Interest 10
0
30
Deposit
Interest
Later Interest 10 Deposit 45 ACK(D)
46 ACK(I) 90
Later Interest 10 Deposit 45 ACK(D)
46 ACK(I) 90
10
45
20
60
45
75
ACK(I)
ACK(D)
46
90
90
105
91
120
- When is interest calculation done at each
replica? - When is deposit made?
37Scalar Timestamps
- Scalar timestamps (such as Lamport timestamps)
give total ordering using C(a) - But C(a) lt C(b) does not mean that event a really
happened before b
- The 4 at P2 occurs before the 3 at P1
38Vector Timestamps
- Lamport timestamps dont reflect causality
- Local events are causally ordered
- Example multicast news posting
- Response might arrive before original posting
- Vector timestamps do reflect causality
- Must specify
- Local data structures to represent logical time
- Update mechanism/protocol
- Tanenbaums description is confusing!
39Vector Timestamps
- Want vector timestamp such that
- If VT(a) lt VT(b) then a causally precedes b
- Process Pi maintains vector Vi
- Vii is incremented for each event at i
- Vij is Pis current view of the number of
events that have occurred at process Pj - Vii is easy to maintain
- Vij is obtained from info sent with msgs
- Each message includes vector timestamp
40Vector Timestamps
- Suppose Pj received msg m from Pi
- Pi includes its vector timestamp, vt
- Then Pj adjusts its values according to vt
- Pj then knows the number of events on which m can
depend - Tanenbaum claims
- Pj knows no. of messages it must receive before
it has seen everything that m could depend on - Not true! Event ? msg!
41Vector Timestamps
1
0
0
2
0
0
3
0
0
4
3
4
5
3
4
P1
2
0
0
5
3
4
5
6
4
0
1
0
2
3
0
2
4
0
P2
2
2
0
5
5
4
2
3
4
2
3
0
0
0
1
2
3
3
P3
2
3
4
2
3
2
42Vector Timestamp
- Modified (useful) form of VT
- Suppose Vii counts msgs sent by Pi
- Now consider multicast newsgroup
- Suppose Pi post a message
- Includes vector vt(a)
- Suppose Pj posts a response
- Includes vector vt(b)
- Want vt(a) to reflect msgs known to Pi when a was
sent, and similarly for vt(b)
43Vector Timestamp
- Let Vii be number of msgs sent by Pi
- Vij number of messages Pi received from Pj
- Pi sends a with vt(a), later Pj sends b, vt(b)
- Suppose Pk receives b before a
- Pk waits to deliver msg b until
- vt(b)j Vkj 1
- This is the next msg expected from Pj
- vt(b)i lt Vki, all i ? j
- Ensures that Pk must have seen msg a
44Vector Timestamp
0
0
0
1
0
0
2
0
0
2
1
0
P1
1
0
0
1
1
0
2
1
0
2
0
0
0
0
0
2
0
0
P2
1
1
0
1
0
0
1
1
0
1
0
0
0
0
0
msg c
P3
1
1
0
2
0
0
0
0
0
1
0
0
1
1
0
2
1
0
Queue msg a
Queue msg b
?
?
?
c
a
b
45Global State
- Global state of distributed system
- All local states plus msgs in transit
- Definition of state can vary
- Useful to know global state to
- Know that computation is finished
- Detect deadlock
- How to record global state?
- Distributed snapshot
46Global State
- Distributed snapshot
- A consistent state in which the system might
have been - For example, if Q received msg from P then must
show that P sent the msg - P sent msg Q has not yet received is OK
- Global state represented by a cut
- Next slide
47Global State
- Consistent cut
- Inconsistent cut
48Global State
- Assume distributed system uses point-to-point
unidirectional communication - Suppose P starts snapshot
- P records its state
- P sends marker to neighbors
- When Q receives marker
- First marker on any channel Q records state
- Append incoming messages from S until marker from
S is received - Q is done when it has received marker on all
incoming channels
49Global State
- This figure does not match algorithm!
- See next few slides
50Global State
- Consider the following example
- Bank has 3 branches, A, B, C
- Each branch connected to others by
- Unidirectional point-to-point links
- State consists of
- Money in branch and
- money in transit between branches
51Global State
Begin SA
Done SA
A
M3
M2
M5
Done SB
B
SB
M4
M1
M6
Done SC
C
SC
- Note that no messages are in transit
- Global state (SA,SB,SC)
52Global State
Begin SA
(SA,T)
Done (SA,T)
A
M3
M2
M5
T
Done SB
B
SB
M4
M1
M6
Done SC
C
SC
- Global state (SA,T,SB,SC)
- This does not work if msgs can be reordered!
53Global State
- Example Termination detection
- Process Q received marker 1st time
- Process that sent it is Qs predecessor
- When Q completes its part
- Q sends DONE msg to its predecessor
- When is snapshot DONE?
- When initiator of snapshot received DONE from all
of its successors
54Global State
- Problem if DONE and msgs in transit, then
computation may not really be done - Are msgs part of snapshot or computation?
- Modification send DONE provided
- All of Qs successors returned DONE and
- Q has not received any msg between time state was
recorded and marker(s) received - Otherwise send CONTINUE msg
- DONE when initiator receives all DONEs
- If CONTINUEs, must do it again
55Election Algorithms
- May want one process to coordinate
- We dont care which process
- How to choose coordinator?
- Have an election!
- Assume each process has a unique number
- All processes know everybody elses number
- But some processes may be down
- Want to elect (live) process with highest number
- Well consider two election algorithms
- Bully algorithm and ring algorithm
56Bully Algorithm
- P notices coordinator not responding
- P sends ELECTION msg to all processes with higher
number than Ps - If no one responds, P becomes coordinator
- If a higher number responds, P is done
- Process receives ELECTION from lower no.
- Responds with OK
- If not already doing so, it initiates an election
- Eventually, everybody gives up
- Except for the biggest bully
57Bully Algorithm
- Process 7 was coordinator until he died
- Process 4 is first to notice, so holds an
election - 5 and 6 respond, 4 gives up (why not stop here?)
- Now 5 and 6 each hold an election
58Bully Algorithm
- Process 6 tells 5 to give up
- Process 6 wins, then tells everyone
59Ring Algorithm
- Assume processes are ordered
- Everyone knows their successor
- Note that no token involved
- Spse P notices coordinator has died
- P sends ELECTION msg to its successor with Ps
number attached - If no response, sends msg to Ps successors
successor, and so on - Each guy in chain appends its number
- When msg gets back to P, it selects highest
number on list and sends COORDINATOR msg
60Ring Algorithm
- 5 and 2 both initiate ELECTION
- What will happen?
61Mutual Exclusion
- Critical region ? a place where mutual exclusion
is required - Example update to a shared data structure
- For single processor system
- Use semaphore, monitors, etc.
- Possible istributed system approaches
- Imitate single processor approach
- Distributed approach
- Token ring approach
62Centralized Algorithm
- Elect a coordinator
- If P want to enter critical region
- Checks with coordinator
- How does coordinator deny request?
- Either explicit denial or no response
- Queues any pending requests
- Fair, efficient, etc.
- No starvation?
- But its centralized and we hate that!
63Centralized Algorithm
- Process 1 OK to enter a critical region
- Process 2 asks permission to enter the same
critical region, but no reply - Process 1 exits, coordinator replies to 2
64Distributed Algorithm
- For this, we need a total ordering on events
- We know how to do this, right?
- P wants to enter critical region
- Send request msg (with timestamp) to everybody
- Including itself
- When request is received
- Receiver not in critical region? Send OK
- Receiver in critical region? No reply, queue
request - Receiver wants to enter critical region but has
not yet? Check timestamps, low one wins - After OKed by everybody, enter critical region
65Distributed Algorithm
- Processes 0 and 2 want to enter critical region
- Process 0 has the lowest timestamp, it wins
- When process 0 is done, 2 gets its turn
66Token Ring Algorithm
- A logical ring with a token
- Token passed around ring
- Process can only enter critical region when it
has the token - Easy to see that this works!
- Usual token ring problems apply
67Token Ring Algorithm
- Unordered group of processes
- Logical ring (also need a token)
68Comparison of Mutual Exclusion Algorithms
Problems
Delay before entry (in message times)
Messages per entry/exit
Algorithm
Coordinator crash
2
3
Centralized
Crash of any process
2 ( n 1 )
2 ( n 1 )
Distributed
Lost token, process crash
0 to n 1
1 to ?
Token ring
69Distributed Transactions
70Transaction Model
- Updating master tape is fault tolerant
71Transaction Model
Primitive Description
BEGIN_TRANSACTION Make the start of a transaction
END_TRANSACTION Terminate the transaction and try to commit
ABORT_TRANSACTION Kill the transaction and restore the old values
READ Read data from a file, a table, or otherwise
WRITE Write data to a file, a table, or otherwise
- Primitives for transactions
72The Transaction Model
BEGIN_TRANSACTION reserve WP -gt JFK reserve JFK -gt Nairobi reserve Nairobi -gt MalindiEND_TRANSACTION (a) BEGIN_TRANSACTION reserve WP -gt JFK reserve JFK -gt Nairobi reserve Nairobi -gt Malindi full gtABORT_TRANSACTION (b)
- Transaction to reserve 3 flights commits
- Aborts when 3rd flight unavailable
73Distributed Transactions
- A nested transaction
- A distributed transaction
74Private Workspace
- File index and disk blocks of 3-block file
- After transaction modified 0, appended block 3
- After committing
75Writeahead Log
x 0 y 0 BEGIN_TRANSACTION x x 1 y y 2 x y y END_TRANSACTION (a) Log x 0 / 1 (b) Log x 0 / 1 y 0/2 (c) Log x 0 / 1 y 0/2 x 1/4 (d)
- a) A transaction
- b) d) Log before statement is executed
76Concurrency Control
- Managers for handling transactions
77Concurrency Control
- Managers for distributed transactions
78Serializability
BEGIN_TRANSACTION x 0 x x 1END_TRANSACTION (a) BEGIN_TRANSACTION x 0 x x 2END_TRANSACTION (b) BEGIN_TRANSACTION x 0 x x 3END_TRANSACTION (c)
Schedule 1 x 0 x x 1 x 0 x x 2 x 0 x x 3 Legal
Schedule 2 x 0 x 0 x x 1 x x 2 x 0 x x 3 Legal
Schedule 3 x 0 x 0 x x 1 x 0 x x 2 x x 3 Illegal
(d)
- a) c) Transactions T1, T2, and T3
- d) Possible schedules
79Two-Phase Locking
80Two-Phase Locking
81Pessimistic Timestamp Ordering
- Concurrency control using timestamps