Title: Synchronization
1Synchronization
2Synchronization
- Synchronization in distributed systems is harder
than in centralized systems because of the need
for distributed algorithms. - Distributed algorithms have the following
properties - No machine has complete information about the
system state. - Machines make decisions based only on local
information. - Failure of one machine does not ruin the
algorithm. - There is no implicit assumption that a global
clock exists. - Clocks are needed to synchronize in a distributed
system.
3Clock Synchronization
- Time is unambiguous in centralized systems.
- System clock keeps time, all entities use this
for time. - In distributed systems each node has own system
clock. - Each crystal-based clock runs at slightly
different rates. This difference is called clock
skew. - Problem An event that occurred after another may
be assigned an earlier time.
4Physical Clocks A Primer
- Accurate clocks are atomic oscillators
- Most clocks are less accurate (e.g., mechanical
watches) - Computers use crystal-based blocks
- Results in clock drift
- How do you tell time?
- Use astronomical metrics (solar day)
- Coordinated universal time (UTC) international
standard based on atomic time same as Greenwich
Mean Time - Add leap seconds to be consistent with
astronomical time - UTC broadcast on radio (satellite and earth)
- Receivers accurate to 0.1 10 ms
- The goal is to synchronize machines with a master
(UTC receiver machine) or with one another.
5Physical Clocks
- Computation of the mean solar day (transit of the
sun noon)
6Physical Clocks
- TAI (Temps Atomique International) seconds are of
constant length, unlike solar seconds. Leap
seconds are introduced when necessary to keep in
phase with the sun.
7Clock Synchronization Algorithms
- The relation between clock time and UTC when
clocks tick at different rates.
8Clock Synchronization
- Each clock has a maximum drift rate r
- 1-r lt dC/dt lt 1r
- Two clocks may drift by 2r Dt in time Dt
- To limit drift to d gt resynchronize every d/2r
seconds (2r Dt lt d, Dt d/2r)
9Cristian's Algorithm
- Synchronize machines to a time server that has a
UTC receiver. - Machine P requests time from server every d/2r
seconds - Receives time t (Cutc) from server, P sets clock
to ttreply where treply is the time to send
reply to P - Use (treqtreply)/2 as an estimate of treply
- Improve accuracy by making a series of
measurements
10Cristian's Algorithm
- Getting the current time from a time server.
11Berkeley Algorithm
- Used in systems without UTC receiver
- Keep clocks synchronized with one another
- One computer is master, other are slaves
- Master periodically polls slaves for their times
- Average times and return differences to slaves
- Communication delays compensated as in Cristians
algorithm - Failure of master ? election of a new master
12The Berkeley Algorithm
- The time daemon asks all the other machines for
their clock values - The machines answer
- The time daemon tells everyone how to adjust
their clock
13Distributed Approaches
- Both approaches studied thus far are centralized.
- Decentralized algorithms use resynchronization
intervals - Broadcast time at the start of the interval
- Collect all other broadcast that arrive in a
period S - Use average value of all reported times
- Can throw away few highest and lowest values
- Approaches in use today
- rdate synchronizes a machine with a specified
machine - Network Time Protocol (NTP) Uses advanced clock
synchronization to achieve accuracy in 1-50 ms
14Logical Clocks
- For many problems, only internal consistency of
clocks matters. - Absolute (real) time is less important
- Use logical clocks
- Key idea
- Clock synchronization needs not be absolute.
- If two machines do not interact, no need to
synchronize them. - More importantly, processes need to agree on the
order in which events occur rather than the time
at which they occurred.
15Event Ordering
- Problem define a total ordering of all events
that occur in a system. - Events in a single processor machine are totally
ordered. - In a distributed system
- No global clock, local clocks may be
unsynchronized. - Can not order events on different machines using
local times. - Key idea Lamport
- Processes exchange messages
- Message must be sent before received
- Send/receive used to order events (and
synchronize clocks).
16Happenes-Before Relation
- The expression A ? B is read A happens before
B. - If A and B are events in the same process and A
executed before B, then A ? B - If A represents sending of a message and B is the
receipt of this message, then A ? B - Relation is transitive
- A ? B and B ? C ? A ? C
- Relation is undefined across processes that do
not exchange messages - Partial ordering on events
17Event Ordering Using HB
- Goal define the notion of time of an event such
that - If A? B then C(A) lt C(B)
- If A and B are concurrent, then C(A) lt, , or gt
C(B) - Solution
- Each processor maintains a logical clock LCi
- Whenever an event occurs locally at I, LCi
LCi1 - When i sends message to j, piggyback LCi
- When j receives message from i
- If LCj lt LCi then LCj LCi 1 else do nothing
- This algorithm meets the above goals
18Lamport Timestamps
- Three processes, each with its own clock. The
clocks run at different rates. - Lamport's algorithm corrects the clocks.
19Example Totally-Ordered Multicasting
- Updating a replicated database and leaving it in
an inconsistent state without a totally-ordered
logic clock.
20Causality
- Lamports logical clocks
- If A ? B then C(A) lt C(B)
- Reverse is not true!!
- Nothing can be said about events by comparing
time-stamps! - If C(A) lt C(B), then ??
- Need to maintain causality
- Causal deliveryIf send(m) ? send(n) ? deliver(m)
? deliver(n) - Capture causal relationships between groups of
processes - Need a time-stamping mechanism such that
- If T(A) lt T(B) then A should have causally
preceded B
21Vector Clocks
- Causality can be captured by means of vector
timestamps. - Each process i maintains a vector Vi
- Vii number of events that have occurred at i
- Vij number of events I knows have occurred at
process j - Update vector clocks as follows
- Local event increment ViI
- Send a message piggyback entire vector V
- Receipt of a message Vjk max( Vjk,Vik )
- Receiver is told about how many events the sender
knows occurred at another process k - Also Vji Vji1
22Global State
- The global state of a distributed system consists
of - Local state of each process
- Messages sent but not received (state of the
queues) - Many applications need to know the state of the
system - Failure recovery, distributed deadlock detection
- Problem how can you figure out the state of a
distributed system? - Each process is independent
- No global clock or synchronization
- A distributed snapshot reflects a consistent
global state.
23Global State
- A consistent cut receipts corresponds a send
event - An inconsistent cut sender cannot be identified
24Distributed Snapshot Algorithm
- Assume each process communicates with another
process using unidirectional point-to-point
channels (e.g, TCP connections) - Any process can initiate the algorithm
- Checkpoint local state
- Send marker on every outgoing channel
- On receiving a marker
- Checkpoint state if first marker and send marker
on outgoing channels, save messages on all other
channels until - Subsequent marker on a channel stop saving state
for that channel
25Distributed Snapshot
- A process finishes when
- It receives a marker on each incoming channel and
processes them all - State local state plus state of all channels
- Send state to initiator
- Any process can initiate snapshot
- Multiple snapshots may be in progress
- Each is separate, and each is distinguished by
tagging the marker with the initiator ID (and
sequence number)
B
M
A
M
C
26Global State (Snapshot Algorithm)
- Organization of a process and channels for a
distributed snapshot
27Global State (Snapshot Algorithm)
- Process Q receives a marker for the first time
and records its local state - Q records all incoming message
- Q receives a marker for its incoming channel and
finishes recording the state of the incoming
channel
28Termination Detection
- Detecting the end of a distributed computation
- Notation let sender be predecessor, receiver be
successor - Two types of markers Done and Continue
- After finishing its part of the snapshot, process
Q sends a Done or a Continue to its predecessor - Send a Done only when
- All of Qs successors send a Done
- Q has not received any message since it
check-pointed its local state and received a
marker on all incoming channels - Else send a Continue
- Computation has terminated if the initiator
receives Done messages from everyone
29Election Algorithms
- Many distributed algorithms need one process to
act as coordinator - Doesnt matter which process does the job, just
need to pick one - Election algorithms technique to pick a unique
coordinator (aka leader election) - Examples take over the role of a failed process,
pick a master in Berkeley clock synchronization
algorithm - Types of election algorithms Bully and Ring
algorithms
30Bully Algorithm
- Each process has a unique numerical ID
- Processes know the Ids and address of every other
process - Communication is assumed reliable
- Key Idea select process with highest ID
- Process initiates election if it just recovered
from failure or if coordinator failed - 3 message types election, OK, I won
- Several processes can initiate an election
simultaneously - Need consistent result
- O(n2) messages required with n processes
31Bully Algorithm Details
- Any process P can initiate an election
- P sends Election messages to all process with
higher Ids and awaits OK messages - If no OK messages, P becomes coordinator and
sends I won messages to all process with lower
Ids - If it receives an OK, it drops out and waits for
an I won - If a process receives an Election msg, it returns
an OK and starts an election - If a process receives a I won, it treats sender
an coordinator
32The Bully Algorithm
- The bully election algorithm
- Process 4 holds an election
- Process 5 and 6 respond, telling 4 to stop
- Now 5 and 6 each hold an election
33Bully Algorithm
- Process 6 tells 5 to stop
- Process 6 wins and tells everyone
34Ring-based Election
- Processes have unique Ids and arranged in a
logical ring - Each process knows its neighbors
- Select process with highest ID
- Begin election if just recovered or coordinator
has failed - Send Election to closest downstream node that is
alive - Sequentially poll each successor until a live
node is found - Each process tags its ID on the message
- Initiator picks node with highest ID and sends a
coordinator message - Multiple elections can be in progress
- Wastes network bandwidth but does no harm
35A Ring Algorithm
- Election algorithm using a ring.
36Comparison
- Assume n processes and one election in progress
- Bully algorithm
- Worst case initiator is node with lowest ID
- Triggers n-2 elections at higher ranked nodes
O(n2) msgs - Best case immediate election n-2 messages
- Ring
- 2 (n-1) messages always
37Distributed Synchronization
- Distributed system with multiple processes may
need to share data or access shared data
structures - Use critical sections with mutual exclusion
- Single process with multiple threads
- Semaphores, locks, monitors
- How do you do this for multiple processes in a
distributed system? - Processes may be running on different machines
- Solution lock mechanism for a distributed
environment - Can be centralized or distributed
38Centralized Mutual Exclusion
- Assume processes are numbered
- One process is elected coordinator (highest ID
process) - Every process needs to check with coordinator
before entering the critical section - To obtain exclusive access send request, await
reply - To release send release message
- Coordinator
- Receive request if available and queue empty,
send grant if not, queue request - Receive release remove next request from queue
and send grant
39Mutual Exclusion A Centralized Algorithm
- Process 1 asks the coordinator for permission to
enter a critical region. Permission is granted - Process 2 then asks permission to enter the same
critical region. The coordinator does not reply. - When process 1 exits the critical region, it
tells the coordinator, when then replies to 2
40Properties
- Simulates centralized lock using blocking calls
- Fair requests are granted the lock in the order
they were received - Simple three messages per use of a critical
section (request, grant, release) - Shortcomings
- Single point of failure
- How do you detect a dead coordinator?
- A process can not distinguish between lock in
use from a dead coordinator - No response from coordinator in either case
- Performance bottleneck in large distributed
systems
41Distributed Algorithm
- Ricart and Agrawala needs 2(n-1) messages
- Based on event ordering and time stamps
- Process k enters critical section as follows
- Generate new time stamp TSk TSk1
- Send request(k,TSk) all other n-1 processes
- Wait until reply(j) received from all other
processes - Enter critical section
- Upon receiving a request message, process j
- Sends reply if no contention
- If already in critical section, does not reply,
queue request - If wants to enter, compare TSj with TSk and send
reply if TSkltTSj, else queue
42A Distributed Algorithm
- Two processes want to enter the same critical
region at the same moment. - Process 0 has the lowest timestamp, so it wins.
- When process 0 is done, it sends an OK also, so 2
can now enter the critical region.
43Properties
- Fully decentralized
- N points of failure!
- All processes are involved in all decisions
- Any overloaded process can become a bottleneck
- A Token Ring Algorithm
- Use a token to arbitrate access to critical
section - Must wait for token before entering CS
- Pass the token to neighbor once done or if not
interested - Detecting token loss in not-trivial
44A Toke Ring Algorithm
- An unordered group of processes on a network.
- A logical ring constructed in software.
45Comparison
- A comparison of three mutual exclusion algorithms.
46Transactions
- Transactions provide higher level mechanism for
atomicity of processing in distributed systems - Have their origins in databases
- Banking example Three accounts A100, B200,
C300 - Client 1 transfer 4 from A to B
- Client 2 transfer 3 from C to B
- Result can be inconsistent unless certain
properties are imposed on the accesses
47ACID Properties
- Atomic all or nothing (indivisible)
- Consistent transaction takes system from one
consistent state to another (hold certain
invariants) - Isolated Immediate effects are not visible to
other (serializable) - Durable Changes are permanent once transaction
completes (commits)
48The Transaction Model
- Updating a master tape is fault tolerant.
49The Transaction Model
- Examples of primitives for transactions.
50The Transaction Model
- Transaction to reserve three flights commits
(White Plains ? New York ? Nairobi ? Malindi) - Transaction aborts when third flight is
unavailable
51Classification of Transactions.
- A flat transaction is a series of operations that
satisfy the ACID properties. - It does not allow partial results to be committed
or aborted. - Example flight reservation, Web link update.
- A nest transaction is constructed from a number
of subtransactions. - A distributed transaction is logically a flat,
indivisible transaction that operates on
distributed data.
52Distributed Transactions
- A nested transaction (transaction is decomposed
into subtransactions) - A distributed transaction (subtransaction on
different data)
53Implementation of transactions
- Two methods can be used to implement
transactions - Private workspace Until the transaction either
commits or aborts, all of its reads and writes go
to the private workspace. - Writeahead log Use a log to record the change.
Only after the log has been written successfully
is the change made to the file. - Private workspace
- Each transaction get copies of all files, objects
- It can optimize for reads by not making copies
- It can optimize for writes by copying only what
is required (An appended block and a copy of
modified block are created. These new blocks are
called shadow blocks.) - Commit requires making local workspace global
54Private Workspace
- The file index and disk blocks for a three-block
file - The situation after a transaction has modified
block 0 and appended block 3 - After committing
55Implementation Write-ahead Logs
- In-place updates transaction makes changes
directly to all files/objects and keeps these
changes in a log. - Write-ahead log prior to making change,
transaction writes to log on stable storage - Transaction ID, block number, original value, new
value - Force logs on commit
- If abort, read log records and undo changes
rollback - Log can be used to rerun transaction after
failure - Both workspaces and logs work for distributed
transactions - Commit needs to be atomic will return to this
issue in Ch. 7
56Writeahead Log
- a) A transaction
- b) d) The log before each statement is executed
57Concurrency Control
- Goal Allow several transactions to be executing
simultaneously such that - Collection of manipulated data item is left in a
consistent state - Achieve consistency by ensuring data items are
accessed in an specific order - Final result should be same as if each
transaction ran sequentially
58Concurrency Control
- Concurrency control can implemented in a layered
fashion - Bottom layer - A data manager performs the actual
read and write operations on data. - Middle layer - A scheduler carries the main
responsibility for properly controlling
concurrency. Scheduling can be based on the use
of locks or timestamps. - Highest layer The transaction manager is
responsible for guaranteeing atomicity of
transactions.
59Concurrency Control
- General organization of managers for handling
transactions.
60Concurrency Control
- General organization of managers for handling
distributed transactions.
61Serializability
- Key idea properly schedule conflicting
operations - Conflict is possible if at least one operation is
write - Read-write conflict
- Write-write conflict
(d)
- a) c) Three transactions T1, T2, and T3
- d) Possible schedules (Schedule 2 is legal
because it results in a valid x value.)
62Serializability
- Two approaches are used in concurrency control
- Pessimistic approaches operations are
synchronized before they are carried out. - Optimistic approaches operations are carried out
and synchronization takes place at the end of
transaction. At the conflict point, one or more
transactions are aborted.
63Two-phase Locking (2PL)
- Widely used concurrency control technique
- Scheduler acquires all necessary locks in growing
phase, releases locks in shrinking phase - Check if operation on data item x conflicts with
existing locks - If so, delay transaction. If not, grant a lock on
x - Never release a lock until data manager finishes
operation on x - Once a lock is released, no further locks can be
granted.
64Two-Phase Locking
65Two-phase Locking (2PL)
- In strict two-phase locking, the shrinking phase
does not take place until the transaction has
finished running. - Advantages
- A transaction always reads a value written by a
committed transaction. - All lock acquisitions and releases can be handled
by the system without the transaction being aware
of them. - Problem deadlock possible
- Example acquiring two locks in different order
66Two-Phase Locking
- Strict two-phase locking.
67Two-phase Locking (2PL)
- In centralized 2PL, a single site is responsible
for granting and releasing locks. - In primary 2PL, each data item is assigned a
primary copy. The lock manager on that copys
machine is responsible for granting and releasing
locks. - In distributed 2PL, the schedulers on each
machine not only take care that locks are granted
and released, but also that the operation is
forwarded to the (local) data manager.
68Timestamp-based Concurrency Control
- Each transaction Ti is given timestamp ts(Ti)
- If Ti wants to do an operation that conflicts
with Tj - Abort Ti if ts(Ti) lt ts(Tj)
- When a transaction aborts, it must restart with a
new (larger) time stamp - Two values for each data item x
- Max-rts(x) max time stamp of a transaction that
read x - Max-wts(x) max time stamp of a transaction that
wrote x
69Reads and Writes using Timestamps
- Readi(x)
- If ts(Ti) lt max-wts(x) then Abort Ti
- Else
- Perform Ri(x)
- Max-rts(x) max(max-rts(x), ts(Ti))
- Writei(x)
- If ts(Ti)ltmax-rts(x) or ts(Ti)ltmax-wts(x) then
Abort Ti - Else
- Perform Wi(x)
- Max-wts(x) ts(Ti)
70Pessimistic Timestamp Ordering
- Concurrency control using timestamps.
71Optimistic Concurrency Control
- Transaction does what it wants and validates
changes prior to commit - Check if files/objects have been changed by
committed transactions since they were opened - Insight conflicts are rare, so works well most
of the time - Works well with private workspaces
- Advantage
- Deadlock free
- Maximum parallelism
- Disadvantage
- Rerun transaction if aborts
- Probability of conflict rises substantially at
high loads - Not used widely