Title: Ordering of Events in Distributed Systems
1Ordering of Events in Distributed Systems
UNIVERSITY of WISCONSIN-MADISONComputer Sciences
Department
CS 739Distributed Systems
Andrea C. Arpaci-Dusseau
- Two papers
- Time, Clocks, and the Ordering of Events in a
Distributed System, Lamport, 1978 - Distributed Snapshots Determining Global
States of Distributed Systems, Chandy and
Lamport, 1985
2Motivation Time, Clocks, Ordering
- To develop distributed algorithms, want all
participants see messages in same order (if want
same decisions!) - How can distributed nodes agree on order of
messages? -
Process A
Process B
Naïve Each process believes sent message first
3Motivation Time, Clocks, Ordering
- When does an event precede (happen before)
another in a distributed system? - Sometimes impossible to tell sometimes it
doesnt matter - If event A occurs on machine A,and event B
occurs on machine B,but there is no
communication between A and B, then did event A
or event B happen first??? - How can a process identify which events happened
before others? - Logical clocks
4Terminology
- Distributed system A collection of distinct
processes which are spatially separated and which
communicate with one another by exchanging
messages - How does this differ from our previous
definitions? - Process A sequence of events (instructions,
sending messages, receiving messages) - The events within a process have a total ordering
5Partial Ordering
- Happened before -gt
- Rules for ordering events a and b
- if a and b are events in same process and a comes
before b, then a-gtb - if a is the sending of a message by a process and
b is receiving that message, then a-gtb - if a-gtb and b-gtc then a-gtc
- a-gtb It is possible for a to causally affect b
- Concurrent gt
- if a gt b and b gt a, then do not know ordering
of a and b - It is not possible for a to causally affect b
6Space-Time Diagram
P
Q
R
Time
p4
q7
r4
q6
r3
q5
p3
q4
r2
q3
p2
q2
p1
r1
q1
What is the relationship between (q3,p3)?
(p1,q3)? (p2,q3)? (q3,r4)? (r1,q6)?
7Logical Clocks
- Abstract view Logical clock is a way to assign a
number to an event to express ordering - No relation between logical clock and physical
time - Clock Ci for process Pi is a function that
assigns a number Ci(a) to any event a in Pi - Clock condition For any events a, b
- if a-gtb, then C(a) lt C(b)
- Converse condition does not hold
- Cant say concurrent events have same logical
time - If C(a) lt C(b) cant conclude a-gtb
8Implementation of Logical Clocks
- C1.
- If a b are in Pi a before b, then Ci(a) lt
Ci(b) - IR1. (Implementation Rule)
- Each process Pi increments Ci between any two
successive events - C2.
- If a is sent by Pi and b is received by Pj, then
Ci(a) lt Cj(b) - IR2.
- (a) If event a is the sending of message m by
process Pi, then m contains a timestamp TmCi(a) - (b) Upon receiving m, process Pj sets Cj greater
than or equal to its presents value and greater
than Tm.
9Logical Clocks Example
What logical clock values are possible? Assume
initial C(p1)5, C(q1)50, C(r1)2
10Logical Clocks Example
57
56
p4
q7
r4
56
q6
56
55
r3
55
q5
p3
53
q4
54
r2
3
q3
53
p2
52
52
q2
p1
r1
6
q1
51
5
2
50
What logical clock values are possible? Assume
initial C(p1)5, C(q1)50, C(r1)2
11Total Ordering
- Use logical clocks to obtain total ordering
across all processes and events - a gt b if and only if
- 1) Ci(a) lt Cj(b) OR
- 2) Ci(a) Cj(b) and Pi lt Pj (i.e., use process
ids to break ties) - Partial ordering is unique, but total ordering is
not! - Concurrent operations can go in any order
- Total ordering depends upon implementation of
each Ci() - Total ordering depends upon tie breaking rules
12Distributed State Machines
- Each process runs same distributed algorithm
- Relies upon total ordering of requests
- Agreed upon by all participants
- Can be used to ensure all see events (inputs) in
same order and therefore make same decisions - Idea
- All requests are time-stamped, responses (acks)
are time-stamped too - Send time-stamped request to all processes
- Handle next request in total order
- To know next request, must have received greater
timestamp from all possible participants - Problems?
- Example Mutual exclusion
13Mutual Exclusion Example
A
B
1
C
D
10
30
20
Scenario A and C want resource A and C each send
request before see others (concurrent
requests) C sends request out earlier in
physical time How will all nodes agree who
should get resource?
Request queue
14Mutual Exclusion Example
A
B
1
C
D
10
30
20
21
21
2
21
22
31
2
2
32
22
23
32
Request queue A - 2, C - 21
15Physical Clocks
- Motivation Can observe anomalous behavior if
other communication channels exist between
processes - Useful to have physical clock with meaning in
physical world - Synchronize independent physical clocks, each
running at slightly different rates (skew) - Implementation Idea
- Send timestamp with each message
- Receiver may update clock to timestampminimal
network delay - Clock must always increase
- Lots of work in this area
16Conclusions
- Distributed, replicated state machines useful for
tolerating faults - Need to construct total ordering of events to
obtain same results everywhere - Logical clocks very simple to implement
- Will see logical clocks used to update replicas
17Distributed Snapshots
- Goal
- Want to record global state of distributed system
(i.e., state of each process, state of each
communication channel) - Useful so can observe system properties
- Computation terminated?
- System deadlocked?
- Number of tokens?
- Complication
- Distributed system has no shared state nor shared
clock - Cannot record global state simultaneously
everywhere - Distributed snapshot Record local state at
different times and combine into meaningful
picture - Obtain cut in logical time, remain consistent by
preserving logical ordering (if not ordering in
physical time)
18System Model
- Distributed system Finite set of processes and
channels described by graph - Processes
- Set of states, initial state, set of events
- Channels
- FIFO, error-free, infinite buffers, arbitrary but
finite delay - State Sequence of messages sent but not yet
received
19Distributed Snapshot Algorithm
- Goal Record local state (each process plus
adjoining channels) that produces a meaningful
global system state - Idea
- After local snapshot, send marker along channels
- Receiver records messages in channel before
marker - Initial Some process decides to initiate
snapshot (performed periodically)
20Intuition with Logical Time
A
B
S1
m1
S
S2
m2
S3
Which snapshots are concurrent? How to reconcile
S2 and S? How to reconcile S1 and S? How to
reconcile S3 and S?
21Marker Rules
- Marker-sending rule for p
- Send marker along each channel (after recording
state of p) before sending more messages - Marking-receiving rule for q on channel c
- if q has not recorded state yet
- record state of q
- record state of c as empty
- if q has recorded state already
- record state of c as the msg sequence that
arrived since it recorded its state - Termination
- When state recorded of all processes and all
channels - Must have algorithm to collect and assemble
information too
22Banking Example
Stable property?
23Banking Example
p2 state 201-3 18, empty channels
24Banking Example
p1 10
p2 20
p3 30
1
2
3
5
3
4
p1 state 10-1-36, empty channelsp3 state
30-2-5-4322, empty channelsTotal money?
18622 46
25Banking Example
p1 10
p2 20
p3 30
1
2
3
5
3
4
26Banking Example
c p2 from p1 nothing c p2 f p3 4 2c p3 f
p1 3c p1 f p3 5
p1 6, p2 18, p3 22
27Banking Example
p1 10
p2 20
p3 30
1
2
3
5
3
4
c p2 from p3 Never 2 and 4 simultaneously
28Properties of Recorded Global State
- Recorded global state, S, may not have occurred
- If it bothers you that S doesnt actually
exist... - Given a permutation of the actual sequence of
events - S is reachable from Sinit
- Sfinal is reachable from S
- Stable properties will hold in S as well
- How to permute sequence of events?
- Goal Want snapshot to correspond to single
logical cut - Slide events so snapshots taken at same logical
time - Some events across processes will switch order
with others - Specifically, postrecorded events and prerecorded
events - prerecorded events occurred before state of p was
recorded - Cant tell ordering of concurrent pre and post
events
29Banking Example
p1 10
p2 20
p3 30
1
2
3
5
3
post
4
pre
Example Need to swap sending 4 from p3 and
receiving 2 Still logically consistent could
not observe difference
30Conclusions
- Distributed snapshots Allow one to reconstruct
valid picture of system from snapshots taken at
different points in time - Record individual process states plus channels
- Snapshots useful for determining if stable
properties hold or not - Recorded state S may not correspond to
reality, but if stable properties hold in
beginning and end states, then hold in S too