Title: Logical Time
1Logical Time
2Introduction
- The concept of logical time has its origin in a
seminal paper by Leslie Lamport Time, Clocks,
and the Ordering of Events in a Distributed
System, Communications of ACM, July 1978. - The topic remains of interest a recent paper
appeared in Computer Capturing Causality in
Distributed System by Raynal and Singhal (see
handout).
3Application of Logical Time
- Logical Time in Visualizations Produced by
Parallel Computations - Banker system algorithm.
- Efficient solutions to the Replicated Log and
Dictionary problems by Wuu Bernstein.
4Background 1 source Raynal and Singhal
- A distributed computation consists of a set of
processes that cooperate and compete to achieve a
common goal. These processes do not share a
common global memory and communicate solely by
passing messages over a communication network.
5Background 2 source Raynal and Singhal
- In a distributed system, a process's actions are
modeled as three types of events internal,
message send, and message receive. - An internal event affects only the process at
which it occurs, and the events at a process are
linearly ordered by their order of occurrence. - Send and receive events signify the flow of
information between processes and establish
causal dependency from the sender process to the
receiver process.
6Background 3 source Raynal and Singhal
- The execution of a distributed application
results in a set of distributed events produced
by the process. - The causal precedence relation induces a partial
order on the events of a distributed computation.
7Background 4 source Raynal and Singhal
- Causality among events, more formally the
causal precedence relation, is a powerful concept
for reasoning, analyzing, and drawing inferences
about a distributed computation. Knowledge of the
causal precedence relation between processes
helps programmers, designers, and the system
itself solve a variety of problems in distributed
computing.
8Background 5 source Raynal and Singhal
- The notion of time is basic to capturing the
causality between events. Distributed systems
have no built-in physical time and can only
approximate it. However, in a distributed
computation, both the progress and the
interaction between processes occur in spurts.
Consequently, logical clocks can be used to
accurately capture the causality relation between
events. - This article presents a general framework of a
system of logical clocks in distributed systems
and discusses three methods--scalar, vector, and
matrix--for implementing logical time in these
systems. - .
9Notations
- A distributed program is composed of a set of n
independent and asynchronous processes p1, p2, ,
pi, , pn. These processes do not share a global
clock. - Each process can execute an event spontaneously
when sending a message, it does not have to wait
for the delivery to be complete. - The execution of each process pi produces a
sequence of events ei0,ei1,.,eix,ei x1, . - The set of events produced by pi have a total
order determined by the sequencing of the events
eix ? ei x1 - We say that eix happens before ei x1.
- The happen-before relation ? is transitive eii
? eij for all i lt j.
10Notations - 2
- Events that occur between two concurrent
processes are generally unrelated, except for
those that are causally related as follows - for every message m exchanged between two
processes Pi and Pj, we have eix send(m),
ejyreceive(m), and - eix ? ejy
- Events in a distributed execution are partially
ordered - Local events are totally ordered.
- Causal events are totally ordered.
- All other events are unordered.
- For any two events e1 and e2 in a distributed
execution, either - (i) e1?e2, (ii) e2?e1, or (iii) e1e2 (that is,
e1 and e2 are concurrent).
11Which of these events are ? related? Which ones
are concurrent?
12Clock conditions
- In a system of logical clocks, every
participating process has a logical clock that is
advanced according to a protocol. - Every event is assigned a timestamp in such a
manner that satisfy the clock consistency
condition - if e1?e2 then C(e1 ) lt C(e2 )
- where C(ei ) is the timestamp assigned to
event ei - If the protocol satisfies the following condition
as well, then the clock is said to be strongly
consistent - if C(e1 ) lt C(e2 ) then e1?e2
-
13A logical clock implementation - the Lamport
Clock
- R1 Before executing an event(send, receive, or
internal), pi executes the following - Ci Ci d (d gt 0, usually d 1)
- R2 Each message carries the clock value of its
sender at sending time. When pi receives a
message with the timestamp Cmsg, it executes the
following - Ci max(Ci , Cmsg )
- Execute R1.
- Deliver the message.
- The logical clock at any process is
monotonically increasing.
14Fill in the logical clock values
15Correctness of the Lamport Clock
- Does the Lamport clock satisfy the clock
consistency condition? - Does the Lamport clock satisfy the strong clock
consistency condition?
16Logical Clock Protocols
- The Lamport Clock is an example of a logical
clock protocol. There are others. - The Lamport Clock is a scalar clock it uses a
single integer to represent the clock value.
17Lamport clock paper
- PODC Influential Paper Award 2000,
http//www.podc.org/influential/2000.html - Time, clocks, and the ordering of events in a
distributed system by Leslie Lamport, obtainable
from the ACM Digital Library.
18An application of scalar logical time bank
system algorithm
- See bank system algorithm slides
19Vector Logical Clock
- Developed by several persons independently.
- Each Pi of n participating processes maintains a
integer vector (array) of size n - vti1,n, where vtii is the local logical
clock of pi, - vtij represents pis latest knowledge of Pjs
local time.
20Vector clock protocol
- At process Pi
- Before executing an event, Pi updates its local
logical time as follows - vtii vtii d (d gt 0)
- Each sender process piggybacks a message m with
its vector clock value at sending time. Upon
receiving such a message (m, vt), Pi updates its
vector clock as follows - For 1 lt k lt n vtik max(vtik , vtk)
- vtii vtii d (d gt 0)
21Vector clock
- The system of vector clocks is strongly
consistent - Every event is assigned a timestamp in such a
manner that satisfies the clock consistency
condition - if e1?e2 then vt(e1 ) lt vt(e2 ), using vector
comparison - where vt(ei ) is the timestamp assigned to
event ei - If the protocol satisfies the following condition
as well, then the clock is said to be strongly
consistent - if vt(e1 ) lt vt(e2 ) then e1?e2 , using vector
comparison
22Vector comparison
- Given two vectors V1 and V2, both of size n
- V1 lt V2 if V1i lt V2i for i 1, , n
- And there exists some k, 0 lt k lt n1, such that
V1k lt V2k - Example V1 1, 2, 3, 4 V2 2, 3, 4, 5
- V1 lt V2
- Example V1 1, 2, 3, 4 V2 2, 2, 4, 4
- V1 (not) lt V2
- Example V1 1, 2, 3, 4 V2 2, 3, 4, 1
- V1 (not) lt V2
23Vector clock
- Because vector clocks are strongly consistent, we
can use them to determine whether two events are
causally related by comparing their vector time
stamps, using vector comparison.
24Matrix Time
- Proposed by Michael and Fischer in 1982.
- A process Pi maintains a matrix
- mti1n, 1n where
- mtii, i denotes the logical clock of Pi
- mtii, j denotes the latest knowledge that Pi
has about the local clock, mtjj, j of Pj (row i
is the vector clock of Pi . - mtij, k represents what Pi knows about the
latest knowledge that Pj has about the local
logical clock mtkk, k of Pk.
25Matrix Time Protocol
- At process Pi
- Before executing an event, Pi updates its local
logical time as follows - mtii, i mtii, i d (d gt 0)
- Each sender process piggybacks a message m with
its matrix clock value at sending time. Upon
receiving such a message (m, vt) from Pj, Pi
updates its matrix clock as follows - for 1 lt k lt n mtii, k max(mtii, k ,
mtj, k ) - for 1 lt k lt n
- for 1 lt q lt n
- mtik, q max(mtik, q , mtk, q )
- 3. mtii, i mtii, i d (d gt 0)
26matrix clock consistency
- The system of matrix clocks is strongly
consistent - Every event is assigned a timestamp in such a
manner that satisfy the clock consistency
condition - if e1? e2 then mt(e1 ) lt mt(e2 ), using matrix
comparison - where mt(ei ) is the timestamp assigned to
event ei - If the protocol satisfies the following condition
as well, then the clock is said to be strongly
consistent - if mt(e1 ) lt mt(e2 ) then e1?e2 , using matrix
comparison
27Matrix comparison
- Given two matrixes M1 and M2, both of size n by
n - M1 lt M2 if M1i, j lt V2i, j
- for i 0, 1, , n, j 0, 1, , n
- And there exist some k, 0 ltk ltn1, and some p, 0
ltp ltn1, such that M1k, p lt V2i, j - Because matrix clocks are strongly consistent, we
can use them to determine whether two events are
causally related by comparing their vector time
stamps
28An application of matrix time Wuu and Bernstein
paper
- The dictionary problem a dictionary is
replicated among multiple nodes. Each node
maintains a view of the dictionary independently
by performing operations on the dictionary
independently. - The network may be unreliable.
- The dictionary data must be consistent among the
nodes. - Serializability (using locking) is the database
approach to address such a problem. - The paper (as did other papers preceding it)
describes an algorithm which does not require
serializability.
29Wuu and Bernstein protocol
- A replicated log is used to achieve mutual
consistency of replicated data in an unreliable
network. - The log contains records of invocations of
operations which access a data object. - Each node updates its local copy of the data
object by performing the operations contained in
its local copy of the log. - The operations are commutative so that the order
in which operations are performed does not affect
the final state of the data.
30The problem environment
- n nodes N1, N2, , Nn are connected over a
network. - Each node maintains a data dictionary V a set
of words s1, s2, , sn, stored in stable
storage impervious to crashes. - Vi denotes the local view of the dictionary at
Ni. - Two types of operations may be issued by any node
to perform on the dictionary - insert(x)
- delete(x)
- delete(x) can be invoked at Ni only if x is in Vi
note that the operation may be issued by
multiple nodes. - insert(x) can only be issued by one node.
31The problem environment - 2
- The unique event which inserts x is denoted ex.
- An event which deletes x is called an x-delete
event - If V(e) is the dictionary view at a node after
the occurrence of event e, then x is in V(e) iff
ex -gt e and there does not exist an x-delete
event, g, such the g -gt e.
32The log
- Each node maintains a log of events L and a
distributed algorithm is employed to keep the
dictionary views up to date. - An event is recorded in the log as a
record/object containing these fields operation,
time, nodeID. For example - (add a, 3, 2) if Node 2 issued add a at its
local time 3. - The event record describing event e is denoted
eR - eR.node is the node that issues the event, eR.op
is the operation eR.time is the value of time
that the operation was issued.
33The log
- Nodes exchange messages containing appropriate
portions of the individually maintained log in
order to achieve data consistency. - L(e) denotes the contents of the log at a node
immediately after the event e completes. - The log problem
- (p1) f-gte iff fR is in L(e)
34A trivial solution
- Each node i that generates an event e adds a
record for the event, eR, to its local log Li. - Each time the node sends a message, it includes
its log Li in the message. - Upon receiving a message, a node j looks at the
log enclosed in the message, and applies the
event in each record to its dictionary view Vj - The logs are maintained indefinitely. If a node
j is cut off from the network due to failures,
its dictionary view may fall behind other nodes,
but as soon as the network is repaired and
messages can be sent to node j again, then the
events logged by other nodes will be made known
to j eventually.
35Trivial solution
- The trivial solution
- is fault-tolerant.
- satisfies the log problem and the dictionary
problem. - The log maintained by each node i, Li, grows
unboundedly, which has these ramifications - The entire log is sent with each message
excessive communication costs - A new view of the dictionary is repeatedly
computed based on the log received in each
message excessive computational costs - The entire log is stored at each node excessive
storage costs.
36Wuu and Bernstiens improved solutions
- Uses matrix time to purge event records that have
already been seen by all participants. - Each node i maintains a matrix clock Ti
- When i receives a log which contains a record for
event e, eR, initiated by node eR.node, it
determines if process k has already seen this
record by this predicate (boolean function) - boolean hasrec(Ti , eR, k)
- return (Tik, eR.node gt eR.time)
-
37Wuu and Bernstiens improved solutions pp.236-7
- Kept at each node are
- Vi the dictionary view, e.g .a, b, c
- Pli a partial log of events
- Initialization
- Vi Pli // set both empty,
- set matrix clock to all 0
38Wuu and Bernstiens improved solutions pp.236-7
- When node i issues insert(x)
- Update matrix clock
- Add the event record to the partial log Pli
- Add x to Vi
- When node i issues delete(x)
- Update matrix clock
- Add the event record to the partial log Pli
- delete x from Vi
39Wuu and Bernstiens improved solutions pp.236-7
- When node i sends to node k
- Create a subset of the partial log Pli,, NP,
consisting of those entries such that - Hasrec((Ti , eR, k) returns false.
- Send the NP and Ti to node k.
40Wuu and Bernsteins improved solutions pp.236-7
- When node i receives from node k
- Extract from the log received a subset, NE,
consisting of those entries such that - Hasrec((Ti , eR, i) returns false.
- These entries have not already been seen by i.
- Update the dictionary view Vi based on NE.
- Update the matrix clock Ti
- Add to the partial log Pli (note not NE) those
records in the log received such that - Hasrec((Ti , eR, j) returns false for at least
one j - Such a record has not been seen by at least one
other node.
41Wuu and Bernsteins improved solutions pp.236-7
- The size of the log sent with each message is
minimized based on the matrix clock. - The number of log entries based on which the
local dictionary view is updated is minimized,
again based on the matrix clock. - The algorithm will allow each log record to be
maintained by at least one node, so that
eventually that knowledge will be propagated to a
recovered node.