Title: Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms
1Consistent Global States of Distributed Systems
Fundamental Concepts and Mechanisms
- CS 249 Project
- Fall 2005
- Wing Wong
2Outline
- Introduction
- Asynchronous distributed systems, distributed
computations, consistency - Two different strategies to construct global
states - Monitor passively observes the system
(reactive-architecture) - Monitor actively interrogates the system
(snapshot protocol) - Properties of global predicates
- Sample applications deadlock detection and
debugging
3Introduction
- global state union of local states of
individual processes - many problems in distributed computing require
- construction of a global state and
- evaluation of whether the state satisfies some
predicate F - difficulties
- uncertainties in message delays
- relative speeds of computations
- global state obtained can be obsolete,
incomplete, or inconsistent
4Distributed Systems
- collection of sequential processes p1, p2, , pn
- unidirectional communication channels between
pairs of processes - reliable channels
- messages may be delivered out of order
- network strongly connected (not necessarily
completely)
5Asynchronous Distributed Systems
- no bounds on relative process speeds
- no bounds on message delays
- no synchronized local clocks
- communication is the only possible mechanism for
synchronization
6Distributed Computations
- distributed program executed by a collection of
processes - each process executes a sequence of events
- communication through events send(m) and
receive(m), m as message identifier
7Distributed Computations
- hi ei1ei2
- local history of process pi
- canonical enumeration
- total order imposed by sequential execution
- hik ei1ei2 eik
- initial prefix of hi containing first k events
- H h1 U U hn
- global history containing all events
- does not specify relative timing between events
8Distributed Computations
- to order events, define binary relation ? to
capture cause-and-effect - e ? e if and only if e causally precedes e
- concurrent events neither e ? e nor e ? e,
write e e - distributed computation partially ordered set
defined by (H, ?)
9Distributed Computations
10Global States, Cuts and Runs
- sik
- local state of process pi after event eik
- S (s1, ,sn)
- global state of distributed computation
- n-tuple of local states
- cut C h1c1 U U hncn or (c1, , cn)
- subset of global history H
11Global States, Cuts and Runs
- (s1c1, ,sncn)
- global state correspond to cut C
- (e1c1, ,encn)
- frontier of cut C
- set of last events
- run
- a total ordering R including all events in global
history - consistent with each local history
12Global States, Cuts and Runs
- cut C (5,2,4) cut C (3,2,6)
- a consistent run R e31e11e32e21e33e34e22e12e35e1
3e14e15e36e23e16
13Consistency
- cut C is consistent if for all events e and e
- closed under the causal precedence relation
- consistent global state corresponds to a
consistent cut - run R is consistent if for all events, e ? e
implies e appears before e in R
14Consistency
- run R e1e2 results in a sequence of global
states S0S1S2 - Si is obtained from Si-1 by some process
executing event ei , or Si-1 leads to Si - denote the transitive closure of the leads-to
relation by gtR - S is reachable from S in run R iff S gtR S
15Lattice of Global States
- lattice set of all consistent global states,
along with leads-to relation - Sk1kn shorthand for global state (s1k1,,snkn)
- k1 kn level of lattice
16Lattice of Global States
- path sequence of global states of increasing
level (downwards) - each path corresponds to a consistent run
- a possible pathS00 S01 S11 S21 S31 S32 S42 S43
S44 S54 S64 S65
17Observing Distributed Computations
(reactive-architecture)
- processes notify monitor process p0 whenever they
execute an event - monitor constructs observation as the sequence of
events corresponding to the notification messages - problem
- observation may be inconsistent due to
variability in notification message delays
18Observing Distributed Computations
19Observing Distributed Computations
- any permutation of run R is a possible
observation - we need
- delivery rule at monitor process to restore
message order - we have First-In-First-Out (FIFO) delivery using
sequence number for all source-destination pair
pi, pj - sendi(m) ? sendi(m) gt deliverj(m) ?
deliverj(m)
20Delivery Rule 1
- assume
- global real-time clock
- message delays bound by d
- process includes timestamp (real-time clock
value) when notifying p0 of local event e - DR1 At time t, deliver all received messages
with timestamps up to t d in increasing
timestamp order
21Delivery Rule 1
- let RC(e) denotes value of global clock when e is
executed - real-time clock satisfies Clock Condition
- e ? e gt RC(e) lt RC(e)
- but logical clocks also satisfies clock condition
22Logical Clocks
- event orderings based on increasing clock values
- LC(ei) denotes value of logical clock when ei is
executed by pi - each sent message m contains timestamp TS(m)
- update rules by pi at occurrence of ei
23Logical Clocks
24Delivery Rule 2
- replace real-time clock by logical clock
- need gap-detection property
- given events e, e where LC(e) lt LC(e),
determine if some event e exists such that
LC(e) lt LC(e) lt LC(e) - message is stable at p if no future messages
with timestamps smaller than TS(m) can be
received by p
25Delivery Rule 2
- with FIFO, when p0 receives m from pi with
timestamp TS(m), can be certain no other message
m from pi with TS(m) TS(m) - message m at p0 guaranteed stable when p0 has
received at least one message from all other
processes with timestamps gt TS(m) - DR2 Deliver all received messages that are
stable at p0 in increasing timestamp order
26Strong Clock Condition
- DR1, DR2 assume RC(e) lt RC(e) (or LC(e) lt
LC(e)) gt e ? e - recall RC and LC guarantee clock condition e ?
e gt RC(e) lt RC(e) - DR1, DR2 can unnecessarily delay delivery
- want timing mechanism TC that gives Strong Clock
Condition - e ? e TC(e) lt TC(e)
27Timing Mechanism 1 - Causal Histories
- causal history as clock value
- set of all events that causally precede event e
- smallest consistent cut that includes e
- projection of ?(e) on process pi ?i(e) ?(e) n
hi
28Timing Mechanism 1 - Causal Histories
29Timing Mechanism 1 - Causal Histories
- To maintain causal histories
- ? initially empty
- if ei is an internal or send event
- ?(ei) ei U ?(previous local event of pi)
- if ei receive of message m by pi from pj
- ?(ei) ei U ?(previous local event of pi) U
?(corresponding send event at pj)
30Timing Mechanism 1 - Causal Histories
new event e15
new send event
new event e23
new receive event
31Timing Mechanism 1 - Causal Histories
- can interpret clock comparison as set inclusion
- e ? e ?(e) ? ?(e)
- (why not set membership, e ? e e ? ?(e)?)
- unfortunately, causal histories grow too rapidly
32Timing Mechanism 2 - Vector Clocks
- note
- projection ?i(e) hik for some unique k
- eir ? ?i(e) for all r lt k
- can use single number k to represent ?i(e)
- ?(e) ?1(e) U U ?n(e)
- represent entire causal history by n-dimensional
vector clock VC(e), where for all 1 i n - VC(e)i k, if and only if ?i(e) hik
33Timing Mechanism 2 - Vector Clocks
34Timing Mechanism 2 - Vector Clocks
- To maintain vector clock
- each process pi initializes VC to contain all
zeros - update rules by pi at occurrence of ei
- VC(ei)i number of events pi has executed up
to and including ei - VC(ei)j number of events of pj that causally
precede event ei of pi
35Timing Mechanism 2 - Vector Clocks
causal histories
vector clocks
new send event
new receive event
36Vector Clock Comparison
- Define less than relation
- V lt V (V ? V) ? (? 1 k n Vk Vk)
37Properties of Vector Clocks
- Strong Clock Condition
- e ? e VC(e) lt VC(e)
- Simple Strong Clock Condition given event ei of
pi and event ej of pj, i ? j - ei ? ej VC(ei)i VC(ej)i
38Properties of Vector Clocks
- Test for Concurrency given event ei of pi and
event ej of pj - ei ej (VC(ei)i gt VC(ej)i) ? (VC(ej)j gt
VC(ei)j) - Pairwise Inconsistent given event ei of pi and
ej of pj, i ? j - if ei , ej cannot belong to the frontier of the
same consistent cut - (VC(ei)i lt VC(ej)i) ? (VC(ej)j lt VC(ei)j)
(concurrent)
39Properties of Vector Clocks
- Consistent Cut
- frontier contains no pairwise inconsistent events
- VC(eici)i ? VC(ejcj)i , ?1 i, j n
- Counting of events causally precede ei
- (ei) (Sj1 .. n VC(ei)j) 1
events 413-1 7
40Properties of Vector Clocks
- Weak Gap-Detection given event ei of pi and ej
of pj, - if VC(ei)k lt VC(ej)k for some k ? j, there
exists event ek such that ?(ek ? ei) ? (ek ? ej)
41Causal Delivery and Vector Clocks
- assume processes increment local component of VC
only for events notified to monitor p0 - p0 maintains set M for messages received but not
yet delivered - suppose we have
- message m from pj
- m last message delivered from process pk, k ? j
42Causal Delivery and Vector Clocks
- To deliver m, p0 must verify
- no earlier message from pj is undelivered(i.e.
TS(m)j 1 messages have been delivered from
pj) - no undelivered message m from pk
s.t.sendk(m)?sendk(m)?sendj(m), ?k ? j (i.e.
whether TS(m)k ? TS(m)k for all k)
43Causal Delivery and Vector Clocks
- p0 maintains array D1n where Di
TS(mi)i, mi being last message delivered from
pi - e.g. on right, delivery of m is delayed until m
is received and delivered
44Delivery Rule 3
- Causal Delivery
- for all messages m, m, sending processes pi, pj
and destination process pk - sendi(m) ? sendj(m) gt deliverk(m) ?
deliverk(m) - DR3 (Causal Delivery) Deliver message m from
process pj as soon as - Dj TS(m)j 1, and
- Dk ? TS(m)k, ?k ? j
- p0 set Dj to TS(m)j after delivery of m
45Causal Delivery and Hidden Channels
- should apply to closed systems
- incorrect conclusion with hidden channels
(communication channel external to the system)
46Active Monitoring - Distributed Snapshots
- monitor p0 requests states of other processes and
combine into global state - assume channels implement FIFO delivery
- channel state ?i,j for channel pi to pj messages
sent by pi not yet received by pj
47Distributed Snapshots
- notationsINi set of processes having direct
channels to piOUTi set of processes to which
pi has a channel - for each execution of the snapshot protocol,
process pi record its local state si and the
states of its incoming channels (?j,i for all pj
? INi)
48Distributed Snapshots
- Snapshot Protocol (Chandy-Lamport)
- p0 starts the protocol by sending itself a take
snapshot message - when receiving the take snapshot message for
the first time from process pf - pi records local state si and relays the take
snapshot message along all outgoing channels - channel state ?f,i is set to empty
- pi starts recording messages on other incoming
channels
49Distributed Snapshots
- Snapshot Protocol (Chandy-Lamport)
- when receiving the take snapshot message beyond
the first time from process ps - pi stops recording messages along channel from ps
- channel state ?s,i are messages that have been
recorded
50Distributed Snapshots
p1 done
p2 done
- dash arrows indicate take snapshot messages
- constructed global state S23 ?1,2 empty ?2,1
m
51Properties of Snapshots
- Let Ss global state constructed Sa global
state when protocol initiated Sf global state
when protocol terminated - Ss is guaranteed to be consistent
- actual run that the system followed may not pass
through Ss - but ? a run R such that Sa gtR Ss gtR Sf
52Properties of Snapshots
-
- Sa S21
- Sf S55
- r does not pass through Ss ( S23)
53Properties of Snapshots
54Properties of Global Predicates
- Now we have two methods for global predicate
evaluation - monitor passively observing runs
- monitor actively constructing snapshots
- utility of either approach depends (in part) on
properties of the predicate
55Stable Predicates
- communication delays gt Ss can only reflect some
past state of the system - stable predicate once become true, remain true
- e.g. deadlock, termination, loss of all tokens,
unreachable storage - if F is stable, then (F is true in Ss) gt (F is
true in Sf) and(F is false in Ss) gt (F is false
in Sa)
56Stable Predicates
- deadlock detection through snapshots (p.29, 30)
57Stable Predicates
- deadlock detection using reactive protocol (p.31,
32)
58Nonstable Predicates
- e.g. debugging, checking if queue lengths exceed
some thresholds - Two problems
- condition may not persist long enough for it to
be true when the predicate is evaluated - if a predicate F is found true, do not know
whether F ever held during the actual run
59Nonstable Predicates
- e.g. monitoring condition (x y)
- 7 states where (x y) holds
- but no longer hold after state S54
- e.g. (y x) 2
- condition hold only in S31 and S41
- monitor might detect (y - x) 2 even if actual
run never goes through S31 or S41
60Nonstable Predicates
- very little value to detect nonstable predicate
61Nonstable Predicates
- With observations, can extend predicates
- Possibly(F) There exist a consistent observation
O of the computation such that F holds in a
global state of O - Definitely(F) For every consistent observation O
of the computation, there exists a global state
of O in which F holds - e.g. Possibly((y x) 2), Definitely(x y)
62Nonstable Predicates
- use of extended predicate in debuggingif F
some erroneous state, then Possibly(F) indicates
a bug, even if it is not observed during an
actual run - if predicate F is stable, then Possibly(F)
Definitely(F)
63Detecting Possibly and Definitely F
- detection based on the lattice of consistent
global states - If any global state in the lattice satisfies F,
then Possibly(F) holds - Definitely(F) requires all possible runs to pass
through a global state that satisfies F
64Detecting Possibly and Definitely F
- Possibly((y x) 2)
- Definitely(y x)(why?)
65Detecting Possibly and Definitely F
- set of global state current with progressively
increasing levels - any member of current satisfies F gt Possibly(F)
true
66Detecting Possibly and Definitely F
- iteratively construct set of global states of
level l without passing through a state that
satisfies F - set empty gt Definitely(F) true
- set contains the final state gt ?Definitely(F)
true
67Conclusions
- many distributed system problems require
recognizing certain global conditions - two approaches to constructing global states
- reactive-architecture based
- snapshot based
- timing mechanism that captures causal precedence
relation - applying to distributed deadlock detection and
debugging - solutions can be adapted to deal with nonstable
predicates, multiple observations and failures