CS60002 Distributed Systems

About This Presentation

Title:

CS60002 Distributed Systems

Description:

will cover about half the course, supplemented by copies of papers ... If a b and b a, then a and b are concurrent ( a || b) Logical Clock ... – PowerPoint PPT presentation

Number of Views:673

Avg rating:3.0/5.0

Slides: 119

Provided by: arobind

Category:

more less

Transcript and Presenter's Notes

Title: CS60002 Distributed Systems

1
CS60002 Distributed Systems
2

Text Book
Advanced Concepts in Operating Systems by
Mukesh Singhal and Niranjan G. Shivaratri
will cover about half the course, supplemented by
copies of papers
Xerox, notes, copies of papers etc. will cover
the rest.

3
What is a distributed system?

A very broad definition
A set of autonomous processes communicating among
themselves to perform a task
Autonomous able to act independently
Communication shared memory or message passing
Concurrent system a better term probably

A more restricted definition
A network of autonomous computers that
communicate by message passing to perform some
task
A practical distributed system will probably
have both
Computers that communicate by messages
Processes/threads on a computer that communicate
by messages or shared memory

5
Advantages

Resource Sharing
Higher Performance
Fault Tolerance
Scalability

6
Why is it hard to design them?

The usual problem of concurrent systems
Arbitrary interleaving of actions makes the
system hard to verify
Plus
No globally shared memory (therefore hard to
collect global state)
No global clock
Unpredictable communication delays

7
Models for Distributed Algorithms

Topology completely connected, ring, tree etc.
Communication shared memory/message passing
(reliable? Delay? FIFO/Causal? Broadcast/multicast
?)
Synchronous/asynchronous
Failure models (fail stop, crash, omission,
Byzantine)
An algorithm need to specify the model on which
it is supposed to work

8
Complexity Measures

Message complexity no. of messages
Communication complexity/Bit Complexity no. of
bits
Time complexity For synchronous systems, no. of
rounds. For asynchronous systems, different
definitions are there.

9
Some Fundamental Problems

Ordering events in the absence of a global clock
Capturing the global state
Mutual exclusion
Leader election
Clock synchronization
Termination detection
Constructing spanning trees
Agreement protocols

Ordering of Events and
Logical Clocks

11
Ordering of Events

Lamports Happened Before relationship
For two events a and b, a ? b if
a and b are events in the same process and a
occurred before b
a is a send event of a message m and b is the
corresponding receive event at the destination
process
a ? c and c ? b for some event c

a ? b implies a is a potential cause of b
Causal ordering potential dependencies
Happened Before relationship causally orders
events
If a ? b, then a causally affects b
If a ? b and b ? a, then a and b are concurrent (
a b)

13
Logical Clock

Each process i keeps a clock Ci.
Each event a in i is timestamped C(a), the value
of Ci when a occured
Ci is incremented by 1 for each event in i
In addition, if a is a send of message m from
process i to j, then on receive of m,
Cj max(Cj, C(a)1)

Points to note
if a ? b, then C(a) lt C(b)
? is an irreflexive partial order
Total ordering possible by arbitrarily ordering
concurrent events by process numbers

15
Limitation of Lamports Clock

a ? b implies C(a) lt C(b)
BUT
C(a) lt C(b) doesnt imply a ? b !!
So not a true clock !!

16
Solution Vector Clocks

Ci is a vector of size n (no. of processes)
C(a) is similarly a vector of size n
Update rules
Cii for every event at process i
if a is send of message m from i to j with vector
timestamp tm, on receive of m
Cjk max(Cjk, tmk) for all k

For events a and b with vector timestamps ta and
tb,
ta tb iff for all i, tai tbi
ta ? tb iff for some i, tai ? tbi
ta tb iff for all i, tai tbi
ta lt tb iff (ta tb and ta ? tb)
ta tb iff (ta lt tb and tb lt ta)

a ? b iff ta lt tb
Events a and b are causally related iff ta lt tb
or tb lt ta, else they are concurrent
Note that this is still not a total order

19
Causal ordering of messages application of
vector clocks

If send(m1)? send(m2), then every recipient of
both message m1 and m2 must deliver m1 before
m2.
deliver when the message is actually given to
the application for processing

20
Birman-Schiper-Stephenson Protocol

To broadcast m from process i, increment Ci(i),
and timestamp m with VTm Cii
When j ? i receives m, j delays delivery of m
until
Cji VTmi 1 and
Cjk VTmk for all k ? i
Delayed messaged are queued in j sorted by vector
time. Concurrent messages are sorted by receive
time.
When m is delivered at j, Cj is updated according
to vector clock rule.

21
Problem of Vector Clock

message size increases since each message needs
to be tagged with the vector
size can be reduced in some cases by only sending
values that have changed

Capturing Global State

23
Global State Collection

Applications
Checking stable properties, checkpoint
recovery
Issues
Need to capture both node and channel states
system cannot be stopped
no global clock

Some notations
LSi local state of process i
send(mij) send event of message mij from
process i to process j
rec(mij) similar, receive instead of send
time(x) time at which state x was recorded
time (send(m)) time at which send(m) occured

send(mij) ? LSi iff
time(send(mij)) lt time(LSi)
rec(mij) ? LSj iff
time(rec(mij)) lt time(LSj)
transit(LSi,LSj) mij send(mij) ? LSi and
rec(mij) ? LSj
inconsistent(LSi, LSj) mij send(mij) ? LSi
and rec(mij) ? LSj

Global state collection of local states
GS LS1, LS2,, LSn
GS is consistent iff
for all i, j, 1 i, j n,
inconsistent(LSi, LSj) ?
GS is transitless iff
for all i, j, 1 i, j n,
transit(LSi, LSj) ?
GS is strongly consistent if it is consistent and
transitless.

27
Chandy-Lamports Algorithm

Uses special marker messages.
One process acts as initiator, starts the state
collection by following the marker sending rule
below.
Marker sending rule for process P
P records its state then for each outgoing
channel C from P on which a marker has not been
sent already, P sends a marker along C before any
further message is sent on C

When Q receives a marker along a channel C
If Q has not recorded its state then Q records
the state of C as empty Q then follows the
marker sending rule
If Q has already recorded its state, it records
the state of C as the sequence of messages
received along C after Qs state was recorded and
before Q received the marker along C

Points to Note
Markers sent on a channel distinguish messages
sent on the channel before the sender recorded
its states and the messages sent after the sender
recorded its state
The state collected may not be any state that
actually happened in reality, rather a state that
could have happened
Requires FIFO channels
Network should be strongly connected (works
obviously for connected, undirected also)
Message complexity O(E), where E no. of links

30
Lai and Youngs Algorithm

Similar to Chandy-Lamports, but does not require
FIFO
Boolean value X at each node, False indicates
state is not recorded yet, True indicates
recorded
Value of X piggybacked with every application
message
Value of X distinguishes pre-snapshot and
post-snapshot messages, similar to the Marker

Mutual Exclusion

32
Mutual Exclusion

very well-understood in shared memory systems
Requirements
at most one process in critical section (safety)
if more than one requesting process, someone
enters (liveness)
a requesting process enters within a finite time
(no starvation)
requests are granted in order (fairness)

33
Classification of Distributed Mutual Exclusion
Algorithms

Non-token based/Permission based
Permission from all processes e.g. Lamport,
Ricart-Agarwala, Raicourol-Carvalho etc.
Permission from a subset ex. Maekawa
Token based
ex. Suzuki-Kasami

34
Some Complexity Measures

No. of messages/critical section entry
Synchronization delay
Response time
Throughput

35
Lamports Algorithm

Every node i has a request queue qi, keeps
requests sorted by logical timestamps (total
ordering enforced by including process id in the
timestamps)
To request critical section
send timestamped REQUEST (tsi, i) to all other
nodes
put (tsi, i) in its own queue
On receiving a request (tsi, i)
send timestamped REPLY to the requesting node i
put request (tsi, i) in the queue

To enter critical section
i enters critical section if (tsi, i) is at the
top if its own queue, and i has received a
message (any message) with timestamp larger than
(tsi, i) from ALL other nodes.
To release critical section
i removes it request from its own queue and sends
a timestamped RELEASE message to all other nodes
On receiving a RELEASE message from i, is
request is removed from the local request queue

Some points to note
Purpose of REPLY messages from node i to j is to
ensure that j knows of all requests of i prior to
sending the REPLY (and therefore, possibly any
request of i with timestamp lower than js
request)
Requires FIFO channels.
3(n 1 ) messages per critical section
invocation
Synchronization delay max. message transmission
time
requests are granted in order of increasing
timestamps

38
Ricart-Agarwala Algorithm

Improvement over Lamports
Main Idea
node j need not send a REPLY to node i if j has a
request with timestamp lower than the request of
i (since i cannot enter before j anyway in this
case)
Does not require FIFO
2(n 1) messages per critical section invocation
Synchronization delay max. message transmission
time
requests granted in order of increasing
timestamps

To request critical section
send timestamped REQUEST message (tsi, i)
On receiving request (tsi, i) at j
send REPLY to i if j is neither requesting nor
executing critical section or if j is requesting
and is request timestamp is smaller than js
request timestamp. Otherwise, defer the request.
To enter critical section
i enters critical section on receiving REPLY from
all nodes
To release critical section
send REPLY to all deferred requests

40
Roucairol-Carvalho Algorithm

Improvement over Ricart-Agarwala
Main idea
once i has received a REPLY from j, it does not
need to send a REQUEST to j again unless it sends
a REPLY to j (in response to a REQUEST from j)
no. of messages required varies between 0 and 2(n
1) depending on request pattern
worst case message complexity still the same

41
Maekawas Algorithm

Permission obtained from only a subset of other
processes, called the Request Set (or Quorum)
Separate Request Set Ri for each process i
Requirements
for all i, j Ri n Rj ? F
for all i i ? Ri
for all i Ri K, for some K
any node i is contained in exactly D Request
Sets, for some D
K D sqrt(N) for Maekawas

A simple version
To request critical section
i sends REQUEST message to all process in Ri
On receiving a REQUEST message
send a REPLY message if no REPLY message has been
sent since the last RELEASE message is received.
Update status to indicate that a REPLY has been
sent. Otherwise, queue up the REQUEST
To enter critical section
i enters critical section after receiving REPLY
from all nodes in Ri

To release critical section
send RELEASE message to all nodes in Ri
On receiving a RELEASE message, send REPLY to
next node in queue and delete the node from the
queue. If queue is empty, update status to
indicate no REPLY message has been sent.

Message Complexity 3sqrt(N)
Synchronization delay
2 (max message transmission time)
Major problem DEADLOCK possible
Need three more types of messages (FAILED,
INQUIRE, YIELD) to handle deadlock. Message
complexity can be 5sqrt(N)
Building the request sets?

45
Token based Algorithms

Single token circulates, enter CS when token is
present
No FIFO required
Mutual exclusion obvious
Algorithms differ in how to find and get the
token
Uses sequence numbers rather than timestamps to
differentiate between old and current requests

46
Suzuki Kasami Algorithm

Broadcast a request for the token
Process with the token sends it to the requestor
if it does not need it
Issues
Current vs. outdated requests
determining sites with pending requests
deciding which site to give the token to

The token
Queue (FIFO) Q of requesting processes
LN1..n sequence number of request that j
executed most recently
The request message
REQUEST(i, k) request message from node i for
its kth critical section execution
Other data structures
RNi1..n for each node i, where RNij is the
largest sequence number received so far by i in a
REQUEST message from j.

To request critical section
If i does not have token, increment RNii and
send REQUEST(i, RNii) to all nodes
if i has token already, enter critical section if
the token is idle (no pending requests), else
follow rule to release critical section
On receiving REQUEST(i, sn) fat j
set RNji max(RNji, sn)
if j has the token and the token is idle, send it
to i if RNji LNi 1. If token is not idle,
follow rule to release critical section

To enter critical section
enter CS if token is present
To release critical section
set LNi RNii
For every node j which is not in Q (in token),
add node j to Q if RNi j LN j 1
If Q is non empty after the above, delete first
node from Q and send the token to that node

Points to note
No. of messages 0 if node holds the token
already, n otherwise
Synchronization delay 0 (node has the token) or
max. message delay (token is elsewhere)
No starvation

51
Raymonds Algorithm

Forms a directed tree (logical) with the
token-holder as root
Each node has variable Holder that points to
its parent on the path to the root. Roots Holder
variable points to itself
Each node i has a FIFO request queue Qi

To request critical section
Send REQUEST to parent on the tree, provided i
does not hold the token currently and Qi is
empty. Then place request in Qi
When a non-root node j receives a request from i
place request in Qj
send REQUEST to parent if no previous REQUEST
sent

When the root receives a REQUEST
send the token to the requesting node
set Holder variable to point to that node
When a node receives the token
delete first entry from the queue
send token to that node
set Holder variable to point to that node
if queue is non-empty, send a REQUEST message to
the parent (node pointed at by Holder variable)

To execute critical section
enter if token is received and own entry is at
the top of the queue delete the entry from the
queue
To release critical section
if queue is non-empty, delete first entry from
the queue, send token to that node and make
Holder variable point to that node
If queue is still non-empty, send a REQUEST
message to the parent (node pointed at by Holder
variable)

Points to note
Avg. message complexity O(log n)
Sync. delay (T log n)/2, where T max. message
delay

56
Leader Election
57
Leader Election in Rings

Models
Synchronous or Asynchronous
Anonymous (no unique id) or Non-anonymous (unique
ids)
Uniform (no knowledge of n, the number of
processes) or non-uniform (knows n)
Known Impossibility Result
There is no Synchronous, non-uniform leader
election protocol for anonymous rings
Implications ??

58
Election in Asynchronous Rings

Lelann-Chang-Roberts Algorithm
send own id to node on left
if an id received from right, forward id to left
node only if received id greater than own id,
else ignore
if own id received, declares itself leader
works on unidirectional rings
message complexity ?(n2)

Hirschberg-Sinclair Algorithm
operates in phases, requires bidirectional ring
In kth phase, send own id to 2k processes on
both sides of yourself (directly send only to
next processes with id and k in it)
if id received, forward if received id greater
than own id, else ignore
last process in the chain sends a reply to
originator if its id less than received id
replies are always forwarded
A process goes to (k1)th phase only if it
receives a reply from both sides in kth phase
process receiving its own id declare itself
leader

Message Complexity O(nlgn)
Lots of other algorithms exist for rings
Lower Bound Result
Any comparison-based leader election algorithm in
a ring requires ?(nlgn) messages
What if not comparison-based?

61
Leader Election in Arbitrary Networks

FloodMax
synchronous, round-based
at each round, each process sends the max. id
seen so far (not necessarily its own) to all its
neighbors
after diameter no. of rounds, if max. id seen
own id, declares itself leader
Complexity O(d.m), where d diameter of the
network, m no. of edges
does not extend to asynchronous model trivially
Variations of building different types of
spanning trees with no pre-specified roots.
Chosen root at the end is the leader (Ex., the
DFS spanning tree algorithm we covered earlier)

Clock Synchronization

63
Clock Synchronization

Multiple machines with physical clocks. How can
we keep them more or less synchronized?
Internal vs. External synchronization
Perfect synchronization not possible because of
communication delays
Even synchronization within a bound can not be
guaranteed with certainty because of
unpredictability of communication delays.
But still useful !! Ex. Kerberos, GPS

64
How clocks work

Computer clocks are crystals that oscillate at a
certain frequency
Every H oscillations, the timer chip interrupts
once (clock tick). No. of interrupts per second
is typically 18.2, 50, 60, 100 can be higher,
settable in some cases
The interrupt handler increments a counter that
keeps track of no. of ticks from a reference in
the past (epoch)
Knowing no. of ticks per second, we can calculate
year, month, day, time of day etc.

65
Clock Drift

Unfortunately, period of crystal oscillation
varies slightly
If it oscillates faster, more ticks per real
second, so clock runs faster similar for slower
clocks
For machine p, when correct reference time is t,
let machine clock show time as C Cp(t)
Ideally, Cp(t) t for all p, t
In practice,
1 ? dC/dt 1 ?
? max. clock drift rate, usually around 10-5
for cheap oscillators
Drift gt Skew between clocks (difference in clock
values of two machines)

66
Resynchronization

Periodic resynchronization needed to offset skew
If two clocks are drifting in opposite
directions, max. skew after time t is 2 ? t
If application requires that clock skew lt d, then
resynchronization period
r lt d /(2 ?)
Usually ? and d are known

67
Cristians Algorithm

One m/c acts as the time server
Each m/c sends a message periodically (within
resync. period r) asking for current time
Time server replies with its time
Sender sets its clock to the reply
Problems
message delay
time server time is less than senders current
time

Handling message delay try to estimate the time
the message with the timer servers time took to
each the sender
measure round trip time and halve it
make multiple measurements of round trip time,
discard too high values, take average of rest
make multiple measurements and take minimum
use knowledge of processing time at server if
known
Handling fast clocks
do not set clock backwards slow it down over a
period of time to bring in tune with servers
clock

69
Berkeley Algorithm

Centralized as in Cristians, but the time server
is active
time server asks for time of other m/cs at
periodic intervals
time server averages the times and sends the new
time to m/cs
M/cs sets their time (advances immediately or
slows down slowly) to the new time
Estimation of transmission delay as before

70
External Synchronization

Clocks must be synchronized with real time
Cristians algorithm can be used if the time
server is synchronized with real time somehow
Berkeley algorithm cannot be used
But what is real time anyway?

71
Measurement of time

Astronomical
traditionally used
based on earths rotation around its axis and
around the sun
solar day interval between two consecutive
transits of the sun
solar second 1/86,400 of a solar day
period of earths rotation varies, so solar
second is not stable
mean solar second average length of large no of
solar days, then divide by 86,400

Atomic
based on the transitions of Cesium 133 atom
1 sec. time for 9,192,631,770 transitions
about 50 labs maintain Cesium clock
International Atomic Time (TAI) mean no. of
ticks of the clocks since Jan 1, 1958
highly stable
But slightly off-sync with mean solar day (since
solar day is getting longer)
A leap second inserted approx. occasionally to
bring it in sync. (so far 32, all positive)
Resulting clock is called UTC Universal
Coordinated Time

UTC time is broadcast from different sources
around the world, ex.
National Institute of Standards Technology
(NIST) runs radio stations, most famous being
WWV, anyone with a proper receiver can tune in
United States Naval Observatory (USNO) supplies
time to all defense sources, among others
National Physical Laboratory in UK
GPS satellites
Many others

74
NTP Network Time Protocol

Protocol for time sync. in the internet
Hierarchical architecture
primary time servers (stratum 1) synchronize to
national time standards via radio, satelite etc.
secondary servers and clients (stratum 2, 3,)
synchronize to primary servers in a hierrachical
manner (stratum 2 servers sync. with stratum 1,
startum 3 with stratum 2 etc.).

Reliability ensured by redundant servers
Communication by multicast (usually within LAN
servers), symmetric (usually within multiple
geographically close servers), or client server
(to higher stratum servers)
Complex algorithms to combine and filter times
Sync. possible to within tens of milliseconds for
most machines
But, just a best-effort service, no guarantees
RFC 1305 and www.eecis.udel.edu/ntp/ for more
details

Termination Detection

77
Termination Detection

Model
processes can be active or idle
only active processes send messages
idle process can become active on receiving an
computation message
active process can become idle at any time
termination all processes are idle and no
computation message are in transit
Can use global snapshot to detect termination
also

78
Huangs Algorithm

One controlling agent, has weight 1 initially
All other processes are idle initially and has
weight 0
Computation starts when controlling agent sends a
computation message to a process
An idle process becomes active on receiving a
computation message
B(DW) computation message with weight DW. Can
be sent only by the controlling agent or an
active process
C(DW) control message with weight DW, sent by
active processes to controlling agent when they
are about to become idle

Let current weight at process W
Send of B(DW)
Find W1, W2 such that W1 gt 0, W2 gt 0, W1 W2 W
Set W W1 and send B(W2)
Receive of B(DW)
W DW
if idle, become active
Send of C(DW)
send C(W) to controlling agent
Become idle
Receive of C(DW)
W DW
if W 1, declare termination

Building Spanning Trees

81
Building Spanning Trees

Applications
Broadcast
Convergecast
Leader election
Two variations
from a given root r
root is not given a-priori

Flooding Algorithm
starts from a given root r
r initiates by sending message M to all
neighbours, sets its own parent to nil
For all other nodes, on receiving M from i for
the first time, set parent to i and send M to all
neighbors except i. Ignore any M received after
that
Tree built is an arbitrary spanning tree
Message complexity
2m (n -1) where m no of edges
Time complexity ??

83
Constructing a DFS tree with given root

plain parallelization of the sequential algorithm
by introducing synchronization
each node i has a set unexplored, initially
contains all neighbors of i
A node i (initiated by the root) considers nodes
in unexplored one by one, sending a neighbor j a
message M and then waiting for a response (parent
or reject) before considering the next node in
unexplored
if j has already received M from some other node,
j sends a reject to i

else, j sets i as its parent, and considers nodes
in its unexplored set one by one
j will send a parent message to i only when it
has considered all nodes in its unexplored set
i then considers the next node in its unexplored
set
Algorithm terminates when root has received
parent or reject message from all its neighbours
Worst case no. of messages 4m
Time complexity O(m)

85
What if no root given?

Main idea
Nodes have unique ids
A node starts building the DFS tree with itself
as root (a single node fragment) spontaneously as
in the previous case
Fragments of the spanning tree gets built in
parallel, all nodes in each fragment is
identified by the id of its root
M carries the fragment id of the sender

when M sent from node in lower id fragment to
node in higher id fragment, lower id fragment is
stalled by higher id fragment by not giving a
response
When M sent from higher to lower id fragment,
node in lower id fragment switches parent to node
in higher id tree, resets unexplored, and starts
DFS again
Eventually, the highest id node becomes the root
(leader election!!)
Message complexity O(mn) !!
Time complexity O(m)

87
What about MSTs??

Gallager-Humblet-Spira Algorithm
much more complex! but similar to Kruskals
no root given, edge weights assumed to be
distinct
MST built up in fragments (subtree of MST)
initially each node in its own fragment
fragments merge, finally just one fragment
outgoing edge edge that goes between two
fragments
known result min. wt. outgoing edge of a
fragment always in MST

Issues
How does a node find its min. wt. outgoing edge?
How does a fragment finds its min. wt. outgoing
edge?
When does two fragments merge?
How does two fragments merge?

89
Some definitions

Each node has three states
Sleeping initial state
Find currently finding the fragments min. wt.
outgoing edge
Found found the min. wt. outgoing edge
Each fragment has a level
initially, each node is in a fragment of level 0

90
Merging rule for fragments

Suppose F is a fragment with id X, level L, and
min. wt. outgoing edge e. Let fragment at other
end of e be F1, with id X1 and level L1. Then
if L lt L1, F merges into F1, new fragment has id
X1, level L1
if LL1, and e is also the min. wt. outgoing edge
for F1, then F and F1 merges new fragment has id
X2 weight of e, and level L 1 e is called
the core edge
otherwise, F waits until one of the above becomes
true

91
How to find min. wt. outgoing edge of a fragment

nodes on core edge broadcasts initiate message to
all fragment nodes along fragment edges contains
level and id
on receiving initiate, a node find its min. wt.
outgoing edge (in Find state) how?
nodes send Report message with min. wt. edge up
towards the core edge along fragment edges (and
enters Found state)
leafs send their min. wt. outgoing edge,
intermediate nodes send the min. of their min.
wt. outgoing edge and min. edge sent by children
in fragment path info to best edge kept
when Report reaches the nodes on the core edge,
min. wt. outgoing edge of the fragment is known.

92
What then???

nodes on core edges send Change_core message to
node i with min. wt. outgoing edge
node i then sends a Connect message to node j at
other end with its level
If js fragment level is greater than is,
initiate message sent from j to i. This updates
level and id of all nodes in js old fragment if
j has not sent a Report message yet, nodes in is
old fragment starts finding its min. wt. outgoing
edge, else not.
if js fragment level is less, no response sent
and i just waits till js fragment id becomes
higher
if fragment ids match and j sends Connect to i
also, merge into a level L1 fragment with new
core edge and id, and send initiate message

some more details skipped, read paper
Algo. terminates when no outgoing edge found for
a fragment
Worst case message complexity O(n log n m)

Fault Tolerance
and
Recovery

95
Fault Tolerance Recovery

Classification of faults
based on component that failed
program/process
processor/machine
link
storage
clock
based on behavior of faulty component
Crash just halts
Failstop crash with additional conditions
Omission fails to perform some steps
Byzantine behaves arbitrarily
Timing violates timing constraints

Types of tolerance
Masking system always behaves as per
specifications even in presence of faults
Non-masking system may violate specifications
in presence of faults. Should at least behave in
a well-defined manner
Fault tolerant system should specify
Class of faults tolerated
what tolerance is given from each class

Some building blocks (assumptions/primitives that
help in designing fault tolerant systems)
Agreement (multiple processes agree on some
value)
Clock synchronization
Stable storage (data accessible after crash)
Reliable communication (point-to-point,
broadcast, multicast)
Atomic actions

98
Agreement Problems

Model
total n processes, at most m of which can be
faulty
reliable communication medium
fully connected
receiver always knows the identity of the sender
of a message
byzantine faults
synchronous system. In each round, a process
receives messages, performs computation, and
sends messages.

99
Different problem variations

Byzantine agreement (or Byzantine Generals
problem)
one process x broadcasts a value v
all nonfaulty processes must agree on a common
value (Agreement condition).
The agreed upon value must be v if x is nonfaulty
(Validity condition)
Consensus
Each process broadcasts its initial value
satisfy agreement condition
If initial value of all nonfaulty processes is v,
then the agreed upon value must be v

100

Interactive Consistency
Each process i broadcasts its own value vi
all nonfaulty processes agree on a common vector
(v1, v2,,vn)
If the ith process is nonfaulty, then the ith
value in the vector agreed upon by nonfaulty
processes must be vi
Solution to Byzantine agreement problem implies
solution to other two

101
Byzantine Agreement Problem

no solution possible if
asynchronous system, or
n lt (3m 1)
needs at least (m1) rounds of message exchange
(lower bound result)
Oral messages messages can be forged/changed
in any manner, but the receiver always knows the
sender

102
Lamport-Shostak-Pease Algorithm

Recursively defined
OM(m), m gt 0
Source x broadcasts value to all processes
Let vi value received by process i from source
(0 if no value received). Process i acts as a new
source and initiates OM(m -1), sending vi to
remaining (n - 2) processes
For each i, j, i ? j, let vj value received by
process i from process j in step 2 using O(m-1).
Process i uses the value majority(v1, v2, , vn
-1)

103

OM(0)
Source x broadcasts value to all processes
Each process uses the value if no value
received, 0 is used
Time complexity m1 rounds
Message Complexity O(nm)
You can reduce message complexity to polynomial
by increasing time

104
Atomic Actions and Commit Protocols

An action may have multiple subactions executed
by different processes at different nodes of a
distributed system
Atomic action either all subactions are done or
none are done (all-or-nothing property/ global
atomicity property) as far as system state is
concerned
Commit protocols protocols for enforcing global
atomicity property

105
Two-Phase Commit

Assumes the presence of write-ahead log at each
process to recover from local crashes
One process acts as coordinator
Phase 1
coordinator sends COMMIT_REQUEST to all processes
waits for replies from all processes
on receiving a COMMIT_REQUEST, a process, if the
local transaction is successful, writes Undo/redo
logs in stable storage, and sends an AGREED
message to the coordinator. Otherwise, sends an
ABORT

106

Phase 2
If all processes reply AGREED, coordinator writes
COMMIT record into the log, then sends COMMIT to
all processes. If at least one process has
replied ABORT, coordinator sends ABORT to all.
Coordinator then waits for ACK from all
processes. If ACK is not received within timeout
period, resend. If all ACKs are received,
coordinator writes COMPLETE to log
On receiving a COMMIT, a process releases all
resources/locks, and sends an ACK to coordinator
On receiving an ABORT, a process undoes the
transaction using Undo log, releases all
resources/locks, and sends an ACK

107

Ensures global atomicity either all processes
commit or all of them aborts
Resilient to crash failures (see text for
different scenarios of failure)
Blocking protocol crash of coordinator can
block all processes
Non-blocking protocols possible ex., Three-Phase
Commit protocol we will not discuss in this class

108
Checkpointing Rollback Recovery

Error recovery
Forward error recovery assess damage due to
faults exactly and repair the erroneous part of
the system state
less overhead but hard to assess effect of faults
exactly in general
Backward error recovery on a fault, restore
system state to a previous error-free state and
restart from there
costlier, but more general, application-independen
t technique

109

Checkpoint and Rollback Recovery a form of
backward error recovery
Checkpoint
local checkpoint local state of a process saved
in stable storage for possible rollback on a
fault
global checkpoint collection of local
checkpoints, one from each process
Consistent and Strongly Consistent Global
Checkpoint similar to consistent and strongly
consistent global state respectively (Also called
recovery line)

110

Orphan message a message whose receive is
recorded in some local checkpoint of a global
checkpoint but send is not recorded in any local
checkpoint in that global checkpoint ( Note A
consistent global checkpoint cannot have an
orphan message)
Lost message a message whose send is recorded
but receive is not in a global checkpoint
Is lost messages a problem??
not if unreliable channels assumed (since
messages can be lost anyway)
if reliable channels assumed, need to handle this
properly! Cannot lose messages !
We will assume unreliable channels for simplicity

111
Performance measures for a checkpointing and
recovery algorithm

during fault-free operation
checkpointing time
space for storing checkpoints and messages (if
needed)
in case of a fault
recovery time (time to establish recovery line)
extent of rollback (how far in the past did we
roll back? how much computation is lost?)
is output commit problem handled? (if an output
was sent out before the fault, say cash dispensed
at a teller m/c, it should not be resent after
restarting after the fault)

112
Some parameters that affect performance

Checkpoint interval (time between two successive
checkpoints)
Number of processes
Communication pattern of the application
Fault frequency
Nature of stable storage

113
Classification of Checkpoint Recovery Algorithms

Asynchronous/Uncoordinated
every process takes local checkpoint
independently
to recover from a fault in one process, all
processes coordinate to find a consistent global
checkpoint from their local checkpoints
very low fault-free overhead, recovery overhead
is high
Domino effect possible (no consistent global
checkpoint exist, so all processes have to
restart from scratch)
higher space requirements, as all local
checkpoints need to be kept
Good for systems where fault is rare and
inter-process communication is not too high (less
chance of domino effect)

114

Synchronous/Coordinated
all processes coordinate to take a consistent
global checkpoint
during recovery, every process just rolls back to
its last local checkpoint independently
low recovery overhead, but high checkpointing
overhead
no domino effect possible
low space requirement, since only last checkpoint
needs to be stored at each process

115

Communication Induced
Synchronize checkpointing with communication,
since message send/receive is the fundamental
cause of inconsistency in global checkpoint
Ex. take local checkpoint right after every
send! Last local checkpoint at each process is
always consistent. But too costly
Many variations are there, more efficient than
the above, we will not discuss them in this class.

116

Message logging
Take coordinated or uncoordinated checkpoint, and
then log (in stable storage) all messages
received since the last checkpoint
On recovery, only the recovering process goes
back to its last checkpoint, and then replays
messages from the log appropriately until it
reaches the state right before the fault
Only class that can handle output commit problem!
Details too complex to discuss in this class

117
Some Checkpointing Algorithms

Asynchronous/Uncoordinated
See Juang-Venkatesans algorithm in text, quite
well-explained
Synchronous/Coordinated
Chandy-Lamports global state collection
algorithm can be modified to handle recovery from
faults
See Koo-Touegs algorithm in text, quite
well-explained

118
(No Transcript)

Write a Comment

User Comments (0)