Fault Tolerant Distributed Systems

About This Presentation

Title:

Fault Tolerant Distributed Systems

Description:

Fault tolerant DS - a survey. 2. The track ... Fault tolerant DS - a survey. 3 #1: Revannaswamy, ... Fault tolerant DS - a survey. 5 #2: Chang, Singhal, Liu - 90 ... – PowerPoint PPT presentation

Number of Views:781

Avg rating:3.0/5.0

Slides: 25

Provided by: khaledah

Category:

Tags: distributed | ds | fault | systems | tolerant

more less

Transcript and Presenter's Notes

Title: Fault Tolerant Distributed Systems

1
Fault Tolerant Distributed Systems

A Survey
by
Nirmit Desai

2
The track

Review the approaches to fault tolerance (coupled
with mutex algorithms in some cases, 1 - 3), Get
an idea of the terminology and field
1 Revannaswamy, Bhatt - 97
2 Chang, Singhal, Liu - 90
3 Helary, Mostefaoui - 94
4 Stumm, Zhou - 90
5 Nett, Mock, Theisohn - 97
6 Ballarini, Bernardi, Donatelli - 02
7 Hardekopf, Kwiat, Upadhyaya - 01
8 Injong Rhee - 95
9 Goyer, Momtahan, Selic - 90
10 Frank Mueller - 01
Summary
Questions

3
1 Revannaswamy, Bhatt - 97

O(log N), N Number of nodes in the spanning
tree
Communicate with neighbours only (Raymonds
algorithm)
Tolerant to single link / single node failures
Eliminate the failed component and obtain
different tree structure (network is biconnected)
Mechanism to detect the failures is assumed to
exist
Only Branches are used for message exchanges

4
1 Contd - Reconfiguaration

Repeat the steps below till all connected OR
insufficient chords

At the end of reconfiguration, all
data-structures are reset and algorithm is
restarted, older token holder is pre-empted and
token is transferred to a newly elected leader

5
2 Chang, Singhal, Liu - 90

O((2K) logN), Each node has K alternate paths
to token node
Works with DAG, no cycles should ever develop
FIFO violation for greedy strategy

6
2 Contd. - Fault Tolerance

Mechanism to detect the failures is assumed to
exist
Single link / single node failure and recovery
Any path constructed by taking outgoing edges
always leads to the token holder
Studies all the states of the system at which the
failure may occur and gives solutions to each of
the states for tolerance
At recovery, the node asks its neighbours about
its state in order to reconstruct the data
structures
Fairness compromized at recovery - queue order is
arbitrary

7
3 Helary, Mostefaoui - 94

Worst case O(lg n) for mutex, average O(lg n) for
fault-tolerance
Fairness compromized at recovery - queue order is
arbitrary
Best of both the worlds Naimi (dynamic),
Raymond(static)
Node topology preserves binomial tree (open cube)
structure
Assumes finite message delay for detection of
node failure

8
3 Contd. - Fault Tolerance

Handles multiple and simultaneous node failures,
no link failures
Due to bounded message delays and estimated
critical section lengths, a node can suspect a
failure by timeouts.
Enquire appropriate nodes about their state and
deduce whether the suspicion is correct.
In case of failure, some data structures are
updated, tree is reconfigured

9
4 Stumm, Zhou

Extension of coherent DSM algorithms for fault
tolerance
Tolerance to single host failures, no link
failures
Availability of broadcast (multicast)
communication is assumed
Communication cost assumed to be larger than
memory access costs
Critical state (DSM data, state info) replicated
to multiple hosts
Central server algo central server manages data
(request/response)
Backup the central server state, fault detection
by timeouts
Full replication algo all data on each host,
local reads, sequenced writes (by sequencers),
simple recovery by obtaining latest copy from
other hosts

10
4 Contd.

Migration algo Data always migrated to the
accessing host, can be integrated with the VMM of
the host OS, maintain sequential consistency
Read replication algo Similar to migration but
no migration for read accesses
Comparisons
Fault tolerant versions of central server and
full replication do not introdue a substantial
overhead
Central server is better for infrequent shared
data accesses
Full replication is better for sparse write
accesses

11
5 Nett, Mock, Theisohn

Specification and realization of a generalized
distributed dependency management system for
fault tolerance
Key observation Management of dynamically
evolving dependencies is a key problem for fault
tolerance
Dependency graph nodes as sites, edges as
dependencies
Dependency set set of nodes reachable from start
nodes
Traversal of dependency set yields solution at
each node
The node is added to the dep. set ?
Search continues ?
Which dep. types to follow now ?
In which direction ?

12
5 Contd - Example

Stability completed actions have stable
dependencies
The proposed system supports only applications
that have stable dependency states
Does our application conform to this requirement
?
Next slide

13
5 Contd - Language

Generate a state machine with this language and,
verify for stability

14
5 Contd - Example

Send messages and generate global dependency
graph to look for inconsistencies

15
6 Ballarini, Bernardi, Donatelli

Formal and theoretical discussion of
synchronization problem in distributed systems

16
7 Hardekopf, Kwiat, Upadhyaya

Security and fault tolerance for aerospace
distributed communications by means of voting
mechanism
Basic voting protocol
Initiator sends a request to one of the voters
The voter multicasts a request to other voters
Voters decide locally and send their vote to the
client
client waits for f1 alike votes where f is the
degree of fault tolerance
Problems not scalable, assumes exact voting
Proposes a model of set of voters, interface
module and user where majority of voters are
assumed to be always correct !!

17
7 Contd - The algorithm

Voters are authenticated when they commit
O(1) messages with any number of voters
Hashing and digital signatures for WAN

Each voter will follow these steps
If no other voter has committed an answer, commit
self-answer and skip
If some voter has committed and the result
conflict, then commit self-answer
After everyone had a chance, determine the
majority result
If majority is in conflict, then go to step 1.
Interface module will receive first commit,
buffer it and start the timer.
If new commit is received old is over-written and
timer is restarted.
Otherwise the buffered result is output to the
client

18
8 Injong Rhee

Fault tolerance in distributed resource
allocation by locality
Failure locality max. number of processes dead
due to failure
Depends on degree of distributedness of algorithm

Tight lower bound on failure locality is
where, is the maximum number of conflicting
processes
Dining philosopher problemhas the optimal failure
locality

Proposed algorithm breaks the chain by not
allowing a process to wait for a process which is
also blocked, achieves

19
8 Contd - Model

Each process is modeled in steps send, receive,
local, fail and local state
Guarded command set receive -gt Action,...
Configuration C is a set of current local
Network is a process too
Enabling and execution of guarded sommand set
System is initial configuration C0 and execution
of guarded command set of all the processes
No constraints on relative timing of execution of
process steps
Priorities by logical clocks (with waiting
factor)
Centralized resource managers manage exclusion by
responding to the requests for resource (with
preemption)

20
9 Goyer, Momtahan, Selic

Hierarchical control strategy for fault tolerance
(centralized app)
Simplicity, efficiency, predictability
Transient and hard failures - recovers or doesnt
recover ?
Soft QoS guarantees in terms of response times

21
10 Frank Mueller

Extension of Naimis protocol - single node
failures
Backup communication in form of ring structure

Fault handler node Find out the faulty node by
timeout, collect fault information (from 2 nodes)
and take appropriate action

22
Some ideas - for our protocol

The first node experiencing timeout may send a
message to every other node asking for their
states
The node experiencing timeout must be waiting
for a grant/token to arrive OR waiting for a
release to arrive
In case of grant/token, propagate the message to
the parent chain to detect the pooint of failure
In case of release, send the message down the
chain of children to see who is supposed to send
a release
Once detected the point, recalculate the state of
the affected nodes - the ones falling on the
chain
Assuming no link failures for now, multiple
failures can be detected

23
Summary

Approaches fall into following categories
Embedded fault tolerance into the base algorithms
- inspiring, concrete, but hard to import
Generalized approaches - demonstrate strength of
applicability but no concrete algorithmic
description
Voting approaches - can be used to reach
consensus, useful in case of multiple failures
Centralized approach - simple, predictable but
may be not desirable
Replication based approaches - simple, incur
overhead, sometimes the only solution (esp for
complex algorithms)
No prominent solution for detecting failure -
even for very specific problems/algorithms

24
Questions

Write a Comment

User Comments (0)

About PowerShow.com

Fault Tolerant Distributed Systems - PowerPoint PPT Presentation

Fault Tolerant Distributed Systems

Fault tolerant DS - a survey. 2. The track ... Fault tolerant DS - a survey. 3 #1: Revannaswamy, ... Fault tolerant DS - a survey. 5 #2: Chang, Singhal, Liu - 90 ... – PowerPoint PPT presentation