Title: CS556: Distributed Systems
1CS-556 Distributed Systems
Fault Tolerance (II)
- Manolis Marazakis
- maraz_at_csd.uoc.gr
2Dependability Basic Concepts
- Availability
- Reliability
- Safety
- Maintainability
Fault ? Error ? Failure
- Faults
- -Transient
- Intermittent
- Permanent
3A 2-node cluster
4Shared-disks vs Shared-nothing
- Shared-disks
- Dual hosting for the storage devices
- SCSI, NAS, SAN
- Access is arbitrated by external software that
runs on both servers - Shared-nothing
- Replication schemes
- Requires more effort to recover a server
- More suitable for WAN
- Requires a functional network and a functional
host on the other side to ensure that the writes
actually succeed - Danger of inconsistency after a failover
5Failover Management Software
- Key components of the system must be monitored
- H/W is generally the easiest part to monitor
- Relatively easy tests
- Relatively few different varieties of H/W
components - How to monitor the health of an application ?
- Examine systems process table
- No guarantee that the app. is running properly!
- Query the application itself
- checking for accurate timely responses
- For some apps, query is easy (eg DBMS)
- Make sure the check is end-to-end
- E.g DBMS s/w disk network
- For others, this is hard!
- Web server ? web page access
- File server ? file access
- Custom s/w ? ??
6Active-Passive Configuration (I)
- Both servers are connected to a set of
dual-hosted disks. - These disks are divided between 2 separate
controllers disk arrays - The data is mirrored from one controller to the
other. - A particular disk or filesystem can only be
accessed by one server at a time. - Ownership conflicts are arbitrated by the
clustering software. - Both servers are connected to the same public
network, and share a single IP address - which is migrated by the FMS from one server to
the other as part of the failover.
7Active-Passive Configuration (II)
8Active-Passive Configuration (III)
- Cost
- 2 hosts are reserved to perform the work of one.
- One host sits largely idle most of the time,
consuming electricity, administrative effort,
data center space, cooling, and other limited and
expensive resources. - However, active-passive configurations are going
to be the most highly available ones over time. - Since there are no unnecessary processes running
on the second host, there are fewer opportunities
for an error to cause the system to fail.
9Active-Active Configuration (I)
- Each host acts as the standby for its partner in
the cluster, while still delivering its own
critical services. - When one server fails, its partner takes over for
it begins to deliver both sets of critical
services - until the failed server can be repaired
returned to service. - The servers must be truly independent of each
other
10Active-Active Configuration (II)
11Service Group Failover (I)
Capability for multiple service groups that ran
together on one server to failover to separate
machines when that first server fails
12Service Group Failover (II)
- Service Group a set containing one or more IP
addresses, one or more disks or volumes, and one
or more critical processes - A service group is the unit that fails from one
server to another within a cluster. - For service groups to maintain their relevance
value, they must be totally independent of each
other. - If because of external requirements, two service
groups must failover together, then they are, in
reality, a single group.
13N-to-1 clusters (I)
A single standby node for the whole cluster -
This node can see all disks.
After recovery of a failed node, we must fail its
services back to it, freeing up the one node to
takeover for another set of service groups.
4-to-1 SCSI cluster
14N-to-1 clusters (II)
The hosts are all identically attached to the
storage.
SAN-based 6-to-1 cluster
15N-plus-1 clusters
1 dedicated stand-by node
After recovery, no failover is needed from
standby to recovered node
- Over time, the layout of hosts services will
not match the original layout within the
cluster. - As long as all of the cluster members
have similar performance capabilities, and they
can see all of the required disks, it does not
actually matter which host actually runs the
service.
As clusters begin to grow, its possible that a
single standby node will not be adequate
SAN-based 6-to-1 cluster
16Failure Models
17Failure detectors
- Not necessarily reliable !
- P is here message, every T sec, assuming a max.
message transmission delay D - Categorization of processes (hints)
- suspected vs unsuspected
- A process may be functioning correctly on the
other side of a partitioned network - or it could be slow to respond to probes
- Reliable detection
- unsuspected vs failed (crashed)
- Feasible only in synchronous systems
- It is possible to give different responses to
different processes - different comm. conditions
18Failure Masking by Redundancy (I)
- Hide the occurrence of failures from other
processes, by redundancy - Information
- Extra bits to allow recovery
- Time
- Transactions to allow abort/redo
- Particularly suited for transient or intermittent
faults - Physical
- Extra equipment to tolerate loss/malfunction of
some components - or redundant s/w processes
- Voter circuitry
- Voters are components too ? They may themselves
fail !
19Failure Masking by Redundancy (II)
- Triple modular redundancy (TMR)
20Flat vs Hierarchical Groups (I)
Process resilience by replicating processes into
groups
Group membership protocols
21Flat vs Hierarchical Groups (II)
- Flat groups
- Symmetrical (no special roles)
- No single point of failure
- Complex operation protocols (eg voting)
- Hierarchical groups
- Coordinator is a single point of failure
- Group membership
- group server
- distributed management
- Eg reliable multicast
- Detection of failed processes?
- Join/leave must be synchronous
- with data messages!
- How to rebuild a group after a major
- failure?
22Failure Masking Replication
- Having a group of identical processes allows us
to mask gt1 faulty process(es) - Primary-backup protocols
- Hierarchical organization
- Election among backups to select a new primary
- Replicated-write protocols
- Flat process groups
- Active replication
- Quorum protocols
- K-fault tolerant system
- Fail-silent processes ? group size (k 1)
- Byzantine failures ? group size gt (2k 1)
- Assuming that processes do not team up !!
- (independent failures)
23Coordination/Agreement
- A set of process must collaborate
- or agree with one or more processes
- without a fixed master/slave relationships
- failure assumptions failure detectors
- Problems
- mutual exclusion
- election
- multicast
- reliability ordering semantics
- consensus
- Byzantine agreement
24Problems of Agreement
- A set of processes need to agree on a value
(decision), after one or more processes have
proposed what that value (decision) should be - Examples
- mutual exclusion, election, transactions
- Processes may be correct, crashed, or they may
exhibit arbitrary (Byzantine) failures - Messages are exchanged on an one-to-one basis,
and they are not signed
25Two Agreement Problems
- Consensus problem every process i proposes a
value vi, while in the undecided state. Process i
exchanges messages until it makes decision di and
moves to decided state. - Termination all correct processes must make a
decision - Agreement same decision for all correct
processes - Integrity if all correct processes proposed same
value, any correct process decides that value - Byzantine generals problem a commander
process i orders value v. - The lieutenant processes must agree on what the
commander ordered. - Processes may be faulty
- provide wrong or contradictory messages
- Integrity requirement
- A distinguished process decides a value for
others to agree upon - Solution only exists if N gt 3f, where f faulty
processes
26Consensus for 3 processes
27The Two-Army Problem
- How can two perfect processes reach agreement
about 1 bit of information ? - over an unreliable comm. channel
- Red army 5000 troops
- Blue army 1, 2 3000 troops each
- How can the blue armies reach agreement on when
to attack ? - Their only means of communication is by sending
messengers - that may be captured by the enemy !
- No solution!
- Proof by contradiction Assume there is a
solution with a minimum messages
28Consensus No Failures Case
majority(v1, , vN) returns most frequently
occurring value - returns if no majority
exists
Consensus via reliable multicast
For ordered values, min/max could be used instead
of majority
In general, if failures can occur it is not 100
certain that consensus can be reached in finite
time !
Terminating Reliable Multicast (TRB) A single
process multicasts a msg, and all
correct processes must agree on that msg -
Even if sender crashes, all correct processes
must deliver a special msg (Server-Fault)
29Relation among problems
A problem B reduces to a problem A if there is an
algorithm which transforms any algorithm for A
into an algorithm for B.
Synchronous systems TRB is equivalent to
Consensus
Asynchronous systems Consensus reduces to
TRB but not vice versa!
Asynchronous systems with crash failures
Atomic Multicast is equivalent to Consensus
30Consensus in synchronous systems
Duration of round max. delay of B-multicast
Up to f faulty processes
Dolev Strong, 1983 Any algorithm to reach
consensus despite up to f failures requires (f
1) rounds.
31Byzantine agreement synchronous
Faulty process
Nothing can be done to improve a correct
process knowledge beyond the first stage -
It cannot tell which process is faulty.
3 says 1 says u
Lamport et al, 1982 No solution for N 3, f
1
Pease et al, 1982 No solution for Nlt 3f
(assuming private comm. channels)
32Agreement in Faulty Systems (I)
- The Byzantine generals problem for 3 loyal
generals and 1 traitor - The generals announce their troop strengths
- The vectors that each general assembles based on
(a) - The vectors that each general receives in step 3.
Consensus by generals 1, 2, 4 ? (1, 2, UNKNOWN,
4))
33Agreement in Faulty Systems (II)
No majority !
- The same as in previous slide, except now with 2
loyal generals and one traitor.
34Byzantine agreement for N gt 3f
Example with N4, f1 - 1st round Commander
sends a value to each lieutenant - 2nd round
Each of the lieutenants sends the value it has
received to each of its peers.
- A lieutenant receives a total of (N 2) 1
values, of which (N 2) are correct. -
By majority(), the correct lieutenants compute
the same value.
In general, O(N(f1)) msgs
O(N2) for signed msgs
35Impossibility of (deterministic) consensus in
asynchronous systems
M.J. Fischer, N. Lynch, and M. Paterson
Impossibility of distributed consensus with one
faulty process, J. ACM, 32(2), pp. 374-382,
1985.
A crashed process cannot be distinguished from a
slow one. - Not even with a 100 reliable
comm. network !
There is always a chance that some continuation
of the processes execution avoid consensus being
reached.
No guarantee for consensus, but Prob(consensus)
gt 0
Solutions based on randomization or
(unreliable) failure detectors or by fault
masking