Fault Tolerant Computing

About This Presentation

Title:

Fault Tolerant Computing

Description:

server omits to respond to an input (fail-silent failure) ... if after a first omission, a server omits to produce output until it restarts. Amnesia crash ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 64

Provided by: DrBetty3

Learn more at: http://www.cse.msu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Fault Tolerant Computing

1
Fault Tolerant Computing
2
Acknowledgements

The following lectures are based on materials
from the following sources
S. Kulkarni
J. Rushby
J. Knight

3
Objectives

Exposure to area of Critical Systems
What it means to have a fault-tolerant system
Specification techniques for representing
critical properties
How to Design Fault tolerance into a system

4
Reliability and Recovery

Reliability
Probability that a system will not fail at time t
if it was operating properly at time 0.
Recovery
Process of restoring consistency after a failure

5
Dependability

Dependability
How much one may rely on the quality of services
delivered
Quality of service depends on
Correctness
Continuity of service

6
Terms

Failure malfunction
Fault condition that might lead to failure
Error an incorrect response indicates a fault
is present
Faults may be
permanent
intermittent
transient

7
Terms (contd)

Graceful Degradation
system is operational, but degraded, after faults
Fail-safe
system execution is safe after the fault
Stabilizing
system recovers to a consistent state after the
fault
Masking
the user of the system does not see any
unintended behavior due to faults

8
Terms (contd)

Mean Time to Failure (MTTF)
expected value of system failure time
Mean Time to Repair (MTTR)
expected value of system repair time
Mean Time Between Failure
expected time between successive failures MTBF
MTTF MTTR
Fault Tolerance
ability to continue operation after occurrence of
faults
A system is faulty, once its behavior is no
longer consistent with its specification.

9
Design Decisions

Fault detection
Fault confinement
Fault diagnosis
Repair and/or reconfigure
Redundancy
Hardware extra hardware
Information redundancy bits
Software diagnosis software, extra software
Temporal re-execute software to recover from
intermittent faults

10
Safety vs Reliability

Reliability
concerns occurrence of failures
System failures defined in terms of system
services
Safety concerns occurrence of accidents
Unplanned events that result in death, inury,
illness, damage, loss of property or evironmental
harm
Defined in terms of external consequences

11
Types of Faults

Omission failure
server omits to respond to an input (fail-silent
failure)
Timing failure
response is functionally correct, but untimely
can be early timing failure or late timing
failure
(performance failure)
Response failure
incorrect response
if output value incorrect (value failure)
state transition incorrect (state transition
failure)

12
Types of Faults (contd)

Crash failure
if after a first omission, a server omits to
produce output until it restarts
Amnesia crash
server restarts in a predefined initial state
that does not depend on the inputs seen before
crash
Partial amnesia crash
some part of the state is the same before the
crash rest is in predefined initial state
Pause crash
server restarts in the state it had before the
crash
Halting crash
crashed server never restarts

13
Types of Faults (contd)

Byzantine failure
Component exhibits arbitrary and malicious
behavior,
Perhaps in cooperation with other faulty
components.
Fail-stop failure
In response to a failure,
Component changes to a state that permits other
components to detect that a failure has occurred
and then stops.

14
Examples

OS crashed followed by reboots in initial state
(amnesia failure)
Database server crash followed by recovery of a
database state that reflects all transactions
before the crash (pause failure)
Communication server occasionally loses messages
but does not delay messages (omission failure)
Excessive message transmission or message
processing delay (communication performance
failure)
Alteration of a message due to random noise
during transmission (response failure)

15
Hierarchical Failure Masking

A failure of a certain type at a lower level can
propagate as a different kind of failure at a
higher level abstraction.
Value Error at the physical layer (e.g., 2 bits
corrupted) propagates as omission error at data
link layer

16
Group Failure Masking

To ensure a service remains available to clients
despite server failure,
one can implement a group of redundant,
physically independent servers.
The group masks the failure of a member.
Hierarchical masking requires
users to implement resource failure-masking
attempts as exception handling code.
In group masking,
individual members failures are entirely hidden
from users by group management mechanisms.

17
Group Failure Masking (contd)

Group output is a function of outputs of
individual group members.
fastest member
distinguished member
result of majority vote
A server able to mask any k-1concurrent member
failures will be termed k-fault tolerant
e.g., a primary/standby group of k servers with
members ranked as primary, 1st backup, 2nd
backup, ..., can mask k-1 failures.

18
Some Formalism

Programs
A Program consists of
a finite set of variables
a finite set of actions
where
guard is a boolean expression over program
variables, and
statement updates program variables
Modifications
guards may contain receive from channels
statements may contain sends/receive

19
Computation

A program computation is a fair'' sequence of
steps, where in each step an action whose guard
is true has its statement executed
In one step, multiple guards may be true.
If guard of some action is true continuously,
then that action would eventually be chosen for
execution.
Notes
A program computation is a sequence of states

20
Specification

A specification is a set of sequences of states.
What does it mean for a program, p to satisfy a
specification sp from a set of states S?
every computation of p that starts from a state
in S is in sp .

21
Examples of specifications

Let S be a predicate.
invariant
Invariant(S) seq S is true in each state of
seq
A sequence seq is in invariant(S) iff S is true
in each state in seq.
Closure
Closed(S)
seq "i i gt 0
S is true in the ith state of seq
S is true in the (i1)th state of seq
If S ever becomes true, it continues to be true.

22
Examples of specifications (contd)

Let R and S be predicates.
leads-to
R leads-to S
seq ("i i gt 0
R is true in ith state of seq
gt
( k k gti
S is true in kth state of seq)
)

23
Examples of specifications (contd)

Mutual Exclusion
invariant( (j ltgt k) (cs.j /\ cs.k) )
("j (req.j leads-to cs.j)) // request for
cs
Leader Election
invariant ( ( jltgtk) (leader.j s /\ leader.k)
)
true leads-to ( j leader.j)
Load Balancing
true leads-to
("j,k load.j - load.k bound)

24
Safety Specification

Safety specification
A sequence does nothing bad''
No sequence has a bad prefix
Let sp be a specification.
sp is a safety specification
iff
("s s Ï sp
( a a is a prefix of s ("b ab Ï sp)))

25
Liveness Specification

Liveness specification
A sequence does something good
Every finite prefix has a good extension
Let sp be a specification
sp is a liveness specification
iff
(" a ( b ab Î sp)) // a could be
bad prefix

26
Faults

A fault is an action that can change the program
state
All faults
(be they crash, fail-stop, omission,
corruption, timing, Byzantine, intruders, or
...)
can be thus viewed as perturbations on the
system

27
Faults (contd)

A program computation in the presence of faults
is a sequence of steps where
in each step either program action executes or
fault action executes
the program actions are fairly executed
the fault occurrences are finite

28
Representation of Faults

Communication faults
Let c denote the sequence of messages on a
channel.
Let m1 and m2 be messages, and let seqm be a
sequence of messages.
Message Loss c lt seqm, m1gt c lt seqmgt
Message Duplication c lt seqm ,m1gt c lt
seqm,m1,m1gt
Message Reorder c lt seqm,m1,m2gt c lt
seqm,m2,m1gt

29
Representation of Faults (contd)

Amnesia/Transient faults.
Let c denote all the variables of a process.
True c?? // ?? arbitrary value

30
Representation of Permanent Faults

Fail-stop fault
Upon fail-stop, a process does nothing
it does not execute any action and
it does not send any messages.
Introduce an auxiliary variable up.j at process j
Add up.j to the guard of each action of j
If processes can detect failure of other
processes, then they can do so using variable up.

31
Representation of Permanent Faults

Byzantine Faults
Introduce an auxiliary variable b.j at process j
Add these actions as faults b.j b.j true
b.j state.j??

32
Goal of Fault-tolerance Design

Starting from some initial states, S,
If the program executes alone then the original
specification, sp, is satisfied
If the program executes in the presence of faults
then the fault-tolerant specification, sp', is
satisfied.
The fault-tolerance specification depends upon
the type of the desired fault-tolerance, e.g.,
for masking sp' sp
for fail-safe sp' safety specification of sp'

33
Representation of Permanent Faults

Fault-tolerant systems are rarely designed from
scratch!!!
One needs to modify a fault-intolerant system to
add fault-tolerance
Need for reuse of the fault-intolerant program.
Fault-tolerant systems need to be modified to
deal with new faults.
Need for incremental design
Need to perform several activities while
developing fault-tolerant systems.
manual or automated design, testing,
verification, synthesis, ...
desirable to have a unified framework that allows
to perform these activities.

34
Overall Design
35
Overall Design (contd)

Should separate concerns of functionality and
fault-tolerance.
Should use components that are responsible for
fault-tolerance alone.
Should provide structural continuity while
performing these tasks.
Should be able to use the same components while
performing the above tasks.

36
A Specific Approach

We explore the following thesis (Kulkarni)
fault-tolerant system
fault-intolerant system
in composition with
fault-tolerance components

37
Validation

Two components, detectors and correctors form a
basis of fault-tolerance design
Detectors and correctors are necessary and
sufficient for designing fault-tolerant systems
that satisfy the reuse criterion
Reuse criterion
In the absence of faults, the fault-tolerant
system behaves like the fault-intolerant system
In the presence of faults, the fault-tolerant
system recovers to the computations of the
fault-intolerant system

38
Validation (contd)

Existing methods satisfy the reuse criterion
Replication
Schneider's state machine approach
Checkpointing and recovery
Programs designed with these methods can be
(alternatively) designed by using detectors and
correctors
The use of detectors and correctors offers the
potential for improved design

39
Outline of Approach

Identifying the components
Their applications in design
Their applications in verification

40
Components for Fail-safe Tolerance

How to preserve the safety specification ?
Existence of safe predicate
follows from the definition of safety
Hence, we need to detect whether execution of an
action in the given state is safe
The added component is called a detector

Assume that safety is not violated here
Check whether safety would be violated
41
Detectors

Specification of a detector ( detection
predicate, X, witness predicate, Z)
Z Þ X
X leads to (ØZ Ú X)
Z next (Z Ú ØX)
Examples error detection codes, acceptance
tests, comparators snapshot procedures,
exception conditions

42
Designing Fail-safe Fault-Tolerance

For each program action
Add a detector d such that
detection predicate equals a safe predicate of
g st
witness predicate equals Z
New action is
Z Ù g st

43
Hierarchical Construction of Detectors
44
Components for Nonmasking Fault-Tolerance

How to eventually satisfy the specification ?
Restore the program to a state from where its
safety and liveness specification are satisfied
The added component is called a corrector

45
Correctors

Specification of a corrector ( correction
predicate, X, witness predicate, Z)
Z Þ X
true leads to (X Ù Z)
X next X
Z next Z
Large' correctors in distributed programs are
built out of parallel' or sequential'
composition of smaller' ones
Examples error correction codes, reset
procedures, voters, rollback recovery,
constraint satisfaction

46
Components for Masking Fault-Tolerance

Ensure that in the presence of faults the safety
specification is always satisfied
use detectors
Ensure that eventually the program reaches a
state from where the specification is satisfied
use correctors

47
An example Input-Output Problem

in constant // either 0 or 1
out 0,1, // either 0 or 1 or //
some specific value (currently // unknown)
Safety specification
always ( )
(out ) Ú (out in)
Ù (out ¹ ) next (out ¹ )
Liveness specification eventually (u) (out
in)

48
Example contd

in constant // either 0 or 1
x 0, 1 // initialized to in
out 0,1,
out out x
Faults
true x ?

49
Example contd

y,z 0, 1 // initialized to in
(x y Ú x z) Ù //detector
More Faults
true y ?
true z ?

50
Triple Modular Redundancy

(y x Ú y z) Ù out out y
(z x Ú z y) Ù out out z

51
Distributed Reset An Example in Design

The problem Reset the state of a distributed
system to a given global state
Applicable in the design of various
fault-tolerant systems
Need for a fault-tolerant, bounded memory
protocol (Lamport and Lynch, in Handbook of TCS
1990)
Previous solutions are merely stabilizing
tolerant
Allows resets to be incorrect during recovery
Our solution is the first to provide masking
tolerance in addition to stabilizing tolerance

52
Specification of Distributed Reset
Masking Tolerant Program
Fail-safe tolerant program
Nonmasking tolerant program
detectors and correctors
detectors
correctors
Intolerant program
53
Specification of Distributed Reset

A process initiates a reset operation to reset
the system to a given global state.
For each reset operation initiated, the following
two conditions should be satisfied
non-prematurity
when the initiating process completes the reset
operation, the program state is reachable from
the given global state
eventual completion
the initiating process eventually completes the
reset operation

54
Faults and Fault-tolerance Requirements

CJTSS'98
Fault-classes considered in our solution
Network faults
Failure and repair of processes and
communication channels
Memory faults
Transient faults, undetectable message
corruption
Fault-tolerance requirements
Masking tolerance to network faults
Stabilizing tolerance to network faults and
memory faults
Other requirements
Bounded memory at each process

55
Use a diffusing computation
56
Fault-intolerant Distributed Reset

Embed a tree
Use a diffusing computation
Root of the tree initiates a diffusing
computation
Each process propagates the diffusing computation
to its children
A process completes the diffusing computation
only after its descendents have completed the
diffusing computation
Each process resets its state when it propagates
the diffusing computation
Two processes communicate only if either both
have reset their states or none have reset their
states in the current reset computation
When the root of the tree completes the diffusing
computation, the state of the system is reachable
from the given global state

57
Designing components for masking tolerance

Add a detector that
lets the root detect if all processes
participated in the current diffusing computation
Add a corrector that
reconstructs the tree
corrects the variables used in a diffusing
computation
ensures that the diffusing computation never
blocks
when if the diffusing computation completes, if
the check performed by the detector fails then
performs another diffusing computation
These components must be multitolerant !!

58
Designing multitolerant detector

Problem Detect whether all processes
participated in the diffusing computation
Subproblem Let each process detect if all its
neighbors participated in that diffusing
computation
Easy if each diffusing computation is associated
with a distinct sequence number
requires that the sequence numbers are unbounded
Difficult if the sequence numbers are bounded
sequence numbers from old diffusing computations
may confuse the detection

59
Problem with Bounded Sequence Numbers
60
Problem with Bounded Sequence Numbers (contd)
61
Problem with Bounded Sequence Numbers (contd)

Theorem. Let j and l be neighboring processes and
let ROOT be an ancestor of j.
If j and l have completed at least two diffusing
computations since they changed tree or they
observed a network fault, and the sequence
numbers of j and l are identical,
Then l has propagated the same diffusing
computation as j

62
Multitolerant Detector (continued)

Our detector guarantees that
In the presence of network faults only, the root
can always detect whether all processes
participated in the current diffusing
computation
In the presence of network faults and memory
faults, the root can eventually detect whether
all processes participated in the current
diffusing computation

63
Multitolerant Distributed Reset