Implementation of a FailSafe system - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

Implementation of a FailSafe system

Description:

A system that can fail but only to a safe state to avoid anything bad to happen' ... Intermittent effect not always present. Permanent effect always present ... – PowerPoint PPT presentation

Number of Views:328

Avg rating:3.0/5.0

Slides: 41

Provided by: kajarighos

Category:

more less

Transcript and Presenter's Notes

Title: Implementation of a FailSafe system

1
Implementation of a Fail-Safe system

Kajari Ghosh Dastidar

2
What is a Fail Safe system?

A system that can fail but only to a safe state
to avoid anything bad to happen
A type of fault tolerance

3
Outline of the talk

What is a Fault Tolerant system?
Some Terminologies
What is fault?
Why have Fault Tolerance?
How is fault tolerance obtained?
Motivation
Related Work
Contribution
Future work and Conclusion

4
Outline of the talk

What is Fault Tolerance?
Some Terminologies
What is fault?
Why have Fault Tolerance?
How is fault tolerance obtained?
Motivation
Contribution
Related Work
Future work and Conclusion

5
What is Fault Tolerance?

A system that continues its desired computation
in presence of system errors.
Making a computer system fault tolerant is one of
the most essential steps in making the system
dependable.

6
Outline of the talk

What is Fault Tolerance?
Some Terminologies
What is fault?
Why have Fault Tolerance?
How is fault tolerance obtained?
Motivation
Related Work
Contribution
Future work and Conclusion

7
Terminologies

Computation

Computation
s0
s2
s1
s3
s4
s5
s6
Action enabled
8
Terminologies

System consists of components which are

connected together
facilitate the flow of information
Interact
computes a method or an algorithm to achieve a
goal

9
Terminologies

Error

Incorrect response from a system indicating a
fault is present which will lead to system
failure if no Fault Tolerance is present.
10
Outline of the talk

What is Fault Tolerance?
Some Terminologies
What is fault?
Why have Fault Tolerance?
How is fault tolerance obtained?
Motivation
Related Work
Contribution
Future work and Conclusion

11
What is fault?

Fault

Flaw in hardware or software which can result in
a failure.
12
Causes of fault

Physical factors from wear out
External disturbances
Design flaws or defects in hardware
Defects in software

13
Types of fault

Transient disappear without repair
Intermittent effect not always present
Permanent effect always present
S/W and H/W design errors
Operator errors
Externally induced damages

14
Types of fault

In what way the system should be made fault
tolerant
depends on
the type of fault we want to tolerate
The number of faults we want to tolerate

15
Outline of the talk

What is Fault Tolerance?
Some Terminologies
What is fault?
Why have Fault Tolerance?
How is fault tolerance obtained?
Motivation
Related Work
Contribution
Future work and Conclusion

16
Why have fault tolerance?

Novice users
Increasing repair costs
Larger systems
Digital systems more prevalent
More users more dependent on digital systems

17
Why have fault tolerance?

Some Related Terms
Fault Latency
fault can go undetected (does not cause an
error)
Fault Avoidance
high quality components and careful design used
to avoid occurrence of faults
Graceful degradation
system performs with degraded but correct
performance after occurrence of faults

18
Outline of the talk

What is Fault Tolerance?
Some Terminologies
What is fault?
Why have Fault Tolerance?
How is fault tolerance obtained?
Motivation
Related Work
Contribution
Future work and Conclusion

19
How is fault tolerance obtained?

Masking tolerance
Non-Masking tolerance
Fail Safe

20
How is fault tolerance obtained?

Masking

Structural redundancy technique that completely
masks faults within a set of redundant modules.
eg. Triple Modular Redundancy Hybrid
Redundancy
21
How is fault tolerance obtained?

Non Masking Tolerance

Due to any failure system goes to an unstable
state before it returns to stability eg.
Forward Recovery Systems Backward
Recovery Systems
22
How is fault tolerance obtained?

Fail safe

System can fail but only to a safe state to avoid
catastrophes. Liveness is compromised. eg.
common fail-safe system is the pilot-light sensor
in most gas furnaces. If the pilot light is cold,
a mechanical arrangement disengages the gas
valve, so that the house cannot fill with
unburned gas.
23
How is fault tolerance obtained?
Masking tolerance
Non-Masking tolerance
perfect
unsafe
safe
Fail safe tolerance
24
Outline of the talk

What is Fault Tolerance?
Some Terminologies
What is fault?
Why have Fault Tolerance?
How is fault tolerance obtained?
Motivation
Related Work
Contribution
Future work and Conclusion

25
Motivation
Effectiveness and Drawbacks of the existing models
26
Motivation
Terminologies

Reliability
Availability
Maintainability
Mean-time-to-failure (MTTF)
Mean-time-to-repair (MTTR)
Mean-time-between-failure (MTBF)
Fault Detection
Fault Containment
Fault Diagnosis
Recovery
Safety Critical Systems

27
Motivation
Effectiveness and Drawbacks of the existing models
Achieving masking tolerance sometimes proves to
be very expensive. Effectiveness depends on
MTBF and life of the system Non-masking
tolerance does not prevent the system to function
erroneously. The above temporary solutions make
the system internally weaker. To keep the system
safe in fail-safe system, the fault detection
should be efficient. What happens when even
fail-safe cant be guaranteed?
28
Outline of the talk

What is Fault Tolerance?
Some Terminologies
What is fault?
Why have Fault Tolerance?
How is fault tolerance obtained?
Motivation
Related Work
Contribution
Future work and Conclusion

29
Related Work
Dijkstra has proposed a self stabilizing system
which marked the foundation of non-masking fault
tolerant systems. Later quite a few papers were
written improving these types of systems. Arora
and Kulkarni has developed a generalized
theoretical model for detecting and correcting
the faults. Ghosh has proposed a new fail-safe
model which points to the limitations of the
previous two fault-tolerant models.
30
Outline of the talk

What is Fault Tolerance?
Some Terminologies
What is fault?
Why have Fault Tolerance?
How is fault tolerance obtained?
Motivation
Related Work
Contribution
Future work and Conclusion

31
Contribution

Present a formalization of the system
Present methodologies to implement the proposed
fail safe system
Try to employ distributed fault detectors where
ever
possible
Argue that in all situations fail safe cant be
guaranteed
Where the above tolerance cant be guaranteed,
we propose a weaker alternative - cheaper

32
Contribution
Formalization Safety Margin
No failures
Failure
Safety Margin
33
Contribution
Implementation
Whenever a fault occurs in the system which does
not perturb its normal functioning, instead of
masking that, we can raise an alarm before
stopping the system. What type of fault is to
be detected? What should be the degree of
tolerance?
34
Contribution
Why distributed detector?
We cant reply on a single, central detector
what if it fails? More than one detectors is
desired. At least one node should detect the
fault to raise the alarm. The alarm will then
propagate through the system. MTBF plays an
important role here.
35
Contribution
Safety Margin is Null
Crash Failures Certain networks with sparse
connectivity
36
Contribution
Weaker (Cheaper) Alternative
Dijkstras ring topology first algorithm

There should be exactly one privilege at any
instance
Number of system states (K) is always greater
than the
number of nodes
System stabilizing after a certain period

37
Contribution
Weaker (Cheaper) Alternative

The system is deadlocking (or going to a
pre-defined safe state)
Eventually deadlocking effect of failure cant
be avoided
System state needed is two
A marker is introduced

38
Outline of the talk

What is Fault Tolerance?
Some Terminologies
What is fault?
Why have Fault Tolerance?
How is fault tolerance obtained?
Motivation
Related Work
Contribution
Future work and Conclusion

39
Future Work
Implementation of the different types of tolerant
models Making the methods more effective Idea
of Replication Reachability
40
Questions?

Write a Comment

User Comments (0)