Title: Implementation of a FailSafe system
1Implementation of a Fail-Safe system
2What is a Fail Safe system?
- A system that can fail but only to a safe state
to avoid anything bad to happen - A type of fault tolerance
3Outline of the talk
- What is a Fault Tolerant system?
- Some Terminologies
- What is fault?
- Why have Fault Tolerance?
- How is fault tolerance obtained?
- Motivation
- Related Work
- Contribution
- Future work and Conclusion
4Outline of the talk
- What is Fault Tolerance?
- Some Terminologies
- What is fault?
- Why have Fault Tolerance?
- How is fault tolerance obtained?
- Motivation
- Contribution
- Related Work
- Future work and Conclusion
5What is Fault Tolerance?
- A system that continues its desired computation
in presence of system errors. - Making a computer system fault tolerant is one of
the most essential steps in making the system
dependable.
6Outline of the talk
- What is Fault Tolerance?
- Some Terminologies
- What is fault?
- Why have Fault Tolerance?
- How is fault tolerance obtained?
- Motivation
- Related Work
- Contribution
- Future work and Conclusion
7Terminologies
Computation
s0
s2
s1
s3
s4
s5
s6
Action enabled
8Terminologies
- System consists of components which are
- connected together
- facilitate the flow of information
- Interact
- computes a method or an algorithm to achieve a
goal
9Terminologies
Incorrect response from a system indicating a
fault is present which will lead to system
failure if no Fault Tolerance is present.
10Outline of the talk
- What is Fault Tolerance?
- Some Terminologies
- What is fault?
- Why have Fault Tolerance?
- How is fault tolerance obtained?
- Motivation
- Related Work
- Contribution
- Future work and Conclusion
11What is fault?
Flaw in hardware or software which can result in
a failure.
12Causes of fault
- Physical factors from wear out
- External disturbances
- Design flaws or defects in hardware
- Defects in software
13Types of fault
- Transient disappear without repair
- Intermittent effect not always present
- Permanent effect always present
- S/W and H/W design errors
- Operator errors
- Externally induced damages
14Types of fault
- In what way the system should be made fault
tolerant - depends on
- the type of fault we want to tolerate
- The number of faults we want to tolerate
15Outline of the talk
- What is Fault Tolerance?
- Some Terminologies
- What is fault?
- Why have Fault Tolerance?
- How is fault tolerance obtained?
- Motivation
- Related Work
- Contribution
- Future work and Conclusion
16Why have fault tolerance?
- Novice users
- Increasing repair costs
- Larger systems
- Digital systems more prevalent
- More users more dependent on digital systems
17Why have fault tolerance?
- Some Related Terms
- Fault Latency
- fault can go undetected (does not cause an
error) - Fault Avoidance
- high quality components and careful design used
- to avoid occurrence of faults
-
- Graceful degradation
- system performs with degraded but correct
- performance after occurrence of faults
18Outline of the talk
- What is Fault Tolerance?
- Some Terminologies
- What is fault?
- Why have Fault Tolerance?
- How is fault tolerance obtained?
- Motivation
- Related Work
- Contribution
- Future work and Conclusion
19How is fault tolerance obtained?
- Masking tolerance
- Non-Masking tolerance
- Fail Safe
20How is fault tolerance obtained?
Structural redundancy technique that completely
masks faults within a set of redundant modules.
eg. Triple Modular Redundancy Hybrid
Redundancy
21How is fault tolerance obtained?
Due to any failure system goes to an unstable
state before it returns to stability eg.
Forward Recovery Systems Backward
Recovery Systems
22How is fault tolerance obtained?
System can fail but only to a safe state to avoid
catastrophes. Liveness is compromised. eg.
common fail-safe system is the pilot-light sensor
in most gas furnaces. If the pilot light is cold,
a mechanical arrangement disengages the gas
valve, so that the house cannot fill with
unburned gas.
23How is fault tolerance obtained?
Masking tolerance
Non-Masking tolerance
perfect
unsafe
safe
Fail safe tolerance
24Outline of the talk
- What is Fault Tolerance?
- Some Terminologies
- What is fault?
- Why have Fault Tolerance?
- How is fault tolerance obtained?
- Motivation
- Related Work
- Contribution
- Future work and Conclusion
25Motivation
Effectiveness and Drawbacks of the existing models
26Motivation
Terminologies
- Reliability
- Availability
- Maintainability
- Mean-time-to-failure (MTTF)
- Mean-time-to-repair (MTTR)
- Mean-time-between-failure (MTBF)
- Fault Detection
- Fault Containment
- Fault Diagnosis
- Recovery
- Safety Critical Systems
27Motivation
Effectiveness and Drawbacks of the existing models
Achieving masking tolerance sometimes proves to
be very expensive. Effectiveness depends on
MTBF and life of the system Non-masking
tolerance does not prevent the system to function
erroneously. The above temporary solutions make
the system internally weaker. To keep the system
safe in fail-safe system, the fault detection
should be efficient. What happens when even
fail-safe cant be guaranteed?
28Outline of the talk
- What is Fault Tolerance?
- Some Terminologies
- What is fault?
- Why have Fault Tolerance?
- How is fault tolerance obtained?
- Motivation
- Related Work
- Contribution
- Future work and Conclusion
29Related Work
Dijkstra has proposed a self stabilizing system
which marked the foundation of non-masking fault
tolerant systems. Later quite a few papers were
written improving these types of systems. Arora
and Kulkarni has developed a generalized
theoretical model for detecting and correcting
the faults. Ghosh has proposed a new fail-safe
model which points to the limitations of the
previous two fault-tolerant models.
30Outline of the talk
- What is Fault Tolerance?
- Some Terminologies
- What is fault?
- Why have Fault Tolerance?
- How is fault tolerance obtained?
- Motivation
- Related Work
- Contribution
- Future work and Conclusion
31Contribution
- Present a formalization of the system
- Present methodologies to implement the proposed
- fail safe system
- Try to employ distributed fault detectors where
ever - possible
- Argue that in all situations fail safe cant be
- guaranteed
- Where the above tolerance cant be guaranteed,
- we propose a weaker alternative - cheaper
32Contribution
Formalization Safety Margin
No failures
Failure
Safety Margin
33Contribution
Implementation
Whenever a fault occurs in the system which does
not perturb its normal functioning, instead of
masking that, we can raise an alarm before
stopping the system. What type of fault is to
be detected? What should be the degree of
tolerance?
34Contribution
Why distributed detector?
We cant reply on a single, central detector
what if it fails? More than one detectors is
desired. At least one node should detect the
fault to raise the alarm. The alarm will then
propagate through the system. MTBF plays an
important role here.
35Contribution
Safety Margin is Null
Crash Failures Certain networks with sparse
connectivity
36Contribution
Weaker (Cheaper) Alternative
Dijkstras ring topology first algorithm
- There should be exactly one privilege at any
instance - Number of system states (K) is always greater
than the - number of nodes
- System stabilizing after a certain period
37Contribution
Weaker (Cheaper) Alternative
- The system is deadlocking (or going to a
pre-defined safe state) - Eventually deadlocking effect of failure cant
be avoided - System state needed is two
- A marker is introduced
38Outline of the talk
- What is Fault Tolerance?
- Some Terminologies
- What is fault?
- Why have Fault Tolerance?
- How is fault tolerance obtained?
- Motivation
- Related Work
- Contribution
- Future work and Conclusion
39Future Work
Implementation of the different types of tolerant
models Making the methods more effective Idea
of Replication Reachability
40Questions?