Title: Failure Mode Assumptions and Assumption Coverage
1Failure Mode Assumptions and Assumption Coverage
2Fault-Tolerance
- Key questions
- How components may fail?
- ? Prevention strategies
- At what rate they may fail?
- ? The Amount of redundancy needed
- What are the important type of faults?
- Types of redundancy needed
- The relation between dependability, redundancy
and faults? - General FT design guidelines
3An F-T Paradox/Dilemma
- More faulty
- ? More redundancy
- ?More possibility of faults
- ???
4Solution- Some Key Steps
- Classify, quantify and verify the assumptions
5Type of Failures
6Overview
- Single-user service
- Service Model
- Potential Errors
- Multiple-user service
- Service Model
- Potential Errors
7Single-user Service Model
- Service items si, i1,2,
- Values of si vsi
- Observation time of si tsi
- Service Model
- Si ltvsi, tsigt
- An omniscient observer
8Correctness Model
- Service item si is correct iff
- (vsi? SVi) ? (tsi? STi)
- SVi and STi are respectively the specified sets
of values and times for service item si
9Potential Errors
- Arbitrary value error si vsi? SVi
- Noncode error si vsi? CV (CV defines a code)
- Arbitrary timing error si tsi? STi
- Early timing error si tsi lt min(STi)
- Late timing error si tsi gt max(STi)
- Omission error si tsi ?
- Impromptu error si (vsi ?) ? (tsi ?)
10Multi-user Service Model
- Service item sisi(1), si(2),, si(n),
- Service model ltvsi(u), tsi(u)gt, all i,u
- New issues consistency
11Correctness Model
- vsi(u) the value of service item i on process u
- vsi-- the value of service item i
- SVi the set of specified service item i
- tsi(u) the observation time of service item i on
process u - STi(u) the range of specified observation time
of service item i on process u - ?uv -- the time bound of related occurrences
12Examples of Potential Errors
- Consistent value error
- Consistent timing error
- Semi-consistent value error
13Failure Mode Assumptions
- Attempt to formalize the concept of an assumed
failure mode - By assertions on the sequences of service items
delivered by a component
14Examples of Value Error Assertions
- No value errors occur (Vnone)
- ?i , vsi ? SVi
- The only value errors that occur are noncode
value errors (Vn) - ?i , (vsi ? SVi) ? (vsi? CV )
- Arbitrary value error can occur (Varb)
- ?i , (vsi ? SVi) ? (vsi? SVi )
15Examples of Timing Error Assertions
- No timing error occurs (Tnone)
- The only timing errors are omission errors (TO)
- The only timing errors are late timing errors
(TL) - The only timing errors are early timing errors
(TE) - Arbitrary timing error can occur (Tarb)
- Permanent omission/crash (Tp)
- Bounded omission degree (TBk)
16Timing Error Implications
17Failure Mode Assertions(FMA)
- A complete FMA entails an assertion on errors
occurring on both value and time domains - By taking the Cartesian production of the two
domains, we get a family of FMA
18FMA Implication Graph
19So what?
- The FMA classification and implication graph can
serve as a guideline to design families of FT
algorithms that can process errors in increasing
severity!
20Assumption Coverage
- Establishing a link between assumed component
failure mode and system dependability - (The design a FT system relies on the assumption
they make) - (The dependability of a FT system is related to
the failure mode they assume)
21Motivation
- Components may fail
- They may fail in a bad way ? leads to a violation
of assumptions of the system - The system, in turn, can fail
- Question to what degree can a component FMA
prove to be true in the real system?
22The Coverage of the Assumption
- Definition
- P(X) Pr X true component failed
- P(Varb ? Tarb) 1
- P(Vnone ? Tnone) 0
23Coverage of an FT system
- PS(X)
- Pr correct error processing X true
- Pr X true component failed
24Influence of Assumption Coverage on System
Dependability
25The System
- A system of n processors
- Connected via unidirectional message-passing bus
- Each processor carries out the same computation
steps - The result of each processing step is
communicated to all other processors - Each process has a decision function (DF)
- The DF is applied to the results received from
other processors -
- Each processor and its associated bus is viewed
as a single component
26Fail-Silent Processor-bus
- A fail-silent processor
- Only has semi-consistent value errors
- Always produces message on time
- Or ceases to produce messages forever
- If a message is delivered to a processor, it is
to be delivered to all processors with consistent
fixed delay
27Fail-Consistent Processor Bus
- Only semi-consistent value errors may occur
- Faulty processors may send erroneous values
- Consistent timing error may occur
28Fail-uncontrolled Processor Bus
- Arbitrary timing error
- Arbitrary value error
29Implications of Assumption Coverage
- Failure mode relations
- Coverage relations
30Dependability Expressions From Markov Models
31A Life-critical Application
- System reliability objective R gt 1-10-9 over 10
hours - Single processor reliability
- r e-?t
- 1/? 5 years
32(No Transcript)
33A Money-Critical Application
- It is about availability of the system rather
than reliability of the system - Please look at the paper for more details
34Unavailability v.s. Coverage
35Conclusion
- A formalism for describing component failure
modes - Multiplicity of value and timing errors
- The notion of assumption coverage
- The relation between dependability, availability
and assumption coverage
36Thank you