Title: Review last class
1Review last class
- Consensus/Agreement with faulty processes
- Impossibility of consensus in asynchronous
systems - Networks, task graphs and scheduling in DS
2Today
- Fault Tolerance in DS
- Concepts
- Hardware
- Software
3Concepts
- What is Fault-Tolerance?
- A fault-tolerant system is one that continues
to perform at desired level of service in spite
of failures in some components that constitute
the system.
4Motivation
- Approaches to design fault tolerant computer
systems - Bottom-up designing fault tolerant components
to integrate them into a fault tolerant system - Top-down designing a fault tolerant system
using components with little or not fault
tolerance - Top down is the most used approach
5Motivation (contd.)
- Challenge of Fault Tolerant Computing using the
top-down approach - Given that both hardware and software components
are unreliable, how do we build reliable systems
from these unreliable components?
- Not a new concept. First use by J. von Neumann
1956
6Motivation (contd.)
- A fault-tolerant computing system may be able to
tolerate one or more fault-types including - transient, intermittent or permanent hardware
faults, - software and hardware design errors,
- operator errors, or
- externally induced upsets or physical damage.
7Concepts
- Intuitive concepts
- Reliability continues to work
- Availability works when I need it
- Safety does not put me in jeopardy
- Performability maintains same performance in
spite of failures - Maintainability do not take much time to repair
8Concepts (contd.)
- The two most common ways industry expresses a
systems ability to tolerate failure are - Reliability
- Availability
9Terminology and definitions
- MTTF mean time to failure
- the expected time the system will operate before
the first failure occurs (a system is replaced
after a failure). - MTTR mean time to repair
- average time required to repair a system
- MTBF mean time between failure
- average time between failures of a system
(renewal situation theres repair or
replacement) -
- MTBF MTTF MTTR
- MTBFMTTF when MTTR is small
10Terminology and definitions
11Fault-Error-Failure concept
- Intuitive definitions
- Fault
- An anomalous physical condition caused by a
manufacturing problem, fatigue, external
disturbance (intentional or un-intentional),
design flaw, - Error - Effect of activation of a fault
- Failure - over-all system effect of an error
- Fault -gt Error -gt Failure
Bit stuck at
Incorrect data at ALU
Incorrect balance, system crash
Not all errors lead to failures!!
12Fault Modeling (contd.)
- High-level failure models (process or system
failure) - General classification
- crash failure - a faulty processor or system
stops permanently - omission failure - a faulty process omits
inputs/outputs some times but when it works, it
works correctly - timing failure - inputs/outputs are delayed or
arrive too early - Byzantine failure (or arbitrary failure) - a
faulty processor can exhibit arbitrary behavior
including malicious nature
13Reliability
- In reliability theory it is customary to assume a
constant failure rate. Then we typically
express reliability R, the probability of
survival to time t, in the form - R e-?t e -t/T
- where T MTBF1/ ?failure rate
- Reliability f(MTBF)
- Reliability is measured on a time interval, it is
also estimated as
14Failure Rate
- Bath tube curve
- The rate at which a component suffers faults
depends on its age, the ambient temperature, any
voltage or physical shocks that it suffers, and
the technology
? constant and independent failures
Burning in used to avoid this zone
Normal lifetime
20 weeks
5-25 years
15Failure Rate
- Example for normal lifetime period
If ? is 25 per million hours, i.e. 0.000025
failures per hour, for an 8 hour mission R(8)
e -(.000025)899.98 The system will complete
an 8 hr mission 9,998 times out of 10,000
16Fault Tolerance and Reliability
- The effect of a fault tolerant design on
reliability can be expressed as - RsysP(no-fault)P(correct-operation/fault)P(faul
t)
Maximized by fault intolerant design (proofs of
correct design, high quality components)
Coverage of a fault tolerance design over all
possible faults
For cost effectiveness, fault tolerant design
should target most likely faults
17Modeling
- Reliability Modeling
- System model, concentrating on reliability aspect
- Models
- Combinatorial Models
- Markov Models
18Modeling (contd.)
- Combinatorial Modeling
- Probabilistic techniques
- Express reliability of a system as a function
of reliability of its components - Construction models
- series
- parallel
19Modeling (contd.)
Parallel Only one of the components must work
correctly High redundancy
Series All components must work correctly No
redundancy
RtR1R2 R3
(1-Rt)(1-R1)(1-R2) (1-R3)
20Modeling (contd.)
- Markov Models
- Many complex problems cannot be modeled easily
in combinational fashion - Use Markov models (aka Markov chains)
- Repair is very difficult to model combinatorially
- Markov models can be applied to modeling
reliability, availability, repair etc.
21Modeling (contd.)
Markov Models STATE Represents all that must be
known to describe the system at a given instant
in time E.g. for reliability Each state
represents a distinct combination of faulty and
fault-free modules (e.g. 101, 1OK, 0fault)
TRANSITION Changes of state that happen in
system Over time as failures occur, system goes
from one state to another State changes are
given probabilities (e.g. prob. of failure, etc.)
22Fundamental Principles
- Redundancy
- Addition of extra parts in a systems design to
allow it continue functioning as intended in
spite of failures - Providing redundancy is key in fault tolerant
computing - Hardware redundancy
- Software Redundancy
- Time Redundancy
- Information Redundancy
23Information Redundancy
- Distinguish
- Data words ? the actual information contents
- Code words ? the transmitted information
(redundant) - Codes can be
- Separable ? if the code word contains all
original data bits plus additional check bits - Non-separable ? otherwise
- Example ASCII coding of single digit (separable)
- Data word 9
- Code word 49
24Information Redundancy
- Dataword with d bits is encoded into a codeword
with c bits where c gt d - Not all 2c combinations are valid codewords
- If c bits are not a valid codeword an error is
detected - Extra bits may be used to correct errors
- Overhead time to encode and decode
25Information Redundancy
Less bandwidth available for real information
More code bits
More error tolerance
26Data Communication
- Error correcting codes provides reliable digital
data transmission when the communication medium
used has an unacceptable bit error rate (BER) and
a low signal-to-noise ratio (SNR)
Noise
ECC Encoder/Decoder
ECC Encoder/Decoder
27Mathematical disgression
- Theory of ECC is based on abstract algebra
- Fields
- informally a field is a set in which we can add,
substract, multiply and divide, obtaining a
result that will be also member of the set - sets of reals and complex are examples of fields
- fields with a finite number of elements q are
called Galois fields GF(q)
28Galois Fields
- G(2)
- Binary numbers 0,1 with xor as addition and and
as multiplication - In G(2) substractionadditionxor
- GF(2m) is a Galois finite field in which the
number of elements is an integer power of 2 - Elements are represented as polynomials
29Cyclic Redundancy Code (CRC)
- Basic idea both parties agree on a fixed number
beforehand - Treat the message as a large binary number,
- multiply it by highest exponent of the fixed
number and - divide it by that fixed binary number,
- make the remainder from this division the error
checking information - send message (multiplied number) plus reminder
- Upon receipt of the message, the receiver can
perform the same division and compare the
remainder with the transmitted remainder - CRC calculations are based on
- polynomial division
- arithmetic over GF(2m).
30CRC
- Divisor is the fixed number generator polynomial
- Given a message to be transmitted bn bn-1 bn-2 .
. . b2 b1 b0 - View the bits of the message as the coefficients
of a polynomial - M(x) bn xn bn-1 xn-1 bn-2 xn-2 . . . b2
x2 b1 x b0 - Multiply the polynomial corresponding to the
message by xk where k is the degree of the
generator polynomial and then divide this product
by the generator to obtain polynomials Q(x) and
R(x) such that - xk M(x)/G(x) Q(x) R(x)
- Treating all the coefficients not as integers but
as integers modulo 2.
31CRC
- Finally, treat the coefficients of the remainder
polynomial, R(X) as "parity bits". That is,
append them to the message before actually
transmitting it. - xk M(x) R(x)
- When a message is received the corresponding
polynomial is divided by G(x). If the remainder
is non-zero, an error is detected. Otherwise, the
message is assumed to be correct - (xk M(x) R(x))/G(x) Q(x)R(x)
32CRC Example
- Suppose we want to send the short message
11010111 using the CRC with the polynomial x3
x2 1 as our generator - The message corresponds to the polynomial x7
x6 x4 x2 x 1 - Given G(x) is of degree 3, we need to multiply
this polynomial by x3 and then divide the result
by G(x) (x10 x9 x7 x5 x4 x3)
x7 x2 1
x3 x2 1 x10 x9 x7 x5 x4 x3
Codeword sent x10 x9 x7 x5 x4 x31
x10 x9 x7
(xi - xi)mod20 (xi xi)mod2(2xi)mod20
x5 x4 x3
x5 x4 x2
x3 x2
x3 x2 1
Residue in module 2 arithmetic Therefore the
parity will be 001
1
33CRC Example (cont)
- At receiver side, message is divided again by
fixed generator polynomial
x7 x2 1
x3 x2 1 x10 x9 x7 x5 x4 x31
x10 x9 x7
x5 x4 x3
x5 x4 x2
x3 x2 1
x3 x2 1
0
Reminder0 No error
34Hardware Redundancy
- Passive redundancy techniques
- fault masking
- Active redundancy techniques
- detection, localization, containment, recovery
- Hybrid redundancy techniques
- static dynamic
- fault masking reconfiguration
35Passive Hardware Redundancy
- Triple modular redundancy (TMR)
3 active components fault masking by
voter Problem voter is a single point of failure
36Passive Hardware Redundancy
- Generalization of TMR (more than 3 modules, e.g.
5MR) - In general NMR (always odd number)
- Voting can be done on digital or analog data
- Application temperature measurement
- Method take 3 measurements, compute median
value - Example
- Sensor 1 99C
- Sensor 2 100 C
- Sensor 3 45,217 C lt- error discard outlier!!
37TMR with Triplicated Voters
38Cascading TMR modules
- Examples
- JPL STAR (Self-Testing And Repairing computer)
- FAA WAAS (Wide Area Augmentation System)
39Passive Hardware Redundancy
- Hardware realization of 1 bit majority voting
Fabacbc 2 gate delays
40Extending TMR
TMR handles processor fault voter fault
memory fault bus fault System has no
single-point-of-failure This approach is
implemented in Tandem Integrity system
41Software analogies to TMR
- N-version programming
- The same specification is implemented in a number
of different versions by different teams. All
versions compute simultaneously and the majority
output is selected using a voting system. - This is the most commonly used approach e.g. in
many models of the Airbus commercial aircraft. - Recovery blocks
- A number of explicitly different versions of the
same specification are written and executed in
sequence. - An acceptance test is used to select the output
to be transmitted.
42N-version programming
43Output comparison
- As in hardware systems, the output comparator is
a simple piece of software that uses a voting
mechanism to select the output. - In real-time systems, there may be a requirement
that the results from the different versions are
all produced within a certain time frame.
44N-version programming
- The different system versions are designed and
implemented by different teams. It is assumed
that there is a low probability that they will
make the same mistakes. The algorithms used
should but may not be different. - There is some empirical evidence that teams
commonly misinterpret specifications in the same
way and chose the same algorithms in their
systems.
45Fault Tolerant Techniques in Software
- Check points and roll backs
- Applications state saved at checkpoint. Roll
back restarts execution from a previous
checkpoint - Recovery Blocks
- Alternates - secondary modules that perform same
function of a primary module - are executed when
primary fails to pass an acceptance test
46Recovery blocks
47Recovery blocks
- These force a different algorithm to be used for
each version so they reduce the probability of
common errors. - However, the design of the acceptance test is
difficult as it must be independent of the
computation used. - There are problems with this approach for
real-time systems because of the sequential
operation of the redundant versions.
48Problems with design diversity
- Teams are not culturally diverse so they tend to
tackle problems in the same way. - Characteristic errors
- Different teams make the same mistakes. Some
parts of an implementation are more difficult
than others so all teams tend to make mistakes in
the same place - Specification errors
- If there is an error in the specification then
this is reflected in all implementations - This can be addressed to some extent by using
multiple specification representations.
49Summary of FTC Techniques
- A summary chart of all techniques
50End-to-end Argument in Design of Distributed
Systems
- In 81 Salzer et al. argued that reliable systems
tend to require end-to-end processing to operate
correctly, in addition to any processing in the
intermediate system. - They demonstrated (by argumentation) that the
end-to-end processing alone would suffice to make
the system operate, and that the intermediate
processing stages are largely redundant. - Therefore, much intermediate processing can be
made simpler, relying on the end-to-end
processing to make the system work.
51End-to-end Argument in Design of Distributed
Systems
- Example
- TCP/IP protocol stack. IP is a dumb, stateless
protocol that simply moves datagrams across the
network, and TCP is a smart end-to-end protocol
operating between the client computers - Intermediate levels of processing may make the
system more efficient. However they are not
essential. In contrast end-to-end processing is
essential