Review last class - PowerPoint PPT Presentation

1 / 51

About This Presentation

Title:

Review last class

Description:

Binary numbers 0,1 with xor as addition and and as multiplication ... A summary chart of all techniques. End-to-end Argument in Design of Distributed Systems ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 52

Provided by: kew67

Category:

more less

Transcript and Presenter's Notes

Title: Review last class

1
Review last class

Consensus/Agreement with faulty processes
Impossibility of consensus in asynchronous
systems
Networks, task graphs and scheduling in DS

2
Today

Fault Tolerance in DS
Concepts
Hardware
Software

3
Concepts

What is Fault-Tolerance?
A fault-tolerant system is one that continues
to perform at desired level of service in spite
of failures in some components that constitute
the system.

4
Motivation

Approaches to design fault tolerant computer
systems
Bottom-up designing fault tolerant components
to integrate them into a fault tolerant system
Top-down designing a fault tolerant system
using components with little or not fault
tolerance
Top down is the most used approach

5
Motivation (contd.)

Challenge of Fault Tolerant Computing using the
top-down approach
Given that both hardware and software components
are unreliable, how do we build reliable systems
from these unreliable components?

Not a new concept. First use by J. von Neumann
1956

6
Motivation (contd.)

A fault-tolerant computing system may be able to
tolerate one or more fault-types including
transient, intermittent or permanent hardware
faults,
software and hardware design errors,
operator errors, or
externally induced upsets or physical damage.

7
Concepts

Intuitive concepts
Reliability continues to work
Availability works when I need it
Safety does not put me in jeopardy
Performability maintains same performance in
spite of failures
Maintainability do not take much time to repair

8
Concepts (contd.)

The two most common ways industry expresses a
systems ability to tolerate failure are
Reliability
Availability

9
Terminology and definitions

MTTF mean time to failure
the expected time the system will operate before
the first failure occurs (a system is replaced
after a failure).
MTTR mean time to repair
average time required to repair a system
MTBF mean time between failure
average time between failures of a system
(renewal situation theres repair or
replacement)
MTBF MTTF MTTR
MTBFMTTF when MTTR is small

10
Terminology and definitions
11
Fault-Error-Failure concept

Intuitive definitions
Fault
An anomalous physical condition caused by a
manufacturing problem, fatigue, external
disturbance (intentional or un-intentional),
design flaw,
Error - Effect of activation of a fault
Failure - over-all system effect of an error
Fault -gt Error -gt Failure

Bit stuck at
Incorrect data at ALU
Incorrect balance, system crash
Not all errors lead to failures!!
12
Fault Modeling (contd.)

High-level failure models (process or system
failure)
General classification
crash failure - a faulty processor or system
stops permanently
omission failure - a faulty process omits
inputs/outputs some times but when it works, it
works correctly
timing failure - inputs/outputs are delayed or
arrive too early
Byzantine failure (or arbitrary failure) - a
faulty processor can exhibit arbitrary behavior
including malicious nature

13
Reliability

In reliability theory it is customary to assume a
constant failure rate. Then we typically
express reliability R, the probability of
survival to time t, in the form
R e-?t e -t/T
where T MTBF1/ ?failure rate
Reliability f(MTBF)
Reliability is measured on a time interval, it is
also estimated as

14
Failure Rate

Bath tube curve
The rate at which a component suffers faults
depends on its age, the ambient temperature, any
voltage or physical shocks that it suffers, and
the technology

? constant and independent failures
Burning in used to avoid this zone
Normal lifetime
20 weeks
5-25 years
15
Failure Rate

Example for normal lifetime period

If ? is 25 per million hours, i.e. 0.000025
failures per hour, for an 8 hour mission R(8)
e -(.000025)899.98 The system will complete
an 8 hr mission 9,998 times out of 10,000
16
Fault Tolerance and Reliability

The effect of a fault tolerant design on
reliability can be expressed as
RsysP(no-fault)P(correct-operation/fault)P(faul
t)

Maximized by fault intolerant design (proofs of
correct design, high quality components)
Coverage of a fault tolerance design over all
possible faults
For cost effectiveness, fault tolerant design
should target most likely faults
17
Modeling

Reliability Modeling
System model, concentrating on reliability aspect
Models
Combinatorial Models
Markov Models

18
Modeling (contd.)

Combinatorial Modeling
Probabilistic techniques
Express reliability of a system as a function
of reliability of its components
Construction models
series
parallel

19
Modeling (contd.)

Combinatorial Modeling

Parallel Only one of the components must work
correctly High redundancy
Series All components must work correctly No
redundancy
RtR1R2 R3
(1-Rt)(1-R1)(1-R2) (1-R3)
20
Modeling (contd.)

Markov Models
Many complex problems cannot be modeled easily
in combinational fashion
Use Markov models (aka Markov chains)
Repair is very difficult to model combinatorially
Markov models can be applied to modeling
reliability, availability, repair etc.

21
Modeling (contd.)
Markov Models STATE Represents all that must be
known to describe the system at a given instant
in time E.g. for reliability Each state
represents a distinct combination of faulty and
fault-free modules (e.g. 101, 1OK, 0fault)
TRANSITION Changes of state that happen in
system Over time as failures occur, system goes
from one state to another State changes are
given probabilities (e.g. prob. of failure, etc.)
22
Fundamental Principles

Redundancy
Addition of extra parts in a systems design to
allow it continue functioning as intended in
spite of failures
Providing redundancy is key in fault tolerant
computing
Hardware redundancy
Software Redundancy
Time Redundancy
Information Redundancy

23
Information Redundancy

Distinguish
Data words ? the actual information contents
Code words ? the transmitted information
(redundant)
Codes can be
Separable ? if the code word contains all
original data bits plus additional check bits
Non-separable ? otherwise
Example ASCII coding of single digit (separable)
Data word 9
Code word 49

24
Information Redundancy

Dataword with d bits is encoded into a codeword
with c bits where c gt d
Not all 2c combinations are valid codewords
If c bits are not a valid codeword an error is
detected
Extra bits may be used to correct errors
Overhead time to encode and decode

25
Information Redundancy
Less bandwidth available for real information
More code bits
More error tolerance
26
Data Communication

Error correcting codes provides reliable digital
data transmission when the communication medium
used has an unacceptable bit error rate (BER) and
a low signal-to-noise ratio (SNR)

Noise
ECC Encoder/Decoder
ECC Encoder/Decoder
27
Mathematical disgression

Theory of ECC is based on abstract algebra
Fields
informally a field is a set in which we can add,
substract, multiply and divide, obtaining a
result that will be also member of the set
sets of reals and complex are examples of fields
fields with a finite number of elements q are
called Galois fields GF(q)

28
Galois Fields

G(2)
Binary numbers 0,1 with xor as addition and and
as multiplication
In G(2) substractionadditionxor
GF(2m) is a Galois finite field in which the
number of elements is an integer power of 2
Elements are represented as polynomials

29
Cyclic Redundancy Code (CRC)

Basic idea both parties agree on a fixed number
beforehand
Treat the message as a large binary number,
multiply it by highest exponent of the fixed
number and
divide it by that fixed binary number,
make the remainder from this division the error
checking information
send message (multiplied number) plus reminder
Upon receipt of the message, the receiver can
perform the same division and compare the
remainder with the transmitted remainder
CRC calculations are based on
polynomial division
arithmetic over GF(2m).

30
CRC

Divisor is the fixed number generator polynomial
Given a message to be transmitted bn bn-1 bn-2 .
. . b2 b1 b0
View the bits of the message as the coefficients
of a polynomial
M(x) bn xn bn-1 xn-1 bn-2 xn-2 . . . b2
x2 b1 x b0
Multiply the polynomial corresponding to the
message by xk where k is the degree of the
generator polynomial and then divide this product
by the generator to obtain polynomials Q(x) and
R(x) such that
xk M(x)/G(x) Q(x) R(x)
Treating all the coefficients not as integers but
as integers modulo 2.

31
CRC

Finally, treat the coefficients of the remainder
polynomial, R(X) as "parity bits". That is,
append them to the message before actually
transmitting it.
xk M(x) R(x)
When a message is received the corresponding
polynomial is divided by G(x). If the remainder
is non-zero, an error is detected. Otherwise, the
message is assumed to be correct
(xk M(x) R(x))/G(x) Q(x)R(x)

32
CRC Example

Suppose we want to send the short message
11010111 using the CRC with the polynomial x3
x2 1 as our generator
The message corresponds to the polynomial x7
x6 x4 x2 x 1
Given G(x) is of degree 3, we need to multiply
this polynomial by x3 and then divide the result
by G(x) (x10 x9 x7 x5 x4 x3)

x7 x2 1
x3 x2 1 x10 x9 x7 x5 x4 x3
Codeword sent x10 x9 x7 x5 x4 x31
x10 x9 x7
(xi - xi)mod20 (xi xi)mod2(2xi)mod20
x5 x4 x3
x5 x4 x2
x3 x2
x3 x2 1
Residue in module 2 arithmetic Therefore the
parity will be 001
1
33
CRC Example (cont)

At receiver side, message is divided again by
fixed generator polynomial

x7 x2 1
x3 x2 1 x10 x9 x7 x5 x4 x31
x10 x9 x7
x5 x4 x3
x5 x4 x2
x3 x2 1
x3 x2 1
0
Reminder0 No error
34
Hardware Redundancy

Passive redundancy techniques
fault masking
Active redundancy techniques
detection, localization, containment, recovery
Hybrid redundancy techniques
static dynamic
fault masking reconfiguration

35
Passive Hardware Redundancy

Triple modular redundancy (TMR)

3 active components fault masking by
voter Problem voter is a single point of failure
36
Passive Hardware Redundancy

Generalization of TMR (more than 3 modules, e.g.
5MR)
In general NMR (always odd number)
Voting can be done on digital or analog data
Application temperature measurement
Method take 3 measurements, compute median
value
Example
Sensor 1 99C
Sensor 2 100 C
Sensor 3 45,217 C lt- error discard outlier!!

37
TMR with Triplicated Voters
38
Cascading TMR modules

Examples
JPL STAR (Self-Testing And Repairing computer)
FAA WAAS (Wide Area Augmentation System)

39
Passive Hardware Redundancy

Hardware realization of 1 bit majority voting

Fabacbc 2 gate delays
40
Extending TMR
TMR handles processor fault voter fault
memory fault bus fault System has no
single-point-of-failure This approach is
implemented in Tandem Integrity system
41
Software analogies to TMR

N-version programming
The same specification is implemented in a number
of different versions by different teams. All
versions compute simultaneously and the majority
output is selected using a voting system.
This is the most commonly used approach e.g. in
many models of the Airbus commercial aircraft.
Recovery blocks
A number of explicitly different versions of the
same specification are written and executed in
sequence.
An acceptance test is used to select the output
to be transmitted.

42
N-version programming
43
Output comparison

As in hardware systems, the output comparator is
a simple piece of software that uses a voting
mechanism to select the output.
In real-time systems, there may be a requirement
that the results from the different versions are
all produced within a certain time frame.

44
N-version programming

The different system versions are designed and
implemented by different teams. It is assumed
that there is a low probability that they will
make the same mistakes. The algorithms used
should but may not be different.
There is some empirical evidence that teams
commonly misinterpret specifications in the same
way and chose the same algorithms in their
systems.

45
Fault Tolerant Techniques in Software

Check points and roll backs
Applications state saved at checkpoint. Roll
back restarts execution from a previous
checkpoint
Recovery Blocks
Alternates - secondary modules that perform same
function of a primary module - are executed when
primary fails to pass an acceptance test

46
Recovery blocks
47
Recovery blocks

These force a different algorithm to be used for
each version so they reduce the probability of
common errors.
However, the design of the acceptance test is
difficult as it must be independent of the
computation used.
There are problems with this approach for
real-time systems because of the sequential
operation of the redundant versions.

48
Problems with design diversity

Teams are not culturally diverse so they tend to
tackle problems in the same way.
Characteristic errors
Different teams make the same mistakes. Some
parts of an implementation are more difficult
than others so all teams tend to make mistakes in
the same place
Specification errors
If there is an error in the specification then
this is reflected in all implementations
This can be addressed to some extent by using
multiple specification representations.

49
Summary of FTC Techniques

A summary chart of all techniques

50
End-to-end Argument in Design of Distributed
Systems

In 81 Salzer et al. argued that reliable systems
tend to require end-to-end processing to operate
correctly, in addition to any processing in the
intermediate system.
They demonstrated (by argumentation) that the
end-to-end processing alone would suffice to make
the system operate, and that the intermediate
processing stages are largely redundant.
Therefore, much intermediate processing can be
made simpler, relying on the end-to-end
processing to make the system work.

51
End-to-end Argument in Design of Distributed
Systems

Example
TCP/IP protocol stack. IP is a dumb, stateless
protocol that simply moves datagrams across the
network, and TCP is a smart end-to-end protocol
operating between the client computers
Intermediate levels of processing may make the
system more efficient. However they are not
essential. In contrast end-to-end processing is
essential

Write a Comment

User Comments (0)