Review last class - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Review last class

Description:

Binary numbers 0,1 with xor as addition and and as multiplication ... A summary chart of all techniques. End-to-end Argument in Design of Distributed Systems ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 52
Provided by: kew67
Category:

less

Transcript and Presenter's Notes

Title: Review last class


1
Review last class
  • Consensus/Agreement with faulty processes
  • Impossibility of consensus in asynchronous
    systems
  • Networks, task graphs and scheduling in DS

2
Today
  • Fault Tolerance in DS
  • Concepts
  • Hardware
  • Software

3
Concepts
  • What is Fault-Tolerance?
  • A fault-tolerant system is one that continues
    to perform at desired level of service in spite
    of failures in some components that constitute
    the system.

4
Motivation
  • Approaches to design fault tolerant computer
    systems
  • Bottom-up designing fault tolerant components
    to integrate them into a fault tolerant system
  • Top-down designing a fault tolerant system
    using components with little or not fault
    tolerance
  • Top down is the most used approach

5
Motivation (contd.)
  • Challenge of Fault Tolerant Computing using the
    top-down approach
  • Given that both hardware and software components
    are unreliable, how do we build reliable systems
    from these unreliable components?
  • Not a new concept. First use by J. von Neumann
    1956

6
Motivation (contd.)
  • A fault-tolerant computing system may be able to
    tolerate one or more fault-types including
  • transient, intermittent or permanent hardware
    faults,
  • software and hardware design errors,
  • operator errors, or
  • externally induced upsets or physical damage.

7
Concepts
  • Intuitive concepts
  • Reliability continues to work
  • Availability works when I need it
  • Safety does not put me in jeopardy
  • Performability maintains same performance in
    spite of failures
  • Maintainability do not take much time to repair

8
Concepts (contd.)
  • The two most common ways industry expresses a
    systems ability to tolerate failure are
  • Reliability
  • Availability

9
Terminology and definitions
  • MTTF mean time to failure
  • the expected time the system will operate before
    the first failure occurs (a system is replaced
    after a failure).
  • MTTR mean time to repair
  • average time required to repair a system
  • MTBF mean time between failure
  • average time between failures of a system
    (renewal situation theres repair or
    replacement)
  • MTBF MTTF MTTR
  • MTBFMTTF when MTTR is small

10
Terminology and definitions
11
Fault-Error-Failure concept
  • Intuitive definitions
  • Fault
  • An anomalous physical condition caused by a
    manufacturing problem, fatigue, external
    disturbance (intentional or un-intentional),
    design flaw,
  • Error - Effect of activation of a fault
  • Failure - over-all system effect of an error
  • Fault -gt Error -gt Failure

Bit stuck at
Incorrect data at ALU
Incorrect balance, system crash
Not all errors lead to failures!!
12
Fault Modeling (contd.)
  • High-level failure models (process or system
    failure)
  • General classification
  • crash failure - a faulty processor or system
    stops permanently
  • omission failure - a faulty process omits
    inputs/outputs some times but when it works, it
    works correctly
  • timing failure - inputs/outputs are delayed or
    arrive too early
  • Byzantine failure (or arbitrary failure) - a
    faulty processor can exhibit arbitrary behavior
    including malicious nature

13
Reliability
  • In reliability theory it is customary to assume a
    constant failure rate. Then we typically
    express reliability R, the probability of
    survival to time t, in the form
  • R e-?t e -t/T
  • where T MTBF1/ ?failure rate
  • Reliability f(MTBF)
  • Reliability is measured on a time interval, it is
    also estimated as

14
Failure Rate
  • Bath tube curve
  • The rate at which a component suffers faults
    depends on its age, the ambient temperature, any
    voltage or physical shocks that it suffers, and
    the technology

? constant and independent failures
Burning in used to avoid this zone
Normal lifetime
20 weeks
5-25 years
15
Failure Rate
  • Example for normal lifetime period

If ? is 25 per million hours, i.e. 0.000025
failures per hour, for an 8 hour mission R(8)
e -(.000025)899.98 The system will complete
an 8 hr mission 9,998 times out of 10,000
16
Fault Tolerance and Reliability
  • The effect of a fault tolerant design on
    reliability can be expressed as
  • RsysP(no-fault)P(correct-operation/fault)P(faul
    t)

Maximized by fault intolerant design (proofs of
correct design, high quality components)
Coverage of a fault tolerance design over all
possible faults
For cost effectiveness, fault tolerant design
should target most likely faults
17
Modeling
  • Reliability Modeling
  • System model, concentrating on reliability aspect
  • Models
  • Combinatorial Models
  • Markov Models

18
Modeling (contd.)
  • Combinatorial Modeling
  • Probabilistic techniques
  • Express reliability of a system as a function
    of reliability of its components
  • Construction models
  • series
  • parallel

19
Modeling (contd.)
  • Combinatorial Modeling

Parallel Only one of the components must work
correctly High redundancy
Series All components must work correctly No
redundancy
RtR1R2 R3
(1-Rt)(1-R1)(1-R2) (1-R3)
20
Modeling (contd.)
  • Markov Models
  • Many complex problems cannot be modeled easily
    in combinational fashion
  • Use Markov models (aka Markov chains)
  • Repair is very difficult to model combinatorially
  • Markov models can be applied to modeling
    reliability, availability, repair etc.

21
Modeling (contd.)
Markov Models STATE Represents all that must be
known to describe the system at a given instant
in time E.g. for reliability Each state
represents a distinct combination of faulty and
fault-free modules (e.g. 101, 1OK, 0fault)
TRANSITION Changes of state that happen in
system Over time as failures occur, system goes
from one state to another State changes are
given probabilities (e.g. prob. of failure, etc.)
22
Fundamental Principles
  • Redundancy
  • Addition of extra parts in a systems design to
    allow it continue functioning as intended in
    spite of failures
  • Providing redundancy is key in fault tolerant
    computing
  • Hardware redundancy
  • Software Redundancy
  • Time Redundancy
  • Information Redundancy

23
Information Redundancy
  • Distinguish
  • Data words ? the actual information contents
  • Code words ? the transmitted information
    (redundant)
  • Codes can be
  • Separable ? if the code word contains all
    original data bits plus additional check bits
  • Non-separable ? otherwise
  • Example ASCII coding of single digit (separable)
  • Data word 9
  • Code word 49

24
Information Redundancy
  • Dataword with d bits is encoded into a codeword
    with c bits where c gt d
  • Not all 2c combinations are valid codewords
  • If c bits are not a valid codeword an error is
    detected
  • Extra bits may be used to correct errors
  • Overhead time to encode and decode

25
Information Redundancy
Less bandwidth available for real information
More code bits
More error tolerance
26
Data Communication
  • Error correcting codes provides reliable digital
    data transmission when the communication medium
    used has an unacceptable bit error rate (BER) and
    a low signal-to-noise ratio (SNR)

Noise
ECC Encoder/Decoder
ECC Encoder/Decoder
27
Mathematical disgression
  • Theory of ECC is based on abstract algebra
  • Fields
  • informally a field is a set in which we can add,
    substract, multiply and divide, obtaining a
    result that will be also member of the set
  • sets of reals and complex are examples of fields
  • fields with a finite number of elements q are
    called Galois fields GF(q)

28
Galois Fields
  • G(2)
  • Binary numbers 0,1 with xor as addition and and
    as multiplication
  • In G(2) substractionadditionxor
  • GF(2m) is a Galois finite field in which the
    number of elements is an integer power of 2
  • Elements are represented as polynomials

29
Cyclic Redundancy Code (CRC)
  • Basic idea both parties agree on a fixed number
    beforehand
  • Treat the message as a large binary number,
  • multiply it by highest exponent of the fixed
    number and
  • divide it by that fixed binary number,
  • make the remainder from this division the error
    checking information
  • send message (multiplied number) plus reminder
  • Upon receipt of the message, the receiver can
    perform the same division and compare the
    remainder with the transmitted remainder
  • CRC calculations are based on
  • polynomial division
  • arithmetic over GF(2m).

30
CRC
  • Divisor is the fixed number generator polynomial
  • Given a message to be transmitted bn bn-1 bn-2 .
    . . b2 b1 b0
  • View the bits of the message as the coefficients
    of a polynomial
  • M(x) bn xn bn-1 xn-1 bn-2 xn-2 . . . b2
    x2 b1 x b0
  • Multiply the polynomial corresponding to the
    message by xk where k is the degree of the
    generator polynomial and then divide this product
    by the generator to obtain polynomials Q(x) and
    R(x) such that
  • xk M(x)/G(x) Q(x) R(x)
  • Treating all the coefficients not as integers but
    as integers modulo 2.

31
CRC
  • Finally, treat the coefficients of the remainder
    polynomial, R(X) as "parity bits". That is,
    append them to the message before actually
    transmitting it.
  • xk M(x) R(x)
  • When a message is received the corresponding
    polynomial is divided by G(x). If the remainder
    is non-zero, an error is detected. Otherwise, the
    message is assumed to be correct
  • (xk M(x) R(x))/G(x) Q(x)R(x)

32
CRC Example
  • Suppose we want to send the short message
    11010111 using the CRC with the polynomial x3
    x2 1 as our generator
  • The message corresponds to the polynomial x7
    x6 x4 x2 x 1
  • Given G(x) is of degree 3, we need to multiply
    this polynomial by x3 and then divide the result
    by G(x) (x10 x9 x7 x5 x4 x3)

x7 x2 1
x3 x2 1 x10 x9 x7 x5 x4 x3
Codeword sent x10 x9 x7 x5 x4 x31
x10 x9 x7
(xi - xi)mod20 (xi xi)mod2(2xi)mod20
x5 x4 x3
x5 x4 x2
x3 x2
x3 x2 1
Residue in module 2 arithmetic Therefore the
parity will be 001
1
33
CRC Example (cont)
  • At receiver side, message is divided again by
    fixed generator polynomial

x7 x2 1
x3 x2 1 x10 x9 x7 x5 x4 x31
x10 x9 x7
x5 x4 x3
x5 x4 x2
x3 x2 1
x3 x2 1
0
Reminder0 No error
34
Hardware Redundancy
  • Passive redundancy techniques
  • fault masking
  • Active redundancy techniques
  • detection, localization, containment, recovery
  • Hybrid redundancy techniques
  • static dynamic
  • fault masking reconfiguration

35
Passive Hardware Redundancy
  • Triple modular redundancy (TMR)

3 active components fault masking by
voter Problem voter is a single point of failure
36
Passive Hardware Redundancy
  • Generalization of TMR (more than 3 modules, e.g.
    5MR)
  • In general NMR (always odd number)
  • Voting can be done on digital or analog data
  • Application temperature measurement
  • Method take 3 measurements, compute median
    value
  • Example
  • Sensor 1 99C
  • Sensor 2 100 C
  • Sensor 3 45,217 C lt- error discard outlier!!

37
TMR with Triplicated Voters
38
Cascading TMR modules
  • Examples
  • JPL STAR (Self-Testing And Repairing computer)
  • FAA WAAS (Wide Area Augmentation System)

39
Passive Hardware Redundancy
  • Hardware realization of 1 bit majority voting

Fabacbc 2 gate delays
40
Extending TMR
TMR handles processor fault voter fault
memory fault bus fault System has no
single-point-of-failure This approach is
implemented in Tandem Integrity system
41
Software analogies to TMR
  • N-version programming
  • The same specification is implemented in a number
    of different versions by different teams. All
    versions compute simultaneously and the majority
    output is selected using a voting system.
  • This is the most commonly used approach e.g. in
    many models of the Airbus commercial aircraft.
  • Recovery blocks
  • A number of explicitly different versions of the
    same specification are written and executed in
    sequence.
  • An acceptance test is used to select the output
    to be transmitted.

42
N-version programming
43
Output comparison
  • As in hardware systems, the output comparator is
    a simple piece of software that uses a voting
    mechanism to select the output.
  • In real-time systems, there may be a requirement
    that the results from the different versions are
    all produced within a certain time frame.

44
N-version programming
  • The different system versions are designed and
    implemented by different teams. It is assumed
    that there is a low probability that they will
    make the same mistakes. The algorithms used
    should but may not be different.
  • There is some empirical evidence that teams
    commonly misinterpret specifications in the same
    way and chose the same algorithms in their
    systems.

45
Fault Tolerant Techniques in Software
  • Check points and roll backs
  • Applications state saved at checkpoint. Roll
    back restarts execution from a previous
    checkpoint
  • Recovery Blocks
  • Alternates - secondary modules that perform same
    function of a primary module - are executed when
    primary fails to pass an acceptance test

46
Recovery blocks
47
Recovery blocks
  • These force a different algorithm to be used for
    each version so they reduce the probability of
    common errors.
  • However, the design of the acceptance test is
    difficult as it must be independent of the
    computation used.
  • There are problems with this approach for
    real-time systems because of the sequential
    operation of the redundant versions.

48
Problems with design diversity
  • Teams are not culturally diverse so they tend to
    tackle problems in the same way.
  • Characteristic errors
  • Different teams make the same mistakes. Some
    parts of an implementation are more difficult
    than others so all teams tend to make mistakes in
    the same place
  • Specification errors
  • If there is an error in the specification then
    this is reflected in all implementations
  • This can be addressed to some extent by using
    multiple specification representations.

49
Summary of FTC Techniques
  • A summary chart of all techniques

50
End-to-end Argument in Design of Distributed
Systems
  • In 81 Salzer et al. argued that reliable systems
    tend to require end-to-end processing to operate
    correctly, in addition to any processing in the
    intermediate system.
  • They demonstrated (by argumentation) that the
    end-to-end processing alone would suffice to make
    the system operate, and that the intermediate
    processing stages are largely redundant.
  • Therefore, much intermediate processing can be
    made simpler, relying on the end-to-end
    processing to make the system work.

51
End-to-end Argument in Design of Distributed
Systems
  • Example
  • TCP/IP protocol stack. IP is a dumb, stateless
    protocol that simply moves datagrams across the
    network, and TCP is a smart end-to-end protocol
    operating between the client computers
  • Intermediate levels of processing may make the
    system more efficient. However they are not
    essential. In contrast end-to-end processing is
    essential
Write a Comment
User Comments (0)
About PowerShow.com