Fault Tolerance - PowerPoint PPT Presentation

About This Presentation
Title:

Fault Tolerance

Description:

Fault tolerance algorithms dependent on fault models. ... There exist practical algorithms for Byzantine agreement if synchronous ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 42
Provided by: carla3
Category:
Tags: fault | tolerance

less

Transcript and Presenter's Notes

Title: Fault Tolerance


1
Fault Tolerance
2
Fault tolerance terminology
  • dependability - extent to which reliance can
    justifiably be placed on service.
  • General concept
  • reliability - continuity of service
  • metric mean time between failures (MBTF)
  • availability - readiness for usage
  • safety - avoidance of catastrophic effects on
    environment
  • security - resistance to unauthorized access.

3
Faults, errors, failures
  • fault - component malfunction
  • error - system state is wrong
  • failure - system departs from specification

error
fault
failure
4
System
System
components
fault
failure
Environment
5
Coping with faults
  • Reduce/eliminate faults in components.
  • Fault tolerance
  • Prevent faults from becoming failures
  • usually through redundancy.

6
Types of faults (fault models)
  • Fault tolerance algorithms dependent on fault
    models.
  • Crash fault or stop fault - faulty component
    stops responding. No incorrect state changes in
    component.
  • Timing fault - response is too early or late.
  • Byzantine fault - arbitrary behavior. Can be
    considered adversarial (imagine worst case).

7
The agreement problem
  • Processors may fail
  • so, use multiple processors
  • but then, processors may disagree, causing
    failures.
  • Need a principled approach to distributed
    agreement

8
Example AFTI 16 (from J. Rushby)
  • Advanced Fighter Technology Integration F16
  • Triple-redundant digital flight-control system
    (DFCS) with analog backup
  • DFCS design was asynchronous
  • processors ran independently
  • sample sensor, evaluate control law, send command
    to actuator
  • actuator averages or selects from commands
  • General Dynamics felt synchronization would
    introduce a single point of failure.

9
AFTI 16 problems
  • Processors can get widely varying sensor readings
    because of timing differences
  • Reconfiguration can cause sudden changes in
    control (thumps).
  • Need to allow wide range of plausible values
    before declaring a processor bad
  • Bad sensor reading drags average down
  • Sensor finally crosses threshhold and is called
    bad
  • average suddenly snaps back when sensor is
    excluded.

10
AFTI 16 problems (cont)
  • Processor states can diverge rapidly
  • especially when different processors go into
    different control modes.
  • Design complexity
  • 70 of application code was for redundancy
    management
  • Control laws had to be modified to ramp changes
    in and out smoothly

11
AFTI 16 flight test, Flight 36
  • Departure from control laws for 3 seconds
  • acceleration exceeded -4g, then 7g
  • Angle of attack went to -10 degrees, then 20
    degrees
  • Aircraft rolled 360 degreees
  • Cause side air probe cut out at high angle of
    attack
  • Analysis showed this would cause complete failure
    of DFCS for several areas of flight envelope

12
AFTI 16 flight 44
  • Each channel declared the others failed
  • asynchronous operation, timing skew, sensor noise
  • analog backup not selected
  • simultaneous failure of two channels not
    anticipated
  • Aircraft flown home on a single digital channel
    (not designed for this)
  • There were no hardware failures.

13
AFTI 16 Analysis (NASA)
  • Nearly all failure indications were design
    oversights related to asynchronous operation
  • Failures due to lack of understanding of
    interactions among
  • Air data system
  • redundancy management software
  • flight control laws (decision points, thumps,
    ramp-in/out)
  • Moral of the story Reliability through
    redundancy is a lot harder than it looks.

14
Distributed consensus
  • Goal multiple processors agree on something in
    the presence of various kinds of faults and
    errors
  • Intellectually difficult
  • Algorithms are tricky
  • Proofs are subtle
  • Sensitive to assumptions
  • Synchronous vs. asynchronous
  • Communication mechanism
  • Fault models
  • Many papers written

15
Synchronous vs. asynchronous
  • Synchronous Processors run in lock-step
  • Hard to implement - model may be unrealistic
  • Requires clock synchronization.
  • Consensus is easier
  • Asynchronous Processors run at arbitrary speed
  • Easier to implement - model is conservative
  • In most models, consensus problem is provably
    unsolvable.

16
Synchronous vs. asynchronous
  • Semi-synchronous
  • Bounds on how far out-of-sync processors can get
  • Model is fairly realistic
  • Consensus is almost as easy as synchronous

17
Fault models
  • Goal Make claims such as the system will
    continue to function if any single processor
    stops.
  • More conservative fault models
  • Fault tolerance is harder
  • But, if successful, stronger claims can be made
  • Fewer assumptions simpler FMEA, easier
    certification
  • A lot of models have been proposed.

18
Process fault models
  • Stopping fault - process stops sending messages
  • does not restart
  • does not send wrong messages
  • liberal (easy) model
  • Byzantine fault - process behaves arbitrarily
  • Name comes from cute Byzantine generals
    metaphor
  • May send arbitrary messages, enter arbitrary
    states
  • Equivalent to evil behavior, for our purposes

19
Synchronous agreement with stopping faults
  • Multiple processes want to agree on a value
  • Applications
  • sensor readings among redundant processors
  • decide what time it is
  • decide which of a group of processors are broken
    and should be removed from system.

20
Synchronous agreement - properties
  • Each process starts with an initial value,
    processes end with a decision value.
  • Agreement all good processes decide on same
    values.
  • Validity if all processors start with same
    value, that value is the final decision value.
  • Termination All good processes eventually decide.

21
Flood set algorithm
  • Assumption There is a dedicated link between
    each pair of processes
  • No more than f processes can stop
  • Each process has an initial value v
  • Each process accumulates a set W of all the
    values it has ever seen.
  • On each round, every process sends its W set to
    every other process
  • Every process sets W to the union of the old
    value and all the new values coming in from
    others.

22
Flood set
  • After f rounds, every process looks at W.
  • If W has only one value, choose that value.
  • Else, choose 0 (a predetermined default).

23
Flood set correctness
  • In f1 rounds, there must be at least one round
    in which no processes stop
  • At most f processes can stop, and processes
    cannot stop more than once.
  • If no process stops in round r, W will be the
    same in all good processes in subsequent rounds.
  • All good processes successfully send all values
    in W to all other good processes, so all
    processes will have same W after the round.
  • After this, nothing can get added to any W sets,
    so it doesnt matter whether more processes stop.

24
Flood set correctness
  • So, after f1 rounds, all non-stopped processes
    have same W sets
  • If W has only one value, all processes pick this
    value.
  • Else all processes pick 1.

25
Flood set example
  • 3 processes, 1 fault, default value 0

W in round 0
W in round 1
W in round 2
final
26
Flood set efficiency
  • O((f 1) n2) messages
  • f1 rounds
  • n processes send n messages per round
  • O((f1)n3) values are sent (each message
  • may have a set of up to n values)

27
Optimized flood set
  • Note If W has more than one element, process
    doesnt need to know what is in it.
  • Idea Every process sends only first two distinct
    values.
  • Every process sends its initial value on first
    round
  • If process receives a different value, it sends
    it out on next round
  • Correctness proof run Flood and OptFlood in
    parallel
  • same initial values, stopping pattern
  • W sets have more than one value iff OptFlood
    process gets two values.

28
OptFlood efficiency
  • 2 n2 messages
  • n processes send at most two messages to n other
    processes.
  • O(n2) values are sent

29
Byzantine agreement
  • Goal non-faulty processes should agree on a
    value.
  • E.g., message received
  • e.g., sensor value
  • Faults may cause arbitrary behavior
  • arbitrary values communicated
  • different values communicated to different
    receivers
  • Advantage reduces fault analysis
  • Disadvantage hard or impossible to do.

30
Byzantine agreement properties
  • Agreement All good processes agree on a value
  • Validity If source of value was non-faulty,
    agreed upon value is the same.

31
Asynchronous agreement
  • Asynchronous model
  • Message transmission takes arbitrary time.
  • Processes run at arbitrary speeds.
  • Theorem There is no algorithm that reaches
    agreement in an asynchronous model with even one
    Byzantine failure
  • Fine print Details of conditions, communication
  • This is one of the most important results about
    distributed systems.

32
Synchronous agreement
  • Synchronous model Processes can communicate in a
    sequence of rounds. All processes complete a
    round before next round begins.
  • The agreement problem is solvable in this model.
  • Theorem Tolerating k Byzantine faults requires
    gt 3k processes.
  • So Triple modular redundancy cant handle
    Byzantine faults.
  • Practical case 1 Byzantine fault, 4 processes.
  • Assumes full connectivity (connections between
    each pair of processors).

33
Synchronous agreement with one fault
  • Single transmitter communicates value to all
    processes.
  • Round 0 Transmitter sends value to n-1
    receivers.
  • Values are sent correctly if transmitter is not
    faulty.
  • Round 1 Each receiver sends value to n-2 other
    receivers.
  • Receivers record all values separately.
  • Intuition receivers compare notes on what
    transmitter told them.
  • Each receiver choose majority value of all values
    it received.
  • If no majority, use pre-arranged default value.

34
Example 1- faulty transmitter
Round 1 rcvrs exchange values (reliably)
35
Example 2- faulty transmitter
Round 1 rcvrs exchange values (reliably)
36
Example 3- faulty receiver
Process 1 is broken, so result is not required
to be correct
Process 1 sends bogus values
37
General case
  • Previous algorithm can be generalized to handle
    more Byzantine faults.
  • General results k faults require k1 (k?)
    rounds, 3k1 processors
  • Number of messages grows exponentially with
    number of rounds
  • Intuition Pn said that Pn-1 said that ... p1
    said that p0 said that the value was x
  • There are exponentially many chains pn ... p0.

38
Hybrid Byzantine agreement
  • Idea Free bonus reliability with the purchase of
    Byzantine agreement.
  • Handles Byzantine faults, plus some more simpler
    faults
  • Symmetric fault process sends same wrong value
    to everyone.
  • Nonmalicious fault process sends a recognizable
    error value.
  • Advantages
  • If processors have these faults, we can tolerate
    more faulty processors
  • These faults are more probable than true
    Byzantine faults - so this increases reliability

39
Hybrid Byzantine agreement
  • Modify previous algorithm by adding special error
    value E.
  • Nonmalicious faults send E value (other faults
    may send E, also).
  • Majority algorithm first removes E values.
  • Theorem Algorithm reaches agreement if
  • n gt 2a 2s b r
  • a Byzantine, s symmetric, b nonmalicious, r
    number of rounds (excluding first
    transmission).
  • Previous case a1, s0, b0, r1, so n gt 3
  • With 6 processors, can deal with 1 Byzantine 2
    nonmalicious faults.
  • or 1 Byzantine and 1 symmetric
  • ... but just 1 Byzantine in previous algorithm

40
Variations
  • Synchronous communication is difficult
  • Compromise between synchronous and asynchronous
    real-time constraints.
  • Authentication - agreement can be made less
    costly by using digital signatures
  • transmitter digitally signs messages
  • processes cant lie about who said what.
  • can handle any number of faults (in synchronous
    model).
  • May assume different network connectivity
  • Some links in network missing

41
Summary
  • Fault tolerance is tricky. Redundancy does not
    necessarily buy reliability.
  • Byzantine models can account for unforeseen fault
    types.
  • Byzantine agreement is impossible in some models.
  • There exist practical algorithms for Byzantine
    agreement if synchronous communication is
    available.
  • There are deep theoretical results in this area.
Write a Comment
User Comments (0)
About PowerShow.com