Fault Tolerance

About This Presentation

Title:

Fault Tolerance

Description:

Fault tolerance algorithms dependent on fault models. ... There exist practical algorithms for Byzantine agreement if synchronous ... – PowerPoint PPT presentation

Number of Views:93

Avg rating:3.0/5.0

Slides: 42

Provided by: carla3

Learn more at: http://roc.cs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Fault Tolerance

1
Fault Tolerance
2
Fault tolerance terminology

dependability - extent to which reliance can
justifiably be placed on service.
General concept
reliability - continuity of service
metric mean time between failures (MBTF)
availability - readiness for usage
safety - avoidance of catastrophic effects on
environment
security - resistance to unauthorized access.

3
Faults, errors, failures

fault - component malfunction
error - system state is wrong
failure - system departs from specification

error
fault
failure
4
System
System
components
fault
failure
Environment
5
Coping with faults

Reduce/eliminate faults in components.
Fault tolerance
Prevent faults from becoming failures
usually through redundancy.

6
Types of faults (fault models)

Fault tolerance algorithms dependent on fault
models.
Crash fault or stop fault - faulty component
stops responding. No incorrect state changes in
component.
Timing fault - response is too early or late.
Byzantine fault - arbitrary behavior. Can be
considered adversarial (imagine worst case).

7
The agreement problem

Processors may fail
so, use multiple processors
but then, processors may disagree, causing
failures.
Need a principled approach to distributed
agreement

8
Example AFTI 16 (from J. Rushby)

Advanced Fighter Technology Integration F16
Triple-redundant digital flight-control system
(DFCS) with analog backup
DFCS design was asynchronous
processors ran independently
sample sensor, evaluate control law, send command
to actuator
actuator averages or selects from commands
General Dynamics felt synchronization would
introduce a single point of failure.

9
AFTI 16 problems

Processors can get widely varying sensor readings
because of timing differences
Reconfiguration can cause sudden changes in
control (thumps).
Need to allow wide range of plausible values
before declaring a processor bad
Bad sensor reading drags average down
Sensor finally crosses threshhold and is called
bad
average suddenly snaps back when sensor is
excluded.

10
AFTI 16 problems (cont)

Processor states can diverge rapidly
especially when different processors go into
different control modes.
Design complexity
70 of application code was for redundancy
management
Control laws had to be modified to ramp changes
in and out smoothly

11
AFTI 16 flight test, Flight 36

Departure from control laws for 3 seconds
acceleration exceeded -4g, then 7g
Angle of attack went to -10 degrees, then 20
degrees
Aircraft rolled 360 degreees
Cause side air probe cut out at high angle of
attack
Analysis showed this would cause complete failure
of DFCS for several areas of flight envelope

12
AFTI 16 flight 44

Each channel declared the others failed
asynchronous operation, timing skew, sensor noise
analog backup not selected
simultaneous failure of two channels not
anticipated
Aircraft flown home on a single digital channel
(not designed for this)
There were no hardware failures.

13
AFTI 16 Analysis (NASA)

Nearly all failure indications were design
oversights related to asynchronous operation
Failures due to lack of understanding of
interactions among
Air data system
redundancy management software
flight control laws (decision points, thumps,
ramp-in/out)
Moral of the story Reliability through
redundancy is a lot harder than it looks.

14
Distributed consensus

Goal multiple processors agree on something in
the presence of various kinds of faults and
errors
Intellectually difficult
Algorithms are tricky
Proofs are subtle
Sensitive to assumptions
Synchronous vs. asynchronous
Communication mechanism
Fault models
Many papers written

15
Synchronous vs. asynchronous

Synchronous Processors run in lock-step
Hard to implement - model may be unrealistic
Requires clock synchronization.
Consensus is easier
Asynchronous Processors run at arbitrary speed
Easier to implement - model is conservative
In most models, consensus problem is provably
unsolvable.

16
Synchronous vs. asynchronous

Semi-synchronous
Bounds on how far out-of-sync processors can get
Model is fairly realistic
Consensus is almost as easy as synchronous

17
Fault models

Goal Make claims such as the system will
continue to function if any single processor
stops.
More conservative fault models
Fault tolerance is harder
But, if successful, stronger claims can be made
Fewer assumptions simpler FMEA, easier
certification
A lot of models have been proposed.

18
Process fault models

Stopping fault - process stops sending messages
does not restart
does not send wrong messages
liberal (easy) model
Byzantine fault - process behaves arbitrarily
Name comes from cute Byzantine generals
metaphor
May send arbitrary messages, enter arbitrary
states
Equivalent to evil behavior, for our purposes

19
Synchronous agreement with stopping faults

Multiple processes want to agree on a value
Applications
sensor readings among redundant processors
decide what time it is
decide which of a group of processors are broken
and should be removed from system.

20
Synchronous agreement - properties

Each process starts with an initial value,
processes end with a decision value.
Agreement all good processes decide on same
values.
Validity if all processors start with same
value, that value is the final decision value.
Termination All good processes eventually decide.

21
Flood set algorithm

Assumption There is a dedicated link between
each pair of processes
No more than f processes can stop
Each process has an initial value v
Each process accumulates a set W of all the
values it has ever seen.
On each round, every process sends its W set to
every other process
Every process sets W to the union of the old
value and all the new values coming in from
others.

22
Flood set

After f rounds, every process looks at W.
If W has only one value, choose that value.
Else, choose 0 (a predetermined default).

23
Flood set correctness

In f1 rounds, there must be at least one round
in which no processes stop
At most f processes can stop, and processes
cannot stop more than once.
If no process stops in round r, W will be the
same in all good processes in subsequent rounds.
All good processes successfully send all values
in W to all other good processes, so all
processes will have same W after the round.
After this, nothing can get added to any W sets,
so it doesnt matter whether more processes stop.

24
Flood set correctness

So, after f1 rounds, all non-stopped processes
have same W sets
If W has only one value, all processes pick this
value.
Else all processes pick 1.

25
Flood set example

3 processes, 1 fault, default value 0

W in round 0
W in round 1
W in round 2
final
26
Flood set efficiency

O((f 1) n2) messages
f1 rounds
n processes send n messages per round
O((f1)n3) values are sent (each message
may have a set of up to n values)

27
Optimized flood set

Note If W has more than one element, process
doesnt need to know what is in it.
Idea Every process sends only first two distinct
values.
Every process sends its initial value on first
round
If process receives a different value, it sends
it out on next round
Correctness proof run Flood and OptFlood in
parallel
same initial values, stopping pattern
W sets have more than one value iff OptFlood
process gets two values.

28
OptFlood efficiency

2 n2 messages
n processes send at most two messages to n other
processes.
O(n2) values are sent

29
Byzantine agreement

Goal non-faulty processes should agree on a
value.
E.g., message received
e.g., sensor value
Faults may cause arbitrary behavior
arbitrary values communicated
different values communicated to different
receivers
Advantage reduces fault analysis
Disadvantage hard or impossible to do.

30
Byzantine agreement properties

Agreement All good processes agree on a value
Validity If source of value was non-faulty,
agreed upon value is the same.

31
Asynchronous agreement

Asynchronous model
Message transmission takes arbitrary time.
Processes run at arbitrary speeds.
Theorem There is no algorithm that reaches
agreement in an asynchronous model with even one
Byzantine failure
Fine print Details of conditions, communication
This is one of the most important results about
distributed systems.

32
Synchronous agreement

Synchronous model Processes can communicate in a
sequence of rounds. All processes complete a
round before next round begins.
The agreement problem is solvable in this model.
Theorem Tolerating k Byzantine faults requires
gt 3k processes.
So Triple modular redundancy cant handle
Byzantine faults.
Practical case 1 Byzantine fault, 4 processes.
Assumes full connectivity (connections between
each pair of processors).

33
Synchronous agreement with one fault

Single transmitter communicates value to all
processes.
Round 0 Transmitter sends value to n-1
receivers.
Values are sent correctly if transmitter is not
faulty.
Round 1 Each receiver sends value to n-2 other
receivers.
Receivers record all values separately.
Intuition receivers compare notes on what
transmitter told them.
Each receiver choose majority value of all values
it received.
If no majority, use pre-arranged default value.

34
Example 1- faulty transmitter
Round 1 rcvrs exchange values (reliably)
35
Example 2- faulty transmitter
Round 1 rcvrs exchange values (reliably)
36
Example 3- faulty receiver
Process 1 is broken, so result is not required
to be correct
Process 1 sends bogus values
37
General case

Previous algorithm can be generalized to handle
more Byzantine faults.
General results k faults require k1 (k?)
rounds, 3k1 processors
Number of messages grows exponentially with
number of rounds
Intuition Pn said that Pn-1 said that ... p1
said that p0 said that the value was x
There are exponentially many chains pn ... p0.

38
Hybrid Byzantine agreement

Idea Free bonus reliability with the purchase of
Byzantine agreement.
Handles Byzantine faults, plus some more simpler
faults
Symmetric fault process sends same wrong value
to everyone.
Nonmalicious fault process sends a recognizable
error value.
Advantages
If processors have these faults, we can tolerate
more faulty processors
These faults are more probable than true
Byzantine faults - so this increases reliability

39
Hybrid Byzantine agreement

Modify previous algorithm by adding special error
value E.
Nonmalicious faults send E value (other faults
may send E, also).
Majority algorithm first removes E values.
Theorem Algorithm reaches agreement if
n gt 2a 2s b r
a Byzantine, s symmetric, b nonmalicious, r
number of rounds (excluding first
transmission).
Previous case a1, s0, b0, r1, so n gt 3
With 6 processors, can deal with 1 Byzantine 2
nonmalicious faults.
or 1 Byzantine and 1 symmetric
... but just 1 Byzantine in previous algorithm

40
Variations

Synchronous communication is difficult
Compromise between synchronous and asynchronous
real-time constraints.
Authentication - agreement can be made less
costly by using digital signatures
transmitter digitally signs messages
processes cant lie about who said what.
can handle any number of faults (in synchronous
model).
May assume different network connectivity
Some links in network missing

41
Summary

Fault tolerance is tricky. Redundancy does not
necessarily buy reliability.
Byzantine models can account for unforeseen fault
types.
Byzantine agreement is impossible in some models.
There exist practical algorithms for Byzantine
agreement if synchronous communication is
available.
There are deep theoretical results in this area.

Write a Comment

User Comments (0)

About PowerShow.com

Fault Tolerance - PowerPoint PPT Presentation

Fault Tolerance

Fault tolerance algorithms dependent on fault models. ... There exist practical algorithms for Byzantine agreement if synchronous ... – PowerPoint PPT presentation