Title: 1. Review of Multiprocessors and Fault Tolerance
18. Fault Tolerance in Software
- Is it true that a program that has once performed
a given task as specified will continue to do so?
- Yes, if provided that none of the following
parameters change - The inputs
- The computing environment
- The user requirements
28. Fault Tolerance in Software
Federal Reserve Funds Transfer Program, active 12
hours/day, 5 days/week.
Consistency of failure rates in time.
38. Fault Tolerance in Software
Data and Analysis Center for Software (DACS),
fault density the of faults per 1000 lines of
code, ranges from 10 50 for good SW and from
1 5 after intensive testing using automated
tools.
Failure rates of Command and Control Systems.
48. Fault Tolerance in Software
Consequences of SW failure Attendance has
personal experience with incorrect billing, lost
airline or hotel reservations. More serious
errors reported in the media, such as the
disruption of phone service to over 20 million
customers during the summer of 1991 due to coding
error in a new generation digital switch. The
most serious consequences are related to
real-time applications, such as those involving
spacecrafts the launch failure of Mariner I
(1962), the destruction of a French
meteorological satellite in 1968, several
problems during the Apollo missions in the early
of 1970s, the NASA Space Shuttle, the fly-by-wire
Airbus A320, the Russian satellite Mars, the
satellite launcher Ariane.
58. Fault Tolerance in Software
- Causes of SW failure
-
- Malfunction of a process. E.g. exception
handling, timeout computation, design error
(solution check the outputs and timer) - Erroneous control sequence (solution set an
upper limit on loop iterations) - Data entry error (solution use of
error-detecting code and type checks in input
data).
68. Fault Tolerance in Software
- 8.3 Dealing with Faulty Programs
- 8.3.1 Robustness
- The minimum requirement is that the program will
properly handle inputs out of range, or in a
different type of format than defined, without
degrading its performance of functions not
dependent on the nonstandard input. - When these input data are found not to comply
with the program specification - a new input may be requested
- the last acceptable value of a variable can be
used - or a predefined default can e assigned.
78. Fault Tolerance in Software
- 8.3 Dealing with Faulty Programs
- 8.3.1 Robustness
- In general, Robustness is used to test
- the function of a process (e.g., by checking the
outputs) - the control sequence (e.g., by setting an upper
limit on loop iterations) - the input data (e.g., by using error-detecting
code and type checks).
88. Fault Tolerance in Software
- 8.3 Dealing with Faulty Programs
- 8.3.2 Temporal Redundancy
- Temporal Redundancy consists of the reexecution
of a program when an error is encountered. The
error may involve faulty data (as detected by
Robustness), faulty execution (e.g., accessing
protected memory), or incorrect output (as
detected by Acceptance Tests). - Temporary reexecution will clear errors that
arose from temporary circumstances that are no
longer present when a new pass through the
program is taken. - E.g., busy or noisy communication channels, full
buffers, power supply transients.
98. Fault Tolerance in Software
- 8.3 Dealing with Faulty Programs
- 8.3.2 Temporal Redundancy
When the error persists, Fault Containment
Procedures must be triggered by the system.
108. Fault Tolerance in Software
- 8.3 Dealing with Faulty Programs
- 8.3.3 Software Diversity
- SW Diversity permits uninterrupted system
operation under the presence of program faults
through multiple implementations of a given
functional process and it is therefore
particularly applicable to real-time control
systems. - It is divided into two categories
- Static SW Fault Tolerance N-Version programming
- Dynamic SW Fault Tolerance Recovery Block
118. Fault Tolerance in Software
- 8.3 Dealing with Faulty Programs
- 8.3.3 Software Diversity
- Static SW Fault Tolerance N-Version Programming
- A given task is executed by several programs
(consecutively on the same machine) and the
result accepted only if a specified of programs
agree within specified limits. The same computer
performs comparison and selection of the results
to be propagated to the external system. - In practice, the programs are executed
concurrently, and therefore multiple computers
are required to implement this technique.
128. Fault Tolerance in Software
- 8.3 Dealing with Faulty Programs
- 8.3.3 Software Diversity
- Dynamic SW Fault Tolerance Recovery Block
- A single program is executed and the result
(including intermediate results) is subjected to
an Acceptance Test.
138. Fault Tolerance in Software
- 8.3 Dealing with Faulty Programs
- 8.3.3 Software Diversity
- The term STATIC is used because the selection of
the acceptable result does not affect the
subsequent execution of the programs. - The term DYNAMIC is used because the selection
between the original and alternate program is
made during execution based on the outcome of the
Acceptance Test.
148. Fault Tolerance in Software
- 8.4 Design of Fault Tolerant Software Using
Diversity - 8.4.1 N-Version Programming
- Defined as the independent generation of N ? 2
functionally equivalent programs, called
versions, from the same initial specification. In
this case, fault masking is not provided and upon
disagreement among the versions, 3 alternatives
are available - Retry or restart (in this case fault containment
rather than FT is provided - Transition to a predefined safe state, possibly
followed by later retries - Reliance on one of the versions, either
designated in advance as more reliable or
selected by a diagnostic program (in the latter
case the technique takes on some aspects of
dynamic redundancy).
158. Fault Tolerance in Software
- 8.4 Design of Fault Tolerant Software Using
Diversity - 8.4.1 N-Version Programming
- For N gt 2, a majority voting logic can be
implemented (N 3), it is required - Three independent programs, each furnishing
identical output formats - An acceptance program that evaluates the output
of (i) and selects the result to be furnished as
N-version output - A driver (process controller) that invokes
requirements (i) and (ii) and furnishes the
N-version output to other programs or the
physical system.
168. Fault Tolerance in Software
- 8.4 Design of Fault Tolerant Software Using
Diversity - 8.4.1 N-Version Programming
- Experiment carried out at UCLA (1978)
- 7 separate versions for the application program
- From this, 12 3-version sets were constructed
- Each set was subject to 32 test cases,yielding
384 total tests. - One of the conclusions
- Cases where a single faulty version resulted in
incorrect execution, the OS of the computer
intervened before the program reached the voting
stage. Most later N-version experiments overcame
this problem by incorporating acceptance tests
for abort conditions and precluding the
intervention of the OS under these conditions.
178. Fault Tolerance in Software
- 8.4 Design of Fault Tolerant Software Using
Diversity - 8.4.1 N-Version Programming
Results of an Early N-Version Programming
Experiment.
188. Fault Tolerance in Software
- 8.4 Design of Fault Tolerant Software Using
Diversity - 8.4.2 Recovery Block
- Represents the Dynamic Redundancy Approach to SW
fault tolerance. - Consists of 3 SW elements
- a primary routing, which executes critical SW
functions - an acceptance test, which tests the output of the
primary routine after every execution - at least one alternate routine which performs the
same function as the primary routine (but may be
less capable or slower) and is invoked by the
acceptance test upon detection of a failure.
198. Fault Tolerance in Software
- 8.4 Design of Fault Tolerant Software Using
Diversity - 8.4.2 Recovery Block
- The basic structure is
- Ensure T
- By P
- Else by Q
- Else Error
- Where
- T is the acceptance test condition that is
expected to be met by successful execution of
either the primary routine P or the alternate
routine Q. - The structure is easily expanded to accommodate
several alternates Q1, Q2, GQ3,...,Qn.
208. Fault Tolerance in Software
- 8.4 Design of Fault Tolerant Software Using
Diversity - 8.4.2 Recovery Block
- Difference between Recovery Block and N-Version
Programming are - only a single implementation of the program is
run at a time (in this case P or Q) - the acceptability of the results is decided by a
test rather than by comparison with functionally
equivalent alternate versions.
218. Fault Tolerance in Software
- 8.4 Design of Fault Tolerant Software Using
Diversity - 8.4.2 Recovery Block
- Real-time control applications require that
results furnished by a program be both correct
and timely. - For this reason, the recovery block for a
real-time program should incorporate a watchdog
timer which initiates execution by Q (if P does
not produce an acceptance result within the
allocated time).
228. Fault Tolerance in Software
- 8.4 Design of Fault Tolerant Software Using
Diversity - 8.4.2 Recovery Block
Recovery block for real-time application. (Progr
am flow under direction of the application module
is shown in solid lines timer-triggered
interrupts are shown in dashed lines.)
238. Fault Tolerance in Software
- 8.4 Design of Fault Tolerant Software Using
Diversity - 8.4.2 Recovery Block
- Highlights ...
- A single program is executed at any given time
- No special demands on computer redundancy or
computer architecture are made. - Performance penalty in normal operation is
small - the execution of the acceptance test.
248. Fault Tolerance in Software
- 8.4 Design of Fault Tolerant Software Using
Diversity - 8.4.2 Recovery Block
- Highlights ...
- Storage requirements are expanded
- in addition to the primary application program,
the acceptance test and the backup program must
also be available in memory. - SW development cost is increased
- Need to generate two programs and the associated
acceptance test.
258. Fault Tolerance in Software
- 8.4 Design of Fault Tolerant Software Using
Diversity - 8.4.2 Recovery Block
- Details about the Basic Recovery Block Structure
...
- The Acceptance Test is divided into 2 separate
tests which are invoked before and after the
execution of the primary routine - Before
- The first acceptance test checks on the call
format and parameters. - The second acceptance test checks on the validity
of the input data. (When data errors are common,
provision of an alternate data source may be
considered dashed lines indicating the backup
data) - After
- The last acceptance test examines the output data.
268. Fault Tolerance in Software
- 8.4 Design of Fault Tolerant Software Using
Diversity - 8.4.2 Recovery Block
Internal Structure for primary application module.
278. Fault Tolerance in Software
- 8.4 Design of Fault Tolerant Software Using
Diversity - 8.4.2 Recovery Block
- The integration of application modules
structured as recovery blocks into a
fault-tolerant SW system is shown in the next
figure. - Application Modules and the decision diamond
labeled Return together represent the structure
shown in figure . - In the absence of failures of the recovery
blocks, the process will always remain in the
inner loop. - If an abort is taken, the failure is recorded
and some diagnostics may be performed. In case of
a first failure in a recovery block, a retry may
be initiated. If the failure persists, further
execution of the task represented by the recovery
block is suspended
288. Fault Tolerance in Software
- 8.4 Design of Fault Tolerant Software Using
Diversity - 8.4.2 Recovery Block
Executive and application modules.