1. Review of Multiprocessors and Fault Tolerance

About This Presentation

Title:

1. Review of Multiprocessors and Fault Tolerance

Description:

8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? – PowerPoint PPT presentation

Number of Views:182

Avg rating:3.0/5.0

Slides: 29

Provided by: Fabia177

Category:

more less

Transcript and Presenter's Notes

Title: 1. Review of Multiprocessors and Fault Tolerance

1
8. Fault Tolerance in Software

8.1 Introduction

Is it true that a program that has once performed
a given task as specified will continue to do so?

Yes, if provided that none of the following
parameters change
The inputs
The computing environment
The user requirements

2
8. Fault Tolerance in Software

8.1 Introduction

Federal Reserve Funds Transfer Program, active 12
hours/day, 5 days/week.
Consistency of failure rates in time.
3
8. Fault Tolerance in Software

8.1 Introduction

Data and Analysis Center for Software (DACS),
fault density the of faults per 1000 lines of
code, ranges from 10 50 for good SW and from
1 5 after intensive testing using automated
tools.
Failure rates of Command and Control Systems.
4
8. Fault Tolerance in Software

8.1 Introduction

Consequences of SW failure Attendance has
personal experience with incorrect billing, lost
airline or hotel reservations. More serious
errors reported in the media, such as the
disruption of phone service to over 20 million
customers during the summer of 1991 due to coding
error in a new generation digital switch. The
most serious consequences are related to
real-time applications, such as those involving
spacecrafts the launch failure of Mariner I
(1962), the destruction of a French
meteorological satellite in 1968, several
problems during the Apollo missions in the early
of 1970s, the NASA Space Shuttle, the fly-by-wire
Airbus A320, the Russian satellite Mars, the
satellite launcher Ariane.
5
8. Fault Tolerance in Software

8.1 Introduction

Causes of SW failure
Malfunction of a process. E.g. exception
handling, timeout computation, design error
(solution check the outputs and timer)
Erroneous control sequence (solution set an
upper limit on loop iterations)
Data entry error (solution use of
error-detecting code and type checks in input
data).

6
8. Fault Tolerance in Software

8.3 Dealing with Faulty Programs
8.3.1 Robustness

The minimum requirement is that the program will
properly handle inputs out of range, or in a
different type of format than defined, without
degrading its performance of functions not
dependent on the nonstandard input.
When these input data are found not to comply
with the program specification
a new input may be requested
the last acceptable value of a variable can be
used
or a predefined default can e assigned.

7
8. Fault Tolerance in Software

8.3 Dealing with Faulty Programs
8.3.1 Robustness

In general, Robustness is used to test
the function of a process (e.g., by checking the
outputs)
the control sequence (e.g., by setting an upper
limit on loop iterations)
the input data (e.g., by using error-detecting
code and type checks).

8
8. Fault Tolerance in Software

8.3 Dealing with Faulty Programs
8.3.2 Temporal Redundancy

Temporal Redundancy consists of the reexecution
of a program when an error is encountered. The
error may involve faulty data (as detected by
Robustness), faulty execution (e.g., accessing
protected memory), or incorrect output (as
detected by Acceptance Tests).
Temporary reexecution will clear errors that
arose from temporary circumstances that are no
longer present when a new pass through the
program is taken.
E.g., busy or noisy communication channels, full
buffers, power supply transients.

9
8. Fault Tolerance in Software

8.3 Dealing with Faulty Programs
8.3.2 Temporal Redundancy

When the error persists, Fault Containment
Procedures must be triggered by the system.
10
8. Fault Tolerance in Software

8.3 Dealing with Faulty Programs
8.3.3 Software Diversity

SW Diversity permits uninterrupted system
operation under the presence of program faults
through multiple implementations of a given
functional process and it is therefore
particularly applicable to real-time control
systems.
It is divided into two categories
Static SW Fault Tolerance N-Version programming
Dynamic SW Fault Tolerance Recovery Block

11
8. Fault Tolerance in Software

8.3 Dealing with Faulty Programs
8.3.3 Software Diversity

Static SW Fault Tolerance N-Version Programming
A given task is executed by several programs
(consecutively on the same machine) and the
result accepted only if a specified of programs
agree within specified limits. The same computer
performs comparison and selection of the results
to be propagated to the external system.
In practice, the programs are executed
concurrently, and therefore multiple computers
are required to implement this technique.

12
8. Fault Tolerance in Software

8.3 Dealing with Faulty Programs
8.3.3 Software Diversity

Dynamic SW Fault Tolerance Recovery Block
A single program is executed and the result
(including intermediate results) is subjected to
an Acceptance Test.

13
8. Fault Tolerance in Software

8.3 Dealing with Faulty Programs
8.3.3 Software Diversity

The term STATIC is used because the selection of
the acceptable result does not affect the
subsequent execution of the programs.
The term DYNAMIC is used because the selection
between the original and alternate program is
made during execution based on the outcome of the
Acceptance Test.

14
8. Fault Tolerance in Software

8.4 Design of Fault Tolerant Software Using
Diversity
8.4.1 N-Version Programming

Defined as the independent generation of N ? 2
functionally equivalent programs, called
versions, from the same initial specification. In
this case, fault masking is not provided and upon
disagreement among the versions, 3 alternatives
are available
Retry or restart (in this case fault containment
rather than FT is provided
Transition to a predefined safe state, possibly
followed by later retries
Reliance on one of the versions, either
designated in advance as more reliable or
selected by a diagnostic program (in the latter
case the technique takes on some aspects of
dynamic redundancy).

15
8. Fault Tolerance in Software

8.4 Design of Fault Tolerant Software Using
Diversity
8.4.1 N-Version Programming

For N gt 2, a majority voting logic can be
implemented (N 3), it is required
Three independent programs, each furnishing
identical output formats
An acceptance program that evaluates the output
of (i) and selects the result to be furnished as
N-version output
A driver (process controller) that invokes
requirements (i) and (ii) and furnishes the
N-version output to other programs or the
physical system.

16
8. Fault Tolerance in Software

8.4 Design of Fault Tolerant Software Using
Diversity
8.4.1 N-Version Programming

Experiment carried out at UCLA (1978)
7 separate versions for the application program
From this, 12 3-version sets were constructed
Each set was subject to 32 test cases,yielding
384 total tests.
One of the conclusions
Cases where a single faulty version resulted in
incorrect execution, the OS of the computer
intervened before the program reached the voting
stage. Most later N-version experiments overcame
this problem by incorporating acceptance tests
for abort conditions and precluding the
intervention of the OS under these conditions.

17
8. Fault Tolerance in Software

8.4 Design of Fault Tolerant Software Using
Diversity
8.4.1 N-Version Programming

Results of an Early N-Version Programming
Experiment.
18
8. Fault Tolerance in Software

8.4 Design of Fault Tolerant Software Using
Diversity
8.4.2 Recovery Block

Represents the Dynamic Redundancy Approach to SW
fault tolerance.
Consists of 3 SW elements
a primary routing, which executes critical SW
functions
an acceptance test, which tests the output of the
primary routine after every execution
at least one alternate routine which performs the
same function as the primary routine (but may be
less capable or slower) and is invoked by the
acceptance test upon detection of a failure.

19
8. Fault Tolerance in Software

8.4 Design of Fault Tolerant Software Using
Diversity
8.4.2 Recovery Block

The basic structure is
Ensure T
By P
Else by Q
Else Error
Where
T is the acceptance test condition that is
expected to be met by successful execution of
either the primary routine P or the alternate
routine Q.
The structure is easily expanded to accommodate
several alternates Q1, Q2, GQ3,...,Qn.

20
8. Fault Tolerance in Software

8.4 Design of Fault Tolerant Software Using
Diversity
8.4.2 Recovery Block

Difference between Recovery Block and N-Version
Programming are
only a single implementation of the program is
run at a time (in this case P or Q)
the acceptability of the results is decided by a
test rather than by comparison with functionally
equivalent alternate versions.

21
8. Fault Tolerance in Software

8.4 Design of Fault Tolerant Software Using
Diversity
8.4.2 Recovery Block

Real-time control applications require that
results furnished by a program be both correct
and timely.
For this reason, the recovery block for a
real-time program should incorporate a watchdog
timer which initiates execution by Q (if P does
not produce an acceptance result within the
allocated time).

22
8. Fault Tolerance in Software

8.4 Design of Fault Tolerant Software Using
Diversity
8.4.2 Recovery Block

Recovery block for real-time application. (Progr
am flow under direction of the application module
is shown in solid lines timer-triggered
interrupts are shown in dashed lines.)
23
8. Fault Tolerance in Software

8.4 Design of Fault Tolerant Software Using
Diversity
8.4.2 Recovery Block
Highlights ...

A single program is executed at any given time
No special demands on computer redundancy or
computer architecture are made.
Performance penalty in normal operation is
small
the execution of the acceptance test.

24
8. Fault Tolerance in Software

8.4 Design of Fault Tolerant Software Using
Diversity
8.4.2 Recovery Block
Highlights ...

Storage requirements are expanded
in addition to the primary application program,
the acceptance test and the backup program must
also be available in memory.
SW development cost is increased
Need to generate two programs and the associated
acceptance test.

25
8. Fault Tolerance in Software

8.4 Design of Fault Tolerant Software Using
Diversity
8.4.2 Recovery Block
Details about the Basic Recovery Block Structure
...

The Acceptance Test is divided into 2 separate
tests which are invoked before and after the
execution of the primary routine
Before
The first acceptance test checks on the call
format and parameters.
The second acceptance test checks on the validity
of the input data. (When data errors are common,
provision of an alternate data source may be
considered dashed lines indicating the backup
data)
After
The last acceptance test examines the output data.

26
8. Fault Tolerance in Software

8.4 Design of Fault Tolerant Software Using
Diversity
8.4.2 Recovery Block

Internal Structure for primary application module.
27
8. Fault Tolerance in Software

8.4 Design of Fault Tolerant Software Using
Diversity
8.4.2 Recovery Block

The integration of application modules
structured as recovery blocks into a
fault-tolerant SW system is shown in the next
figure.
Application Modules and the decision diamond
labeled Return together represent the structure
shown in figure .
In the absence of failures of the recovery
blocks, the process will always remain in the
inner loop.
If an abort is taken, the failure is recorded
and some diagnostics may be performed. In case of
a first failure in a recovery block, a retry may
be initiated. If the failure persists, further
execution of the task represented by the recovery
block is suspended

28
8. Fault Tolerance in Software

8.4 Design of Fault Tolerant Software Using
Diversity
8.4.2 Recovery Block

Executive and application modules.

Write a Comment

User Comments (0)

About PowerShow.com

1. Review of Multiprocessors and Fault Tolerance - PowerPoint PPT Presentation

1. Review of Multiprocessors and Fault Tolerance

8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? – PowerPoint PPT presentation