1. Review of Multiprocessors and Fault Tolerance - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

1. Review of Multiprocessors and Fault Tolerance

Description:

8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? – PowerPoint PPT presentation

Number of Views:176
Avg rating:3.0/5.0
Slides: 29
Provided by: Fabia177
Category:

less

Transcript and Presenter's Notes

Title: 1. Review of Multiprocessors and Fault Tolerance


1
8. Fault Tolerance in Software
  • 8.1 Introduction
  • Is it true that a program that has once performed
    a given task as specified will continue to do so?
  • Yes, if provided that none of the following
    parameters change
  • The inputs
  • The computing environment
  • The user requirements

2
8. Fault Tolerance in Software
  • 8.1 Introduction

Federal Reserve Funds Transfer Program, active 12
hours/day, 5 days/week.
Consistency of failure rates in time.
3
8. Fault Tolerance in Software
  • 8.1 Introduction

Data and Analysis Center for Software (DACS),
fault density the of faults per 1000 lines of
code, ranges from 10 50 for good SW and from
1 5 after intensive testing using automated
tools.
Failure rates of Command and Control Systems.
4
8. Fault Tolerance in Software
  • 8.1 Introduction

Consequences of SW failure Attendance has
personal experience with incorrect billing, lost
airline or hotel reservations. More serious
errors reported in the media, such as the
disruption of phone service to over 20 million
customers during the summer of 1991 due to coding
error in a new generation digital switch. The
most serious consequences are related to
real-time applications, such as those involving
spacecrafts the launch failure of Mariner I
(1962), the destruction of a French
meteorological satellite in 1968, several
problems during the Apollo missions in the early
of 1970s, the NASA Space Shuttle, the fly-by-wire
Airbus A320, the Russian satellite Mars, the
satellite launcher Ariane.
5
8. Fault Tolerance in Software
  • 8.1 Introduction
  • Causes of SW failure
  • Malfunction of a process. E.g. exception
    handling, timeout computation, design error
    (solution check the outputs and timer)
  • Erroneous control sequence (solution set an
    upper limit on loop iterations)
  • Data entry error (solution use of
    error-detecting code and type checks in input
    data).

6
8. Fault Tolerance in Software
  • 8.3 Dealing with Faulty Programs
  • 8.3.1 Robustness
  • The minimum requirement is that the program will
    properly handle inputs out of range, or in a
    different type of format than defined, without
    degrading its performance of functions not
    dependent on the nonstandard input.
  • When these input data are found not to comply
    with the program specification
  • a new input may be requested
  • the last acceptable value of a variable can be
    used
  • or a predefined default can e assigned.

7
8. Fault Tolerance in Software
  • 8.3 Dealing with Faulty Programs
  • 8.3.1 Robustness
  • In general, Robustness is used to test
  • the function of a process (e.g., by checking the
    outputs)
  • the control sequence (e.g., by setting an upper
    limit on loop iterations)
  • the input data (e.g., by using error-detecting
    code and type checks).

8
8. Fault Tolerance in Software
  • 8.3 Dealing with Faulty Programs
  • 8.3.2 Temporal Redundancy
  • Temporal Redundancy consists of the reexecution
    of a program when an error is encountered. The
    error may involve faulty data (as detected by
    Robustness), faulty execution (e.g., accessing
    protected memory), or incorrect output (as
    detected by Acceptance Tests).
  • Temporary reexecution will clear errors that
    arose from temporary circumstances that are no
    longer present when a new pass through the
    program is taken.
  • E.g., busy or noisy communication channels, full
    buffers, power supply transients.

9
8. Fault Tolerance in Software
  • 8.3 Dealing with Faulty Programs
  • 8.3.2 Temporal Redundancy

When the error persists, Fault Containment
Procedures must be triggered by the system.
10
8. Fault Tolerance in Software
  • 8.3 Dealing with Faulty Programs
  • 8.3.3 Software Diversity
  • SW Diversity permits uninterrupted system
    operation under the presence of program faults
    through multiple implementations of a given
    functional process and it is therefore
    particularly applicable to real-time control
    systems.
  • It is divided into two categories
  • Static SW Fault Tolerance N-Version programming
  • Dynamic SW Fault Tolerance Recovery Block

11
8. Fault Tolerance in Software
  • 8.3 Dealing with Faulty Programs
  • 8.3.3 Software Diversity
  • Static SW Fault Tolerance N-Version Programming
  • A given task is executed by several programs
    (consecutively on the same machine) and the
    result accepted only if a specified of programs
    agree within specified limits. The same computer
    performs comparison and selection of the results
    to be propagated to the external system.
  • In practice, the programs are executed
    concurrently, and therefore multiple computers
    are required to implement this technique.

12
8. Fault Tolerance in Software
  • 8.3 Dealing with Faulty Programs
  • 8.3.3 Software Diversity
  • Dynamic SW Fault Tolerance Recovery Block
  • A single program is executed and the result
    (including intermediate results) is subjected to
    an Acceptance Test.

13
8. Fault Tolerance in Software
  • 8.3 Dealing with Faulty Programs
  • 8.3.3 Software Diversity
  • The term STATIC is used because the selection of
    the acceptable result does not affect the
    subsequent execution of the programs.
  • The term DYNAMIC is used because the selection
    between the original and alternate program is
    made during execution based on the outcome of the
    Acceptance Test.

14
8. Fault Tolerance in Software
  • 8.4 Design of Fault Tolerant Software Using
    Diversity
  • 8.4.1 N-Version Programming
  • Defined as the independent generation of N ? 2
    functionally equivalent programs, called
    versions, from the same initial specification. In
    this case, fault masking is not provided and upon
    disagreement among the versions, 3 alternatives
    are available
  • Retry or restart (in this case fault containment
    rather than FT is provided
  • Transition to a predefined safe state, possibly
    followed by later retries
  • Reliance on one of the versions, either
    designated in advance as more reliable or
    selected by a diagnostic program (in the latter
    case the technique takes on some aspects of
    dynamic redundancy).

15
8. Fault Tolerance in Software
  • 8.4 Design of Fault Tolerant Software Using
    Diversity
  • 8.4.1 N-Version Programming
  • For N gt 2, a majority voting logic can be
    implemented (N 3), it is required
  • Three independent programs, each furnishing
    identical output formats
  • An acceptance program that evaluates the output
    of (i) and selects the result to be furnished as
    N-version output
  • A driver (process controller) that invokes
    requirements (i) and (ii) and furnishes the
    N-version output to other programs or the
    physical system.

16
8. Fault Tolerance in Software
  • 8.4 Design of Fault Tolerant Software Using
    Diversity
  • 8.4.1 N-Version Programming
  • Experiment carried out at UCLA (1978)
  • 7 separate versions for the application program
  • From this, 12 3-version sets were constructed
  • Each set was subject to 32 test cases,yielding
    384 total tests.
  • One of the conclusions
  • Cases where a single faulty version resulted in
    incorrect execution, the OS of the computer
    intervened before the program reached the voting
    stage. Most later N-version experiments overcame
    this problem by incorporating acceptance tests
    for abort conditions and precluding the
    intervention of the OS under these conditions.

17
8. Fault Tolerance in Software
  • 8.4 Design of Fault Tolerant Software Using
    Diversity
  • 8.4.1 N-Version Programming

Results of an Early N-Version Programming
Experiment.
18
8. Fault Tolerance in Software
  • 8.4 Design of Fault Tolerant Software Using
    Diversity
  • 8.4.2 Recovery Block
  • Represents the Dynamic Redundancy Approach to SW
    fault tolerance.
  • Consists of 3 SW elements
  • a primary routing, which executes critical SW
    functions
  • an acceptance test, which tests the output of the
    primary routine after every execution
  • at least one alternate routine which performs the
    same function as the primary routine (but may be
    less capable or slower) and is invoked by the
    acceptance test upon detection of a failure.

19
8. Fault Tolerance in Software
  • 8.4 Design of Fault Tolerant Software Using
    Diversity
  • 8.4.2 Recovery Block
  • The basic structure is
  • Ensure T
  • By P
  • Else by Q
  • Else Error
  • Where
  • T is the acceptance test condition that is
    expected to be met by successful execution of
    either the primary routine P or the alternate
    routine Q.
  • The structure is easily expanded to accommodate
    several alternates Q1, Q2, GQ3,...,Qn.

20
8. Fault Tolerance in Software
  • 8.4 Design of Fault Tolerant Software Using
    Diversity
  • 8.4.2 Recovery Block
  • Difference between Recovery Block and N-Version
    Programming are
  • only a single implementation of the program is
    run at a time (in this case P or Q)
  • the acceptability of the results is decided by a
    test rather than by comparison with functionally
    equivalent alternate versions.

21
8. Fault Tolerance in Software
  • 8.4 Design of Fault Tolerant Software Using
    Diversity
  • 8.4.2 Recovery Block
  • Real-time control applications require that
    results furnished by a program be both correct
    and timely.
  • For this reason, the recovery block for a
    real-time program should incorporate a watchdog
    timer which initiates execution by Q (if P does
    not produce an acceptance result within the
    allocated time).

22
8. Fault Tolerance in Software
  • 8.4 Design of Fault Tolerant Software Using
    Diversity
  • 8.4.2 Recovery Block

Recovery block for real-time application. (Progr
am flow under direction of the application module
is shown in solid lines timer-triggered
interrupts are shown in dashed lines.)
23
8. Fault Tolerance in Software
  • 8.4 Design of Fault Tolerant Software Using
    Diversity
  • 8.4.2 Recovery Block
  • Highlights ...
  • A single program is executed at any given time
  • No special demands on computer redundancy or
    computer architecture are made.
  • Performance penalty in normal operation is
    small
  • the execution of the acceptance test.

24
8. Fault Tolerance in Software
  • 8.4 Design of Fault Tolerant Software Using
    Diversity
  • 8.4.2 Recovery Block
  • Highlights ...
  • Storage requirements are expanded
  • in addition to the primary application program,
    the acceptance test and the backup program must
    also be available in memory.
  • SW development cost is increased
  • Need to generate two programs and the associated
    acceptance test.

25
8. Fault Tolerance in Software
  • 8.4 Design of Fault Tolerant Software Using
    Diversity
  • 8.4.2 Recovery Block
  • Details about the Basic Recovery Block Structure
    ...
  • The Acceptance Test is divided into 2 separate
    tests which are invoked before and after the
    execution of the primary routine
  • Before
  • The first acceptance test checks on the call
    format and parameters.
  • The second acceptance test checks on the validity
    of the input data. (When data errors are common,
    provision of an alternate data source may be
    considered dashed lines indicating the backup
    data)
  • After
  • The last acceptance test examines the output data.

26
8. Fault Tolerance in Software
  • 8.4 Design of Fault Tolerant Software Using
    Diversity
  • 8.4.2 Recovery Block

Internal Structure for primary application module.
27
8. Fault Tolerance in Software
  • 8.4 Design of Fault Tolerant Software Using
    Diversity
  • 8.4.2 Recovery Block
  • The integration of application modules
    structured as recovery blocks into a
    fault-tolerant SW system is shown in the next
    figure.
  • Application Modules and the decision diamond
    labeled Return together represent the structure
    shown in figure .
  • In the absence of failures of the recovery
    blocks, the process will always remain in the
    inner loop.
  • If an abort is taken, the failure is recorded
    and some diagnostics may be performed. In case of
    a first failure in a recovery block, a retry may
    be initiated. If the failure persists, further
    execution of the task represented by the recovery
    block is suspended

28
8. Fault Tolerance in Software
  • 8.4 Design of Fault Tolerant Software Using
    Diversity
  • 8.4.2 Recovery Block

Executive and application modules.
Write a Comment
User Comments (0)
About PowerShow.com