Application Level Fault Tolerance and Detection - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Application Level Fault Tolerance and Detection

Description:

Application Level Fault Tolerance and Detection. Principal Investigators: ... If the corresponding primary has been incapacitated ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 23
Provided by: lordt
Category:

less

Transcript and Presenter's Notes

Title: Application Level Fault Tolerance and Detection


1
Application Level Fault Tolerance and Detection
  • Principal Investigators
  • C. Mani Krishna Israel Koren
  • Presented By
  • Eric Ciocca

Architecture and Real-Time Systems (ARTS)
Lab. Department of Electrical and Computer
Engineering University of Massachusetts Amherst
MA 01003
2
What is ALFTD?
  • Application Level Fault Tolerance and Detection
  • ALFTD complements existing system or algorithm
    level fault tolerance by leveraging information
    available only at the application level
  • Using such application level semantic information
    significantly reduces the overall cost providing
    fault tolerance
  • ALFTD may be used alone or to supplement other
    fault detection schemes

3
ALFTD Overview
  • Application Level Fault Tolerance and Detection
    allows for system survival of both data and
    system (instruction/hardware) faults.
  • System faults cause a process to eventually cease
    functioning
  • Data faults cause a process to continue running
    with incorrect results
  • ALFTD is scalable
  • The level of fault tolerance can be traded off
    with invested time overhead

4
Principles of ALFTD
Node 1
Node 2
Node 3
Node 4
  • To provide system fault tolerance, every physical
    node runs its own work (P,primary) as well as a
    scaled-down copy of a neighboring nodes work
    (S,secondary)
  • If a fault should corrupt a process, the
    corresponding secondary of that task will still
    produce output, albeit at a lower (but
    acceptable) quality

5
Principles of ALFTD
  • The secondary processes can be scaled-down by
  • reducing the resolution of input data
  • reducing the precision of calculations
  • heuristically predicting results from previous
    iterations output
  • In some applications the secondary can be run
    optionally on an as-needed basis
  • If the corresponding primary is approaching a
    deadline miss
  • If the corresponding primary has been
    incapacitated
  • If the corresponding primary has produced faulty
    data
  • If faults are infrequent, an optional secondary
    will incur very little additional overhead

6
ALFTD in OTIS
  • ALFTD was implemented into OTIS (Oribital Thermal
    Imaging Spectrometer) to test its viability as a
    fault tolerance and detection scheme
  • OTIS, part of the REE (Remote Exploration and
    Experimentation) program group from JPL, is
    intended to run on orbiting satellites
  • OTIS processes radiation data of a geographic
    area from a sensor array input and produces
    temperature and emissivity data output

7
OTIS Structure
OUTPUT
M
5
3
2
1. MPI Starts
1
MPI
4
S
2. MPI Starts Slave and master processes
3. Master sends tasks
S
4. Slave Calculations
S
5. Slave Returns Results
8
ALFTD in OTIS (contd)
  • ALFTD is suited for remote applications,
  • As a software-based fault handling mechanism, it
    requires no extra hardware
  • The scaled secondaries require less power than
    full software redundancy
  • In OTIS, and other applications, ALFTD is
    passive, only requiring extra runtime in a fault
    case.

9
ALFTD OTIS Structure
?
OUTPUT
M
5
3
2
1. MPI Starts
1
4
MPI
2. MPI Starts master and slaves, primary and
secondary processes
P1
S2
P2
S3
3. Master sends tasks
P3
4. Slave Calculations
S1
5. Slave Returns Results
10
Secondaries in OTIS
  • The secondary required for ALFTD is implemented
    to be functionally similar to the primary
  • Secondary scaling occurs through resolution
    reduction
  • OTIS natural temperature data input exhibits
    spatial locality
  • Points not directly calculated can be
    approximately estimated using interpolation
    between calculated points
  • Secondary processes have been tested at 20-50
    of the primary calculation overhead
  • While 50 affords better quality, 20 has less
    overhead

11
Example of Secondary Resolution
100 Secondary Resolution
50 Secondary Resolution
33 Secondary Resolution
25 Secondary Resolution
  • (ALFTD Compensation for 10 rows in a sample
    dataset)

12
Fault Detection
  • Output filters on the primary data determine when
    secondary validation is required
  • Output filters are created to check for
    application-specific trends in data
  • Aberrations from normal data characteristics can
    be considered to be the product of potentially
    faulty processes
  • OTIS relies on natural temperature
    characteristics to detect potentially faulty data
  • Spatial Locality temperature changes gradually
    over small areas
  • Absolute Bounds temperature should not exceed
    certain values

13
Fault Detection (contd)
  • After the secondary has been run to validate a
    primarys results, the better data is chosen
    according to the following logic grid

Secondary Results
14
Data Sets
  • Three data sets were chosen for their interesting
    characteristics

15
Fault Tolerance Results Spots
  • Fault Tolerance with injected faults in Spots

16
Fault Tolerance Results Spots (contd)
Faulty Output
Fault-Free Output
33 ALFTD Computation Overhead
50 ALFTD Computation Overhead
25 ALFTD Computation Overhead
ALFTD-corrected faulty output
17
Fault Tolerance Results Blob
  • Fault Tolerance with injected faults in Blob

18
Fault Tolerance Results Blob (contd)
Faulty Output
Fault-Free Output
33 ALFTD Computation Overhead
50 ALFTD Computation Overhead
25 ALFTD Computation Overhead
ALFTD-corrected faulty output
19
Fault Tolerance Results Stripe
Difference Plots faulty output versus faultless
output
No ALFTD
25 ALFTD Computation Overhead
33 ALFTD Computation Overhead
50 ALFTD Computation Overhead
No Error
Max Error
20
Fault Tolerance Results Stripe(contd)
Faulty Output
Fault-Free Output
33 ALFTD Computation Overhead
50 ALFTD Computation Overhead
25 ALFTD Computation Overhead
ALFTD-corrected faulty output
21
Conclusion / Future Work
  • ALFTD has shown to be a cost-effective
    alternative to full redundancy
  • Improvements on the scheme will increase fault
    coverage and decrease secondary calculation
    overhead
  • OTIS has general application characteristics that
    will make its implementation a springboard to
    other, similar programs
  • ALFTD should continue to be effective in any
    programs that have predictable data
    characteristics

22
Thank You!
  • For additional information, please contact
  • Eric Ciocca (eciocca_at_ecs.umass.edu)
  • Israel Koren (koren_at_euler.ecs.umass.edu)
  • C. Mani Krishna (krishna_at_ecs.umass.edu)
Write a Comment
User Comments (0)
About PowerShow.com