Title: Application Level Fault Tolerance and Detection
1Application Level Fault Tolerance and Detection
- Principal Investigators
- C. Mani Krishna Israel Koren
- Graduate Students
- Diganta Eric Janhavi Osman Vijay
Architecture and Real-Time Systems (ARTS)
Lab. Department of Electrical and Computer
Engineering University of Massachusetts Amherst
MA 01003
2What is ALFTD?
- Application Level Fault Tolerance and Detection
- ALFTD complements existing system or algorithm
level fault tolerance by leveraging information
available only at the application level - Using such application level semantic information
significantly reduces the overall cost providing
fault tolerance - ALFTD may be used alone or to supplement other
fault detection schemes - ALFTD is scalable
- Error overhead can be traded off with invested
time overhead for fault tolerance
3ALFTD Overview
- Application Level Fault Tolerance and Detection
allows for system survival of both data and
system (instruction/hardware) faults. - System faults cause a process to eventually cease
functioning - Data faults cause a process to continue running
with incorrect results - ALFTD has been implemented into OTIS to determine
its feasibility as a fault detection and
tolerance method for REE applications - OTIS has two sets of related output data, the
temperature and emissivity - Experiments have focused mostly on the
temperature output
4OTIS Structure
4. Slave Calculations
5OTIS Work Distribution
- OTIS dynamic workload distribution allows it to
compensate for system faults - Work originally partitioned for a failed
processor is instead taken by the remaining
processes - OTIS does not compensate for data faults
- As long as the work is completed, there is no
measure of correctness - OTIS does not consider deadline repercussions
6OTIS Fault Cases
7ALFTD OTIS Structure
8Secondaries in OTIS
- The secondary required for ALFTD is implemented
to be functionally similar to the primary - Secondary scaling occurs through resolution
reduction - OTIS natural data input exhibits spatial
locality - Points not directly calculated can be
approximately estimated using interpolation
between calculated points - Secondary processes have been tested at 20-50
of the primary calculation overhead - While 50 affords better quality, 20 has less
overhead
9Example of Secondary Resolution
100 Secondary Resolution
50 Secondary Resolution
33 Secondary Resolution
25 Secondary Resolution
- (ALFTD Compensation for 10 rows in a sample
dataset)
10ALFTD Benefit
11ALFTD Benefit (contd)
12Fault Detection
- When to run the secondary, and when to use the
secondary output, is determined by output filters - Output filters are created to check for
application-specific trends in data - Aberrations from normal data characteristics can
be considered to be the product of potentially
faulty processes - OTIS relies on natural temperature
characteristics to detect potentially faulty data - Spatial Locality temperature changes gradually
over small areas - Absolute Bounds temperature should not exceed
certain values
13Data Sets
- Three data sets were chosen for their interesting
characteristics
14Data Frequency (Values)
15Data Frequency (Spatial Locality)
16Validation Through Secondaries
- When the primary deadline is hit, rows are
re-delegated to the secondaries if (and only if) - The primary has returned results for that row
suspected to be faulty - The secondary results can be used to decide
whether the results are indeed faulty - A particular row was never successfully
calculated - The secondary results can be immediately used in
place of the missing primary results
17Validation Through Secondaries (contd)
- After the secondary has been run to verify a
primarys results, the better data is chosen
according to the following logic grid
Secondary
18Fault Tolerance Results Spots
- Fault Tolerance with injected faults in Spots
19Fault Tolerance Results Spots (contd)
Faulty Output
Fault-Free Output
33 ALFTD Computation Overhead
50 ALFTD Computation Overhead
25 ALFTD Computation Overhead
ALFTD-corrected faulty output
20Fault Tolerance Results Spots (contd)
Difference Plots faulty output versus faultless
output
No ALFTD
25 ALFTD Computation Overhead
33 ALFTD Computation Overhead
50 ALFTD Computation Overhead
No Error
Max Error
21Fault Tolerance Results Blob
- Fault Tolerance with injected faults in Blob
22Fault Tolerance Results Blob (contd)
Faulty Output
Fault-Free Output
33 ALFTD Computation Overhead
50 ALFTD Computation Overhead
25 ALFTD Computation Overhead
ALFTD-corrected faulty output
23Fault Tolerance Results Blob (contd)
Difference Plots faulty output versus faultless
output
No ALFTD
25 ALFTD Computation Overhead
33 ALFTD Computation Overhead
50 ALFTD Computation Overhead
No Error
Max Error
24Fault Tolerance Results Stripe
- Fault Tolerance with injected faults in Stripe
25Fault Tolerance Results Stripe(contd)
Faulty Output
Fault-Free Output
33 ALFTD Computation Overhead
50 ALFTD Computation Overhead
25 ALFTD Computation Overhead
ALFTD-corrected faulty output
26Fault Tolerance Results Stripe(contd)
Difference Plots faulty output versus faultless
output
No ALFTD
25 ALFTD Computation Overhead
33 ALFTD Computation Overhead
50 ALFTD Computation Overhead
No Error
Max Error
27Emissivity Data
- Emissivity is loosely proportional to temperature
data - Emissivity exhibits spatial locality
- Emissivity has natural bounds of expected data
lt0.5 - Faulty
gt1.0 - Faulty
Natural Metal 0.5
Vegetatation, Water 1.0
Rock 0.8 - 0.95
28Emissivity Data (contd)
- Emissivity does not exhibit the same data
closeness as temperature output - This makes it very difficult to distinguish
faulty from non-faulty data - Luckily, faults present in temperature output are
easily detected, and reflect faults in emissivity
output. - Emissivity does not have per-pixel independence
of calculation - Dependence on the correctness of neighboring
pixels makes resolution reduction a viable, but
not the best, method for secondary reduction
29Data Frequency (Emissivity Values)
30Conclusion
- ALFTD has already shown to be a worthwhile
alternative to full redundancy - Improvements on the scheme will increase fault
coverage and decrease secondary calculation
overhead in both the emissivity and temperature
outputs - OTIS, as a general matrix-based, master/slave
program is a springboard to other, similar
programs (e.g., NGST) - ALFTD as a fault-detection scheme will continue
to be effective in programs which exhibit
natural output
31Thank You!
32Relative Error Calculation
- Error in OTIS output is calculated relative to a
faultless template - The average relative error is the average of all
relative errors of the entire output - Faulty value f(x,y)
- Faultless value F(x,y)
- Error