Title: Application Level Fault Tolerance and Detection
1Application Level Fault Tolerance and Detection
- Principal Investigators
- C. Mani Krishna Israel Koren
- Presented By
- Eric Ciocca
Architecture and Real-Time Systems (ARTS)
Lab. Department of Electrical and Computer
Engineering University of Massachusetts Amherst
MA 01003
2What is ALFTD?
- Application Level Fault Tolerance and Detection
- ALFTD complements existing system or algorithm
level fault tolerance by leveraging information
available only at the application level - Using such application level semantic information
significantly reduces the overall cost providing
fault tolerance - ALFTD may be used alone or to supplement other
fault detection schemes
3ALFTD Overview
- Application Level Fault Tolerance and Detection
allows for system survival of both data and
system (instruction/hardware) faults. - System faults cause a process to eventually cease
functioning - Data faults cause a process to continue running
with incorrect results - ALFTD is scalable
- The level of fault tolerance can be traded off
with invested time overhead
4Principles of ALFTD
Node 1
Node 2
Node 3
Node 4
- To provide system fault tolerance, every physical
node runs its own work (P,primary) as well as a
scaled-down copy of a neighboring nodes work
(S,secondary) - If a fault should corrupt a process, the
corresponding secondary of that task will still
produce output, albeit at a lower (but
acceptable) quality
5Principles of ALFTD
- The secondary processes can be scaled-down by
- reducing the resolution of input data
- reducing the precision of calculations
- heuristically predicting results from previous
iterations output - In some applications the secondary can be run
optionally on an as-needed basis - If the corresponding primary is approaching a
deadline miss - If the corresponding primary has been
incapacitated - If the corresponding primary has produced faulty
data - If faults are infrequent, an optional secondary
will incur very little additional overhead
6ALFTD in OTIS
- ALFTD was implemented into OTIS (Oribital Thermal
Imaging Spectrometer) to test its viability as a
fault tolerance and detection scheme - OTIS, part of the REE (Remote Exploration and
Experimentation) program group from JPL, is
intended to run on orbiting satellites - OTIS processes radiation data of a geographic
area from a sensor array input and produces
temperature and emissivity data output
7OTIS Structure
OUTPUT
M
5
3
2
1. MPI Starts
1
MPI
4
S
2. MPI Starts Slave and master processes
3. Master sends tasks
S
4. Slave Calculations
S
5. Slave Returns Results
8ALFTD in OTIS (contd)
- ALFTD is suited for remote applications,
- As a software-based fault handling mechanism, it
requires no extra hardware - The scaled secondaries require less power than
full software redundancy - In OTIS, and other applications, ALFTD is
passive, only requiring extra runtime in a fault
case.
9ALFTD OTIS Structure
?
OUTPUT
M
5
3
2
1. MPI Starts
1
4
MPI
2. MPI Starts master and slaves, primary and
secondary processes
P1
S2
P2
S3
3. Master sends tasks
P3
4. Slave Calculations
S1
5. Slave Returns Results
10Secondaries in OTIS
- The secondary required for ALFTD is implemented
to be functionally similar to the primary - Secondary scaling occurs through resolution
reduction - OTIS natural temperature data input exhibits
spatial locality - Points not directly calculated can be
approximately estimated using interpolation
between calculated points - Secondary processes have been tested at 20-50
of the primary calculation overhead - While 50 affords better quality, 20 has less
overhead
11Example of Secondary Resolution
100 Secondary Resolution
50 Secondary Resolution
33 Secondary Resolution
25 Secondary Resolution
- (ALFTD Compensation for 10 rows in a sample
dataset)
12Fault Detection
- Output filters on the primary data determine when
secondary validation is required - Output filters are created to check for
application-specific trends in data - Aberrations from normal data characteristics can
be considered to be the product of potentially
faulty processes - OTIS relies on natural temperature
characteristics to detect potentially faulty data - Spatial Locality temperature changes gradually
over small areas - Absolute Bounds temperature should not exceed
certain values
13Fault Detection (contd)
- After the secondary has been run to validate a
primarys results, the better data is chosen
according to the following logic grid
Secondary Results
14Data Sets
- Three data sets were chosen for their interesting
characteristics
15Fault Tolerance Results Spots
- Fault Tolerance with injected faults in Spots
16Fault Tolerance Results Spots (contd)
Faulty Output
Fault-Free Output
33 ALFTD Computation Overhead
50 ALFTD Computation Overhead
25 ALFTD Computation Overhead
ALFTD-corrected faulty output
17Fault Tolerance Results Blob
- Fault Tolerance with injected faults in Blob
18Fault Tolerance Results Blob (contd)
Faulty Output
Fault-Free Output
33 ALFTD Computation Overhead
50 ALFTD Computation Overhead
25 ALFTD Computation Overhead
ALFTD-corrected faulty output
19Fault Tolerance Results Stripe
Difference Plots faulty output versus faultless
output
No ALFTD
25 ALFTD Computation Overhead
33 ALFTD Computation Overhead
50 ALFTD Computation Overhead
No Error
Max Error
20Fault Tolerance Results Stripe(contd)
Faulty Output
Fault-Free Output
33 ALFTD Computation Overhead
50 ALFTD Computation Overhead
25 ALFTD Computation Overhead
ALFTD-corrected faulty output
21Conclusion / Future Work
- ALFTD has shown to be a cost-effective
alternative to full redundancy - Improvements on the scheme will increase fault
coverage and decrease secondary calculation
overhead - OTIS has general application characteristics that
will make its implementation a springboard to
other, similar programs - ALFTD should continue to be effective in any
programs that have predictable data
characteristics
22Thank You!
- For additional information, please contact
- Eric Ciocca (eciocca_at_ecs.umass.edu)
- Israel Koren (koren_at_euler.ecs.umass.edu)
- C. Mani Krishna (krishna_at_ecs.umass.edu)