Title: Enhancing The FaultTolerance of Nonmasking Programs
1Enhancing The Fault-Tolerance of Nonmasking
Programs
- Sandeep S. Kulkarni and Ali Ebnenasir
- Software Engineering and Network Systems
Laboratory - Computer Science and Engineering Department
- Michigan State University
2Acknowledgement
- This work is partially sponsored by
- NSF,
- DARPA NEST,
- ONR URI, and
- Michigan State University
3Motivation
- Programs are subject to unanticipated faults
- Encounter new classes of faults, add
corresponding fault-tolerance - How to add fault-tolerance?
- Develop from scratch (expensive approach)
- Incrementally add fault-tolerance
- Reuse of the behaviors of the fault-intolerant
program - Potential to preserve properties that are hard to
specify (e.g., efficiency) - How to ensure correctness?
- After the fact verification
- Automatic addition of fault-tolerance (correct by
construction)
4Motivation (Continued)
- Problem Complexity of automatic addition
- Automatic addition of fault-tolerance to
distributed programs is - NP-hard FTRTFT00, ICDCS02
- How do we deal with this complexity?
- Develop heuristics
- Identifying the boundary of polynomial-time
addition - Step-wise addition (weaker forms of
fault-tolerance) - The goal of this paper
- Enhance the fault-tolerance of nonmasking
programs - Partial automation of fault-tolerance programs
5Outline
- Preliminary Concepts
- Enhancement Problem
- Enhancement in High Atomicity Model
- Enhancement for Distributed Programs
- Example Byzantine Agreement Program
- Conclusion and Future Work
6Preliminary ConceptsPrograms and Faults
- Finite State space Sp
- Invariant S, fault-span T ? Sp
- Program p, Fault f, Safety ? (s0, s1)
(s0, s1) ? Sp ? Sp - Fault-tolerance
- Failsafe, Nonmasking, Masking
Sp
7Step-Wise Addition
Masking fault-tolerant
This paper
Nonmasking fault-tolerant
Failsafe fault-tolerant
FTRTFT00
ICDCS02
Intolerant Program
8Enhancement Problem
Nonmasking program p
Masking program p'
Synthesis Algorithm
Specification Spec
Invariant S'
Invariant S
Faults f
Fault-span T'
Requirements Only fault-tolerance is added no
new functional behavior is added
Sp
S
9- Enhancement in High Atomicity Model
10Enhancement in High Atomicity Model
- High Atomicity Model
- Each process can read/write all program variables
T
S
11Enhancement in High Atomicity Model (Continued)
- Find a state predicate T ' such that
- T ' is closed in the computations of the program
in the presence of faults - The specification is satisfied from every state
of T ' (i.e., no deadlocks) - Construct p' such that for every (s0, s1) ? p'
- (s0, s1) does not violate safety
- s0 ? T ' ? s1 ? T '
T
S
ms
- Deadlock States appear due to removing some
transitions
12FTRTFT00
- HighAtomicityEnhancement (p,f transitions,
- TStatePredicate, specification spec)
- Calculate ms Calculate mt
- T' ConstructFaultSpan( )
- if ( T' ) declare no masking
- f-tolerant program exists exit
- else Construct the transitions of p'
- AddMasking (p,f transitions, SStatePredicate,
- specification spec)
- 1. Calculate ms Calculate mt
- 2. . . .
- 3. . . .
- 4. repeat
- 4-1) . . .
- 4-2) . . .
- 4-3) T ConstructFaultSpan( )
- 4-4) . . .
- 4-5) if (S \/ T ) declare no masking
f-tolerant - program exists exit
- until (ExitConditionHolds)
- 5. Remove cycles in outside the invariant in T
- 6. Construct the transitions of p'
Partial Automation
13- Enhancement For Distributed Programs
14Difficulties with Distribution
- Read/Write restrictions (low atomicity model).
- A program p
- Two processes j, k
- Two Boolean variables a and b
- Process j cannot read b
- Can we include the following transition?
Groups of transitions (instead of individual
transitions) must be chosen.
15Enhancement of Nonmasking Distributed Programs
Calculate T' high
Start
Calculate S' init S' low
Calculate Sreachable from S' low by
fault/program transitions
- Search in
- (T' high S' low)
- Under distribution restrictions
Calculate Srecovery from where recovery is
possible to S' low
S' low S' low ? Srecovery
T' S' low Calculate p' transitions
No
Srecovery
Yes
Declare failure
No
Yes
Sreachable
Stop
16A High Atomicity Fault-Span
- The largest possible domain for the states that
can be included in the fault-span of the
distributed program
ms
T
S
17The Initial Low Atomicity Invariant
- Remove states from where an outgoing transition
crosses the boundary of S' high - E.g., s0
- Removal is a non-deterministic choice, where we
have more than one state to remove
18Single-Step Reachable States
- Reachable by a fault/program transition (denoted
Sreachable)
S' init
19Single-Step Recovery States
- Safer recovery in a single step (denoted
Srecovery) - Goal infinite computations are possible from all
states in S' low - s0 represents a typical recovery state
S' init
S' low
20Enhancement of Nonmasking Distributed Programs
Start
Calculate T' high
Calculate S' init S' low
Calculate Sreachable from S' low by
fault/program transitions
Calculate Srecovery from where recovery is
possible to S' low
S' low S' low ? Srecovery
T' S' low Calculate p' transitions
No
Srecovery
Yes
Declare failure
No
Yes
Sreachable
Stop
21Example Byzantine Agreement
- Why this example?
- Was used to illustrate the addition of masking
fault-tolerance in SRDS01 - Manual enhancement has been already applied
TSE98 - Processes General, g, and three non-generals j,
k, and l - Variables
- d.g 0, 1
- d.j, d.k, d.l 0, 1, -
- b.g, b.j, b.k, b.l 0, 1
- f.j, f.k, f.l 0, 1
- Safety Specification
- Agreement No two non-Byzantine non-generals can
finalize with different decisions - Validity If g is not Byzantine, no process can
finalize with different decision with respect to
g - A finalized process should not execute any
transition
g
l
k
j
22Example Byzantine Agreement
- Read/Write restrictions
- Readable variables for process j
- b.j, d.j, f.j, d.g, d.k, d.l
- Process j can write d.j, f.j
- Disjkstras guarded commands
- Guard ? Statement
- (s0, s1) Guard holds at s0 and atomic
execution of Statement yields s1 - Nonmasking fault-tolerant program transitions
- d.j - ? f.j 0 ? d.j d.g
- d.j ? - ? f.j 0 ? f.j 1
- d.j 1 ? d.k 0 ? d.l 0 ? d.j 0
- d.j 0 ? d.k 1 ? d.l 1 ? d.j 1
- Fault transitions
- b.g ? b.j ? b.k ? b.l ?
b.j true - b.j ? d.j 01
23Example Byzantine Agreement (Continued)
- Why enhancement is easier?
d.j d.k - , d.g 1, d.l 1, f.l 0
S0
A good transition inside the invariant
Premature finalization
d.j d.k - , d.g 1, d.l 1, f.l 1
S1
Fault transition
d.j d.k - , d.g 0, d.l 1, f.l 1
b.g 1
S2
d.j d.k - , d.g 0, d.l 1, f.l 1
S3
d.j d.k 0 , d.g 0, d.l 1, f.l 1
A deadlock state
S4
24Example Byzantine Agreement (Continued)
- High atomicity reasoning
- Synthesize a masking program in high atomicity
and then refine it to a distributed program
- Masking fault-tolerant program
- d.j - ? f.j 0 ? d.j d.g
- d.j ? - ? f.j 0 ? f.j 1
- d.j 1 ? d.k 0 ? d.l 0 ? d.j 0
- d.j 0 ? d.k 1 ? d.l 1 ? d.j 1
? ((d.j d.k) ? (d.j d.l))
? (f.j 0)
? (f.j 0)
25Enhancement vs. Addition
- Reuse the computations of the nonmasking program
-
- Reasoning in high atomicity model has the
potential to reduce the complexity of addition
26Synthesis Framework
- Development of a synthesis framework
- Developers of fault-tolerance can interactively
add fault-tolerance to fault-intolerant programs - Partial automation helps us to reap the benefits
of automation as much as possible - Enhancement identifies programs where partial
automation is possible - Implementation of enhancement algorithms in the
synthesis framework - http//www.cse.msu.edu/sandeep/software/Code/synt
hesis-framework/
27Conclusion and Future Work
- Enhancement simplifies automated design of
masking programs - Less asymptotic complexity
- Polynomial-time enhancement in the low atomicity
model (in the state space of the nonmasking
program) - Sound, but not complete
- Reasoning in high atomicity simplifies the
synthesis of masking distributed programs - Future Work
- A polynomial-time sound and complete enhancement
algorithm for a restricted class of programs and
specifications
28Thank You!
29Example Triple Modular Redundancy
- Processes Three processes j, k, and l
- Variables and their domains
- in.j, in.k, and in.l are Boolean variables
- out belongs to 0, 1, -
- Nonmasking program ( addition in modulo 3)
- N1 (out -) ? out in.j
- N2 (out ! -) /\ (out ! in.j) /\
- ((in.j in.k) \/ (in.j in.l)) ? out in.j
- Faults
- F (in.j in.k) /\ (in.j in.l) ? in.j
01 - Safety specification
- Do not reach states where out is different than
the majority of inputs. - out should not be changed after it is assigned a
value.
30Example Triple Modular Redundancy
- Invariant
- S ((out -) /\ (in.j in.k in.k)) \/ (out
in.j in.k) - \/ (out in.j in.l) \/ (out in.k
in.l) - Fault-span
- T ( (in.j in.k in.l) gt ((out -) \/
(out in.j in.k in.l)) ) - Enhancement algorithm
- Compute ms ms
- Remove bad transitions
- t t violates safety and t t reaches
ms - Construct a new fault-span T
- T T s (out !-) /\ (out is not equal to
majority of inputs) - Masking program
- M1 (out -) /\
- (in.j in.k) \/ (in.j in.l) ? out
in.j
31Enhancement of Nonmasking Distributed Programs
Start
Calculate T' high
Calculate S' init S' low
Calculate Sreachable from S' low by
fault/program transitions
Calculate Srecovery from where recovery is
possible to S' low
S' low S' low ? Srecovery
No
Srecovery
Yes
Declare failure
No
Yes
Sreachable
T' S' low , calculate p' transitions
32Enhancement of Nonmasking Distributed Programs
Start
Calculate T' high
Calculate S' init S' low
Calculate Sreachable from S' low by
fault/program transitions
Calculate Srecovery from where recovery is
possible to S' low
S' low S' low ? Srecovery
No
Srecovery
Yes
Declare failure
No
Yes
Sreachable
T' S' low , calculate p' transitions
33Enhancement of Nonmasking Distributed Programs
Start
Calculate T' high
Calculate S' init S' low
S' init S' low at the first iteration
Calculate Sreachable from S' low by
fault/program transitions
Calculate Srecovery from where recovery is
possible to S' low
S' low S' low ? Srecovery
No
Srecovery
Yes
Declare failure
No
Yes
Sreachable
T' S' low , calculate p' transitions
34Enhancement of Nonmasking Distributed Programs
Start
Calculate T' high
Calculate S' init S' low
Calculate Sreachable from S' low by
fault/program transitions
Calculate Srecovery from where recovery is
possible to S' low
S' low S' low ? Srecovery
No
Srecovery
Yes
Declare failure
No
Yes
Sreachable
T' S' low , calculate p' transitions