Title: Analysis of the SPIDER FaultTolerance Protocols
1Analysis of the SPIDER Fault-Tolerance Protocols
- Paul S. Miner
- p.s.miner_at_larc.nasa.gov
- 5th NASA Langley Formal Methods Workshop
- Williamsburg, VA
- June 14, 2000
2What is SPIDER?
- A general purpose fault-tolerant architecture
- Scalable Processor-Independent Design for
Electromagnetic Resilience - Intended to serve as a platform to explore
recovery strategies for HIRF/EMI induced faults - Developed as part of an FAA funded case-study to
exercise RTCA DO-254 Design Assurance Guidance
for Airborne Electronic Hardware
3RTCA DO-254/EUROCAE ED-76
- Developed by RTCA Special Committee 180 and
EUROCAE Working Group 46 - Approved by RTCA Program Management Committee in
April 2000 - Approved by EUROCAE(?)
- FAA Advisory Circular (?)
- Earliest would be sometime this fall
4Formal Methods in DO-254
- Formal Methods is one of the advanced analysis
techniques suggested when developing hardware to
support safety-critical (Level A or B) aircraft
functions - DO-254 section on Formal Methods based upon
material from NASA Formal Methods Guidebook,
Volume II, (NASA-GB-001-97)
5DO-254 Case-Study Participants
- NASA LaRC (Design Team)
- Paul Miner, Project Lead
- Mahyar Malekpour, Design Engineer
- Wilfredo Torres-Pomales, Design Engineer
- Victor Carreño, Process Assurance
- FAA (Sponsor and Certification Liaison)
- Leanna Rierson, Pete Saraceni, Dennis Wallace,
Connie Beane, and Will Struck
6Strategy for DO-254 Case-Study
- Fault-tolerance protocols and reliability models
use the same fault classifications - Reliability analysis using SURE (Butler)
- Calculates P(enough good hardware)
- Formal proof of fault-tolerance protocols using
PVS (SRI) - enough good hardware gt correct operation
7SPIDER Design Concept
- Inspired by several earlier designs
- Main concept inspired by Palumbos Fault-tolerant
processing system (U.S. Patent 5,533,188) - Developed as part of Fly-By-Light/Power-By-Wire
project - Other ideas from Drapers FTPP, FTP, and FTMP
Allied-Signals MAFT SRIs SIFT Kopetzs TTA
Honeywells SAFEbus . . .
8SPIDER Architecture
- N simplex general purpose nodes logically
connected via a Reliable Optical BUS (ROBUS) - A ROBUS is an ultra-reliable unit providing basic
fault-tolerant services - A ROBUS is implemented as a special purpose
fault-tolerant device
9SPIDER Architecture
4
3
5
ROBUS
2
6
1
7
0
10Logical View of ROBUS
- ROBUS operates as a time-division multiplexed
access broadcast bus - ROBUS strictly enforces write access
- no babbling idiots
- Processing nodes may be grouped to provide
differing degrees of fault-tolerance - some voting available within ROBUS
11Logical view of ROBUS(Sample Configuration)
0
4
2
3
1
6
5
7
ROBUS
12ROBUS Characteristics
- Bus access schedule statically determined
- similar to SAFEbus, TTA
- Some fault-tolerance functions provided by
processing nodes - ROBUS will not have general purpose processing
capabilities - Processing Elements need not be uniform
- support for dissimilar architectures
13ROBUS Requirements
- All fault-free nodes observe the exact same
sequence of messages - ROBUS provides a reliable time source (RTS)
- The nodes are synchronized relative to this RTS
- ROBUS provides correct and consistent system
diagnostic information to all fault-free nodes - For 10 hour mission, P(ROBUS Failure) lt 10-10
14Interactive Consistency(Byzantine Agreement)
- Agreement For any message, all non-faulty
receiving nodes will agree on the value of the
message - Validity If the originator of the message is
non-faulty, good receivers will receive the
message sent
15Clock Synchronization
- Precision There is a small positive constant
dmax such that for any two clocks that are good
at t, - C1(t) - C2(t) ? dmax
- Accuracy All good clocks maintain an accurate
measure of the passage of time (within a linear
envelope of real time)
16Diagnosis
- Correctness Every node diagnosed as faulty by a
good node is faulty - A good node can never conclude that another good
node is faulty - Completeness Every faulty node is (eventually)
diagnosed as being faulty - This is not always possible (pathological case
involves asymmetric fault)
17Physical Segregation
- ROBUS decomposed into physically isolated Fault
Containment Regions (FCR) - Two main design elements
- Bus Interface Unit (BIU)
- Redundancy Management Unit (RMU)
- Processing elements may form separate FCRs
- FCRs fail independently
- This is necessary to achieve reliability goals
18 ROBUS Topology
19Hybrid Fault Assumptions
- The failure status of an FCR is subdivided into
four mutually exclusive cases - Good (or fault-free)
- Benign Faulty (Known bad by all good)
- Symmetric Faulty (Same to all good)
- Asymmetric Faulty (Byzantine, Malicious)
- This is a global classification, individual FCRs
do not know the failure status of other FCRs
20Fault Classification
- Partition the RMUs into disjoint subsets based
upon fault classification - GR, BR, SR, and AR for good, benign, symmetric,
and asymmetric RMUs respectively - Similarly partition the BIUs
- GB, BB, SB, and AB
21Tolerating Asymmetric Faults
- Requires 3f 1 participants in protocol to
withstand f simultaneous asymmetric faults - Requires 2f 1 disjoint communication paths
between any two participants - Requires f 1 levels of communication
- ROBUS Topology satisfies these conditions for N ?
3, M ? 3, f 1 - For target reliability, we must tolerate at least
1 asymmetric fault
22Interactive Consistency
- SPIDER IC protocol is simple adaptation of IC
algorithm for Draper FTP Architecture - Existing PVS proof due to Lincoln and Rushby,
COMPASS94, pages 107-120 - Protocol generalizes one suggested in
- Daniel Davies and John Wakerly, Synchronization
and Matching in Redundant Systems, IEEE Trans. on
Computers, Vol. C-27, No. 6, June 1978
23SPIDER IC Protocol
- Algorithm OMS (ignoring hybrid fault model)
- 1. Processing element j sends value v to BIU j
- 2. BIU j broadcasts v to all RMUs
- 3. All RMUs broadcast value received from BIU j
to all BIUs - 4. Each BIU votes on the values received from the
RMUs to determine value from j - 5. Each BIU forwards the voted value to its PE
24Adapting for hybrid faults
- Simple modification to steps 3 and 4 to enable
special handling of manifestly bad messages - from benign faulty or asymmetrically faulty
sources - OMHS(p,v,q) denotes the value received by q, when
p broadcast value v using hybrid oral messages
protocol on SPIDER - Verified in PVS, using simple modifications to
Lincoln and Rushbys proof of the Draper FTP
Interactive Consistency Protocol
25Interactive Consistency Results
- Agreement For BIU g, if (AR 0) or (g ? AB
and GR gt SR AR), then for p,q ? GB - OMHS(g,v,p) OMHS(g,v,q)
- Validity If GR gt SR AR, then for p ? GB
- If g ? GB, then OMHS(g,v,p) v
- If g ? BB, then OMHS(g,v,p) Error
- If g ? SB, then OMHS(g,v,p) sent(g,v)
26Alternative Verification Options
- For a fixed number of participants, it is easy to
demonstrate correctness of Interactive
Consistency protocol using symbolic simulation - Amount of effort needed to verify using theorem
prover is negligible - PVS proof is mostly symbolic evaluation
- there is a small amount of deductive reasoning
to evaluate the abstract specification of hybrid
majority
27Clock Synchronization Goal
- To match the degree of Fault-Tolerance of the IC
protocol, the synchronization protocol should
ensure synchronization using the following fault
assumptions - GR gt SR AR
- Not (AR gt 0 AB gt 0)
28Clock Synchronization
- Need added assumption
- GB gt SB AB
- With this assumption, a modified form of the
Davies and Wakerly protocol (1978, IEEE ToC)
ensures synchronization of the RMUs - Modified protocol is similar to the Srikanth and
Toueg protocol (1987, JACM)
29Clock Synchronization Basics
- Clocks (counters) driven by oscillator with a
bounded drift (?) from its stated frequency - Periodically (every P ticks) clocks will adjust
the count based on exchange with other clocks - The periods are indexed by round number (k)
- Protocol seeks to ensure that at beginning of
every round, all good clocks are within dmin
30Synchronization Basics
- If all good clocks start round within dmin of
each other, by the end of the round they can be
at most dmin 2 ? P apart - If good clocks then make a small adjustment so
that they start the next round within dmin, then
both precision and accuracy are satisfied - Several machine checked proofs exist
31Network Imprecision
- There is imprecision in communication
- If a node transmits a message at time t,
observing nodes will receive it within time
interval - t d, t d e
- d is the minimum communication delay
- e is the inherent imprecision (e gt (1 ?) ticks)
- e is a lower bound on synchronization precision
- For many Byzantine Resilient protocols,
- dmin ? 2e
32Simple Protocol (for round k)
- RMU (Perform the following concurrently)
- If Ready?(k) then broadcast (round k) to all BIUs
- If Accept?(k) then reset counter for round k
- BIU
- If Accept?(k) then broadcast (round k) to all
RMUs
33Informal Description
- Each good RMU broadcasts when its clock reaches a
specific value - Each good BIU waits until it knows it has
received a message from at least one good RMU.
It then relays this information to all RMUs - Each RMU waits until it knows it has a message
from at least one good BIU before resetting
34Ready and Accept
- Ready?(k) is an event triggered by a
pre-determined local counter value, kP - a, - a is a constant offset for communication delays
- P is the nominal duration of a round
- k is the round index
- Degree of fault-tolerance is determined by
Accept?(k)
35Accept?(k)
- Wait until there is a hybrid majority of observed
(round k) events to trigger Accept?(k) - A selection function under the hybrid fault model
ignores manifestly bad values - This protocol ensures that all good RMUs accept
(round k) within a short time interval, provided
the maximum fault assumptions are not violated - Can also synchronize BIUs by echoing RMU accept
36Verification in PVS
- Built generic clock synchronization theory in PVS
- PVS theory from Ulm introduced too much potential
error in formulation of clock drift assumptions - New theory allows proofs of clock skew as tight
as best theoretical results - Support for some protocols absent
- All pieces in place to complete SPIDER
verification - Estimate 1-2 weeks effort to tie up loose ends
37Alternative Verification Options
- Protocol is (almost) finite state
- Should be possible to use a model checker to
confirm that all good nodes start each round
within dmin of each other - Plausible tools for this are HyTech, UPPAAL
- Still need theorem prover to get Precision and
Accuracy results - Theorem prover can verify for arbitrary number of
participants
38Diagnosis
- Plan to adapt MAFT on-line diagnosis algorithms
to SPIDER architecture - MAFT algorithms previously verified using PVS
- Chris Walter, Patrick Lincoln, and Neeraj Suri.
Formally verified on-line diagnosis, IEEE Trans.
On Software Engineering, Nov. 1997 - For diagnosis of Processing Elements, there exist
verified Group Membership protocols - Katz, Lincoln, and Rushby, Low overhead
time-triggered group membership, In 11th Workshop
on Distributed Algorithms (WDAG97), pages
155-169, LNCS 1320
39Summary
- New conceptual design for a family of
fault-tolerant systems - Design being developed using DO-254 guidance
- Critical fault-tolerance verified using PVS
- Able to reuse or adapt many existing proofs of
fault-tolerance protocols - Unable to reuse existing Clock synchronization
proofs
40Future Plans
- Report documenting SPIDER Conceptual design and
proofs of fault-tolerance by end of summer - First laboratory prototype implementation of
SPIDER by December - Second generation design in 2001