Analysis of the SPIDER FaultTolerance Protocols - PowerPoint PPT Presentation

About This Presentation
Title:

Analysis of the SPIDER FaultTolerance Protocols

Description:

... an FAA funded case-study to exercise RTCA DO-254: Design Assurance Guidance for ... GB, BB, SB, and AB. June 14, 2000. Lfm2000. 21. Langley Research Center ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 41
Provided by: pauls46
Category:

less

Transcript and Presenter's Notes

Title: Analysis of the SPIDER FaultTolerance Protocols


1
Analysis of the SPIDER Fault-Tolerance Protocols
  • Paul S. Miner
  • p.s.miner_at_larc.nasa.gov
  • 5th NASA Langley Formal Methods Workshop
  • Williamsburg, VA
  • June 14, 2000

2
What is SPIDER?
  • A general purpose fault-tolerant architecture
  • Scalable Processor-Independent Design for
    Electromagnetic Resilience
  • Intended to serve as a platform to explore
    recovery strategies for HIRF/EMI induced faults
  • Developed as part of an FAA funded case-study to
    exercise RTCA DO-254 Design Assurance Guidance
    for Airborne Electronic Hardware

3
RTCA DO-254/EUROCAE ED-76
  • Developed by RTCA Special Committee 180 and
    EUROCAE Working Group 46
  • Approved by RTCA Program Management Committee in
    April 2000
  • Approved by EUROCAE(?)
  • FAA Advisory Circular (?)
  • Earliest would be sometime this fall

4
Formal Methods in DO-254
  • Formal Methods is one of the advanced analysis
    techniques suggested when developing hardware to
    support safety-critical (Level A or B) aircraft
    functions
  • DO-254 section on Formal Methods based upon
    material from NASA Formal Methods Guidebook,
    Volume II, (NASA-GB-001-97)

5
DO-254 Case-Study Participants
  • NASA LaRC (Design Team)
  • Paul Miner, Project Lead
  • Mahyar Malekpour, Design Engineer
  • Wilfredo Torres-Pomales, Design Engineer
  • Victor Carreño, Process Assurance
  • FAA (Sponsor and Certification Liaison)
  • Leanna Rierson, Pete Saraceni, Dennis Wallace,
    Connie Beane, and Will Struck

6
Strategy for DO-254 Case-Study
  • Fault-tolerance protocols and reliability models
    use the same fault classifications
  • Reliability analysis using SURE (Butler)
  • Calculates P(enough good hardware)
  • Formal proof of fault-tolerance protocols using
    PVS (SRI)
  • enough good hardware gt correct operation

7
SPIDER Design Concept
  • Inspired by several earlier designs
  • Main concept inspired by Palumbos Fault-tolerant
    processing system (U.S. Patent 5,533,188)
  • Developed as part of Fly-By-Light/Power-By-Wire
    project
  • Other ideas from Drapers FTPP, FTP, and FTMP
    Allied-Signals MAFT SRIs SIFT Kopetzs TTA
    Honeywells SAFEbus . . .

8
SPIDER Architecture
  • N simplex general purpose nodes logically
    connected via a Reliable Optical BUS (ROBUS)
  • A ROBUS is an ultra-reliable unit providing basic
    fault-tolerant services
  • A ROBUS is implemented as a special purpose
    fault-tolerant device

9
SPIDER Architecture
4
3
5
ROBUS
2
6
1
7
0
10
Logical View of ROBUS
  • ROBUS operates as a time-division multiplexed
    access broadcast bus
  • ROBUS strictly enforces write access
  • no babbling idiots
  • Processing nodes may be grouped to provide
    differing degrees of fault-tolerance
  • some voting available within ROBUS

11
Logical view of ROBUS(Sample Configuration)
0
4
2
3
1
6
5
7
ROBUS
12
ROBUS Characteristics
  • Bus access schedule statically determined
  • similar to SAFEbus, TTA
  • Some fault-tolerance functions provided by
    processing nodes
  • ROBUS will not have general purpose processing
    capabilities
  • Processing Elements need not be uniform
  • support for dissimilar architectures

13
ROBUS Requirements
  • All fault-free nodes observe the exact same
    sequence of messages
  • ROBUS provides a reliable time source (RTS)
  • The nodes are synchronized relative to this RTS
  • ROBUS provides correct and consistent system
    diagnostic information to all fault-free nodes
  • For 10 hour mission, P(ROBUS Failure) lt 10-10

14
Interactive Consistency(Byzantine Agreement)
  • Agreement For any message, all non-faulty
    receiving nodes will agree on the value of the
    message
  • Validity If the originator of the message is
    non-faulty, good receivers will receive the
    message sent

15
Clock Synchronization
  • Precision There is a small positive constant
    dmax such that for any two clocks that are good
    at t,
  • C1(t) - C2(t) ? dmax
  • Accuracy All good clocks maintain an accurate
    measure of the passage of time (within a linear
    envelope of real time)

16
Diagnosis
  • Correctness Every node diagnosed as faulty by a
    good node is faulty
  • A good node can never conclude that another good
    node is faulty
  • Completeness Every faulty node is (eventually)
    diagnosed as being faulty
  • This is not always possible (pathological case
    involves asymmetric fault)

17
Physical Segregation
  • ROBUS decomposed into physically isolated Fault
    Containment Regions (FCR)
  • Two main design elements
  • Bus Interface Unit (BIU)
  • Redundancy Management Unit (RMU)
  • Processing elements may form separate FCRs
  • FCRs fail independently
  • This is necessary to achieve reliability goals

18
ROBUS Topology
19
Hybrid Fault Assumptions
  • The failure status of an FCR is subdivided into
    four mutually exclusive cases
  • Good (or fault-free)
  • Benign Faulty (Known bad by all good)
  • Symmetric Faulty (Same to all good)
  • Asymmetric Faulty (Byzantine, Malicious)
  • This is a global classification, individual FCRs
    do not know the failure status of other FCRs

20
Fault Classification
  • Partition the RMUs into disjoint subsets based
    upon fault classification
  • GR, BR, SR, and AR for good, benign, symmetric,
    and asymmetric RMUs respectively
  • Similarly partition the BIUs
  • GB, BB, SB, and AB

21
Tolerating Asymmetric Faults
  • Requires 3f 1 participants in protocol to
    withstand f simultaneous asymmetric faults
  • Requires 2f 1 disjoint communication paths
    between any two participants
  • Requires f 1 levels of communication
  • ROBUS Topology satisfies these conditions for N ?
    3, M ? 3, f 1
  • For target reliability, we must tolerate at least
    1 asymmetric fault

22
Interactive Consistency
  • SPIDER IC protocol is simple adaptation of IC
    algorithm for Draper FTP Architecture
  • Existing PVS proof due to Lincoln and Rushby,
    COMPASS94, pages 107-120
  • Protocol generalizes one suggested in
  • Daniel Davies and John Wakerly, Synchronization
    and Matching in Redundant Systems, IEEE Trans. on
    Computers, Vol. C-27, No. 6, June 1978

23
SPIDER IC Protocol
  • Algorithm OMS (ignoring hybrid fault model)
  • 1. Processing element j sends value v to BIU j
  • 2. BIU j broadcasts v to all RMUs
  • 3. All RMUs broadcast value received from BIU j
    to all BIUs
  • 4. Each BIU votes on the values received from the
    RMUs to determine value from j
  • 5. Each BIU forwards the voted value to its PE

24
Adapting for hybrid faults
  • Simple modification to steps 3 and 4 to enable
    special handling of manifestly bad messages
  • from benign faulty or asymmetrically faulty
    sources
  • OMHS(p,v,q) denotes the value received by q, when
    p broadcast value v using hybrid oral messages
    protocol on SPIDER
  • Verified in PVS, using simple modifications to
    Lincoln and Rushbys proof of the Draper FTP
    Interactive Consistency Protocol

25
Interactive Consistency Results
  • Agreement For BIU g, if (AR 0) or (g ? AB
    and GR gt SR AR), then for p,q ? GB
  • OMHS(g,v,p) OMHS(g,v,q)
  • Validity If GR gt SR AR, then for p ? GB
  • If g ? GB, then OMHS(g,v,p) v
  • If g ? BB, then OMHS(g,v,p) Error
  • If g ? SB, then OMHS(g,v,p) sent(g,v)

26
Alternative Verification Options
  • For a fixed number of participants, it is easy to
    demonstrate correctness of Interactive
    Consistency protocol using symbolic simulation
  • Amount of effort needed to verify using theorem
    prover is negligible
  • PVS proof is mostly symbolic evaluation
  • there is a small amount of deductive reasoning
    to evaluate the abstract specification of hybrid
    majority

27
Clock Synchronization Goal
  • To match the degree of Fault-Tolerance of the IC
    protocol, the synchronization protocol should
    ensure synchronization using the following fault
    assumptions
  • GR gt SR AR
  • Not (AR gt 0 AB gt 0)

28
Clock Synchronization
  • Need added assumption
  • GB gt SB AB
  • With this assumption, a modified form of the
    Davies and Wakerly protocol (1978, IEEE ToC)
    ensures synchronization of the RMUs
  • Modified protocol is similar to the Srikanth and
    Toueg protocol (1987, JACM)

29
Clock Synchronization Basics
  • Clocks (counters) driven by oscillator with a
    bounded drift (?) from its stated frequency
  • Periodically (every P ticks) clocks will adjust
    the count based on exchange with other clocks
  • The periods are indexed by round number (k)
  • Protocol seeks to ensure that at beginning of
    every round, all good clocks are within dmin

30
Synchronization Basics
  • If all good clocks start round within dmin of
    each other, by the end of the round they can be
    at most dmin 2 ? P apart
  • If good clocks then make a small adjustment so
    that they start the next round within dmin, then
    both precision and accuracy are satisfied
  • Several machine checked proofs exist

31
Network Imprecision
  • There is imprecision in communication
  • If a node transmits a message at time t,
    observing nodes will receive it within time
    interval
  • t d, t d e
  • d is the minimum communication delay
  • e is the inherent imprecision (e gt (1 ?) ticks)
  • e is a lower bound on synchronization precision
  • For many Byzantine Resilient protocols,
  • dmin ? 2e

32
Simple Protocol (for round k)
  • RMU (Perform the following concurrently)
  • If Ready?(k) then broadcast (round k) to all BIUs
  • If Accept?(k) then reset counter for round k
  • BIU
  • If Accept?(k) then broadcast (round k) to all
    RMUs

33
Informal Description
  • Each good RMU broadcasts when its clock reaches a
    specific value
  • Each good BIU waits until it knows it has
    received a message from at least one good RMU.
    It then relays this information to all RMUs
  • Each RMU waits until it knows it has a message
    from at least one good BIU before resetting

34
Ready and Accept
  • Ready?(k) is an event triggered by a
    pre-determined local counter value, kP - a,
  • a is a constant offset for communication delays
  • P is the nominal duration of a round
  • k is the round index
  • Degree of fault-tolerance is determined by
    Accept?(k)

35
Accept?(k)
  • Wait until there is a hybrid majority of observed
    (round k) events to trigger Accept?(k)
  • A selection function under the hybrid fault model
    ignores manifestly bad values
  • This protocol ensures that all good RMUs accept
    (round k) within a short time interval, provided
    the maximum fault assumptions are not violated
  • Can also synchronize BIUs by echoing RMU accept

36
Verification in PVS
  • Built generic clock synchronization theory in PVS
  • PVS theory from Ulm introduced too much potential
    error in formulation of clock drift assumptions
  • New theory allows proofs of clock skew as tight
    as best theoretical results
  • Support for some protocols absent
  • All pieces in place to complete SPIDER
    verification
  • Estimate 1-2 weeks effort to tie up loose ends

37
Alternative Verification Options
  • Protocol is (almost) finite state
  • Should be possible to use a model checker to
    confirm that all good nodes start each round
    within dmin of each other
  • Plausible tools for this are HyTech, UPPAAL
  • Still need theorem prover to get Precision and
    Accuracy results
  • Theorem prover can verify for arbitrary number of
    participants

38
Diagnosis
  • Plan to adapt MAFT on-line diagnosis algorithms
    to SPIDER architecture
  • MAFT algorithms previously verified using PVS
  • Chris Walter, Patrick Lincoln, and Neeraj Suri.
    Formally verified on-line diagnosis, IEEE Trans.
    On Software Engineering, Nov. 1997
  • For diagnosis of Processing Elements, there exist
    verified Group Membership protocols
  • Katz, Lincoln, and Rushby, Low overhead
    time-triggered group membership, In 11th Workshop
    on Distributed Algorithms (WDAG97), pages
    155-169, LNCS 1320

39
Summary
  • New conceptual design for a family of
    fault-tolerant systems
  • Design being developed using DO-254 guidance
  • Critical fault-tolerance verified using PVS
  • Able to reuse or adapt many existing proofs of
    fault-tolerance protocols
  • Unable to reuse existing Clock synchronization
    proofs

40
Future Plans
  • Report documenting SPIDER Conceptual design and
    proofs of fault-tolerance by end of summer
  • First laboratory prototype implementation of
    SPIDER by December
  • Second generation design in 2001
Write a Comment
User Comments (0)
About PowerShow.com