Failure and Fault Tolerance - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Failure and Fault Tolerance

Description:

An hour-long radar failure at the Kansas City En Route Traffic Control Center ... was utterly chaotic until we regained control and caused an unusually high ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 19
Provided by: scie216
Category:

less

Transcript and Presenter's Notes

Title: Failure and Fault Tolerance


1
Failure and Fault Tolerance
2
An hour-long radar failure at the Kansas City En
Route Traffic Control Center late Sunday
afternoon sent controllers scrambling to locate
aircraft and keep them safe while transitioning
to a far less efficient backup system, according
to the union representing flight controllers.
In the most recent incident Sunday, the union
says air traffic controllers' radar displays
stopped working at approximately 5 p.m. CDT, with
335 aircraft above the skies of the central
United States. The failure reportedly stopped
radar targets for individual flights from
updating on controllers' scopes. Controllers then
had to manually transition to a backup system
that provides a limited display."The transition
period was utterly chaotic until we regained
control and caused an unusually high level of
stress in the facility," said Kansas City Center
controller Howard Blankenship, who also serves as
facility representative for the National Air
Traffic Controllers Association. The union
says the failure also affected the center's User
Request Evaluation Tool (URET), Without URET
on Sunday, however, the union said controllers
scrambled to find paper and pencil to compensate
for the automation's failure. Blankenship said
dwindling numbers of controllers at the facility
make him concerned about the ability of the FAA
to weather future equipment problems without
putting passenger safety at risk.
From http//www.consumeraffairs.com/news04/2005/a
ir_traffic_failure.html
3
Failures and Faults
  • Failure no longer functioning
  • Fault a component has failed
  • Fault tolerance
  • System continues functioning in the presence of
    faults.
  • System continues functioning even if components
    fail.

4
Dependable Systems
  • Available Is system available when requested
  • Reliable Is system running continuously
  • Safe What happens if system or component of
    system fails
  • Maintainable Failed system easily repaired

5
Fault Types
  • Transient Occur once and disappear
  • Intermittent Vanishes and reappears
  • Permanent Continues until faulty component
    repaired

6
Failure Models
  • Omission Failure
  • Timing Failure
  • Arbitrary Failure (Byzantine failures)

7
Omission Failure
  • Process
  • Crash
  • fail-stop
  • Communication
  • dropping messages
  • Send-omission failure
  • Receive-omission failure

8
Timing Failures
  • Some performance guarantee given
  • Server provides data too soon
  • Performance failure Server provides data too
    late

9
Arbitrary Failure
  • Byzantine failures
  • Output cannot be detected as incorrect

10
Managing Faults and Failures
  • Redundancy and replication
  • Logging and restarting from log

11
Creating Logs
  • Use logging
  • Restart process (or replace)
  • Use log
  • To create last state of process
  • See what requests were suspended.

12
Information to Log
  • Messages received
  • Messages send
  • Log before or after sending message?

13
Case Study
Client
Server 2. for each item check if it is ours
check if it is avail if not, check status
1. List of books, each with ISBN
quantity
3.
Accounting
4.
Shipping
5.
Inventory
6. Report
14
Failure Masking
  • Redundancy
  • Send same message multiple times
  • Redundancy in hardware
  • Replication
  • Server
  • Database

15
Replication Issues
  • Group Communication
  • IP Multicast
  • Unreliable
  • Unordered
  • Agreement between replicas
  • Consistency of data

16
Group Management
  • Membership changes
  • Failure detection
  • Notify members of group membership changes.
  • Ensure messages are reliable (and consistently)
    delivered to all group members.

17
Membership Changes
  • Process requests to join the group
  • Process communicates intend to leave the group
  • Process crashes and needs to be removed from the
    group

18
Failure Detection
  • Reliable failure detector only possible in
    homogeneous system
  • Unreliable failure detector
  • Checks processes in the group
  • Marks them as suspected or unsuspected
  • Suspected are removed from group
Write a Comment
User Comments (0)
About PowerShow.com