Title: Failure and Fault Tolerance
1Failure and Fault Tolerance
2An hour-long radar failure at the Kansas City En
Route Traffic Control Center late Sunday
afternoon sent controllers scrambling to locate
aircraft and keep them safe while transitioning
to a far less efficient backup system, according
to the union representing flight controllers.
In the most recent incident Sunday, the union
says air traffic controllers' radar displays
stopped working at approximately 5 p.m. CDT, with
335 aircraft above the skies of the central
United States. The failure reportedly stopped
radar targets for individual flights from
updating on controllers' scopes. Controllers then
had to manually transition to a backup system
that provides a limited display."The transition
period was utterly chaotic until we regained
control and caused an unusually high level of
stress in the facility," said Kansas City Center
controller Howard Blankenship, who also serves as
facility representative for the National Air
Traffic Controllers Association. The union
says the failure also affected the center's User
Request Evaluation Tool (URET), Without URET
on Sunday, however, the union said controllers
scrambled to find paper and pencil to compensate
for the automation's failure. Blankenship said
dwindling numbers of controllers at the facility
make him concerned about the ability of the FAA
to weather future equipment problems without
putting passenger safety at risk.
From http//www.consumeraffairs.com/news04/2005/a
ir_traffic_failure.html
3Failures and Faults
- Failure no longer functioning
- Fault a component has failed
- Fault tolerance
- System continues functioning in the presence of
faults. - System continues functioning even if components
fail.
4Dependable Systems
- Available Is system available when requested
- Reliable Is system running continuously
- Safe What happens if system or component of
system fails - Maintainable Failed system easily repaired
5Fault Types
- Transient Occur once and disappear
- Intermittent Vanishes and reappears
- Permanent Continues until faulty component
repaired
6Failure Models
- Omission Failure
- Timing Failure
- Arbitrary Failure (Byzantine failures)
7Omission Failure
- Process
- Crash
- fail-stop
- Communication
- dropping messages
- Send-omission failure
- Receive-omission failure
8Timing Failures
- Some performance guarantee given
- Server provides data too soon
- Performance failure Server provides data too
late
9Arbitrary Failure
- Byzantine failures
- Output cannot be detected as incorrect
10Managing Faults and Failures
- Redundancy and replication
- Logging and restarting from log
11Creating Logs
- Use logging
- Restart process (or replace)
- Use log
- To create last state of process
- See what requests were suspended.
12Information to Log
- Messages received
- Messages send
- Log before or after sending message?
13Case Study
Client
Server 2. for each item check if it is ours
check if it is avail if not, check status
1. List of books, each with ISBN
quantity
3.
Accounting
4.
Shipping
5.
Inventory
6. Report
14Failure Masking
- Redundancy
- Send same message multiple times
- Redundancy in hardware
- Replication
- Server
- Database
15Replication Issues
- Group Communication
- IP Multicast
- Unreliable
- Unordered
- Agreement between replicas
- Consistency of data
16Group Management
- Membership changes
- Failure detection
- Notify members of group membership changes.
- Ensure messages are reliable (and consistently)
delivered to all group members.
17Membership Changes
- Process requests to join the group
- Process communicates intend to leave the group
- Process crashes and needs to be removed from the
group
18Failure Detection
- Reliable failure detector only possible in
homogeneous system - Unreliable failure detector
- Checks processes in the group
- Marks them as suspected or unsuspected
- Suspected are removed from group