Fault Tolerance - PowerPoint PPT Presentation

About This Presentation
Title:

Fault Tolerance

Description:

... problems, design flaws, defects in hardware; design or ... Type - hardware or software. Cause - improper design, hardware failure, external disturbance ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 16
Provided by: loish6
Learn more at: http://www.cs.fsu.edu
Category:
Tags: fault | tolerance

less

Transcript and Presenter's Notes

Title: Fault Tolerance


1
Fault Tolerance Reliability CDA 5140 Spring
2006
  • Chapter 1
  • Overview Definitions

2
Topics
  • basic concepts of Fault Tolerance (FT)
  • reliability availability of systems, both
    hardware software
  • tools to compare contrast FT designs

3
What is FT?
  • Computing in presence of errors
  • Some techniques from analog systems of 1940s -
    1960s
  • Digital technology adds to these to be faster,
    better cheaper
  • Investigate architecture keeping in mind tradeoff
    of cost, weight volume
  • Becoming more important as digital systems become
    more more prevalent

4
Why Have FT?
  • Needed more in 21st century since
  • Harsher environments
  • Many novice users
  • Increasing repair costs
  • Larger systems
  • Digital systems more prevalent
  • More users dependent on digital systems from
    business to government to home to school

5
How is FT Obtained?
  • Add redundancy in form of
  • Hardware, e.g. RAID
  • Software, e.g. 2 algorithms for same task
  • Information, e.g. coding theory
  • Time, e.g. on Internet if fault, then new route

6
Definitions Terminology
  • Failure - departure from correct operation
  • Fault - flaw in hardware or software resulting in
    failure, e.g. physical problems, design flaws,
    defects in hardware design or implementation for
    software
  • Error - incorrect response from module leading to
    system failure if no FT
  • Type - hardware or software
  • Cause - improper design, hardware failure,
    external disturbance

7
Definitions continued
  • Permanent Fault - always present, needs repair to
    remove
  • Intermittent fault - not always present but still
    needs repair to remove
  • Transient fault - will disappear without repair
  • Fault latency - fault can go undetected does
    not cause error
  • Fault-avoidance - use of high quality components
    careful design to avoid faults
  • Fault-tolerance - use of redundancy (hardware,
    software, information or time) to correct system
    operation after fault occurs

8
Definitions continued
  • Graceful degradation - system still performs but
    with degraded but correct performance after
    faults
  • Fail-safe - system can fail but only to safe
    state to avoid catastrophes
  • Reliability - probability of not failing within
    time t given operating correctly at time 0
  • Availability - probability system operating
    correctly at time t
  • Maintainability - probability that system can be
    restored to operation by time t given not
    operational at time 0

9
Definitions continued
  • Mean-time-to-failure (MTTF) - expected value of
    system failure time
  • Mean-time-to-repair (MTTR) - expected value of
    system repair time
  • Mean-time-between-failure (MTBF) - expected value
    between successive system failure, MTTF MTTR
  • Fault detection - method used to detect presence
    of fault
  • Fault confinement - technique to confine damage
    of fault to as small an area as possible

10
Definitions continued
  • Fault diagnosis - automatic identification of
    faulty modules
  • Recovery - system put into operating state,
    possibly degraded
  • Hardware redundancy - extra hardware to detect,
    mask or diagnose faults
  • Passive hardware redundancy - fault masking to
    hide faults prevent faults from resulting in
    errors no action by system

11
Definitions continued
  • Information redundancy - use of coding theory
    techniques (addition of bits)
  • Software redundancy - use of diagnostic software
    or extra modules, each with distinct algorithm
  • Temporal redundancy - repeating bus cycles or
    whole programs, new route on Internet

12
Microelectronic Growth
  • Density of chips dramatically increased
    concomitantly, use of digital systems
  • Obvious need for FT in space shuttle, nuclear
    power plants, but with increased use in homes,
    more faults likely so will need FT there too
  • Interesting observations
  • 1999 typical home had 40-60 microprocessors
  • 2004 expected to be 280

13
Reliability Availability
  • Goal high reliability availability based on
    sound analysis not conjecture!
  • Use both reliability availability as measures

14
Air Traffic Control Example
  • ATC fails once/year, so MTTF 8766 hours
  • Airline Reservation System (ARS) down 5
    times/year, so MTTF1753 hours
  • Availability (A) uptime/(uptime downtime)
  • ATC down 1 hour, so
  • A 8765/(8765 1) 0.999886
  • ARS down for 1 minute, 5 times, or 0.083333 hours
  • A 8765.91666/(87666) 0.999905

15
Air Traffic Control Example contd
  • Unavailability U 1-A
  • So, comparing the two systems for U
  • (1-0.999886)/(1-0.999905) 12
  • The ARS is 12 times better than the ATC in terms
    of availability.
  • Homework 1 1.13, 1.14, 1.17 (3 examples)
Write a Comment
User Comments (0)
About PowerShow.com