Fault Tolerance - PowerPoint PPT Presentation

About This Presentation

Title:

Fault Tolerance

Description:

Number of Views:25

Avg rating:3.0/5.0

Slides: 16

Provided by: loish6

Learn more at: http://www.cs.fsu.edu

Category:

Tags: fault | tolerance

Transcript and Presenter's Notes

Title: Fault Tolerance

1
Fault Tolerance Reliability CDA 5140 Spring
2006

2
Topics

3
What is FT?

4
Why Have FT?

Needed more in 21st century since
Harsher environments
Many novice users
Increasing repair costs
Larger systems
Digital systems more prevalent
More users dependent on digital systems from
business to government to home to school

5
How is FT Obtained?

6
Definitions Terminology

Failure - departure from correct operation
Fault - flaw in hardware or software resulting in
failure, e.g. physical problems, design flaws,
defects in hardware design or implementation for
software
Error - incorrect response from module leading to
system failure if no FT
Type - hardware or software
Cause - improper design, hardware failure,
external disturbance

7
Definitions continued

Permanent Fault - always present, needs repair to
remove
Intermittent fault - not always present but still
needs repair to remove
Transient fault - will disappear without repair
Fault latency - fault can go undetected does
not cause error
Fault-avoidance - use of high quality components
careful design to avoid faults
Fault-tolerance - use of redundancy (hardware,
software, information or time) to correct system
operation after fault occurs

8
Definitions continued

Graceful degradation - system still performs but
with degraded but correct performance after
faults
Fail-safe - system can fail but only to safe
state to avoid catastrophes
Reliability - probability of not failing within
time t given operating correctly at time 0
Availability - probability system operating
correctly at time t
Maintainability - probability that system can be
restored to operation by time t given not
operational at time 0

9
Definitions continued

Mean-time-to-failure (MTTF) - expected value of
system failure time
Mean-time-to-repair (MTTR) - expected value of
system repair time
Mean-time-between-failure (MTBF) - expected value
between successive system failure, MTTF MTTR
Fault detection - method used to detect presence
of fault
Fault confinement - technique to confine damage
of fault to as small an area as possible

10
Definitions continued

Fault diagnosis - automatic identification of
faulty modules
Recovery - system put into operating state,
possibly degraded
Hardware redundancy - extra hardware to detect,
mask or diagnose faults
Passive hardware redundancy - fault masking to
hide faults prevent faults from resulting in
errors no action by system

11
Definitions continued

Information redundancy - use of coding theory
techniques (addition of bits)
Software redundancy - use of diagnostic software
or extra modules, each with distinct algorithm
Temporal redundancy - repeating bus cycles or
whole programs, new route on Internet

12
Microelectronic Growth

Density of chips dramatically increased
concomitantly, use of digital systems
Obvious need for FT in space shuttle, nuclear
power plants, but with increased use in homes,
more faults likely so will need FT there too
Interesting observations
1999 typical home had 40-60 microprocessors
2004 expected to be 280

13
Reliability Availability