Title: Lecture no 22: Feilsking og retting
1Lecture no 22 Feilsøking og -retting
TDT4285 Planlegging og drift av
IT-systemer Spring 2007 Anders Christensen, IDI
2Phases in fault detection and correction
Repro- ducability
Fault isolation
Verification
Error report
Correction
Testing
Feed- back
Document- ation
Deployment
Adoption
3Error report phase
- Collect as much info on the problem as possible
- Get precise error messages
- Get screen dumps and transaction logs if
applicable - What is the error message really saying?
- What does the user really want to do?
- What does the user think should have happened?
- What was the context?
4Two main types of problems
- Reproducable. The problems which can be
reproduced at command - Non-reproducable. Those problems that occur in a
sporadic or random fashion, or where reproduction
is unacceptable due to the damage it will do.
5Non-reproducable faults
- Initiate monitoring of them
- Provoke them by stressing the system
- Analyze to gain insight
- Initiate alarms and alerts
- Create mitigating mechanisms
6Principles for isolating the fault
- Eliminate single components one at a time
- Successive refinement
- Follow the trace from start to end
- Statistic analysis of log data
- Analysis to pin-point the error
7Suggestions for isolating faults
- Look at internal formats
- Read the logs
- Try the same in a slightly different environment
- Analyze the symptoms.
- Single-step the program
- Change the parameters and try again
- Ask somebody for suggestions
- Introduce/activate debugging output
- Read the doc once more
8Two types of causes
- Direct cause. What is the immediate reason why
something does not work - Indirect cause. The reason behind the direct
cause.
Direct cause
Indirect cause
Problem
9Verification
- Temporarily fix the error
- Verify that the problem disappear
- Remove the fix
- Verify that the error reappear
- Repeat as needed
Temp. correction
Testing
Remove fix
Testing
10Use the right tools
- To study internal states
- To study intermediate data
- To read the configuration data
- To collect log data and output
- To run the system step-wise
- To gain knowledge about the system
11Examples of tools
- Traceroute list the network path
- Ping check connectivity
- Truss list system calls
- Tcpdump dump network data
- Lastcomm present the process log
12Bad handling of faults
- Suppress the symptoms
- Fix something without understanding why
- Implementing a temporary fix (and forget it)
- Redefine an error to be a feature
- Correct an error by introducing new ones
- Correct an error by redesigning the system
13Fault detection and correction
- Fault detection requires
- Creativity
- Tools know-how
- System overview
- Technical knowledge
- General experience
- Curiosity
- Persistence
- Fault correction requires
- Precision
- System knowledge
- Knowing the system history
- Special local knowledge
- The ability to do it Right
14Fault handling and the three lines of system
administration
(Projects)
3rd line
Correction
Testing
(Sysadmin)
Adaption
2nd line
Fault isolation
Verification
Documentation
Reproducability
Deploy- ment
1st line
Error message
Feedback
(Routines and user suppert)
15Main categories of faults
- User fault confusion and misunderstanding at
the users part - Routine corrections which have been planned for
and which there exists a routine for how to
handle. - Normal fault which needs to be found and fixed.
- Conceptual fault in the system, where it must be
redesigned to eliminate the fault.
16Correction of faults
1st line
2nd line
3rd line
Guidance
User fault
Execution
Routine correction
Verification
Correction
Normal Fault
Fault detection
Verification
Redesign
Conceptual fault
17Corrections and testing
- Correction. Fixates the correction of the fault.
- Distribution. Makes the correction active on all
systems - Testing. Must be done in multiple ways, from
different angles, using different tools and
techniques. - Double testing. (and tripple testing) see
above - Documentation. Must be maintained as a part of
the process, not an after-the-fact clean-up. Also
includes feedback to the user.
18Four strategies for error correction
- Correct it before it occurs
- Automatically correct when it occurs
- Manual correction when the first symptoms can be
detected (but before the users notice anything) - Clean-up after the users have noticed a problem.
19Costs (estimation)
4
When the problem is noticeable
Down time costs
3
When the symptom is detectable
Automatic correction
2
1
Before the fault occurs
Initial operational costs
20Accumulated faults
-
- A critical fault is seldom caused by one single
problem, but usually an accumulation of several
participating problems, each of which may not be
a show-stopper. - If potential or actual faults are corrected as
soon as possible, it is possible to prevent them
to become parts in complex problems.