Lecture no 22: Feilsking og retting

About This Presentation

Title:

Lecture no 22: Feilsking og retting

Description:

Lecture no 22: Feils king og -retting. TDT4285 Planlegging og drift ... Traceroute list the network path. Ping check connectivity. Truss list system calls ... – PowerPoint PPT presentation

Number of Views:17

Avg rating:3.0/5.0

Slides: 21

Provided by: anderschr

Category:

more less

Transcript and Presenter's Notes

Title: Lecture no 22: Feilsking og retting

1
Lecture no 22 Feilsøking og -retting
TDT4285 Planlegging og drift av
IT-systemer Spring 2007 Anders Christensen, IDI
2
Phases in fault detection and correction
Repro- ducability
Fault isolation
Verification
Error report
Correction
Testing
Feed- back
Document- ation
Deployment
Adoption
3
Error report phase

Collect as much info on the problem as possible
Get precise error messages
Get screen dumps and transaction logs if
applicable
What is the error message really saying?
What does the user really want to do?
What does the user think should have happened?
What was the context?

4
Two main types of problems

Reproducable. The problems which can be
reproduced at command
Non-reproducable. Those problems that occur in a
sporadic or random fashion, or where reproduction
is unacceptable due to the damage it will do.

5
Non-reproducable faults

Initiate monitoring of them
Provoke them by stressing the system
Analyze to gain insight
Initiate alarms and alerts
Create mitigating mechanisms

6
Principles for isolating the fault

Eliminate single components one at a time
Successive refinement
Follow the trace from start to end
Statistic analysis of log data
Analysis to pin-point the error

7
Suggestions for isolating faults

Look at internal formats
Read the logs
Try the same in a slightly different environment
Analyze the symptoms.
Single-step the program
Change the parameters and try again
Ask somebody for suggestions
Introduce/activate debugging output
Read the doc once more

8
Two types of causes

Direct cause. What is the immediate reason why
something does not work
Indirect cause. The reason behind the direct
cause.

Direct cause
Indirect cause
Problem
9
Verification

Temporarily fix the error
Verify that the problem disappear
Remove the fix
Verify that the error reappear
Repeat as needed

Temp. correction
Testing
Remove fix
Testing
10
Use the right tools

To study internal states
To study intermediate data
To read the configuration data
To collect log data and output
To run the system step-wise
To gain knowledge about the system

11
Examples of tools

Traceroute list the network path
Ping check connectivity
Truss list system calls
Tcpdump dump network data
Lastcomm present the process log

12
Bad handling of faults

Suppress the symptoms
Fix something without understanding why
Implementing a temporary fix (and forget it)
Redefine an error to be a feature
Correct an error by introducing new ones
Correct an error by redesigning the system

13
Fault detection and correction

Fault detection requires
Creativity
Tools know-how
System overview
Technical knowledge
General experience
Curiosity
Persistence

Fault correction requires
Precision
System knowledge
Knowing the system history
Special local knowledge
The ability to do it Right

14
Fault handling and the three lines of system
administration
(Projects)
3rd line
Correction
Testing
(Sysadmin)
Adaption
2nd line
Fault isolation
Verification
Documentation
Reproducability
Deploy- ment
1st line
Error message
Feedback
(Routines and user suppert)
15
Main categories of faults

User fault confusion and misunderstanding at
the users part
Routine corrections which have been planned for
and which there exists a routine for how to
handle.
Normal fault which needs to be found and fixed.
Conceptual fault in the system, where it must be
redesigned to eliminate the fault.

16
Correction of faults
1st line
2nd line
3rd line
Guidance
User fault
Execution
Routine correction
Verification
Correction
Normal Fault
Fault detection
Verification
Redesign
Conceptual fault
17
Corrections and testing

Correction. Fixates the correction of the fault.
Distribution. Makes the correction active on all
systems
Testing. Must be done in multiple ways, from
different angles, using different tools and
techniques.
Double testing. (and tripple testing) see
above
Documentation. Must be maintained as a part of
the process, not an after-the-fact clean-up. Also
includes feedback to the user.

18
Four strategies for error correction

Correct it before it occurs
Automatically correct when it occurs
Manual correction when the first symptoms can be
detected (but before the users notice anything)
Clean-up after the users have noticed a problem.

19
Costs (estimation)
4
When the problem is noticeable
Down time costs
3
When the symptom is detectable
Automatic correction
2
1
Before the fault occurs
Initial operational costs
20
Accumulated faults

A critical fault is seldom caused by one single
problem, but usually an accumulation of several
participating problems, each of which may not be
a show-stopper.
If potential or actual faults are corrected as
soon as possible, it is possible to prevent them
to become parts in complex problems.

Write a Comment

User Comments (0)