Lecture no 22: Feilsking og retting - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Lecture no 22: Feilsking og retting

Description:

Lecture no 22: Feils king og -retting. TDT4285 Planlegging og drift ... Traceroute list the network path. Ping check connectivity. Truss list system calls ... – PowerPoint PPT presentation

Number of Views:17
Avg rating:3.0/5.0
Slides: 21
Provided by: anderschr
Category:

less

Transcript and Presenter's Notes

Title: Lecture no 22: Feilsking og retting


1
Lecture no 22 Feilsøking og -retting
TDT4285 Planlegging og drift av
IT-systemer Spring 2007 Anders Christensen, IDI
2
Phases in fault detection and correction
Repro- ducability
Fault isolation
Verification
Error report
Correction
Testing
Feed- back
Document- ation
Deployment
Adoption
3
Error report phase
  • Collect as much info on the problem as possible
  • Get precise error messages
  • Get screen dumps and transaction logs if
    applicable
  • What is the error message really saying?
  • What does the user really want to do?
  • What does the user think should have happened?
  • What was the context?

4
Two main types of problems
  • Reproducable. The problems which can be
    reproduced at command
  • Non-reproducable. Those problems that occur in a
    sporadic or random fashion, or where reproduction
    is unacceptable due to the damage it will do.

5
Non-reproducable faults
  • Initiate monitoring of them
  • Provoke them by stressing the system
  • Analyze to gain insight
  • Initiate alarms and alerts
  • Create mitigating mechanisms

6
Principles for isolating the fault
  • Eliminate single components one at a time
  • Successive refinement
  • Follow the trace from start to end
  • Statistic analysis of log data
  • Analysis to pin-point the error

7
Suggestions for isolating faults
  • Look at internal formats
  • Read the logs
  • Try the same in a slightly different environment
  • Analyze the symptoms.
  • Single-step the program
  • Change the parameters and try again
  • Ask somebody for suggestions
  • Introduce/activate debugging output
  • Read the doc once more

8
Two types of causes
  • Direct cause. What is the immediate reason why
    something does not work
  • Indirect cause. The reason behind the direct
    cause.

Direct cause
Indirect cause
Problem
9
Verification
  • Temporarily fix the error
  • Verify that the problem disappear
  • Remove the fix
  • Verify that the error reappear
  • Repeat as needed

Temp. correction
Testing
Remove fix
Testing
10
Use the right tools
  • To study internal states
  • To study intermediate data
  • To read the configuration data
  • To collect log data and output
  • To run the system step-wise
  • To gain knowledge about the system

11
Examples of tools
  • Traceroute list the network path
  • Ping check connectivity
  • Truss list system calls
  • Tcpdump dump network data
  • Lastcomm present the process log

12
Bad handling of faults
  • Suppress the symptoms
  • Fix something without understanding why
  • Implementing a temporary fix (and forget it)
  • Redefine an error to be a feature
  • Correct an error by introducing new ones
  • Correct an error by redesigning the system

13
Fault detection and correction
  • Fault detection requires
  • Creativity
  • Tools know-how
  • System overview
  • Technical knowledge
  • General experience
  • Curiosity
  • Persistence
  • Fault correction requires
  • Precision
  • System knowledge
  • Knowing the system history
  • Special local knowledge
  • The ability to do it Right

14
Fault handling and the three lines of system
administration
(Projects)
3rd line
Correction
Testing
(Sysadmin)
Adaption
2nd line
Fault isolation
Verification
Documentation
Reproducability
Deploy- ment
1st line
Error message
Feedback
(Routines and user suppert)
15
Main categories of faults
  • User fault confusion and misunderstanding at
    the users part
  • Routine corrections which have been planned for
    and which there exists a routine for how to
    handle.
  • Normal fault which needs to be found and fixed.
  • Conceptual fault in the system, where it must be
    redesigned to eliminate the fault.

16
Correction of faults
1st line
2nd line
3rd line
Guidance
User fault
Execution
Routine correction
Verification
Correction
Normal Fault
Fault detection
Verification
Redesign
Conceptual fault
17
Corrections and testing
  • Correction. Fixates the correction of the fault.
  • Distribution. Makes the correction active on all
    systems
  • Testing. Must be done in multiple ways, from
    different angles, using different tools and
    techniques.
  • Double testing. (and tripple testing) see
    above
  • Documentation. Must be maintained as a part of
    the process, not an after-the-fact clean-up. Also
    includes feedback to the user.

18
Four strategies for error correction
  • Correct it before it occurs
  • Automatically correct when it occurs
  • Manual correction when the first symptoms can be
    detected (but before the users notice anything)
  • Clean-up after the users have noticed a problem.

19
Costs (estimation)
4
When the problem is noticeable
Down time costs
3
When the symptom is detectable
Automatic correction
2
1
Before the fault occurs
Initial operational costs
20
Accumulated faults
  • A critical fault is seldom caused by one single
    problem, but usually an accumulation of several
    participating problems, each of which may not be
    a show-stopper.
  • If potential or actual faults are corrected as
    soon as possible, it is possible to prevent them
    to become parts in complex problems.
Write a Comment
User Comments (0)
About PowerShow.com