What Happens Before A Disk Fails - PowerPoint PPT Presentation

About This Presentation
Title:

What Happens Before A Disk Fails

Description:

Uses a switched network design to connect ... A WEB Accessible Image Collection. Available 24 hours/day, 7 days/week. Slide 5. Outline ... CDB: a c b1 bf 80 0 ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 33
Provided by: davidoppe
Category:

less

Transcript and Presenter's Notes

Title: What Happens Before A Disk Fails


1
What Happens Before A Disk Fails?
  • Randi Thomas, Nisha Talagala
  • http//www.cs.berkeley.edu/randit/Iram/disklogs.h
    tml

2
Motivation
  • ISTORE
  • Proposes to take advantage of predicted failures
    to improve system robustness
  • Uses a switched network design to connect
    intelligent devices to each other to improve
    system performance.
  • Therefore ISTORE devices do not share electrical
    connections
  • Is this another ISTORE advantage?
  • This talk examines
  • The potential to predict failures for disk
    devices
  • If and how the failure of a device sharing
    electrical connections with other devices affects
    those other devices

3
Just Before a Disk Fails...
  • Can we predict the disk failure?
    To answer we will investigate
  • What kind of log messages does the system
    generate?
  • When do these messages get generated?
  • How do we distinguish a failing disk from a
    non-failing disk?
  • Are the other connected devices in the system
    affected in any way?
    To answer we will investigate
  • Are there correlations between the logged
    messages?

4
Which Logs on What System?
  • The Error Logs Generated by Berkeleys
  • Tertiary Disk System
  • Log Dates January to November, 1998

The Tertiary Disk Application
  • A WEB Accessible Image Collection
  • Available 24 hours/day, 7 days/week

5
Outline
  • Tertiary Disk Architecture
  • Example of a log Message
  • What Kind of Messages are generated?
  • Can we predict the disk failure?
  • Are the other connected devices in the system
    affected in any way?
  • Summary and Conclusion

6
The Tertiary Disk Architecture
  • 20 PCs (m0-m19)
  • 200 MHz Pentium Pros
  • 96 MB of RAM
  • Running FreeBSD version 2.2
  • Connected through a switched Ethernet network
  • Hosts a set of disks using fast-wide SCSI 2 in
    the single ended mode
  • Using twin channel SCSI controllers
  • Total of 368 Disks
  • 8 GB each
  • State of the Art in 1996

7
The Tertiary Disk Architecture
  • 4 PCs (m0 - m3) have 28 or more disks each
  • 2-3 SCSI Chains per PC
  • 9-15 Disks per SCSI chain
  • 16 PCs (m4 - m19) have 16 disks each
  • 2 SCSI Chains per PC
  • 8 Disks per SCSI chain
  • SCSI bus made up of
  • SCSI cable Connects the controller and enclosure
  • Backplane of the enclosure

8
The Tertiary Disk Architecture
9
Outline
  • Tertiary Disk Architecture
  • Example of a log Message
  • What Kind of Messages are generated?
  • Can we predict the disk failure?
  • Are the other connected devices in the system
    affected in any way?
  • Summary and Conclusion

10
Example of A Log Message
  • Oct 22 145350 m6 /kernel (da1ahc0010)
    WRITE(06). CDB a c b1 bf 80 0
  • Oct 22 145350 m6 /kernel (da1ahc0010)
    HARDWARE FAILURE infocb1bf asc44,0
  • Oct 22 145350 m6 /kernel (da1ahc0010)
    Internal target failure field replaceable unit 1
    sks80,3
  • Month Day Time --gt Oct 22 145350
  • Machine name --gt m6
  • Source of message --gt kernel reporting message
  • Error Device --gt disk da1, SCSI bus ahc0
  • Description of Error --gt Write request had a
    write fault and caused a HW Failure
  • More information --gt Driver SCSI Controller
    Codes

11
Outline
  • Tertiary Disk Architecture
  • Example of a log Message
  • What Kind of Messages are generated?
  • Can we predict the disk failure?
  • Are the other connected devices in the system
    affected in any way?
  • Summary and Conclusion

12
What kind of messages are generated?
  • Data Disk Error Messages
  • Hardware Error The command unsuccessfully
    terminated due to a non-recoverable hardware
    failure. (Type is given in the message)
  • Medium Error The operation was unsuccessful due
    to a flaw in the medium --gt usually recommends
    reassigning sectors
  • Recoverable Error The last command completed
    with the help of some error recovery at the
    target --gt e.g. if the drive dynamically
    reassigned a bad sector to available spare sector
  • Not Ready The drive cannot be accessed at all
  • SCSI Error Messages
  • Time Outs Can happen in any of the SCSI bus
    phases, i.e. message, data, idle. Response a BUS
    RESET command
  • Parity Cause of an aborted request

13
Outline
  • Tertiary Disk Architecture
  • Example of a log Message
  • What Kind of Messages are generated?
  • Can we predict the disk failure?
  • Are the other connected devices in the system
    affected in any way?
  • Summary and Conclusion

14
m0 SCSI Time OutsRecovered Errors
SCSI Bus 0
15
m0 SCSI Time OutsRecovered Errors
SCSI Bus 4
16
m0 SCSI Time OutsRecovered Errors
SCSI Bus 0
17
m0 SCSI Time OutsRecovered Errors
SCSI Bus 0
18
Can we predict a disk failure?
  • Yes, we can look for Recovered Error messages --gt
    on 10-16-98
  • There were 433 Recovered Error Messages
  • These messages lasted for slightly over an hour
    between
  • 1243 and 1410
  • On 11-24-98 Disk 5 on m0 was fired, i.e. it
    was about to fail so it was swapped
  • Another example...

19
m11 SCSI Time Outs
SCSI Bus 0
20
m11 SCSI Time Outs Hardware Failures
SCSI Bus 0
21
Can we predict a disk failure?
  • Yes, we can also look for Hardware Failure
    messages --gt
  • These messages lasted for 8 days between
  • 8-17-98 and 8-25-98
  • On disk 9 there were
  • 1763 Hardware Failure Messages, and
  • 297 Timed Out Messages
  • Disk 9 on SCSI Bus 0 of m11 was fired, i.e. it
    was about to fail so it was swapped on 8-28-98

22
Outline
  • Tertiary Disk Architecture
  • Example of a log Message
  • What Kind of Messages are generated?
  • Can we predict the disk failure?
  • Are the other connected devices in the system
    affected in any way?
  • Summary and Conclusion

23
Are the other connected devices in the system
affected in any way?
  • Yes, observe the Time Out message traffic on
    other disks on the same SCSI bus for --gt
  • The same 8 day period
  • 8-17-98 and 8-25-98
  • What about predicting other kinds of failures
    besides just disk failures? --gt
  • Distinguishing between failing and non-failing
    disks...

24
m2 SCSI Bus 2 Parity Errors
25
m2 SCSI Bus 2 Parity Errors
26
Can We Predict Other Kinds of Failures?
  • Yes, the flurry of parity errors on m2 occurred
    between
  • 1-1-98 and 2-3-98, as well as
  • 9-3-98 and 10-12-98
  • On 11-24-98
  • m2 had a bad enclosure --gt cables or connections
    defective
  • The enclosure was then swapped
  • Note The activity logs are not available for the
    earlier time period.

27
Can We Distinguish a Failing Disk From a
Non-Failing Disk?
  • Yes...
  • SCSI Error Messages alone --gt No impending disk
    failure
  • As in the m2 Parity example
  • Disk Error Messages alone or accompanied by SCSI
    Error Messages --gt High Probability of an
    impending disk failure e.g.
  • ALONE m0 had only Recovered Error Messages
  • Disk 5 was about to fail and therefore was
    fired
  • BOTH m11 had both Hardware Failure Disk Messages
    and Time Out SCSI Messages
  • Disk 9 was about to fail and therefore was
    fired

28
Outline
  • Tertiary Disk Architecture
  • Example of a log Message
  • What Kind of Messages are generated?
  • Can we predict the disk failure?
  • Are the other connected devices in the system
    affected in any way?
  • Summary and Conclusion

29
Total Disk SCSI Errors Per Machine
30
Summary and Conclusion
  • Disks dont fail very often
  • In the 10 months of logs, only two disks failed
  • We have only 2 data points for these conclusions!
  • We can predict disk failures and other kinds of
    failures with enough time to do something about
    it
  • There are correlations between the logged
    messages
  • Hardware Failure Messages on one disk device
    propagates as Time Out Messages on
  • not only the failing disk,
  • but also other disks on the same SCSI bus

31
Back Up Slides
32
m0 SCSI Time Outs
SCSI Bus 2
Write a Comment
User Comments (0)
About PowerShow.com