Title: What Happens Before A Disk Fails
1What Happens Before A Disk Fails?
- Randi Thomas, Nisha Talagala
- http//www.cs.berkeley.edu/randit/Iram/disklogs.h
tml
2Motivation
- ISTORE
- Proposes to take advantage of predicted failures
to improve system robustness - Uses a switched network design to connect
intelligent devices to each other to improve
system performance. - Therefore ISTORE devices do not share electrical
connections - Is this another ISTORE advantage?
- This talk examines
- The potential to predict failures for disk
devices - If and how the failure of a device sharing
electrical connections with other devices affects
those other devices
3Just Before a Disk Fails...
- Can we predict the disk failure?
To answer we will investigate - What kind of log messages does the system
generate? - When do these messages get generated?
- How do we distinguish a failing disk from a
non-failing disk? - Are the other connected devices in the system
affected in any way?
To answer we will investigate - Are there correlations between the logged
messages?
4 Which Logs on What System?
- The Error Logs Generated by Berkeleys
- Tertiary Disk System
- Log Dates January to November, 1998
-
The Tertiary Disk Application
- A WEB Accessible Image Collection
- Available 24 hours/day, 7 days/week
5Outline
- Tertiary Disk Architecture
- Example of a log Message
- What Kind of Messages are generated?
- Can we predict the disk failure?
- Are the other connected devices in the system
affected in any way? - Summary and Conclusion
6The Tertiary Disk Architecture
- 20 PCs (m0-m19)
- 200 MHz Pentium Pros
- 96 MB of RAM
- Running FreeBSD version 2.2
- Connected through a switched Ethernet network
- Hosts a set of disks using fast-wide SCSI 2 in
the single ended mode - Using twin channel SCSI controllers
- Total of 368 Disks
- 8 GB each
- State of the Art in 1996
7The Tertiary Disk Architecture
- 4 PCs (m0 - m3) have 28 or more disks each
- 2-3 SCSI Chains per PC
- 9-15 Disks per SCSI chain
- 16 PCs (m4 - m19) have 16 disks each
- 2 SCSI Chains per PC
- 8 Disks per SCSI chain
- SCSI bus made up of
- SCSI cable Connects the controller and enclosure
- Backplane of the enclosure
8The Tertiary Disk Architecture
9Outline
- Tertiary Disk Architecture
- Example of a log Message
- What Kind of Messages are generated?
- Can we predict the disk failure?
- Are the other connected devices in the system
affected in any way? - Summary and Conclusion
10Example of A Log Message
- Oct 22 145350 m6 /kernel (da1ahc0010)
WRITE(06). CDB a c b1 bf 80 0 - Oct 22 145350 m6 /kernel (da1ahc0010)
HARDWARE FAILURE infocb1bf asc44,0 - Oct 22 145350 m6 /kernel (da1ahc0010)
Internal target failure field replaceable unit 1
sks80,3 - Month Day Time --gt Oct 22 145350
- Machine name --gt m6
- Source of message --gt kernel reporting message
- Error Device --gt disk da1, SCSI bus ahc0
- Description of Error --gt Write request had a
write fault and caused a HW Failure - More information --gt Driver SCSI Controller
Codes
11Outline
- Tertiary Disk Architecture
- Example of a log Message
- What Kind of Messages are generated?
- Can we predict the disk failure?
- Are the other connected devices in the system
affected in any way? - Summary and Conclusion
12What kind of messages are generated?
- Data Disk Error Messages
- Hardware Error The command unsuccessfully
terminated due to a non-recoverable hardware
failure. (Type is given in the message) - Medium Error The operation was unsuccessful due
to a flaw in the medium --gt usually recommends
reassigning sectors - Recoverable Error The last command completed
with the help of some error recovery at the
target --gt e.g. if the drive dynamically
reassigned a bad sector to available spare sector - Not Ready The drive cannot be accessed at all
- SCSI Error Messages
- Time Outs Can happen in any of the SCSI bus
phases, i.e. message, data, idle. Response a BUS
RESET command - Parity Cause of an aborted request
13Outline
- Tertiary Disk Architecture
- Example of a log Message
- What Kind of Messages are generated?
- Can we predict the disk failure?
- Are the other connected devices in the system
affected in any way? - Summary and Conclusion
14m0 SCSI Time OutsRecovered Errors
SCSI Bus 0
15m0 SCSI Time OutsRecovered Errors
SCSI Bus 4
16m0 SCSI Time OutsRecovered Errors
SCSI Bus 0
17m0 SCSI Time OutsRecovered Errors
SCSI Bus 0
18Can we predict a disk failure?
- Yes, we can look for Recovered Error messages --gt
on 10-16-98 - There were 433 Recovered Error Messages
- These messages lasted for slightly over an hour
between - 1243 and 1410
- On 11-24-98 Disk 5 on m0 was fired, i.e. it
was about to fail so it was swapped - Another example...
19m11 SCSI Time Outs
SCSI Bus 0
20m11 SCSI Time Outs Hardware Failures
SCSI Bus 0
21Can we predict a disk failure?
- Yes, we can also look for Hardware Failure
messages --gt - These messages lasted for 8 days between
- 8-17-98 and 8-25-98
- On disk 9 there were
- 1763 Hardware Failure Messages, and
- 297 Timed Out Messages
- Disk 9 on SCSI Bus 0 of m11 was fired, i.e. it
was about to fail so it was swapped on 8-28-98
22Outline
- Tertiary Disk Architecture
- Example of a log Message
- What Kind of Messages are generated?
- Can we predict the disk failure?
- Are the other connected devices in the system
affected in any way? - Summary and Conclusion
23 Are the other connected devices in the system
affected in any way?
- Yes, observe the Time Out message traffic on
other disks on the same SCSI bus for --gt - The same 8 day period
- 8-17-98 and 8-25-98
- What about predicting other kinds of failures
besides just disk failures? --gt - Distinguishing between failing and non-failing
disks...
24m2 SCSI Bus 2 Parity Errors
25m2 SCSI Bus 2 Parity Errors
26Can We Predict Other Kinds of Failures?
- Yes, the flurry of parity errors on m2 occurred
between - 1-1-98 and 2-3-98, as well as
- 9-3-98 and 10-12-98
- On 11-24-98
- m2 had a bad enclosure --gt cables or connections
defective - The enclosure was then swapped
- Note The activity logs are not available for the
earlier time period.
27Can We Distinguish a Failing Disk From a
Non-Failing Disk?
- Yes...
- SCSI Error Messages alone --gt No impending disk
failure - As in the m2 Parity example
- Disk Error Messages alone or accompanied by SCSI
Error Messages --gt High Probability of an
impending disk failure e.g. - ALONE m0 had only Recovered Error Messages
- Disk 5 was about to fail and therefore was
fired - BOTH m11 had both Hardware Failure Disk Messages
and Time Out SCSI Messages - Disk 9 was about to fail and therefore was
fired
28Outline
- Tertiary Disk Architecture
- Example of a log Message
- What Kind of Messages are generated?
- Can we predict the disk failure?
- Are the other connected devices in the system
affected in any way? - Summary and Conclusion
29Total Disk SCSI Errors Per Machine
30Summary and Conclusion
- Disks dont fail very often
- In the 10 months of logs, only two disks failed
- We have only 2 data points for these conclusions!
- We can predict disk failures and other kinds of
failures with enough time to do something about
it - There are correlations between the logged
messages - Hardware Failure Messages on one disk device
propagates as Time Out Messages on - not only the failing disk,
- but also other disks on the same SCSI bus
31Back Up Slides
32m0 SCSI Time Outs
SCSI Bus 2