What Happens Before A Disk Fails - PowerPoint PPT Presentation

About This Presentation

Title:

What Happens Before A Disk Fails

Description:

Uses a switched network design to connect ... A WEB Accessible Image Collection. Available 24 hours/day, 7 days/week. Slide 5. Outline ... CDB: a c b1 bf 80 0 ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 33

Provided by: davidoppe

Learn more at: http://istore.cs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: What Happens Before A Disk Fails

1
What Happens Before A Disk Fails?

Randi Thomas, Nisha Talagala
http//www.cs.berkeley.edu/randit/Iram/disklogs.h
tml

2
Motivation

ISTORE
Proposes to take advantage of predicted failures
to improve system robustness
Uses a switched network design to connect
intelligent devices to each other to improve
system performance.
Therefore ISTORE devices do not share electrical
connections
Is this another ISTORE advantage?
This talk examines
The potential to predict failures for disk
devices
If and how the failure of a device sharing
electrical connections with other devices affects
those other devices

3
Just Before a Disk Fails...

Can we predict the disk failure?
To answer we will investigate
What kind of log messages does the system
generate?
When do these messages get generated?
How do we distinguish a failing disk from a
non-failing disk?
Are the other connected devices in the system
affected in any way?
To answer we will investigate
Are there correlations between the logged
messages?

4
Which Logs on What System?

The Error Logs Generated by Berkeleys
Tertiary Disk System
Log Dates January to November, 1998

The Tertiary Disk Application

A WEB Accessible Image Collection
Available 24 hours/day, 7 days/week

5
Outline

Tertiary Disk Architecture
Example of a log Message
What Kind of Messages are generated?
Can we predict the disk failure?
Are the other connected devices in the system
affected in any way?
Summary and Conclusion

6
The Tertiary Disk Architecture

20 PCs (m0-m19)
200 MHz Pentium Pros
96 MB of RAM
Running FreeBSD version 2.2
Connected through a switched Ethernet network
Hosts a set of disks using fast-wide SCSI 2 in
the single ended mode
Using twin channel SCSI controllers
Total of 368 Disks
8 GB each
State of the Art in 1996

7
The Tertiary Disk Architecture

4 PCs (m0 - m3) have 28 or more disks each
2-3 SCSI Chains per PC
9-15 Disks per SCSI chain
16 PCs (m4 - m19) have 16 disks each
2 SCSI Chains per PC
8 Disks per SCSI chain
SCSI bus made up of
SCSI cable Connects the controller and enclosure
Backplane of the enclosure

8
The Tertiary Disk Architecture
9
Outline

Tertiary Disk Architecture
Example of a log Message
What Kind of Messages are generated?
Can we predict the disk failure?
Are the other connected devices in the system
affected in any way?
Summary and Conclusion

10
Example of A Log Message

Oct 22 145350 m6 /kernel (da1ahc0010)
WRITE(06). CDB a c b1 bf 80 0
Oct 22 145350 m6 /kernel (da1ahc0010)
HARDWARE FAILURE infocb1bf asc44,0
Oct 22 145350 m6 /kernel (da1ahc0010)
Internal target failure field replaceable unit 1
sks80,3
Month Day Time --gt Oct 22 145350
Machine name --gt m6
Source of message --gt kernel reporting message
Error Device --gt disk da1, SCSI bus ahc0
Description of Error --gt Write request had a
write fault and caused a HW Failure
More information --gt Driver SCSI Controller
Codes

11
Outline

Tertiary Disk Architecture
Example of a log Message
What Kind of Messages are generated?
Can we predict the disk failure?
Are the other connected devices in the system
affected in any way?
Summary and Conclusion

12
What kind of messages are generated?

Data Disk Error Messages
Hardware Error The command unsuccessfully
terminated due to a non-recoverable hardware
failure. (Type is given in the message)
Medium Error The operation was unsuccessful due
to a flaw in the medium --gt usually recommends
reassigning sectors
Recoverable Error The last command completed
with the help of some error recovery at the
target --gt e.g. if the drive dynamically
reassigned a bad sector to available spare sector
Not Ready The drive cannot be accessed at all
SCSI Error Messages
Time Outs Can happen in any of the SCSI bus
phases, i.e. message, data, idle. Response a BUS
RESET command
Parity Cause of an aborted request

13
Outline

Tertiary Disk Architecture
Example of a log Message
What Kind of Messages are generated?
Can we predict the disk failure?
Are the other connected devices in the system
affected in any way?
Summary and Conclusion

14
m0 SCSI Time OutsRecovered Errors
SCSI Bus 0
15
m0 SCSI Time OutsRecovered Errors
SCSI Bus 4
16
m0 SCSI Time OutsRecovered Errors
SCSI Bus 0
17
m0 SCSI Time OutsRecovered Errors
SCSI Bus 0
18
Can we predict a disk failure?

Yes, we can look for Recovered Error messages --gt
on 10-16-98
There were 433 Recovered Error Messages
These messages lasted for slightly over an hour
between
1243 and 1410
On 11-24-98 Disk 5 on m0 was fired, i.e. it
was about to fail so it was swapped
Another example...

19
m11 SCSI Time Outs
SCSI Bus 0
20
m11 SCSI Time Outs Hardware Failures
SCSI Bus 0
21
Can we predict a disk failure?

Yes, we can also look for Hardware Failure
messages --gt
These messages lasted for 8 days between
8-17-98 and 8-25-98
On disk 9 there were
1763 Hardware Failure Messages, and
297 Timed Out Messages
Disk 9 on SCSI Bus 0 of m11 was fired, i.e. it
was about to fail so it was swapped on 8-28-98

22
Outline

Tertiary Disk Architecture
Example of a log Message
What Kind of Messages are generated?
Can we predict the disk failure?
Are the other connected devices in the system
affected in any way?
Summary and Conclusion

23
Are the other connected devices in the system
affected in any way?

Yes, observe the Time Out message traffic on
other disks on the same SCSI bus for --gt
The same 8 day period
8-17-98 and 8-25-98
What about predicting other kinds of failures
besides just disk failures? --gt
Distinguishing between failing and non-failing
disks...

24
m2 SCSI Bus 2 Parity Errors
25
m2 SCSI Bus 2 Parity Errors
26
Can We Predict Other Kinds of Failures?

Yes, the flurry of parity errors on m2 occurred
between
1-1-98 and 2-3-98, as well as
9-3-98 and 10-12-98
On 11-24-98
m2 had a bad enclosure --gt cables or connections
defective
The enclosure was then swapped
Note The activity logs are not available for the
earlier time period.

27
Can We Distinguish a Failing Disk From a
Non-Failing Disk?

Yes...
SCSI Error Messages alone --gt No impending disk
failure
As in the m2 Parity example
Disk Error Messages alone or accompanied by SCSI
Error Messages --gt High Probability of an
impending disk failure e.g.
ALONE m0 had only Recovered Error Messages
Disk 5 was about to fail and therefore was
fired
BOTH m11 had both Hardware Failure Disk Messages
and Time Out SCSI Messages
Disk 9 was about to fail and therefore was
fired

28
Outline

Tertiary Disk Architecture
Example of a log Message
What Kind of Messages are generated?
Can we predict the disk failure?
Are the other connected devices in the system
affected in any way?
Summary and Conclusion

29
Total Disk SCSI Errors Per Machine
30
Summary and Conclusion

Disks dont fail very often
In the 10 months of logs, only two disks failed
We have only 2 data points for these conclusions!
We can predict disk failures and other kinds of
failures with enough time to do something about
it
There are correlations between the logged
messages
Hardware Failure Messages on one disk device
propagates as Time Out Messages on
not only the failing disk,
but also other disks on the same SCSI bus