Title: Failure Prediction in Hardware Systems
1Failure Prediction in Hardware Systems
- Douglas Turnbull
- Neil Alldrin
- CSE 221 Operating System Final Project
- Fall 2003
1
2Background
- Using sensors from a high-end server, can we
predict system board failures.
- If we can predict failure, we can take
preventative action to avoid costly failures. - System Specifications
- 18 Hot Swappable System Boards
- 4 Processors per Board
- 18 Sensors per Board
- Measures various temperatures and voltages
2
3Sensor Logs
- Each board has an associated Sensor Log
- About every minute, the sensors are sampled and
the - measurements are stored in the sensor logs.
- System board failures are also record in the
sensor log.
We need to extract a data set from these logs to
represent failure events (positive examples) and
normal operating conditions (negative examples).
We accomplish this using a Windowing
Abstraction.
3
4Windowing Abstraction
Sensor Window Adjacent entries in the sensor
log that are used to predict failures Potential
Failure Window An example is labeled as
positive or negative if a failure occurs in the
potential failure window.
4
5Feature Vectors
Feature Vectors are created from the data in a
sensor window. There are two types of feature
vectors Raw Feature Vectors a vector all the
sensor measurement in a sensor window. Summary
Feature Vectors the mean, standard deviation,
range and slope for each of the sensors in a
sensor window.
5
6Classification
A classifier assigns labels (positive or
negative) to novel feature vectors after it has
been trained using a set of feature vectors with
known labels. Many classifiers can be used, such
as SVMs, Bayesian mixture models, and neural
networks. We use a Radial Basis Function (RBF)
network, a special form or a neural network,
because it is computationally efficient.
6
7Evaluation Predictions
We must consider two rates when evaluating our
prediction system. True Positive Rate (tpr) A
measure of our ability to correctly predict true
failures. tpr Correctly Predicted Failures /
Total Number of True Failures False Positive
Rate (fpr) A measure of the number of
mispredictions. fpr incorrectly
Predicted Failures / Total Number of Non-Failures
Ground Truth
Non-failure
Failure
True Positives False Positives
False Negatives True Negatives
Failure
Prediction
Non-failure
7
8Preliminary Results
- Observations
- Summary feature vectors have lower false positive
rates than Raw Feature Vectors. - 2. Window size does not seem to matter.
- How can we improve these results?
8
9Feature Subset Selection
We can further improve prediction accuracy (and
reduce computation) by reducing the number of
features used by our classifier. Feature are
selected automatically using Forward Stepwise
Selection.
9
10Results
10
11Best Results
We find the best prediction results with Summary
Feature Vectors using 2/3 of the summary
features 0.87 True Positive Rate (tpr) 0.10
False Positive Rate (fpr) Our data set assumes
that we are equally likely to find a failure as a
non-failure. When one considers that there are
very few failures in most hardware system, even a
low false positive rate will produce many false
positives.
11
12Future Work
- Implement other classifiers SVMS, Bayesian
Mixture Models - Develop a larger data set with more examples of
failures - Apply framework to other hardware system such as
personal computers - Modify operating system to take advantage of
failure prediction - Migrate processes to other system boards
- Run diagnostic tests
- Turn off suspect system boards
- Backup data
12
13The End
Questions?
13
14RBF Network
14
15Value of a prediction system
The value of a prediction system can be
summarized as, Value (benefit of predicted
failure) tpr (cost of mispredicted failure)
fpr
15
16Template
16