Title: IBM blueandwhite template with image
1- Failure Rate and Life Expectance of Critical
System Components Based on Real Time Monitoring
of Component Usage And Temperature Cycles - A disk drive which is spinning 24 hours a day is
likely to fail earlier - than the one which is rarely used.
- A constantly and heavily loaded CPU is likely to
fail earlier than the - one which is idle most of the time.
2- Motivation
- A typical computer systems failure rate can be
estimated based on the aggregate mean time to
failure (MTTF) of its components - Component reliability data are commonly taken
from acceleration tests - As system complexity increases, the number of
components increases, e.g. processors in a
system can go as high as 50,000. - Real time usage is very much different from
acceleration tests due to unpredicted load,
temperature and the cycling of temperature due to
load variations. - To project system reliability and potential
component failure more accurately, it becomes
necessary to monitor component usage in real
time. - Real time usage data in combination with the
acceleration test data will provide a better
projection of the MTTF of components and the
system as a whole
3Reliability Oriented Performance Management System
- Monitor, analyze and predict life expectancy of a
computer system - Collect and analyze real-time usage and
temperature data with the Component Life
Distribution Model (CLDM) of each component - Use a System Life Distribution Model (SLDM) to
predict system failure rate based on components
MTTF
4Component Life Distribution Model (CLDM)
- CLDM is constructed from acceleration test data
and historical failure data - Estimate component current Mean-Time-To-Failure
(MTTF)
Component
Lab Projected MTTF data
Adjusted MTTF
Individual Component Reliability Dashboard
Usage Monitoring
Accumulated Usage
5System Life Distribution Model
- Component dependency matrix - Learning from data
- Obtain initial dependency information gt prior
model - Update prior model using historical data
- Continue updating model online from data stream
- Based on different predictive models
- Bayesian network approach
- Neural network approach
- Markov chains
6System Reliability Projection Based on Individual
Component Usage
System Reliability Projection
Component
Component Lab MTTF Data
Component Current Usage Profile
Usage Monitor
Current Component Reliability Projection
Component
Component Dependency Matrix
System Reliability Dashboard
Usage Monitor
Component Current Usage Profile
Component Lab MTTF Data
Current Component Reliability Projection
Component
Usage Monitor
Component Current Usage Profile
Component Lab MTTF Data
Current Component Reliability Projection
7- Using CPU as an example, a Simple Model to
illustrate the Concept - Performance P
- Power Pr
- Life Time Temperature profile T
- Reliability Adjustment based on Real time usage
R adj - Lab Accelerated Stress Test Projected
Reliability L lab - Overall Reliability R overall
- The higher the performance, the higher the power
- Temperature is directly proportional to power.
- Reliability is inversely proportional to
temperature. - Pr F pr ( P ), power is a function of
performance - T F T ( Pr ), temperature is a function of
power - R F R ( T ), reliability is a function of
temperature, it follows that
8CPU Power vs. Performance
9CPU Temperature vs. Power
10CPU Reliability vs. Temperature
Calculated failure rate of a CPU using
Arrhenius-based model
11CPU Performance vs. Reliability
12For a system with 6 CPUs in series
System Fail Rate Min ( MTTF1.. MTTFn )
CPU1
CPU2
CPU3
CPU4
CPU5
CPU6
13For a system with 6 CPUs in parallel
CPU1
CPU2
CPU3
System Fail Rate B1Fcpu1 B2Fcpu2 . BnFcpuv
CPU4
F CUP fail rate B a function or index for
relative probability
CPU5
CPU6
14Network Model for a System with 6 CPUs
CPU1
CPU6
X1
X3
X2
X4
X5
X6
CPU3
CPU2
Yn
Y1
CPU4
CPU5
Z
1 if CPUi is OK 0 otherwise
X i
- Prob (X1 down/Y1 is OK)
- System Fail Rate arg Maxx P(x,y)
1 if Probj is OK 0 otherwise
Y j
15Examples of Reliability Oriented Performance
Management System Usage Scenarios
- Display key performance indicators, such as
component MTTF, system life expectancy - With detailed drill-down capabilities
- Allow estimation of trade-offs between life
expectancy of components in a system and its
overall performance - Guide the design of a data center based on its
cooling capacity and environment to maximize its
overall performance