IBM blueandwhite template with image

About This Presentation

Title:

IBM blueandwhite template with image

Description:

Dashboard. IBM Research. Apr 2005 | IBM Confidential 2005 IBM Corporation ... Dashboard. Component. Lab MTTF. Data. Component. Current Usage. Profile. Usage ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 16

Provided by: indus62

Category:

more less

Transcript and Presenter's Notes

Title: IBM blueandwhite template with image

1

Failure Rate and Life Expectance of Critical
System Components Based on Real Time Monitoring
of Component Usage And Temperature Cycles
A disk drive which is spinning 24 hours a day is
likely to fail earlier
than the one which is rarely used.
A constantly and heavily loaded CPU is likely to
fail earlier than the
one which is idle most of the time.

Motivation
A typical computer systems failure rate can be
estimated based on the aggregate mean time to
failure (MTTF) of its components
Component reliability data are commonly taken
from acceleration tests
As system complexity increases, the number of
components increases, e.g. processors in a
system can go as high as 50,000.
Real time usage is very much different from
acceleration tests due to unpredicted load,
temperature and the cycling of temperature due to
load variations.
To project system reliability and potential
component failure more accurately, it becomes
necessary to monitor component usage in real
time.
Real time usage data in combination with the
acceleration test data will provide a better
projection of the MTTF of components and the
system as a whole

3
Reliability Oriented Performance Management System

Monitor, analyze and predict life expectancy of a
computer system
Collect and analyze real-time usage and
temperature data with the Component Life
Distribution Model (CLDM) of each component
Use a System Life Distribution Model (SLDM) to
predict system failure rate based on components
MTTF

4
Component Life Distribution Model (CLDM)

CLDM is constructed from acceleration test data
and historical failure data
Estimate component current Mean-Time-To-Failure
(MTTF)

Component
Lab Projected MTTF data
Adjusted MTTF
Individual Component Reliability Dashboard
Usage Monitoring
Accumulated Usage
5
System Life Distribution Model

Component dependency matrix - Learning from data
Obtain initial dependency information gt prior
model
Update prior model using historical data
Continue updating model online from data stream
Based on different predictive models
Bayesian network approach
Neural network approach
Markov chains

6
System Reliability Projection Based on Individual
Component Usage

System Reliability Projection
Component
Component Lab MTTF Data
Component Current Usage Profile
Usage Monitor
Current Component Reliability Projection
Component
Component Dependency Matrix
System Reliability Dashboard
Usage Monitor
Component Current Usage Profile
Component Lab MTTF Data
Current Component Reliability Projection
Component
Usage Monitor
Component Current Usage Profile
Component Lab MTTF Data
Current Component Reliability Projection
7

Using CPU as an example, a Simple Model to
illustrate the Concept
Performance P
Power Pr
Life Time Temperature profile T
Reliability Adjustment based on Real time usage
R adj
Lab Accelerated Stress Test Projected
Reliability L lab
Overall Reliability R overall
The higher the performance, the higher the power
Temperature is directly proportional to power.
Reliability is inversely proportional to
temperature.
Pr F pr ( P ), power is a function of
performance
T F T ( Pr ), temperature is a function of
power
R F R ( T ), reliability is a function of
temperature, it follows that

8
CPU Power vs. Performance
9
CPU Temperature vs. Power
10
CPU Reliability vs. Temperature
Calculated failure rate of a CPU using
Arrhenius-based model
11
CPU Performance vs. Reliability
12
For a system with 6 CPUs in series
System Fail Rate Min ( MTTF1.. MTTFn )
CPU1
CPU2
CPU3
CPU4
CPU5
CPU6
13
For a system with 6 CPUs in parallel
CPU1
CPU2
CPU3
System Fail Rate B1Fcpu1 B2Fcpu2 . BnFcpuv
CPU4
F CUP fail rate B a function or index for
relative probability
CPU5
CPU6
14
Network Model for a System with 6 CPUs
CPU1
CPU6
X1
X3
X2
X4
X5
X6
CPU3
CPU2

Yn
Y1
CPU4
CPU5
Z
1 if CPUi is OK 0 otherwise
X i

Prob (X1 down/Y1 is OK)
System Fail Rate arg Maxx P(x,y)

1 if Probj is OK 0 otherwise
Y j
15
Examples of Reliability Oriented Performance
Management System Usage Scenarios

Display key performance indicators, such as
component MTTF, system life expectancy
With detailed drill-down capabilities
Allow estimation of trade-offs between life
expectancy of components in a system and its
overall performance
Guide the design of a data center based on its
cooling capacity and environment to maximize its
overall performance

Write a Comment

User Comments (0)