IBM blueandwhite template with image - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

IBM blueandwhite template with image

Description:

Dashboard. IBM Research. Apr 2005 | IBM Confidential 2005 IBM Corporation ... Dashboard. Component. Lab MTTF. Data. Component. Current Usage. Profile. Usage ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 16
Provided by: indus62
Category:

less

Transcript and Presenter's Notes

Title: IBM blueandwhite template with image


1
  • Failure Rate and Life Expectance of Critical
    System Components Based on Real Time Monitoring
    of Component Usage And Temperature Cycles
  • A disk drive which is spinning 24 hours a day is
    likely to fail earlier
  • than the one which is rarely used.
  • A constantly and heavily loaded CPU is likely to
    fail earlier than the
  • one which is idle most of the time.

2
  • Motivation
  • A typical computer systems failure rate can be
    estimated based on the aggregate mean time to
    failure (MTTF) of its components
  • Component reliability data are commonly taken
    from acceleration tests
  • As system complexity increases, the number of
    components increases, e.g. processors in a
    system can go as high as 50,000.
  • Real time usage is very much different from
    acceleration tests due to unpredicted load,
    temperature and the cycling of temperature due to
    load variations.
  • To project system reliability and potential
    component failure more accurately, it becomes
    necessary to monitor component usage in real
    time.
  • Real time usage data in combination with the
    acceleration test data will provide a better
    projection of the MTTF of components and the
    system as a whole


3
Reliability Oriented Performance Management System
  • Monitor, analyze and predict life expectancy of a
    computer system
  • Collect and analyze real-time usage and
    temperature data with the Component Life
    Distribution Model (CLDM) of each component
  • Use a System Life Distribution Model (SLDM) to
    predict system failure rate based on components
    MTTF

4
Component Life Distribution Model (CLDM)
  • CLDM is constructed from acceleration test data
    and historical failure data
  • Estimate component current Mean-Time-To-Failure
    (MTTF)

Component
Lab Projected MTTF data
Adjusted MTTF
Individual Component Reliability Dashboard
Usage Monitoring
Accumulated Usage
5
System Life Distribution Model
  • Component dependency matrix - Learning from data
  • Obtain initial dependency information gt prior
    model
  • Update prior model using historical data
  • Continue updating model online from data stream
  • Based on different predictive models
  • Bayesian network approach
  • Neural network approach
  • Markov chains

6
System Reliability Projection Based on Individual
Component Usage

System Reliability Projection
Component
Component Lab MTTF Data
Component Current Usage Profile
Usage Monitor
Current Component Reliability Projection
Component
Component Dependency Matrix
System Reliability Dashboard
Usage Monitor
Component Current Usage Profile
Component Lab MTTF Data
Current Component Reliability Projection
Component
Usage Monitor
Component Current Usage Profile
Component Lab MTTF Data
Current Component Reliability Projection
7
  • Using CPU as an example, a Simple Model to
    illustrate the Concept
  • Performance P
  • Power Pr
  • Life Time Temperature profile T
  • Reliability Adjustment based on Real time usage
    R adj
  • Lab Accelerated Stress Test Projected
    Reliability L lab
  • Overall Reliability R overall
  • The higher the performance, the higher the power
  • Temperature is directly proportional to power.
  • Reliability is inversely proportional to
    temperature.
  • Pr F pr ( P ), power is a function of
    performance
  • T F T ( Pr ), temperature is a function of
    power
  • R F R ( T ), reliability is a function of
    temperature, it follows that


8
CPU Power vs. Performance
9
CPU Temperature vs. Power
10
CPU Reliability vs. Temperature
Calculated failure rate of a CPU using
Arrhenius-based model
11
CPU Performance vs. Reliability
12
For a system with 6 CPUs in series
System Fail Rate Min ( MTTF1.. MTTFn )
CPU1
CPU2
CPU3
CPU4
CPU5
CPU6
13
For a system with 6 CPUs in parallel
CPU1
CPU2
CPU3
System Fail Rate B1Fcpu1 B2Fcpu2 . BnFcpuv
CPU4
F CUP fail rate B a function or index for
relative probability
CPU5
CPU6
14
Network Model for a System with 6 CPUs
CPU1
CPU6
X1
X3
X2
X4
X5
X6
CPU3
CPU2

Yn
Y1
CPU4
CPU5
Z
1 if CPUi is OK 0 otherwise
X i
  • Prob (X1 down/Y1 is OK)
  • System Fail Rate arg Maxx P(x,y)

1 if Probj is OK 0 otherwise
Y j
15
Examples of Reliability Oriented Performance
Management System Usage Scenarios
  • Display key performance indicators, such as
    component MTTF, system life expectancy
  • With detailed drill-down capabilities
  • Allow estimation of trade-offs between life
    expectancy of components in a system and its
    overall performance
  • Guide the design of a data center based on its
    cooling capacity and environment to maximize its
    overall performance
Write a Comment
User Comments (0)
About PowerShow.com