Software Rejuvenation - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Software Rejuvenation

Description:

Predict and avoid unplanned outages due to software aging ... A very limited class of outage causes. Adaptive identification of pre-outage signatures ... – PowerPoint PPT presentation

Number of Views:306
Avg rating:3.0/5.0
Slides: 17
Provided by: tonyabbo
Category:

less

Transcript and Presenter's Notes

Title: Software Rejuvenation


1
Software Rejuvenation
  • Vittorio Castelli
  • Rick Harper
  • Phil Heidelberger
  • Steve Hunter
  • Tom Pahel
  • Kalyan Vaidyanathan

2
Objectives
  • Improve system availability
  • Software-induced outages dominate
    hardware-induced outages
  • A top concern of most customers
  • Proactive fault management is greatly preferred,
    replacing unplanned outages with planned outages
  • There are many problems...we chose to attack
    software aging
  • Predict and avoid unplanned outages due to
    software aging
  • Monitor consumption of resources such as free
    memory, swap space, handle count, thread count,
    inode count, ...
  • Extrapolate resource consumption within a
    user-specified horizon
  • When exhaustion is predicted, produce "event"
    into IBM Director, which can cause alert,
    selective rejuvenation, cluster failover, or
    reboot
  • In some cases can identify which
    process/subsystem is the culprit

3
Software Aging
  • Software state (OS, middleware, applications)
    decays with time...
  • memory leaks
  • handle leaks
  • nonterminated threads
  • unreleased file-locks
  • data corruption
  • ...resulting in Bad Things (outages, hangs,
    performance degradation)
  • We feel this behavior is an inevitable by-product
    of software industry dynamics and practice
  • Software failure prediction and state
    rejuvenation is a proactive technology designed
    to mitigate the effects of software aging
  • Predict when resource exhaustion is about to
    occur
  • Reset the state of the system to an initial
    low-resource-consumption condition

4
Project History
  • Supported in 2000 and 2001 by PSI funding
  • xSeries, Research, University collaboration
  • xSeries architecture (Steve Hunter), RAS,
    Development (Tom Pahel) Marketing
  • Research Rick Harper, Vittorio Castelli and Phil
    Heidelberger
  • Duke Kalyan Vaidyanathan, Kishor Trivedi
  • Incorporated into IBM Director
  • Timed rejuvenation on NT GA Q4'99
  • Predictive rejuvenation on NT/W2K GA Q4'00 (NT
    includes per-process diagnosis)
  • Predictive rejuvenation and per-process diagnosis
    on Linux GA Q4'01
  • The market liked it

5
Software Rejuvenation Agent
  • Prediction Algorithm

6
Prediction Algorithm
  • Sampled Parameters
  • Windows agent can predict exhaustion of committed
    bytes, pool nonpaged bytes, pool paged bytes,
    logical disk bytes
  • Linux agent swap space, disk space, inodes, file
    descriptors, processes
  • Sampling Technique
  • User selects exhaustion notification horizon
  • Typically should be at least several days
  • Agent sets up sliding sampling window that is
    1/10 the size of the horizon
  • Agent sets up sampling rate such that 300 points
    lie within sampling window
  • Can perform linear prediction using 200 points,
    more complex predictions require 300 points
  • Historical data is saved, subject to
    user-selected file size limitation
  • Predictive algorithm
  • Constructs 6 candidate fitted curves to smoothed
    sliding window data
  • Linear, Log, Linear/Log with 2 or 3 breakpoints
  • Selects best-fitting curve
  • Extrapolates selected curve out to exhaustion
    horizon
  • Generates event if extrapolated data impacts
    limits within horizon, and indicates how long
    until impact

7
Example of Algorithm Execution
8
Diagnosis
Process Consumption of Nonpaged Pool
Bytes SERVICES 447936 2.51 WINLOGON 64992
0.36 WinMgmt 57068 0.32 svchost 47448
0.27 explorer 45896 0.26 svchost 44704
0.25 CSRSS 42416 0.24 LSASS 40708
0.23 msdtc 35608 0.20 rtvscan 34448
0.19 System Module Consumption of NonPaged
Pool Bytes Tag LSwi 2293760
0.124 Tag File 2027424
0.110 Tag Wdm 1705888
0.092 Tag MmCa 1371744
0.074 Tag Ntfr 1350112
0.073 Tag Nmdd 1048576
0.057 Tag NtFs 753888
0.041 Tag Ntfn 750080
0.041 Tag NDam 612608
0.033 Tag FSfm 541888
0.029 Tag Dmio 532448
0.029 Of a total of 35667968 Pool Nonpaged Bytes,
1318904 (3.70) can be diagnosed to processes
and 34349064 (96.30) are consumed by system
modules.
9
Problem False Alarms due to Temporary Surges
10
Transition from Un-Notified to Notified Ready to
Notify
11
Transition from Un-Notified to Notified Notify
12
Software Rejuvenation Agent
  • Director Integration

13
High Level Design
IBM Director Console
Director Tasks Inventory, Events,
Software Rejuvenation Task
IPC
Director Management Server
Topology Engine
  • Console Task is used to configure SW Rejuv
    Options Criteria
  • Server Task saves persistent configuration data
    and communicates with agent machine
  • The agent monitors OS usage of resources,
    projects future exhaustion, notifies server if
    exhaustion is imminent.

SNMP Device
Director Clients
Cluster Servers
Rack Device
Other Add-on Topology Extensions
Director Server Tasks Inventory, Event,
Monitors, FileTransfer, Scheduler, CIM...
Software Rejuvenation Server Task (Persistent
Data)
DataBase
IPC
eServer Box
Director IPC Agent
Software Rejuvenation Sub-Agent
Sub-Agents Events, Inventory, Monitors,...
Configuration (Input/Output) Files
Plug-ins Inventory, Monitors
Operating System, Device Drivers
Service Processor, ServeRAID
14
Console Task
Verify Console Installation
15
Task Interface
  • Trend Viewer for systems w/prediction
  • Schedule Filter to prevent rejuvenation on
    specified days
  • Drag-n-Drop services for time based rejuvenation
  • Rejuvenation Options apply to clusters only

16
Conclusions
  • The xSeries Software Rejuvenation project only
    attacked a small fraction of system outage
    causes, yet was well received
  • Much remains...
  • Current technology based on lab testing and a
    priori understanding of exhaustible resources
  • A very limited class of outage causes
  • Adaptive identification of pre-outage signatures
  • Improved diagnostic resolution
  • Selective rejuvenation of offending subsystem
  • Expand to more general classes of software
    failures and syndromes
  • Multiparameter signatures
  • Non-extremal conditions
  • Misconfigurations
  • Event log analysis
  • Applications
  • Workload balancing
  • HW/SW fault discrimination
  • SW testing and hardening
Write a Comment
User Comments (0)
About PowerShow.com