Resiliency and selfhealing - PowerPoint PPT Presentation

About This Presentation
Title:

Resiliency and selfhealing

Description:

... to the newest generation of high-performance computers ... Suitable (tested) system for (Hewlett Packard) server systems. Pinpoints causes of SLO violation ... – PowerPoint PPT presentation

Number of Views:146
Avg rating:3.0/5.0
Slides: 33
Provided by: netla5
Category:

less

Transcript and Presenter's Notes

Title: Resiliency and selfhealing


1
Resiliency and self-healing
  • Visa Holopainen, visa_at_netlab.tkk.fi

2
Reinforcement Learning for Autonomic Network
Repair, M. Littman, N. Ravi, E. Fenson, R.
Howard, 2004
  • Reinforcement learning
  • Used to solve Markov decision problems (MDPs)
  • States, actions, rewards, transitions, transition
    probabilities
  • Agent explores an environment in which it
    perceives its current state and takes actions to
    reach new states
  • A reward is assosiated to every state
  • Reinforcement learning tries to find a policy for
    maximizing cumulative reward for a task

3
(Simplified) Reinforcement Learning example
  • Which direction should the agent move?

4
Reinforcement Learning example (cont)
  • Agent makes random moves until a Goal state is
    reached

5
Reinforcement Learning example (cont)
  • Now a policy is associated with the state from
    which the goal state was reached

6
Reinforcement Learning example (cont)
  • Now if at some point state S (that has policy
    associated to it) is reached from state S, a
    policy is assigned to S also

7
Reinforcement Learning example (cont)
  • After some amount of iterations the optimal
    policies have been formed

8
Reinforcement Learning example (cont)
  • The corresponding state rewards

9
Implemented concept
  • Reinforcement learning is used to restore network
    connectivity after a failure
  • Starting state no connectivity, Goal state
    connectivity
  • Actions PingGateway, PingIP, DNSLookup,
    UseCachedIP, FixIP, RenewLease, UseCachedIP
  • Learned policy in the picture
  • Prototype implemented
  • Nice concept but not very useful

10
Approaches to Building Self HealingSystems
using Dependency Analysis, J. Gao, G. Kar, P.
Kermani, 2004
  • Problems
  • Is there a way to automatically determine the
    root cause(s) of a downgraded performance of i.e.
    an Internet shopping site
  • Provided that the root cause(s) can be
    determined, are there some ways to automatically
    fix this problem

11
Architecture
  • Distributed System
  • A typical multi-tier e-Business system (web
    access, database)
  • The Monitoring System
  • Includes monitoring agents that monitor 1) the
    response time of the system from users
    perspective and 2) the application components
    (servlets, EJBs,)
  • The Dependency Matrix
  • Which transactions depend on which system
    components
  • Self-healing Engine
  • Launched when a performance problem is noticed by
    monitoring system

12
Problem description
  • Based on previous work a dependency matrix can be
    formed
  • The matrix informs which customer transactions
    depend on which system resources
  • Using this matrix the system resource that causes
    a preformance problem can be tracked
  • The initial goal was to minimize the needed
    transactions to find the root cause of a problem
  • This problem is found to be NP-hard -gt a
    heuristic solution is presented

13
Solution
  • No solution can be guaranteed to be found if two
    or more matrix columns are similar
  • Assume that 1) all matrix colums are different
    and 2) there is only one broken system component
  • Now the solution can be found by the following
    algorithm
  • The set of all resources is denoted S. The set of
    all transactions is denoted T
  • Run all transactions one by one
  • If a trasaction succeeds then remove all
    resources that this trasaction depends on from S.
  • Finally only one resource is left in S. This is
    the broken resource.

14
Solution (cont)
  • If the fixed set of customer transactions cannot
    locate the root cause of performance problem,
    synthetic transactions need to be created and
    executed
  • Many practical difficuties exists in doing so
  • No testing

15
Ensembles of Models for Automated Diagnosis of
System Performance Problems, S. Zhang, I. Cohen,
M. Goldszmidt, J. Symons, A. Fox, 2005
  • Ensemble collection
  • SLA contains Service Level Objectives (SLO)
  • SLO example Server downtime lt X sec in a day
  • Problem Which system metrics correlate with SLO
    violations?
  • Example system metrics CPU metrics, Memory, I/O,
    Network activity coming in and out of servers,
    Swapspace usage, Paging, etc
  • Tree Augmented Naïve Bayes (TAN) models
  • Determine which low-level metrics most likely
    contributed to an SLO violation
  • A mapping function is learned by the algorithm

16
TAN model example
  • Given SLO state (SLO violation) S, what is the
    most predictive set of system-level metrics for
    S
  • Combinations of metrics more predictive of SLO
    violations than individual metrics
  • Small numbers of metrics (3-8) usually sufficient
    to predict SLO violation

17
Multiple TAN models
  • TAN models that are built using data collected
    under some conditions don't work well on data
    collected under different conditions -gt need to
    maintain multiple TAN models
  • The model that best suits the current conditions
    is chosen by using Brier score
  • Brier score is similar to Mean Squared Error
    (MSE) and offers a fine grained evaluation of a
    model

18
Results
  • Ensembles of models outperform single model
  • Also do slightly better than workload specific
    approach
  • Indicates that some workload conditions too
    complex for single model
  • BA Balanced Accuracy
  • FA False Alerts
  • Det Detections

19
TAN summary
  • Ensemble of models perform better than single
    model
  • The approach allows for rapid adaptation to
    changing conditions
  • No domain specific knowledge is required
  • Different workloads seem to be characterized by
    different metric-attribution signatures (future
    work)

20
Towards Autonomic Web Services Achieving
Self-Healing Using Web Services, S. Gurguis, A.
Zeid, 2005
  • CBE-log is a representation format into which log
    files of all different applications can be
    converted
  • Diagnosis Engine selects a set of repair actions
  • The Symptoms Database is an XML-file containing
    symptoms and recovery actions
  • Rule Engine decides which repair actions should
    be taken based on the Policy Database
  • No prototype implemented

21
  • A typical record in the Symptom Database
    presented in the picture
  • Possible application legacy systems

22
Reflection, Self-Awareness and Self-Healing in
OpenORB, G. Blair, G. Coulson, et al. 2002
  • OMG (Object Management Group)
  • An open membership, not-for-profit consortium
    that produces and maintains computer industry
    specifications for interoperable enterprise
    applications
  • OMG CORBA (Common Object Request Broker
    Architecture)
  • Open, vendor-independent architecture and
    infrastructure that computer applications use to
    work together over networks
  • Supports communication between different types of
    operating systems, programming languages and
    networks
  • Interfaces defined in OMG IDL (Interface
    Definition Language)
  • Mappings exists between IDL and C, C, Java,
    COBOL, Smalltalk, Ada, Lisp, Python, and
    IDLscript
  • OpenORB
  • Provides a Java implementation of the OMG CORBA
    2.4.2 specification

23
Example, OMG IDL lt-gt C mappings
24
OpenORB self-healing
  • Meta-interface supports access to the underlying
    platform
  • Open ORB supports the ability to discover
    meta-information about the current system, both
    in terms of its structure and ongoing behaviour
  • System properties can also be adapted by using
    the appropriate meta-interfaces
  • Management component can be introduced
    (dynamically) into the various meta-space models
  • ??

25
Measuring the Effectiveness of Self-Healing
Autonomic Systems, A. Brown, C. Redlin, 2005
  • SPEC (Standard Performance Evaluation Group)
  • Non-profit corporation that maintains a
    standardized set of relevant benchmarks
    applicable to the newest generation of
    high-performance computers
  • SPEC jAppServer2004
  • Benchmark for measuring the performance of J2EE
    application servers
  • An end-to-end application which exercises all
    major J2EE technologies
  • Based on jAppServer2004 a benchmarking system was
    created that is capable of quantifying the
    autonomic self-healing capability of a
    large-scale J2EE software solution
  • The system is used in various production
    environments

26
The Architecture
  • 30 different types of disturbances representing
    common failure modes can be injected into the SUT
  • Component shutdowns, data loss, resource
    exhaustion, load surges, operator errors, ...
  • Two metrics are used to evaluate SUTs
    self-healing capacity
  • How effectively the SUT heals itself
  • Basically measured by counting how many requests
    the jAppServer2004 gets right in case of
    disturbance while compared to normal working
    conditions
  • How autonomic the healing response is
  • A 90-question survey is used

27
The Survey
  • The 90-question survey assigns points to the SUT
    based on the level of automation present in its
    response to each disturbance (based on IBMs
    autonomic computing maturity model)
  • 0 points for a basic manual response, 1 point for
    a managed response, 2 for predictive, 4 for
    adaptive, and 8 for autonomic
  • ...Our baseline run on SUT 1 resulted in an
    average healing effectiveness score of 0.79 and
    an autonomic maturity score of 0.15 (both out of
    1.0), indicating a relatively low level of
    autonomic self-healing capability. In comparison,
    SUT 2 attained an effectiveness score of 0.83
    and a maturity score of 0.22. Comparing the two
    results indicates that SUT 2s system management
    technology provided a smallbut
    measurableimprovement in autonomic capability...

28
Personal Autonomic Computing Self-Healing Tool,
R. Sterritt, S. Chung, 2004
  • A self-healing tool consisting of pulse monitor
    and a health monitor
  • Used in PC-environment
  • Pulse Monitoring application (PBM) is an
    UDP-based peer-to-peer application which
  • Checks whether hosts are providing a heartbeat
    or not and
  • Indicates the health level of the system (state
    of processes)
  • Reboots a neighbor if no heartbeat is heard from
    it (security?)
  • Health Monitoring runs on a host and restarts a
    process on the same host if its not responding
  • Combines three old concepts watchdog processes,
    hello-mechanism, and remote control

29
The Architecture
  • Pulse Monitor (Java) communicates with
    platform-specific Health Monitor (C) through JNI
  • Main monitor monitors Pulse monitor and Health
    monitor

30
Testing
  • A proof-of-concept prototype system was built on
    MS. Windows platform
  • Future topics more autonomic functionality
    supported platforms
  • Maybe useful when human administration not
    possible (sensor networks?)

31
Conclusions
  • Reinforcement Learning for Autonomic Network
    Repair
  • Learn autonomically the best sequence of actions
    to repair a network outage
  • Prototype implemented and tested (useful?)
  • Approaches to Building Self Healing Systems using
    Dependency Analysis
  • Determine the root-cause of downgraded
    performance and try to fix it
  • No testing, use 3. instead?
  • Ensembles of Models for Automated Diagnosis of
    System Performance Problems
  • Suitable (tested) system for (Hewlett Packard)
    server systems
  • Pinpoints causes of SLO violation
  • Towards Autonomic Web Services Achieving
    Self-Healing Using Web Services
  • Autonomic web server healing system
  • No testing

32
Conclusions
  • Reflection, Self-Awareness and Self-Healing in
    OpenORB
  • ?
  • Measuring the Effectiveness of Self-Healing
    Autonomic Systems
  • Suitable system for J2EE server systems
  • Provides users with a quantitative way to measure
    the self-healing capability of their IT systems
  • Implemented and in use
  • Personal Autonomic Computing Self-Healing Tool
  • Enables a group of PCs to monitor the health of
    each other
  • Applications?
  • Prototype implemented
  • Overall much discussion about server self-healing
Write a Comment
User Comments (0)
About PowerShow.com