System Models for Problem Determination - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

System Models for Problem Determination

Description:

State of the art in systems monitoring: manual; tools' help ... Monitor a minimal vector of metrics. Response time & errors in user-accessible servlets ... – PowerPoint PPT presentation

Number of Views:15
Avg rating:3.0/5.0
Slides: 26
Provided by: dominoRes
Category:

less

Transcript and Presenter's Notes

Title: System Models for Problem Determination


1
System Models for Problem Determination
  • Michael Jiang, Mohammad A. Munawar, Kevin Quan,
  • Paul A.S. Ward

Department of Electrical and Computer
Engineering University of Waterloo
April 26, 2006
2
Outline
  • Introduction
  • Context
  • Background
  • Related Work
  • The Problem
  • Challenges
  • Contributions
  • Prototype Adaptive Monitoring Tool
  • What Comes Next?
  • Summary

3
Introduction
  • Enterprise information systems have high
    reliability requirements
  • These systems are getting larger and more complex
  • Defects cannot be completely eliminated failures
    can be very costly
  • Managing these systems is hard
  • Componentization and availability of information
  • More human resources needed, but the ones with
    required abilities are in short supply
  • Many duties of systems administrators depend on
    systems monitoring
  • State of the art in systems monitoring manual
    tools help
  • Slow response impact on availability
  • Error-prone
  • Monitor everything all the time?
  • Impractical and unnecessary
  • Overhead involved measurement, storage,
    communication, computation

4
Introduction
  • Intelligent Monitoring
  • Automatically adapt monitoring to match
    prevailing condition
  • Motivation
  • Reduce human involvement in monitoring
  • Collect only that which is needed
  • Reduce impact on performance and other overheads
  • Importance
  • Software systems will inevitably get larger and
    more complex self-managed systems requires
    intelligent monitoring
  • Benefits
  • Human resources are free for more important,
    more-complex tasks
  • Less pertinent information lost
  • System can perform as close as possible to its
    unmonitored version

5
Background
  • Component-based software systems
  • Made of re-usable/pluggable parts well-defined
    boundaries for our purposes, components
    internal implementation not known
  • Examples COM/DCOM, CORBA, J2EE, .Net

6
Background
  • Software systems based on Java 2 Platform,
    Enterprise Edition (J2EE)

7
Background
  • Monitoring a J2EE-based system

8
Related Work
  • Monitoring large-scale systems
  • Academic research NetLogger, Astrolabe, Ganglia
  • Goals Efficiency, scalability, robustness,
    flexibility
  • Techniques used binary formats, gossip,
    multi-cast, mobile-code
  • Solutions from the Industry Tivoli, OpenView,
    EBay SuperCall
  • Intelligent summaries and visual support,
    end-to-end monitoring, storage and analysis of
    data, threshold-based triggers
  • Modeling
  • Black-box know nothing about internals
  • e.g., time-series modeling, statistical learning,
    machine learning
  • Non-black-box know something about internals
  • Application emulators, queuing models, Petri nets
    black-box

9
Related Work
  • Applying modeling for problem determination in
    enterprise systems
  • Using access logs
  • Based on page hit counts and errors
  • Learn and use normal access patterns with
    chi-square tests and naïve Bayes models
  • Using traces (request-paths)
  • Model application execution flow using PCFGs
  • Model component interactions and test using the
    chi-square test
  • Diagnosis using clustering and decision trees
  • Using aggregate metrics
  • Using application structure and use averages
    under normal conditions as anomaly-thresholds
  • Correlating low-level metrics with SLO violations
    using Bayesian networks

10
Related Work
  • Adaptive monitoring
  • Extensible OS
  • Database table statistics
  • JFuild dynamic instrumentation of call graphs
    in Java programs
  • Moss adaptive performance monitoring
  • Measurement overhead ignored
  • No attempt at characterizing normal behaviour
    need to set thresholds
  • Active probing
  • Incrementally find the smallest set of probes
    (tests) that can determine the systems state

11
Related Work
What is monitored?
How is it monitored?
High-level
Bayesian models and information theory
Diagnosis
Decision trees
Data clustering
Visualization tools
Manual diagnosis
Monitoring
Bayesian models (naïve Bayes)
overhead
Statistical learning (e.g., chi-square)
Anomaly Detection
Mean and variance
Manually-set thresholds
Low-level
12
The Problem
  • Need to adapt monitoring
  • Need a framework to do so
  • How do we model system for a generic framework?
  • What is a good model for problem determination?

13
Challenges
  • Characterizing normal behaviour
  • What?
  • Monitoring information
  • System itself (changes)
  • When?
  • Initially
  • Incremental updating
  • How?
  • What techniques to use
  • Leverage work already done
  • Seek new applications of available techniques
  • Emergent behaviour

14
Challenges
  • Minimum Monitoring
  • Most useful smallest cost information
  • Adapting Monitoring
  • What warrants collection of more data?
  • Detecting anomalies in the collected data
  • Fusing anomaly-related information
  • When should we stop collecting more data?

15
Challenges
  • Basis for selecting information sources
  • Information gain
  • Reference characterization
  • Overhead involved
  • Prior knowledge

16
Challenges
  • Algorithms for adaptation what drives adaptation
    of monitoring
  • Dependency structure top-down approach
  • Explicit
  • Implicit (Inferred)
  • No dependency structure

17
Contributions
  • Design and implementation of an adaptive
    monitoring framework
  • Study of how to perform adaptation of monitoring
  • Basis, algorithms, system modeling and anomaly
    detection
  • Research new ways of applying existing modeling
    techniques
  • Validation by applying it to a J2EE-based
    software system
  • Demonstration of the generality of the framework
  • Whats new?
  • Impact-aware monitoring
  • Integrated monitoring
  • More comprehensive performance is only one,
    albeit important aspect
  • Most previous work has considered monitoring
    using a fixed set of information sources

18
Prototype
  • Applied the framework to a J2EE-based testbed
  • Testbed IBM WebSphere App. Server, DB2 UDB,
    custom-workload generators
  • Use benchmarking J2EE applications such as TPC-W,
    SPECJApp Server 2004, Trade, RUBiS
  • Use synthetic workload
  • Simulate anomalies
  • Software defects
  • Delay in servlets and EJBs
  • Exceptions in same
  • External faults
  • Drop connections/data between WAS and DB2
  • CPU hogs

19
Prototype
  • Correlation-based approach
  • Modeling single variables is difficult
  • Drift
  • No black box connection to problems
  • Why not consider pairs?
  • Takes out some non-linearity
  • Allows black-box problem determination
  • Initially
  • Collect all metrics
  • Find correlated pairs from known subsystems-level
    relationships

20
Prototype
  • Example of correlated a metric pair

21
Prototype
  • Operation
  • Monitor a minimal vector of metrics
  • Response time errors in user-accessible
    servlets
  • When an anomaly is detected, increase monitoring
    level so as to have pairs to analyze
  • Stop when source of problem found or monitoring
    level reaches maximum. Increase number of pairs,
    otherwise.
  • Problems
  • Error propagation
  • Determination of thresholds
  • Knowledge decay
  • Anomaly corroboration

22
Prototype
  • Problems
  • Error propagation
  • Determination of thresholds
  • Knowledge decay
  • Anomaly corroboration
  • False positives
  • False negatives

23
What Comes Next?
  • Moved/moving to WAS 6
  • Relevance fine-grained monitoring
  • Other modeling techniques
  • but still correlation-based between subsystems
  • Other monitored data sources
  • False positives
  • Need to determine if we should
  • Relearn system parameters
  • Do nothing
  • Fault injections
  • Workloads

24
Summary
  • Correlation is invaluable for problem
    determination
  • Can be used in a semi-black-box approach

25
Questions?
Write a Comment
User Comments (0)
About PowerShow.com