Statistical Learning, Inference and Control HP Labs and the challenges of PowerPoint PPT Presentation

presentation player overlay
1 / 33
About This Presentation
Transcript and Presenter's Notes

Title: Statistical Learning, Inference and Control HP Labs and the challenges of


1
  • Statistical Learning, Inference and Control (HP
    Labs) and the challenges of
  • applying it to the Grid.
  • George Tsouloupas
  • High Performance Computing systems Laboratory
  • CS Dept., UCY
  • Graphics taken from a presentation at
  • the ACM Symposium on Operating Systems Principles
    (Oct 2005)
  • by Ira Cohen, Moises Goldszmidt, Julie Symons,
    Terence Kelly HP Labs
  • Steve Zhang, Armando Fox -Stanford University

2
Outline
  • Statistical Learning, Inference Control (HP)
  • The project
  • The approach
  • On to the Grid
  • Grid Service Levels
  • How is the Grid Different
  • Potential approaches

3
Statistical Learning, Inference Control (HP)
  • Correlating instrumentation data to system
    states A building block for automated diagnosis
    and control (2004)
  • Capturing, Indexing, Clustering, and Retrieving
    System History (2005)
  • Short term performance forecasting in enterprise
    systems (2005)
  • Three research challenges at the intersection of
    machine learning, statistical induction, and
    systems (2005)
  • Ensembles of models for automated diagnosis of
    system performance problems (2005)

4
SLIC-people
  • Team
  • Moises Goldszmidt
  • Ira Cohen
  • Julie Symons
  • Nelson Lee (Stanford University, 2005 Intern)
  • Rob Powers (Stanford University, Intern)
  • Steve Zhang (Stanford University, Intern)
  • Dawn Banard (Duke University, 2003 Summer Intern)

5
SLIC-people
  • Collaborators
  • Terence Kelly, HP Labs
  • Kim Keeton, HP Labs
  • Jeff Chase, Duke University
  • George Forman, HP Labs
  • Armando Fox, Stanford University
  • Fabio Cozman, University of Sao Paulo

6
SLIC- my understanding in a single slide
  • Gather state from the machine
  • low-level system attributes
  • High-level performance metrics (SLO)
  • Statistically associate (a subset of) the
    system-level metrics to Service Levels and then
    classify violations
  • (opt.) Continuously maintain/update the model
  • If past violations are annotated then we can even
    take action based on past knowledge.

7
Correlating instrumentation data to system states
  • Two approaches to Self-managing systems
  • A priori models of system structure and behavior
    -- event-condition-action rules
  • difficult/costly to build
  • incomplete/inaccurate
  • fall apart when faced with unanticipated
    conditions
  • Apply statistical learning techniques to
    automatically induce the models.
  • efficient, accurate and robust techniques are
    still an open issue

8
Aims
  • Investigate the effectiveness and practicality of
    Tree-Augmented Naive Bayesian Networks (TAN's)
  • for performance diagnosis/ forecasting from
    system-level instrumentation.
  • Construct an analysis engine
  • induce a classifier
  • predict if system is (or will be over some
    interval) in compliance to the SLO
  • based on the values of the collected metrics

9
Setup
10
Keep the raw values?
11
Service Level Objectives
12
SLO compliance/violation
P(SLO, M)
13
Classification
  • Basically It is a pattern classification problem
    in supervised learning
  • Balanced Accuracy
  • Average of the probability of identifying
    compliance and the probability of identifying
    violation.
  • Why Bayesian nets
  • performance
  • interpretability
  • modifiability (readily accept expert knowledge
    into a model)

14
Baysian Networks
  • Inducing an optimal Bayesian Network is
    intractable
  • restrict it to a TAN
  • heuristically select a subset of the metrics and
    find the optimal TAN classifier.
  • The model approximates a probability distribution
    P(SM)
  • A decision model based on what is more
    likely P(S-M) gt P(SM)

15
Baysian Network Example
16
Metric attribution
Attri- bution
17
Metric Attribution
  • Actual values quite different but relation to
    normal is the same

Gbl app alive proc
18
Creating and using signatures
Leveraging prior diagnosis efforts
Monitored service
Retrieval engine
Signature DB
Signature construction engine
Metrics/SLO Monitoring
Provides annotations in free form
Clustering engine
Admin
-Identifies intensity -Identifies
recurrence -Provides syndromes
19
Incremental work
  • Ensembles of models
  • Changing workloads
  • External disturbances
  • Short term performance forecasting
  • Automate assignment of resources
  • Transferable models?
  • Capturing, Indexing, Clustering and Retrieving
    system history
  • Find similar events in the past.

20
Ensemble Algorithm Outline
  • Size of window is a parameter
  • Controls rate of adaptation
  • Larger windows produce better models (up to a
    point)
  • Typically 6-12 hrs of data

New sample every 1-5 minutes
Yes
2-3 secs
1 ms per model
few ms for 100s of models
20 per month
21
Challenges going forward
  • Design procedures/algorithms to continuously test
    the validity of induced models
  • How to interact with human operators for
    feedback, identification of false positives and
    missed problems
  • How do you maintain a long-term indexable history
    of system state?

22
  • On to the Grid

23
Grid Service Levels
  • Grid policies and Service level agreements
  • Current service agreements
  • EGEE
  • N number of CPU's available
  • x of processing time committed to a specific VO
  • Several works by different groups but nothing of
    wide acceptance yet (?)
  • Is a notion on Service Level already fused into
    the scheduling policy of the Resource
    Broker? rank -other.GlueCEStateEstimatedRespons
    eTime

24
Setting the expected service level
  • Indirect / By instrumentation
  • Job throughput/ Instrumented Resource Brokers
  • Job Performance
  • Profile the different applications
  • Not applicable to many applications
  • Explicit measurement of performance
  • Perform Controlled experiments
  • Monitored/Filtered
  • Obtain statistically valid Measurements

25
How is the Grid Different?
  • Scale! (3-tier web application Vs the Grid)
  • But! It really depends at which level of
    abstraction we consider the Grid?
  • Monitoring
  • On the Grid there is no Global view
  • Expensive and prohibitive for System level
    measurements (How do you monitor 30,000 CPU's in
    200 sites?)
  • From CE's? RB's? BDII's? RLS's? ...
  • What is System-level monitoring on the Grid?
  • Network traffic, Queue lengths, RB metrics
    (instrumented/log monitoring), service response
    times .....

26
How is the Grid Different? (2)
  • Volatility
  • Resources enter and leave the Grid
  • Internal changes - HW/MW/Configuration
  • Virtualization
  • Two consecutive measurements could be measuring
    two entirely different things
  • Does this have serious impact?
  • Statistics to the rescue?
  • Does system-level monitoring make any sense when
    faced with virtualized resources

27
How is the Grid Different? (3)
  • End-to-end approach
  • VO service levels
  • Policies, Quotas etc.
  • in the absence of system-level measurements (i.e.
    monitoring)
  • potentially huge delays (relatively) in obtaining
    measurements through running jobs
  • which could also be the subject of a SLA
  • If performance is monitored through regular
    benchmarking jobs, only free CPU's are really
    monitored see Virtualization

28
How is the Grid Different? (4)
  • Heterogeneity (spacial/temporal)
  • Root cause analysis
  • More difficult on the Grid-- complexity
  • Sparse measurements
  • Because measurements are costly.
  • Minimize costs (and operate on a budget.)
  • What to measure and when to measure it becomes
    important.

29
Sketch of a potential solution
  • A resource enters the Grid and goes through a
    phase where it's SL is established
  • Basic performance metrics taken and labeled
    expectedOR
  • Some SLO is agreed upon
  • A service periodically (and intelligently)
    obtains these metrics (sampling) and establishes
    if they meet the SL.
  • Via a threshold, Mean Sq. Error, goodness of fit
    ...

30
Resource Auditing
  • Define auditing as the evaluation of the
    conformance of a site to certain commitments
    SLO's
  • Mainly in terms of performance and dependability
  • Accounting Vs Auditing
  • What is the Service Level Objective?
  • How is it expressed?
  • Throughput? Dependability?
  • Behavior under a set of synthetic tests?
  • How is it measured?

31
Alternatively
  • Apply a similar approach to the SLIC project and
    establish a relationship between
  • Monitoring at a higher level but still at a
    grid-wise system level
  • GSTAT
  • RGMA
  • SLO compliance monitoring through Resource Broker
    (log watchers)
  • Possibly include the ticketing system database to
    identify actual violations diagnosed by human
    operators.

32
Yet alternatively
  • Apply the SLIC approach to Grid subsystems
    independently e.g.
  • collect system-level metrics from the CE
    determine the SLO's for a CE
  • collect system-level metrics from the RB, BDII
    and RLS and correlate it with an RB SLO.
  • Possibly consider a hierarchical approach
  • low-level Service Levels make up the system
    metrics for an upper level. Perhaps relay only
    attributed metrics?

33
Conclusion
  • Direct application of mindset of the SLIC project
    in a Grid setting requires that
  • we establish what is a Grid SLO and what is
    compliance and violation
  • we have readily available system-level
    performance metrics
  • Difficult to determine Grid SLO's
  • Job throughput? For which type of job? Is it the
    jobs fault?
  • The subsystem approach is maybe more doable

34
  • Thank you.
  • Answers? )
Write a Comment
User Comments (0)
About PowerShow.com