Title: Statistical Learning, Inference and Control HP Labs and the challenges of
1- Statistical Learning, Inference and Control (HP
Labs) and the challenges of - applying it to the Grid.
- George Tsouloupas
- High Performance Computing systems Laboratory
- CS Dept., UCY
- Graphics taken from a presentation at
- the ACM Symposium on Operating Systems Principles
(Oct 2005) - by Ira Cohen, Moises Goldszmidt, Julie Symons,
Terence Kelly HP Labs - Steve Zhang, Armando Fox -Stanford University
2Outline
- Statistical Learning, Inference Control (HP)
- The project
- The approach
- On to the Grid
- Grid Service Levels
- How is the Grid Different
- Potential approaches
3Statistical Learning, Inference Control (HP)
- Correlating instrumentation data to system
states A building block for automated diagnosis
and control (2004) - Capturing, Indexing, Clustering, and Retrieving
System History (2005) - Short term performance forecasting in enterprise
systems (2005) - Three research challenges at the intersection of
machine learning, statistical induction, and
systems (2005) - Ensembles of models for automated diagnosis of
system performance problems (2005)
4SLIC-people
- Team
- Moises Goldszmidt
- Ira Cohen
- Julie Symons
- Nelson Lee (Stanford University, 2005 Intern)
- Rob Powers (Stanford University, Intern)
- Steve Zhang (Stanford University, Intern)
- Dawn Banard (Duke University, 2003 Summer Intern)
5SLIC-people
- Collaborators
- Terence Kelly, HP Labs
- Kim Keeton, HP Labs
- Jeff Chase, Duke University
- George Forman, HP Labs
- Armando Fox, Stanford University
- Fabio Cozman, University of Sao Paulo
6SLIC- my understanding in a single slide
- Gather state from the machine
- low-level system attributes
- High-level performance metrics (SLO)
- Statistically associate (a subset of) the
system-level metrics to Service Levels and then
classify violations - (opt.) Continuously maintain/update the model
- If past violations are annotated then we can even
take action based on past knowledge.
7Correlating instrumentation data to system states
- Two approaches to Self-managing systems
- A priori models of system structure and behavior
-- event-condition-action rules - difficult/costly to build
- incomplete/inaccurate
- fall apart when faced with unanticipated
conditions - Apply statistical learning techniques to
automatically induce the models. - efficient, accurate and robust techniques are
still an open issue
8Aims
- Investigate the effectiveness and practicality of
Tree-Augmented Naive Bayesian Networks (TAN's) - for performance diagnosis/ forecasting from
system-level instrumentation. - Construct an analysis engine
- induce a classifier
- predict if system is (or will be over some
interval) in compliance to the SLO - based on the values of the collected metrics
9Setup
10Keep the raw values?
11Service Level Objectives
12SLO compliance/violation
P(SLO, M)
13Classification
- Basically It is a pattern classification problem
in supervised learning - Balanced Accuracy
- Average of the probability of identifying
compliance and the probability of identifying
violation. - Why Bayesian nets
- performance
- interpretability
- modifiability (readily accept expert knowledge
into a model)
14Baysian Networks
- Inducing an optimal Bayesian Network is
intractable - restrict it to a TAN
- heuristically select a subset of the metrics and
find the optimal TAN classifier. - The model approximates a probability distribution
P(SM) - A decision model based on what is more
likely P(S-M) gt P(SM)
15Baysian Network Example
16Metric attribution
Attri- bution
17Metric Attribution
- Actual values quite different but relation to
normal is the same
Gbl app alive proc
18Creating and using signatures
Leveraging prior diagnosis efforts
Monitored service
Retrieval engine
Signature DB
Signature construction engine
Metrics/SLO Monitoring
Provides annotations in free form
Clustering engine
Admin
-Identifies intensity -Identifies
recurrence -Provides syndromes
19Incremental work
- Ensembles of models
- Changing workloads
- External disturbances
- Short term performance forecasting
- Automate assignment of resources
- Transferable models?
- Capturing, Indexing, Clustering and Retrieving
system history - Find similar events in the past.
20Ensemble Algorithm Outline
- Size of window is a parameter
- Controls rate of adaptation
- Larger windows produce better models (up to a
point) - Typically 6-12 hrs of data
New sample every 1-5 minutes
Yes
2-3 secs
1 ms per model
few ms for 100s of models
20 per month
21Challenges going forward
- Design procedures/algorithms to continuously test
the validity of induced models - How to interact with human operators for
feedback, identification of false positives and
missed problems - How do you maintain a long-term indexable history
of system state?
22 23Grid Service Levels
- Grid policies and Service level agreements
- Current service agreements
- EGEE
- N number of CPU's available
- x of processing time committed to a specific VO
- Several works by different groups but nothing of
wide acceptance yet (?) - Is a notion on Service Level already fused into
the scheduling policy of the Resource
Broker? rank -other.GlueCEStateEstimatedRespons
eTime
24Setting the expected service level
- Indirect / By instrumentation
- Job throughput/ Instrumented Resource Brokers
- Job Performance
- Profile the different applications
- Not applicable to many applications
- Explicit measurement of performance
- Perform Controlled experiments
- Monitored/Filtered
- Obtain statistically valid Measurements
25How is the Grid Different?
- Scale! (3-tier web application Vs the Grid)
- But! It really depends at which level of
abstraction we consider the Grid? - Monitoring
- On the Grid there is no Global view
- Expensive and prohibitive for System level
measurements (How do you monitor 30,000 CPU's in
200 sites?) - From CE's? RB's? BDII's? RLS's? ...
- What is System-level monitoring on the Grid?
- Network traffic, Queue lengths, RB metrics
(instrumented/log monitoring), service response
times .....
26How is the Grid Different? (2)
- Volatility
- Resources enter and leave the Grid
- Internal changes - HW/MW/Configuration
- Virtualization
- Two consecutive measurements could be measuring
two entirely different things - Does this have serious impact?
- Statistics to the rescue?
- Does system-level monitoring make any sense when
faced with virtualized resources
27How is the Grid Different? (3)
- End-to-end approach
- VO service levels
- Policies, Quotas etc.
- in the absence of system-level measurements (i.e.
monitoring) - potentially huge delays (relatively) in obtaining
measurements through running jobs - which could also be the subject of a SLA
- If performance is monitored through regular
benchmarking jobs, only free CPU's are really
monitored see Virtualization
28How is the Grid Different? (4)
- Heterogeneity (spacial/temporal)
- Root cause analysis
- More difficult on the Grid-- complexity
- Sparse measurements
- Because measurements are costly.
- Minimize costs (and operate on a budget.)
- What to measure and when to measure it becomes
important.
29Sketch of a potential solution
- A resource enters the Grid and goes through a
phase where it's SL is established - Basic performance metrics taken and labeled
expectedOR - Some SLO is agreed upon
- A service periodically (and intelligently)
obtains these metrics (sampling) and establishes
if they meet the SL. - Via a threshold, Mean Sq. Error, goodness of fit
...
30Resource Auditing
- Define auditing as the evaluation of the
conformance of a site to certain commitments
SLO's - Mainly in terms of performance and dependability
- Accounting Vs Auditing
- What is the Service Level Objective?
- How is it expressed?
- Throughput? Dependability?
- Behavior under a set of synthetic tests?
- How is it measured?
31Alternatively
- Apply a similar approach to the SLIC project and
establish a relationship between - Monitoring at a higher level but still at a
grid-wise system level - GSTAT
- RGMA
- SLO compliance monitoring through Resource Broker
(log watchers) - Possibly include the ticketing system database to
identify actual violations diagnosed by human
operators.
32Yet alternatively
- Apply the SLIC approach to Grid subsystems
independently e.g. - collect system-level metrics from the CE
determine the SLO's for a CE - collect system-level metrics from the RB, BDII
and RLS and correlate it with an RB SLO. - Possibly consider a hierarchical approach
- low-level Service Levels make up the system
metrics for an upper level. Perhaps relay only
attributed metrics?
33Conclusion
- Direct application of mindset of the SLIC project
in a Grid setting requires that - we establish what is a Grid SLO and what is
compliance and violation - we have readily available system-level
performance metrics - Difficult to determine Grid SLO's
- Job throughput? For which type of job? Is it the
jobs fault? - The subsystem approach is maybe more doable
34