Statistical Learning, Inference and Control HP Labs and the challenges of presentation

About This Presentation

Transcript and Presenter's Notes

Title: Statistical Learning, Inference and Control HP Labs and the challenges of

1

Statistical Learning, Inference and Control (HP
Labs) and the challenges of
applying it to the Grid.
George Tsouloupas
High Performance Computing systems Laboratory
CS Dept., UCY
Graphics taken from a presentation at
the ACM Symposium on Operating Systems Principles
(Oct 2005)
by Ira Cohen, Moises Goldszmidt, Julie Symons,
Terence Kelly HP Labs
Steve Zhang, Armando Fox -Stanford University

2
Outline

Statistical Learning, Inference Control (HP)
The project
The approach
On to the Grid
Grid Service Levels
How is the Grid Different
Potential approaches

3
Statistical Learning, Inference Control (HP)

Correlating instrumentation data to system
states A building block for automated diagnosis
and control (2004)
Capturing, Indexing, Clustering, and Retrieving
System History (2005)
Short term performance forecasting in enterprise
systems (2005)
Three research challenges at the intersection of
machine learning, statistical induction, and
systems (2005)
Ensembles of models for automated diagnosis of
system performance problems (2005)

4
SLIC-people

Team
Moises Goldszmidt
Ira Cohen
Julie Symons
Nelson Lee (Stanford University, 2005 Intern)
Rob Powers (Stanford University, Intern)
Steve Zhang (Stanford University, Intern)
Dawn Banard (Duke University, 2003 Summer Intern)

5
SLIC-people

Collaborators
Terence Kelly, HP Labs
Kim Keeton, HP Labs
Jeff Chase, Duke University
George Forman, HP Labs
Armando Fox, Stanford University
Fabio Cozman, University of Sao Paulo

6
SLIC- my understanding in a single slide

Gather state from the machine
low-level system attributes
High-level performance metrics (SLO)
Statistically associate (a subset of) the
system-level metrics to Service Levels and then
classify violations
(opt.) Continuously maintain/update the model
If past violations are annotated then we can even
take action based on past knowledge.

7
Correlating instrumentation data to system states

Two approaches to Self-managing systems
A priori models of system structure and behavior
-- event-condition-action rules
difficult/costly to build
incomplete/inaccurate
fall apart when faced with unanticipated
conditions
Apply statistical learning techniques to
automatically induce the models.
efficient, accurate and robust techniques are
still an open issue

8
Aims

Investigate the effectiveness and practicality of
Tree-Augmented Naive Bayesian Networks (TAN's)
for performance diagnosis/ forecasting from
system-level instrumentation.
Construct an analysis engine
induce a classifier
predict if system is (or will be over some
interval) in compliance to the SLO
based on the values of the collected metrics

9
Setup
10
Keep the raw values?
11
Service Level Objectives
12
SLO compliance/violation
P(SLO, M)
13
Classification

Basically It is a pattern classification problem
in supervised learning
Balanced Accuracy
Average of the probability of identifying
compliance and the probability of identifying
violation.
Why Bayesian nets
performance
interpretability
modifiability (readily accept expert knowledge
into a model)

14
Baysian Networks

Inducing an optimal Bayesian Network is
intractable
restrict it to a TAN
heuristically select a subset of the metrics and
find the optimal TAN classifier.
The model approximates a probability distribution
P(SM)
A decision model based on what is more
likely P(S-M) gt P(SM)

15
Baysian Network Example
16
Metric attribution
Attri- bution
17
Metric Attribution

Actual values quite different but relation to
normal is the same

Gbl app alive proc
18
Creating and using signatures
Leveraging prior diagnosis efforts
Monitored service
Retrieval engine
Signature DB
Signature construction engine
Metrics/SLO Monitoring
Provides annotations in free form
Clustering engine
Admin
-Identifies intensity -Identifies
recurrence -Provides syndromes
19
Incremental work

Ensembles of models
Changing workloads
External disturbances
Short term performance forecasting
Automate assignment of resources
Transferable models?
Capturing, Indexing, Clustering and Retrieving
system history
Find similar events in the past.

20
Ensemble Algorithm Outline

Size of window is a parameter
Controls rate of adaptation
Larger windows produce better models (up to a
point)
Typically 6-12 hrs of data

New sample every 1-5 minutes
Yes
2-3 secs
1 ms per model
few ms for 100s of models
20 per month
21
Challenges going forward

Design procedures/algorithms to continuously test
the validity of induced models
How to interact with human operators for
feedback, identification of false positives and
missed problems
How do you maintain a long-term indexable history
of system state?

On to the Grid

23
Grid Service Levels

Grid policies and Service level agreements
Current service agreements
EGEE
N number of CPU's available
x of processing time committed to a specific VO
Several works by different groups but nothing of
wide acceptance yet (?)
Is a notion on Service Level already fused into
the scheduling policy of the Resource
Broker? rank -other.GlueCEStateEstimatedRespons
eTime

24
Setting the expected service level

Indirect / By instrumentation
Job throughput/ Instrumented Resource Brokers
Job Performance
Profile the different applications
Not applicable to many applications
Explicit measurement of performance
Perform Controlled experiments
Monitored/Filtered
Obtain statistically valid Measurements

25
How is the Grid Different?

Scale! (3-tier web application Vs the Grid)
But! It really depends at which level of
abstraction we consider the Grid?
Monitoring
On the Grid there is no Global view
Expensive and prohibitive for System level
measurements (How do you monitor 30,000 CPU's in
200 sites?)
From CE's? RB's? BDII's? RLS's? ...
What is System-level monitoring on the Grid?
Network traffic, Queue lengths, RB metrics
(instrumented/log monitoring), service response
times .....

26
How is the Grid Different? (2)

Volatility
Resources enter and leave the Grid
Internal changes - HW/MW/Configuration
Virtualization
Two consecutive measurements could be measuring
two entirely different things
Does this have serious impact?
Statistics to the rescue?
Does system-level monitoring make any sense when
faced with virtualized resources

27
How is the Grid Different? (3)

End-to-end approach
VO service levels
Policies, Quotas etc.
in the absence of system-level measurements (i.e.
monitoring)
potentially huge delays (relatively) in obtaining
measurements through running jobs
which could also be the subject of a SLA
If performance is monitored through regular
benchmarking jobs, only free CPU's are really
monitored see Virtualization

28
How is the Grid Different? (4)

Heterogeneity (spacial/temporal)
Root cause analysis
More difficult on the Grid-- complexity
Sparse measurements
Because measurements are costly.
Minimize costs (and operate on a budget.)
What to measure and when to measure it becomes
important.

29
Sketch of a potential solution

A resource enters the Grid and goes through a
phase where it's SL is established
Basic performance metrics taken and labeled
expectedOR
Some SLO is agreed upon
A service periodically (and intelligently)
obtains these metrics (sampling) and establishes
if they meet the SL.
Via a threshold, Mean Sq. Error, goodness of fit
...

30
Resource Auditing

Define auditing as the evaluation of the
conformance of a site to certain commitments
SLO's
Mainly in terms of performance and dependability
Accounting Vs Auditing
What is the Service Level Objective?
How is it expressed?
Throughput? Dependability?
Behavior under a set of synthetic tests?
How is it measured?

31
Alternatively

Apply a similar approach to the SLIC project and
establish a relationship between
Monitoring at a higher level but still at a
grid-wise system level
GSTAT
RGMA
SLO compliance monitoring through Resource Broker
(log watchers)
Possibly include the ticketing system database to
identify actual violations diagnosed by human
operators.

32
Yet alternatively

Apply the SLIC approach to Grid subsystems
independently e.g.
collect system-level metrics from the CE
determine the SLO's for a CE
collect system-level metrics from the RB, BDII
and RLS and correlate it with an RB SLO.
Possibly consider a hierarchical approach
low-level Service Levels make up the system
metrics for an upper level. Perhaps relay only
attributed metrics?

33
Conclusion

Direct application of mindset of the SLIC project
in a Grid setting requires that
we establish what is a Grid SLO and what is
compliance and violation
we have readily available system-level
performance metrics
Difficult to determine Grid SLO's
Job throughput? For which type of job? Is it the
jobs fault?
The subsystem approach is maybe more doable

Thank you.
Answers? )

Write a Comment

User Comments (0)

About PowerShow.com

Statistical Learning, Inference and Control HP Labs and the challenges of PowerPoint PPT Presentation