Autonomic Computing: A New Challenge for Machine Learning ECML-06 Tutorial

About This Presentation

Title:

Autonomic Computing: A New Challenge for Machine Learning ECML-06 Tutorial

Description:

Citibank: online banking. Application. Manager. Servers. Servers. Servers. DB2. Router. SLA ... But online, adaptive approach is even more efficient! ... – PowerPoint PPT presentation

Number of Views:456

Avg rating:3.0/5.0

Slides: 86

Provided by: IBMU328

Category:

more less

Transcript and Presenter's Notes

Title: Autonomic Computing: A New Challenge for Machine Learning ECML-06 Tutorial

1
Autonomic ComputingA New Challenge for Machine
LearningECML-06 Tutorial
Irina Rish, Gerry Tesauro IBM T.J. Watson
Research Center Hawthorne, NY
2
Outline

What is Autonomic Computing? Why use ML?
Some Application Examples
Part 1 Inference and Learning with Active
Sampling
Active testing in Bayesian inference problem
diagnosis
Active Learning in Collaborative Prediction
server selection
Part 2 Decision Making and Reinforcement
Learning
Summary and Future Directions

3
Challenges in Systems Management

Large-scale, heterogeneous distributed systems
with highly dynamic, complex multi-component
interactions
Large volumes of real-time high-dimensional data,
but also lots of missing information and
uncertainty
Too much complexity, too few (skilled)
administrators

Need for self-managing systems ? autonomic
computing
4
Evolution of Computing
5
What is Autonomic Computing?
Computing systems that manage themselves in
accordance with high-level objectives from
humans Kephart and Chess, A Vision of Autonomic
Computing, IEEE Computer, 2003

Self-management capabilities include
Self-Configuration Automated configuration of
components, systems according to high-level
policies rest of system adjusts seamlessly.
Self-Healing Automated detection, diagnosis, and
repair of localized software/hardware problems.
Self-Optimization Automatic and continual
adaptive tuning of hundreds of parameters
(database params, server params,) affecting
performance efficiency
Self-Protection Automated defense against
malicious attacks or cascading failures use
early warning to anticipate and prevent
system-wide failures.
Good application domain for ML rich
opportunities, little previously done

6
Autonomic Computing Element Architecture
Inference
Learning
Decision- making
7
Machine Learning Promises and Challenges

Promise machine learning is a natural solution
to automation!
- Avoids knowledge-intensive model building
- Deals naturally with dynamicity, changes in
system composition
- Can deal with complex non-steady-state
dynamical phenomena
Challenges
- Curse of dimensionality O(exp(N))
state-space for N variables ? representation and
computation complexity
- Data sparsity largely unexplored
state-action spaces ?
challenge for prediction and
decision-making
Reverse the curse!
- squeeze out relevant information
(generalization, dim. reduction)
- only do exploration that matters (active
learning)

8
Reversing the curse A Unified Approach
System
High-dimensional, high-volume raw data

High dimensionality,
data sparsity
Scalable predictive
algorithms
Exploration vs exploitation
cost-benefit tradeoffs

State-Space Data Abstraction
Compressed, informative data
Performance Prediction (as function of possible
actions)
Classification, regression
Predicted Performance
Action Evaluation Decision Making
Actions
9
Some Application Examples

Mining Event Data
Transaction Recognition
Fault Diagnosis
Performance Prediction
Server provider selection problem
Online Resource Allocation
Power and performance management

10
Example 1 Event MiningFor details, see Ma and
Hellerstein
Analyzing system event logs to extract
interesting behavior patterns

- Thousands of hosts
- Hundreds of event types
Billions of events
Various severity levels
Some high-severity events
Cisco_Link_Down,
chassisMinorAlarm_On
Some low-severity events
tcpConnectClose, duplicate_ip

- Learning dependencies among events
(clustering, association rules) - Inference
predict high-severity events, diagnose suspicious
behavior - Take corrective and/or preventive
actions based on predictions
11
Event Prediction Problem
12
Classification-based Event PredictionFor
details, see Vilalta and Ma, Domencioni et al,
Sahoo et al
13
Example 2 Users Transaction Recognition

Transaction recognition is needed for
Building realistic workload models for testing
performance
Quantifying end-user perception of performance
(response times)

- Learning segmentation and labeling of RPC
streams - Inference
predicting most-likely next transactions given
the model - Decision-making resource allocation
based on anticipating requests
14
Why is it hard? Why learn from data?
Example EUTs and RPCs in Lotus Notes
15
Segmentation and Classification Subproblems For
details, see Hellerstein, Jayram, Rish,
AAAI-2000
(similar to text classification)

Features RPC occurences (bag of words) or
RPC counts
Accuracy results Naïve Bayes - 86-88, SVM
85-87, Decision Tree 90-92

(similar to speech understanding, image
segmentation)

Dynamic programming (Viterbi search) Naïve
Bayes
Accuracy results w/ Naïve Bayes 64 (harder
problem than classification!)

16
Example 3 Network Monitoring using Probes
Probes end-to-end transactions
ping, trace-route, email- and web-access,
e-business transactions Typically used
for testing end-to-end performance
(e.g., SLA)
- Learning dependency models (topology,
routing), noise parameters - Inference problem
diagnosis (why a transaction timed-out?)
- Decision-making cost-efficient active
probing, optimal routing etc.
17
Example 4 Content-Distribution Systems
Examples - Napster - Gnutella - IBMs
downloadGrid
p
p
p
p
p
p
p
probe
p
p
- peer
Management Center
x
Download request
p
complaint
- Learning performance prediction (latency,
bandwidth) - Inference problem diagnosis
(node didnt reply is it faulty?) -
Decision-making best provider selection (e.g.,
max-bandwidth)
18
Example 5 Allocating Server Resources in a Data
Center

Scenario Data center serving multiple customers,
each running high-volume web apps with
independent time-varying workloads

Data Center
Maximize business value across all customers
Resource Arbiter
Application Manager
SLA
SLA
SLA
Router
DB2
Citibank online banking
19
Outline

What is Autonomic Computing? Why use ML?
Some Application Examples
Part 1 Inference and Learning with Active
Sampling
Active testing in Bayesian inference problem
diagnosis
Active Learning in Collaborative Prediction
server selection
Part 2 Decision Making and Reinforcement
Learning
Summary and Future Directions

20
Why Focus on Active Sampling?

Cost-efficiency concerns
Huge number of possible measurements gt data
collection costs (instrumentation, storage,
overhead due to invasive tests)
Huge volumes of data gt complexity of data
analysis
Need to squeeze out only most-relevant
information, otherwise
we are drowning in data but
starving for knowledge
Needed optimized measurement selection
Good news artificial systems are well

suited for active sampling more flexible

than some natural applications

21
Example Network Monitoring via Probes
Probes end-to-end transactions
ping, trace-route, email- and web-access,
e-business transactions Typically used
for testing end-to-end performance
(e.g., SLA)
- Inference problem diagnosis (why a transaction
timed-out?) - Decision-making
cost-efficient, active probe selection
22
Simple Example
Dependency matrix

Columns components/nodes
hardware/software components
Rows probes
ping, trace-route, email, web-access
E.g.,
pWS-Web Page access probe
pDBS-database query
pAS application test
pingR ping router
pingWS ping Web Server

23
Probabilistic Inference in Bayesian Networks
24
Noisy-OR Bayesian Network Model

No noise logical-OR
Probe outcomes define a set of constraints
t1 X1 \/ X2 \/ X5
t2 X1 \/ X3 \/ X6
t3 X2 \/ X3 \/ X4

Noise noisy-OR (causal independence model)
"Spurious" probes (unexpected success due to
inhibited cause)
"Lost" probes (unexpected failure due to hidden
cause)
25
Challenge Make Diagnosis Cost-Efficient

Need to reduce cost to deploy and maintain
physical
infrastructure probe stations, databases,
reporting systems, staff
However, need a fast and accurate real-time
diagnosis!

Multicriteria Optimization need to minimize

the number of probe stations
the number of probes
computational complexity of probe selection
expected diagnostic error
computational complexity of diagnosis

Off-line (Planning)
On-line (Active)
Topology information
Probe station (source) selection
Probe Set Construction
Probe Set Optimization
26
Probe Source Selection Greedy vs Random
Odintsova and Rish, 2005
Watson network (router level)
IBM Research network (router level)
Random graphs
Scale-free networks
Number of probes for fault detection

Bound on the number of sources
Greedy heuristic approach seems to always beat
the random source placement
27
Optimal Probe Set Selection

Non-adaptive (off-line)
Given an unobserved variable X, its
distribution P(X),
a set of possible probes S, and P(SX), choose
the
smallest subset of tests T that allows most
accurate
diagnosis of X (NP-hard problem Beygelzimer
2003)
However, greedy most-informative-next
heuristic
search works quite well Brodie, Rish, Ma
2001

Minimum probe subset for single-fault
diagnosis In case of zero noise

But online, adaptive approach is even more
efficient!
Maximizes infogain given outcomes of the
previous probes more context-specific and
typically requires much less probing

28
Active online probe selection
Select next probe that provides maximum
information about the unknown system state
29
Active Diagnosis Adaptive vs Non-Adaptive
Test Selection Rish et al, IEEE Trans. NN, 2005
of nodes
of tests
Non- adaptive (exact)
Non- adaptive (greedy)
Savings adaptive vs exact
Adaptive (avg)
Odyssey (Mar03)
Odyssey (Feb03)
Odyssey (Nov02)
CRM1
CRM2
CRM3
CRM4
intranet1
intranet2
active vs greedy off-line
Active vs offline 60-76 savings in the number
of tests
30
Warning active probe selection becomes
intractable for general multi-fault
diagnosisZhang, Rish, Beygelzimer, UAI 2005

Without k-fault assumption for small k, even
myopic VOI - selecting a single next
most-informative probe - becomes intractable as
it now requires generally intractable
probabilistic inference for computing
However, a decomposition of H(X,T) and
subsequent efficient approximation based on
belief propagation is possible (see Zhang et al,
UAI-05) it computes ALL infogains for candidate
tests in one sweep and does not compromise much
accuracy w.r.t. to exact approach to active probe
selection
Related work Krause, Guestrin 2005

Oopsintractable!
31
Some Theoretical Bounds on Diagnostic Error in
Bayes Nets Rish, Allerton-2005
Theorem diagnostic error measured by the bit
error rate (BER) is bounded from below as
follows where c is the minimum number of
children for any Xi,
is the maximum
prior, and
Assumptions regular random bipartite networks
with n input nodes, m output nodes, k number
of parents per
each test node (and thus km/n children per each
hidden node), p P(X1)- fault prior (plt0.5),
q noise parameter,
qleak leak parameter
n
Interpretation m probes, each of length k, over
randomly selected subsets of nodes
this is a bit unrealistic setting
(in reality, probe selection is constrained),
but convenient for initial
evaluation of diagnostic error
m
32
Necessary Conditions for Zero-Error Diagnosis
Corollary 2 a necessary condition for achieving
error-free diagnosis is

More probes per node (m/n) are required for
higher fault probability p
Longer probes (larger k) yield lower requirements
for m/n
But the bounds are weak can we find ACHIEVABLE
lower bound?
(just like Shannons ? )

33
Computational Complexity of Diagnosis
In our problems, w grows with increasing number
of probe stations and gets intractable quite
quickly (multiple probe stations are needed to
have more informative probe sets)

What can we do?
Sometimes, problem structure is easy (e.g.,
trees, low-w graphs)
But in general, approximate inference is needed
e.g, loopy belief propagation.

34
Distributed Diagnosis

RAIL real-time active inference and learning
engine
EPP end-to-end probing station
Corresponding
Bayesian Network
each RAIL corresponds to a
region in the Bayesian network that includes
all probes controlled by RAIL and their
respective nodes

35
Factor graphs and Belief Propagation
Map Bayesian network to a Factor Graph each
variable is mapped to a variable node, each
function is mapped to a factor node
Factor node ? probe Variable node ? component
Parallel belief propagation each RAIL updates
beliefs of its nodes and sends messages to
neighbor-RAILs sharing some common variable
nodes
36
Belief Propagation on Factor Graphs
The belief b(Y) is the BP approximation of
the marginal probability
From variable node to factor node
From factor node to variable node
37
BP Diagnosis on Internet-like Networks

INET topology generator used to create a network
over 487 nodes
387 probes selected by greedy search for
single-fault diagnosis
Simulated levels of noise q and fault prior p

Error increases with growing probability of fault
p and noise level q
Fault probability p has more impact on the error
than the noise q
Good news in reality, p is usually quite low, so
the error is small

38
Outline

What is Autonomic Computing? Why use ML?
Some Application Examples
Part 1 Inference and Learning with Active
Sampling
Active testing in Bayesian inference problem
diagnosis
Active Learning in Collaborative Prediction
server selection
Part 2 Decision Making and Reinforcement
Learning
Summary and Future Directions

39
Performance Prediction and Server Selection

End-to-end performance between a pair of
nodes
network latency, bandwidth, round-trip
time, any other QoS metric
Knowing e2e performance is important in many
applications

Content-distribution systems (peer-to-peer,
Grid) choosing
highest-bandwidth server to download an object
from
Distributed Hash Tables
route a lookup
request to the peer with the lowest latency
Overlay routing
select lowest-latency peer
to communicate with
Example IBM download Grid

40
Approach Collaborative Prediction (CP)
Problem given a sparse matrix of some previously
observed user experiences (users ratings for a
set of products, or bandwidth between some
client-server pairs), predict unobserved entries
Example bandwidth matrix for 100x100 subset of
dGrid

How to generalize from observed to unobserved
entries (fill-in black space?)
Underlying assumption matrix entries are NOT
independent, e.g. similar nodes have similar
performances
Various approaches, mainly factorized models
that assume hidden factors affecting the
ratings, e.g. aspect model, pLSA, MCVQ,
SVD, NMF, MMMF

Servers
Clients
41
Assumptions - there is a number of
(hidden) factors behind the user preferences
that relate to (hidden) movie properties
- movies have intrinsic values associated
with such factors
- users have intrinsic weights with such
factors user ratings a weighted
(linear) combinations of movies values
Factor models ? dimensionality reduction (for
small of factors)
42
rank k

Y
X
Objective find a factorizable XUV that
approximates Y
and satisfies some regularization constraints
(e.g. rank(X) lt k)
Loss functions depends on the nature of your
problem
43
How to solve it?

Singular value decomposition (SVD) low-rank
approximation
Assumes fully observed Y and sum-squared loss

In collaborative prediction, Y is only partially
observed
Low-rank approximation becomes non-convex
problem w/ many local minima

Furthermore, we may not want sum-squared loss,
but instead
accurate predictions (0/1 loss, approximated by
hinge loss)
cost-sensitive predictions (missing a good
server vs suggesting a bad one)
cost of ranking, etc. depends on the decision
algorithm using predictions

Use instead the state-of-art Max-Margin Matrix
Factorization Srebro 05
replaces bounded rank constraint by bounded norm
of U, V vectors
convex optimization problem! can be solved
exactly by semi-definite programming
strongly relates to learning max-margin
classifiers (SVMs)

We demonstrate MMMF on binary classification
first

44
Key Insight
Rows feature vectors, Columns linear
classifiers
Linear classifiers?weight vectors
v2
f1
-1
Feature vectors
Xij signij x marginij
Predictorij signij
If signij gt 0, classify as 1, Otherwise
classify as -1
45
MMMF Simultaneous Search for Low-norm
Feature Vectors and Max-margin Classifiers
46
Our contribution Active Learning
MMMF works well, but it ignores natural domain
property possibility of active sampling
(make user A connect to server 117 to improve our
model)
-0.3
-0.5
0.3
0.4
0.6
-0.9
-0.6
0.1
-0.5
0.8
0.2
-0.9
-0.1
0.3
-0.7
-0.5
0.6
0.7
-0.9
-0.8
0.9
0.1
0.5
0.2
0.3
-0.5
0.6
0.6
0.5
-0.8
-0.5
0.2
0.7
-0.9
0.9
-0.6
0.9
-0.1
0.7
-0.4
0.3
0.8
-0.2
-0.5
0.6
Current active-SVM heuristic Active ?
Min-margin sample (most uncertain one)
-0.4
-0.5
-0.5
-0.5
0.4
0.5
0.6
-0.5
-0.5
-0.2
0.1
-0.5
0.3
0.9
0.8
-0.5
0.6
0.2
47
Active Max-Margin Matrix FactorizationA-MMMF

A-MMMF(M,s)
1. Given s sparse matrix Y, learn approximation
X MMMF(Y)
2. Using current predictions, actively
select best s samples and request their labels
(e.g., test client/server pair via enforced
download)
3. Add new samples to Y
4. Repeat 1-3 until no significant
improvement in prediction is likely

Active sampling
The idea is to eliminate as many as possible
wrong hypotheses, e.g., SVM lines, from
consideration see later for more detail
Current approach used the simplest
minimum-margin heuristic for SVMs
However, there is a variety of other SVM active
learning heuristics to try
Ideally, a theoretically founded approach is
desirable a hard open problem

48
Results Active vs Random Sampling
DownloadGrid bandwidth prediction
PlanetLab latency prediction
0 100 200 300 400 500
600
0 100
200
of initial data set
of initial data set
Active sampling gives consistent improvement in
classification accuracy, which leads to better
decisions ? higher bandwidth/faster downloads
49
More Results on Latency Prediction
P2Psim data
NLANR-AMP data
Comparing various active sampling strategies
most-uncertain (min-margin) and
least-uncertain (max-margin/safe) sample
selection
50
Results for Movie Rating Prediction
MovieLens data prediction error
MovieLens data cost
Prediction accuracy versus cost of sampling using
various strategies
51
Conclusions

Common challenge in systems management
applications cost-efficient measurement
selection
Promising approach cost-efficient active
sampling
Active sampling improves predictive accuracy
while keeping the number of measurements low in
several domains
Online diagnosis
Bandwidth and latency prediction
Future work
Other systems applications and problems that need
active sampling
Theoretical analysis of active sampling
performance
More efficient active sampling approaches (better
heuristics, non-myopic test selection)

52
Outline

What is Autonomic Computing? Why use ML?
Some Application Examples
Part 1 Inference and Learning with Active
Sampling
Active testing in Bayesian inference problem
diagnosis
Active Learning in Collaborative Prediction
server selection
Part 2 Decision Making and Reinforcement
Learning
Summary and Future Directions

53
Examples of Autonomic Decision-Making

Systems Performance Management
Dynamic Resource Allocation
Servers, threads, CPU slices,
Memory
Storage
Bandwidth
Online Parameter Tuning
MAXCLIENTS, timeout parameters,
Routing and Scheduling
Access/Flow Control
Application Placement
Objectives QoS/SLA objectives, consistency,
customer retention, fairness etc. (can get
pretty nebulous)

54
More Decision-Making Examples

Availability Management
Knobs Data redundancy, server redundancy
Objectives MTBF, RTO, RPO
Power/Thermal Management
Knobs CPU clock speeds, ambient temp., fan
speeds,
Objectives min. cost of electricity, cooling,
heat degradation of eqpt.
Background Utility Throttling
DB backups, re-indexing
Utilization Management
Multi-Criteria Tradeoffs
e.g. Performance Availability Performance
Power etc.

55
WebSphere On Demand Operating Environment
Prioritization and Flow Control
Routing and Load Balancing
Classification
Node 1
AM
ST
Stock Trading
High Importance
Provisioning Executions
Node 2
AM
ST
Account Mngmt
Node 3
Medium Importance
FA
ST
Stock Trading
Account Mngmt
Node 4
AM
FA
WebSphere on demand Router
Financial Advice
Financial Advice
Low Importance
Node 5
AM
FA
WebSphere Cell
Application Demand Resource State
Provisioning Decisions
Operational Policy
56
Application Allocating Server Resources in a
Data Center

Scenario Data center serving multiple customers,
each running high-volume web apps with
independent time-varying workloads

Data Center
Maximize business value across all customers
Resource Arbiter
Application Manager
SLA
SLA
SLA
Router
DB2
Citibank online banking
57
ML for Autonomic Decision-Making

Natural method of choice is Reinforcement
Learning Learns behavioral policies State ?
Action
But several huge challenges in making this
practical
Non-Markovian effects (non-stationarity, history
dependence, partial observability) may be
pronounced
Scale/complexity of big distributed systems is
daunting
Can easily have gtgt thousands of state variables
in Data Centers, large multi-tier Web
applications
Need RL approaches that can scale to huge state
spaces, as well as huge action spaces
Cost of acquiring training data is quite real in
live systems
Cost of poor performance of initial policy
Cost of exploration

58
Needed Enhancements to Vanilla RL

Active Learning i.e. exploration-exploitation
tradeoff
exact solution known for bandit problems
(stateless MDP)
several established heuristic approaches
(Boltzmann exploration, Interval Estimation,)
Principled Bayes-RL approaches beginning to be
developed (Wang et al., ICML 2005 Poupart et
al., ICML 2006)
Need to address Curse-of-Dimensionality
Robust Function Approximation Exploit smooth,
monotonic dependence on state variables in many
systems management applications
Feature Selection well-developed literature,
mainly for supervised learning (classification/reg
ression)
State Abstraction structured MDP approaches,
options, PSR
Hidden State Inference POMDP literature, Hidden
Aspect Model learning

59
Whats Wrong with Standard Model-Based
Approaches? (Why Bother With ML?)

Model-Based Approach Design an appropriate
system performance model (e.g. control-theoretic
or queuing-theoretic)
Estimate model parameters offline or online
Model estimates how control variable affects
system performance
Use optimization methods to select best control
variable setting (if exhaustive search is
infeasible)
Two main limitations
Model design is difficult and knowledge-intensive
Model assumptions dont exactly match real system
Two prospective benefits of machine learning
approach
Avoid knowledge bottleneck
Can achieve more principled MDP-optimal policies
Properly account for long-range dynamic
consequences of actions

60
A Knowledge Bottleneck in Autonomic Computing
61
Case Studies RL for Dynamic Resource Allocation

Realistic Prototype Data Center
Real servers and multiple Web-based transactional
workloads
Realistic time-varying demand in each workload
Dynamically allocate servers to optimize SLA
payments
Decompositional Online RL Approach
Enables scalability to many workloads
Learn local value functions within each workload
Hybrid RL Approach
Collect data using an external policy (queuing
model based) to make allocation decisions
Train RL in batch mode on collected data
Learned policy outperforms original policy

62
Application Allocating Server Resources in a
Data Center

Scenario Data center serving multiple customers,
each running high-volume web apps with
independent time-varying workloads

Data Center
Maximize business value across all customers
Resource Arbiter
Application Manager
SLA
SLA
SLA
Router
DB2
Citibank online banking
63
Data Center Prototype Implementation

Real servers Cluster of 20 IBM eServer X series
(RedHat Linux)
Realistic Web-based workload Trade3 (online
trading emulation)
Runs on top of WebSphere (web applications
platform) and DB2 (database management
software)
Realistic demand generation
Open-loop scenario Poisson HTTP requests mean
arrival rate ? varies with time
Closed-loop scenario Finite number of customers
M with fixed think time distribution M varies
with time
Variations in M or ? governed by stochastic
time-series model of Web traffic (Squillante, Yao
and Zhang, 1999)

64
Data Center Prototype Experimental setup
Maximize Total SLA Revenue
Demand (HTTP req/sec)
Demand (HTTP req/sec)
5 sec
Resource Arbiter
Value(srvrs)
Value(srvrs)
Value(srvrs)
App Manager
App Manager
App Manager
SLA
SLA
SLA
WebSphere 5.1
WebSphere 5.1
Value(RT)
Value(srvrs)
Value(RT)
DB2
DB2
Trade3
Batch
Trade3
Server
Server
Server
Server
Server
Server
Server
Server
8 xSeries servers
65
Will ML Without Built-In Knowledge Work?
Tabula Rasa blank slate (Latin)
66
Global RL versus Local RL

One approach Make the Resource Arbiter a global
Q-Learner
Advantages
Arbiters problem is a true MDP
Can rely on convergence guarantee
Main Disadvantage
Arbiters state space is huge cross product of
all local state spaces
? Serious curse-of-dimensionality if many
applications
Alternative Approach Local RL
Each application does local TD(0) based on local
state, local provisioning, and local reward ?
learns local value function
Each application conveys current V(resource)
estimates to arbiter
Arbiter then acts to maximize sum of current
value functions
Local learning should be much easier than global
learning but
No longer have a convergence guarantee
Related work Russell Zimdars, ICML-03. (local
rewards only)

67
Important RL Issues

Are the local applications really MDPs?
may be history dependence in demand (e.g.
closed-loop user activity)
may be history dependence in performance (garbage
collection)
may not have full observability of local state
(lots of sensors needed)
avg. demand, avg. response time, avg. CPU util.,
avg. memory util., current number of threads
running,
Can we learn fast enough?
Shouldnt take millions of value table updates
at most few 10k updates
Start RL off from good heuristic initial state
Hybrid learning Initial policy decisions made by
model-based approach
Can we avoid excessive exploration penalties?
epsilon-greedy (10 random allocation decisions)
seems ok, but in general, need more intelligent
methods

68
Online RL in Trade3 Application Manager (AAAI
2005)
Resource Arbiter

Observed state current demand ? only
Arbiter action servers provided (n)
Instantaneous reward U SLA payment
Learns long-range expected value function
V(state,action) V(?, n)
(two-dimensional lookup table)
Data Center results
good asymptotic performance, but
poor performance during long training period
method scales poorly with state space size

Servers
V(n)
TRADE3 App Mgr
U
RL
Response Time
V(?, n)
Demand ?
Server
Server
Server
Application Environment
69
Amazingly Enough, RL Works! -)
Results of overnight training (25k RL updates
16 hours real time) with random initial condition
70
Comparison of Performance 2 Application
Environments
71
3 Application Environments Performance
72
Will ML Without Built-In Knowledge Work?
Tabula Rasa blank slate (Latin)
73
A Hybrid Approach Combining Knowledge ML
(Tesauro et al., ICAC 2006)

Initial Knowledge ? Behavioral Data ? ML ?
Improved Knowledge
Several advantages
No direct interface between ML and Initial
Knowledge dont engineer knowledge into ML
Initial knowledge can be virtually anything
very simple (e.g. crude heuristic)
highly sophisticated (multi-tier closed queuing
network)
could even be human behavior
Can do multiple iterations to keep improving

74
Hybrid Reinforcement Learning Illustrated

Run RL offline on data from initial policy
Bellman Policy Improvement Theorem (1957) ?
V(state,action) defines a new policy guaranteed
better than original policy
Combines best aspects of both RL and model-based
(e.g. queuing) methods
Very general method that automatically improves
any existing systems management policy

In Data Center prototype
Implement best queuing models within each Trade3
mgr
Log system data in overnight run (12-20 hrs)
Train RL on log data (2 cpu hrs) ? new value
functions
Replace queuing models by RL value functions and
rerun experiment

State
Reward
Action
75
Two key ingredients of Trade3 implementation

1. Delay-Aware State Representation
Include previous allocation decision as part of
current state ? V V( ?t , nt-1 , nt )
Can learn to properly evaluate switching delay
(provided that delay lt allocation interval)
e.g. can distinguish V(?, 2, 3) from V(?, 3, 3)
delay need not be directly observable RL only
observes delayed reward
Also handles transient suboptimal performance
2. Nonlinear Function Approximation (Neural Nets)
Generalizes across states and actions
Obviates visiting every state in space
Greatly reduces need for exploratory actions
Much better scaling to larger state spaces
From 2-3 state variables to 20-30, potentially
But lose guaranteed optimality

76
Results Open Loop, No Switching Delay
77
Results Closed Loop, No Switching Delay
78
Results Effects of Switching Delay
79
Insights into Hybrid RL outperformance

1. Less biased estimation errors
Queuing model predicts indirectly RT ? SLA(RT) ?
V
Nonlinear SLA induces overprovisioning bias
RL estimates utility directly ? less biased
estimate of V
2. RL handles transients and switching delays
Steady-state queuing models cannot
3. RL learns to avoid thrashing

80
Policy Hysteresis in Learned Value Function

Stable joint allocations (T1, T2, Batch) at fixed
?2

81
Hybrid RL learns not to thrash
Queuing Model Servers(T1)
Closed Loop Demand Customers in T1
T2 Allocation Delay 4.5s
Hybrid RL Servers(T1)
T2
Queuing Model Servers(T2)
T1
Hybrid RL Servers(T2)
82
Hybrid RL does less swapping than QM
0.9
0.736
0.8
0.654
0.7
0.581
0.578
0.6
0.486
0.464
0.5
0.331
0.4
lt?ngt
0.269
0.3
0.2
0.1
0
QM
RL
QM
RL
QM
RL
QM
RL
Delay0
Delay4.5
Delay0
Delay4.5
Open
Open
Closed
Closed
Experiment
83
Conclusions

RL holds great promise for Autonomic
Decision-Making
can learn without building explicit models
can achieve decision-theoretic MDP optimal
policies
Online RL feasibility seen in small-scale lab
study
requires good heuristic initialization
Hybrid RL works quite well for server allocation
combines disparate strengths of RL and queuing
models
exploits domain knowledge built into queuing
model
but doesnt need access to knowledge only uses
externally observable behavior of queuing model
policy
Potential for wide usage of RL in systems
management
managing other resource types memory, storage,
LPARs etc.
manage control params web server/OS/DB params
etc.
simultaneous management of multiple criteria
performance/utilization, performance/availability
etc.
Thanks! Any Questions?

84
Related Work

R. Agrawal and R. Srikant. Fast algorithms for
mining association rules. In VLDB 94
S. Ma and J.L. Hellerstein, Mining Mutually
Dependent Patterns for System Management, IEEE
Journal on Selected Areas in Communications,
2002, pp. 726-735
R. Vilalta and S. Ma, Predicting Rare Events in
Temporal Domains, ICDM-02
C. Domeniconi, C. Perng, R. Vilalta, S. Ma, A
Classification Approach for Prediction of Target
Events in Temporal Sequences, PKDD-02
R. K. Sahoo, A. J. Oliner, I. Rish, M. Gupta,
J.E. Moreira, S. Ma, R. Vilalta, A.
Sivasubramaniam, Critical Event Prediction for
Proactive Management in Large-scale Computer
Clusters, KDD-03
JL Hellerstein, TS Jayram, and I Rish.
Recognizing end-user transactions in performance
management, in AAAI-00.
I. Rish, M. Brodie, S. Ma, N. Odintsova, A.
Beygelzimer, G. Grabarnik, K. Hernandez.
Adaptive Diagnosis in Distributed Systems, in
IEEE Transactions on Neural Networks (special
issue on Adaptive Learning Systems in
Communication Networks), vol. 16, no. 5, pp.
1088-1109, September 2005.
A. Beygelzimer, R. Linsker, G. Grinstein, I.
Rish. Improving Network Robustness by Edge
Modification, Physica A, Vol 357, No 3-4, pp.
593-612, November 2005.
I. Rish, Distributed Systems Diagnosis Using
Belief Propagation, in Proc. of the 43-d Annual
Allerton Conference on Communication, Control and
Computing, 2005.
A. Zheng. I. Rish, A. Beygelzimer. Efficient
Test Selection in Active Diagnosis via Entropy
Approximation, in Proceedings of UAI-2005.