Title: Autonomic Computing: A New Challenge for Machine Learning ECML-06 Tutorial
1Autonomic ComputingA New Challenge for Machine
LearningECML-06 Tutorial
Irina Rish, Gerry Tesauro IBM T.J. Watson
Research Center Hawthorne, NY
2Outline
- What is Autonomic Computing? Why use ML?
- Some Application Examples
- Part 1 Inference and Learning with Active
Sampling - Active testing in Bayesian inference problem
diagnosis - Active Learning in Collaborative Prediction
server selection - Part 2 Decision Making and Reinforcement
Learning - Summary and Future Directions
3Challenges in Systems Management
- Large-scale, heterogeneous distributed systems
with highly dynamic, complex multi-component
interactions - Large volumes of real-time high-dimensional data,
but also lots of missing information and
uncertainty - Too much complexity, too few (skilled)
administrators
Need for self-managing systems ? autonomic
computing
4Evolution of Computing
5What is Autonomic Computing?
Computing systems that manage themselves in
accordance with high-level objectives from
humans Kephart and Chess, A Vision of Autonomic
Computing, IEEE Computer, 2003
- Self-management capabilities include
- Self-Configuration Automated configuration of
components, systems according to high-level
policies rest of system adjusts seamlessly. - Self-Healing Automated detection, diagnosis, and
repair of localized software/hardware problems. - Self-Optimization Automatic and continual
adaptive tuning of hundreds of parameters
(database params, server params,) affecting
performance efficiency - Self-Protection Automated defense against
malicious attacks or cascading failures use
early warning to anticipate and prevent
system-wide failures. - Good application domain for ML rich
opportunities, little previously done
6Autonomic Computing Element Architecture
Inference
Learning
Decision- making
7Machine Learning Promises and Challenges
- Promise machine learning is a natural solution
to automation! - - Avoids knowledge-intensive model building
- - Deals naturally with dynamicity, changes in
system composition - - Can deal with complex non-steady-state
dynamical phenomena - Challenges
- - Curse of dimensionality O(exp(N))
state-space for N variables ? representation and
computation complexity - - Data sparsity largely unexplored
state-action spaces ? - challenge for prediction and
decision-making - Reverse the curse!
- - squeeze out relevant information
(generalization, dim. reduction) - - only do exploration that matters (active
learning) -
8Reversing the curse A Unified Approach
System
High-dimensional, high-volume raw data
-
- High dimensionality,
- data sparsity
- Scalable predictive
- algorithms
- Exploration vs exploitation
- cost-benefit tradeoffs
State-Space Data Abstraction
Compressed, informative data
Performance Prediction (as function of possible
actions)
Classification, regression
Predicted Performance
Action Evaluation Decision Making
Actions
9Some Application Examples
- Mining Event Data
- Transaction Recognition
- Fault Diagnosis
- Performance Prediction
- Server provider selection problem
- Online Resource Allocation
- Power and performance management
10Example 1 Event MiningFor details, see Ma and
Hellerstein
Analyzing system event logs to extract
interesting behavior patterns
- - Thousands of hosts
- - Hundreds of event types
- Billions of events
- Various severity levels
- Some high-severity events
- Cisco_Link_Down,
- chassisMinorAlarm_On
- Some low-severity events
- tcpConnectClose, duplicate_ip
- Learning dependencies among events
(clustering, association rules) - Inference
predict high-severity events, diagnose suspicious
behavior - Take corrective and/or preventive
actions based on predictions
11Event Prediction Problem
12Classification-based Event PredictionFor
details, see Vilalta and Ma, Domencioni et al,
Sahoo et al
13Example 2 Users Transaction Recognition
- Transaction recognition is needed for
- Building realistic workload models for testing
performance - Quantifying end-user perception of performance
(response times)
- Learning segmentation and labeling of RPC
streams - Inference
predicting most-likely next transactions given
the model - Decision-making resource allocation
based on anticipating requests
14Why is it hard? Why learn from data?
Example EUTs and RPCs in Lotus Notes
15Segmentation and Classification Subproblems For
details, see Hellerstein, Jayram, Rish,
AAAI-2000
(similar to text classification)
- Features RPC occurences (bag of words) or
RPC counts - Accuracy results Naïve Bayes - 86-88, SVM
85-87, Decision Tree 90-92
(similar to speech understanding, image
segmentation)
- Dynamic programming (Viterbi search) Naïve
Bayes - Accuracy results w/ Naïve Bayes 64 (harder
problem than classification!)
16Example 3 Network Monitoring using Probes
Probes end-to-end transactions
ping, trace-route, email- and web-access,
e-business transactions Typically used
for testing end-to-end performance
(e.g., SLA)
- Learning dependency models (topology,
routing), noise parameters - Inference problem
diagnosis (why a transaction timed-out?)
- Decision-making cost-efficient active
probing, optimal routing etc.
17Example 4 Content-Distribution Systems
Examples - Napster - Gnutella - IBMs
downloadGrid
p
p
p
p
p
p
p
probe
p
p
- peer
Management Center
x
Download request
p
complaint
- Learning performance prediction (latency,
bandwidth) - Inference problem diagnosis
(node didnt reply is it faulty?) -
Decision-making best provider selection (e.g.,
max-bandwidth)
18Example 5 Allocating Server Resources in a Data
Center
- Scenario Data center serving multiple customers,
each running high-volume web apps with
independent time-varying workloads
Data Center
Maximize business value across all customers
Resource Arbiter
Application Manager
SLA
SLA
SLA
Router
DB2
Citibank online banking
19Outline
- What is Autonomic Computing? Why use ML?
- Some Application Examples
- Part 1 Inference and Learning with Active
Sampling - Active testing in Bayesian inference problem
diagnosis - Active Learning in Collaborative Prediction
server selection - Part 2 Decision Making and Reinforcement
Learning - Summary and Future Directions
20Why Focus on Active Sampling?
- Cost-efficiency concerns
- Huge number of possible measurements gt data
collection costs (instrumentation, storage,
overhead due to invasive tests) - Huge volumes of data gt complexity of data
analysis - Need to squeeze out only most-relevant
information, otherwise
we are drowning in data but
starving for knowledge - Needed optimized measurement selection
- Good news artificial systems are well
suited for active sampling more flexible
than some natural applications
21Example Network Monitoring via Probes
Probes end-to-end transactions
ping, trace-route, email- and web-access,
e-business transactions Typically used
for testing end-to-end performance
(e.g., SLA)
- Inference problem diagnosis (why a transaction
timed-out?) - Decision-making
cost-efficient, active probe selection
22Simple Example
Dependency matrix
- Columns components/nodes
- hardware/software components
- Rows probes
- ping, trace-route, email, web-access
- E.g.,
- pWS-Web Page access probe
- pDBS-database query
- pAS application test
- pingR ping router
- pingWS ping Web Server
23Probabilistic Inference in Bayesian Networks
24Noisy-OR Bayesian Network Model
- No noise logical-OR
- Probe outcomes define a set of constraints
- t1 X1 \/ X2 \/ X5
- t2 X1 \/ X3 \/ X6
- t3 X2 \/ X3 \/ X4
Noise noisy-OR (causal independence model)
"Spurious" probes (unexpected success due to
inhibited cause)
"Lost" probes (unexpected failure due to hidden
cause)
25Challenge Make Diagnosis Cost-Efficient
- Need to reduce cost to deploy and maintain
physical - infrastructure probe stations, databases,
reporting systems, staff - However, need a fast and accurate real-time
diagnosis!
Multicriteria Optimization need to minimize
- the number of probe stations
- the number of probes
- computational complexity of probe selection
- expected diagnostic error
- computational complexity of diagnosis
Off-line (Planning)
On-line (Active)
Topology information
Probe station (source) selection
Probe Set Construction
Probe Set Optimization
26Probe Source Selection Greedy vs Random
Odintsova and Rish, 2005
Watson network (router level)
IBM Research network (router level)
Random graphs
Scale-free networks
Number of probes for fault detection
Bound on the number of sources
Greedy heuristic approach seems to always beat
the random source placement
27Optimal Probe Set Selection
- Non-adaptive (off-line)
- Given an unobserved variable X, its
distribution P(X), - a set of possible probes S, and P(SX), choose
the - smallest subset of tests T that allows most
accurate - diagnosis of X (NP-hard problem Beygelzimer
2003) - However, greedy most-informative-next
heuristic - search works quite well Brodie, Rish, Ma
2001
Minimum probe subset for single-fault
diagnosis In case of zero noise
- But online, adaptive approach is even more
efficient! - Maximizes infogain given outcomes of the
previous probes more context-specific and
typically requires much less probing
28Active online probe selection
Select next probe that provides maximum
information about the unknown system state
29Active Diagnosis Adaptive vs Non-Adaptive
Test Selection Rish et al, IEEE Trans. NN, 2005
of nodes
of tests
Non- adaptive (exact)
Non- adaptive (greedy)
Savings adaptive vs exact
Adaptive (avg)
Odyssey (Mar03)
Odyssey (Feb03)
Odyssey (Nov02)
CRM1
CRM2
CRM3
CRM4
intranet1
intranet2
active vs greedy off-line
Active vs offline 60-76 savings in the number
of tests
30Warning active probe selection becomes
intractable for general multi-fault
diagnosisZhang, Rish, Beygelzimer, UAI 2005
- Without k-fault assumption for small k, even
myopic VOI - selecting a single next
most-informative probe - becomes intractable as
it now requires generally intractable
probabilistic inference for computing - However, a decomposition of H(X,T) and
subsequent efficient approximation based on
belief propagation is possible (see Zhang et al,
UAI-05) it computes ALL infogains for candidate
tests in one sweep and does not compromise much
accuracy w.r.t. to exact approach to active probe
selection - Related work Krause, Guestrin 2005
Oopsintractable!
31Some Theoretical Bounds on Diagnostic Error in
Bayes Nets Rish, Allerton-2005
Theorem diagnostic error measured by the bit
error rate (BER) is bounded from below as
follows where c is the minimum number of
children for any Xi,
is the maximum
prior, and
Assumptions regular random bipartite networks
with n input nodes, m output nodes, k number
of parents per
each test node (and thus km/n children per each
hidden node), p P(X1)- fault prior (plt0.5),
q noise parameter,
qleak leak parameter
n
Interpretation m probes, each of length k, over
randomly selected subsets of nodes
this is a bit unrealistic setting
(in reality, probe selection is constrained),
but convenient for initial
evaluation of diagnostic error
m
32Necessary Conditions for Zero-Error Diagnosis
Corollary 2 a necessary condition for achieving
error-free diagnosis is
- More probes per node (m/n) are required for
higher fault probability p - Longer probes (larger k) yield lower requirements
for m/n - But the bounds are weak can we find ACHIEVABLE
lower bound? - (just like Shannons ? )
33 Computational Complexity of Diagnosis
In our problems, w grows with increasing number
of probe stations and gets intractable quite
quickly (multiple probe stations are needed to
have more informative probe sets)
- What can we do?
- Sometimes, problem structure is easy (e.g.,
trees, low-w graphs) - But in general, approximate inference is needed
e.g, loopy belief propagation.
34Distributed Diagnosis
- RAIL real-time active inference and learning
engine - EPP end-to-end probing station
- Corresponding
- Bayesian Network
- each RAIL corresponds to a
region in the Bayesian network that includes
all probes controlled by RAIL and their
respective nodes
35Factor graphs and Belief Propagation
Map Bayesian network to a Factor Graph each
variable is mapped to a variable node, each
function is mapped to a factor node
Factor node ? probe Variable node ? component
Parallel belief propagation each RAIL updates
beliefs of its nodes and sends messages to
neighbor-RAILs sharing some common variable
nodes
36Belief Propagation on Factor Graphs
The belief b(Y) is the BP approximation of
the marginal probability
From variable node to factor node
From factor node to variable node
37BP Diagnosis on Internet-like Networks
- INET topology generator used to create a network
over 487 nodes - 387 probes selected by greedy search for
single-fault diagnosis - Simulated levels of noise q and fault prior p
-
- Error increases with growing probability of fault
p and noise level q - Fault probability p has more impact on the error
than the noise q - Good news in reality, p is usually quite low, so
the error is small
38Outline
- What is Autonomic Computing? Why use ML?
- Some Application Examples
- Part 1 Inference and Learning with Active
Sampling - Active testing in Bayesian inference problem
diagnosis - Active Learning in Collaborative Prediction
server selection - Part 2 Decision Making and Reinforcement
Learning - Summary and Future Directions
39 Performance Prediction and Server Selection
- End-to-end performance between a pair of
nodes - network latency, bandwidth, round-trip
time, any other QoS metric - Knowing e2e performance is important in many
applications
- Content-distribution systems (peer-to-peer,
Grid) choosing
highest-bandwidth server to download an object
from - Distributed Hash Tables
route a lookup
request to the peer with the lowest latency -
- Overlay routing
- select lowest-latency peer
- to communicate with
- Example IBM download Grid
40Approach Collaborative Prediction (CP)
Problem given a sparse matrix of some previously
observed user experiences (users ratings for a
set of products, or bandwidth between some
client-server pairs), predict unobserved entries
Example bandwidth matrix for 100x100 subset of
dGrid
- How to generalize from observed to unobserved
entries (fill-in black space?) - Underlying assumption matrix entries are NOT
independent, e.g. similar nodes have similar
performances - Various approaches, mainly factorized models
that assume hidden factors affecting the
ratings, e.g. aspect model, pLSA, MCVQ,
SVD, NMF, MMMF
Servers
Clients
41Assumptions - there is a number of
(hidden) factors behind the user preferences
that relate to (hidden) movie properties
- movies have intrinsic values associated
with such factors
- users have intrinsic weights with such
factors user ratings a weighted
(linear) combinations of movies values
Factor models ? dimensionality reduction (for
small of factors)
42rank k
Y
X
Objective find a factorizable XUV that
approximates Y
and satisfies some regularization constraints
(e.g. rank(X) lt k)
Loss functions depends on the nature of your
problem
43How to solve it?
- Singular value decomposition (SVD) low-rank
approximation - Assumes fully observed Y and sum-squared loss
- In collaborative prediction, Y is only partially
observed - Low-rank approximation becomes non-convex
problem w/ many local minima
- Furthermore, we may not want sum-squared loss,
but instead - accurate predictions (0/1 loss, approximated by
hinge loss) - cost-sensitive predictions (missing a good
server vs suggesting a bad one) - cost of ranking, etc. depends on the decision
algorithm using predictions
- Use instead the state-of-art Max-Margin Matrix
Factorization Srebro 05 - replaces bounded rank constraint by bounded norm
of U, V vectors - convex optimization problem! can be solved
exactly by semi-definite programming - strongly relates to learning max-margin
classifiers (SVMs)
- We demonstrate MMMF on binary classification
first
44 Key Insight
Rows feature vectors, Columns linear
classifiers
Linear classifiers?weight vectors
v2
f1
-1
Feature vectors
Xij signij x marginij
Predictorij signij
If signij gt 0, classify as 1, Otherwise
classify as -1
45 MMMF Simultaneous Search for Low-norm
Feature Vectors and Max-margin Classifiers
46 Our contribution Active Learning
MMMF works well, but it ignores natural domain
property possibility of active sampling
(make user A connect to server 117 to improve our
model)
-0.3
-0.5
0.3
0.4
0.6
-0.9
-0.6
0.1
-0.5
0.8
0.2
-0.9
-0.1
0.3
-0.7
-0.5
0.6
0.7
-0.9
-0.8
0.9
0.1
0.5
0.2
0.3
-0.5
0.6
0.6
0.5
-0.8
-0.5
0.2
0.7
-0.9
0.9
-0.6
0.9
-0.1
0.7
-0.4
0.3
0.8
-0.2
-0.5
0.6
Current active-SVM heuristic Active ?
Min-margin sample (most uncertain one)
-0.4
-0.5
-0.5
-0.5
0.4
0.5
0.6
-0.5
-0.5
-0.2
0.1
-0.5
0.3
0.9
0.8
-0.5
0.6
0.2
47 Active Max-Margin Matrix FactorizationA-MMMF
- A-MMMF(M,s)
- 1. Given s sparse matrix Y, learn approximation
X MMMF(Y) - 2. Using current predictions, actively
select best s samples and request their labels
(e.g., test client/server pair via enforced
download) - 3. Add new samples to Y
- 4. Repeat 1-3 until no significant
improvement in prediction is likely
- Active sampling
- The idea is to eliminate as many as possible
wrong hypotheses, e.g., SVM lines, from
consideration see later for more detail - Current approach used the simplest
minimum-margin heuristic for SVMs - However, there is a variety of other SVM active
learning heuristics to try - Ideally, a theoretically founded approach is
desirable a hard open problem
48Results Active vs Random Sampling
DownloadGrid bandwidth prediction
PlanetLab latency prediction
0 100 200 300 400 500
600
0 100
200
of initial data set
of initial data set
Active sampling gives consistent improvement in
classification accuracy, which leads to better
decisions ? higher bandwidth/faster downloads
49More Results on Latency Prediction
P2Psim data
NLANR-AMP data
Comparing various active sampling strategies
most-uncertain (min-margin) and
least-uncertain (max-margin/safe) sample
selection
50Results for Movie Rating Prediction
MovieLens data prediction error
MovieLens data cost
Prediction accuracy versus cost of sampling using
various strategies
51Conclusions
-
- Common challenge in systems management
applications cost-efficient measurement
selection - Promising approach cost-efficient active
sampling - Active sampling improves predictive accuracy
while keeping the number of measurements low in
several domains - Online diagnosis
- Bandwidth and latency prediction
- Future work
- Other systems applications and problems that need
active sampling - Theoretical analysis of active sampling
performance - More efficient active sampling approaches (better
heuristics, non-myopic test selection)
52Outline
- What is Autonomic Computing? Why use ML?
- Some Application Examples
- Part 1 Inference and Learning with Active
Sampling - Active testing in Bayesian inference problem
diagnosis - Active Learning in Collaborative Prediction
server selection - Part 2 Decision Making and Reinforcement
Learning - Summary and Future Directions
53 Examples of Autonomic Decision-Making
- Systems Performance Management
- Dynamic Resource Allocation
- Servers, threads, CPU slices,
- Memory
- Storage
- Bandwidth
- Online Parameter Tuning
- MAXCLIENTS, timeout parameters,
- Routing and Scheduling
- Access/Flow Control
- Application Placement
- Objectives QoS/SLA objectives, consistency,
customer retention, fairness etc. (can get
pretty nebulous)
54 More Decision-Making Examples
- Availability Management
- Knobs Data redundancy, server redundancy
- Objectives MTBF, RTO, RPO
- Power/Thermal Management
- Knobs CPU clock speeds, ambient temp., fan
speeds, - Objectives min. cost of electricity, cooling,
heat degradation of eqpt. - Background Utility Throttling
- DB backups, re-indexing
- Utilization Management
- Multi-Criteria Tradeoffs
- e.g. Performance Availability Performance
Power etc.
55WebSphere On Demand Operating Environment
Prioritization and Flow Control
Routing and Load Balancing
Classification
Node 1
AM
ST
Stock Trading
High Importance
Provisioning Executions
Node 2
AM
ST
Account Mngmt
Node 3
Medium Importance
FA
ST
Stock Trading
Account Mngmt
Node 4
AM
FA
WebSphere on demand Router
Financial Advice
Financial Advice
Low Importance
Node 5
AM
FA
WebSphere Cell
Application Demand Resource State
Provisioning Decisions
Operational Policy
56Application Allocating Server Resources in a
Data Center
- Scenario Data center serving multiple customers,
each running high-volume web apps with
independent time-varying workloads
Data Center
Maximize business value across all customers
Resource Arbiter
Application Manager
SLA
SLA
SLA
Router
DB2
Citibank online banking
57 ML for Autonomic Decision-Making
- Natural method of choice is Reinforcement
Learning Learns behavioral policies State ?
Action - But several huge challenges in making this
practical - Non-Markovian effects (non-stationarity, history
dependence, partial observability) may be
pronounced - Scale/complexity of big distributed systems is
daunting - Can easily have gtgt thousands of state variables
in Data Centers, large multi-tier Web
applications - Need RL approaches that can scale to huge state
spaces, as well as huge action spaces - Cost of acquiring training data is quite real in
live systems - Cost of poor performance of initial policy
- Cost of exploration
58 Needed Enhancements to Vanilla RL
- Active Learning i.e. exploration-exploitation
tradeoff - exact solution known for bandit problems
(stateless MDP) - several established heuristic approaches
(Boltzmann exploration, Interval Estimation,) - Principled Bayes-RL approaches beginning to be
developed (Wang et al., ICML 2005 Poupart et
al., ICML 2006) - Need to address Curse-of-Dimensionality
- Robust Function Approximation Exploit smooth,
monotonic dependence on state variables in many
systems management applications - Feature Selection well-developed literature,
mainly for supervised learning (classification/reg
ression) - State Abstraction structured MDP approaches,
options, PSR - Hidden State Inference POMDP literature, Hidden
Aspect Model learning
59Whats Wrong with Standard Model-Based
Approaches? (Why Bother With ML?)
- Model-Based Approach Design an appropriate
system performance model (e.g. control-theoretic
or queuing-theoretic) - Estimate model parameters offline or online
- Model estimates how control variable affects
system performance - Use optimization methods to select best control
variable setting (if exhaustive search is
infeasible) - Two main limitations
- Model design is difficult and knowledge-intensive
- Model assumptions dont exactly match real system
- Two prospective benefits of machine learning
approach - Avoid knowledge bottleneck
- Can achieve more principled MDP-optimal policies
- Properly account for long-range dynamic
consequences of actions
60A Knowledge Bottleneck in Autonomic Computing
61 Case Studies RL for Dynamic Resource Allocation
- Realistic Prototype Data Center
- Real servers and multiple Web-based transactional
workloads - Realistic time-varying demand in each workload
- Dynamically allocate servers to optimize SLA
payments - Decompositional Online RL Approach
- Enables scalability to many workloads
- Learn local value functions within each workload
- Hybrid RL Approach
- Collect data using an external policy (queuing
model based) to make allocation decisions - Train RL in batch mode on collected data
- Learned policy outperforms original policy
62Application Allocating Server Resources in a
Data Center
- Scenario Data center serving multiple customers,
each running high-volume web apps with
independent time-varying workloads
Data Center
Maximize business value across all customers
Resource Arbiter
Application Manager
SLA
SLA
SLA
Router
DB2
Citibank online banking
63Data Center Prototype Implementation
- Real servers Cluster of 20 IBM eServer X series
(RedHat Linux) - Realistic Web-based workload Trade3 (online
trading emulation) - Runs on top of WebSphere (web applications
platform) and DB2 (database management
software) - Realistic demand generation
- Open-loop scenario Poisson HTTP requests mean
arrival rate ? varies with time - Closed-loop scenario Finite number of customers
M with fixed think time distribution M varies
with time - Variations in M or ? governed by stochastic
time-series model of Web traffic (Squillante, Yao
and Zhang, 1999)
64Data Center Prototype Experimental setup
Maximize Total SLA Revenue
Demand (HTTP req/sec)
Demand (HTTP req/sec)
5 sec
Resource Arbiter
Value(srvrs)
Value(srvrs)
Value(srvrs)
App Manager
App Manager
App Manager
SLA
SLA
SLA
WebSphere 5.1
WebSphere 5.1
Value(RT)
Value(srvrs)
Value(RT)
DB2
DB2
Trade3
Batch
Trade3
Server
Server
Server
Server
Server
Server
Server
Server
8 xSeries servers
65Will ML Without Built-In Knowledge Work?
Tabula Rasa blank slate (Latin)
66Global RL versus Local RL
- One approach Make the Resource Arbiter a global
Q-Learner - Advantages
- Arbiters problem is a true MDP
- Can rely on convergence guarantee
- Main Disadvantage
- Arbiters state space is huge cross product of
all local state spaces - ? Serious curse-of-dimensionality if many
applications - Alternative Approach Local RL
- Each application does local TD(0) based on local
state, local provisioning, and local reward ?
learns local value function - Each application conveys current V(resource)
estimates to arbiter - Arbiter then acts to maximize sum of current
value functions - Local learning should be much easier than global
learning but - No longer have a convergence guarantee
- Related work Russell Zimdars, ICML-03. (local
rewards only)
67Important RL Issues
- Are the local applications really MDPs?
- may be history dependence in demand (e.g.
closed-loop user activity) - may be history dependence in performance (garbage
collection) - may not have full observability of local state
(lots of sensors needed) - avg. demand, avg. response time, avg. CPU util.,
avg. memory util., current number of threads
running, - Can we learn fast enough?
- Shouldnt take millions of value table updates
at most few 10k updates - Start RL off from good heuristic initial state
- Hybrid learning Initial policy decisions made by
model-based approach - Can we avoid excessive exploration penalties?
- epsilon-greedy (10 random allocation decisions)
seems ok, but in general, need more intelligent
methods
68Online RL in Trade3 Application Manager (AAAI
2005)
Resource Arbiter
- Observed state current demand ? only
- Arbiter action servers provided (n)
- Instantaneous reward U SLA payment
- Learns long-range expected value function
V(state,action) V(?, n) - (two-dimensional lookup table)
- Data Center results
- good asymptotic performance, but
- poor performance during long training period
- method scales poorly with state space size
Servers
V(n)
TRADE3 App Mgr
U
RL
Response Time
V(?, n)
Demand ?
Server
Server
Server
Application Environment
69Amazingly Enough, RL Works! -)
Results of overnight training (25k RL updates
16 hours real time) with random initial condition
70Comparison of Performance 2 Application
Environments
713 Application Environments Performance
72Will ML Without Built-In Knowledge Work?
Tabula Rasa blank slate (Latin)
73A Hybrid Approach Combining Knowledge ML
(Tesauro et al., ICAC 2006)
- Initial Knowledge ? Behavioral Data ? ML ?
Improved Knowledge - Several advantages
- No direct interface between ML and Initial
Knowledge dont engineer knowledge into ML - Initial knowledge can be virtually anything
- very simple (e.g. crude heuristic)
- highly sophisticated (multi-tier closed queuing
network) - could even be human behavior
- Can do multiple iterations to keep improving
74Hybrid Reinforcement Learning Illustrated
- Run RL offline on data from initial policy
- Bellman Policy Improvement Theorem (1957) ?
V(state,action) defines a new policy guaranteed
better than original policy - Combines best aspects of both RL and model-based
(e.g. queuing) methods - Very general method that automatically improves
any existing systems management policy
- In Data Center prototype
- Implement best queuing models within each Trade3
mgr - Log system data in overnight run (12-20 hrs)
- Train RL on log data (2 cpu hrs) ? new value
functions - Replace queuing models by RL value functions and
rerun experiment
State
Reward
Action
75Two key ingredients of Trade3 implementation
- 1. Delay-Aware State Representation
- Include previous allocation decision as part of
current state ? V V( ?t , nt-1 , nt ) - Can learn to properly evaluate switching delay
(provided that delay lt allocation interval) - e.g. can distinguish V(?, 2, 3) from V(?, 3, 3)
- delay need not be directly observable RL only
observes delayed reward - Also handles transient suboptimal performance
- 2. Nonlinear Function Approximation (Neural Nets)
- Generalizes across states and actions
- Obviates visiting every state in space
- Greatly reduces need for exploratory actions
- Much better scaling to larger state spaces
- From 2-3 state variables to 20-30, potentially
- But lose guaranteed optimality
76Results Open Loop, No Switching Delay
77Results Closed Loop, No Switching Delay
78Results Effects of Switching Delay
79Insights into Hybrid RL outperformance
- 1. Less biased estimation errors
- Queuing model predicts indirectly RT ? SLA(RT) ?
V - Nonlinear SLA induces overprovisioning bias
- RL estimates utility directly ? less biased
estimate of V - 2. RL handles transients and switching delays
- Steady-state queuing models cannot
- 3. RL learns to avoid thrashing
80Policy Hysteresis in Learned Value Function
- Stable joint allocations (T1, T2, Batch) at fixed
?2
81 Hybrid RL learns not to thrash
Queuing Model Servers(T1)
Closed Loop Demand Customers in T1
T2 Allocation Delay 4.5s
Hybrid RL Servers(T1)
T2
Queuing Model Servers(T2)
T1
Hybrid RL Servers(T2)
82 Hybrid RL does less swapping than QM
0.9
0.736
0.8
0.654
0.7
0.581
0.578
0.6
0.486
0.464
0.5
0.331
0.4
lt?ngt
0.269
0.3
0.2
0.1
0
QM
RL
QM
RL
QM
RL
QM
RL
Delay0
Delay4.5
Delay0
Delay4.5
Open
Open
Closed
Closed
Experiment
83Conclusions
- RL holds great promise for Autonomic
Decision-Making - can learn without building explicit models
- can achieve decision-theoretic MDP optimal
policies - Online RL feasibility seen in small-scale lab
study - requires good heuristic initialization
- Hybrid RL works quite well for server allocation
- combines disparate strengths of RL and queuing
models - exploits domain knowledge built into queuing
model - but doesnt need access to knowledge only uses
externally observable behavior of queuing model
policy - Potential for wide usage of RL in systems
management - managing other resource types memory, storage,
LPARs etc. - manage control params web server/OS/DB params
etc. - simultaneous management of multiple criteria
performance/utilization, performance/availability
etc. - Thanks! Any Questions?
84Related Work
- R. Agrawal and R. Srikant. Fast algorithms for
mining association rules. In VLDB 94 - S. Ma and J.L. Hellerstein, Mining Mutually
Dependent Patterns for System Management, IEEE
Journal on Selected Areas in Communications,
2002, pp. 726-735 - R. Vilalta and S. Ma, Predicting Rare Events in
Temporal Domains, ICDM-02 - C. Domeniconi, C. Perng, R. Vilalta, S. Ma, A
Classification Approach for Prediction of Target
Events in Temporal Sequences, PKDD-02 - R. K. Sahoo, A. J. Oliner, I. Rish, M. Gupta,
J.E. Moreira, S. Ma, R. Vilalta, A.
Sivasubramaniam, Critical Event Prediction for
Proactive Management in Large-scale Computer
Clusters, KDD-03 - JL Hellerstein, TS Jayram, and I Rish.
Recognizing end-user transactions in performance
management, in AAAI-00. - I. Rish, M. Brodie, S. Ma, N. Odintsova, A.
Beygelzimer, G. Grabarnik, K. Hernandez.
Adaptive Diagnosis in Distributed Systems, in
IEEE Transactions on Neural Networks (special
issue on Adaptive Learning Systems in
Communication Networks), vol. 16, no. 5, pp.
1088-1109, September 2005. - A. Beygelzimer, R. Linsker, G. Grinstein, I.
Rish. Improving Network Robustness by Edge
Modification, Physica A, Vol 357, No 3-4, pp.
593-612, November 2005. - I. Rish, Distributed Systems Diagnosis Using
Belief Propagation, in Proc. of the 43-d Annual
Allerton Conference on Communication, Control and
Computing, 2005. - A. Zheng. I. Rish, A. Beygelzimer. Efficient
Test Selection in Active Diagnosis via Entropy
Approximation, in Proceedings of UAI-2005.
85End