Autonomic Computing: A New Challenge for Machine Learning ECML-06 Tutorial - PowerPoint PPT Presentation

1 / 85
About This Presentation
Title:

Autonomic Computing: A New Challenge for Machine Learning ECML-06 Tutorial

Description:

Citibank: online banking. Application. Manager. Servers. Servers. Servers. DB2. Router. SLA ... But online, adaptive approach is even more efficient! ... – PowerPoint PPT presentation

Number of Views:456
Avg rating:3.0/5.0
Slides: 86
Provided by: IBMU328
Category:

less

Transcript and Presenter's Notes

Title: Autonomic Computing: A New Challenge for Machine Learning ECML-06 Tutorial


1
Autonomic ComputingA New Challenge for Machine
LearningECML-06 Tutorial
Irina Rish, Gerry Tesauro IBM T.J. Watson
Research Center Hawthorne, NY
2
Outline
  • What is Autonomic Computing? Why use ML?
  • Some Application Examples
  • Part 1 Inference and Learning with Active
    Sampling
  • Active testing in Bayesian inference problem
    diagnosis
  • Active Learning in Collaborative Prediction
    server selection
  • Part 2 Decision Making and Reinforcement
    Learning
  • Summary and Future Directions

3
Challenges in Systems Management
  • Large-scale, heterogeneous distributed systems
    with highly dynamic, complex multi-component
    interactions
  • Large volumes of real-time high-dimensional data,
    but also lots of missing information and
    uncertainty
  • Too much complexity, too few (skilled)
    administrators

Need for self-managing systems ? autonomic
computing
4
Evolution of Computing
5
What is Autonomic Computing?
Computing systems that manage themselves in
accordance with high-level objectives from
humans Kephart and Chess, A Vision of Autonomic
Computing, IEEE Computer, 2003
  • Self-management capabilities include
  • Self-Configuration Automated configuration of
    components, systems according to high-level
    policies rest of system adjusts seamlessly.
  • Self-Healing Automated detection, diagnosis, and
    repair of localized software/hardware problems.
  • Self-Optimization Automatic and continual
    adaptive tuning of hundreds of parameters
    (database params, server params,) affecting
    performance efficiency
  • Self-Protection Automated defense against
    malicious attacks or cascading failures use
    early warning to anticipate and prevent
    system-wide failures.
  • Good application domain for ML rich
    opportunities, little previously done

6
Autonomic Computing Element Architecture
Inference
Learning
Decision- making
7
Machine Learning Promises and Challenges
  • Promise machine learning is a natural solution
    to automation!
  • - Avoids knowledge-intensive model building
  • - Deals naturally with dynamicity, changes in
    system composition
  • - Can deal with complex non-steady-state
    dynamical phenomena
  • Challenges
  • - Curse of dimensionality O(exp(N))
    state-space for N variables ? representation and
    computation complexity
  • - Data sparsity largely unexplored
    state-action spaces ?
  • challenge for prediction and
    decision-making
  • Reverse the curse!
  • - squeeze out relevant information
    (generalization, dim. reduction)
  • - only do exploration that matters (active
    learning)


8
Reversing the curse A Unified Approach
System
High-dimensional, high-volume raw data
  • High dimensionality,
  • data sparsity
  • Scalable predictive
  • algorithms
  • Exploration vs exploitation
  • cost-benefit tradeoffs

State-Space Data Abstraction
Compressed, informative data
Performance Prediction (as function of possible
actions)
Classification, regression
Predicted Performance
Action Evaluation Decision Making
Actions
9
Some Application Examples
  • Mining Event Data
  • Transaction Recognition
  • Fault Diagnosis
  • Performance Prediction
  • Server provider selection problem
  • Online Resource Allocation
  • Power and performance management

10
Example 1 Event MiningFor details, see Ma and
Hellerstein
Analyzing system event logs to extract
interesting behavior patterns
  • - Thousands of hosts
  • - Hundreds of event types
  • Billions of events
  • Various severity levels
  • Some high-severity events
  • Cisco_Link_Down,
  • chassisMinorAlarm_On
  • Some low-severity events
  • tcpConnectClose, duplicate_ip

- Learning dependencies among events
(clustering, association rules) - Inference
predict high-severity events, diagnose suspicious
behavior - Take corrective and/or preventive
actions based on predictions
11
Event Prediction Problem
12
Classification-based Event PredictionFor
details, see Vilalta and Ma, Domencioni et al,
Sahoo et al
13
Example 2 Users Transaction Recognition
  • Transaction recognition is needed for
  • Building realistic workload models for testing
    performance
  • Quantifying end-user perception of performance
    (response times)

- Learning segmentation and labeling of RPC
streams - Inference
predicting most-likely next transactions given
the model - Decision-making resource allocation
based on anticipating requests
14
Why is it hard? Why learn from data?
Example EUTs and RPCs in Lotus Notes
15
Segmentation and Classification Subproblems For
details, see Hellerstein, Jayram, Rish,
AAAI-2000
(similar to text classification)
  • Features RPC occurences (bag of words) or
    RPC counts
  • Accuracy results Naïve Bayes - 86-88, SVM
    85-87, Decision Tree 90-92

(similar to speech understanding, image
segmentation)
  • Dynamic programming (Viterbi search) Naïve
    Bayes
  • Accuracy results w/ Naïve Bayes 64 (harder
    problem than classification!)

16
Example 3 Network Monitoring using Probes
Probes end-to-end transactions
ping, trace-route, email- and web-access,
e-business transactions Typically used
for testing end-to-end performance
(e.g., SLA)
- Learning dependency models (topology,
routing), noise parameters - Inference problem
diagnosis (why a transaction timed-out?)
- Decision-making cost-efficient active
probing, optimal routing etc.
17
Example 4 Content-Distribution Systems
Examples - Napster - Gnutella - IBMs
downloadGrid
p
p
p
p
p
p
p
probe
p
p
- peer
Management Center
x
Download request
p
complaint
- Learning performance prediction (latency,
bandwidth) - Inference problem diagnosis
(node didnt reply is it faulty?) -
Decision-making best provider selection (e.g.,
max-bandwidth)
18
Example 5 Allocating Server Resources in a Data
Center
  • Scenario Data center serving multiple customers,
    each running high-volume web apps with
    independent time-varying workloads

Data Center
Maximize business value across all customers
Resource Arbiter
Application Manager
SLA
SLA
SLA
Router
DB2
Citibank online banking
19
Outline
  • What is Autonomic Computing? Why use ML?
  • Some Application Examples
  • Part 1 Inference and Learning with Active
    Sampling
  • Active testing in Bayesian inference problem
    diagnosis
  • Active Learning in Collaborative Prediction
    server selection
  • Part 2 Decision Making and Reinforcement
    Learning
  • Summary and Future Directions

20
Why Focus on Active Sampling?
  • Cost-efficiency concerns
  • Huge number of possible measurements gt data
    collection costs (instrumentation, storage,
    overhead due to invasive tests)
  • Huge volumes of data gt complexity of data
    analysis
  • Need to squeeze out only most-relevant
    information, otherwise
    we are drowning in data but
    starving for knowledge
  • Needed optimized measurement selection
  • Good news artificial systems are well

    suited for active sampling more flexible

    than some natural applications

21
Example Network Monitoring via Probes
Probes end-to-end transactions
ping, trace-route, email- and web-access,
e-business transactions Typically used
for testing end-to-end performance
(e.g., SLA)
- Inference problem diagnosis (why a transaction
timed-out?) - Decision-making
cost-efficient, active probe selection
22
Simple Example
Dependency matrix
  • Columns components/nodes
  • hardware/software components
  • Rows probes
  • ping, trace-route, email, web-access
  • E.g.,
  • pWS-Web Page access probe
  • pDBS-database query
  • pAS application test
  • pingR ping router
  • pingWS ping Web Server

23
Probabilistic Inference in Bayesian Networks
24
Noisy-OR Bayesian Network Model
  • No noise logical-OR
  • Probe outcomes define a set of constraints
  • t1 X1 \/ X2 \/ X5
  • t2 X1 \/ X3 \/ X6
  • t3 X2 \/ X3 \/ X4

Noise noisy-OR (causal independence model)
"Spurious" probes (unexpected success due to
inhibited cause)
"Lost" probes (unexpected failure due to hidden
cause)
25
Challenge Make Diagnosis Cost-Efficient
  • Need to reduce cost to deploy and maintain
    physical
  • infrastructure probe stations, databases,
    reporting systems, staff
  • However, need a fast and accurate real-time
    diagnosis!

Multicriteria Optimization need to minimize
  • the number of probe stations
  • the number of probes
  • computational complexity of probe selection
  • expected diagnostic error
  • computational complexity of diagnosis

Off-line (Planning)
On-line (Active)
Topology information
Probe station (source) selection
Probe Set Construction
Probe Set Optimization
26
Probe Source Selection Greedy vs Random
Odintsova and Rish, 2005
Watson network (router level)
IBM Research network (router level)
Random graphs
Scale-free networks
Number of probes for fault detection

Bound on the number of sources
Greedy heuristic approach seems to always beat
the random source placement
27
Optimal Probe Set Selection
  • Non-adaptive (off-line)
  • Given an unobserved variable X, its
    distribution P(X),
  • a set of possible probes S, and P(SX), choose
    the
  • smallest subset of tests T that allows most
    accurate
  • diagnosis of X (NP-hard problem Beygelzimer
    2003)
  • However, greedy most-informative-next
    heuristic
  • search works quite well Brodie, Rish, Ma
    2001

Minimum probe subset for single-fault
diagnosis In case of zero noise
  • But online, adaptive approach is even more
    efficient!
  • Maximizes infogain given outcomes of the
    previous probes more context-specific and
    typically requires much less probing

28
Active online probe selection
Select next probe that provides maximum
information about the unknown system state
29
Active Diagnosis Adaptive vs Non-Adaptive
Test Selection Rish et al, IEEE Trans. NN, 2005
of nodes
of tests
Non- adaptive (exact)
Non- adaptive (greedy)
Savings adaptive vs exact
Adaptive (avg)
Odyssey (Mar03)
Odyssey (Feb03)
Odyssey (Nov02)
CRM1
CRM2
CRM3
CRM4
intranet1
intranet2
active vs greedy off-line
Active vs offline 60-76 savings in the number
of tests
30
Warning active probe selection becomes
intractable for general multi-fault
diagnosisZhang, Rish, Beygelzimer, UAI 2005
  • Without k-fault assumption for small k, even
    myopic VOI - selecting a single next
    most-informative probe - becomes intractable as
    it now requires generally intractable
    probabilistic inference for computing
  • However, a decomposition of H(X,T) and
    subsequent efficient approximation based on
    belief propagation is possible (see Zhang et al,
    UAI-05) it computes ALL infogains for candidate
    tests in one sweep and does not compromise much
    accuracy w.r.t. to exact approach to active probe
    selection
  • Related work Krause, Guestrin 2005

Oopsintractable!
31
Some Theoretical Bounds on Diagnostic Error in
Bayes Nets Rish, Allerton-2005
Theorem diagnostic error measured by the bit
error rate (BER) is bounded from below as
follows where c is the minimum number of
children for any Xi,
is the maximum
prior, and
Assumptions regular random bipartite networks
with n input nodes, m output nodes, k number
of parents per
each test node (and thus km/n children per each
hidden node), p P(X1)- fault prior (plt0.5),
q noise parameter,
qleak leak parameter
n
Interpretation m probes, each of length k, over
randomly selected subsets of nodes
this is a bit unrealistic setting
(in reality, probe selection is constrained),
but convenient for initial
evaluation of diagnostic error
m
32
Necessary Conditions for Zero-Error Diagnosis
Corollary 2 a necessary condition for achieving
error-free diagnosis is
  • More probes per node (m/n) are required for
    higher fault probability p
  • Longer probes (larger k) yield lower requirements
    for m/n
  • But the bounds are weak can we find ACHIEVABLE
    lower bound?
  • (just like Shannons ? )

33
Computational Complexity of Diagnosis
In our problems, w grows with increasing number
of probe stations and gets intractable quite
quickly (multiple probe stations are needed to
have more informative probe sets)
  • What can we do?
  • Sometimes, problem structure is easy (e.g.,
    trees, low-w graphs)
  • But in general, approximate inference is needed
    e.g, loopy belief propagation.

34
Distributed Diagnosis
  • RAIL real-time active inference and learning
    engine
  • EPP end-to-end probing station
  • Corresponding
  • Bayesian Network
  • each RAIL corresponds to a
    region in the Bayesian network that includes
    all probes controlled by RAIL and their
    respective nodes

35
Factor graphs and Belief Propagation
Map Bayesian network to a Factor Graph each
variable is mapped to a variable node, each
function is mapped to a factor node
Factor node ? probe Variable node ? component
Parallel belief propagation each RAIL updates
beliefs of its nodes and sends messages to
neighbor-RAILs sharing some common variable
nodes
36
Belief Propagation on Factor Graphs
The belief b(Y) is the BP approximation of
the marginal probability
From variable node to factor node
From factor node to variable node
37
BP Diagnosis on Internet-like Networks
  • INET topology generator used to create a network
    over 487 nodes
  • 387 probes selected by greedy search for
    single-fault diagnosis
  • Simulated levels of noise q and fault prior p
  • Error increases with growing probability of fault
    p and noise level q
  • Fault probability p has more impact on the error
    than the noise q
  • Good news in reality, p is usually quite low, so
    the error is small

38
Outline
  • What is Autonomic Computing? Why use ML?
  • Some Application Examples
  • Part 1 Inference and Learning with Active
    Sampling
  • Active testing in Bayesian inference problem
    diagnosis
  • Active Learning in Collaborative Prediction
    server selection
  • Part 2 Decision Making and Reinforcement
    Learning
  • Summary and Future Directions

39
Performance Prediction and Server Selection
  • End-to-end performance between a pair of
    nodes
  • network latency, bandwidth, round-trip
    time, any other QoS metric
  • Knowing e2e performance is important in many
    applications
  • Content-distribution systems (peer-to-peer,
    Grid) choosing
    highest-bandwidth server to download an object
    from
  • Distributed Hash Tables
    route a lookup
    request to the peer with the lowest latency

  • Overlay routing
  • select lowest-latency peer
  • to communicate with
  • Example IBM download Grid


40
Approach Collaborative Prediction (CP)
Problem given a sparse matrix of some previously
observed user experiences (users ratings for a
set of products, or bandwidth between some
client-server pairs), predict unobserved entries
Example bandwidth matrix for 100x100 subset of
dGrid
  • How to generalize from observed to unobserved
    entries (fill-in black space?)
  • Underlying assumption matrix entries are NOT
    independent, e.g. similar nodes have similar
    performances
  • Various approaches, mainly factorized models
    that assume hidden factors affecting the
    ratings, e.g. aspect model, pLSA, MCVQ,
    SVD, NMF, MMMF

Servers
Clients
41
Assumptions - there is a number of
(hidden) factors behind the user preferences
that relate to (hidden) movie properties
- movies have intrinsic values associated
with such factors
- users have intrinsic weights with such
factors user ratings a weighted
(linear) combinations of movies values
Factor models ? dimensionality reduction (for
small of factors)
42
rank k

Y
X
Objective find a factorizable XUV that
approximates Y
and satisfies some regularization constraints
(e.g. rank(X) lt k)
Loss functions depends on the nature of your
problem
43
How to solve it?
  • Singular value decomposition (SVD) low-rank
    approximation
  • Assumes fully observed Y and sum-squared loss
  • In collaborative prediction, Y is only partially
    observed
  • Low-rank approximation becomes non-convex
    problem w/ many local minima
  • Furthermore, we may not want sum-squared loss,
    but instead
  • accurate predictions (0/1 loss, approximated by
    hinge loss)
  • cost-sensitive predictions (missing a good
    server vs suggesting a bad one)
  • cost of ranking, etc. depends on the decision
    algorithm using predictions
  • Use instead the state-of-art Max-Margin Matrix
    Factorization Srebro 05
  • replaces bounded rank constraint by bounded norm
    of U, V vectors
  • convex optimization problem! can be solved
    exactly by semi-definite programming
  • strongly relates to learning max-margin
    classifiers (SVMs)
  • We demonstrate MMMF on binary classification
    first

44
Key Insight
Rows feature vectors, Columns linear
classifiers
Linear classifiers?weight vectors
v2
f1
-1
Feature vectors
Xij signij x marginij
Predictorij signij
If signij gt 0, classify as 1, Otherwise
classify as -1
45
MMMF Simultaneous Search for Low-norm
Feature Vectors and Max-margin Classifiers
46
Our contribution Active Learning
MMMF works well, but it ignores natural domain
property possibility of active sampling
(make user A connect to server 117 to improve our
model)
-0.3
-0.5
0.3
0.4
0.6
-0.9
-0.6
0.1
-0.5
0.8
0.2
-0.9
-0.1
0.3
-0.7
-0.5
0.6
0.7
-0.9
-0.8
0.9
0.1
0.5
0.2
0.3
-0.5
0.6
0.6
0.5
-0.8
-0.5
0.2
0.7
-0.9
0.9
-0.6
0.9
-0.1
0.7
-0.4
0.3
0.8
-0.2
-0.5
0.6
Current active-SVM heuristic Active ?
Min-margin sample (most uncertain one)
-0.4
-0.5
-0.5
-0.5
0.4
0.5
0.6
-0.5
-0.5
-0.2
0.1
-0.5
0.3
0.9
0.8
-0.5
0.6
0.2
47
Active Max-Margin Matrix FactorizationA-MMMF
  • A-MMMF(M,s)
  • 1. Given s sparse matrix Y, learn approximation
    X MMMF(Y)
  • 2. Using current predictions, actively
    select best s samples and request their labels
    (e.g., test client/server pair via enforced
    download)
  • 3. Add new samples to Y
  • 4. Repeat 1-3 until no significant
    improvement in prediction is likely
  • Active sampling
  • The idea is to eliminate as many as possible
    wrong hypotheses, e.g., SVM lines, from
    consideration see later for more detail
  • Current approach used the simplest
    minimum-margin heuristic for SVMs
  • However, there is a variety of other SVM active
    learning heuristics to try
  • Ideally, a theoretically founded approach is
    desirable a hard open problem

48
Results Active vs Random Sampling
DownloadGrid bandwidth prediction
PlanetLab latency prediction
0 100 200 300 400 500
600
0 100
200
of initial data set
of initial data set
Active sampling gives consistent improvement in
classification accuracy, which leads to better
decisions ? higher bandwidth/faster downloads
49
More Results on Latency Prediction
P2Psim data
NLANR-AMP data
Comparing various active sampling strategies
most-uncertain (min-margin) and
least-uncertain (max-margin/safe) sample
selection
50
Results for Movie Rating Prediction
MovieLens data prediction error
MovieLens data cost
Prediction accuracy versus cost of sampling using
various strategies
51
Conclusions
  • Common challenge in systems management
    applications cost-efficient measurement
    selection
  • Promising approach cost-efficient active
    sampling
  • Active sampling improves predictive accuracy
    while keeping the number of measurements low in
    several domains
  • Online diagnosis
  • Bandwidth and latency prediction
  • Future work
  • Other systems applications and problems that need
    active sampling
  • Theoretical analysis of active sampling
    performance
  • More efficient active sampling approaches (better
    heuristics, non-myopic test selection)

52
Outline
  • What is Autonomic Computing? Why use ML?
  • Some Application Examples
  • Part 1 Inference and Learning with Active
    Sampling
  • Active testing in Bayesian inference problem
    diagnosis
  • Active Learning in Collaborative Prediction
    server selection
  • Part 2 Decision Making and Reinforcement
    Learning
  • Summary and Future Directions

53
Examples of Autonomic Decision-Making
  • Systems Performance Management
  • Dynamic Resource Allocation
  • Servers, threads, CPU slices,
  • Memory
  • Storage
  • Bandwidth
  • Online Parameter Tuning
  • MAXCLIENTS, timeout parameters,
  • Routing and Scheduling
  • Access/Flow Control
  • Application Placement
  • Objectives QoS/SLA objectives, consistency,
    customer retention, fairness etc. (can get
    pretty nebulous)

54
More Decision-Making Examples
  • Availability Management
  • Knobs Data redundancy, server redundancy
  • Objectives MTBF, RTO, RPO
  • Power/Thermal Management
  • Knobs CPU clock speeds, ambient temp., fan
    speeds,
  • Objectives min. cost of electricity, cooling,
    heat degradation of eqpt.
  • Background Utility Throttling
  • DB backups, re-indexing
  • Utilization Management
  • Multi-Criteria Tradeoffs
  • e.g. Performance Availability Performance
    Power etc.

55
WebSphere On Demand Operating Environment
Prioritization and Flow Control
Routing and Load Balancing
Classification
Node 1
AM
ST
Stock Trading
High Importance
Provisioning Executions
Node 2
AM
ST
Account Mngmt
Node 3
Medium Importance
FA
ST
Stock Trading
Account Mngmt
Node 4
AM
FA
WebSphere on demand Router
Financial Advice
Financial Advice
Low Importance
Node 5
AM
FA
WebSphere Cell
Application Demand Resource State
Provisioning Decisions
Operational Policy
56
Application Allocating Server Resources in a
Data Center
  • Scenario Data center serving multiple customers,
    each running high-volume web apps with
    independent time-varying workloads

Data Center
Maximize business value across all customers
Resource Arbiter
Application Manager
SLA
SLA
SLA
Router
DB2
Citibank online banking
57
ML for Autonomic Decision-Making
  • Natural method of choice is Reinforcement
    Learning Learns behavioral policies State ?
    Action
  • But several huge challenges in making this
    practical
  • Non-Markovian effects (non-stationarity, history
    dependence, partial observability) may be
    pronounced
  • Scale/complexity of big distributed systems is
    daunting
  • Can easily have gtgt thousands of state variables
    in Data Centers, large multi-tier Web
    applications
  • Need RL approaches that can scale to huge state
    spaces, as well as huge action spaces
  • Cost of acquiring training data is quite real in
    live systems
  • Cost of poor performance of initial policy
  • Cost of exploration

58
Needed Enhancements to Vanilla RL
  • Active Learning i.e. exploration-exploitation
    tradeoff
  • exact solution known for bandit problems
    (stateless MDP)
  • several established heuristic approaches
    (Boltzmann exploration, Interval Estimation,)
  • Principled Bayes-RL approaches beginning to be
    developed (Wang et al., ICML 2005 Poupart et
    al., ICML 2006)
  • Need to address Curse-of-Dimensionality
  • Robust Function Approximation Exploit smooth,
    monotonic dependence on state variables in many
    systems management applications
  • Feature Selection well-developed literature,
    mainly for supervised learning (classification/reg
    ression)
  • State Abstraction structured MDP approaches,
    options, PSR
  • Hidden State Inference POMDP literature, Hidden
    Aspect Model learning

59
Whats Wrong with Standard Model-Based
Approaches? (Why Bother With ML?)
  • Model-Based Approach Design an appropriate
    system performance model (e.g. control-theoretic
    or queuing-theoretic)
  • Estimate model parameters offline or online
  • Model estimates how control variable affects
    system performance
  • Use optimization methods to select best control
    variable setting (if exhaustive search is
    infeasible)
  • Two main limitations
  • Model design is difficult and knowledge-intensive
  • Model assumptions dont exactly match real system
  • Two prospective benefits of machine learning
    approach
  • Avoid knowledge bottleneck
  • Can achieve more principled MDP-optimal policies
  • Properly account for long-range dynamic
    consequences of actions

60
A Knowledge Bottleneck in Autonomic Computing
61
Case Studies RL for Dynamic Resource Allocation
  • Realistic Prototype Data Center
  • Real servers and multiple Web-based transactional
    workloads
  • Realistic time-varying demand in each workload
  • Dynamically allocate servers to optimize SLA
    payments
  • Decompositional Online RL Approach
  • Enables scalability to many workloads
  • Learn local value functions within each workload
  • Hybrid RL Approach
  • Collect data using an external policy (queuing
    model based) to make allocation decisions
  • Train RL in batch mode on collected data
  • Learned policy outperforms original policy

62
Application Allocating Server Resources in a
Data Center
  • Scenario Data center serving multiple customers,
    each running high-volume web apps with
    independent time-varying workloads

Data Center
Maximize business value across all customers
Resource Arbiter
Application Manager
SLA
SLA
SLA
Router
DB2
Citibank online banking
63
Data Center Prototype Implementation
  • Real servers Cluster of 20 IBM eServer X series
    (RedHat Linux)
  • Realistic Web-based workload Trade3 (online
    trading emulation)
  • Runs on top of WebSphere (web applications
    platform) and DB2 (database management
    software)
  • Realistic demand generation
  • Open-loop scenario Poisson HTTP requests mean
    arrival rate ? varies with time
  • Closed-loop scenario Finite number of customers
    M with fixed think time distribution M varies
    with time
  • Variations in M or ? governed by stochastic
    time-series model of Web traffic (Squillante, Yao
    and Zhang, 1999)

64
Data Center Prototype Experimental setup
Maximize Total SLA Revenue
Demand (HTTP req/sec)
Demand (HTTP req/sec)
5 sec
Resource Arbiter
Value(srvrs)
Value(srvrs)
Value(srvrs)
App Manager
App Manager
App Manager
SLA
SLA
SLA
WebSphere 5.1
WebSphere 5.1
Value(RT)
Value(srvrs)
Value(RT)
DB2
DB2
Trade3
Batch
Trade3
Server
Server
Server
Server
Server
Server
Server
Server
8 xSeries servers
65
Will ML Without Built-In Knowledge Work?
Tabula Rasa blank slate (Latin)
66
Global RL versus Local RL
  • One approach Make the Resource Arbiter a global
    Q-Learner
  • Advantages
  • Arbiters problem is a true MDP
  • Can rely on convergence guarantee
  • Main Disadvantage
  • Arbiters state space is huge cross product of
    all local state spaces
  • ? Serious curse-of-dimensionality if many
    applications
  • Alternative Approach Local RL
  • Each application does local TD(0) based on local
    state, local provisioning, and local reward ?
    learns local value function
  • Each application conveys current V(resource)
    estimates to arbiter
  • Arbiter then acts to maximize sum of current
    value functions
  • Local learning should be much easier than global
    learning but
  • No longer have a convergence guarantee
  • Related work Russell Zimdars, ICML-03. (local
    rewards only)

67
Important RL Issues
  • Are the local applications really MDPs?
  • may be history dependence in demand (e.g.
    closed-loop user activity)
  • may be history dependence in performance (garbage
    collection)
  • may not have full observability of local state
    (lots of sensors needed)
  • avg. demand, avg. response time, avg. CPU util.,
    avg. memory util., current number of threads
    running,
  • Can we learn fast enough?
  • Shouldnt take millions of value table updates
    at most few 10k updates
  • Start RL off from good heuristic initial state
  • Hybrid learning Initial policy decisions made by
    model-based approach
  • Can we avoid excessive exploration penalties?
  • epsilon-greedy (10 random allocation decisions)
    seems ok, but in general, need more intelligent
    methods

68
Online RL in Trade3 Application Manager (AAAI
2005)
Resource Arbiter
  • Observed state current demand ? only
  • Arbiter action servers provided (n)
  • Instantaneous reward U SLA payment
  • Learns long-range expected value function
    V(state,action) V(?, n)
  • (two-dimensional lookup table)
  • Data Center results
  • good asymptotic performance, but
  • poor performance during long training period
  • method scales poorly with state space size

Servers
V(n)
TRADE3 App Mgr
U
RL
Response Time
V(?, n)
Demand ?
Server
Server
Server
Application Environment
69
Amazingly Enough, RL Works! -)
Results of overnight training (25k RL updates
16 hours real time) with random initial condition
70
Comparison of Performance 2 Application
Environments
71
3 Application Environments Performance
72
Will ML Without Built-In Knowledge Work?
Tabula Rasa blank slate (Latin)
73
A Hybrid Approach Combining Knowledge ML
(Tesauro et al., ICAC 2006)
  • Initial Knowledge ? Behavioral Data ? ML ?
    Improved Knowledge
  • Several advantages
  • No direct interface between ML and Initial
    Knowledge dont engineer knowledge into ML
  • Initial knowledge can be virtually anything
  • very simple (e.g. crude heuristic)
  • highly sophisticated (multi-tier closed queuing
    network)
  • could even be human behavior
  • Can do multiple iterations to keep improving

74
Hybrid Reinforcement Learning Illustrated
  • Run RL offline on data from initial policy
  • Bellman Policy Improvement Theorem (1957) ?
    V(state,action) defines a new policy guaranteed
    better than original policy
  • Combines best aspects of both RL and model-based
    (e.g. queuing) methods
  • Very general method that automatically improves
    any existing systems management policy
  • In Data Center prototype
  • Implement best queuing models within each Trade3
    mgr
  • Log system data in overnight run (12-20 hrs)
  • Train RL on log data (2 cpu hrs) ? new value
    functions
  • Replace queuing models by RL value functions and
    rerun experiment

State
Reward
Action
75
Two key ingredients of Trade3 implementation
  • 1. Delay-Aware State Representation
  • Include previous allocation decision as part of
    current state ? V V( ?t , nt-1 , nt )
  • Can learn to properly evaluate switching delay
    (provided that delay lt allocation interval)
  • e.g. can distinguish V(?, 2, 3) from V(?, 3, 3)
  • delay need not be directly observable RL only
    observes delayed reward
  • Also handles transient suboptimal performance
  • 2. Nonlinear Function Approximation (Neural Nets)
  • Generalizes across states and actions
  • Obviates visiting every state in space
  • Greatly reduces need for exploratory actions
  • Much better scaling to larger state spaces
  • From 2-3 state variables to 20-30, potentially
  • But lose guaranteed optimality

76
Results Open Loop, No Switching Delay
77
Results Closed Loop, No Switching Delay
78
Results Effects of Switching Delay
79
Insights into Hybrid RL outperformance
  • 1. Less biased estimation errors
  • Queuing model predicts indirectly RT ? SLA(RT) ?
    V
  • Nonlinear SLA induces overprovisioning bias
  • RL estimates utility directly ? less biased
    estimate of V
  • 2. RL handles transients and switching delays
  • Steady-state queuing models cannot
  • 3. RL learns to avoid thrashing

80
Policy Hysteresis in Learned Value Function
  • Stable joint allocations (T1, T2, Batch) at fixed
    ?2

81
Hybrid RL learns not to thrash
Queuing Model Servers(T1)
Closed Loop Demand Customers in T1
T2 Allocation Delay 4.5s
Hybrid RL Servers(T1)
T2
Queuing Model Servers(T2)
T1
Hybrid RL Servers(T2)
82
Hybrid RL does less swapping than QM
0.9
0.736
0.8
0.654
0.7
0.581
0.578
0.6
0.486
0.464
0.5
0.331
0.4
lt?ngt
0.269
0.3
0.2
0.1
0
QM
RL
QM
RL
QM
RL
QM
RL
Delay0
Delay4.5
Delay0
Delay4.5
Open
Open
Closed
Closed
Experiment
83
Conclusions
  • RL holds great promise for Autonomic
    Decision-Making
  • can learn without building explicit models
  • can achieve decision-theoretic MDP optimal
    policies
  • Online RL feasibility seen in small-scale lab
    study
  • requires good heuristic initialization
  • Hybrid RL works quite well for server allocation
  • combines disparate strengths of RL and queuing
    models
  • exploits domain knowledge built into queuing
    model
  • but doesnt need access to knowledge only uses
    externally observable behavior of queuing model
    policy
  • Potential for wide usage of RL in systems
    management
  • managing other resource types memory, storage,
    LPARs etc.
  • manage control params web server/OS/DB params
    etc.
  • simultaneous management of multiple criteria
    performance/utilization, performance/availability
    etc.
  • Thanks! Any Questions?

84
Related Work
  • R. Agrawal and R. Srikant. Fast algorithms for
    mining association rules. In VLDB 94
  • S. Ma and J.L. Hellerstein, Mining Mutually
    Dependent Patterns for System Management, IEEE
    Journal on Selected Areas in Communications,
    2002, pp. 726-735
  • R. Vilalta and S. Ma, Predicting Rare Events in
    Temporal Domains, ICDM-02
  • C. Domeniconi, C. Perng, R. Vilalta, S. Ma, A
    Classification Approach for Prediction of Target
    Events in Temporal Sequences, PKDD-02
  • R. K. Sahoo, A. J. Oliner, I. Rish, M. Gupta,
    J.E. Moreira, S. Ma, R. Vilalta, A.
    Sivasubramaniam, Critical Event Prediction for
    Proactive Management in Large-scale Computer
    Clusters, KDD-03
  • JL Hellerstein, TS Jayram, and I Rish.
    Recognizing end-user transactions in performance
    management, in AAAI-00.
  • I. Rish, M. Brodie, S. Ma, N. Odintsova, A.
    Beygelzimer, G. Grabarnik, K. Hernandez.
    Adaptive Diagnosis in Distributed Systems, in
    IEEE Transactions on Neural Networks (special
    issue on Adaptive Learning Systems in
    Communication Networks), vol. 16, no. 5, pp.
    1088-1109, September 2005.
  • A. Beygelzimer, R. Linsker, G. Grinstein, I.
    Rish. Improving Network Robustness by Edge
    Modification, Physica A, Vol 357, No 3-4, pp.
    593-612, November 2005.
  • I. Rish, Distributed Systems Diagnosis Using
    Belief Propagation, in Proc. of the 43-d Annual
    Allerton Conference on Communication, Control and
    Computing, 2005.
  • A. Zheng. I. Rish, A. Beygelzimer. Efficient
    Test Selection in Active Diagnosis via Entropy
    Approximation, in Proceedings of UAI-2005.

85
End
Write a Comment
User Comments (0)
About PowerShow.com