Online Sampling for Markov Decision Processes - PowerPoint PPT Presentation

About This Presentation
Title:

Online Sampling for Markov Decision Processes

Description:

Markov Decision Process (MDP) Ingredients: System state x in state space X Control action a in A(x) Reward R(x,a) ... Two-pronged solution approach: ... – PowerPoint PPT presentation

Number of Views:153
Avg rating:3.0/5.0
Slides: 52
Provided by: Edwin69
Category:

less

Transcript and Presenter's Notes

Title: Online Sampling for Markov Decision Processes


1
Online SamplingforMarkov Decision Processes
  • Bob Givan
  • Joint work w/ E. K. P. Chong, H. Chang, G. Wu

Electrical and Computer Engineering Purdue
University
2
Markov Decision Process (MDP)
  • Ingredients
  • System state x in state space X
  • Control action a in A(x)
  • Reward R(x,a)
  • State-transition probability P(x,y,a)
  • Find control policy to maximize objective fun

3
Optimal Policies
  • Policy mapping from state and time to actions
  • Stationary Policy mapping from state to actions
  • Goal a policy maximizing the objective function
  • VH(x0) max Obj R(x0,a0), , R(xH-1,aH-1)
  • where the max is over all policies u
    u0,,uH-1
  • For large H, a0 independent of H. (w/ergodicity
    assum.)
  • Stationary optimal action a0 for H ? via
    receding horizon control

4
Q Values
  • ? Fix a large H, focus on finite-horizon reward
  • Define Q(x,a) R(x,a) EVH-1(y)
  • Utility of action a at state x.
  • Name Q-value of action a at state x.
  • Key identities (Bellmans equations)
  • VH(x) maxa Q(x,a)
  • ?0(x) argmaxa Q(x,a)

5
Solution Methods
  • Recall
  • u0(x) argmaxa Q(x,a)
  • Q(x,a) R(x,a) E VH-1(y)
  • Problem Q-value depends on optimal policy.
  • State space is extremely large (often continuous)
  • Two-pronged solution approach
  • Apply a receding-horizon method
  • Estimate Q-values via simulation/sampling

6
Methods for Q-value Estimation
  • Previous work by other authors
  • Unbiased sampling (exact Q value)Kearns et al.,
    IJCAI-99
  • Policy rollout (lower bound) Bertsekas
    Castanon, 1999
  • Our techniques
  • Hindsight optimization (upper bound)
  • Parallel rollout (lower bound)

7
Expectimax Tree for V
8
Unbiased Sampling
9
Unbiased Sampling (Contd)
  • For a given desired accuracy, how largeshould
    sampling width and depth be?
  • Answered Kearns, Mansour, and Ng (1999)
  • Requires prohibitive sampling width and depth
  • e.g. C ? 108, Hs gt 60 to distinguish best and
    worst policies in our scheduling domain
  • We evaluate with smaller width and depth

10
How to Look Deeper?
11
Policy Roll-out
12
Policy Rollout in Equations
  • Write VHu (y) for the value of following policy u
  • Recall Q(x,a) R(x,a) E VH-1(y)
  • R(x,a) E maxu
    VH-1u(y)
  • Given a base policy u, use
  • R(x,a) E VH-1u(y)
  • as an lower bound estimate of Q-value.
  • Resulting policy is PI(u), given infinite sampling

13
Policy Roll-out (contd)
14
Parallel Policy Rollout
  • Generalization of policy rollout, due toChang,
    Givan, and Chong, 2000
  • Given a set U of base policies, use
  • R(x,a) E maxu?U VH-1u(y)
  • as an estimate of Q-value
  • More accurate estimate than policy rollout
  • Still gives a lower bound to true Q-value
  • Still gives a policy no worse than any in U

15
Hindsight Optimization Tree View
16
Hindsight Optimization Equations
  • Swap Max and Exp in expectimax tree.
  • Solve each off-line optimization problem
  • O (kC f(H)) time
  • where f(H) is the offline problem complexity
  • Jensens inequality implies upper bounds

17
Hindsight Optimization (Contd)
18
Application to Example Problems
  • Apply unbiased sampling, policy rollout, parallel
    rollout, and hindsight optimization to
  • Multi-class deadline scheduling
  • Random early dropping
  • Congestion control

19
Basic Approach
  • Traffic model provides a stochastic description
    of possible future outcomes
  • Method
  • Formulate network decision problems as POMDPs by
    incorporating traffic model
  • Solve belief-state MDP online using
    sampling(choose time-scale to allow for
    computation time)

20
Domain 1 Deadline Scheduling
Objective Minimize weighted loss
21
Domain 2 Random Early Dropping
Objective Minimize delaywithout sacrificing
throughput
22
Domain 3 Congestion Control
23
Traffic Modeling
  • A Hidden Markov Model (HMM) for each source
  • Note state is hidden, model is partially
    observed

24
Deadline Scheduling Results
  • Non-sampling Policies
  • EDF earliest deadline first.
  • Deadline sensitive, class insensitive.
  • SP static priority.
  • Deadline insensitive, class sensitive.
  • CM current minloss Givan et al., 2000
  • Deadline and class sensitive.
  • Minimizes weighted loss for the current packets.

25
Deadline Scheduling Results
  • Objective minimize weighted loss
  • Comparison
  • Non-sampling policies
  • Unbiased sampling (Kearns et al.)
  • Hindsight optimization
  • Rollout with CM as base policy
  • Parallel rollout
  • Results due to H. S. Chang

26
Deadline Scheduling Results
27
Deadline Scheduling Results
28
Deadline Scheduling Results
29
Random Early Dropping Results
  • Objective minimize delay subject to throughput
    loss-tolerance
  • Comparison
  • Candidate policies RED and buffer-k
  • KMN-sampling
  • Rollout of buffer-k
  • Parallel rollout
  • Hindsight optimization
  • Results due to H. S. Chang.

30
Random Early Dropping Results
31
Random Early Dropping Results
32
Congestion Control Results
  • MDP Objective minimize weighted sum of
    throughput, delay, and loss-rate
  • Fairness is hard-wired
  • Comparisons
  • PD-k (proportional-derivative with k target
    queue)
  • Hindsight optimization
  • Rollout of PD-k parallel rollout
  • Results due to G. Wu, in progress

33
Congestion Control Results
34
Congestion Control Results
35
Congestion Control Results
36
Congestion Control Results
37
Results Summary
  • Unbiased sampling cannot cope
  • Parallel rollout wins in 2 domains
  • Not always equal to simple rollout of one base
    policy
  • Hindsight optimization wins in 1 domain
  • Simple policy rollout the cheapest method
  • Poor in domain 1
  • Strong in domain 2 with best base policy but
    how to find this policy?
  • So-so in domain 3 with any base policy

38
Talk Summary
  • Case study of MDP sampling methods
  • New methods offering practical improvements
  • Parallel policy rollout
  • Hindsight optimization
  • Systematic methods for using traffic models to
    help make network control decisions
  • Feasibility of real-time implementation depends
    on problem timescale

39
Ongoing Research
  • Apply to other control problems (different
    timescales)
  • Admission/access control
  • QoS routing
  • Link bandwidth allotment
  • Multiclass connection management
  • Problems arising in proxy-services
  • Diagnosis and recovery

40
Ongoing Research (Contd)
  • Alternative traffic models
  • Multi-timescale models
  • Long-range dependent models
  • Closed-loop traffic
  • Fluid models
  • Learning traffic model online
  • Adaptation to changing traffic conditions

41
Congestion Control (Contd)
42
Congestion Control Results
43
Hindsight Optimization (Contd)
44
Policy Rollout (Contd)
Policy-performance
Base Policy
45
Receding-horizon Control
  • For large horizon H, policy is stationary.
  • At each time, if state is x, then apply action
  • u(x) argmaxa Q(x,a)
  • argmaxa R(x,a) E
    VH-1(y)
  • Compute estimate of Q-value at each time.

46
Congestion Control (Contd)
47
Domain 3 Congestion Control
High-priority Traffic
Bottleneck
Node
Best-effort Traffic
  • Resources Bandwidth and buffer
  • Objective optimize throughput, delay, loss, and
    fairness
  • High-priority traffic
  • Open-loop controlled
  • Low-priority traffic
  • Closed-loop controlled

48
Congestion Control Results
49
Congestion Control Results
50
Congestion Control Results
51
Congestion Control Results
Write a Comment
User Comments (0)
About PowerShow.com