Agenda: Markov Decision Processes (

1 / 25
About This Presentation
Title:

Agenda: Markov Decision Processes (

Description:

MDPs provide a normative basis for talking about optimal plans (policies) in the ... If you are twenty and not a liberal, you are heartless ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 26
Provided by: Mau70

less

Transcript and Presenter's Notes

Title: Agenda: Markov Decision Processes (


1
4/1
  • Agenda Markov Decision Processes ( Decision
    Theoretic Planning)

2
MDPs as a way of Reducing Planning Complexity
  • MDPs provide a normative basis for talking about
    optimal plans (policies) in the context of
    stochastic actions, and complex reward models
  • Optimal policies for MDPs can be computed in
    polynomial time
  • In contrast, even classical planning is
    NP-complete or P-Space Complete (depending on
    whether the plans are polynomial or exponential
    length)
  • ?SO, convert planning problems to MDP problems,
    so we can get polynomial time performance.?
  • To see this, note that Sorting problem can be
    written as a planning problem. But Sorting is
    only polynomial time. Thus, the inherent
    complexity of planning is only polynomial. All
    that MDP conversion does is to let planning
    exhibit its inherent polynomial complexity.

Happy April 1st -)
3
MDP Complexity The Real Deal
  • Complexity results are stated in terms of the
    Size of the input (measured in some way)
  • MDP complexity results are typically in terms of
    state space size.
  • Planning complexity results are typically in
    terms of the factored input (in terms of state
    variables )
  • State Space is already exponential in terms of
    State Variables.
  • So, polynomial in state space implies exponential
    in factored representation
  • More depressingly, optimal policy construction is
    exponential and undecidable respectively for
    POMDPs with finite and infinite horizons even
    with input size measured in terms of explicit
    state space.
  • So clearly, we dont consider compiling planning
    problems to MDP model for efficiency

4
Forget your homework grading. Forget your
project grading. Well make it look like you
remembered
5
Agenda
  • General (FO)MDP model
  • Action (Transition) model
  • Action Cost Model
  • Reward Model
  • Histories
  • Horizon
  • Policies
  • Optimal value and policy
  • Value iteration/Policy Iteration/RTDP
  • Special cases of MDP model relevant to Planning
  • Pure cost models (goal states are absorbing)
  • Reward/Cost models
  • Over-subscription models
  • Connections to heuristic search
  • Efficient approaches for policy construction

6
Markov Decision Process (MDP)
  • S A set of states
  • A A set of actions
  • Pr(ss,a) transition model
  • (aka Mas,s)
  • C(s,a,s) cost model
  • G set of goals
  • s0 start state
  • ? discount factor
  • R(s,a,s) reward model

7
Objective of a Fully Observable MDP
  • Find a policy ? S ? A
  • which optimises
  • minimises expected cost to reach a goal
  • maximises expected reward
  • maximises expected (reward-cost)
  • given a ____ horizon
  • finite
  • infinite
  • indefinite
  • assuming full observability

discounted or undiscount.
8
Histories Value of Histories (expected) Value
of a policy Optimal Value Bellman Principle
Bellman's Equation
9
Policy evaluation vs. Optimal Value Function in
Finite vs. Infinite Horizon
10
Repeat
can generalize to have action costs C(a,s)
If Mij matrix is not known a priori, then we
have a reinforcement learning scenario..
11
What does a solution to an MDP look like?
  • The solution should tell the optimal action to do
    in each state (called a Policy)
  • Policy is a function from states to actions (
    see finite horizon case below)
  • Not a sequence of actions anymore
  • Needed because of the non-deterministic actions
  • If there are S states and A actions that we
    can do at each state, then there are AS
    policies
  • How do we get the best policy?
  • Pick the policy that gives the maximal expected
    reward
  • For each policy p
  • Simulate the policy (take actions suggested by
    the policy) to get behavior traces
  • Evaluate the behavior traces
  • Take the average value of the behavior traces.

We will concentrate on infinite horizon
problems (infinite horizon doesnt
necessarily mean that that all behavior
traces are infinite. They could be finite
and end in a sink state)
12
Horizon Policy
If you are twenty and not a liberal, you are
heartless If you are sixty and not a
conservative, you are mindless

--Churchill
  • How long should behavior traces be?
  • Each trace is no longer than k (Finite Horizon
    case)
  • Policy will be horizon-dependent (optimal action
    depends not just on what state you are in, but
    how far is your horizon)
  • Eg Financial portfolio advice for yuppies vs.
    retirees.
  • No limit on the size of the trace (Infinite
    horizon case)
  • Policy is not horizon dependent

We will concentrate on infinite horizon
problems (infinite horizon doesnt
necessarily mean that that all behavior
traces are infinite. They could be finite
and end in a sink state)
13
How to handle unbounded state sequences?
  • If we dont have a horizon, then we can have
    potentially infinitely long state sequences.
    Three ways to handle them
  • Use discounted reward model ( ith state in the
    sequence contributes only i R(si)
  • Assume that the policy is proper (i.e., each
    sequence terminates into an absorbing state with
    non-zero probability).
  • Consider average reward per-step

14
How to evaluate a policy?
  • Step 1 Define utility of a sequence of states in
    terms of their rewards
  • Assume stationarity of preferences
  • If you prefer future f1 to f2 starting tomorrow,
    you should prefer them the same way even if they
    start today
  • Then, only two reasonable ways to define Utility
    of a sequence of states
  • U(s1, s2 ? sn) ?n R(si)
  • U(s1, s2 ? sn) ?n i R(si) (0 1)
  • Maximum utility bounded from above by Rmax/(1 -
    )
  • Step 2 Utility of a policy ¼ is the expected
    utility of the behaviors exhibited by an agent
    following it. E ?1t0 t
    R(st) ¼
  • Step 3 Optimal policy ¼ is the one that
    maximizes the expectation argmax¼ E ?1t0 t
    R(st) ¼
  • Since there are only As different policies, you
    can evaluate them all in finite time (Haa haa..)

15
Utility of a State
  • The (long term) utility of a state s with respect
    to a policy \pi is the expected value of all
    state sequences starting with s
  • U¼(s) E ?1t0 t R(st) ¼ , s0 s
  • The true utility of a state s is just its utility
    w.r.t optimal policy U(s) U¼(s)
  • Thus, U and ¼ are closely related
  • ¼(s) argmaxa ?s Mass U(s)
  • As are utilities of neighboring states
  • U(s) R(s) argmaxa ?s Mass U(s)

Bellman Eqn
16
Repeat
U is the maximal expected utility (value)
assuming optimal policy
17
Optimal Policies depend on rewards..
Repeat
-
-
-
-
18
Bellman Equations as a basis for computing
optimal policy
  • Qn Is there a simpler way than having to
    evaluate AS policies?
  • Yes
  • The Optimal Value and Optimal Policy are related
    by the Bellman Equations
  • U(s) R(s) argmaxa ?s Mass U(s)
  • ¼(s) argmaxa ?s Mass U(s)
  • The equations can be solved exactly through
  • value iteration (iteratively compute U and then
    compute ¼)
  • policy iteration ( iterate over policies)
  • Or solve approximately through real-time dynamic
    programming

19
.8
.1
.1
U(i) R(i) maxj Maij U(j)

20
Value Iteration Demo
  • http//www.cs.ubc.ca/spider/poole/demos/mdp/vi.htm
    l
  • Things to note
  • The way the values change (states far from
    absorbing states may first reduce and then
    increase their values)
  • The convergence speed difference between Policy
    and value

21
Updates can be done synchronously OR
asynchronously --convergence guaranteed
as long as each state updated
infinitely often
Why are values coming down first? Why are some
states reaching optimal value faster?
.8
.1
.1
22
Terminating Value Iteration
  • The basic idea is to terminate the value
    iteration when the values have converged (i.e.,
    not changing much from iteration to iteration)
  • Set a threshold e and stop when the change across
    two consecutive iterations is less than e
  • There is a minor problem since value is a vector
  • We can bound the maximum change that is allowed
    in any of the dimensions between two successive
    iterations by e
  • Max norm . of a vector is the maximal value
    among all its dimensions. We are basically
    terminating when Ui Ui1 lt e

23
Policies converge earlier than values
  • There are finite number of policies but infinite
    number of value functions.
  • So entire regions of value vector are mapped
    to a specific policy
  • So policies may be converging faster than
    values. Search in the space of policies
  • Given a utility vector Ui we can compute the
    greedy policy pui
  • The policy loss of pui is Upui-U
  • (max norm difference of two vectors is the
    maximum amount by which they differ on any
    dimension)

P4
P3
V(S2)
U
P2
P1
V(S1)
Consider an MDP with 2 states and 2 actions
24
n linear equations with n unknowns.
We can either solve the linear eqns exactly,
or solve them approximately by running the
value iteration a few times (the update wont
have the max operation)
25
Bellman equations when actions have costs
  • The model discussed in class ignores action costs
    and only thinks of state rewards
  • C(s,a) is the cost of doing action a in state s
  • Assume costs are just negative rewards..
  • The Bellman equation then becomes
  • U(s) R(s) maxa -C(s,a) ?s R(s)
    Mass
  • Notice that the only difference is that -C(s,a)
    is now inside the maximization
  • With this model, we can talk about partial
    satisfaction planning problems where
  • Actions have costs goals have utilities and the
    optimal plan may not satisfy all goals.
Write a Comment
User Comments (0)