Title: Agenda: Markov Decision Processes (
14/1
- Agenda Markov Decision Processes ( Decision
Theoretic Planning)
2MDPs as a way of Reducing Planning Complexity
- MDPs provide a normative basis for talking about
optimal plans (policies) in the context of
stochastic actions, and complex reward models - Optimal policies for MDPs can be computed in
polynomial time - In contrast, even classical planning is
NP-complete or P-Space Complete (depending on
whether the plans are polynomial or exponential
length) - ?SO, convert planning problems to MDP problems,
so we can get polynomial time performance.? - To see this, note that Sorting problem can be
written as a planning problem. But Sorting is
only polynomial time. Thus, the inherent
complexity of planning is only polynomial. All
that MDP conversion does is to let planning
exhibit its inherent polynomial complexity.
Happy April 1st -)
3MDP Complexity The Real Deal
- Complexity results are stated in terms of the
Size of the input (measured in some way) - MDP complexity results are typically in terms of
state space size. - Planning complexity results are typically in
terms of the factored input (in terms of state
variables ) - State Space is already exponential in terms of
State Variables. - So, polynomial in state space implies exponential
in factored representation - More depressingly, optimal policy construction is
exponential and undecidable respectively for
POMDPs with finite and infinite horizons even
with input size measured in terms of explicit
state space. - So clearly, we dont consider compiling planning
problems to MDP model for efficiency
4Forget your homework grading. Forget your
project grading. Well make it look like you
remembered
5Agenda
- General (FO)MDP model
- Action (Transition) model
- Action Cost Model
- Reward Model
- Histories
- Horizon
- Policies
- Optimal value and policy
- Value iteration/Policy Iteration/RTDP
- Special cases of MDP model relevant to Planning
- Pure cost models (goal states are absorbing)
- Reward/Cost models
- Over-subscription models
- Connections to heuristic search
- Efficient approaches for policy construction
6Markov Decision Process (MDP)
- S A set of states
- A A set of actions
- Pr(ss,a) transition model
- (aka Mas,s)
- C(s,a,s) cost model
- G set of goals
- s0 start state
- ? discount factor
- R(s,a,s) reward model
7Objective of a Fully Observable MDP
- Find a policy ? S ? A
- which optimises
- minimises expected cost to reach a goal
- maximises expected reward
- maximises expected (reward-cost)
- given a ____ horizon
- finite
- infinite
- indefinite
- assuming full observability
discounted or undiscount.
8Histories Value of Histories (expected) Value
of a policy Optimal Value Bellman Principle
Bellman's Equation
9Policy evaluation vs. Optimal Value Function in
Finite vs. Infinite Horizon
10Repeat
can generalize to have action costs C(a,s)
If Mij matrix is not known a priori, then we
have a reinforcement learning scenario..
11What does a solution to an MDP look like?
- The solution should tell the optimal action to do
in each state (called a Policy) - Policy is a function from states to actions (
see finite horizon case below) - Not a sequence of actions anymore
- Needed because of the non-deterministic actions
- If there are S states and A actions that we
can do at each state, then there are AS
policies - How do we get the best policy?
- Pick the policy that gives the maximal expected
reward - For each policy p
- Simulate the policy (take actions suggested by
the policy) to get behavior traces - Evaluate the behavior traces
- Take the average value of the behavior traces.
We will concentrate on infinite horizon
problems (infinite horizon doesnt
necessarily mean that that all behavior
traces are infinite. They could be finite
and end in a sink state)
12Horizon Policy
If you are twenty and not a liberal, you are
heartless If you are sixty and not a
conservative, you are mindless
--Churchill
- How long should behavior traces be?
- Each trace is no longer than k (Finite Horizon
case) - Policy will be horizon-dependent (optimal action
depends not just on what state you are in, but
how far is your horizon) - Eg Financial portfolio advice for yuppies vs.
retirees. - No limit on the size of the trace (Infinite
horizon case) - Policy is not horizon dependent
We will concentrate on infinite horizon
problems (infinite horizon doesnt
necessarily mean that that all behavior
traces are infinite. They could be finite
and end in a sink state)
13How to handle unbounded state sequences?
- If we dont have a horizon, then we can have
potentially infinitely long state sequences.
Three ways to handle them - Use discounted reward model ( ith state in the
sequence contributes only i R(si) - Assume that the policy is proper (i.e., each
sequence terminates into an absorbing state with
non-zero probability). - Consider average reward per-step
14How to evaluate a policy?
- Step 1 Define utility of a sequence of states in
terms of their rewards - Assume stationarity of preferences
- If you prefer future f1 to f2 starting tomorrow,
you should prefer them the same way even if they
start today - Then, only two reasonable ways to define Utility
of a sequence of states - U(s1, s2 ? sn) ?n R(si)
- U(s1, s2 ? sn) ?n i R(si) (0 1)
- Maximum utility bounded from above by Rmax/(1 -
) - Step 2 Utility of a policy ¼ is the expected
utility of the behaviors exhibited by an agent
following it. E ?1t0 t
R(st) ¼ - Step 3 Optimal policy ¼ is the one that
maximizes the expectation argmax¼ E ?1t0 t
R(st) ¼ - Since there are only As different policies, you
can evaluate them all in finite time (Haa haa..)
15Utility of a State
- The (long term) utility of a state s with respect
to a policy \pi is the expected value of all
state sequences starting with s - U¼(s) E ?1t0 t R(st) ¼ , s0 s
- The true utility of a state s is just its utility
w.r.t optimal policy U(s) U¼(s) - Thus, U and ¼ are closely related
- ¼(s) argmaxa ?s Mass U(s)
- As are utilities of neighboring states
- U(s) R(s) argmaxa ?s Mass U(s)
Bellman Eqn
16Repeat
U is the maximal expected utility (value)
assuming optimal policy
17Optimal Policies depend on rewards..
Repeat
-
-
-
-
18Bellman Equations as a basis for computing
optimal policy
- Qn Is there a simpler way than having to
evaluate AS policies? - Yes
- The Optimal Value and Optimal Policy are related
by the Bellman Equations - U(s) R(s) argmaxa ?s Mass U(s)
- ¼(s) argmaxa ?s Mass U(s)
- The equations can be solved exactly through
- value iteration (iteratively compute U and then
compute ¼) - policy iteration ( iterate over policies)
- Or solve approximately through real-time dynamic
programming
19.8
.1
.1
U(i) R(i) maxj Maij U(j)
20Value Iteration Demo
- http//www.cs.ubc.ca/spider/poole/demos/mdp/vi.htm
l - Things to note
- The way the values change (states far from
absorbing states may first reduce and then
increase their values) - The convergence speed difference between Policy
and value
21Updates can be done synchronously OR
asynchronously --convergence guaranteed
as long as each state updated
infinitely often
Why are values coming down first? Why are some
states reaching optimal value faster?
.8
.1
.1
22 Terminating Value Iteration
- The basic idea is to terminate the value
iteration when the values have converged (i.e.,
not changing much from iteration to iteration) - Set a threshold e and stop when the change across
two consecutive iterations is less than e - There is a minor problem since value is a vector
- We can bound the maximum change that is allowed
in any of the dimensions between two successive
iterations by e - Max norm . of a vector is the maximal value
among all its dimensions. We are basically
terminating when Ui Ui1 lt e
23Policies converge earlier than values
- There are finite number of policies but infinite
number of value functions. - So entire regions of value vector are mapped
to a specific policy - So policies may be converging faster than
values. Search in the space of policies - Given a utility vector Ui we can compute the
greedy policy pui - The policy loss of pui is Upui-U
- (max norm difference of two vectors is the
maximum amount by which they differ on any
dimension)
P4
P3
V(S2)
U
P2
P1
V(S1)
Consider an MDP with 2 states and 2 actions
24n linear equations with n unknowns.
We can either solve the linear eqns exactly,
or solve them approximately by running the
value iteration a few times (the update wont
have the max operation)
25Bellman equations when actions have costs
- The model discussed in class ignores action costs
and only thinks of state rewards - C(s,a) is the cost of doing action a in state s
- Assume costs are just negative rewards..
- The Bellman equation then becomes
- U(s) R(s) maxa -C(s,a) ?s R(s)
Mass - Notice that the only difference is that -C(s,a)
is now inside the maximization - With this model, we can talk about partial
satisfaction planning problems where - Actions have costs goals have utilities and the
optimal plan may not satisfy all goals.