Agenda: Markov Decision Processes (

1 / 25

About This Presentation

Title:

Agenda: Markov Decision Processes (

Description:

MDPs provide a normative basis for talking about optimal plans (policies) in the ... If you are twenty and not a liberal, you are heartless ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 26

Provided by: Mau70

Learn more at: http://rakaposhi.eas.asu.edu

more less

Transcript and Presenter's Notes

Title: Agenda: Markov Decision Processes (

1
4/1

Agenda Markov Decision Processes ( Decision
Theoretic Planning)

2
MDPs as a way of Reducing Planning Complexity

MDPs provide a normative basis for talking about
optimal plans (policies) in the context of
stochastic actions, and complex reward models
Optimal policies for MDPs can be computed in
polynomial time
In contrast, even classical planning is
NP-complete or P-Space Complete (depending on
whether the plans are polynomial or exponential
length)
?SO, convert planning problems to MDP problems,
so we can get polynomial time performance.?
To see this, note that Sorting problem can be
written as a planning problem. But Sorting is
only polynomial time. Thus, the inherent
complexity of planning is only polynomial. All
that MDP conversion does is to let planning
exhibit its inherent polynomial complexity.

Happy April 1st -)
3
MDP Complexity The Real Deal

Complexity results are stated in terms of the
Size of the input (measured in some way)
MDP complexity results are typically in terms of
state space size.
Planning complexity results are typically in
terms of the factored input (in terms of state
variables )
State Space is already exponential in terms of
State Variables.
So, polynomial in state space implies exponential
in factored representation
More depressingly, optimal policy construction is
exponential and undecidable respectively for
POMDPs with finite and infinite horizons even
with input size measured in terms of explicit
state space.
So clearly, we dont consider compiling planning
problems to MDP model for efficiency

4
Forget your homework grading. Forget your
project grading. Well make it look like you
remembered
5
Agenda

General (FO)MDP model
Action (Transition) model
Action Cost Model
Reward Model
Histories
Horizon
Policies
Optimal value and policy
Value iteration/Policy Iteration/RTDP
Special cases of MDP model relevant to Planning
Pure cost models (goal states are absorbing)
Reward/Cost models
Over-subscription models
Connections to heuristic search
Efficient approaches for policy construction

6
Markov Decision Process (MDP)

S A set of states
A A set of actions
Pr(ss,a) transition model
(aka Mas,s)
C(s,a,s) cost model
G set of goals
s0 start state
? discount factor
R(s,a,s) reward model

7
Objective of a Fully Observable MDP

Find a policy ? S ? A
which optimises
minimises expected cost to reach a goal
maximises expected reward
maximises expected (reward-cost)
given a ____ horizon
finite
infinite
indefinite
assuming full observability

discounted or undiscount.
8
Histories Value of Histories (expected) Value
of a policy Optimal Value Bellman Principle
Bellman's Equation
9
Policy evaluation vs. Optimal Value Function in
Finite vs. Infinite Horizon
10
Repeat
can generalize to have action costs C(a,s)
If Mij matrix is not known a priori, then we
have a reinforcement learning scenario..
11
What does a solution to an MDP look like?

The solution should tell the optimal action to do
in each state (called a Policy)
Policy is a function from states to actions (
see finite horizon case below)
Not a sequence of actions anymore
Needed because of the non-deterministic actions
If there are S states and A actions that we
can do at each state, then there are AS
policies
How do we get the best policy?
Pick the policy that gives the maximal expected
reward
For each policy p
Simulate the policy (take actions suggested by
the policy) to get behavior traces
Evaluate the behavior traces
Take the average value of the behavior traces.

We will concentrate on infinite horizon
problems (infinite horizon doesnt
necessarily mean that that all behavior
traces are infinite. They could be finite
and end in a sink state)
12
Horizon Policy
If you are twenty and not a liberal, you are
heartless If you are sixty and not a
conservative, you are mindless

--Churchill

How long should behavior traces be?
Each trace is no longer than k (Finite Horizon
case)
Policy will be horizon-dependent (optimal action
depends not just on what state you are in, but
how far is your horizon)
Eg Financial portfolio advice for yuppies vs.
retirees.
No limit on the size of the trace (Infinite
horizon case)
Policy is not horizon dependent

If we dont have a horizon, then we can have
potentially infinitely long state sequences.
Three ways to handle them
Use discounted reward model ( ith state in the
sequence contributes only i R(si)
Assume that the policy is proper (i.e., each
sequence terminates into an absorbing state with
non-zero probability).
Consider average reward per-step

14
How to evaluate a policy?

Step 1 Define utility of a sequence of states in
terms of their rewards
Assume stationarity of preferences
If you prefer future f1 to f2 starting tomorrow,
you should prefer them the same way even if they
start today
Then, only two reasonable ways to define Utility
of a sequence of states
U(s1, s2 ? sn) ?n R(si)
U(s1, s2 ? sn) ?n i R(si) (0 1)
Maximum utility bounded from above by Rmax/(1 -
)
Step 2 Utility of a policy ¼ is the expected
utility of the behaviors exhibited by an agent
following it. E ?1t0 t
R(st) ¼
Step 3 Optimal policy ¼ is the one that
maximizes the expectation argmax¼ E ?1t0 t
R(st) ¼
Since there are only As different policies, you
can evaluate them all in finite time (Haa haa..)

15
Utility of a State

The (long term) utility of a state s with respect
to a policy \pi is the expected value of all
state sequences starting with s
U¼(s) E ?1t0 t R(st) ¼ , s0 s
The true utility of a state s is just its utility
w.r.t optimal policy U(s) U¼(s)
Thus, U and ¼ are closely related
¼(s) argmaxa ?s Mass U(s)
As are utilities of neighboring states
U(s) R(s) argmaxa ?s Mass U(s)

Bellman Eqn
16
Repeat
U is the maximal expected utility (value)
assuming optimal policy
17
Optimal Policies depend on rewards..
Repeat
-
-
-
-
18
Bellman Equations as a basis for computing
optimal policy

Qn Is there a simpler way than having to
evaluate AS policies?
Yes
The Optimal Value and Optimal Policy are related
by the Bellman Equations
U(s) R(s) argmaxa ?s Mass U(s)
¼(s) argmaxa ?s Mass U(s)
The equations can be solved exactly through
value iteration (iteratively compute U and then
compute ¼)
policy iteration ( iterate over policies)
Or solve approximately through real-time dynamic
programming

19
.8
.1
.1
U(i) R(i) maxj Maij U(j)

20
Value Iteration Demo

http//www.cs.ubc.ca/spider/poole/demos/mdp/vi.htm
l
Things to note
The way the values change (states far from
absorbing states may first reduce and then
increase their values)
The convergence speed difference between Policy
and value

21
Updates can be done synchronously OR
asynchronously --convergence guaranteed
as long as each state updated
infinitely often
Why are values coming down first? Why are some
states reaching optimal value faster?
.8
.1
.1
22
Terminating Value Iteration

The basic idea is to terminate the value
iteration when the values have converged (i.e.,
not changing much from iteration to iteration)
Set a threshold e and stop when the change across
two consecutive iterations is less than e
There is a minor problem since value is a vector
We can bound the maximum change that is allowed
in any of the dimensions between two successive
iterations by e
Max norm . of a vector is the maximal value
among all its dimensions. We are basically
terminating when Ui Ui1 lt e

23
Policies converge earlier than values

There are finite number of policies but infinite
number of value functions.
So entire regions of value vector are mapped
to a specific policy
So policies may be converging faster than
values. Search in the space of policies
Given a utility vector Ui we can compute the
greedy policy pui
The policy loss of pui is Upui-U
(max norm difference of two vectors is the
maximum amount by which they differ on any
dimension)

P4
P3
V(S2)
U
P2
P1
V(S1)
Consider an MDP with 2 states and 2 actions
24
n linear equations with n unknowns.
We can either solve the linear eqns exactly,
or solve them approximately by running the
value iteration a few times (the update wont
have the max operation)
25
Bellman equations when actions have costs

The model discussed in class ignores action costs
and only thinks of state rewards
C(s,a) is the cost of doing action a in state s
Assume costs are just negative rewards..
The Bellman equation then becomes
U(s) R(s) maxa -C(s,a) ?s R(s)
Mass
Notice that the only difference is that -C(s,a)
is now inside the maximization
With this model, we can talk about partial
satisfaction planning problems where
Actions have costs goals have utilities and the
optimal plan may not satisfy all goals.