Markov Decision Process MDP

1 / 35

About This Presentation

Title:

Markov Decision Process MDP

Description:

1. Markov Decision Process (MDP) Ruti Glick. Bar-Ilan university. 2. Policy ... Unlike traditional plans, it is not a sequence of actions that an agent must execute ... – PowerPoint PPT presentation

Number of Views:107

Avg rating:3.0/5.0

Slides: 36

Provided by: harva79

more less

Transcript and Presenter's Notes

Title: Markov Decision Process MDP

1
Markov Decision Process(MDP)

Ruti Glick
Bar-Ilan university

2
Policy

Policy is similar to plan
generated ahead of time
Unlike traditional plans, it is not a sequence of
actions that an agent must execute
If there are failures in execution, agent can
continue to execute a policy
Prescribes an action for all the states
Maximizes expected reward, rather than just
reaching a goal state

3
Utility and Policy

utility
Compute for every state
What is the usage (utility) of this state for
the overall task???
Policy
Complete mapping from states to actions
In which state should I perform which action?
policy state ? action

4
The optimal Policy
p(s) argmaxa?sT(s, a, s)U(s) T(s, a, s)
Probability of reaching state s from state
s U(s) Utility of state j.

If we know the utility we can easily compute the
optimal policy.??
The problem is to compute the correct utilities
for all states.

5
Finding p

Value iteration
Policy iteration

6
Value iteration

Process
Calculate the utility of each state
Use the values to select an optimal action

7
Bellman Equation

Bellman Equation
U(s) R(s) ?maxa ? (T(s, a, s) U(s))
For exampleU(1,1) -0.04?max
0.8U(1,2)0.1U(2,1)0.1U(1,1), 0.9U(1,1)0.1U(
1,2), 0.9U(1,1)0.1U(2,1), 0.8U(2,1)0.1U(
1,2)0.1U(1,1,)

Up Left Down Right
8
Bellman Equation

properties
U(s) R(s) ? maxa ? (T(s, a, s) U(s))
n equations. One for each step
n vaiables
Problem
Operator max is not a linear operator
Non-linear equations.
Solution
Iterative approach

9
Value iteration algorithm

Initial arbitrary values for utilities
Update utility of each state from its neighbors
Ui1(s) ?R(s) ? maxa ? (T(s, a, s) Ui(s))
Iteration step called Bellman update
Repeat till converges

10
Value Iteration properties

This equilibrium is a unique solution!
Can prove that the value iteration process
converges
Dont need exact values

11
convergence

Value iteration is contraction
Function of one argument
When applied on to inputs produces value that are
closer together
Have only one fixed point
When applied the value must be closer to fixed
point
Well not going to prove last point
converge to correct value

12
Value Iteration Algorithm

function VALUE_ITERATION (mdp) returns a utility
function
input mdp, MDP with states S, transition model
T,
reward function R, discount ?
local variables U, U vectors of utilities for
states in S,
initially identical to R
repeat
U? U
for each state s in S do
Us ? Rs ?maxa ?s T(s, a,
s)Us
until close-enough(U,U)
return U

13
Example

Small version of our main example
2x2 world
The agent is placed in (1,1)
States (2,1), (2,2) are goal states
If blocked by the wall stay in place
The reward are written in the board

14
Example (cont.)

First iteration
U(1,1) R(1,1) ? max 0.8U(1,2) 0.1U(1,1)
0.1U(2,1), 0.9U(1,1) 0.1U(1,2), 0.9U(1,1
) 0.1U(2,1), 0.8U(2,1) 0.1U(1,1)
0.1U(1,2) -0.04 1 x max
0.8x(-0.04) 0.1x(-0.04) 0.1x(-1), 0.9x(-0.
04) 0.1x(-0.04), 0.9x(-0.04)
0.1x(-1), 0.8x(-1) 0.1x(-0.04)
0.1x(-0.04) -0.04 max -0.136, -0.04,
-0.136, -0.808 -0.08
U(1,2) R(1,2) ? max 0.9U(1,2)
0.1U(2,2), 0.9U(1,2) 0.1U(1,1), 0.8U(1,1
) 0.1U(2,2) 0.1U(1,2), 0.8U(2,2)
0.1U(1,2) 0.1U(1,1) -0.04 1 x max
0.9x(-0.04) 0.1x1, 0.9x(-0.04)
0.1x(-0.04), 0.8x(-0.04) 0.1x1
0.1x(-0.04), 0.8x1 0.1x(-0.04)
0.1x(-0.04) -0.04 max 0.064, -0.04,
0.064, 0.792 0.752
Goal states remain the same

15
Example (cont.)

Second iteration
U(1,1) R(1,1) ? max 0.8U(1,2) 0.1U(1,1)
0.1U(2,1), 0.9U(1,1) 0.1U(1,2), 0.9U(1,1
) 0.1U(2,1), 0.8U(2,1) 0.1U(1,1)
0.1U(1,2) -0.04 1 x max
0.8x(0.752) 0.1x(-0.08) 0.1x(-1), 0.9x(-0.
08) 0.1x(0.752), 0.9x(-0.08)
0.1x(-1), 0.8x(-1) 0.1x(-0.08)
0.1x(0.752) -0.04 max 0.4936,
0.0032, -0.172, -0.3728 0.4536
U(1,2) R(1,2) ? max 0.9U(1,2)
0.1U(2,2), 0.9U(1,2) 0.1U(1,1), 0.8U(1,1
) 0.1U(2,2) 0.1U(1,2), 0.8U(2,2)
0.1U(1,2) 0.1U(1,1) -0.04 1 x max
0.9x(0.752) 0.1x1, 0.9x(0.752)
0.1x(-0.08), 0.8x(-0.08) 0.1x1
0.1x(0.752), 0.8x1 0.1x(0.752)
0.1x(-0.08) -0.04 max 0.7768,
0.6688, 0.1112, 0.8672 0.8272

16
Example (cont.)

Third iteration
U(1,1) R(1,1) ? max 0.8U(1,2) 0.1U(1,1)
0.1U(2,1), 0.9U(1,1) 0.1U(1,2), 0.9U(1,1
) 0.1U(2,1), 0.8U(2,1) 0.1U(1,1)
0.1U(1,2) -0.04 1 x max
0.8x(0.8272) 0.1x(0.4536) 0.1x(-1), 0.9x(0
.4536) 0.1x(0.8272), 0.9x(0.4536)
0.1x(-1), 0.8x(-1) 0.1x(0.4536)
0.1x(0.8272) -0.04 max 0.6071,
0.491, 0.3082, -0.6719 0.5676
U(1,2) R(1,2) ? max 0.9U(1,2)
0.1U(2,2), 0.9U(1,2) 0.1U(1,1), 0.8U(1,1
) 0.1U(2,2) 0.1U(1,2), 0.8U(2,2)
0.1U(1,2) 0.1U(1,1) -0.04 1 x max
0.9x(0.8272) 0.1x1, 0.9x(0.8272)
0.1x(0.4536), 0.8x(0.4536) 0.1x1
0.1x(0.8272), 0.8x1 0.1x(0.8272)
0.1x(0.4536) -0.04 max 0.8444,
0.7898, 0.5456, 0.9281 0.8881

17
Example (cont.)

Continue to next iteration
Finish if close enough
Here last change was 0.114 close enough

18
close enough

We will not go down deeply to this issue!
Different possibilities to detect convergence
RMS error root mean square error of the
utility value compare to the correct values
demand of RMS(U, U) lt e
when e maximum error allowed in utility of
any state in an iteration

19
close enough (cont.)

Policy Loss difference between the expected
utility using the policy to the expected utility
obtain by the optimal policy
Ui1 Ui lt e (1-?) / ?
When U maxa U(s)
e maximum error allowed in utility of any
state in an iteration
? the discount factor

20
Finding the policy

True utilities have founded
New search for the optimal policy
For each s in S do ps ? argmaxa ?s T(s, a,
s)U(s)
Return p

21
Example (cont.)

Find the optimal police
?(1,1) argmaxa 0.8U(1,2) 0.1U(1,1)
0.1U(2,1), //Up 0.9U(1,1)
0.1U(1,2), //Left
0.9U(1,1) 0.1U(2,1),
//Down 0.8U(2,1)
0.1U(1,1) 0.1U(1,2) //Right
argmaxa 0.8x(0.8881) 0.1x(0.5676)
0.1x(-1), 0.9x(0.5676)
0.1x(0.8881), 0.9x(0.5676)
0.1x(-1), 0.8x(-1) 0.1x(0.5676)
0.1x(0.8881) argmaxa 0.6672,
0.5996, 0.4108, -0.6512 Up
?(1,2) argmaxa 0.9U(1,2) 0.1U(2,2),
//Up
0.9U(1,2) 0.1U(1,1),
//Left 0.8U(1,1) 0.1U(2,2)
0.1U(1,2), //Down
0.8U(2,2) 0.1U(1,2) 0.1U(1,1)
//Right argmaxa 0.9x(0.8881)
0.1x1, 0.9x(0.8881)
0.1x(0.5676), 0.8x(0.5676) 0.1x1
0.1x(0.8881), 0.8x1
0.1x(0.8881) 0.1x(0.5676) argmaxa
0.8993, 0.8561, 0.6429, 0.9456 Right

22
Summery value iteration
1. The given environment
2. Calculate utilities
4. Execute actions
3. Extract optimal policy
23
Example - convergence
Error allowed
24
Policy iteration

picking a policy, then calculating the utility of
each state given that policy (value iteration
step)
Update the policy at each state using the
utilities of the successor states
Repeat until the policy stabilize

25
Policy iteration

For each state in each step
Policy evaluation
Given policy pi.
Calculate the utility Ui of each state if p were
to be execute
Policy improvement
Calculate new policy pi1
Based on pi
p i1s ? argmaxa ?s T(s, a, s)U(s)

26
Policy iteration Algorithm

function POLICY_ITERATION (mdp) returns a policy
input mdp, an MDP with states S, transition
model T
local variables U, U vectors of utilities for
states in S, initially identical to R
p, a policy, vector indexed by states,
initially random
repeat
U? Policy-Evaluation(p,U,mdp)
unchanged? ? true
for each state s in S do
if maxa ?s T(s, a, s) Us gt ?sT(s,
ps, s) Us then
ps ? argmaxa ?s T(s, a, s)
Uj
unchanged? ? false
end
until unchanged?
return p

27
Example

Back to our last example
2x2 world
The agent is placed in (1,1)
States (2,1), (2,2) are goal states
If blocked by the wall stay in place
The reward are written in the board
Initial policy Up (for every step)

28
Example (cont.)

First iteration policy evaluation
U(1,1) R(1,1) ? x (0.8U(1,2) 0.1U(1,1)
0.1U(2,1))
U(1,2) R(1,2) ? x (0.9U(1,2) 0.1U(2,2))
U(2,1) R(2,1)
U(2,2) R(2,2)
U(1,1) -0.04 0.8U(1,2) 0.1U(1,1)
0.1U(2,1)
U(1,2) -0.04 0.9U(1,2) 0.1U(2,2)
U(2,1) -1
U(2,2) 1
0.04 -0.9U(1,1) 0.8U(1,2) 0.1U(2,1)
0U(2,2)0.04 0U(1,1) 0.1U(1,2)
0U(2,1) 0.1U(2,2)-1 0U(1,1)
0U(1,2) - 1U(2,1) 0U(2,2) 1
0U(1,1) 0U(1,2) - 0U(2,1) 1U(2,2)
U(1,1) 0.3778
U(1,2) 0.6
U(2,1) -1
U(2,2) 1

Policy ?(1,1) Up ?(1,2) Up
29
Example (cont.)

First iteration policy improvement
?(1,1) argmaxa 0.8U(1,2) 0.1U(1,1)
0.1U(2,1), //Up 0.9U(1,1)
0.1U(1,2), //Left
0.9U(1,1) 0.1U(2,1),
//Down 0.8U(2,1)
0.1U(1,1) 0.1U(1,2) //Right
argmaxa 0.8x(0.6) 0.1x(0.3778)
0.1x(-1), 0.9x(0.3778)
0.1x(0.6), 0.9x(0.3778)
0.1x(-1), 0.8x(-1) 0.1x(0.3778)
0.1x(0.6) argmaxa 0.4178, 0.4,
0.24, -0.7022 Up ? dont have to
update
?(1,2) argmaxa 0.9U(1,2) 0.1U(2,2),
//Up
0.9U(1,2) 0.1U(1,1),
//Left 0.8U(1,1) 0.1U(2,2)
0.1U(1,2), //Down
0.8U(2,2) 0.1U(1,2) 0.1U(1,1)
//Right argmaxa 0.9x(0.6)
0.1x1, 0.9x(0.6) 0.1x(0.3778),
0.8x(0.3778) 0.1x1 0.1x(0.6),
0.8x1 0.1x(0.6) 0.1x(0.3778)
argmaxa 0.64, 0.5778, 0.4622, 0.8978
Right ? update

Policy ?(1,1) Up ?(1,2) Up
30
Example (cont.)

Second iteration policy evaluation
U(1,1) R(1,1) ? x (0.8U(1,2) 0.1U(1,1)
0.1U(2,1))
U(1,2) R(1,2) ? x (0.1U(1,2) 0.8U(2,2)
0.1U(1,1))
U(2,1) R(2,1)
U(2,2) R(2,2)
U(1,1) -0.04 0.8U(1,2) 0.1U(1,1)
0.1U(2,1)
U(1,2) -0.04 0.1U(1,2) 0.8U(2,2)
0.1U(1,1)
U(2,1) -1
U(2,2) 1
0.04 -0.9U(1,1) 0.8U(1,2) 0.1U(2,1)
0U(2,2)0.04 0.1U(1,1) 0.9U(1,2)
0U(2,1) 0.8U(2,2)-1 0U(1,1)
0U(1,2) - 1U(2,1) 0U(2,2) 1
0U(1,1) 0U(1,2) - 0U(2,1) 1U(2,2)
U(1,1) 0.5413
U(1,2) 0.7843
U(2,1) -1
U(2,2) 1

Policy ?(1,1) Up ?(1,2) Right
31
Example (cont.)

Second iteration policy improvement
?(1,1) argmaxa 0.8U(1,2) 0.1U(1,1)
0.1U(2,1), //Up 0.9U(1,1)
0.1U(1,2), //Left
0.9U(1,1) 0.1U(2,1),
//Down 0.8U(2,1)
0.1U(1,1) 0.1U(1,2) //Right
argmaxa 0.8x(0.7843) 0.1x(0.5413)
0.1x(-1), 0.9x(0.5413)
0.1x(0.7843), 0.9x(0.5413)
0.1x(-1), 0.8x(-1) 0.1x(0.5413)
0.1x(0.7843) argmaxa 0.5816,
0.5656, 0.3871, -0.6674 Up ? dont
have to update
?(1,2) argmaxa 0.9U(1,2) 0.1U(2,2),
//Up
0.9U(1,2) 0.1U(1,1),
//Left 0.8U(1,1) 0.1U(2,2)
0.1U(1,2), //Down
0.8U(2,2) 0.1U(1,2) 0.1U(1,1)
//Right argmaxa 0.9x(0.7843)
0.1x1, 0.9x(0.7843)
0.1x(0.5413), 0.8x(0.5413) 0.1x1
0.1x(0.7843), 0.8x1
0.1x(0.7843) 0.1x(0.5413) argmaxa
0.8059, 0.76, 0.6115, 0.9326 Right ?
dont have to update

Policy ?(1,1) Up ?(1,2) Right
32
Example (cont.)

No change in the policy has found ? finish
The optimal policyp(1,1) Upp(1,2) Right
Policy iteration must terminate since policys
number is finite

33
Simplify Policy iteration

Can focus of subset of state
Find utility by simplified value
iterationUi1(s) R(s) ? ?s (T(s, p(s), s)
Ui(s))
OR
Policy Improvement
Guaranteed to converge under certain conditions
on initial polity and utility values

34
Policy Iteration properties