Title: Markov Decision Process MDP
1Markov Decision Process(MDP)
- Ruti Glick
- Bar-Ilan university
2Policy
- Policy is similar to plan
- generated ahead of time
- Unlike traditional plans, it is not a sequence of
actions that an agent must execute - If there are failures in execution, agent can
continue to execute a policy - Prescribes an action for all the states
- Maximizes expected reward, rather than just
reaching a goal state
3Utility and Policy
- utility
- Compute for every state
- What is the usage (utility) of this state for
the overall task??? - Policy
- Complete mapping from states to actions
- In which state should I perform which action?
- policy state ? action
4The optimal Policy
p(s) argmaxa?sT(s, a, s)U(s) T(s, a, s)
Probability of reaching state s from state
s U(s) Utility of state j.
- If we know the utility we can easily compute the
optimal policy.?? - The problem is to compute the correct utilities
for all states.
5Finding p
- Value iteration
- Policy iteration
6Value iteration
- Process
- Calculate the utility of each state
- Use the values to select an optimal action
-
7Bellman Equation
- Bellman Equation
- U(s) R(s) ?maxa ? (T(s, a, s) U(s))
- For exampleU(1,1) -0.04?max
0.8U(1,2)0.1U(2,1)0.1U(1,1), 0.9U(1,1)0.1U(
1,2), 0.9U(1,1)0.1U(2,1), 0.8U(2,1)0.1U(
1,2)0.1U(1,1,)
Up Left Down Right
8Bellman Equation
- properties
- U(s) R(s) ? maxa ? (T(s, a, s) U(s))
- n equations. One for each step
- n vaiables
- Problem
- Operator max is not a linear operator
- Non-linear equations.
- Solution
- Iterative approach
9Value iteration algorithm
- Initial arbitrary values for utilities
- Update utility of each state from its neighbors
- Ui1(s) ?R(s) ? maxa ? (T(s, a, s) Ui(s))
- Iteration step called Bellman update
- Repeat till converges
10Value Iteration properties
- This equilibrium is a unique solution!
- Can prove that the value iteration process
converges - Dont need exact values
11convergence
- Value iteration is contraction
- Function of one argument
- When applied on to inputs produces value that are
closer together - Have only one fixed point
- When applied the value must be closer to fixed
point - Well not going to prove last point
- converge to correct value
12Value Iteration Algorithm
- function VALUE_ITERATION (mdp) returns a utility
function - input mdp, MDP with states S, transition model
T, - reward function R, discount ?
- local variables U, U vectors of utilities for
states in S, - initially identical to R
- repeat
- U? U
- for each state s in S do
- Us ? Rs ?maxa ?s T(s, a,
s)Us - until close-enough(U,U)
- return U
-
13Example
- Small version of our main example
- 2x2 world
- The agent is placed in (1,1)
- States (2,1), (2,2) are goal states
- If blocked by the wall stay in place
- The reward are written in the board
14Example (cont.)
- First iteration
- U(1,1) R(1,1) ? max 0.8U(1,2) 0.1U(1,1)
0.1U(2,1), 0.9U(1,1) 0.1U(1,2), 0.9U(1,1
) 0.1U(2,1), 0.8U(2,1) 0.1U(1,1)
0.1U(1,2) -0.04 1 x max
0.8x(-0.04) 0.1x(-0.04) 0.1x(-1), 0.9x(-0.
04) 0.1x(-0.04), 0.9x(-0.04)
0.1x(-1), 0.8x(-1) 0.1x(-0.04)
0.1x(-0.04) -0.04 max -0.136, -0.04,
-0.136, -0.808 -0.08 - U(1,2) R(1,2) ? max 0.9U(1,2)
0.1U(2,2), 0.9U(1,2) 0.1U(1,1), 0.8U(1,1
) 0.1U(2,2) 0.1U(1,2), 0.8U(2,2)
0.1U(1,2) 0.1U(1,1) -0.04 1 x max
0.9x(-0.04) 0.1x1, 0.9x(-0.04)
0.1x(-0.04), 0.8x(-0.04) 0.1x1
0.1x(-0.04), 0.8x1 0.1x(-0.04)
0.1x(-0.04) -0.04 max 0.064, -0.04,
0.064, 0.792 0.752 - Goal states remain the same
15Example (cont.)
- Second iteration
- U(1,1) R(1,1) ? max 0.8U(1,2) 0.1U(1,1)
0.1U(2,1), 0.9U(1,1) 0.1U(1,2), 0.9U(1,1
) 0.1U(2,1), 0.8U(2,1) 0.1U(1,1)
0.1U(1,2) -0.04 1 x max
0.8x(0.752) 0.1x(-0.08) 0.1x(-1), 0.9x(-0.
08) 0.1x(0.752), 0.9x(-0.08)
0.1x(-1), 0.8x(-1) 0.1x(-0.08)
0.1x(0.752) -0.04 max 0.4936,
0.0032, -0.172, -0.3728 0.4536 - U(1,2) R(1,2) ? max 0.9U(1,2)
0.1U(2,2), 0.9U(1,2) 0.1U(1,1), 0.8U(1,1
) 0.1U(2,2) 0.1U(1,2), 0.8U(2,2)
0.1U(1,2) 0.1U(1,1) -0.04 1 x max
0.9x(0.752) 0.1x1, 0.9x(0.752)
0.1x(-0.08), 0.8x(-0.08) 0.1x1
0.1x(0.752), 0.8x1 0.1x(0.752)
0.1x(-0.08) -0.04 max 0.7768,
0.6688, 0.1112, 0.8672 0.8272
16Example (cont.)
- Third iteration
- U(1,1) R(1,1) ? max 0.8U(1,2) 0.1U(1,1)
0.1U(2,1), 0.9U(1,1) 0.1U(1,2), 0.9U(1,1
) 0.1U(2,1), 0.8U(2,1) 0.1U(1,1)
0.1U(1,2) -0.04 1 x max
0.8x(0.8272) 0.1x(0.4536) 0.1x(-1), 0.9x(0
.4536) 0.1x(0.8272), 0.9x(0.4536)
0.1x(-1), 0.8x(-1) 0.1x(0.4536)
0.1x(0.8272) -0.04 max 0.6071,
0.491, 0.3082, -0.6719 0.5676 - U(1,2) R(1,2) ? max 0.9U(1,2)
0.1U(2,2), 0.9U(1,2) 0.1U(1,1), 0.8U(1,1
) 0.1U(2,2) 0.1U(1,2), 0.8U(2,2)
0.1U(1,2) 0.1U(1,1) -0.04 1 x max
0.9x(0.8272) 0.1x1, 0.9x(0.8272)
0.1x(0.4536), 0.8x(0.4536) 0.1x1
0.1x(0.8272), 0.8x1 0.1x(0.8272)
0.1x(0.4536) -0.04 max 0.8444,
0.7898, 0.5456, 0.9281 0.8881
17Example (cont.)
- Continue to next iteration
- Finish if close enough
- Here last change was 0.114 close enough
18close enough
- We will not go down deeply to this issue!
- Different possibilities to detect convergence
- RMS error root mean square error of the
utility value compare to the correct values - demand of RMS(U, U) lt e
- when e maximum error allowed in utility of
any state in an iteration
19close enough (cont.)
- Policy Loss difference between the expected
utility using the policy to the expected utility
obtain by the optimal policy - Ui1 Ui lt e (1-?) / ?
- When U maxa U(s)
- e maximum error allowed in utility of any
state in an iteration - ? the discount factor
20Finding the policy
- True utilities have founded
- New search for the optimal policy
- For each s in S do ps ? argmaxa ?s T(s, a,
s)U(s) - Return p
21Example (cont.)
- Find the optimal police
- ?(1,1) argmaxa 0.8U(1,2) 0.1U(1,1)
0.1U(2,1), //Up 0.9U(1,1)
0.1U(1,2), //Left
0.9U(1,1) 0.1U(2,1),
//Down 0.8U(2,1)
0.1U(1,1) 0.1U(1,2) //Right
argmaxa 0.8x(0.8881) 0.1x(0.5676)
0.1x(-1), 0.9x(0.5676)
0.1x(0.8881), 0.9x(0.5676)
0.1x(-1), 0.8x(-1) 0.1x(0.5676)
0.1x(0.8881) argmaxa 0.6672,
0.5996, 0.4108, -0.6512 Up - ?(1,2) argmaxa 0.9U(1,2) 0.1U(2,2),
//Up
0.9U(1,2) 0.1U(1,1),
//Left 0.8U(1,1) 0.1U(2,2)
0.1U(1,2), //Down
0.8U(2,2) 0.1U(1,2) 0.1U(1,1)
//Right argmaxa 0.9x(0.8881)
0.1x1, 0.9x(0.8881)
0.1x(0.5676), 0.8x(0.5676) 0.1x1
0.1x(0.8881), 0.8x1
0.1x(0.8881) 0.1x(0.5676) argmaxa
0.8993, 0.8561, 0.6429, 0.9456 Right
22Summery value iteration
1. The given environment
2. Calculate utilities
4. Execute actions
3. Extract optimal policy
23Example - convergence
Error allowed
24Policy iteration
- picking a policy, then calculating the utility of
each state given that policy (value iteration
step) - Update the policy at each state using the
utilities of the successor states - Repeat until the policy stabilize
25Policy iteration
- For each state in each step
- Policy evaluation
- Given policy pi.
- Calculate the utility Ui of each state if p were
to be execute - Policy improvement
- Calculate new policy pi1
- Based on pi
- p i1s ? argmaxa ?s T(s, a, s)U(s)
26Policy iteration Algorithm
- function POLICY_ITERATION (mdp) returns a policy
- input mdp, an MDP with states S, transition
model T - local variables U, U vectors of utilities for
states in S, initially identical to R - p, a policy, vector indexed by states,
initially random -
- repeat
- U? Policy-Evaluation(p,U,mdp)
- unchanged? ? true
- for each state s in S do
- if maxa ?s T(s, a, s) Us gt ?sT(s,
ps, s) Us then - ps ? argmaxa ?s T(s, a, s)
Uj - unchanged? ? false
- end
- until unchanged?
- return p
27Example
- Back to our last example
- 2x2 world
- The agent is placed in (1,1)
- States (2,1), (2,2) are goal states
- If blocked by the wall stay in place
- The reward are written in the board
- Initial policy Up (for every step)
28Example (cont.)
- First iteration policy evaluation
- U(1,1) R(1,1) ? x (0.8U(1,2) 0.1U(1,1)
0.1U(2,1)) - U(1,2) R(1,2) ? x (0.9U(1,2) 0.1U(2,2))
- U(2,1) R(2,1)
- U(2,2) R(2,2)
- U(1,1) -0.04 0.8U(1,2) 0.1U(1,1)
0.1U(2,1) - U(1,2) -0.04 0.9U(1,2) 0.1U(2,2)
- U(2,1) -1
- U(2,2) 1
- 0.04 -0.9U(1,1) 0.8U(1,2) 0.1U(2,1)
0U(2,2)0.04 0U(1,1) 0.1U(1,2)
0U(2,1) 0.1U(2,2)-1 0U(1,1)
0U(1,2) - 1U(2,1) 0U(2,2) 1
0U(1,1) 0U(1,2) - 0U(2,1) 1U(2,2) - U(1,1) 0.3778
- U(1,2) 0.6
- U(2,1) -1
- U(2,2) 1
Policy ?(1,1) Up ?(1,2) Up
29Example (cont.)
- First iteration policy improvement
- ?(1,1) argmaxa 0.8U(1,2) 0.1U(1,1)
0.1U(2,1), //Up 0.9U(1,1)
0.1U(1,2), //Left
0.9U(1,1) 0.1U(2,1),
//Down 0.8U(2,1)
0.1U(1,1) 0.1U(1,2) //Right
argmaxa 0.8x(0.6) 0.1x(0.3778)
0.1x(-1), 0.9x(0.3778)
0.1x(0.6), 0.9x(0.3778)
0.1x(-1), 0.8x(-1) 0.1x(0.3778)
0.1x(0.6) argmaxa 0.4178, 0.4,
0.24, -0.7022 Up ? dont have to
update - ?(1,2) argmaxa 0.9U(1,2) 0.1U(2,2),
//Up
0.9U(1,2) 0.1U(1,1),
//Left 0.8U(1,1) 0.1U(2,2)
0.1U(1,2), //Down
0.8U(2,2) 0.1U(1,2) 0.1U(1,1)
//Right argmaxa 0.9x(0.6)
0.1x1, 0.9x(0.6) 0.1x(0.3778),
0.8x(0.3778) 0.1x1 0.1x(0.6),
0.8x1 0.1x(0.6) 0.1x(0.3778)
argmaxa 0.64, 0.5778, 0.4622, 0.8978
Right ? update
Policy ?(1,1) Up ?(1,2) Up
30Example (cont.)
- Second iteration policy evaluation
- U(1,1) R(1,1) ? x (0.8U(1,2) 0.1U(1,1)
0.1U(2,1)) - U(1,2) R(1,2) ? x (0.1U(1,2) 0.8U(2,2)
0.1U(1,1)) - U(2,1) R(2,1)
- U(2,2) R(2,2)
- U(1,1) -0.04 0.8U(1,2) 0.1U(1,1)
0.1U(2,1) - U(1,2) -0.04 0.1U(1,2) 0.8U(2,2)
0.1U(1,1) - U(2,1) -1
- U(2,2) 1
- 0.04 -0.9U(1,1) 0.8U(1,2) 0.1U(2,1)
0U(2,2)0.04 0.1U(1,1) 0.9U(1,2)
0U(2,1) 0.8U(2,2)-1 0U(1,1)
0U(1,2) - 1U(2,1) 0U(2,2) 1
0U(1,1) 0U(1,2) - 0U(2,1) 1U(2,2) - U(1,1) 0.5413
- U(1,2) 0.7843
- U(2,1) -1
- U(2,2) 1
Policy ?(1,1) Up ?(1,2) Right
31Example (cont.)
- Second iteration policy improvement
- ?(1,1) argmaxa 0.8U(1,2) 0.1U(1,1)
0.1U(2,1), //Up 0.9U(1,1)
0.1U(1,2), //Left
0.9U(1,1) 0.1U(2,1),
//Down 0.8U(2,1)
0.1U(1,1) 0.1U(1,2) //Right
argmaxa 0.8x(0.7843) 0.1x(0.5413)
0.1x(-1), 0.9x(0.5413)
0.1x(0.7843), 0.9x(0.5413)
0.1x(-1), 0.8x(-1) 0.1x(0.5413)
0.1x(0.7843) argmaxa 0.5816,
0.5656, 0.3871, -0.6674 Up ? dont
have to update - ?(1,2) argmaxa 0.9U(1,2) 0.1U(2,2),
//Up
0.9U(1,2) 0.1U(1,1),
//Left 0.8U(1,1) 0.1U(2,2)
0.1U(1,2), //Down
0.8U(2,2) 0.1U(1,2) 0.1U(1,1)
//Right argmaxa 0.9x(0.7843)
0.1x1, 0.9x(0.7843)
0.1x(0.5413), 0.8x(0.5413) 0.1x1
0.1x(0.7843), 0.8x1
0.1x(0.7843) 0.1x(0.5413) argmaxa
0.8059, 0.76, 0.6115, 0.9326 Right ?
dont have to update
Policy ?(1,1) Up ?(1,2) Right
32Example (cont.)
- No change in the policy has found ? finish
- The optimal policyp(1,1) Upp(1,2) Right
- Policy iteration must terminate since policys
number is finite
33Simplify Policy iteration
- Can focus of subset of state
- Find utility by simplified value
iterationUi1(s) R(s) ? ?s (T(s, p(s), s)
Ui(s)) - OR
- Policy Improvement
- Guaranteed to converge under certain conditions
on initial polity and utility values
34Policy Iteration properties
- Linear equation easy to solve
- Fast convergence in practice
- Proved to be optimal
35Value vs. Policy Iteration
- Which to use
- Policy iteration in more expensive per iteration
- In practice, Policy iteration require fewer
iterations