Markov Decision Process MDP

1 / 35
About This Presentation
Title:

Markov Decision Process MDP

Description:

1. Markov Decision Process (MDP) Ruti Glick. Bar-Ilan university. 2. Policy ... Unlike traditional plans, it is not a sequence of actions that an agent must execute ... – PowerPoint PPT presentation

Number of Views:107
Avg rating:3.0/5.0
Slides: 36
Provided by: harva79

less

Transcript and Presenter's Notes

Title: Markov Decision Process MDP


1
Markov Decision Process(MDP)
  • Ruti Glick
  • Bar-Ilan university

2
Policy
  • Policy is similar to plan
  • generated ahead of time
  • Unlike traditional plans, it is not a sequence of
    actions that an agent must execute
  • If there are failures in execution, agent can
    continue to execute a policy
  • Prescribes an action for all the states
  • Maximizes expected reward, rather than just
    reaching a goal state

3
Utility and Policy
  • utility
  • Compute for every state
  • What is the usage (utility) of this state for
    the overall task???
  • Policy
  • Complete mapping from states to actions
  • In which state should I perform which action?
  • policy state ? action

4
The optimal Policy
p(s) argmaxa?sT(s, a, s)U(s) T(s, a, s)
Probability of reaching state s from state
s U(s) Utility of state j.
  • If we know the utility we can easily compute the
    optimal policy.??
  • The problem is to compute the correct utilities
    for all states.

5
Finding p
  • Value iteration
  • Policy iteration

6
Value iteration
  • Process
  • Calculate the utility of each state
  • Use the values to select an optimal action

7
Bellman Equation
  • Bellman Equation
  • U(s) R(s) ?maxa ? (T(s, a, s) U(s))
  • For exampleU(1,1) -0.04?max
    0.8U(1,2)0.1U(2,1)0.1U(1,1), 0.9U(1,1)0.1U(
    1,2), 0.9U(1,1)0.1U(2,1), 0.8U(2,1)0.1U(
    1,2)0.1U(1,1,)

Up Left Down Right
8
Bellman Equation
  • properties
  • U(s) R(s) ? maxa ? (T(s, a, s) U(s))
  • n equations. One for each step
  • n vaiables
  • Problem
  • Operator max is not a linear operator
  • Non-linear equations.
  • Solution
  • Iterative approach

9
Value iteration algorithm
  • Initial arbitrary values for utilities
  • Update utility of each state from its neighbors
  • Ui1(s) ?R(s) ? maxa ? (T(s, a, s) Ui(s))
  • Iteration step called Bellman update
  • Repeat till converges

10
Value Iteration properties
  • This equilibrium is a unique solution!
  • Can prove that the value iteration process
    converges
  • Dont need exact values

11
convergence
  • Value iteration is contraction
  • Function of one argument
  • When applied on to inputs produces value that are
    closer together
  • Have only one fixed point
  • When applied the value must be closer to fixed
    point
  • Well not going to prove last point
  • converge to correct value

12
Value Iteration Algorithm
  • function VALUE_ITERATION (mdp) returns a utility
    function
  • input mdp, MDP with states S, transition model
    T,
  • reward function R, discount ?
  • local variables U, U vectors of utilities for
    states in S,
  • initially identical to R
  • repeat
  • U? U
  • for each state s in S do
  • Us ? Rs ?maxa ?s T(s, a,
    s)Us
  • until close-enough(U,U)
  • return U

13
Example
  • Small version of our main example
  • 2x2 world
  • The agent is placed in (1,1)
  • States (2,1), (2,2) are goal states
  • If blocked by the wall stay in place
  • The reward are written in the board

14
Example (cont.)
  • First iteration
  • U(1,1) R(1,1) ? max 0.8U(1,2) 0.1U(1,1)
    0.1U(2,1), 0.9U(1,1) 0.1U(1,2), 0.9U(1,1
    ) 0.1U(2,1), 0.8U(2,1) 0.1U(1,1)
    0.1U(1,2) -0.04 1 x max
    0.8x(-0.04) 0.1x(-0.04) 0.1x(-1), 0.9x(-0.
    04) 0.1x(-0.04), 0.9x(-0.04)
    0.1x(-1), 0.8x(-1) 0.1x(-0.04)
    0.1x(-0.04) -0.04 max -0.136, -0.04,
    -0.136, -0.808 -0.08
  • U(1,2) R(1,2) ? max 0.9U(1,2)
    0.1U(2,2), 0.9U(1,2) 0.1U(1,1), 0.8U(1,1
    ) 0.1U(2,2) 0.1U(1,2), 0.8U(2,2)
    0.1U(1,2) 0.1U(1,1) -0.04 1 x max
    0.9x(-0.04) 0.1x1, 0.9x(-0.04)
    0.1x(-0.04), 0.8x(-0.04) 0.1x1
    0.1x(-0.04), 0.8x1 0.1x(-0.04)
    0.1x(-0.04) -0.04 max 0.064, -0.04,
    0.064, 0.792 0.752
  • Goal states remain the same

15
Example (cont.)
  • Second iteration
  • U(1,1) R(1,1) ? max 0.8U(1,2) 0.1U(1,1)
    0.1U(2,1), 0.9U(1,1) 0.1U(1,2), 0.9U(1,1
    ) 0.1U(2,1), 0.8U(2,1) 0.1U(1,1)
    0.1U(1,2) -0.04 1 x max
    0.8x(0.752) 0.1x(-0.08) 0.1x(-1), 0.9x(-0.
    08) 0.1x(0.752), 0.9x(-0.08)
    0.1x(-1), 0.8x(-1) 0.1x(-0.08)
    0.1x(0.752) -0.04 max 0.4936,
    0.0032, -0.172, -0.3728 0.4536
  • U(1,2) R(1,2) ? max 0.9U(1,2)
    0.1U(2,2), 0.9U(1,2) 0.1U(1,1), 0.8U(1,1
    ) 0.1U(2,2) 0.1U(1,2), 0.8U(2,2)
    0.1U(1,2) 0.1U(1,1) -0.04 1 x max
    0.9x(0.752) 0.1x1, 0.9x(0.752)
    0.1x(-0.08), 0.8x(-0.08) 0.1x1
    0.1x(0.752), 0.8x1 0.1x(0.752)
    0.1x(-0.08) -0.04 max 0.7768,
    0.6688, 0.1112, 0.8672 0.8272

16
Example (cont.)
  • Third iteration
  • U(1,1) R(1,1) ? max 0.8U(1,2) 0.1U(1,1)
    0.1U(2,1), 0.9U(1,1) 0.1U(1,2), 0.9U(1,1
    ) 0.1U(2,1), 0.8U(2,1) 0.1U(1,1)
    0.1U(1,2) -0.04 1 x max
    0.8x(0.8272) 0.1x(0.4536) 0.1x(-1), 0.9x(0
    .4536) 0.1x(0.8272), 0.9x(0.4536)
    0.1x(-1), 0.8x(-1) 0.1x(0.4536)
    0.1x(0.8272) -0.04 max 0.6071,
    0.491, 0.3082, -0.6719 0.5676
  • U(1,2) R(1,2) ? max 0.9U(1,2)
    0.1U(2,2), 0.9U(1,2) 0.1U(1,1), 0.8U(1,1
    ) 0.1U(2,2) 0.1U(1,2), 0.8U(2,2)
    0.1U(1,2) 0.1U(1,1) -0.04 1 x max
    0.9x(0.8272) 0.1x1, 0.9x(0.8272)
    0.1x(0.4536), 0.8x(0.4536) 0.1x1
    0.1x(0.8272), 0.8x1 0.1x(0.8272)
    0.1x(0.4536) -0.04 max 0.8444,
    0.7898, 0.5456, 0.9281 0.8881

17
Example (cont.)
  • Continue to next iteration
  • Finish if close enough
  • Here last change was 0.114 close enough

18
close enough
  • We will not go down deeply to this issue!
  • Different possibilities to detect convergence
  • RMS error root mean square error of the
    utility value compare to the correct values
  • demand of RMS(U, U) lt e
  • when e maximum error allowed in utility of
    any state in an iteration

19
close enough (cont.)
  • Policy Loss difference between the expected
    utility using the policy to the expected utility
    obtain by the optimal policy
  • Ui1 Ui lt e (1-?) / ?
  • When U maxa U(s)
  • e maximum error allowed in utility of any
    state in an iteration
  • ? the discount factor

20
Finding the policy
  • True utilities have founded
  • New search for the optimal policy
  • For each s in S do ps ? argmaxa ?s T(s, a,
    s)U(s)
  • Return p

21
Example (cont.)
  • Find the optimal police
  • ?(1,1) argmaxa 0.8U(1,2) 0.1U(1,1)
    0.1U(2,1), //Up 0.9U(1,1)
    0.1U(1,2), //Left
    0.9U(1,1) 0.1U(2,1),
    //Down 0.8U(2,1)
    0.1U(1,1) 0.1U(1,2) //Right
    argmaxa 0.8x(0.8881) 0.1x(0.5676)
    0.1x(-1), 0.9x(0.5676)
    0.1x(0.8881), 0.9x(0.5676)
    0.1x(-1), 0.8x(-1) 0.1x(0.5676)
    0.1x(0.8881) argmaxa 0.6672,
    0.5996, 0.4108, -0.6512 Up
  • ?(1,2) argmaxa 0.9U(1,2) 0.1U(2,2),
    //Up
    0.9U(1,2) 0.1U(1,1),
    //Left 0.8U(1,1) 0.1U(2,2)
    0.1U(1,2), //Down
    0.8U(2,2) 0.1U(1,2) 0.1U(1,1)
    //Right argmaxa 0.9x(0.8881)
    0.1x1, 0.9x(0.8881)
    0.1x(0.5676), 0.8x(0.5676) 0.1x1
    0.1x(0.8881), 0.8x1
    0.1x(0.8881) 0.1x(0.5676) argmaxa
    0.8993, 0.8561, 0.6429, 0.9456 Right

22
Summery value iteration
1. The given environment
2. Calculate utilities
4. Execute actions
3. Extract optimal policy
23
Example - convergence
Error allowed
24
Policy iteration
  • picking a policy, then calculating the utility of
    each state given that policy (value iteration
    step)
  • Update the policy at each state using the
    utilities of the successor states
  • Repeat until the policy stabilize

25
Policy iteration
  • For each state in each step
  • Policy evaluation
  • Given policy pi.
  • Calculate the utility Ui of each state if p were
    to be execute
  • Policy improvement
  • Calculate new policy pi1
  • Based on pi
  • p i1s ? argmaxa ?s T(s, a, s)U(s)

26
Policy iteration Algorithm
  • function POLICY_ITERATION (mdp) returns a policy
  • input mdp, an MDP with states S, transition
    model T
  • local variables U, U vectors of utilities for
    states in S, initially identical to R
  • p, a policy, vector indexed by states,
    initially random
  • repeat
  • U? Policy-Evaluation(p,U,mdp)
  • unchanged? ? true
  • for each state s in S do
  • if maxa ?s T(s, a, s) Us gt ?sT(s,
    ps, s) Us then
  • ps ? argmaxa ?s T(s, a, s)
    Uj
  • unchanged? ? false
  • end
  • until unchanged?
  • return p

27
Example
  • Back to our last example
  • 2x2 world
  • The agent is placed in (1,1)
  • States (2,1), (2,2) are goal states
  • If blocked by the wall stay in place
  • The reward are written in the board
  • Initial policy Up (for every step)

28
Example (cont.)
  • First iteration policy evaluation
  • U(1,1) R(1,1) ? x (0.8U(1,2) 0.1U(1,1)
    0.1U(2,1))
  • U(1,2) R(1,2) ? x (0.9U(1,2) 0.1U(2,2))
  • U(2,1) R(2,1)
  • U(2,2) R(2,2)
  • U(1,1) -0.04 0.8U(1,2) 0.1U(1,1)
    0.1U(2,1)
  • U(1,2) -0.04 0.9U(1,2) 0.1U(2,2)
  • U(2,1) -1
  • U(2,2) 1
  • 0.04 -0.9U(1,1) 0.8U(1,2) 0.1U(2,1)
    0U(2,2)0.04 0U(1,1) 0.1U(1,2)
    0U(2,1) 0.1U(2,2)-1 0U(1,1)
    0U(1,2) - 1U(2,1) 0U(2,2) 1
    0U(1,1) 0U(1,2) - 0U(2,1) 1U(2,2)
  • U(1,1) 0.3778
  • U(1,2) 0.6
  • U(2,1) -1
  • U(2,2) 1

Policy ?(1,1) Up ?(1,2) Up
29
Example (cont.)
  • First iteration policy improvement
  • ?(1,1) argmaxa 0.8U(1,2) 0.1U(1,1)
    0.1U(2,1), //Up 0.9U(1,1)
    0.1U(1,2), //Left
    0.9U(1,1) 0.1U(2,1),
    //Down 0.8U(2,1)
    0.1U(1,1) 0.1U(1,2) //Right
    argmaxa 0.8x(0.6) 0.1x(0.3778)
    0.1x(-1), 0.9x(0.3778)
    0.1x(0.6), 0.9x(0.3778)
    0.1x(-1), 0.8x(-1) 0.1x(0.3778)
    0.1x(0.6) argmaxa 0.4178, 0.4,
    0.24, -0.7022 Up ? dont have to
    update
  • ?(1,2) argmaxa 0.9U(1,2) 0.1U(2,2),
    //Up
    0.9U(1,2) 0.1U(1,1),
    //Left 0.8U(1,1) 0.1U(2,2)
    0.1U(1,2), //Down
    0.8U(2,2) 0.1U(1,2) 0.1U(1,1)
    //Right argmaxa 0.9x(0.6)
    0.1x1, 0.9x(0.6) 0.1x(0.3778),
    0.8x(0.3778) 0.1x1 0.1x(0.6),
    0.8x1 0.1x(0.6) 0.1x(0.3778)
    argmaxa 0.64, 0.5778, 0.4622, 0.8978
    Right ? update

Policy ?(1,1) Up ?(1,2) Up
30
Example (cont.)
  • Second iteration policy evaluation
  • U(1,1) R(1,1) ? x (0.8U(1,2) 0.1U(1,1)
    0.1U(2,1))
  • U(1,2) R(1,2) ? x (0.1U(1,2) 0.8U(2,2)
    0.1U(1,1))
  • U(2,1) R(2,1)
  • U(2,2) R(2,2)
  • U(1,1) -0.04 0.8U(1,2) 0.1U(1,1)
    0.1U(2,1)
  • U(1,2) -0.04 0.1U(1,2) 0.8U(2,2)
    0.1U(1,1)
  • U(2,1) -1
  • U(2,2) 1
  • 0.04 -0.9U(1,1) 0.8U(1,2) 0.1U(2,1)
    0U(2,2)0.04 0.1U(1,1) 0.9U(1,2)
    0U(2,1) 0.8U(2,2)-1 0U(1,1)
    0U(1,2) - 1U(2,1) 0U(2,2) 1
    0U(1,1) 0U(1,2) - 0U(2,1) 1U(2,2)
  • U(1,1) 0.5413
  • U(1,2) 0.7843
  • U(2,1) -1
  • U(2,2) 1

Policy ?(1,1) Up ?(1,2) Right
31
Example (cont.)
  • Second iteration policy improvement
  • ?(1,1) argmaxa 0.8U(1,2) 0.1U(1,1)
    0.1U(2,1), //Up 0.9U(1,1)
    0.1U(1,2), //Left
    0.9U(1,1) 0.1U(2,1),
    //Down 0.8U(2,1)
    0.1U(1,1) 0.1U(1,2) //Right
    argmaxa 0.8x(0.7843) 0.1x(0.5413)
    0.1x(-1), 0.9x(0.5413)
    0.1x(0.7843), 0.9x(0.5413)
    0.1x(-1), 0.8x(-1) 0.1x(0.5413)
    0.1x(0.7843) argmaxa 0.5816,
    0.5656, 0.3871, -0.6674 Up ? dont
    have to update
  • ?(1,2) argmaxa 0.9U(1,2) 0.1U(2,2),
    //Up
    0.9U(1,2) 0.1U(1,1),
    //Left 0.8U(1,1) 0.1U(2,2)
    0.1U(1,2), //Down
    0.8U(2,2) 0.1U(1,2) 0.1U(1,1)
    //Right argmaxa 0.9x(0.7843)
    0.1x1, 0.9x(0.7843)
    0.1x(0.5413), 0.8x(0.5413) 0.1x1
    0.1x(0.7843), 0.8x1
    0.1x(0.7843) 0.1x(0.5413) argmaxa
    0.8059, 0.76, 0.6115, 0.9326 Right ?
    dont have to update

Policy ?(1,1) Up ?(1,2) Right
32
Example (cont.)
  • No change in the policy has found ? finish
  • The optimal policyp(1,1) Upp(1,2) Right
  • Policy iteration must terminate since policys
    number is finite

33
Simplify Policy iteration
  • Can focus of subset of state
  • Find utility by simplified value
    iterationUi1(s) R(s) ? ?s (T(s, p(s), s)
    Ui(s))
  • OR
  • Policy Improvement
  • Guaranteed to converge under certain conditions
    on initial polity and utility values

34
Policy Iteration properties
  • Linear equation easy to solve
  • Fast convergence in practice
  • Proved to be optimal

35
Value vs. Policy Iteration
  • Which to use
  • Policy iteration in more expensive per iteration
  • In practice, Policy iteration require fewer
    iterations
Write a Comment
User Comments (0)