Markov Decision Process (MDP)

1 / 17
About This Presentation
Title:

Markov Decision Process (MDP)

Description:

focusses attention on more probable states. fast convergence. focusses attention on unconverged states. terminates in finite time ... – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 18
Provided by: Mau70

less

Transcript and Presenter's Notes

Title: Markov Decision Process (MDP)


1
Markov Decision Process (MDP)
Value function expected long term reward
from the state Q values Expected long term
reward of doing a in s V(s) max
Q(s,a) Greedy Policy w.r.t. a value
function Value of a policy Optimal value
function
  • S A set of states
  • A A set of actions
  • Pr(ss,a) transition model
  • (aka Mas,s)
  • C(s,a,s) cost model
  • G set of goals
  • s0 start state
  • ? discount factor
  • R(s,a,s) reward model

2
Examples of MDPs
  • Goal-directed, Indefinite Horizon, Cost
    Minimization MDP
  • ltS, A, Pr, C, G, s0gt
  • Most often studied in planning community
  • Infinite Horizon, Discounted Reward Maximization
    MDP
  • ltS, A, Pr, R, ?gt
  • Most often studied in reinforcement learning
  • Goal-directed, Finite Horizon, Prob. Maximization
    MDP
  • ltS, A, Pr, G, s0, Tgt
  • Also studied in planning community
  • Oversubscription Planning Non absorbing goals,
    Reward Max. MDP
  • ltS, A, Pr, G, R, s0gt
  • Relatively recent model

3
SSPPStochastic Shortest Path Problem An MDP with
Init and Goal states
  • MDPs dont have a notion of an initial and
    goal state. (Process orientation instead of
    task orientation)
  • Goals are sort of modeled by reward functions
  • Allows pretty expressive goals (in theory)
  • Normal MDP algorithms dont use initial state
    information (since policy is supposed to cover
    the entire search space anyway).
  • Could consider envelope extension methods
  • Compute a deterministic plan (which gives the
    policy for some of the states Extend the policy
    to other states that are likely to happen during
    execution
  • RTDP methods
  • SSSP are a special case of MDPs where
  • (a) initial state is given
  • (b) there are absorbing goal states
  • (c) Actions have costs. All states have zero
    rewards
  • A proper policy for SSSP is a policy which is
    guaranteed to ultimately put the agent in one of
    the absorbing states
  • For SSSP, it would be worth finding a partial
    policy that only covers the relevant states
    (states that are reachable from init and goal
    states on any optimal policy)
  • Value/Policy Iteration dont consider the notion
    of relevance
  • Consider heuristic state search algorithms
  • Heuristic can be seen as the estimate of the
    value of a state.

4
Bellman Equations for Cost Minimization
MDP(absorbing goals)also called Stochastic
Shortest Path
  • ltS, A, Pr, C, G, s0gt
  • Define J(s) optimal cost as the minimum
    expected cost to reach a goal from this state.
  • J should satisfy the following equation

Q(s,a)
5
Bellman Equations for infinite horizon discounted
reward maximization MDP
  • ltS, A, Pr, R, s0, ?gt
  • Define V(s) optimal value as the maximum
    expected discounted reward from this state.
  • V should satisfy the following equation

6
Bellman Equations for probability maximization
MDP
  • ltS, A, Pr, G, s0, Tgt
  • Define P(s,t) optimal prob. as the maximum
    probability of reaching a goal from this state at
    tth timestep.
  • P should satisfy the following equation

7
Modeling Softgoal problems as deterministic MDPs
  • Consider the net-benefit problem, where actions
    have costs, and goals have utilities, and we want
    a plan with the highest net benefit
  • How do we model this as MDP?
  • (wrong idea) Make every state in which any
    subset of goals hold into a sink state with
    reward equal to the cumulative sum of utilities
    of the goals.
  • Problemwhat if achieving g1 g2 will necessarily
    lead you through a state where g1 is already
    true?
  • (correct version) Make a new fluent called
    done dummy action called Done-Deal. It is
    applicable in any state and asserts the fluent
    done. All done states are sink states. Their
    reward is equal to sum of rewards of the
    individual states.

8
An eye for an eye only ends up making the whole
world blind. -Mohandas Karamchand Gandhi,
born October 2nd, 1869.
Lecture of October 2nd, 2009
9
Ideas for Efficient Algorithms..
  • Use heuristic search (and reachability
    information)
  • LAO, RTDP
  • Use execution and/or Simulation
  • Actual Execution Reinforcement learning
  • (Main motivation for RL is to learn the
    model)
  • Simulation simulate the given model to sample
    possible futures
  • Policy rollout, hindsight optimization etc.
  • Use factored representations
  • Factored representations for Actions, Reward
    Functions, Values and Policies
  • Directly manipulating factored representations
    during the Bellman update

10
Heuristic Search vs. Dynamic Programming
(Value/Policy Iteration)
  • VI and PI approaches use Dynamic Programming
    Update
  • Set the value of a state in terms of the maximum
    expected value achievable by doing actions from
    that state.
  • They do the update for every state in the state
    space
  • Wasteful if we know the initial state(s) that the
    agent is starting from
  • Heuristic search (e.g. A/AO) explores only the
    part of the state space that is actually
    reachable from the initial state
  • Even within the reachable space, heuristic search
    can avoid visiting many of the states.
  • Depending on the quality of the heuristic used..
  • But what is the heuristic?
  • An admissible heuristic is a lowerbound on the
    cost to reach goal from any given state
  • It is a lowerbound on V!

11
Real Time Dynamic Programming Barto, Bradtke,
Singh95
  • Trial simulate greedy policy starting from start
    state
  • perform Bellman backup on visited states
  • RTDP repeat Trials until cost function converges

RTDP was originally introduced for Reinforcement
Learning ?For RL, instead of simulate you
execute ?You also have to do
exploration in addition to
exploitation ? with probability p, follow the
greedy policy with 1-p pick
a random action
What if we simulate the actions effect
with noise (rather than exactly wrt its
transition probabilities)
12
RTDP Trial
Note that the value function is being updated
per each level. How about waiting until you hit
goal and then update everyone?
Jn
Qn1(s0,a)
agreedy a2
Jn
?
a1
Jn
Goal
a2
?
Jn1(s0)
Jn
a3
?
Jn
Jn
Jn
13
Greedy On-Policy RTDP without execution
?Using the current utility values, select the
action with the highest expected utility
(greedy action) at each state, until you reach
a terminating state. Update the values along
this path. Loop backuntil the values stabilize
14
Comments
  • Properties
  • if all states are visited infinitely often then
    Jn ? J
  • Only relevant states will be considered
  • A state is relevant if the optimal policy could
    visit it.
  • ? Notice emphasis on optimal policyjust
    because a rough neighborhood surrounds National
    Mall doesnt mean that you will need to know what
    to do in that neighborhood
  • Advantages
  • Anytime more probable states explored quickly
  • Disadvantages
  • complete convergence is slow!
  • no termination condition

Do we care about complete convergence? ?Think
Cpt. Sullenberger
15
Labeled RTDP BonetGeffner03
  • Initialise J0 with an admissible heuristic
  • ? Jn monotonically increases
  • Label a state as solved
  • if the Jn for that state has converged
  • Backpropagate solved labeling
  • Stop trials when they reach any solved state
  • Terminate when s0 is solved

high Q costs
s
G
?
t
best action ) J(s) wont change!
high Q costs
s
G
both s and t get solved together
16
(No Transcript)
17
Properties
  • admissible J0 ? optimal J
  • heuristic-guided
  • explores a subset of reachable state space
  • anytime
  • focusses attention on more probable states
  • fast convergence
  • focusses attention on unconverged states
  • terminates in finite time

18
Recent Advances Focused RTDPSmithSimmons06
  • Similar to Bounded RTDP except
  • a more sophisticated definition of priority that
    combines gap and prob. of reaching the state
  • adaptively increasing the max-trial length

Recent Advances Learning DFSBonetGeffner06
  • Iterative Deepening A equivalent for MDPs
  • Find strongly connected components to check for a
    state being solved.

19
Other Advances
  • Ordering the Bellman backups to maximise
    information flow.
  • Wingate Seppi05
  • Dai Hansen07
  • Partition the state space and combine value
    iterations from different partitions.
  • Wingate Seppi05
  • Dai Goldsmith07
  • External memory version of value iteration
  • Edelkamp, Jabbar Bonet07
Write a Comment
User Comments (0)