An Introduction to PO-MDP

1 / 27
About This Presentation
Title:

An Introduction to PO-MDP

Description:

An Introduction to PO-MDP Presented by Alp Sarda – PowerPoint PPT presentation

Number of Views:0
Avg rating:3.0/5.0
Slides: 28
Provided by: link56

less

Transcript and Presenter's Notes

Title: An Introduction to PO-MDP


1
An Introduction to PO-MDP
  • Presented by
  • Alp Sardag

2
MDP
  • Components
  • State
  • Action
  • Transition
  • Reinforcement
  • Problem
  • choose the action that makes the right tradeoffs
    between the immediate rewards and the future
    gains, to yield the best possible solution
  • Solution
  • Policy value function

3
Definition
  • Horizon length
  • Value Iteration
  • Temporal Difference Learning
  • Q(x,a) ? Q(x,a) ?(r ?maxbQ(y,b) - Q(x,a))
  • where ? learning rate and ? discount rate.
  • Adding PO to CO-MDP is not trivial
  • Requires the complete observability of the state.
  • PO clouds the current state.

4
PO-MDP
  • Components
  • States
  • Actions
  • Transitions
  • Reinforcement
  • Observations

5
Mapping in CO-MDP PO-MDP
  • In CO-MDPs, mapping is from states to actions.
  • In PO-MDPs, mapping is from probability
    distributions (over states) to actions.

6
VI in CO-MDP PO-MDP
  • In a CO-MDP,
  • Track our current state
  • Update it after each action
  • In a PO-MDP,
  • Probability distribution over states
  • Perform an action and make an observation, then
    update the distribution

7
Belief State and Space
  • Belief State probability distribution over
    states.
  • Belief Space the entire probability space.
  • Example
  • Assume two state PO-MDP.
  • P(s1) p P(s2) 1-p.
  • Line become hyper-plane in higher dimension.

s1
8
Belief Transform
  • Assumption
  • Finite action
  • Finite observation
  • Next belief state T(cbf,a,o) where
  • cbf current belief state, aaction,
    oobservation
  • Finite number of possible next belief state

9
PO-MDP into continuous CO-MDP
  • The process is Markovian, the next belief state
    depends on
  • Current belief state
  • Current action
  • Observation
  • Discrete PO-MDP problem can be converted into a
    continuous space CO-MDP problem where the
    continuous space is the belief space.

10
Problem
  • Using VI in continuous state space.
  • No nice tabular representation as before.

11
PWLC
  • Restrictions on the form of the solutions to the
    continuous space CO-MDP
  • The finite horizon value function is piecewise
    linear and convex (PWLC) for every horizon
    length.
  • the value of a belief point is simply the dot
    product of the two vectors.

GOALfor each iteration of value iteration, find
a finite number of linear segments that make up
the value function
12
Steps in VI
  • Represent the value function for each horizon as
    a set of vectors.
  • Overcome how to represent a value function over a
    continuous space.
  • Find the vector that has the largest dot product
    with the belief state.

13
PO-MDP Value Iteration Example
  • Assumption
  • Two states
  • Two actions
  • Three observations
  • Ex horizon length is 1.

b0.25 0.75
a1 a2


s1 s2
  • 0
  • 0 1.5

V(a1,b) 0.25x10.75x0 0.25 V(a2,b)0.25x00.75
x1.51.125
14
PO-MDP Value Iteration Example
  • The value of a belief state for horizon length 2
    given b,a1,z1
  • immediate action plus the value of the next
    action.
  • Find best achievable value for the belief state
    that results from our initial belief state b when
    we perform action a1 and observe z1.

15
PO-MDP Value Iteration Example
  • Find the value for all the belief points given
    this fixed action and observation.
  • The Transformed value function is also PWLC.

16
PO-MDP Value Iteration Example
  • How to compute the value of a belief state given
    only the action?
  • The horizon 2 value of the belief state, given
    that
  • Values for each observation z1 0.7 z2 0.8 z3
    1.2
  • P(z1 b,a1)0.6 P(z2 b,a1)0.25 P(z3
    b,a1)0.15
  • 0.6x0.8 0.25x0.7 0.15x1.2 0.835

17
Transformed Value Functions
  • Each of these transformed functions partitions
    the belief space differently.
  • Best next action to perform depends upon the
    initial belief state and observation.

18
Best Value For Belief States
  • The value of every single belief point, the sum
    of
  • Immediate reward.
  • The line segments from the S() functions for each
    observation's future strategy.
  • since adding lines gives you lines, it is linear.

19
Best Strategy for any Belief Points
  • All the useful future strategies are easy to pick
    out

20
Value Function and Partition
  • For the specific action a1, the value function
    and corresponding partitions

21
Value Function and Partition
  • For the specific action a2, the value function
    and corresponding partitions

22
Which Action to Choose?
  • put the value functions for each action together
    to see where each action gives the highest value.

23
Compact Horizon 2 Value Function
24
Value Function for Action a1 with a Horizon of 3
25
Value Function for Action a2 with a Horizon of 3
26
Value Function for Both Action with a Horizon of 3
27
Value Function for Horizon of 3
Write a Comment
User Comments (0)