Title: An Introduction to PO-MDP
1An Introduction to PO-MDP
2MDP
- Components
- State
- Action
- Transition
- Reinforcement
- Problem
- choose the action that makes the right tradeoffs
between the immediate rewards and the future
gains, to yield the best possible solution - Solution
- Policy value function
3Definition
- Horizon length
- Value Iteration
- Temporal Difference Learning
- Q(x,a) ? Q(x,a) ?(r ?maxbQ(y,b) - Q(x,a))
- where ? learning rate and ? discount rate.
- Adding PO to CO-MDP is not trivial
- Requires the complete observability of the state.
- PO clouds the current state.
4PO-MDP
- Components
- States
- Actions
- Transitions
- Reinforcement
- Observations
5Mapping in CO-MDP PO-MDP
- In CO-MDPs, mapping is from states to actions.
- In PO-MDPs, mapping is from probability
distributions (over states) to actions.
6VI in CO-MDP PO-MDP
- In a CO-MDP,
- Track our current state
- Update it after each action
- In a PO-MDP,
- Probability distribution over states
- Perform an action and make an observation, then
update the distribution
7Belief State and Space
- Belief State probability distribution over
states. - Belief Space the entire probability space.
- Example
- Assume two state PO-MDP.
- P(s1) p P(s2) 1-p.
- Line become hyper-plane in higher dimension.
s1
8Belief Transform
- Assumption
- Finite action
- Finite observation
- Next belief state T(cbf,a,o) where
- cbf current belief state, aaction,
oobservation - Finite number of possible next belief state
9PO-MDP into continuous CO-MDP
- The process is Markovian, the next belief state
depends on - Current belief state
- Current action
- Observation
- Discrete PO-MDP problem can be converted into a
continuous space CO-MDP problem where the
continuous space is the belief space.
10Problem
- Using VI in continuous state space.
- No nice tabular representation as before.
11PWLC
- Restrictions on the form of the solutions to the
continuous space CO-MDP - The finite horizon value function is piecewise
linear and convex (PWLC) for every horizon
length. - the value of a belief point is simply the dot
product of the two vectors.
GOALfor each iteration of value iteration, find
a finite number of linear segments that make up
the value function
12Steps in VI
- Represent the value function for each horizon as
a set of vectors. - Overcome how to represent a value function over a
continuous space. - Find the vector that has the largest dot product
with the belief state.
13PO-MDP Value Iteration Example
- Assumption
- Two states
- Two actions
- Three observations
- Ex horizon length is 1.
-
b0.25 0.75
a1 a2
s1 s2
V(a1,b) 0.25x10.75x0 0.25 V(a2,b)0.25x00.75
x1.51.125
14PO-MDP Value Iteration Example
- The value of a belief state for horizon length 2
given b,a1,z1 - immediate action plus the value of the next
action. - Find best achievable value for the belief state
that results from our initial belief state b when
we perform action a1 and observe z1.
15PO-MDP Value Iteration Example
- Find the value for all the belief points given
this fixed action and observation. - The Transformed value function is also PWLC.
16PO-MDP Value Iteration Example
- How to compute the value of a belief state given
only the action? - The horizon 2 value of the belief state, given
that - Values for each observation z1 0.7 z2 0.8 z3
1.2 - P(z1 b,a1)0.6 P(z2 b,a1)0.25 P(z3
b,a1)0.15 - 0.6x0.8 0.25x0.7 0.15x1.2 0.835
17Transformed Value Functions
- Each of these transformed functions partitions
the belief space differently. - Best next action to perform depends upon the
initial belief state and observation.
18Best Value For Belief States
- The value of every single belief point, the sum
of - Immediate reward.
- The line segments from the S() functions for each
observation's future strategy. - since adding lines gives you lines, it is linear.
19Best Strategy for any Belief Points
- All the useful future strategies are easy to pick
out
20Value Function and Partition
- For the specific action a1, the value function
and corresponding partitions
21Value Function and Partition
- For the specific action a2, the value function
and corresponding partitions
22Which Action to Choose?
- put the value functions for each action together
to see where each action gives the highest value.
23Compact Horizon 2 Value Function
24Value Function for Action a1 with a Horizon of 3
25Value Function for Action a2 with a Horizon of 3
26Value Function for Both Action with a Horizon of 3
27Value Function for Horizon of 3