An Introduction to PO-MDP

About This Presentation

Title:

An Introduction to PO-MDP

Description:

An Introduction to PO-MDP Presented by Alp Sarda – PowerPoint PPT presentation

Number of Views:0

Avg rating:3.0/5.0

Slides: 28

Provided by: link56

more less

Transcript and Presenter's Notes

Title: An Introduction to PO-MDP

1
An Introduction to PO-MDP

Presented by
Alp Sardag

2
MDP

Components
State
Action
Transition
Reinforcement
Problem
choose the action that makes the right tradeoffs
between the immediate rewards and the future
gains, to yield the best possible solution
Solution
Policy value function

3
Definition

Horizon length
Value Iteration
Temporal Difference Learning
Q(x,a) ? Q(x,a) ?(r ?maxbQ(y,b) - Q(x,a))
where ? learning rate and ? discount rate.
Adding PO to CO-MDP is not trivial
Requires the complete observability of the state.
PO clouds the current state.

4
PO-MDP

Components
States
Actions
Transitions
Reinforcement
Observations

5
Mapping in CO-MDP PO-MDP

In CO-MDPs, mapping is from states to actions.
In PO-MDPs, mapping is from probability
distributions (over states) to actions.

6
VI in CO-MDP PO-MDP

In a CO-MDP,
Track our current state
Update it after each action
In a PO-MDP,
Probability distribution over states
Perform an action and make an observation, then
update the distribution

7
Belief State and Space

Belief State probability distribution over
states.
Belief Space the entire probability space.
Example
Assume two state PO-MDP.
P(s1) p P(s2) 1-p.
Line become hyper-plane in higher dimension.

s1
8
Belief Transform

Assumption
Finite action
Finite observation
Next belief state T(cbf,a,o) where
cbf current belief state, aaction,
oobservation
Finite number of possible next belief state

9
PO-MDP into continuous CO-MDP

The process is Markovian, the next belief state
depends on
Current belief state
Current action
Observation
Discrete PO-MDP problem can be converted into a
continuous space CO-MDP problem where the
continuous space is the belief space.

10
Problem

Using VI in continuous state space.
No nice tabular representation as before.

11
PWLC

Restrictions on the form of the solutions to the
continuous space CO-MDP
The finite horizon value function is piecewise
linear and convex (PWLC) for every horizon
length.
the value of a belief point is simply the dot
product of the two vectors.

GOALfor each iteration of value iteration, find
a finite number of linear segments that make up
the value function
12
Steps in VI

Represent the value function for each horizon as
a set of vectors.
Overcome how to represent a value function over a
continuous space.
Find the vector that has the largest dot product
with the belief state.

13
PO-MDP Value Iteration Example

Assumption
Two states
Two actions
Three observations
Ex horizon length is 1.

b0.25 0.75
a1 a2

s1 s2

0
0 1.5

V(a1,b) 0.25x10.75x0 0.25 V(a2,b)0.25x00.75
x1.51.125
14
PO-MDP Value Iteration Example

The value of a belief state for horizon length 2
given b,a1,z1
immediate action plus the value of the next
action.
Find best achievable value for the belief state
that results from our initial belief state b when
we perform action a1 and observe z1.

15
PO-MDP Value Iteration Example

Find the value for all the belief points given
this fixed action and observation.
The Transformed value function is also PWLC.

16
PO-MDP Value Iteration Example

How to compute the value of a belief state given
only the action?
The horizon 2 value of the belief state, given
that
Values for each observation z1 0.7 z2 0.8 z3
1.2
P(z1 b,a1)0.6 P(z2 b,a1)0.25 P(z3
b,a1)0.15
0.6x0.8 0.25x0.7 0.15x1.2 0.835

17
Transformed Value Functions

Each of these transformed functions partitions
the belief space differently.
Best next action to perform depends upon the
initial belief state and observation.

18
Best Value For Belief States

The value of every single belief point, the sum
of
Immediate reward.
The line segments from the S() functions for each
observation's future strategy.
since adding lines gives you lines, it is linear.

19
Best Strategy for any Belief Points

All the useful future strategies are easy to pick
out

20
Value Function and Partition

For the specific action a1, the value function
and corresponding partitions

21
Value Function and Partition

For the specific action a2, the value function
and corresponding partitions

22
Which Action to Choose?

put the value functions for each action together
to see where each action gives the highest value.

23
Compact Horizon 2 Value Function
24
Value Function for Action a1 with a Horizon of 3
25
Value Function for Action a2 with a Horizon of 3
26
Value Function for Both Action with a Horizon of 3
27
Value Function for Horizon of 3

Write a Comment

User Comments (0)