Partially Observable MDP - PowerPoint PPT Presentation

1 / 17

About This Presentation

Title:

Partially Observable MDP

Description:

Pr(x|y) Pr(x) Pr(x|y)= , Pr(x|y)= y YPr(x|y) Pr(y) Bayes Rue: ... immediate prize. for applying the. 1st action. resulted belief state. for applying a at b and ... – PowerPoint PPT presentation

Number of Views:90

Avg rating:3.0/5.0

Slides: 18

Provided by: gzirkela

Category:

more less

Transcript and Presenter's Notes

Title: Partially Observable MDP

1
Partially Observable MDP
2
MDP Perfect Observation

Basic assumption we know the state of the world
at each stage
In essence we have perfect sensors
Typically we have imperfect sensors ? we can
only have partial information about the state
When we have imperfect information we sometimes
take actions simply to gain information

3
POMDP

ltS, A, Tr, R, O, Ogt
S State space.
A Actions set.
Tr - SxA??S. State space over S.
Tr(s, a, s) p - The probability to reach s
from s using a.s the state before.a
action.s the state after.
R - SxA?R. The reward for doing a?A at state S.
O - set of possible observations.
O - SxA??(O).
O(s, a, o) p - The probability of observing o?O
after performing a in s.OR the probability to
observe o?O after doing a and reaching s.

4
POMDPValue of Information

A robot with a wall sensor starts at one of the I
with same probability.
Following a move, it sense the walls around it
By moving up, observed walls configuration will
be the same for both options.
By moving down, we get different configuration.

5
Solving a POMDP

As in MDP, because of uncertainty, we need a
policy, not a plan
But what does the policy depend we dont know
the state?
Option 1 History
How much history do we need to remember?
How big is our policy
Problem Highly non-uniform, hard to work with

6
POMDPBelief State
s0
b
a1
a2
a1
a2
s1
s2
s3
s4
o1
o2
o3
A much harder tree Each state is different
then the other, as it is based on different
actions and observations.
o1
o2
o3
o4
o5
o6
The history of observations defines a state
7
Option 2 Belief State

What matters about the future is the current
state
We dont know what the current state is
Instead, we can maintain a probability
distribution over the current state
Called the belief state

8
POMDPBelief State
b0
a1
a2
Evaluated using action and observations
b1
b2
b3
b4
a2
a1
b5
b6
b7
b8

How do we compute the next belief state?

9
POMDPUpdating the belief state

Let b be the current belief state.
We calculate b, the belief state resulted from b
by applying a and observing o.
b(s) the probability of s according to b.

Normalizing factor. Ignore it in the
calculations, and normalize to 1 later.
, Pr(xy)?y?YPr(xy)Pr(y)
Bayes Rue
10
POMDP ? MDP

We can reduce the pomdp to an MDP over belief
states
State belief states
Actions same actions
R(b,a) Ssb(s)R(s,a)
?(b,a,b)Pr(ba,b)?o?OPr(ba,o,b)Pr(oa,b)

11
POMDPBelief State MDPs Value Function

At every belief state, choose the action that
maximize the value.
The best value is v(a).

12
POMDPBelief State MDP

For more then one action v(b) is the average.
vn?R(s1,?(s1)) ?o?OPr(os1,a) vn-1 ?/o(b0a)

13
POMDPBelief State MDP
a(?)
?
ok
o1
?o-1
?o-k

aa R(s1,a), R(s2,a) in our example.
a? vector of size S where each state has
the value of the prize for a.
va(b)baa
v? (b)ba?
?1n are policies of length m.
Pa1, , an
?p(b)argmax a?P aib, i?1..n
vp(b)maxa?P ab
v?(b)?s?Sb(s)r(s,a) d?o?OPr(ob,a)v?/o(bao)

the policy at the sub tree of ? matching o
immediate prize for applying the 1st action
resulted belief state for applying a at b
and observing o
14
POMDPValue Iteration for belief states

init ?a?A, v1a(b)?s?Sb(s)R(s,a)
build a value function for k1 steps, given the
functions for k steps.
1 let ?1..?n be the possible policies from depth
k that are not dominated by the rest of the
policies from depth k (exists a belief state b
for which they are optimal).
2 Build all the policies trees from depth k of
the form
3 Calculate v?(b)?s?Sb(s)r(s,a)d?o?OPr(ob,a)v
?/o(bao) for each of the trees.

a?A
O1
Ok
Where i?1..n
?i1
?ik

15
POMDPPoint Based Value Iteration

Idea maintain a fixed size set a1, , an of
vectors. Each vector ai matches a belief state bi
and action a(ai).
maxi?1..nbai
Advantage Num of vectors is bounded by n.
Disadvantage Only approximation to optimality.

a1
a2
16
POMDPPoint Based Value Iteration

The method
Same as in Value-Iteration, but initialize the
as to match the optimal actions for b1, , bn.
At the iterative part build all trees for k1
given the functions a1, , an of the kth step.
use the value function only from step k.
keep only the new policies and the matching ai,
which are optimal for bi.
Number of possible trees AnO

17
POMDPRepresenting policy as an automata

Automata Initial state ? policy
The idea
Based on solution to m vector equations.
To every state of the automata match a value
function represented by the correspondinga-vector
.
For every state of the automata, evaluate the
best value function under the assumption that
after executing the first action, we continue to
one of the value function evaluated above.