Title: Partially Observable MDP
1Partially Observable MDP
2MDP Perfect Observation
- Basic assumption we know the state of the world
at each stage - In essence we have perfect sensors
- Typically we have imperfect sensors ? we can
only have partial information about the state - When we have imperfect information we sometimes
take actions simply to gain information
3POMDP
- ltS, A, Tr, R, O, Ogt
- S State space.
- A Actions set.
- Tr - SxA??S. State space over S.
- Tr(s, a, s) p - The probability to reach s
from s using a.s the state before.a
action.s the state after. - R - SxA?R. The reward for doing a?A at state S.
- O - set of possible observations.
- O - SxA??(O).
- O(s, a, o) p - The probability of observing o?O
after performing a in s.OR the probability to
observe o?O after doing a and reaching s.
4POMDPValue of Information
- A robot with a wall sensor starts at one of the I
with same probability. - Following a move, it sense the walls around it
- By moving up, observed walls configuration will
be the same for both options. - By moving down, we get different configuration.
5Solving a POMDP
- As in MDP, because of uncertainty, we need a
policy, not a plan - But what does the policy depend we dont know
the state? - Option 1 History
- How much history do we need to remember?
- How big is our policy
- Problem Highly non-uniform, hard to work with
6POMDPBelief State
s0
b
a1
a2
a1
a2
s1
s2
s3
s4
o1
o2
o3
A much harder tree Each state is different
then the other, as it is based on different
actions and observations.
o1
o2
o3
o4
o5
o6
The history of observations defines a state
7Option 2 Belief State
- What matters about the future is the current
state - We dont know what the current state is
- Instead, we can maintain a probability
distribution over the current state - Called the belief state
8POMDPBelief State
b0
a1
a2
Evaluated using action and observations
b1
b2
b3
b4
a2
a1
b5
b6
b7
b8
- How do we compute the next belief state?
9POMDPUpdating the belief state
- Let b be the current belief state.
- We calculate b, the belief state resulted from b
by applying a and observing o. - b(s) the probability of s according to b.
Normalizing factor. Ignore it in the
calculations, and normalize to 1 later.
, Pr(xy)?y?YPr(xy)Pr(y)
Bayes Rue
10POMDP ? MDP
- We can reduce the pomdp to an MDP over belief
states - State belief states
- Actions same actions
- R(b,a) Ssb(s)R(s,a)
- ?(b,a,b)Pr(ba,b)?o?OPr(ba,o,b)Pr(oa,b)
11POMDPBelief State MDPs Value Function
- At every belief state, choose the action that
maximize the value. - The best value is v(a).
12POMDPBelief State MDP
- For more then one action v(b) is the average.
- vn?R(s1,?(s1)) ?o?OPr(os1,a) vn-1 ?/o(b0a)
13POMDPBelief State MDP
a(?)
?
ok
o1
?o-1
?o-k
- aa R(s1,a), R(s2,a) in our example.
- a? vector of size S where each state has
the value of the prize for a. - va(b)baa
- v? (b)ba?
- ?1n are policies of length m.
- Pa1, , an
- ?p(b)argmax a?P aib, i?1..n
- vp(b)maxa?P ab
- v?(b)?s?Sb(s)r(s,a) d?o?OPr(ob,a)v?/o(bao)
the policy at the sub tree of ? matching o
immediate prize for applying the 1st action
resulted belief state for applying a at b
and observing o
14POMDPValue Iteration for belief states
- init ?a?A, v1a(b)?s?Sb(s)R(s,a)
- build a value function for k1 steps, given the
functions for k steps. - 1 let ?1..?n be the possible policies from depth
k that are not dominated by the rest of the
policies from depth k (exists a belief state b
for which they are optimal). - 2 Build all the policies trees from depth k of
the form - 3 Calculate v?(b)?s?Sb(s)r(s,a)d?o?OPr(ob,a)v
?/o(bao) for each of the trees.
a?A
O1
Ok
Where i?1..n
?i1
?ik
15POMDPPoint Based Value Iteration
- Idea maintain a fixed size set a1, , an of
vectors. Each vector ai matches a belief state bi
and action a(ai). - maxi?1..nbai
- Advantage Num of vectors is bounded by n.
- Disadvantage Only approximation to optimality.
a1
a2
16POMDPPoint Based Value Iteration
- The method
- Same as in Value-Iteration, but initialize the
as to match the optimal actions for b1, , bn. - At the iterative part build all trees for k1
given the functions a1, , an of the kth step. - use the value function only from step k.
- keep only the new policies and the matching ai,
which are optimal for bi. - Number of possible trees AnO
17POMDPRepresenting policy as an automata
- Automata Initial state ? policy
- The idea
- Based on solution to m vector equations.
- To every state of the automata match a value
function represented by the correspondinga-vector
. - For every state of the automata, evaluate the
best value function under the assumption that
after executing the first action, we continue to
one of the value function evaluated above.