Partially Observable MDP - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Partially Observable MDP

Description:

Pr(x|y) Pr(x) Pr(x|y)= , Pr(x|y)= y YPr(x|y) Pr(y) Bayes Rue: ... immediate prize. for applying the. 1st action. resulted belief state. for applying a at b and ... – PowerPoint PPT presentation

Number of Views:90
Avg rating:3.0/5.0
Slides: 18
Provided by: gzirkela
Category:

less

Transcript and Presenter's Notes

Title: Partially Observable MDP


1
Partially Observable MDP
2
MDP Perfect Observation
  • Basic assumption we know the state of the world
    at each stage
  • In essence we have perfect sensors
  • Typically we have imperfect sensors ? we can
    only have partial information about the state
  • When we have imperfect information we sometimes
    take actions simply to gain information

3
POMDP
  • ltS, A, Tr, R, O, Ogt
  • S State space.
  • A Actions set.
  • Tr - SxA??S. State space over S.
  • Tr(s, a, s) p - The probability to reach s
    from s using a.s the state before.a
    action.s the state after.
  • R - SxA?R. The reward for doing a?A at state S.
  • O - set of possible observations.
  • O - SxA??(O).
  • O(s, a, o) p - The probability of observing o?O
    after performing a in s.OR the probability to
    observe o?O after doing a and reaching s.

4
POMDPValue of Information
  • A robot with a wall sensor starts at one of the I
    with same probability.
  • Following a move, it sense the walls around it
  • By moving up, observed walls configuration will
    be the same for both options.
  • By moving down, we get different configuration.

5
Solving a POMDP
  • As in MDP, because of uncertainty, we need a
    policy, not a plan
  • But what does the policy depend we dont know
    the state?
  • Option 1 History
  • How much history do we need to remember?
  • How big is our policy
  • Problem Highly non-uniform, hard to work with

6
POMDPBelief State
s0
b
a1
a2
a1
a2
s1
s2
s3
s4
o1
o2
o3
A much harder tree Each state is different
then the other, as it is based on different
actions and observations.
o1
o2
o3
o4
o5
o6
The history of observations defines a state
7
Option 2 Belief State
  • What matters about the future is the current
    state
  • We dont know what the current state is
  • Instead, we can maintain a probability
    distribution over the current state
  • Called the belief state

8
POMDPBelief State
b0
a1
a2
Evaluated using action and observations
b1
b2
b3
b4
a2
a1
b5
b6
b7
b8
  • How do we compute the next belief state?

9
POMDPUpdating the belief state
  • Let b be the current belief state.
  • We calculate b, the belief state resulted from b
    by applying a and observing o.
  • b(s) the probability of s according to b.



Normalizing factor. Ignore it in the
calculations, and normalize to 1 later.
, Pr(xy)?y?YPr(xy)Pr(y)
Bayes Rue
10
POMDP ? MDP
  • We can reduce the pomdp to an MDP over belief
    states
  • State belief states
  • Actions same actions
  • R(b,a) Ssb(s)R(s,a)
  • ?(b,a,b)Pr(ba,b)?o?OPr(ba,o,b)Pr(oa,b)

11
POMDPBelief State MDPs Value Function
  • At every belief state, choose the action that
    maximize the value.
  • The best value is v(a).

12
POMDPBelief State MDP
  • For more then one action v(b) is the average.
  • vn?R(s1,?(s1)) ?o?OPr(os1,a) vn-1 ?/o(b0a)

13
POMDPBelief State MDP
a(?)
?
ok
o1
?o-1
?o-k
  • aa R(s1,a), R(s2,a) in our example.
  • a? vector of size S where each state has
    the value of the prize for a.
  • va(b)baa
  • v? (b)ba?
  • ?1n are policies of length m.
  • Pa1, , an
  • ?p(b)argmax a?P aib, i?1..n
  • vp(b)maxa?P ab
  • v?(b)?s?Sb(s)r(s,a) d?o?OPr(ob,a)v?/o(bao)

the policy at the sub tree of ? matching o
immediate prize for applying the 1st action
resulted belief state for applying a at b
and observing o
14
POMDPValue Iteration for belief states
  • init ?a?A, v1a(b)?s?Sb(s)R(s,a)
  • build a value function for k1 steps, given the
    functions for k steps.
  • 1 let ?1..?n be the possible policies from depth
    k that are not dominated by the rest of the
    policies from depth k (exists a belief state b
    for which they are optimal).
  • 2 Build all the policies trees from depth k of
    the form
  • 3 Calculate v?(b)?s?Sb(s)r(s,a)d?o?OPr(ob,a)v
    ?/o(bao) for each of the trees.

a?A
O1
Ok
Where i?1..n
?i1
?ik

15
POMDPPoint Based Value Iteration
  • Idea maintain a fixed size set a1, , an of
    vectors. Each vector ai matches a belief state bi
    and action a(ai).
  • maxi?1..nbai
  • Advantage Num of vectors is bounded by n.
  • Disadvantage Only approximation to optimality.

a1
a2
16
POMDPPoint Based Value Iteration
  • The method
  • Same as in Value-Iteration, but initialize the
    as to match the optimal actions for b1, , bn.
  • At the iterative part build all trees for k1
    given the functions a1, , an of the kth step.
  • use the value function only from step k.
  • keep only the new policies and the matching ai,
    which are optimal for bi.
  • Number of possible trees AnO

17
POMDPRepresenting policy as an automata
  • Automata Initial state ? policy
  • The idea
  • Based on solution to m vector equations.
  • To every state of the automata match a value
    function represented by the correspondinga-vector
    .
  • For every state of the automata, evaluate the
    best value function under the assumption that
    after executing the first action, we continue to
    one of the value function evaluated above.
Write a Comment
User Comments (0)
About PowerShow.com