Active Learning in POMDPs

About This Presentation
Title:

Active Learning in POMDPs

Description:

Active Learning in POMDPs Robin JAULMES Supervisors: Doina PRECUP and Joelle PINEAU McGill University rjaulm_at_cs.mcgill.ca Outline 1) Partially Observable Markov ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 68
Provided by: Aki144

less

Transcript and Presenter's Notes

Title: Active Learning in POMDPs


1
Active Learning in POMDPs
  • Robin JAULMES
  • Supervisors Doina PRECUP and Joelle PINEAU
  • McGill University
  • rjaulm_at_cs.mcgill.ca

2
Outline
  • 1) Partially Observable Markov Decision Processes
    (POMDPs)
  • 2) Active Learning in POMDPs
  • 3) The MEDUSA algorithm.

3
Markov Decision Processes(MDPs)
  • Markov Decision Processes
  • States S
  • Actions A
  • Probabilistic transitions P(ss,a)
  • Immediate Rewards R(s,a)
  • A discount factor ?
  • The current state is always perfectly observed.

4
Partially Observable Markov Decision Processes
(POMDPs)
  • A POMDP has
  • States S
  • Actions A
  • Probabilistic transitions
  • Immediate Rewards
  • A discount factor
  • Observations Z
  • Observation probabilities
  • An initial belief b0

5
Applications of POMDPs
  • The ability to render environments in which the
    state is not fully observed can allow
    applications in
  • Dialogue management
  • Vision
  • Robot navigation
  • High-level control of robots
  • Medical diagnosis
  • Network maintenance

6
A POMDP example The Tiger Problem
7
The Tiger Problem
  • Description
  • 2 states Tiger_Left, Tiger_Right
  • 3 actions Listen, Open_Left, Open_Right
  • 2 observations Hear_Left, Hear_Right
  • Rewards are
  • -1 for the Listen action
  • -100 for the Open_Left in the Tiger_Left state
  • 10 for the Open_Right in the Tiger_Left state

8
The Tiger Problem
  • Furthermore
  • The hear action does not change the state
  • The open action puts the tiger behind any door
    with 50 chance.
  • The open action leads to A a useless observation
    (50 hear_left, 50 hear_right)
  • The hear action gives the correct information 85
    of the time.

9
Solving a POMDP
  • To solve a POMDP is to find, for any
    action/observation history, the action that
    maximizes the expected discounted reward

10
The belief state
  • Instead of maintaining the complete
    action/observation history, we maintain a belief
    state b.
  • The belief is a probability distribution over the
    states. Dim(b) S-1

11
The belief space
Here is a representation of the belief space when
we have two states (s0,s1)
12
The belief space
Here is a representation of the belief state when
we have three states (s0,s1,s2)
13
The belief space
Here is a representation of the belief state when
we have four states (s0,s1,s2,s3)
14
The belief space
  • The belief space is continuous but we only visit
    a countable number of belief points.

15
The Bayesian update
16
Value Function in POMDPs
  • We will compute the value function over the
    belief space.
  • Hard the belief space is continuous !!
  • But we can use a property of the optimal value
    function for a finite horizon it is
    piecewise-linear and convex.
  • We can represent any finite-horizon solution by a
    finite set of alpha-vectors.
  • V(b) max_aS_s a(s)b(s)

17
Alpha-Vectors
  • They are a set of hyperplanes which define the
    belief function. At each belief point the value
    function is equal to the hyperplane with the
    highest value.

18
Value Iteration in POMDPs
  • Value iteration
  • Initialize value function (horizon 1 value)
  • V(b) max_a S_s R(s,a) b(s)
  • This produces 1 alpha-vector per action.
  • Compute the value function at the next iteration
    using Bellmans equation
  • V(b) max_a S_s R(s,a)b(s)?S_sT(s,a,s)O(s,a,
    z)a(s)

19
PBVI Point-Based Value Iteration
  • Always keep a bounded number of alpha vectors.
  • Use value iteration starting from belief points
    on a grid to produce new sets of alpha vectors.
  • Stop after n steps (finite horizon).
  • The solution is approximate but found in a
    reasonable amount of time and memory.
  • Good tradeoff between computation time and
    quality
  • See Pineau et al., 2003

20
Learning a POMDP
  • What happens if we dont know for sure the model
    of the POMDP?
  • We have to learn it.
  • The two solutions in the literature are
  • EM-based approaches (prone to local minima)
  • History-based approach (require of the order of
    1,000,000 samples for 2 state problems) Singh et
    al. 2003

21
Active Learning
  • In an Active Learning Problem the learner has the
    ability to influence its training data.
  • The learner asks for what is the most useful
    given its current knowledge.
  • Methods to find the most useful query have been
    shown by Cohn et al. (95)

22
Active Learning (Cohn et al. 95)
  • Their method, used for function approximation
    tasks, is based on finding the query that will
    minimize the estimated variance of the learner.
  • They showed how this could be done exactly
  • For a mixture of Gaussians model.
  • For locally weighted regression.

23
Applying Active Learning to POMDPs
  • We will suppose in this work that we have an
    oracle to determine the hidden state of a system
    on request.
  • However, this action is costly and we want to use
    it as little as possible.
  • In this setting, the active learning query will
    be to ask for the hidden state.

24
Applying Active Learning to POMDPs
  • We propose two solutions
  • Integrate the model uncertainty and the query
    possibility inside the POMDP framework to take
    advantage of existing algorithms.
  • The MEDUSA algorithm. It uses a Dirichlet
    distribution over possible models to determine
    which actions to take and which queries to ask.

25
Decision-Theoretic Model Learning
  • We want to integrate in the POMDP model the fact
    that
  • We have only a rough estimation of its
    parameters.
  • The agent can query the hidden state.
  • These queries should not be used too often, and
    only used to learn.

26
Decision-Theoretic Model Learning
  • So we modify our POMDP
  • For each uncertain parameter we introduce an
    additional state feature. This feature is
    discretized into n levels.
  • At initialization we are uniformly distributed
    among these n groups of states but we remain in
    this group as the transitions occur.
  • We introduce a query action that returns the
    hidden state.
  • This action is attached to a negative reward Rq.
  • Then we solve this new POMDP using the usual
    methods.

27
Decision-Theoretic Model Learning
28
D-T Planning Results
29
DT-Planning Conclusions
  • Theoretically sound, but
  • The results are very sensitive to the value of
    the query penalty, which is therefore very
    difficult to establish.
  • The number of states becomes exponential in the
    number of uncertain parameters ! This increases
    greatly the complexity of the problem.
  • With MEDUSA, we leave the theoretical guarantees
    of optimality to get a tractable algorithm.

30
MEDUSA The main ideas
  • Markovian Exploration with Decision based on the
    Use of Samples Algorithm
  • Use Dirichlet distributions to represent current
    knowledge about the parameters of the POMDP
    model.
  • Sample models from the distribution.
  • Use models to take actions that could be good.
  • Use queries to improve current knowledge.

31
Dirichlet distributions
  • Let X ? 0 1 2 ... N. X is drawn from a
    multinomial distribution of parameters (?1,...
    ?N) iff p(Xi) ?i
  • The Dirichlet distribution is a distribution of
    multinomial distribution parameters (of (?1,...
    ?N) tuples such that ?i gt 0 and S ?i 1)

32
Dirichlet distributions
  • Dirichlet distributions have parameters lta1 aNgt
    s.t. aigt0.
  • We can sample from Dirichlet distributions by
    using Gamma distributions.
  • The most likely parameters in a Dirichlet
    distribution are the following

33
Dirichlet distributions
  • We can also compute the probability of
    multinomial distribution parameters according to
    the Dirichlet.

34
The MEDUSA algorithm
  • Step 1 initialize the Dirichlet distribution.
  • Step 2 sample k(20) POMDPs from the Dirichlet
    distribution and compute their probabilities
    according to the Dirichlet. Normalize them to get
    the weights.
  • Step 3 solve the k POMDPs with an approximate
    method (PBVI, finite horizon)

35
The MEDUSA algorithm
  • Step 4 run the experiment
  • At each time step
  • Compute the optimal actions for all POMDPs.
  • Execute one of them.
  • Update the belief for each POMDP.
  • If some conditions are met, do a state query.
    Update the Dirichlet parameters according to this
    query.

36
The MEDUSA algorithm
  • At each time step
  • Recompute the POMDP weights
  • At fixed intervals, erase the POMDP with the
    lowest weight and redraw another according to the
    current Dirichlet distribution.
  • Compute the belief of the new POMDP according to
    the action-observation history until current time.

37
Theoretical analysis
  • We can compute the policy to which MEDUSA
    converges with an infinite number of models using
    integrals over the whole space of models.
  • Under some assumptions over the POMDP, we can
    prove that MEDUSA converges to the true model.

38
MEDUSA on Tiger
Evolution of mean discounted reward with time
steps (query at every step)
39
Diminishing the complexity
  • The algorithm is flexible. We can have a wide
    variety of priors.
  • Some parameters may be certain. They can also be
    dependent (if we use the same alpha parameters
    for different distributions)
  • So if we have additional information about the
    POMDPs dynamics we can therefore diminish the
    number of alpha- parameters.

40
Diminishing the complexity
  • On the Tiger problem
  • if we know that
  • The hear action does not change the state
  • The problem is symmetric
  • Opening a door brings an uninformative
    observation and puts back the tiger with a 0.5
    probability behind each door.
  • We can diminish the number of alpha-parameters
    from 24 to 2.

41
MEDUSA on simplified Tiger
Evolution of mean discounted reward with time
steps (query at every step) Blue normal problem
Black simplified problem
42
Learning without query
  • The alternate belief ß keeps track of the
    knowledge brought by the last query.
  • The non-query update updates each parameter
    proportionally to the probability a query would
    have of updating it, given the alternate belief
    and the last action/observation.

43
Learning without query
  • Non-query learning
  • Has high variance learning rate needs to be
    lower, therefore more time steps are needed.
  • Is prone to local minima. Convergence to the
    correct values is not guaranteed.
  • Can converge to the right solution if the initial
    prior is good enough.
  • MEDUSA should use non-query learning when it has
    done enough query learning.

44
Choosing when to query
  • There are different heuristics to choose when to
    do a query.
  • Always (up to a certain number of queries).
  • When models disagree.
  • When value functions for the models are
    different.
  • When the beliefs in the different models differ.
  • When information from last query has been lost
    (?).
  • Not when a query would bring no information.

45
Choosing when to query
  • There is different heuristics to choose when to
    do a query
  • Always (up to a certain number of queries)
  • When models disagree.
  • When value functions for the models differ.
  • When the beliefs in the different models differ.
  • When information from last query has been lost.
  • Not when a query would bring no information.

46
Non-Query Learning on Tiger
Mean discounted reward Number of queries Blue
Query learning Black NQ learning
47
Picking actions during learning
  • Take one model and do its best action.
  • Consider every model, every action, do the action
    with highest overall value.
  • Compute the mean value of every action,
    probabilistically take one of them according to
    the Boltzman method.

48
Picking actions during learning
  • Take one model and do its best action.
  • Consider every model, every action, do the action
    with highest overall value.
  • Compute the mean value of every action,
    probabilistically take one of them according to
    the Boltzman method.

49
Different action-pickings on Tiger
Evolution of mean discounted reward with time
steps (query at every step) Blue Highest overall
value Black Pick one model
50
Learning with non-stationary models
  • If the parameters of the model unpredictably
    change with time
  • At every time step decay alpha parameters by some
    factor ? so that new experience weighs more than
    old experience.
  • If the parameters do not vary too much, non-query
    learning is sufficient to keep track of their
    evolution.

51
Non-stationary Tiger problemChange in p
(probability of correct observation with Hear
action) at time 0.
Evolution of mean discounted reward with time
steps.
52
The Hallway problem
  • 60 states
  • 5 actions
  • 17 observations
  • The number of alpha parameters corresponding to a
    prior that is reasonable is 84.

53
Hallway reward evolution
Evolution of mean discounted reward with time
steps
54
Hallway query number
Evolution of number of queries with time steps
55
Benchmark POMDPs
56
MEDUSA and Robotics
  • We have interfaced MEDUSA with a robotic
    simulator (Carmen) which can be used to simulate
    POMDP learning problems.
  • We present experimental results on the HIDE
    problem.

57
The HIDE problem
  • The robot is trying to capture a moving agent on
    the following map.
  • The movements of the robot are deterministic but
    the behavior of the person is unknown and is
    learned through the execution.
  • The problem is formulated in a POMDP with 362
    states, 22 observations and 5 actions. To model
    the behavior of the moving agent we learn 52
    alpha parameters

58
Results on the HIDE problem
Evolution of mean discounted reward with time
steps
59
Results on the HIDE problem
Evolution of number of queries with time steps
60
Conclusion
  • Advantages
  • Learned models can be re-used.
  • If a parameter in the environment has a small
    change, MEDUSA can detect it online and without
    query.
  • Convergence is theoretically guaranteed.
  • Number of queries requested and length of
    training is tractable, even in large problems.
  • Can be applied to robotics.

61
Conclusion
  • But
  • The assumption of an oracle is strong. However we
    do not need to know the query result immediately.
  • The algorithm has lots of parameters which
    request tuning so that it can work properly.
  • Convergence is guaranteed only under certain
    conditions (for a certain policy, for certain
    POMDPs, for an infinite amount of models and
    perfect policy).

62
References
  • Jaulmes R., Pineau, J., Precup, D. Learning in
    non-stationary Partially Observable Markov
    Decision Processes, ECML workshop on
    Non-Stationarity in RL, 2005.
  • Jaulmes R.,Pineau J., Precup, D. Active Learning
    in Partially Observable Markov Decision
    Processes, ECML, 2005.
  • Cohn, D.A.,Ghaharamani, Z., Jordan, M.I. Active
    Learning with Statistical Models NIPS, 1995.
  • Pineau,J.,Gordon,G.,Thrun,S. Point-Based Value
    Iteration an anytime algorithm for POMDPs
    IJCAI, 2003.
  • Dearden,R.,Friedman,N.,Andre,N.,Model based
    Bayesian Exploration, UAI, 1999.
  • Singh,S.Littman,M.,Jong.,N.K.,Pardoe,D.,Stone,P.L
    earning Predictive State Representations ICML,
    2003.

63
Questions ?
64
Definitions
65
Convergence of the policy
66
Convergence of MEDUSA
67
Non-query update equations
Write a Comment
User Comments (0)