Learning and Planning for POMDPs

About This Presentation
Title:

Learning and Planning for POMDPs

Description:

No RESET. Connected environment (unichain POMDP) ... Average runs, resetting between runs. Run the best policy so far. Ensures good average return ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 36
Provided by: eya9

less

Transcript and Presenter's Notes

Title: Learning and Planning for POMDPs


1
Learning and Planning for POMDPs
  • Eyal Even-Dar, Tel-Aviv University
  • Sham Kakade, University of Pennsylvania
  • Yishay Mansour, Tel-Aviv University

2
Talk Outline
  • Bounded Rationality and
  • Partially Observable MDPs
  • Mathematical Model of POMDPs
  • Learning in POMDPs
  • Planning in POMDPs
  • Tracking in POMDPs

3
Bounded Rationality
  • Rationality
  • Unlimited Computational power players
  • Bounded Rationality
  • Computational limitation
  • Finite Automata
  • Challenge play optimally against a Finite
    Automata
  • Size of automata unknown

4
Bounded Rationality and RL
  • Model
  • Perform an action
  • See an observation
  • Either immediate rewards or delay reward
  • This is a POMDP
  • Unknown size is a serious challenge

5
Classical Reinforcement LearningAgent
Environment Interaction
Agent
Agent action
Reward
Environment
Next state
6
Reinforcement Learning - Goal
  • Maximize the return.
  • Discounted return ??trt 0lt?lt1
  • Undiscounted return ?rt/ T

8
t1
T
t1
7
Markov Decision Process
  • S the states
  • A actions
  • Psa(-) next state distribution
  • R(s,a) Reward distribution

s2
s1
0.3
0.7
ER (s3,a) 10
s3
8
Reinforcement Learning ModelPolicy
  • Policy ?
  • Mapping states to distribution over
  • Optimal policy ?
  • Attains optimal return from any start state.
  • Theorem
  • There exists a stationary deterministic optimal
    policy

9
Planning and Learning in MDPs
  • Planning
  • Input a complete model
  • Output an optimal policy ?
  • Learning
  • Interaction with the environment
  • Achieve near optimal return.
  • For MDPs both planning and learning can be done
    efficiently
  • Polynomial in the number of states
  • representation in tabular form

10
Partial ObservableAgent Environment Interaction
Agent
Agent action
Reward
Environment
Signal correlated with state
11
Partially Observable Markov Decision Process
  • S the states
  • A actions
  • Psa(-) next state distribution
  • R(s,a) Reward distribution
  • O Observations
  • O(s,a) Observation distribution

O1 .1 02 .8 03 .1
O1 .8 02 .1 03 .1
s2
s1
0.3
0.7
O1 .1 02 .1 03 .8
ER (s3,a) 10
s3
12
Partial Observables problems in Planning
  • The optimal policy is not stationary furthermore
    it is history dependent
  • Example

13
Partial Observables Complexity Hardness
results
policy horizon Approximation Complexity
stationary finite ? -additive NP-comp
History dependent finite ? -additive PSPACE-comp
stationary discounted ? -additive NP-comp
LGM01, L95
14
Learning in PODMPs Difficulties
  • Suppose an agent knows its state initially, can
    he keep track of his state?
  • Easy given a completely accurate model.
  • Inaccurate model Our new tracking result.
  • How can the agent return to the same state?
  • What is the meaning of very long histories?
  • Do we really need to keep all the history?!

15
Planning in POMDPs Belief State Algorithm
  • A Bayesian setting
  • Prior over initial state
  • Given an action and observation defines a
    posterior
  • belief state distribution over states
  • View the possible belief states as states
  • Infinite number of states
  • Assumes also a perfect model

16
Learning in POMDPs Popular methods
  • Policy gradient methods
  • Find local optimal policy in a restricted class
    of polices (parameterized policies)
  • Need to assume a reset to the start state!
  • Cannot guarantee asymptotic results
  • Peshkin et al, Baxter Bartlett,

17
Learning in POMDPs
  • Trajectory trees KMN
  • Assume a generative model
  • A strong RESET procedure
  • Find near best policy in a restricted class of
    polices
  • finite horizon policies
  • parameterized policies

18
Trajectory tree KMN
s0
a1
a2
o2
o1
a1
a1
a2
a2
o4
o2
o3
o1
19
Our setting
  • Return Average reward criteria
  • One long trajectory
  • No RESET
  • Connected environment (unichain POMDP)
  • Goal Achieve the optimal return (average reward)
    with probability 1

20
Homing strategies - POMDPs
  • Homing strategy is a strategy that identifies the
    state.
  • Knows how to return home
  • Enables to approximate reset in during a long
    trajectory.

21
Homing strategies
  • Learning finite automata Rivest Schapire
  • Use homing sequence to identify the state
  • The homing sequence is exact
  • It can lead to many states
  • Use finite automata learning of Angluin 87
  • Diversity based learning Rivest Schpire
  • Similar to our setting
  • Major difference deterministic transitions

22
Homing strategies - POMDPs
  • Definition
  • H is an (?,K)-homing strategy if
  • for every two belief states x1 and x2,
  • after K steps of following H,
  • the expected belief states b1 and b2 are within ?
    distance.

23
Homing strategies Random Walk
  • The POMDP is strongly connected, then the random
    walk Markov chain is irreducible
  • Following the random walk assures that we
    converge to the steady state

24
Homing strategies Random Walk
  • What if the Markov chain is periodic?
  • a cycle
  • Use stay action to overcome periodicity
    problems

25
Homing strategies Amplifying
  • Claim
  • If H is an (?,K)-homing sequence then repeating
    H for T times is an (?T,KT)-homing sequence

26
Reinforcement learning with homing
  • Usually algorithms should balance between
    exploration and exploitation
  • Now they should balance between exploration,
    exploitation and homing
  • Homing is performed in both exploration and
    exploitation

27
Policy testing algorithm
  • Theorem
  • For any connected POMDP the policy testing
    algorithm obtains the optimal average reward with
    probability 1
  • After T time steps is competes with policies of
    horizon log log T

28
Policy testing
  • Enumerate the policies
  • Gradually increase horizon
  • Run in phases
  • Test policy pk
  • Average runs, resetting between runs
  • Run the best policy so far
  • Ensures good average return
  • Again, reset between runs.

29
Model based algorithm
  • Theorem
  • For any connected POMDP the model based
    algorithm obtains the optimal average reward with
    probability 1
  • After T time steps is competes with policies of
    horizon log T

30
Model based algorithm
  • For t1 to 8
  • For K1(t)times do
  • Run random for t steps and build an empirical
    model
  • Use homing sequence to approximate reset
  • Compute optimal policy on the empirical model
  • For K2(t) times do
  • Run the empirical optimal policy for t steps
  • Use homing sequence to approximate reset

Exploration
Exploitation
31
Model based algorithm
s0

a2
a1
o2
o1
o2
o1
a1
a2
a1
a2

32
Model based algorithm Computing the optimal
policy
  • Bounding the error in the model
  • Significant Nodes
  • Sampling
  • Approximate reset
  • Insignificant Nodes
  • Compute an e-optimal t horizon policy in each step

33
Model Based algorithm- Convergence w.p 1 proof
  • Proof idea
  • At any stage K1(t) is large enough so we compute
    an ?t-optimal t horizon policy
  • K2(t) is large enough such that all phases before
    influence is bounded by ?t
  • For a large enough horizon, the homing sequence
    influence is also bounded

34
Model Based algorithmConvergence rate
  • Model based algorithm produces an ?-optimal
    policy with probability 1 - ? in time polynomial
    in , A,O, log(1/ ?), Homing sequence length,
    and exponential in the horizon time of the
    optimal policy
  • Note the algorithm does not depend on S

35
Planning in POMDP
  • Unfortunately, not today
  • Basic results
  • Tight connections with Multiplicity Automata
  • Well establish theory starting in the 60s
  • Rank of the Hankel matrix
  • Similar to PSR
  • Always less then the number of states
  • Planning algorithm
  • Exponential in the rank of the Hankel matrix

36
Tracking in POMDPs
  • Belief states algorithm
  • Assumes perfect tracking
  • Perfect model.
  • Imperfect model, tracking impossible
  • For example No observable
  • New results
  • Informative observables implies efficient
    tracking.
  • Towards a spectrum of partially
Write a Comment
User Comments (0)