Learning and Planning for POMDPs

About This Presentation

Title:

Learning and Planning for POMDPs

Description:

No RESET. Connected environment (unichain POMDP) ... Average runs, resetting between runs. Run the best policy so far. Ensures good average return ... – PowerPoint PPT presentation

Number of Views:21

Avg rating:3.0/5.0

Slides: 36

Provided by: eya9

more less

Transcript and Presenter's Notes

Title: Learning and Planning for POMDPs

1
Learning and Planning for POMDPs

Eyal Even-Dar, Tel-Aviv University
Sham Kakade, University of Pennsylvania
Yishay Mansour, Tel-Aviv University

2
Talk Outline

Bounded Rationality and
Partially Observable MDPs
Mathematical Model of POMDPs
Learning in POMDPs
Planning in POMDPs
Tracking in POMDPs

3
Bounded Rationality

Rationality
Unlimited Computational power players
Bounded Rationality
Computational limitation
Finite Automata
Challenge play optimally against a Finite
Automata
Size of automata unknown

4
Bounded Rationality and RL

Model
Perform an action
See an observation
Either immediate rewards or delay reward
This is a POMDP
Unknown size is a serious challenge

5
Classical Reinforcement LearningAgent
Environment Interaction
Agent
Agent action
Reward
Environment
Next state
6
Reinforcement Learning - Goal

Maximize the return.
Discounted return ??trt 0lt?lt1
Undiscounted return ?rt/ T

8
t1
T
t1
7
Markov Decision Process

S the states

A actions

Psa(-) next state distribution

R(s,a) Reward distribution

s2
s1
0.3
0.7
ER (s3,a) 10
s3
8
Reinforcement Learning ModelPolicy

Policy ?
Mapping states to distribution over
Optimal policy ?
Attains optimal return from any start state.
Theorem
There exists a stationary deterministic optimal
policy

9
Planning and Learning in MDPs

Planning
Input a complete model
Output an optimal policy ?
Learning
Interaction with the environment
Achieve near optimal return.
For MDPs both planning and learning can be done
efficiently
Polynomial in the number of states
representation in tabular form

10
Partial ObservableAgent Environment Interaction
Agent
Agent action
Reward
Environment
Signal correlated with state
11
Partially Observable Markov Decision Process

S the states

A actions

Psa(-) next state distribution

R(s,a) Reward distribution

O Observations
O(s,a) Observation distribution

O1 .1 02 .8 03 .1
O1 .8 02 .1 03 .1
s2
s1
0.3
0.7
O1 .1 02 .1 03 .8
ER (s3,a) 10
s3
12
Partial Observables problems in Planning

The optimal policy is not stationary furthermore
it is history dependent
Example

13
Partial Observables Complexity Hardness
results
policy horizon Approximation Complexity
stationary finite ? -additive NP-comp
History dependent finite ? -additive PSPACE-comp
stationary discounted ? -additive NP-comp
LGM01, L95
14
Learning in PODMPs Difficulties

Suppose an agent knows its state initially, can
he keep track of his state?
Easy given a completely accurate model.
Inaccurate model Our new tracking result.
How can the agent return to the same state?
What is the meaning of very long histories?
Do we really need to keep all the history?!

15
Planning in POMDPs Belief State Algorithm

A Bayesian setting
Prior over initial state
Given an action and observation defines a
posterior
belief state distribution over states
View the possible belief states as states
Infinite number of states
Assumes also a perfect model

16
Learning in POMDPs Popular methods

Policy gradient methods
Find local optimal policy in a restricted class
of polices (parameterized policies)
Need to assume a reset to the start state!
Cannot guarantee asymptotic results
Peshkin et al, Baxter Bartlett,

17
Learning in POMDPs

Trajectory trees KMN
Assume a generative model
A strong RESET procedure
Find near best policy in a restricted class of
polices
finite horizon policies
parameterized policies

18
Trajectory tree KMN
s0
a1
a2
o2
o1
a1
a1
a2
a2
o4
o2
o3
o1
19
Our setting

Return Average reward criteria
One long trajectory
No RESET
Connected environment (unichain POMDP)
Goal Achieve the optimal return (average reward)
with probability 1

20
Homing strategies - POMDPs

Homing strategy is a strategy that identifies the
state.
Knows how to return home
Enables to approximate reset in during a long
trajectory.

21
Homing strategies

Learning finite automata Rivest Schapire
Use homing sequence to identify the state
The homing sequence is exact
It can lead to many states
Use finite automata learning of Angluin 87
Diversity based learning Rivest Schpire
Similar to our setting
Major difference deterministic transitions

22
Homing strategies - POMDPs

Definition
H is an (?,K)-homing strategy if
for every two belief states x1 and x2,
after K steps of following H,
the expected belief states b1 and b2 are within ?
distance.

23
Homing strategies Random Walk

The POMDP is strongly connected, then the random
walk Markov chain is irreducible
Following the random walk assures that we
converge to the steady state

24
Homing strategies Random Walk

What if the Markov chain is periodic?
a cycle
Use stay action to overcome periodicity
problems

25
Homing strategies Amplifying

Claim
If H is an (?,K)-homing sequence then repeating
H for T times is an (?T,KT)-homing sequence

26
Reinforcement learning with homing

Usually algorithms should balance between
exploration and exploitation
Now they should balance between exploration,
exploitation and homing
Homing is performed in both exploration and
exploitation

27
Policy testing algorithm

Theorem
For any connected POMDP the policy testing
algorithm obtains the optimal average reward with
probability 1
After T time steps is competes with policies of
horizon log log T

28
Policy testing

Enumerate the policies
Gradually increase horizon
Run in phases
Test policy pk
Average runs, resetting between runs
Run the best policy so far
Ensures good average return
Again, reset between runs.

29
Model based algorithm

Theorem
For any connected POMDP the model based
algorithm obtains the optimal average reward with
probability 1
After T time steps is competes with policies of
horizon log T

30
Model based algorithm

For t1 to 8
For K1(t)times do
Run random for t steps and build an empirical
model
Use homing sequence to approximate reset
Compute optimal policy on the empirical model
For K2(t) times do
Run the empirical optimal policy for t steps
Use homing sequence to approximate reset

Exploration
Exploitation
31
Model based algorithm
s0

a2
a1
o2
o1
o2
o1
a1
a2
a1
a2

32
Model based algorithm Computing the optimal
policy

Bounding the error in the model
Significant Nodes
Sampling
Approximate reset
Insignificant Nodes
Compute an e-optimal t horizon policy in each step

33
Model Based algorithm- Convergence w.p 1 proof

Proof idea
At any stage K1(t) is large enough so we compute
an ?t-optimal t horizon policy
K2(t) is large enough such that all phases before
influence is bounded by ?t
For a large enough horizon, the homing sequence
influence is also bounded

34
Model Based algorithmConvergence rate

Model based algorithm produces an ?-optimal
policy with probability 1 - ? in time polynomial
in , A,O, log(1/ ?), Homing sequence length,
and exponential in the horizon time of the
optimal policy
Note the algorithm does not depend on S

35
Planning in POMDP