Tutorial on Finite State Controllers and Policy Search - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Tutorial on Finite State Controllers and Policy Search

Description:

... [Meuleau et al. 99, Aberdeen & Baxter 02] Branch and bound [Meuleau et ... Gradient-based policy search [Baxter, Bartlett 00] Natural policy gradient [Kakade 02] ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 29

Provided by: scien84

Category:

more less

Transcript and Presenter's Notes

Title: Tutorial on Finite State Controllers and Policy Search

1
Tutorial on Finite State Controllers and Policy
Search
July 14, 2008 AAAI Workshop on Advancements in
POMDP solvers
Pascal Poupart University of Waterloo ppoupart_at_cs
.uwaterloo.ca
2
Outline

Policy representations
Policy iteration
Bounded controllers
Bounded policy search
Bounded policy iteration
Non-convex optimization
Maximum likelihood
Synthesis

3
Policy Representations

? H ? A (histories to actions)
a0,o0,a1,o1,,an,on ? a
Problem growing history
? B ? A (beliefs to actions)
Problem we cant enumerate all beliefs
Alternatively, ? ? ? A (?-vectors to actions)

a3
a1
?1
a2
?2
?3
Belief space
4
Policy Iteration

Sondik 71, 78 (description from Hansen 97)
Think of POMDP as a continuous belief MDP
Apply policy iteration for MDPs

Policy improvement ?(b) ? argmaxa R(b,a) ? ?o
Pr(ob,a) V(?(b,a,o)
How?
Policy evaluation Compute V?(b)
5
Finite State Controller

Nodes actions ?(n) a
Edges observations ß(n,o) n
Policy p lt?,ßgt

a2
o2
o2
a1
o2
o1
o1
a1
o1
a3
o1
o1
o2
a3
o1
o2
o2
a2
6
Policy Evaluation

Solve linear system

Vn(b) R(b,?(n)) ? So Pr(ob,s(n))
Vß(n,o)(?(b,?(n),o))
V
Belief space
7
Policy Iteration

Sondik 71, 78 (description from Hansen 97)

Policy improvement ?(b) ? argmaxa R(b,a) ? ?o
Pr(ob,a) V(?(b,a,o))
Policy evaluation Vn(b) R(b,?(n)) ? So
Pr(ob,?(n)) Vß(n,o)(?(b,?(n),o))
8
Improved Policy Iteration

Hansen 97)

Policy improvement ?(b) ? argmaxa R(b,a) ? ?o
Pr(ob,a) V(?(b,a,o))
Policy evaluation Vn(b) R(b,?(n)) ? So
Pr(ob,?(n)) Vß(n,o)(?(b,?(n),o))
9
Policy improvement Hansen

Create new nodes for all possible ? and ß
Total of ANO new nodes

o1
a1
o1
o2
a1
a1
o1
o2
o2
o2
a1
o1
o1
a2
o2
o1
a2
o2
10
Policy improvement Hansen

Retain only blue dominating nodes
i.e. those part of the upper surface

o1
a1
o1
o2
a1
a1
o1
o2
o2
o2
a1
o1
o1
a2
o2
o1
a2
o2
11
Policy improvement Hansen

Prune pointwise dominated black nodes
i.e. those dominated by a single node

o1
a1
o1
o2
a1
a1
o1
o2
o2
o1
a2
o2
o1
a2
o2
12
Exponential Growth

Problem controllers tend to grow exponentially!
At each iteration, up to ANO nodes may be
added
Solution Bounded Controllers

13
Policy Search for Bounded Controllers

Gradient ascent Meuleau et al. 99, Aberdeen
Baxter 02
Branch and bound Meuleau et al. 99
Stochastic Local Search Braziunas, Boutilier 04
Bounded policy iteration Poupart 03
Non-convex optimization Amato et al. 07
Maximum likelihood Toussaint et al. 06

14
Stochastic Controllers

Policy search often done with stochastic
controllers
?(n) Pr(an)
ß(n,o) Pr(no,n)
Why?
Continuous parameterization
More expressive policy space

15
Bounded Policy Improvement

Improve each node in turn Poupart, Boutilier 03
Replace with dominating stochastic node

16
Bounded Policy Improvement

Improve each node in turn Poupart, Boutilier 03
Replace with dominating stochastic node

17
Node Improvement

Linear Programming
O(SAO) constraints
O(AON) variables

Objective max e Variables
Pr(a,nn,o) Constraints Vn e ? San
Pr(a,nn,ok) Ra ? So tra,o Pr(a,nn,o)
Vn Sn Pr(a,nn,ok) Sn Pr(a,nn,o) ?a,o
18
Synthetic Network Management

Poupart, Boutilier 04
3legs25 33,554,432 states, 51 actions, 2 obs.

19
Sparse Node Improvement

Hansen 08
Observation
Controllers are mostly deterministic
Few non-zero parameters
Proposal
Column generation
Solve several reduced LPs
O(O) variables (instead of O(AON))
Can be several orders of magnitude faster

20
Non-convex optimization

Amato et al. 07
Quadratically constrained problem
N more variables constraints than LPs in BPI

Objective max b0Vn0 Variables
Pr(a,nn,o),Vn Constraints Vn San
Pr(a,nn,ok) Ra ? So tra,o Pr(a,nn,o)
Vn Sn Pr(a,nn,ok) Sn Pr(a,nn,o)
?a,n,o
21
Alternating Optimization

Bounded policy iteration
Policy evaluation fix Pr(a,nn,o) and optimize
Vn
Policy improvement fix Vn rhs and optimize
Pr(a,nn,o), Vn lhs

Objective max b0Vn0 Variables
Pr(a,nn,o),Vn Constraints Vn San
Pr(a,nn,ok) Ra ? So tra,o Pr(a,nn,o)
Vn Sn Pr(a,nn,ok) Sn Pr(a,nn,o)
?a,n,o
22
Graphical Model

Meuleau et al. 99
Influence diagram that includes controller

n
n
o
o
a
s
s
23
Likelihood Maximization

Toussaint et al. 06
Mixture of DBNs with normalized terminal reward
Maximize reward likelihood
Expectation-Maximization

n0
1-?
o0
s0
r0
?(1-?)
...
n1
n0
nk
n2
...
?k(1-?)
o1
a1
o0
a0
ok
a2
o2
s2
sk
rk
s1
s0
24
Local Optima Analysis

Non-convex optimization problem
All existing algorithms get trapped in local
optima
What do we know about local optima?

25
Local Optima Analysis

Theorem
Corollary

BPI is in a local optimum
Each nodes value function is tangent to the
backed up value function
Each node reachable from the initial belief state
has value tangent to the backed up value function
GA is in a local optimum
Value function Backed up value function Tangent
points
26
Escape Technique for BPI