Title: Tutorial on Finite State Controllers and Policy Search
1Tutorial on Finite State Controllers and Policy
Search
July 14, 2008 AAAI Workshop on Advancements in
POMDP solvers
Pascal Poupart University of Waterloo ppoupart_at_cs
.uwaterloo.ca
2Outline
- Policy representations
- Policy iteration
- Bounded controllers
- Bounded policy search
- Bounded policy iteration
- Non-convex optimization
- Maximum likelihood
- Synthesis
3Policy Representations
- ? H ? A (histories to actions)
- a0,o0,a1,o1,,an,on ? a
- Problem growing history
- ? B ? A (beliefs to actions)
- Problem we cant enumerate all beliefs
- Alternatively, ? ? ? A (?-vectors to actions)
a3
a1
?1
a2
?2
?3
Belief space
4Policy Iteration
- Sondik 71, 78 (description from Hansen 97)
- Think of POMDP as a continuous belief MDP
- Apply policy iteration for MDPs
Policy improvement ?(b) ? argmaxa R(b,a) ? ?o
Pr(ob,a) V(?(b,a,o)
How?
Policy evaluation Compute V?(b)
5Finite State Controller
- Nodes actions ?(n) a
- Edges observations ß(n,o) n
- Policy p lt?,ßgt
a2
o2
o2
a1
o2
o1
o1
a1
o1
a3
o1
o1
o2
a3
o1
o2
o2
a2
6Policy Evaluation
Vn(b) R(b,?(n)) ? So Pr(ob,s(n))
Vß(n,o)(?(b,?(n),o))
V
Belief space
7Policy Iteration
- Sondik 71, 78 (description from Hansen 97)
Policy improvement ?(b) ? argmaxa R(b,a) ? ?o
Pr(ob,a) V(?(b,a,o))
Policy evaluation Vn(b) R(b,?(n)) ? So
Pr(ob,?(n)) Vß(n,o)(?(b,?(n),o))
8Improved Policy Iteration
Policy improvement ?(b) ? argmaxa R(b,a) ? ?o
Pr(ob,a) V(?(b,a,o))
Policy evaluation Vn(b) R(b,?(n)) ? So
Pr(ob,?(n)) Vß(n,o)(?(b,?(n),o))
9Policy improvement Hansen
- Create new nodes for all possible ? and ß
- Total of ANO new nodes
o1
a1
o1
o2
a1
a1
o1
o2
o2
o2
a1
o1
o1
a2
o2
o1
a2
o2
10Policy improvement Hansen
- Retain only blue dominating nodes
- i.e. those part of the upper surface
o1
a1
o1
o2
a1
a1
o1
o2
o2
o2
a1
o1
o1
a2
o2
o1
a2
o2
11Policy improvement Hansen
- Prune pointwise dominated black nodes
- i.e. those dominated by a single node
o1
a1
o1
o2
a1
a1
o1
o2
o2
o1
a2
o2
o1
a2
o2
12Exponential Growth
- Problem controllers tend to grow exponentially!
- At each iteration, up to ANO nodes may be
added - Solution Bounded Controllers
13Policy Search for Bounded Controllers
- Gradient ascent Meuleau et al. 99, Aberdeen
Baxter 02 - Branch and bound Meuleau et al. 99
- Stochastic Local Search Braziunas, Boutilier 04
- Bounded policy iteration Poupart 03
- Non-convex optimization Amato et al. 07
- Maximum likelihood Toussaint et al. 06
14Stochastic Controllers
- Policy search often done with stochastic
controllers - ?(n) Pr(an)
- ß(n,o) Pr(no,n)
- Why?
- Continuous parameterization
- More expressive policy space
15Bounded Policy Improvement
- Improve each node in turn Poupart, Boutilier 03
- Replace with dominating stochastic node
16Bounded Policy Improvement
- Improve each node in turn Poupart, Boutilier 03
- Replace with dominating stochastic node
17Node Improvement
- Linear Programming
- O(SAO) constraints
- O(AON) variables
Objective max e Variables
Pr(a,nn,o) Constraints Vn e ? San
Pr(a,nn,ok) Ra ? So tra,o Pr(a,nn,o)
Vn Sn Pr(a,nn,ok) Sn Pr(a,nn,o) ?a,o
18Synthetic Network Management
- Poupart, Boutilier 04
- 3legs25 33,554,432 states, 51 actions, 2 obs.
19Sparse Node Improvement
- Hansen 08
- Observation
- Controllers are mostly deterministic
- Few non-zero parameters
- Proposal
- Column generation
- Solve several reduced LPs
- O(O) variables (instead of O(AON))
- Can be several orders of magnitude faster
20Non-convex optimization
- Amato et al. 07
- Quadratically constrained problem
- N more variables constraints than LPs in BPI
Objective max b0Vn0 Variables
Pr(a,nn,o),Vn Constraints Vn San
Pr(a,nn,ok) Ra ? So tra,o Pr(a,nn,o)
Vn Sn Pr(a,nn,ok) Sn Pr(a,nn,o)
?a,n,o
21Alternating Optimization
- Bounded policy iteration
- Policy evaluation fix Pr(a,nn,o) and optimize
Vn - Policy improvement fix Vn rhs and optimize
Pr(a,nn,o), Vn lhs
Objective max b0Vn0 Variables
Pr(a,nn,o),Vn Constraints Vn San
Pr(a,nn,ok) Ra ? So tra,o Pr(a,nn,o)
Vn Sn Pr(a,nn,ok) Sn Pr(a,nn,o)
?a,n,o
22Graphical Model
- Meuleau et al. 99
- Influence diagram that includes controller
n
n
o
o
a
s
s
23Likelihood Maximization
- Toussaint et al. 06
- Mixture of DBNs with normalized terminal reward
- Maximize reward likelihood
- Expectation-Maximization
n0
1-?
o0
s0
r0
?(1-?)
...
n1
n0
nk
n2
...
?k(1-?)
o1
a1
o0
a0
ok
a2
o2
s2
sk
rk
s1
s0
24Local Optima Analysis
- Non-convex optimization problem
- All existing algorithms get trapped in local
optima - What do we know about local optima?
25Local Optima Analysis
BPI is in a local optimum
Each nodes value function is tangent to the
backed up value function
Each node reachable from the initial belief state
has value tangent to the backed up value function
GA is in a local optimum
Value function Backed up value function Tangent
points
26Escape Technique for BPI
- Idea create new nodes to improve belief states
reachable in one step from tangent belief
states - Theorem
b
b
trao
No improvement at belief states reachable in one
step from tangent belief states
Policy is optimal at the tangent belief states
?
27Summary
- Bounded Controller Advantages
- Easily interpretable policy
- No need for belief monitoring
- Real-time policy execution
- Policy search as optimization Amato et al. 07
- Wide range of optimization techniques
- Policy search as likelihood maximization
Toussaint 06 - Wide range of inference techniques
- Bounded Controller Drawback
- Local optima
28Other Policy Search Techniques
- Policy search via density estimation Ng et al.
99 - PEGASUS Ng, Jordan 00
- Gradient-based policy search Baxter, Bartlett
00 - Natural policy gradient Kakade 02
- Covariant policy search Bagnell, Schneider 03
- Policy search by dynamic programming Bagnell et
al. 04 - Point-based policy iteration Ji, Parr et al. 07