Title: A Hybridized Planner for Stochastic Domains
1A Hybridized Planner for Stochastic Domains
- Mausam and Daniel S. Weld
- University of Washington, Seattle
- Piergiorgio Bertoli
- ITC-IRST, Trento
2Planning under Uncertainty(ICAPS03 Workshop)
- Qualitative (disjunctive) uncertainty
- Which real problem can you solve?
- Quantitative (probabilistic) uncertainty
- Which real problem can you model?
3The Quantitative View
- Markov Decision Process
- models uncertainty with probabilistic outcomes
- general decision-theoretic framework
- algorithms are slow
- do we need the full power of decision theory?
- is an unconverged partial policy any good?
4The Qualitative View
- Conditional Planning
- Model uncertainty as logical disjunction of
outcomes - exploits classical planning techniques ? FAST
- ignores probabilities ? poor solutions
- how bad are pure qualitative solutions?
- can we improve the qualitative policies?
5HybPlan A Hybridized Planner
- combine probabilistic disjunctive planners
- produces good solutions in intermediate times
- anytime makes effective use of resources
- bounds termination with quality guarantee
- Quantitative View
- completes partial probabilistic policy by using
qualitative policies in some states - Qualitative View
- improves qualitative policies in more important
regions
6Outline
- Motivation
- Planning with Probabilistic Uncertainty (RTDP)
- Planning with Disjunctive Uncertainty (MBP)
- Hybridizing RTDP and MBP (HybPlan)
- Experiments
- Conclusions and Future Work
7Markov Decision Process
- lt S, A, Pr, C, s0, G gt
- S a set of states
- A a set of actions
- Pr prob. transition model
- C cost model
- s0 start state
- G a set of goals
- Find a policy (S ! A)
- minimizes expected cost to reach a goal
- for an indefinite horizon
- for a fully observable
- Markov decision process.
Optimal cost function, J, optimal policy
8Example
2
Longer path
s0
Goal
All states are dead-ends
2
Wrong direction, but goal still reachable
9Optimal State Costs
2
2
3
3
4
4
1
3
2
1
1
4
1
0
3
1
2
1
1
Goal
8
8
2
7
7
6
10Optimal Policy
3
2
1
4
0
3
2
1
Goal
11Bellman Backup
- Create better approximation to cost function _at_ s
12Bellman Backup
Trialsimulate greedy policy update visited
states
- Create better approximation to cost function _at_ s
13Bellman Backup
Real Time Dynamic Programming(Barto et al. 95
Bonet Geffner03)
Repeat trials until cost function converges
Trialsimulate greedy policy update visited
states
- Create better approximation to cost function _at_ s
14Planning with Disjunctive Uncertainty
- lt S, A, T, s0, G gt
- S a set of states
- A a set of actions
- T disjunctive transition model
- s0 the start state
- G a set of goals
- Find a strong-cyclic policy (S ! A)
- that guarantees reaching a goal
- for an indefinite horizon
- for a fully observable
- planning problem
15Model Based Planner (Bertoli et. al.)
- States, transitions, etc. represented logically
- Uncertainty ? multiple possible successor states
- Planning Algorithm
- Iteratively removes bad states.
- Bad dont reach anywhere or reach other bad
states
16MBP Policy
Sub-optimal solution
Goal
17Outline
- Motivation
- Planning with Probabilistic Uncertainty (RTDP)
- Planning with Disjunctive Uncertainty (MBP)
- Hybridizing RTDP and MBP (HybPlan)
- Experiments
- Conclusions and Future Work
18HybPlan Top Level Code
- 0. run MBP to find a solution to goal
- run RTDP for some time
- compute partial greedy policy (?rtdp)
- compute hybridized policy (?hyb) by
- ?hyb(s) ?rtdp(s) if visited(s) gt
threshold - ?hyb(s) ?mbp(s) otherwise
- clean ?hyb by removing
- dead-ends
- probability 1 cycles
- evaluate ?hyb
- save best policy obtained so far
repeat until 1) resources exhaust or 2)a
satisfactory policy found
19First RTDP Trial
0
2
0
0
0
0
0
0
0
0
0
0
0
Goal
0
0
0
0
0
0
0
2
0
0
0
20Bellman Backup
0
2
0
0
0
0
0
0
0
0
0
0
Goal
0
0
0
0
0
0
0
Q1(s,N) 1 0.5 0 0.5 0 Q1(s,N) 1 Q1(s,S)
Q1(s,W) Q1(s,E) 1 J1(s) 1 Let greedy
action be North
2
0
0
0
21Simulation of Greedy Action
0
2
0
0
0
0
0
0
0
0
0
1
0
Goal
0
0
0
0
0
0
0
2
0
0
0
22Continuing First Trial
0
2
0
0
0
0
0
0
0
0
1
0
Goal
0
0
0
0
0
0
0
2
0
0
0
23Continuing First Trial
0
2
0
0
1
0
0
0
0
0
0
Goal
1
0
0
0
0
0
0
0
2
0
0
0
24Finishing First Trial
2
1
0
0
1
0
0
0
0
0
0
Goal
1
0
0
0
0
0
0
0
2
0
0
0
25Cost Function after First Trial
2
2
0
1
0
1
0
0
0
0
0
0
Goal
1
0
0
0
0
0
0
0
2
0
0
0
26Partial Greedy Policy
2
2. compute greedy policy (?rtdp)
2
1
0
1
Goal
1
27Construct Hybridized Policy w/ MBP
2
3. compute hybridized policy (?hyb) (threshold
0)
2
1
0
0
1
Goal
1
28Evaluate Hybridized Policy
2
2
5. evaluate ?hyb 6. store ?hyb
2
1
0
3
3
0
1
4
4
Goal
1
5
After first trial
J(?hyb) 5
29Second Trial
2
2
0
1
0
1
0
0
0
0
0
2
Goal
1
0
1
0
1
0
0
0
2
0
0
0
30Partial Greedy Policy
0
2
1
1
1
31Absence of MBP Policy
2
2
0
1
MBP Policy doesnt exist! no path to goal
0
1
0
2
1
Goal
1
1
32Third Trial
2
2
0
1
0
1
0
0
0
0
0
Goal
1
2
0
1
0
1
0
1
0
2
0
1
3
33Partial Greedy Policy
1
1
0
2
1
3
34Probability 1 Cycles
repeat find a state s in cycle ?hyb(s)
?mbp(s) until cycle is broken
1
1
0
2
1
0
3
35Probability 1 Cycles
repeat find a state s in cycle ?hyb(s)
?mbp(s) until cycle is broken
1
1
0
2
1
0
3
36Probability 1 Cycles
repeat find a state s in cycle ?hyb(s)
?mbp(s) until cycle is broken
1
1
0
2
1
0
3
37Probability 1 Cycles
repeat find a state s in cycle ?hyb(s)
?mbp(s) until cycle is broken
1
1
0
2
1
0
3
38Probability 1 Cycles
2
2
0
1
0
1
repeat find a state s in cycle ?hyb(s)
?mbp(s) until cycle is broken
1
Goal
1
0
2
1
0
3
39Error Bound
2
2
2
1
0
3
3
0
1
4
4
J(s0) 5 J(s0) 1 ) Error(?hyb) 5-1 4
Goal
1
5
After 1st trial
J(?hyb) 5
40Termination
- when a policy of required error bound is found
- when the planning time exhausts
- when the available memory exhausts
Properties
- outputs a proper policy
- anytime algorithm (once MBP terminates)
- HybPlan RTDP, if infinite resources available
- HybPlan MBP, if extremely limited resources
- HybPlan better than both, otherwise
41Outline
- Motivation
- Planning with Probabilistic Uncertainty (RTDP)
- Planning with Disjunctive Uncertainty (MBP)
- Hybridizing RTDP and MBP (HybPlan)
- Experiments
- Anytime Properties
- Scalability
- Conclusions and Future Work
42Domains
NASA Rover Domain Factory Domain
Elevator domain
43Anytime Properties
RTDP
44Anytime Properties
RTDP
45Scalability
46Conclusions
- First algorithm that integrates disjunctive and
probabilistic planners. - Experiments show that HybPlan is
- anytime
- scales better than RTDP
- produces better quality solutions than MBP
- can interleaved planning and execution
47Hybridized Planning A General Notion
- Hybridize other pairs of planners
- an optimal or close-to-optimal planner
- a sub-optimal but fast planner
- to yield a planner that produces
- a good quality solution in intermediate running
times - Examples
- POMDP RTDP/PBVI with POND/MBP/BBSP
- Oversubscription Planning A with greedy
solutions - Concurrent MDP Sampled RTDP with single-action
RTDP