Title: Probabilistic Planning via Determinization in Hindsight FFHindsight
1Probabilistic Planning via Determinization in
HindsightFF-Hindsight
- Sungwook Yoon
- Joint work with
- Alan Fern, Bob Givan and Rao Kambhampati
2Probabilistic Planning Competition
Client Participants, send action Server
Competition Host, simulates actions
3The Winner was
- FF-Replan
- A replanner. Use FF
- Probabilistic domain is determinized
- Interesting Contrast
- Many probabilistic planning techniques
- Work in theory but does not work in practice
- FF-Replan
- No theory
- Work in practice
4The Papers Objective
Better determinization approach (Determinization
in Hindsight)
Theoretical consideration of the new
determinization (in Hindsight)
New view on FF-Replan
Experimental studies with determinization in
Hindsight (FF-Hindsight)
5Probabilistic Planning(goal-oriented)
Action
Left Outcomes are more likely
Maximize Goal Achievement
I
Probabilistic Outcome
A1
A2
Time 1
A1
A2
A1
A2
A1
A2
A1
A2
Time 2
Dead End
Action
Goal State
State
6All Outcome Replanning (FFRA)
ICAPS-07
Effect 1
Action1
Effect 1
Probability1
Action
Probability2
Effect 2
Action2
Effect 2
7Probabilistic PlanningAll Outcome Determinization
Action
Find Goal
I
Probabilistic Outcome
A1
A2
Time 1
A1-1
A1-2
A2-1
A2-2
A1
A2
A1
A2
A1
A2
A1
A2
Time 2
A1-1
A1-2
A2-1
A2-2
A1-1
A1-2
A2-1
A2-2
A1-1
A1-2
A2-1
A2-2
A1-1
A1-2
A2-1
A2-2
Dead End
Action
Goal State
State
8Probabilistic PlanningAll Outcome Determinization
Action
Find Goal
I
Probabilistic Outcome
A1
A2
Time 1
A1-1
A1-2
A2-1
A2-2
A1
A2
A1
A2
A1
A2
A1
A2
Time 2
A1-1
A1-2
A2-1
A2-2
A1-1
A1-2
A2-1
A2-2
A1-1
A1-2
A2-1
A2-2
A1-1
A1-2
A2-1
A2-2
Dead End
Action
Goal State
State
9Problem of FF-Replan and better alternative
sampling
FF-Replans Static Determinizations dont respect
probabilities. We need Probabilistic and Dynamic
Determinization
Sample Future Outcomes and Determinization in
Hindsight
Each Future Sample Becomes a Known-Future
Deterministic Problem
10Probabilistic Planning(goal-oriented)
Action
Left Outcomes are more likely
Maximize Goal Achievement
I
Probabilistic Outcome
A1
A2
Time 1
A1
A2
A1
A2
A1
A2
A1
A2
Time 2
Dead End
Action
Goal State
State
11Start Sampling
Note. Sampling will reveal which is better A1? Or
A2 at state I
12Hindsight Sample 1
Action
Left Outcomes are more likely
Maximize Goal Achievement
I
Probabilistic Outcome
A1
A2
Time 1
A1
A2
A1
A2
A1
A2
A1
A2
Time 2
A1 1 A2 0
Dead End
Action
Goal State
State
13Hindsight Sample 2
Action
Left Outcomes are more likely
Maximize Goal Achievement
I
Probabilistic Outcome
A1
A2
Time 1
A1
A2
A1
A2
A1
A2
A1
A2
Time 2
A1 2 A2 1
Dead End
Action
Goal State
State
14Hindsight Sample 3
Action
Left Outcomes are more likely
Maximize Goal Achievement
I
Probabilistic Outcome
A1
A2
Time 1
A1
A2
A1
A2
A1
A2
A1
A2
Time 2
A1 2 A2 1
Dead End
Action
Goal State
State
15Hindsight Sample 4
Action
Left Outcomes are more likely
Maximize Goal Achievement
I
Probabilistic Outcome
A1
A2
Time 1
A1
A2
A1
A2
A1
A2
A1
A2
Time 2
A1 3 A2 1
Dead End
Action
Goal State
State
16Summary of the IdeaThe Decision
Process(Estimating Q-Value, Q(s,a))
S Current State, A(S) ? S
1. For Each Action A, Draw Future Samples
Each Sample is a Deterministic Planning Problem
2. Solve The Deterministic Problems
The solution length is used for goal-oriented
problems, Q(s,A)
3. Aggregate the solutions for each action
Max A Q(s,A)
4. Select the action with best aggregation
17Mathematical Summary of the Algorithm
- H-horizon future FH for M S,A,T,R
- Mapping of state, action and time (hltH) to a
state - S A h ? S
- Value of a policy p for FH
- R(s,FH, p)
- VHS(s,H) EFH maxp R(s,FH,p)
- Compare this and the real value
- V(s,H) maxp EFH R(s,FH,p)
- VFFRa(s) maxF V(s,F) VHS(s,H) V(s,H)
- Q(s,a,H) (R(a) EFH-1 maxp R(a(s),FH-1,p) )
- In our proposal, computation of maxp R(s,FH-1,p)
is approximately done by FF Hoffmann and Nebel
01
Each Future is a Deterministic Problem
Done by FF
18Key Technical Results
The Importance of Independent Sampling of States,
Actions, Time
The necessity of Random Time Breaking in Decision
making
We identify the characteristic of FF-Replan in
terms of Hindsight Decision Making, VFFRa(s)
maxF V(s,F)
Theorem 1 When there is a policy that can achieve
the goal with probability 1 within horizon,
hindsight decision making algorithm will find the
goal with probability 1.
Theorem 2 Polynomial number of samples are needed
with regard to, Horizon, Action, The minimum
Q-value advantage
19Empirical Results
IPPC-04 Problems
Numbers are solved Trials
For ZenoTravel, when we used Importance sampling,
the solved trials have been improved to 26
20Empirical Results
These Domains are Developed just to Beat
FF-Replan Obviously, FF-Replan did not do
well. But, FF-Hindsight did very well,
showing Probabilistic Reasoning Ability while
achieving Scalability
21Conclusion
Deterministic Planning
Probabilistic Planning
scalability
scalability
Classic Planning
Markov Decision Processes
Machine Learning for Planning
Machine Learning for MDP
Net Benefit Optimization
Temporal MDP
Temporal Planning
scalability
Determinization
22Conclusion
- Devised an algorithm that can take advantage of
the significant advances in deterministic
planning in the context of probabilistic planning - Made many of the deterministic planning
techniques available to probabilistic planning - Most of the learning to planning techniques are
developed solely for deterministic planning - Now, these techniques are relevant to
probabilistic planning too - Advanced net-benefit style of planners can be
used for the reward maximization style of
probabilistic planning problems
23Discussion
- Mercier and Van Hentenryck provided the analysis
of the difference between - V(s,H) maxp EFH R(s,FH,p)
- VHS(s,H) EFH maxp R(s,FH,p)
- Ng and Jordan provided the analysis of the
difference between - V(s,H) maxp EFH R(s,FH,p)
- V(s,H) maxp ? R(s,FH,p) / m, where m is
the sample number
24IPPC-2004 Results
Winner of IPPC-04 FFRs
Human Control Knowledge
Numbers Successful Runs
Learned Knowledge
2nd Place Winners
25IPPC-2006 Results
Numbers Percentage of Successful Runs
Unofficial Winner of IPPC-06 FFRa
26(No Transcript)
27Sampling ProblemTime dependency issue
S1
S2
A
Start
Goal
D (with probability 1-p)
B
C (with probability p)
S3
C (with probability 1-p)
D (with probability p)
Dead End
28Sampling ProblemTime dependency issue
S1
S2
A
Start
Goal
B
S3
Dead End
S3 is worse state then S1 but looks like there is
always a path to GoalNeed to sample
independently across actions
29Action Selection ProblemRandom Tie breaking is
essential
B with probability 1-p
A Always stays in Start
Start
S1
Goal
B with probability p
C with probability 1-p
C with probability p
In Start state, C action is definitely better,
but A can be used to wait until C to the Goal
effect is realized
30Sampling ProblemImportance Sampling (IS)
B with very high probability
Start
Goal
S1
B with extremely low probability
- Sampling uniformly would find the problem
unsolvable. - Use importance sampling. -
Identifying the region that needs importance
sampling is for further study.-In the benchmark,
Zenotravel needs the IS idea.
31Theoretical Results
- Theorem 1
- For goal-achieving probabilistic planning
problems, if there is a policy that can solve the
probabilistic planning problem with probability 1
with bounded horizon, then hindsight planning
would solve the problem with probability 1. If
there is no such policy, hindsight planning would
return less 1 success ratio. - If there is a future where no plan can achieve
the goal, the future can be sampled - Theorem 2
- The number of future samples needed to correctly
identify the best action - w gt 4?-2T ln (AH / d)
- ? the minimum Q-advantage of the best action
over the other actions, d confidence parameter - From Chernoff Bound
32Probabilistic PlanningExpecti-max solution
Action
Maximize Goal Achievement
Probabilistic Outcome
Max
Time 1
Exp
Exp
Max
Max
Max
Max
Time 2
E
E
E
E
E
E
E
E
Action
Goal State
State
33Hindsight Sample 1
Action
Left Outcomes are more likely
Maximize Goal Achievement
I
Probabilistic Outcome
A1
A2
Time 1
A1
A2
A1
A2
A1
A2
A1
A2
Time 2
A1 1 A2 0
Dead End
Action
Goal State
State
34Hindsight Sample 2
Action
Left Outcomes are more likely
Maximize Goal Achievement
I
Probabilistic Outcome
A1
A2
Time 1
A1
A2
A1
A2
A1
A2
A1
A2
Time 2
A1 2 A2 1
Dead End
Action
Goal State
State
35Hindsight Sample 3
Action
Left Outcomes are more likely
Maximize Goal Achievement
I
Probabilistic Outcome
A1
A2
Time 1
A1
A2
A1
A2
A1
A2
A1
A2
Time 2
A1 2 A2 1
Dead End
Action
Goal State
State
36Hindsight Sample 4
Action
Left Outcomes are more likely
Maximize Goal Achievement
I
Probabilistic Outcome
A1
A2
Time 1
A1
A2
A1
A2
A1
A2
A1
A2
Time 2
A1 3 A2 1
Dead End
Action
Goal State
State