Title: Stochastic Planning with Concurrent, Durative Actions
14/3
2(FO)MDPs The plan
- General model has no initial state complex cost
and reward functions, and finite/infinite/indefini
te horizons - Standard algorithms are Value and Policy
iteration - Have to look at the entire state space
- Can be made even more general with
- Partial observability (POMDPs)
- Continuous state spaces
- Multiple agents (DECPOMDPS/MDPS)
- Durative actions
- Conurrent MDPs
- Semi-MDPs
- Directions
- Efficient algorithms for special cases
- TODAY 4/10
- Combining Learning of the model and planning
with the model - Reinforcement Learning4/8
3Markov Decision Process (MDP)
Value function expected long term reward
from the state Q values Expected long term
reward of doing a in s V(s) max
Q(s,a) Greedy Policy w.r.t. a value
function Value of a policy Optimal value
function
- S A set of states
- A A set of actions
- Pr(ss,a) transition model
- (aka Mas,s)
- C(s,a,s) cost model
- G set of goals
- s0 start state
- ? discount factor
- R(s,a,s) reward model
4Examples of MDPs
- Goal-directed, Indefinite Horizon, Cost
Minimization MDP - ltS, A, Pr, C, G, s0gt
- Most often studied in planning community
- Infinite Horizon, Discounted Reward Maximization
MDP - ltS, A, Pr, R, ?gt
- Most often studied in reinforcement learning
- Goal-directed, Finite Horizon, Prob. Maximization
MDP - ltS, A, Pr, G, s0, Tgt
- Also studied in planning community
- Oversubscription Planning Non absorbing goals,
Reward Max. MDP - ltS, A, Pr, G, R, s0gt
- Relatively recent model
5SSPPStochastic Shortest Path Problem An MDP with
Init and Goal states
- MDPs dont have a notion of an initial and
goal state. (Process orientation instead of
task orientation) - Goals are sort of modeled by reward functions
- Allows pretty expressive goals (in theory)
- Normal MDP algorithms dont use initial state
information (since policy is supposed to cover
the entire search space anyway). - Could consider envelope extension methods
- Compute a deterministic plan (which gives the
policy for some of the states Extend the policy
to other states that are likely to happen during
execution - RTDP methods
- SSSP are a special case of MDPs where
- (a) initial state is given
- (b) there are absorbing goal states
- (c) Actions have costs. All states have zero
rewards - A proper policy for SSSP is a policy which is
guaranteed to ultimately put the agent in one of
the absorbing states - For SSSP, it would be worth finding a partial
policy that only covers the relevant states
(states that are reachable from init and goal
states on any optimal policy) - Value/Policy Iteration dont consider the notion
of relevance - Consider heuristic state search algorithms
- Heuristic can be seen as the estimate of the
value of a state.
6Bellman Equations for Cost Minimization
MDP(absorbing goals)also called Stochastic
Shortest Path
- ltS, A, Pr, C, G, s0gt
- Define J(s) optimal cost as the minimum
expected cost to reach a goal from this state. - J should satisfy the following equation
-
Q(s,a)
7Bellman Equations for infinite horizon discounted
reward maximization MDP
- ltS, A, Pr, R, s0, ?gt
- Define V(s) optimal value as the maximum
expected discounted reward from this state. - V should satisfy the following equation
-
8Bellman Equations for probability maximization
MDP
- ltS, A, Pr, G, s0, Tgt
- Define P(s,t) optimal prob. as the maximum
probability of reaching a goal from this state at
tth timestep. - P should satisfy the following equation
-
9Modeling Softgoal problems as deterministic MDPs
- Consider the net-benefit problem, where actions
have costs, and goals have utilities, and we want
a plan with the highest net benefit - How do we model this as MDP?
- (wrong idea) Make every state in which any
subset of goals hold into a sink state with
reward equal to the cumulative sum of utilities
of the goals. - Problemwhat if achieving g1 g2 will necessarily
lead you through a state where g1 is already
true? - (correct version) Make a new fluent called
done dummy action called Done-Deal. It is
applicable in any state and asserts the fluent
done. All done states are sink states. Their
reward is equal to sum of rewards of the
individual states.
10Ideas for Efficient Algorithms..
- Use heuristic search (and reachability
information) - LAO, RTDP
- Use execution and/or Simulation
- Actual Execution Reinforcement learning
- (Main motivation for RL is to learn the
model) - Simulation simulate the given model to sample
possible futures - Policy rollout, hindsight optimization etc.
- Use factored representations
- Factored representations for Actions, Reward
Functions, Values and Policies - Directly manipulating factored representations
during the Bellman update
11Heuristic Search vs. Dynamic Programming
(Value/Policy Iteration)
- VI and PI approaches use Dynamic Programming
Update - Set the value of a state in terms of the maximum
expected value achievable by doing actions from
that state. - They do the update for every state in the state
space - Wasteful if we know the initial state(s) that the
agent is starting from
- Heuristic search (e.g. A/AO) explores only the
part of the state space that is actually
reachable from the initial state - Even within the reachable space, heuristic search
can avoid visiting many of the states. - Depending on the quality of the heuristic used..
- But what is the heuristic?
- An admissible heuristic is a lowerbound on the
cost to reach goal from any given state - It is a lowerbound on V!
12Connection with Heuristic Search
s0
s0
s0
?
?
?
?
G
G
G
regular graph
acyclic AND/OR graph
cyclic AND/OR graph
13Connection with Heuristic Search
s0
s0
s0
?
?
?
?
G
G
G
regular graph soln(shortest) path A
acyclic AND/OR graph soln(expected shortest)
acyclic graph AO Nilsson71
cyclic AND/OR graph soln(expected shortest)
cyclic graph LAO HansenZil.98
All algorithms able to make effective use of
reachability information!
Sanity check Why cant we handle the cycles by
duplicate elimination as
in A search?
14LAO HansenZilberstein98
- add s0 in the fringe and in greedy graph
- repeat
- expand a state on the fringe (in greedy graph)
- initialize all new states by their heuristic
value - perform value iteration for all expanded states
- recompute the greedy graph
- until greedy graph is free of fringe states
- output the greedy graph as the final policy
15LAO Iteration 1
s0
s0
?
?
G
add s0 in the fringe and in greedy graph
16LAO Iteration 1
s0
s0
?
?
?
?
G
expand a state on fringe in greedy graph
17LAO Iteration 1
J1
s0
s0
?
?
?
?
h
h
h
h
G
- initialise all new states by their
- heuristic values
- perform VI on expanded states
18LAO Iteration 1
J1
s0
s0
?
?
?
?
h
h
h
h
G
recompute the greedy graph
19LAO Iteration 2
J1
s0
s0
?
?
?
?
h
h
h
h
G
h
h
expand a state on the fringe initialise new states
20LAO Iteration 2
J2
s0
s0
?
?
?
?
J2
h
h
h
G
h
h
perform VI compute greedy policy
21LAO Iteration 3
J2
s0
s0
?
?
?
?
J2
h
h
G
G
h
h
expand fringe state
22LAO Iteration 3
J3
s0
s0
?
?
?
?
J3
J3
h
h
G
G
h
h
perform VI recompute greedy graph
23LAO Iteration 4
J4
s0
s0
?
?
?
?
J4
J4
J4
h
G
G
h
h
h
24LAO Iteration 4
J4
s0
s0
?
?
?
?
J4
J4
J4
h
G
G
h
h
h
Stops when all nodes in greedy graph have been
expanded
25Comments
- Dynamic Programming Heuristic Search
- admissible heuristic ? optimal policy
- expands only part of the reachable state space
- outputs a partial policy
- one that is closed w.r.t. to Pr and s0
- Speedups
- expand all states in fringe at once
- perform policy iteration instead of value
iteration - perform partial value/policy iteration
- weighted heuristic f (1-w).g w.h
- ADD based symbolic techniques (symbolic LAO)
26AO search for solving SSP problems
Main issues -- Cost of a node is
expected cost of its children -- The And
tree can have LOOPS ?Cost backup
is complicated
Intermediate nodes given admissible heuristic
estimates --can be just the shortest paths
(or their estimates)
27LAO--turning bottom-up labeling into a full DP
28How to derive heuristics?
- Deterministic shortest route is a heuristic on
the expected cost J(s) - But how do you compute it?
- Idea 1 Most likely outcome determinization
Consider the most likely transition for each
action - Idea 2 All outcome determinization For each
stochastic action, make multiple deterministic
actions that correspond to the various outcomes - Which is admissible? Which is more informed?
- How about Idea 3 Sampling based
determinization - Construct a sample determinization by
simulating each stochastic action to pick the
outcome. Find the cost of shortest path in that
determinization - Take multiple samples, and take the average of
the shortest path.
Determinization involves converting And arcs in
the And/Or graph to Or arcs
29Real Time Dynamic Programming Barto, Bradtke,
Singh95
- Trial simulate greedy policy starting from start
state - perform Bellman backup on visited states
- RTDP repeat Trials until cost function converges
Notice that you can also do the Trial above by
executing rather than simulating. In that
case, we will be doing reinforcement learning.
(In fact, RTDP was originally developed for
reinforcement learning)
30RTDP Approach Interleave Planning Execution
(Simulation)
Start from the current state S. Expand the tree
(either uniformly to k-levels, or
non-uniformlygoing deeper in some
branches) Evaluate the leaf nodes back-up the
values to S. Update the stored value of S. Pick
the action that leads to best value Do it or
simulate it. Loop back. Leaf nodes evaluated
by Using their cached values ?If this
node has been evaluated using RTDP
analysis in the past, you use its
remembered value else use the
heuristic value ?If not use heuristics to
estimate a. Immediate reward values
b. Reachability heuristics
Sort of like depth-limited game-playing
(expectimax) --Who is the game against? Can
also do reinforcement learning this way ?
The Mij are not known correctly in RL
31RTDP Trial
Jn
Qn1(s0,a)
agreedy a2
Jn
?
a1
Jn
Goal
a2
?
Jn1(s0)
Jn
a3
?
Jn
Jn
Jn
32Greedy On-Policy RTDP without execution
?Using the current utility values, select the
action with the highest expected utility
(greedy action) at each state, until you reach
a terminating state. Update the values along
this path. Loop backuntil the values stabilize
33(No Transcript)
34Comments
- Properties
- if all states are visited infinitely often then
Jn ? J - Advantages
- Anytime more probable states explored quickly
- Disadvantages
- complete convergence is slow!
- no termination condition
35Labeled RTDP BonetGeffner03
- Initialise J0 with an admissible heuristic
- ? Jn monotonically increases
- Label a state as solved
- if the Jn for that state has converged
- Backpropagate solved labeling
- Stop trials when they reach any solved state
- Terminate when s0 is solved
high Q costs
s
G
?
t
best action ) J(s) wont change!
high Q costs
s
G
both s and t get solved together
36Properties
- admissible J0 ? optimal J
- heuristic-guided
- explores a subset of reachable state space
- anytime
- focusses attention on more probable states
- fast convergence
- focusses attention on unconverged states
- terminates in finite time
37Recent Advances Bounded RTDPMcMahan, Likhachev
Gordon05
- Associate with each state
- Lower bound (lb) for simulation
- Upper bound (ub) for policy computation
- gap(s) ub(s) lb(s)
- Terminate trial when gap(s) lt ?
- Bias sampling towards unconverged states
- proportional to Pr(ss,a).gap(s)
- Perform backups in reverse order for current
trajectory.
38Recent Advances Focused RTDPSmithSimmons06
- Similar to Bounded RTDP except
- a more sophisticated definition of priority that
combines gap and prob. of reaching the state - adaptively increasing the max-trial length
Recent Advances Learning DFSBonetGeffner06
- Iterative Deepening A equivalent for MDPs
- Find strongly connected components to check for a
state being solved.
39Other Advances
- Ordering the Bellman backups to maximise
information flow. - Wingate Seppi05
- Dai Hansen07
- Partition the state space and combine value
iterations from different partitions. - Wingate Seppi05
- Dai Goldsmith07
- External memory version of value iteration
- Edelkamp, Jabbar Bonet07
40Policy Gradient Approaches Williams92
- direct policy search
- parameterised policy Pr(as,w)
- no value function
- flexible memory requirements
- policy gradient
- J(w)Ew?t0..1?tct
- gradient descent (wrt w)
- reaches a local optimum
- continuous/discrete spaces
parameterised policy Pr(as.w)
Pr(aa1s,w)
Pr(aa2s,w)
state s
action a
.
Pr(aaks,w)
parameters w
non-stationary
41Policy Gradient Algorithm
- J(w)Ew?t0..1?tct (failure prob.,makespan, )
- minimise J by
- computing gradient
- stepping the parameters away wt1 wt - ?rJ(w)
- until convergence
- Gradient Estimate Sutton et.al.99, Baxter
Bartlett01 - Monte Carlo estimate from trace s1, a1, c1, ,
sT, aT, cT - et1 et rw log Pr(at1st,wt)
- wt1 wt - ??tctet1
42Policy Gradient Approaches
- often used in reinforcement learning
- partial observability
- model free (Pr(ss,a), Pr(os) are unknown)
-
- to learn a policy from observations and costs
Reinforcement Learner
Pr(aa1o,w)
world/simulator
Pr(aa2o,w)
Pr(ss,a)
.
action a
Pr(os)
Pr(aako,w)
Pr(ao,w)
observation o cost c
43Modeling Complex Problems
- Modeling time
- continuous variable in the state space
- discretisation issues
- large state space
- Modeling concurrency
- many actions may execute at once
- large action space
- Modeling time and concurrency
- large state and action space!!
J(s)
t
J(s)
t
44Ideas for Efficient Algorithms..
- Use heuristic search (and reachability
information) - LAO, RTDP
- Use execution and/or Simulation
- Actual Execution Reinforcement learning
- (Main motivation for RL is to learn the
model) - Simulation simulate the given model to sample
possible futures - Policy rollout, hindsight optimization etc.
- Use factored representations
- Factored representations for Actions, Reward
Functions, Values and Policies - Directly manipulating factored representations
during the Bellman update
45Factored Representations Actions
- Actions can be represented directly in terms of
their effects on the individual state variables
(fluents). The CPTs of the BNs can be represented
compactly too! - Write a Bayes Network relating the value of
fluents at the state before and after the action - Bayes networks representing fluents at different
time points are called Dynamic Bayes Networks - We look at 2TBN (2-time-slice dynamic bayes nets)
- Go further by using STRIPS assumption
- Fluents not affected by the action are not
represented explicitly in the model - Called Probabilistic STRIPS Operator (PSO) model
46Action CLK
47Envelope Extension Methods
- For each action, take the most likely outcome and
discard the rest. - Find a plan (deterministic path) from Init to
Goal state. This is a (very partial) policy for
just the states that fall on the maximum
probability state sequence. - Consider states that are most likely to be
encountered while traveling this path. - Find policy for those states too.
- Tricky part is to show that we can converge to
the optimal policy
48(No Transcript)
49(No Transcript)
50(No Transcript)
51Factored Representations Reward, Value and
Policy Functions
- Reward functions can be represented in factored
form too. Possible representations include - Decision trees (made up of fluents)
- ADDs (Algebraic decision diagrams)
- Value functions are like reward functions (so
they too can be represented similarly) - Bellman update can then be done directly using
factored representations..
52(No Transcript)
53SPUDDs use of ADDs
54Direct manipulation of ADDs in SPUDD
55(No Transcript)
56(No Transcript)
57(No Transcript)
58Policy Rollout Time Complexity
Following PRp,h,w
s
- To compute PRp,h,w(s) for each action we need
to compute w trajectories of length h - Total of Ahw calls to the simulator
59Policy Rollout
- Often p is significantly better than p. I.e. one
step of policy iteration can provide substantial
improvement - Using simulation to approximate p is known as
policy rollout - PolicyRollout(s, p, h, w)
- FOR each action a, Q(a) EstimateQ(s, a, p , h,
w) - RETURN arg maxa Q(a)
- We will denote the rollout policy by PRp,h,w
- Note that PRp,h,w is stochastic
- Questions
- What is the complexity of computing PRp,h,w(s)?
- Can we approximate k iterations of policy
iteration using sampling? - How good is the rollout policy compared to the
policy iteration improvement?
60Multi-Stage Policy Rollout
Following PRPRp,h,w,h,w
s
Trajectories under PRp,h,w
Each step requires Ahw simulator calls
a1
s
an
- Approximates policy resulting from two steps of
PI. - Requires (Ahw)2 calls to the simulator
- In general exponential in the number of PI
iterations
61Policy Rollout Quality
- How good is PRp,h,w(s) compared to p?
- In general for a fixed h and w there is always an
MDP such that the quality of the rollout policy
is arbitrarily worse than p. - If we make an assumption about the MDP, then it
is possible to select h and w so that the rollout
quality is close to p. - This is a bit involved.
- In your homework you will solve a related
problem. - Choose h and w so that with high probability
PRp,h,w(s) selects an action that maximizes
Qp(s,a)
62Rollout Summary
- We often are able to write simple, mediocre
policies - Network routing policy
- Policy for card game of Hearts
- Policy for game of Backgammon
- Solitaire playing policy
- Policy rollout is a general and easy way to
improve upon such policies - Often observe substantial improvement, e.g.
- Compiler instruction scheduling
- Backgammon
- Network routing
- Combinatorial optimization
- Game of GO
- Solitaire
63Example Rollout for Go
- Go is a popular board game for which there are
still no highly-ranked computer programs - High branching factor
- Difficult to design heuristics
- Unlike Chess where the best computers compete
with the best humans - What if we play Go using level-1 rollouts of the
random policy? - With a few additional tweaks you get the best go
program to date! (CrazyStone, winner of 11th
Computer Go Olympiad) http//remi.coulom.free.fr/
CrazyStone/
64FF-Replan A Baseline for Probabilistic Planning
- Sungwook Yoon
- Alan fern
- Robert Givan
FF-Replan Sungwook Yoon
65Replanning Approach
- Deterministic Planner for Probabilistic Planning?
- Winner of IPPC-2004 and (unofficial) winner of
IPPC-2006 - Why was it conceived?
- Why it worked?
- Domain by domain analysis
- Any extension?
FF-Replan Sungwook Yoon
66IPPC-2004 Pre-released Domains
Blocksworld
Boxworld
FF-Replan Sungwook Yoon
67IPPC Performance Test
- Client Server Interaction
- The problem definition is known apriori
- Performance is recorded in the server log
- For one problem, 30 repetitive test is conducted
FF-Replan Sungwook Yoon
68Single Outcome Replanning (FFRs)
- Natural approach given the competition setting
and the domains, Intro to AI (Russell and Norvig) - Hash state-action mapping
- Replace probabilistic effects with deterministic
effect - Ground Goal
Effect 1
Probability1
A
B
Probability2
Action
Effect 2
Action
Effect 2
C
Probability3
Effect 3
FF-Replan Sungwook Yoon
69IPPC-2004 Domains
Blocksworld
Boxworld
Fileworld
Tireworld
Tower of Hanoi
ZenoTravel
Exploding Blocksworld
FF-Replan Sungwook Yoon
70IPPC-2004 Results
Human Control Knowledge
Numbers Successful Runs
Learned Knowledge
2nd Place Winners
NMRC J1 Classy NMR mGPT C FFRS FFRA
BW 252 270 255 30 120 30 210 270
Box 134 150 100 0 30 0 150 150
File - - - 3 30 3 14 29
Zeno - - - 30 30 30 0 30
Tire-r - - - 30 30 30 30 30
Tire-g - - - 9 16 30 7 7
TOH - - - 15 0 0 0 11
Exploding - - - 0 0 0 3 5
NMR Non-Markovian Reward Decision Process Planner
Classy Approximate Policy Iteration with a Policy Language Bias
mGPT Heuristic Search Probabilistic Planning
C Symbolic Heuristic Search
71Reason of the Success
- Determinization and efficient pre-processing of
complex planning language - Input language is quite complex (PPDDL)
- Classic planning has developed efficient
preprocessing techniques on complex input
language and scales well - Grounding goal also helped
- Classic planning takes hard time dealing with
lifted goals - The domains in the competition
- 17 of 20 problems were dead-end free
- Amenable to Replanning approach
FF-Replan Sungwook Yoon
72All Outcome Replanning (FFRA)
- Selecting one outcome is troublesome
- Which outcome to take?
- Lets use all the outcomes
- All we have to do is translating a deterministic
action to the original probabilistic action
during the server-client interaction with MDPSIM - Novel approach
Effect 1
Action1
Effect 1
Probability1
Probability2
Action
Effect 2
Action2
Effect 2
Probability3
Effect 3
Action3
Effect 3
FF-Replan Sungwook Yoon
73IPPC-2006 Domains
- Blocksworld
- Exploding Blocksworld
- ZenoTravel
- Tireworld
- Elevator
- Drive
- PitchCatch
- Schedule
- Random
- Randomly generate syntactically correct domain
- E.g., Dont delete facts that are not in the
precondition - Randomly generate a state
- This is initial state
- Take random walk from the state, using the random
domain - The resulting state is a goal state
- There is at least a path from the initial state
to the goal state - If the probability of the path is bigger than a,
then stop, otherwise take a random walk again - Special reset action is provided that take any
state to the initial state
FF-Replan Sungwook Yoon
74IPPC-2006 Results
Numbers Percentage of Successful Runs
FFRA FPG FOALP sfDP Paragraph FFRS
BW 86 63 100 29 0 77
Zenotravel 100 27 0 7 7 7
Random 100 65 0 0 5 73
Elevator 93 76 100 0 0 93
Exploding 52 43 24 31 31 52
Drive 71 56 0 0 9 0
Schedule 51 54 0 0 1 0
PitchCatch 54 23 0 0 0 0
Tire 82 75 82 0 91 69
FPG Factored Policy Gradient Planner
FOALP First Order Approximate Linear Programming
sfDP Symbolic Stochastic Focused Dynamic Programming with Decision Diagrams
Paragraph A Graphplan Based Probabilistic Planner
75Discussion
- Novel all-outcome replanning technique
outperforms naïve replanner - The replanner performed well even on the real
probabilistic domains - Drive
- The complexity of the domain might have
contributed to this phenomenon - Replanner did not win the domains where it is
supposed to be very best - Blocksworld
FF-Replan Sungwook Yoon
76Weakness of the Replanning
- Ignorance of the probabilistic effects
- Try not to use actions with detrimental effects
- Detrimental effects can sometimes easily be found
- Ignorance of prior planning during replanning
- Plan Stability Work by Fox, Gerevini, Long and
Serina - No learning
- There is an obvious learning opportunity, since
it solves a problem repetitively
FF-Replan Sungwook Yoon
77Potential improvements
State
- Intelligent Replanning
- Policy rollout
- Policy learning
- Hindsight Optimization
- During (determinized) planning, when it meets the
previous seen state, stop planning - May reduce the replanning time
- Hashing state-action mapping can be viewed as
partial policy - Currently, the mapping is always fixed
- When there is a failure, we can update the
policy, that is, give penalty to the
state-actions in the failure trajectory - During planning, try not to use those actions in
those states - E.g., after explosion in the exploding-blocksworld
, do not use putdown action
state
Select Max
action
A1
A2
Average
Average
FF Replan
FF Replan
Action outcome that really will happen
Reward
Reward
Goal State
FF-Replan Sungwook Yoon