Stochastic Planning with Concurrent, Durative Actions

About This Presentation
Title:

Stochastic Planning with Concurrent, Durative Actions

Description:

Construct a sample determinization by 'simulating' each stochastic action to pick the outcome. ... Upper bound (ub): for policy computation. gap(s) = ub(s) lb ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Stochastic Planning with Concurrent, Durative Actions


1
4/3
2
(FO)MDPs The plan
  • General model has no initial state complex cost
    and reward functions, and finite/infinite/indefini
    te horizons
  • Standard algorithms are Value and Policy
    iteration
  • Have to look at the entire state space
  • Can be made even more general with
  • Partial observability (POMDPs)
  • Continuous state spaces
  • Multiple agents (DECPOMDPS/MDPS)
  • Durative actions
  • Conurrent MDPs
  • Semi-MDPs
  • Directions
  • Efficient algorithms for special cases
  • TODAY 4/10
  • Combining Learning of the model and planning
    with the model
  • Reinforcement Learning4/8

3
Markov Decision Process (MDP)
Value function expected long term reward
from the state Q values Expected long term
reward of doing a in s V(s) max
Q(s,a) Greedy Policy w.r.t. a value
function Value of a policy Optimal value
function
  • S A set of states
  • A A set of actions
  • Pr(ss,a) transition model
  • (aka Mas,s)
  • C(s,a,s) cost model
  • G set of goals
  • s0 start state
  • ? discount factor
  • R(s,a,s) reward model

4
Examples of MDPs
  • Goal-directed, Indefinite Horizon, Cost
    Minimization MDP
  • ltS, A, Pr, C, G, s0gt
  • Most often studied in planning community
  • Infinite Horizon, Discounted Reward Maximization
    MDP
  • ltS, A, Pr, R, ?gt
  • Most often studied in reinforcement learning
  • Goal-directed, Finite Horizon, Prob. Maximization
    MDP
  • ltS, A, Pr, G, s0, Tgt
  • Also studied in planning community
  • Oversubscription Planning Non absorbing goals,
    Reward Max. MDP
  • ltS, A, Pr, G, R, s0gt
  • Relatively recent model

5
SSPPStochastic Shortest Path Problem An MDP with
Init and Goal states
  • MDPs dont have a notion of an initial and
    goal state. (Process orientation instead of
    task orientation)
  • Goals are sort of modeled by reward functions
  • Allows pretty expressive goals (in theory)
  • Normal MDP algorithms dont use initial state
    information (since policy is supposed to cover
    the entire search space anyway).
  • Could consider envelope extension methods
  • Compute a deterministic plan (which gives the
    policy for some of the states Extend the policy
    to other states that are likely to happen during
    execution
  • RTDP methods
  • SSSP are a special case of MDPs where
  • (a) initial state is given
  • (b) there are absorbing goal states
  • (c) Actions have costs. All states have zero
    rewards
  • A proper policy for SSSP is a policy which is
    guaranteed to ultimately put the agent in one of
    the absorbing states
  • For SSSP, it would be worth finding a partial
    policy that only covers the relevant states
    (states that are reachable from init and goal
    states on any optimal policy)
  • Value/Policy Iteration dont consider the notion
    of relevance
  • Consider heuristic state search algorithms
  • Heuristic can be seen as the estimate of the
    value of a state.

6
Bellman Equations for Cost Minimization
MDP(absorbing goals)also called Stochastic
Shortest Path
  • ltS, A, Pr, C, G, s0gt
  • Define J(s) optimal cost as the minimum
    expected cost to reach a goal from this state.
  • J should satisfy the following equation

Q(s,a)
7
Bellman Equations for infinite horizon discounted
reward maximization MDP
  • ltS, A, Pr, R, s0, ?gt
  • Define V(s) optimal value as the maximum
    expected discounted reward from this state.
  • V should satisfy the following equation

8
Bellman Equations for probability maximization
MDP
  • ltS, A, Pr, G, s0, Tgt
  • Define P(s,t) optimal prob. as the maximum
    probability of reaching a goal from this state at
    tth timestep.
  • P should satisfy the following equation

9
Modeling Softgoal problems as deterministic MDPs
  • Consider the net-benefit problem, where actions
    have costs, and goals have utilities, and we want
    a plan with the highest net benefit
  • How do we model this as MDP?
  • (wrong idea) Make every state in which any
    subset of goals hold into a sink state with
    reward equal to the cumulative sum of utilities
    of the goals.
  • Problemwhat if achieving g1 g2 will necessarily
    lead you through a state where g1 is already
    true?
  • (correct version) Make a new fluent called
    done dummy action called Done-Deal. It is
    applicable in any state and asserts the fluent
    done. All done states are sink states. Their
    reward is equal to sum of rewards of the
    individual states.

10
Ideas for Efficient Algorithms..
  • Use heuristic search (and reachability
    information)
  • LAO, RTDP
  • Use execution and/or Simulation
  • Actual Execution Reinforcement learning
  • (Main motivation for RL is to learn the
    model)
  • Simulation simulate the given model to sample
    possible futures
  • Policy rollout, hindsight optimization etc.
  • Use factored representations
  • Factored representations for Actions, Reward
    Functions, Values and Policies
  • Directly manipulating factored representations
    during the Bellman update

11
Heuristic Search vs. Dynamic Programming
(Value/Policy Iteration)
  • VI and PI approaches use Dynamic Programming
    Update
  • Set the value of a state in terms of the maximum
    expected value achievable by doing actions from
    that state.
  • They do the update for every state in the state
    space
  • Wasteful if we know the initial state(s) that the
    agent is starting from
  • Heuristic search (e.g. A/AO) explores only the
    part of the state space that is actually
    reachable from the initial state
  • Even within the reachable space, heuristic search
    can avoid visiting many of the states.
  • Depending on the quality of the heuristic used..
  • But what is the heuristic?
  • An admissible heuristic is a lowerbound on the
    cost to reach goal from any given state
  • It is a lowerbound on V!

12
Connection with Heuristic Search
s0
s0
s0
?
?
?
?
G
G
G
regular graph
acyclic AND/OR graph
cyclic AND/OR graph
13
Connection with Heuristic Search
s0
s0
s0
?
?
?
?
G
G
G
regular graph soln(shortest) path A
acyclic AND/OR graph soln(expected shortest)
acyclic graph AO Nilsson71
cyclic AND/OR graph soln(expected shortest)
cyclic graph LAO HansenZil.98
All algorithms able to make effective use of
reachability information!
Sanity check Why cant we handle the cycles by
duplicate elimination as
in A search?
14
LAO HansenZilberstein98
  • add s0 in the fringe and in greedy graph
  • repeat
  • expand a state on the fringe (in greedy graph)
  • initialize all new states by their heuristic
    value
  • perform value iteration for all expanded states
  • recompute the greedy graph
  • until greedy graph is free of fringe states
  • output the greedy graph as the final policy

15
LAO Iteration 1
s0
s0
?
?
G
add s0 in the fringe and in greedy graph
16
LAO Iteration 1
s0
s0
?
?
?
?
G
expand a state on fringe in greedy graph
17
LAO Iteration 1
J1
s0
s0
?
?
?
?
h
h
h
h
G
  • initialise all new states by their
  • heuristic values
  • perform VI on expanded states

18
LAO Iteration 1
J1
s0
s0
?
?
?
?
h
h
h
h
G
recompute the greedy graph
19
LAO Iteration 2
J1
s0
s0
?
?
?
?
h
h
h
h
G
h
h
expand a state on the fringe initialise new states
20
LAO Iteration 2
J2
s0
s0
?
?
?
?
J2
h
h
h
G
h
h
perform VI compute greedy policy
21
LAO Iteration 3
J2
s0
s0
?
?
?
?
J2
h
h
G
G
h
h
expand fringe state
22
LAO Iteration 3
J3
s0
s0
?
?
?
?
J3
J3
h
h
G
G
h
h
perform VI recompute greedy graph
23
LAO Iteration 4
J4
s0
s0
?
?
?
?
J4
J4
J4
h
G
G
h
h
h
24
LAO Iteration 4
J4
s0
s0
?
?
?
?
J4
J4
J4
h
G
G
h
h
h
Stops when all nodes in greedy graph have been
expanded
25
Comments
  • Dynamic Programming Heuristic Search
  • admissible heuristic ? optimal policy
  • expands only part of the reachable state space
  • outputs a partial policy
  • one that is closed w.r.t. to Pr and s0
  • Speedups
  • expand all states in fringe at once
  • perform policy iteration instead of value
    iteration
  • perform partial value/policy iteration
  • weighted heuristic f (1-w).g w.h
  • ADD based symbolic techniques (symbolic LAO)

26
AO search for solving SSP problems
Main issues -- Cost of a node is
expected cost of its children -- The And
tree can have LOOPS ?Cost backup
is complicated
Intermediate nodes given admissible heuristic
estimates --can be just the shortest paths
(or their estimates)
27
LAO--turning bottom-up labeling into a full DP
28
How to derive heuristics?
  • Deterministic shortest route is a heuristic on
    the expected cost J(s)
  • But how do you compute it?
  • Idea 1 Most likely outcome determinization
    Consider the most likely transition for each
    action
  • Idea 2 All outcome determinization For each
    stochastic action, make multiple deterministic
    actions that correspond to the various outcomes
  • Which is admissible? Which is more informed?
  • How about Idea 3 Sampling based
    determinization
  • Construct a sample determinization by
    simulating each stochastic action to pick the
    outcome. Find the cost of shortest path in that
    determinization
  • Take multiple samples, and take the average of
    the shortest path.

Determinization involves converting And arcs in
the And/Or graph to Or arcs
29
Real Time Dynamic Programming Barto, Bradtke,
Singh95
  • Trial simulate greedy policy starting from start
    state
  • perform Bellman backup on visited states
  • RTDP repeat Trials until cost function converges

Notice that you can also do the Trial above by
executing rather than simulating. In that
case, we will be doing reinforcement learning.
(In fact, RTDP was originally developed for
reinforcement learning)
30
RTDP Approach Interleave Planning Execution
(Simulation)
Start from the current state S. Expand the tree
(either uniformly to k-levels, or
non-uniformlygoing deeper in some
branches) Evaluate the leaf nodes back-up the
values to S. Update the stored value of S. Pick
the action that leads to best value Do it or
simulate it. Loop back. Leaf nodes evaluated
by Using their cached values ?If this
node has been evaluated using RTDP
analysis in the past, you use its
remembered value else use the
heuristic value ?If not use heuristics to
estimate a. Immediate reward values
b. Reachability heuristics
Sort of like depth-limited game-playing
(expectimax) --Who is the game against? Can
also do reinforcement learning this way ?
The Mij are not known correctly in RL
31
RTDP Trial
Jn
Qn1(s0,a)
agreedy a2
Jn
?
a1
Jn
Goal
a2
?
Jn1(s0)
Jn
a3
?
Jn
Jn
Jn
32
Greedy On-Policy RTDP without execution
?Using the current utility values, select the
action with the highest expected utility
(greedy action) at each state, until you reach
a terminating state. Update the values along
this path. Loop backuntil the values stabilize
33
(No Transcript)
34
Comments
  • Properties
  • if all states are visited infinitely often then
    Jn ? J
  • Advantages
  • Anytime more probable states explored quickly
  • Disadvantages
  • complete convergence is slow!
  • no termination condition

35
Labeled RTDP BonetGeffner03
  • Initialise J0 with an admissible heuristic
  • ? Jn monotonically increases
  • Label a state as solved
  • if the Jn for that state has converged
  • Backpropagate solved labeling
  • Stop trials when they reach any solved state
  • Terminate when s0 is solved

high Q costs
s
G
?
t
best action ) J(s) wont change!
high Q costs
s
G
both s and t get solved together
36
Properties
  • admissible J0 ? optimal J
  • heuristic-guided
  • explores a subset of reachable state space
  • anytime
  • focusses attention on more probable states
  • fast convergence
  • focusses attention on unconverged states
  • terminates in finite time

37
Recent Advances Bounded RTDPMcMahan, Likhachev
Gordon05
  • Associate with each state
  • Lower bound (lb) for simulation
  • Upper bound (ub) for policy computation
  • gap(s) ub(s) lb(s)
  • Terminate trial when gap(s) lt ?
  • Bias sampling towards unconverged states
  • proportional to Pr(ss,a).gap(s)
  • Perform backups in reverse order for current
    trajectory.

38
Recent Advances Focused RTDPSmithSimmons06
  • Similar to Bounded RTDP except
  • a more sophisticated definition of priority that
    combines gap and prob. of reaching the state
  • adaptively increasing the max-trial length

Recent Advances Learning DFSBonetGeffner06
  • Iterative Deepening A equivalent for MDPs
  • Find strongly connected components to check for a
    state being solved.

39
Other Advances
  • Ordering the Bellman backups to maximise
    information flow.
  • Wingate Seppi05
  • Dai Hansen07
  • Partition the state space and combine value
    iterations from different partitions.
  • Wingate Seppi05
  • Dai Goldsmith07
  • External memory version of value iteration
  • Edelkamp, Jabbar Bonet07

40
Policy Gradient Approaches Williams92
  • direct policy search
  • parameterised policy Pr(as,w)
  • no value function
  • flexible memory requirements
  • policy gradient
  • J(w)Ew?t0..1?tct
  • gradient descent (wrt w)
  • reaches a local optimum
  • continuous/discrete spaces

parameterised policy Pr(as.w)
Pr(aa1s,w)
Pr(aa2s,w)
state s
action a
.

Pr(aaks,w)
parameters w
non-stationary
41
Policy Gradient Algorithm
  • J(w)Ew?t0..1?tct (failure prob.,makespan, )
  • minimise J by
  • computing gradient
  • stepping the parameters away wt1 wt - ?rJ(w)
  • until convergence
  • Gradient Estimate Sutton et.al.99, Baxter
    Bartlett01
  • Monte Carlo estimate from trace s1, a1, c1, ,
    sT, aT, cT
  • et1 et rw log Pr(at1st,wt)
  • wt1 wt - ??tctet1

42
Policy Gradient Approaches
  • often used in reinforcement learning
  • partial observability
  • model free (Pr(ss,a), Pr(os) are unknown)
  • to learn a policy from observations and costs

Reinforcement Learner
Pr(aa1o,w)
world/simulator
Pr(aa2o,w)
Pr(ss,a)
.
action a
Pr(os)

Pr(aako,w)
Pr(ao,w)
observation o cost c
43
Modeling Complex Problems
  • Modeling time
  • continuous variable in the state space
  • discretisation issues
  • large state space
  • Modeling concurrency
  • many actions may execute at once
  • large action space
  • Modeling time and concurrency
  • large state and action space!!

J(s)
t
J(s)
t
44
Ideas for Efficient Algorithms..
  • Use heuristic search (and reachability
    information)
  • LAO, RTDP
  • Use execution and/or Simulation
  • Actual Execution Reinforcement learning
  • (Main motivation for RL is to learn the
    model)
  • Simulation simulate the given model to sample
    possible futures
  • Policy rollout, hindsight optimization etc.
  • Use factored representations
  • Factored representations for Actions, Reward
    Functions, Values and Policies
  • Directly manipulating factored representations
    during the Bellman update

45
Factored Representations Actions
  • Actions can be represented directly in terms of
    their effects on the individual state variables
    (fluents). The CPTs of the BNs can be represented
    compactly too!
  • Write a Bayes Network relating the value of
    fluents at the state before and after the action
  • Bayes networks representing fluents at different
    time points are called Dynamic Bayes Networks
  • We look at 2TBN (2-time-slice dynamic bayes nets)
  • Go further by using STRIPS assumption
  • Fluents not affected by the action are not
    represented explicitly in the model
  • Called Probabilistic STRIPS Operator (PSO) model

46
Action CLK
47
Envelope Extension Methods
  • For each action, take the most likely outcome and
    discard the rest.
  • Find a plan (deterministic path) from Init to
    Goal state. This is a (very partial) policy for
    just the states that fall on the maximum
    probability state sequence.
  • Consider states that are most likely to be
    encountered while traveling this path.
  • Find policy for those states too.
  • Tricky part is to show that we can converge to
    the optimal policy

48
(No Transcript)
49
(No Transcript)
50
(No Transcript)
51
Factored Representations Reward, Value and
Policy Functions
  • Reward functions can be represented in factored
    form too. Possible representations include
  • Decision trees (made up of fluents)
  • ADDs (Algebraic decision diagrams)
  • Value functions are like reward functions (so
    they too can be represented similarly)
  • Bellman update can then be done directly using
    factored representations..

52
(No Transcript)
53
SPUDDs use of ADDs
54
Direct manipulation of ADDs in SPUDD
55
(No Transcript)
56
(No Transcript)
57
(No Transcript)
58
Policy Rollout Time Complexity
Following PRp,h,w
s
  • To compute PRp,h,w(s) for each action we need
    to compute w trajectories of length h
  • Total of Ahw calls to the simulator

59
Policy Rollout
  • Often p is significantly better than p. I.e. one
    step of policy iteration can provide substantial
    improvement
  • Using simulation to approximate p is known as
    policy rollout
  • PolicyRollout(s, p, h, w)
  • FOR each action a, Q(a) EstimateQ(s, a, p , h,
    w)
  • RETURN arg maxa Q(a)
  • We will denote the rollout policy by PRp,h,w
  • Note that PRp,h,w is stochastic
  • Questions
  • What is the complexity of computing PRp,h,w(s)?
  • Can we approximate k iterations of policy
    iteration using sampling?
  • How good is the rollout policy compared to the
    policy iteration improvement?

60
Multi-Stage Policy Rollout
Following PRPRp,h,w,h,w
s
Trajectories under PRp,h,w
Each step requires Ahw simulator calls
a1
s


an
  • Approximates policy resulting from two steps of
    PI.
  • Requires (Ahw)2 calls to the simulator
  • In general exponential in the number of PI
    iterations

61
Policy Rollout Quality
  • How good is PRp,h,w(s) compared to p?
  • In general for a fixed h and w there is always an
    MDP such that the quality of the rollout policy
    is arbitrarily worse than p.
  • If we make an assumption about the MDP, then it
    is possible to select h and w so that the rollout
    quality is close to p.
  • This is a bit involved.
  • In your homework you will solve a related
    problem.
  • Choose h and w so that with high probability
    PRp,h,w(s) selects an action that maximizes
    Qp(s,a)

62
Rollout Summary
  • We often are able to write simple, mediocre
    policies
  • Network routing policy
  • Policy for card game of Hearts
  • Policy for game of Backgammon
  • Solitaire playing policy
  • Policy rollout is a general and easy way to
    improve upon such policies
  • Often observe substantial improvement, e.g.
  • Compiler instruction scheduling
  • Backgammon
  • Network routing
  • Combinatorial optimization
  • Game of GO
  • Solitaire

63
Example Rollout for Go
  • Go is a popular board game for which there are
    still no highly-ranked computer programs
  • High branching factor
  • Difficult to design heuristics
  • Unlike Chess where the best computers compete
    with the best humans
  • What if we play Go using level-1 rollouts of the
    random policy?
  • With a few additional tweaks you get the best go
    program to date! (CrazyStone, winner of 11th
    Computer Go Olympiad) http//remi.coulom.free.fr/
    CrazyStone/

64
FF-Replan A Baseline for Probabilistic Planning
  • Sungwook Yoon
  • Alan fern
  • Robert Givan

FF-Replan Sungwook Yoon
65
Replanning Approach
  • Deterministic Planner for Probabilistic Planning?
  • Winner of IPPC-2004 and (unofficial) winner of
    IPPC-2006
  • Why was it conceived?
  • Why it worked?
  • Domain by domain analysis
  • Any extension?

FF-Replan Sungwook Yoon
66
IPPC-2004 Pre-released Domains
Blocksworld
Boxworld
FF-Replan Sungwook Yoon
67
IPPC Performance Test
  • Client Server Interaction
  • The problem definition is known apriori
  • Performance is recorded in the server log
  • For one problem, 30 repetitive test is conducted

FF-Replan Sungwook Yoon
68
Single Outcome Replanning (FFRs)
  • Natural approach given the competition setting
    and the domains, Intro to AI (Russell and Norvig)
  • Hash state-action mapping
  • Replace probabilistic effects with deterministic
    effect
  • Ground Goal

Effect 1
Probability1
A
B
Probability2
Action
Effect 2
Action
Effect 2
C
Probability3
Effect 3
FF-Replan Sungwook Yoon
69
IPPC-2004 Domains
Blocksworld
Boxworld
Fileworld
Tireworld
Tower of Hanoi
ZenoTravel
Exploding Blocksworld
FF-Replan Sungwook Yoon
70
IPPC-2004 Results
Human Control Knowledge
Numbers Successful Runs
Learned Knowledge
2nd Place Winners
NMRC J1 Classy NMR mGPT C FFRS FFRA
BW 252 270 255 30 120 30 210 270
Box 134 150 100 0 30 0 150 150
File - - - 3 30 3 14 29
Zeno - - - 30 30 30 0 30
Tire-r - - - 30 30 30 30 30
Tire-g - - - 9 16 30 7 7
TOH - - - 15 0 0 0 11
Exploding - - - 0 0 0 3 5
NMR Non-Markovian Reward Decision Process Planner
Classy Approximate Policy Iteration with a Policy Language Bias
mGPT Heuristic Search Probabilistic Planning
C Symbolic Heuristic Search
71
Reason of the Success
  • Determinization and efficient pre-processing of
    complex planning language
  • Input language is quite complex (PPDDL)
  • Classic planning has developed efficient
    preprocessing techniques on complex input
    language and scales well
  • Grounding goal also helped
  • Classic planning takes hard time dealing with
    lifted goals
  • The domains in the competition
  • 17 of 20 problems were dead-end free
  • Amenable to Replanning approach

FF-Replan Sungwook Yoon
72
All Outcome Replanning (FFRA)
  • Selecting one outcome is troublesome
  • Which outcome to take?
  • Lets use all the outcomes
  • All we have to do is translating a deterministic
    action to the original probabilistic action
    during the server-client interaction with MDPSIM
  • Novel approach

Effect 1
Action1
Effect 1
Probability1
Probability2
Action
Effect 2
Action2
Effect 2
Probability3
Effect 3
Action3
Effect 3
FF-Replan Sungwook Yoon
73
IPPC-2006 Domains
  • Blocksworld
  • Exploding Blocksworld
  • ZenoTravel
  • Tireworld
  • Elevator
  • Drive
  • PitchCatch
  • Schedule
  • Random
  • Randomly generate syntactically correct domain
  • E.g., Dont delete facts that are not in the
    precondition
  • Randomly generate a state
  • This is initial state
  • Take random walk from the state, using the random
    domain
  • The resulting state is a goal state
  • There is at least a path from the initial state
    to the goal state
  • If the probability of the path is bigger than a,
    then stop, otherwise take a random walk again
  • Special reset action is provided that take any
    state to the initial state

FF-Replan Sungwook Yoon
74
IPPC-2006 Results
Numbers Percentage of Successful Runs
FFRA FPG FOALP sfDP Paragraph FFRS
BW 86 63 100 29 0 77
Zenotravel 100 27 0 7 7 7
Random 100 65 0 0 5 73
Elevator 93 76 100 0 0 93
Exploding 52 43 24 31 31 52
Drive 71 56 0 0 9 0
Schedule 51 54 0 0 1 0
PitchCatch 54 23 0 0 0 0
Tire 82 75 82 0 91 69
FPG Factored Policy Gradient Planner
FOALP First Order Approximate Linear Programming
sfDP Symbolic Stochastic Focused Dynamic Programming with Decision Diagrams
Paragraph A Graphplan Based Probabilistic Planner
75
Discussion
  • Novel all-outcome replanning technique
    outperforms naïve replanner
  • The replanner performed well even on the real
    probabilistic domains
  • Drive
  • The complexity of the domain might have
    contributed to this phenomenon
  • Replanner did not win the domains where it is
    supposed to be very best
  • Blocksworld

FF-Replan Sungwook Yoon
76
Weakness of the Replanning
  • Ignorance of the probabilistic effects
  • Try not to use actions with detrimental effects
  • Detrimental effects can sometimes easily be found
  • Ignorance of prior planning during replanning
  • Plan Stability Work by Fox, Gerevini, Long and
    Serina
  • No learning
  • There is an obvious learning opportunity, since
    it solves a problem repetitively

FF-Replan Sungwook Yoon
77
Potential improvements
State
  • Intelligent Replanning
  • Policy rollout
  • Policy learning
  • Hindsight Optimization
  • During (determinized) planning, when it meets the
    previous seen state, stop planning
  • May reduce the replanning time
  • Hashing state-action mapping can be viewed as
    partial policy
  • Currently, the mapping is always fixed
  • When there is a failure, we can update the
    policy, that is, give penalty to the
    state-actions in the failure trajectory
  • During planning, try not to use those actions in
    those states
  • E.g., after explosion in the exploding-blocksworld
    , do not use putdown action

state
Select Max
action
A1
A2
Average
Average
FF Replan
FF Replan
Action outcome that really will happen
Reward
Reward
Goal State
FF-Replan Sungwook Yoon
Write a Comment
User Comments (0)