Title: NMS PI meeting, September 27-29, 2000
1GraphPlan, Satplan and Markov Decision Processes
Sungwook Yoon
- Based in part on slides by Alan Fern
2GraphPlan
- Many planning systems use ideas from Graphplan
- IPP, STAN, SGP, Blackbox, Medic
- Can run much faster than POP-style planners
- History
- Before GraphPlan came out, most planning
researchers were working on POP-style planners - GraphPlan started them thinking about other more
efficient algorithms - Recent planning algorithms run much faster than
GraphPlan - However, most of them of them have been
influenced by GraphPlan
3Big Picture
- A big source of inefficiency in search algorithms
is the large branching factor - GraphPlan reduces the branching factor by
searching in a special data structure
- Phase 1 Create a Planning Graph
- built from initial state
- contains actions and propositions that are
possibly reachable from initial state - does not include unreachable actions or
propositions - Phase 2 - Solution Extraction
- Backward search for the Search for solution in
the planning graph - backward from goal
4Planning Graph
A literal is just a positive or negative
propositon
- Sequence of levels that correspond to time-steps
in the plan - Each level contains a set of literals and a set
of actions - Literals are those that could possibly be true at
the time step - Actions are those that their preconditions could
be satisfied at the time step. - Idea construct superset of literals that could
be possible achieved after an n-level layered
plan - Gives a compact (but approximate) representation
of states that are reachable by n level plans
5Planning Graph
propositions
actions
6Planning Graph
- maintenance action (persistence actions)
- represents what happens if no action affects the
literal - include action with precondition c and effect c,
for each literal c
propositions
actions
7Graph expansion
- Initial proposition layer
- Just the initial conditions
- Action layer n
- If all of an actions preconditions are in
proposition layer n,then add action to layer n - Proposition layer n1
- For each action at layer n (including persistence
actions) - Add all its effects (both positive and negative)
at layer n1 - (Also allow propositions at layer n to
persist to n1 - Propagate mutex information (well talk about
this in a moment)
8Example
stack(A,B) precondition holding(A),
clear(B) effect holding(A), clear(B),
on(A,B), clear(B), handempty
s0
a0
s1
holding(A)
holding(A)
holding(A)
handempty
stack(a,b)
clear(B)
on(A,B)
clear(B)
clear(B)
9Example
stack(A,B) precondition holding(A),
clear(B) effect holding(A), clear(B),
on(A,B), clear(B), handempty
s0
a0
s1
holding(A)
holding(A)
holding(A)
handempty
stack(A,B)
clear(B)
on(A,B)
clear(B)
clear(B)
Notice that not all literals in s1 can be made
true simultaneously after 1 level e.g.
holding(A), holding(A) and on(A,B), clear(B)
10Mutual Exclusion (Mutex)
- Between pairs of actions
- no valid plan could contain both at layer n
- E.g., stack(a,b), unstack(a,b)
- Between pairs of literals
- no valid plan could produce both at layer n
- E.g., clear(a), clear(a) on(a,b),
clear(b) - GraphPlan checks pairs only
- mutex relationships can help rule out
possibilities during search in phase 2 of
Graphplan
11Solution Extraction Backward Search
Repeat until goal set is empty If goals are
present non-mutex 1) Choose set of non-mutex
actions to achieve each goal 2) Add
preconditions to next goal set
12Searching for a solution plan
- Backward chain on the planning graph
- Achieve goals level by level
- At level k, pick a subset of non-mutex actions to
achieve current goals. Their preconditions become
the goals for k-1 level. - Build goal subset by picking each goal and
choosing an action to add. Use one already
selected if possible (backtrack if cant pick
non-mutex action) - If we reach the initial proposition level and the
current goals are in that level (i.e. they are
true in the initial state) then we have found a
successful layered plan
13GraphPlan algorithm
- Grow the planning graph (PG) to a level n such
that all goals are reachable and not mutex - necessary but insufficient condition for the
existence of an n level plan that achieves the
goals - if PG levels off before non-mutex goals are
achieved then fail - Search the PG for a valid plan
- If none found, add a level to the PG and try
again - If the PG levels off and still no valid plan
found, then return failure - Correctness follows from PG properties
14Important Ideas
- Plan graph construction is polynomial time
- Though construction can be expensive when there
are many objects and hence many propositions - The plan graph captures important properties of
the planning problem - Necessarily unreachable literals and actions
- Possibly reachable literals and actions
- Mutually exclusive literals and actions
- Significantly prunes search space compared to POP
style planners - The plan graph provides a sound termination
procedure - Knows when no plan exists
- Plan graphs can also be used for deriving
admissible (and good non-admissible) heuristics - See your book (we may come back to this idea
later)
15Encoding Planning as Satisfiability Basic Idea
- Bounded planning problem (P,n)
- P is a planning problem n is a positive integer
- Find a solution for P of length n
- Create a propositional formula that represents
- Initial state
- Goal
- Action Dynamics
- for n time steps
- We will define the formula for (P,n) such that
1) any model (i.e. satisfying truth assignment)
of the formula represent a solution to
(P,n) 2) if (P,n) has a solution then the
formula is satisfiable
16Example of Complete Formula for (P,1)
- at(r1,l1,0) ? ?at(r1,l2,0) ?
- at(r1,l2,1) ?
- move(r1,l1,l2,0) ? at(r1,l1,0) ?
- move(r1,l1,l2,0) ? at(r1,l2,1) ?
- move(r1,l1,l2,0) ? ?at(r1,l1,1) ?
- move(r1,l2,l1,0) ? at(r1,l2,0) ?
- move(r1,l2,l1,0) ? at(r1,l1,1) ?
- move(r1,l2,l1,0) ? ?at(r1,l2,1) ?
- ?move(r1,l1,l2,0) ? ?move(r1,l2,l1,0) ?
- ?at(r1,l1,0) ? at(r1,l1,1) ? move(r1,l2,l1,0)
? - ?at(r1,l2,0) ? at(r1,l2,1) ? move(r1,l1,l2,0)
? - at(r1,l1,0) ? ?at(r1,l1,1) ? move(r1,l1,l2,0)
? - at(r1,l2,0) ? ?at(r1,l2,1) ? move(r1,l2,l1,0)
Formula has propositions for actions and states
variablesat each possible timestep
Well now discuss how to construct such a formula
17Overall Approach
- Do iterative deepening like we did with
Graphplan - for n 0, 1, 2, ,
- encode (P,n) as a satisfiability problem ?
- if ? is satisfiable, then
- From the set of truth values that satisfies ?, a
solution plan can be constructed, so return it
and exit - With a complete satisfiability tester, this
approach will produce optimal layered plans for
solvable problems - We can use a GraphPlan analysis to determine an
upper bound on n, giving a way to detect
unsolvability
18Fluents (will be used as propositons)
- If plan ?a0, a1, , an1? is a solution to (P,n),
then it generates a sequence of states ?s0, s1,
, sn1? - A fluent is a proposition used to describe whats
true in each si - on(A,B,i) is a fluent thats true iff
at(r1,loc1) is in si - Well use ei to denote the fluent for a fact e in
state si - e.g. if e at(r1,loc1)
- then ei at(r1,loc1,i)
- ai is a fluent saying that a is a action taken at
step i - e.g., if a move(r1,loc2,loc1)
- then ai move(r1,loc2,loc1,i)
- The set of all possible fluents for (P,n) form
the set of primitive propositions used to
construct our formula for (P,n)
19Encoding Planning Problems
- We can encode (P,n) so that we consider either
layered plans or totally ordered plans - an advantage of considering layered plans is that
fewer time steps are necessary (i.e. smaller n
translates into smaller formulas) - for simplicity we first consider totally-ordered
plans - Encode (P,n) as a formula ? such that
- ?a0, a1, , an1? is a solution for (P,n)
- if and only if
- ? can be satisfied in a way that makes the
fluents a0, , an1 true - ? will be conjunction of many other formulas
20Formulas in ?
- Formula describing the initial state (let E be
the set of possible facts in the planning
problem) - /\e0 e ? s0 ? /\?e0 e ? E s0
- Describes the complete initial state (both
positive and negative fact) - E.g. on(A,B,0) ? ?on(B,A,0)
- Formula describing the goal (G is set of goal
facts) - /\en e ? G
- says that the goal facts must be true in the
final state at timestep n - E.g. on(B,A,n)
- Is this enough?
- Of course not. The formulas say nothing about
actions.
21Formulas in ?
- For every action a and timestep i, formula
describing what fluents must be true if a were
the ith step of the plan - ai ? /\ ei e ? Precond(a), as
preconditions must be true - ai ? /\ ei1 e ? ADD(a), as ADD effects
must be true in i1 - ai ? /\ ?ei1 e ? DEL(a), as DEL
effects must be false in i1 - Complete exclusion axiom
- For all actions a and b and timesteps i, formulas
saying a and b cant occur at the same time - ? ai ? ? bi
- this guarantees there can be only one action at a
time - Is this enough?
- The formulas say nothing about what happens to
facts if they are not effected by an action - This is known as the frame problem
22Example
- Planning domain
- one robot r1
- two adjacent locations l1, l2
- one operator (move the robot)
- Encode (P,n) where n 1
- Initial state at(r1,l1)
- Encoding at(r1,l1,0) ? ?at(r1,l2,0)
- Goal at(r1,l2)
- Encoding at(r1,l2,1)
- Action Schema see next slide
23Extracting a Plan
- Suppose we find an assignment of truth values
that satisfies ?. - This means P has a solution of length n
- For i0,,n-1, there will be exactly one action a
such that ai true - This is the ith action of the plan.
- Example (from the previous slides)
- ? can be satisfied with move(r1,l1,l2,0) true
- Thus ?move(r1,l1,l2,0)? is a solution for (P,0)
- Its the only solution - no other way to satisfy
?
24What SATPLAN Shows
- General propositional reasoning can compete with
state of the art specialized planning systems - New, highly tuned variations of DP surprising
powerful - Radically new stochastic approaches to SAT can
provide very low exponential scaling - Why does it work?
- More flexible than forward or backward chaining
- Randomized algorithms less likely to get trapped
along bad paths
25BlackBox (GraphPlan SatPlan)
- The BlackBox procedure combines planning-graph
expansion and satisfiability checking - It is roughly as follows
- for n 0, 1, 2,
- Graph expansion
- create a planning graph that contains n
levels - Check whether the planning graph satisfies a
necessary(but insufficient) condition for plan
existence - If it does, then
- Encode (P,n) as a satisfiability problem ? but
include only the actions in the planning graph - If ? is satisfiable then return the solution
26Blackbox
Can be thought of as an implementation of
GraphPlan that uses an alternative plan
extraction technique than the backward chaining
of GraphPlan.
Plan Graph
Mutex computation
STRIPS
Translator
CNF
Simplifier
General Stochastic / Systematic SAT engines
Solution
CNF
27Classical Planning Assumptions
Actions
Percepts
World
sole sourceof change
perfect
????
deterministic
fully observable
instantaneous
28Stochastic/Probabilistic Planning Markov
Decision Process (MDP) Model
Actions
Percepts
World
sole sourceof change
perfect
????
stochastic
fully observable
instantaneous
29Types of Uncertainty
- Disjunctive (used by non-deterministic planning)
- Next state could be one of a set of states.
- Stochastic/Probabilistic
- Next state is drawn from a probability
distribution over the set of states. - How are these models related?
30Markov Decision Processes
- An MDP has four components S, A, R, T
- (finite) state set S (S n)
- (finite) action set A (A m)
- (Markov) transition function T(s,a,s) Pr(s
s,a) - Probability of going to state s after taking
action a in state s - How many parameters does it take to represent?
- bounded, real-valued reward function R(s)
- Immediate reward we get for being in state s
- For example in a goal-based domain R(s) may equal
1 for goal states and 0 for all others - Can be generalized to include action costs
R(s,a) - Can be generalized to be a stochastic function
- Can easily generalize to countable or continuous
state and action spaces (but algorithms will be
different)
31Graphical View of MDP
At
At1
St
St1
St2
Rt2
Rt
Rt1
32Assumptions
- First-Order Markovian dynamics (history
independence) - Pr(St1At,St,At-1,St-1,..., S0) Pr(St1At,St)
- Next state only depends on current state and
current action - First-Order Markovian reward process
- Pr(RtAt,St,At-1,St-1,..., S0) Pr(RtAt,St)
- Reward only depends on current state and action
- As described earlier we will assume reward is
specified by a deterministic function R(s) - i.e. Pr(RtR(St) At,St) 1
- Stationary dynamics and reward
- Pr(St1At,St) Pr(Sk1Ak,Sk) for all t, k
- The world dynamics do not depend on the absolute
time - Full observability
- Though we cant predict exactly which state we
will reach when we execute an action, once it is
realized, we know what it is
33Policies (plans for MDPs)
- Nonstationary policy
- pS x T ? A, where T is the non-negative integers
- p(s,t) is action to do at state s with t
stages-to-go - What if we want to keep acting indefinitely?
- Stationary policy
- pS ? A
- p(s) is action to do at state s (regardless of
time) - specifies a continuously reactive controller
- These assume or have these properties
- full observability
- history-independence
- deterministic action choice
Why not just consider sequences of actions? Why
not just replan?
34Value of a Policy
- How good is a policy p?
- How do we measure accumulated reward?
- Value function V S ?R associates value with each
state (or each state and time for non-stationary
p) - Vp(s) denotes value of policy at state s
- Depends on immediate reward, but also what you
achieve subsequently by following p - An optimal policy is one that is no worse than
any other policy at any state - The goal of MDP planning is to compute an optimal
policy (method depends on how we define value)
35Policy Evaluation
- Value equation for fixed policy
- How can we compute the value function for a
policy? - we are given R and Pr
- simple linear system with n variables (each
variables is value of a state) and n constraints
(one value equation for each state) - Use linear algebra (e.g. matrix inverse)
36Value Iteration vs. Policy Iteration
- Which is faster? VI or PI
- It depends on the problem
- VI takes more iterations than PI, but PI requires
more time on each iteration - PI must perform policy evaluation on each step
which involves solving a linear system - Complexity
- There are at most exp(n) policies, so PI is no
worse than exponential time in number of states - Empirically O(n) iterations are required
- Still no polynomial bound on the number of PI
iterations (open problem)!