Title: 11/22: Conditional Planning
111/22 Conditional Planning Replanning
- Current Standings sent
- Semester project report due 11/30
- Homework 4 will be due before the last class
- Next class Review of MDPs (please read Chapter
16 and the class slides)
2Sensing Actions
- Sensing actions in essence partition a belief
state - Sensing a formula f splits a belief state B to
Bf Bf - Both partitions need to be taken to the goal
state now - Tree plan
- AO search
- Heuristics will have to compare two generalized
AND branches - In the figure, the lower branch has an expected
cost of 11,000 - The upper branch has a fixed sensing cost of 300
based on the outcome, a cost of 7 or 12,000 - If we consider worst case cost, we assume the
cost is 12,300 - If we consider both to be equally likey, we
assume 6303.5 units cost - If we know actual probabilities that the sensing
action returns one result as against other, we
can use that to get the expected cost
7
300
12,000
As
A
11,000
3Sensing General observations
- Sensing can be thought in terms of
- Speicific state variables whose values can be
found - OR sensing actions that evaluate truth of some
boolean formula over the state variables. - Sense(p) Sense(pV(qr))
- A general action may have both causative effects
and sensing effects - Sensing effect changes the agents knowledge, and
not the world - Causative effect changes the world (and may give
certain knowledge to the agent) - A pure sensing action only has sensing effects a
pure causative action only has causative effects.
4Progression/Regression with Sensing
- When applied to a belief state, AT RUN TIME the
sensing effects of an action wind up reducing the
cardinality of that belief state - basically by removing all states that are not
consistent with the sensed effects - AT PLAN TIME, Sensing actions PARTITION belief
states - If you apply Sense-f? to a belief state B, you
get a partition of B1 Bf and B2 Bf - You will have to make a plan that takes both
partitions to the goal state - Introduces branches in the plan
- If you regress two belief state Bf and Bf over
a sensing action Sense-f?, you get the belief
state B
5(No Transcript)
6(No Transcript)
7Note Full vs. Partial observability is
independent of sensing individual fluents vs.
sensing formulas.
(assuming single literal sensing)
If a state variable p Is in B, then there is
some action Ap that Can sense whether p is true
or false
If PB, the problem is fully observable If B is
empty, the problem is non observable If B is a
subset of P, it is partially observable
8Full Observability State Space partitioned to
singleton Obs. Classes Non-observability Entire
state space is a single observation class
Partial Observability Between 1 and S
observation classes
9Hardness classes for planning with sensing
- Planning with sensing is hard or easy depending
on (easy case listed first) - Whether the sensory actions give us full or
partial observability - Whether the sensory actions sense individual
fluents or formulas on fluents - Whether the sensing actions are always applicable
or have preconditions that need to be achieved
before the action can be done
10A Simple Progression Algorithm in the presence of
pure sensing actions
- Call the procedure Plan(BI,G,nil) where
- Procedure Plan(B,G,P)
- If G is satisfied in all states of B, then
return P - Non-deterministically choose
- I. Non-deterministically choose a causative
action a that is applicable in B. - Return Plan(a(B),G,Pa)
- II. Non-deterministically choose a sensing action
s that senses a formula f (could be a single
state variable) - Let p Plan(Bf,G,nil) pPlan(Bf,G,nil)
- /Bf is the set of states of B in which f is true
/ - Return P(s?pp)
If we always pick I and never do II then we will
produce conformant Plans (if we succeed).
11Remarks on the progression with sensing actions
- Progression is implicitly finding an AND subtree
of an AND/OR Graph - If we look for AND subgraphs, we can represent
DAGS. - The amount of sensing done in the eventual
solution plan is controlled by how often we pick
step I vs. step II (if we always pick I, we get
conformant solutions). - Progression is as clue-less as to whether to do
sensing and which sensing to do, as it is about
which causative action to apply - Need heuristic support
12Heuristics for sensing
- We need to compare the cumulative distance of B1
and B2 to goal with that of B3 to goal - Notice that Planning cost is related to plan size
while plan exec cost is related to the length of
the deepest branch (or expected length of a
branch) - If we use the conformant belief state distance
(as discussed last class), then we will be over
estimating the distance (since sensing may allow
us to do shorter branch) - Bryce ICAPS 05submitted starts wth the
conformant relaxed plan and introduces sensory
actions into the plan to estimate the cost more
accurately
B1
B2
B3
13Sensing More things under the mat(which we
wont lift for now ?)
- Sensing extends the notion of goals (and action
preconditions). - Findout goals Check if Rao is awake vs. Wake up
Rao - Presents some tricky issues in terms of goal
satisfaction! - You cannot use causative effects to support
findout goals - But what if the causative effects are supporting
another needed goal and wind up affecting the
goal as a side-effect? (e.g. Have-gong-go-off
find-out-if-rao-is-awake) - Quantification is no longer syntactic sugaring in
effects and preconditions in the presence of
sensing actions - Rm can satisfy the effect forall files
remove(file) without KNOWING what are the files
in the directory! - This is alternative to finding each files name
and doing rm ltfile-namegt - Sensing actions can have preconditions (as well
as other causative effects) they can have cost - The problem of OVER-SENSING (Sort of like a
beginning driver who looks all directions every 3
millimeters of driving also Sphexishness)
XII/Puccini project - Handling over-sensing using local-closedworld
assumptions - Listing a file doesnt destroy your knowledge
about the size of a file but - compressing it does. If you dont recognize
it, you will always be checking the size of the
file after each and every action
14Very simple Example
Problem Init dont know p
Goal g
A1 pgtr,p A2 pgtr,p A3 rgtg O5
observe(p)
Plan O5p?A1?A3A2?A3
Notice that in this case we also have a
conformant plan A1A2A3 --Whether or not
the conformant plan is cheaper depends on
how costly is sensing action O5 compared to A1
and A2
15Very simple Example
Problem Init dont know p
Goal g
A1 pgtr,p A2 pgtr,p A3 rgtg O5
observe(p)
Plan O5p?A1?A3A2?A3
O5p?
Y
N
A1
A2
A3
A3
16A more interesting example Medication
This domain is partially observable because the
states (D,I,B) and (D,I,B) cannot be
distinguished
The patient is not Dead and may be Ill. The test
paper is not Blue. We want to make the patient be
not Dead and not Ill We have three actions
Medicate which makes the patient not ill if he is
ill Stainwhich makes the test paper blue if the
patient is ill Sense-paperwhich can tell us if
the paper is blue or not.
No conformant plan possible here. Also, notice
that I cannot be sensed directly but only
through B
17(No Transcript)
18(No Transcript)
19(No Transcript)
20(No Transcript)
21(No Transcript)
22Goal directed conditional planning
- Recall that regression of two belief state Bf
and Bf over a sensing action Sense-f will
result in a belief state B - Search with this definition leads to two
challenges - We have to combine search states into single ones
(a sort of reverse AO operation) - We may need to explicitly condition a goal
formula in partially observable case (especially
when certain fluents can only be indirectly
sensed) - Example is the Medicate domain where I has to be
found through B - If you have a goal state B, you can always write
it as Bf and Bf for any arbitrary f! (The goal
Happy is achieved by achieving the twin goals
Happyrich as well as Happyrich) - Of course, we need to pick the f such that f/f
can be sensed (i.e. f and f defines an
observational class feature) - This step seems to go against the grain of
goal-directedensswe may not know what to sense
based on what our goal is after all! ?
Regression for PO case is Still
not Well-understood
23Regresssion
24Handling the combination during regression
- We have to combine search states into single ones
(a sort of reverse AO operation) - Two ideas
- In addition to the normal regression children,
also generate children from any pair of regressed
states on the search fringe (has a breadth-first
feel. Can be expensive!) Tuan Le does this - Do a contingent regression. Specifically, go
ahead and generate B from Bf using Sense-f but
now you have to go forward from the not-f
branch of Sense-f to goal too. CNLP does this
See the example
25Need for explicit conditioning during regression
(not needed for Fully Observable case)
- If you have a goal state B, you can always write
it as Bf and Bf for any arbitrary f! (The goal
Happy is achieved by achieving the twin goals
Happyrich as well as Happyrich) - Of course, we need to pick the f such that f/f
can be sensed (i.e. f and f defines an
observational class feature) - This step seems to go against the grain of
goal-directedensswe may not know what to sense
based on what our goal is after all! ?
Notice the analogy to conditioning in
evaluating a probabilistic query
Consider the Medicate problem. Coming from the
goal of DI, we will never see the
connection to sensing blue!
26(No Transcript)
27(No Transcript)
28(No Transcript)
29(No Transcript)
30(No Transcript)
31(No Transcript)
32(No Transcript)
33We now have yet another way of handling unsafe
links --Conditioning to put the threatening
step in a different world!
Similar processing can be done for regression
(PO planning is nothing but least-committed
regression planning)
34Sensing More things under the mat
- Sensing extends the notion of goals too.
- Check if Rao is awake vs. Wake up Rao
- Presents some tricky issues in terms of goal
satisfaction! - Handling quantified effects and preconditions in
the presence of sensing actions - Rm can satisfy the effect forall files
remove(file) without KNOWING what are the files
in the directory! - Sensing actions can have preconditions (as well
as other causative effects) - The problem of OVER-SENSING (Sort of like the
initial driver also Sphexishness) XII/Puccini
project - Handling over-sensing using local-closedworld
assumptions - Listing a file doesnt destroy your knowledge
about the size of a file but - compressing it does. If you dont recognize
it, you will always be checking the size of the
file after each and every action - A general action may have both causative effects
and sensing effects - Sensing effect changes the agents knowledge, and
not the world - Causative effect changes the world (and may give
certain knowledge to the agent) - A pure sensing action only has sensing effects a
pure causative action only has causative effects. - The recent work on conditional planning has
considered mostly simplistic sensing actions that
have no preconditions and only have pure sensing
effects. - Sensing has cost!
3511/24
- Replanning
- MDPs
- HW4 updated See the paper task
- Only MDP stuff to be added
36Sensing More things under the mat(which we
wont lift for now ?)
Review
- Sensing extends the notion of goals (and action
preconditions). - Findout goals Check if Rao is awake vs. Wake up
Rao - Presents some tricky issues in terms of goal
satisfaction! - You cannot use causative effects to support
findout goals - But what if the causative effects are supporting
another needed goal and wind up affecting the
goal as a side-effect? (e.g. Have-gong-go-off
find-out-if-rao-is-awake) - Quantification is no longer syntactic sugaring in
effects and preconditions in the presence of
sensing actions - Rm can satisfy the effect forall files
remove(file) without KNOWING what are the files
in the directory! - This is alternative to finding each files name
and doing rm ltfile-namegt - Sensing actions can have preconditions (as well
as other causative effects) they can have cost - The problem of OVER-SENSING (Sort of like a
beginning driver who looks all directions every 3
millimeters of driving also Sphexishness)
XII/Puccini project - Handling over-sensing using local-closedworld
assumptions - Listing a file doesnt destroy your knowledge
about the size of a file but - compressing it does. If you dont recognize
it, you will always be checking the size of the
file after each and every action
37Sensing Limited Contingency planning
- In many real-world scenarios, having a plan that
works in all contingencies is too hard - An idea is to make a plan for some of the
contingencies and monitor/Replan as necessary. - Qn What contingencies should we plan for?
- The ones that are most likely to occur(need
likelihoods) - Qn What do we do if an unexpected contingency
arises? - Monitor (the observable parts of the world)
- When it goes out of expected world, replan
starting from that state.
38Things more complicated if the world is
partially observable ?Need to insert
sensing actions to sense fluents
that can only be indirectly sensed
39Triangle Tables
40This involves disjunctive goals!
41ReplanningRespecting Commitments
- In real-world, where you make commitments based
on your plan, you cannot just throw away the plan
at the first sign of failure - One heuristic is to reuse as much of the old plan
as possible while doing replanning. - A more systematic approach is to
- Capture the commitments made by the agent based
on the current plan - Give these commitments as additional soft
constraints to the planner
42Replanning as a universal antidote
- If the domain is observable and lenient to
failures, and we are willing to do replanning,
then we can always handle non-deterministic as
well as stochastic actions with classical
planning! - Solve the deterministic relaxation of the
problem - Start executing it, while monitoring the world
state - When an unexpected state is encountered, replan
- A planner that did this in the First Intl.
Planning CompetitionProbabilistic Track, called
FF-Replan, won the competition.
4320 years of research into decision
theoretic planning, ..and FF-Replan is the
result?
30 years of research into programming
languages, ..and C is the result?
44Models of Planning
Uncertainty Deterministic
Disjunctive Probabilistic
Classical Contingent (FO)MDP
??? Contingent POMDP
??? Conformant (NO)MDP
Complete Observation Partial None
45MDPs as Utility-based problem solving agents
Repeat
46Repeat
can generalize to have action costs C(a,s)
If Mij matrix is not known a priori, then we
have a reinforcement learning scenario..
47(Value)
How about deterministic case? U(si) is the
shortest path to the goal ?
48Repeat
49(No Transcript)
50Policies change with rewards..
-
-
51What does a solution to an MDP look like?
- The solution should tell the optimal action to do
in each state (called a Policy) - Policy is a function from states to actions (
see finite horizon case below) - Not a sequence of actions anymore
- Needed because of the non-deterministic actions
- If there are S states and A actions that we
can do at each state, then there are AS
policies - How do we get the best policy?
- Pick the policy that gives the maximal expected
reward - For each policy p
- Simulate the policy (take actions suggested by
the policy) to get behavior traces - Evaluate the behavior traces
- Take the average value of the behavior traces.
- How long should behavior traces be?
- Each trace is no longer than k (Finite Horizon
case) - Policy will be horizon-dependent (optimal action
depends not just on what state you are in, but
how far is your horizon) - Eg Financial portfolio advice for yuppies vs.
retirees. - No limit on the size of the trace (Infinite
horizon case) - Policy is not horizon dependent
- Qn Is there a simpler way than having to
evaluate AS policies? - Yes
We will concentrate on infinite horizon
problems (infinite horizon doesnt
necessarily mean that that all behavior
traces are infinite. They could be finite
and end in a sink state)
52(No Transcript)
53.8
.1
.1
54Updates can be done synchronously OR
asynchronously --convergence guaranteed
as long as each state updated
infinitely often
Why are values coming down first? Why are some
states reaching optimal value faster?
.8
.1
.1
55 Terminating Value Iteration
- The basic idea is to terminate the value
iteration when the values have converged (i.e.,
not changing much from iteration to iteration) - Set a threshold e and stop when the change across
two consecutive iterations is less than e - There is a minor problem since value is a vector
- We can bound the maximum change that is allowed
in any of the dimensions between two successive
iterations by e - Max norm . of a vector is the maximal value
among all its dimensions. We are basically
terminating when Ui Ui1 lt e
56Policies converge earlier than values
- There are finite number of policies but infinite
number of value functions. - So entire regions of value vector are mapped
to a specific policy - So policies may be converging faster than
values. Search in the space of policies - Given a utility vector Ui we can compute the
greedy policy pui - The policy loss of pui is Upui-U
- (max norm difference of two vectors is the
maximum amount by which they differ on any
dimension)
P4
P3
V(S2)
U
P2
P1
V(S1)
Consider an MDP with 2 states and 2 actions
57n linear equations with n unknowns.
We can either solve the linear eqns exactly,
or solve them approximately by running the
value iteration a few times (the update wont
have the max operation)
58Other ways of solving MDPs
- Value and Policy iteration are the bed-rock
methods for solving MDPs. Both give optimality
guarantees - Both of them tend to be very inefficient for
large (several thousand state) MDPs - Many ideas are used to improve the efficiency
while giving up optimality guarantees - E.g. Consider the part of the policy for more
likely states (envelope extension method) - Interleave search and execution (Real Time
Dynamic Programming) - Do limited-depth analysis based on reachability
to find the value of a state (and there by the
best action you you should be doingwhich is the
action that is sending you the best value) - The values of the leaf nodes are set to be their
immediate rewards - If all the leaf nodes are terminal nodes, then
the backed up value will be true optimal value.
Otherwise, it is an approximation
RTDP
59What if you see this as a game?
If you are perpetual optimist then V2
max(V3,V4)
Min-Max!
60Incomplete observability(the dreaded POMDPs)
- To model partial observability, all we need to do
is to look at MDP in the space of belief states
(belief states are fully observable even when
world states are not) - Policy maps belief states to actions
- In practice, this causes (humongous) problems
- The space of belief states is continuous (even
if the underlying world is discrete and finite).
GET IT? GET IT?? - Even approximate policies are hard to find
(PSPACE-hard). - Problems with few dozen world states are hard to
solve currently - Depth-limited exploration (such as that done in
adversarial games) are the only option
Belief state s10.3, s20.4 s40.3
5 LEFTs
5 UPs
This figure basically shows that belief states
change as we take actions
61(No Transcript)
62(No Transcript)
63(No Transcript)
64(No Transcript)
65MDPs and Deterministic Search
- Problem solving agent search corresponds to what
special case of MDP? - Actions are deterministic Goal states are all
equally valued, and are all sink states. - Is it worth solving the problem using MDPs?
- The construction of optimal policy is an overkill
- The policy, in effect, gives us the optimal path
from every state to the goal state(s)) - The value function, or its approximations, on the
other hand are useful. How? - As heuristics for the problem solving agents
search - This shows an interesting connection between
dynamic programming and state search paradigms - DP solves many related problems on the way to
solving the one problem we want - State search tries to solve just the problem we
want - We can use DP to find heuristics to run state
search..
66Modeling Softgoal problems as deterministic MDPs
67SSPPStochastic Shortest Path Problem An MDP with
Init and Goal states
- MDPs dont have a notion of an initial and
goal state. (Process orientation instead of
task orientation) - Goals are sort of modeled by reward functions
- Allows pretty expressive goals (in theory)
- Normal MDP algorithms dont use initial state
information (since policy is supposed to cover
the entire search space anyway). - Could consider envelope extension methods
- Compute a deterministic plan (which gives the
policy for some of the states Extend the policy
to other states that are likely to happen during
execution - RTDP methods
- SSSP are a special case of MDPs where
- (a) initial state is given
- (b) there are absorbing goal states
- (c) Actions have costs. Goal states have zero
costs. - A proper policy for SSSP is a policy which is
guaranteed to ultimately put the agent in one of
the absorbing states - For SSSP, it would be worth finding a partial
policy that only covers the relevant states
(states that are reachable from init and goal
states on any optimal policy) - Value/Policy Iteration dont consider the notion
of relevance - Consider heuristic state search algorithms
- Heuristic can be seen as the estimate of the
value of a state.
68AO search for solving SSP problems
Main issues -- Cost of a node is
expected cost of its children -- The And
tree can have LOOPS ?Cost backup
is complicated
Intermediate nodes given admissible heuristic
estimates --can be just the shortest paths
(or their estimates)
69LAO--turning bottom-up labeling into a full DP
70RTDP Approach Interleave Planning Execution
(Simulation)
Start from the current state S. Expand the tree
(either uniformly to k-levels, or
non-uniformlygoing deeper in some
branches) Evaluate the leaf nodes back-up the
values to S. Update the stored value of S. Pick
the action that leads to best value Do it or
simulate it. Loop back. Leaf nodes evaluated
by Using their cached values ?If this
node has been evaluated using RTDP
analysis in the past, you use its
remembered value else use the
heuristic value ?If not use heuristics to
estimate a. Immediate reward values
b. Reachability heuristics
Sort of like depth-limited game-playing
(expectimax) --Who is the game against? Can
also do reinforcement learning this way ?
The Mij are not known correctly in RL
71Greedy On-Policy RTDP without execution
?Using the current utility values, select the
action with the highest expected utility
(greedy action) at each state, until you reach
a terminating state. Update the values along
this path. Loop backuntil the values stabilize
72(No Transcript)
73Envelope Extension Methods
- For each action, take the most likely outcome and
discard the rest. - Find a plan (deterministic path) from Init to
Goal state. This is a (very partial) policy for
just the states that fall on the maximum
probability state sequence. - Consider states that are most likely to be
encountered while traveling this path. - Find policy for those states too.
- Tricky part is to show that we can converge to
the optimal policy