11/22: Conditional Planning

About This Presentation

Title:

11/22: Conditional Planning

Description:

Homework 4 will be due before the last class ... looks all directions every 3 millimeters of driving; also Sphexishness) [XII/Puccini project] ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 57

Provided by: min63

Learn more at: https://rakaposhi.eas.asu.edu

Category:

more less

Transcript and Presenter's Notes

Title: 11/22: Conditional Planning

1
11/22 Conditional Planning Replanning

Current Standings sent
Semester project report due 11/30
Homework 4 will be due before the last class
Next class Review of MDPs (please read Chapter
16 and the class slides)

2
Sensing Actions

Sensing actions in essence partition a belief
state
Sensing a formula f splits a belief state B to
Bf Bf
Both partitions need to be taken to the goal
state now
Tree plan
AO search
Heuristics will have to compare two generalized
AND branches
In the figure, the lower branch has an expected
cost of 11,000
The upper branch has a fixed sensing cost of 300
based on the outcome, a cost of 7 or 12,000
If we consider worst case cost, we assume the
cost is 12,300
If we consider both to be equally likey, we
assume 6303.5 units cost
If we know actual probabilities that the sensing
action returns one result as against other, we
can use that to get the expected cost

7
300
12,000
As
A
11,000
3
Sensing General observations

Sensing can be thought in terms of
Speicific state variables whose values can be
found
OR sensing actions that evaluate truth of some
boolean formula over the state variables.
Sense(p) Sense(pV(qr))
A general action may have both causative effects
and sensing effects
Sensing effect changes the agents knowledge, and
not the world
Causative effect changes the world (and may give
certain knowledge to the agent)
A pure sensing action only has sensing effects a
pure causative action only has causative effects.

4
Progression/Regression with Sensing

When applied to a belief state, AT RUN TIME the
sensing effects of an action wind up reducing the
cardinality of that belief state
basically by removing all states that are not
consistent with the sensed effects
AT PLAN TIME, Sensing actions PARTITION belief
states
If you apply Sense-f? to a belief state B, you
get a partition of B1 Bf and B2 Bf
You will have to make a plan that takes both
partitions to the goal state
Introduces branches in the plan
If you regress two belief state Bf and Bf over
a sensing action Sense-f?, you get the belief
state B

5
(No Transcript)
6
(No Transcript)
7
Note Full vs. Partial observability is
independent of sensing individual fluents vs.
sensing formulas.
(assuming single literal sensing)
If a state variable p Is in B, then there is
some action Ap that Can sense whether p is true
or false
If PB, the problem is fully observable If B is
empty, the problem is non observable If B is a
subset of P, it is partially observable
8
Full Observability State Space partitioned to
singleton Obs. Classes Non-observability Entire
state space is a single observation class
Partial Observability Between 1 and S
observation classes
9
Hardness classes for planning with sensing

Planning with sensing is hard or easy depending
on (easy case listed first)
Whether the sensory actions give us full or
partial observability
Whether the sensory actions sense individual
fluents or formulas on fluents
Whether the sensing actions are always applicable
or have preconditions that need to be achieved
before the action can be done

10
A Simple Progression Algorithm in the presence of
pure sensing actions

Call the procedure Plan(BI,G,nil) where
Procedure Plan(B,G,P)
If G is satisfied in all states of B, then
return P
Non-deterministically choose
I. Non-deterministically choose a causative
action a that is applicable in B.
Return Plan(a(B),G,Pa)
II. Non-deterministically choose a sensing action
s that senses a formula f (could be a single
state variable)
Let p Plan(Bf,G,nil) pPlan(Bf,G,nil)
/Bf is the set of states of B in which f is true
/
Return P(s?pp)

If we always pick I and never do II then we will
produce conformant Plans (if we succeed).
11
Remarks on the progression with sensing actions

Progression is implicitly finding an AND subtree
of an AND/OR Graph
If we look for AND subgraphs, we can represent
DAGS.
The amount of sensing done in the eventual
solution plan is controlled by how often we pick
step I vs. step II (if we always pick I, we get
conformant solutions).
Progression is as clue-less as to whether to do
sensing and which sensing to do, as it is about
which causative action to apply
Need heuristic support

12
Heuristics for sensing

We need to compare the cumulative distance of B1
and B2 to goal with that of B3 to goal
Notice that Planning cost is related to plan size
while plan exec cost is related to the length of
the deepest branch (or expected length of a
branch)
If we use the conformant belief state distance
(as discussed last class), then we will be over
estimating the distance (since sensing may allow
us to do shorter branch)
Bryce ICAPS 05submitted starts wth the
conformant relaxed plan and introduces sensory
actions into the plan to estimate the cost more
accurately

B1
B2
B3
13
Sensing More things under the mat(which we
wont lift for now ?)

Sensing extends the notion of goals (and action
preconditions).
Findout goals Check if Rao is awake vs. Wake up
Rao
Presents some tricky issues in terms of goal
satisfaction!
You cannot use causative effects to support
findout goals
But what if the causative effects are supporting
another needed goal and wind up affecting the
goal as a side-effect? (e.g. Have-gong-go-off
find-out-if-rao-is-awake)
Quantification is no longer syntactic sugaring in
effects and preconditions in the presence of
sensing actions
Rm can satisfy the effect forall files
remove(file) without KNOWING what are the files
in the directory!
This is alternative to finding each files name
and doing rm ltfile-namegt
Sensing actions can have preconditions (as well
as other causative effects) they can have cost
The problem of OVER-SENSING (Sort of like a
beginning driver who looks all directions every 3
millimeters of driving also Sphexishness)
XII/Puccini project
Handling over-sensing using local-closedworld
assumptions
Listing a file doesnt destroy your knowledge
about the size of a file but
compressing it does. If you dont recognize
it, you will always be checking the size of the
file after each and every action

14
Very simple Example
Problem Init dont know p
Goal g
A1 pgtr,p A2 pgtr,p A3 rgtg O5
observe(p)
Plan O5p?A1?A3A2?A3
Notice that in this case we also have a
conformant plan A1A2A3 --Whether or not
the conformant plan is cheaper depends on
how costly is sensing action O5 compared to A1
and A2
15
Very simple Example
Problem Init dont know p
Goal g
A1 pgtr,p A2 pgtr,p A3 rgtg O5
observe(p)
Plan O5p?A1?A3A2?A3
O5p?
Y
N
A1
A2
A3
A3
16
A more interesting example Medication
This domain is partially observable because the
states (D,I,B) and (D,I,B) cannot be
distinguished
The patient is not Dead and may be Ill. The test
paper is not Blue. We want to make the patient be
not Dead and not Ill We have three actions
Medicate which makes the patient not ill if he is
ill Stainwhich makes the test paper blue if the
patient is ill Sense-paperwhich can tell us if
the paper is blue or not.
No conformant plan possible here. Also, notice
that I cannot be sensed directly but only
through B
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
Goal directed conditional planning

Recall that regression of two belief state Bf
and Bf over a sensing action Sense-f will
result in a belief state B
Search with this definition leads to two
challenges
We have to combine search states into single ones
(a sort of reverse AO operation)
We may need to explicitly condition a goal
formula in partially observable case (especially
when certain fluents can only be indirectly
sensed)
Example is the Medicate domain where I has to be
found through B
If you have a goal state B, you can always write
it as Bf and Bf for any arbitrary f! (The goal
Happy is achieved by achieving the twin goals
Happyrich as well as Happyrich)
Of course, we need to pick the f such that f/f
can be sensed (i.e. f and f defines an
observational class feature)
This step seems to go against the grain of
goal-directedensswe may not know what to sense
based on what our goal is after all! ?

Regression for PO case is Still
not Well-understood
23
Regresssion
24
Handling the combination during regression

We have to combine search states into single ones
(a sort of reverse AO operation)
Two ideas
In addition to the normal regression children,
also generate children from any pair of regressed
states on the search fringe (has a breadth-first
feel. Can be expensive!) Tuan Le does this
Do a contingent regression. Specifically, go
ahead and generate B from Bf using Sense-f but
now you have to go forward from the not-f
branch of Sense-f to goal too. CNLP does this
See the example

25
Need for explicit conditioning during regression
(not needed for Fully Observable case)

If you have a goal state B, you can always write
it as Bf and Bf for any arbitrary f! (The goal
Happy is achieved by achieving the twin goals
Happyrich as well as Happyrich)
Of course, we need to pick the f such that f/f
can be sensed (i.e. f and f defines an
observational class feature)
This step seems to go against the grain of
goal-directedensswe may not know what to sense
based on what our goal is after all! ?

Notice the analogy to conditioning in
evaluating a probabilistic query
Consider the Medicate problem. Coming from the
goal of DI, we will never see the
connection to sensing blue!
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
We now have yet another way of handling unsafe
links --Conditioning to put the threatening
step in a different world!
Similar processing can be done for regression
(PO planning is nothing but least-committed
regression planning)
34
Sensing More things under the mat

Sensing extends the notion of goals too.
Check if Rao is awake vs. Wake up Rao
Presents some tricky issues in terms of goal
satisfaction!
Handling quantified effects and preconditions in
the presence of sensing actions
Rm can satisfy the effect forall files
remove(file) without KNOWING what are the files
in the directory!
Sensing actions can have preconditions (as well
as other causative effects)
The problem of OVER-SENSING (Sort of like the
initial driver also Sphexishness) XII/Puccini
project
Handling over-sensing using local-closedworld
assumptions
Listing a file doesnt destroy your knowledge
about the size of a file but
compressing it does. If you dont recognize
it, you will always be checking the size of the
file after each and every action
A general action may have both causative effects
and sensing effects
Sensing effect changes the agents knowledge, and
not the world
Causative effect changes the world (and may give
certain knowledge to the agent)
A pure sensing action only has sensing effects a
pure causative action only has causative effects.
The recent work on conditional planning has
considered mostly simplistic sensing actions that
have no preconditions and only have pure sensing
effects.
Sensing has cost!

35
11/24

Replanning
MDPs
HW4 updated See the paper task
Only MDP stuff to be added

36
Sensing More things under the mat(which we
wont lift for now ?)
Review

Sensing extends the notion of goals (and action
preconditions).
Findout goals Check if Rao is awake vs. Wake up
Rao
Presents some tricky issues in terms of goal
satisfaction!
You cannot use causative effects to support
findout goals
But what if the causative effects are supporting
another needed goal and wind up affecting the
goal as a side-effect? (e.g. Have-gong-go-off
find-out-if-rao-is-awake)
Quantification is no longer syntactic sugaring in
effects and preconditions in the presence of
sensing actions
Rm can satisfy the effect forall files
remove(file) without KNOWING what are the files
in the directory!
This is alternative to finding each files name
and doing rm ltfile-namegt
Sensing actions can have preconditions (as well
as other causative effects) they can have cost
The problem of OVER-SENSING (Sort of like a
beginning driver who looks all directions every 3
millimeters of driving also Sphexishness)
XII/Puccini project
Handling over-sensing using local-closedworld
assumptions
Listing a file doesnt destroy your knowledge
about the size of a file but
compressing it does. If you dont recognize
it, you will always be checking the size of the
file after each and every action

37
Sensing Limited Contingency planning

In many real-world scenarios, having a plan that
works in all contingencies is too hard
An idea is to make a plan for some of the
contingencies and monitor/Replan as necessary.
Qn What contingencies should we plan for?
The ones that are most likely to occur(need
likelihoods)
Qn What do we do if an unexpected contingency
arises?
Monitor (the observable parts of the world)
When it goes out of expected world, replan
starting from that state.

38
Things more complicated if the world is
partially observable ?Need to insert
sensing actions to sense fluents
that can only be indirectly sensed
39
Triangle Tables
40
This involves disjunctive goals!
41
ReplanningRespecting Commitments

In real-world, where you make commitments based
on your plan, you cannot just throw away the plan
at the first sign of failure
One heuristic is to reuse as much of the old plan
as possible while doing replanning.
A more systematic approach is to
Capture the commitments made by the agent based
on the current plan
Give these commitments as additional soft
constraints to the planner

42
Replanning as a universal antidote

If the domain is observable and lenient to
failures, and we are willing to do replanning,
then we can always handle non-deterministic as
well as stochastic actions with classical
planning!
Solve the deterministic relaxation of the
problem
Start executing it, while monitoring the world
state
When an unexpected state is encountered, replan
A planner that did this in the First Intl.
Planning CompetitionProbabilistic Track, called
FF-Replan, won the competition.

43
20 years of research into decision
theoretic planning, ..and FF-Replan is the
result?
30 years of research into programming
languages, ..and C is the result?
44
Models of Planning
Uncertainty Deterministic
Disjunctive Probabilistic
Classical Contingent (FO)MDP
??? Contingent POMDP
??? Conformant (NO)MDP
Complete Observation Partial None
45
MDPs as Utility-based problem solving agents
Repeat
46
Repeat
can generalize to have action costs C(a,s)
If Mij matrix is not known a priori, then we
have a reinforcement learning scenario..
47
(Value)
How about deterministic case? U(si) is the
shortest path to the goal ?
48
Repeat
49
(No Transcript)
50
Policies change with rewards..
-
-
51
What does a solution to an MDP look like?

The solution should tell the optimal action to do
in each state (called a Policy)
Policy is a function from states to actions (
see finite horizon case below)
Not a sequence of actions anymore
Needed because of the non-deterministic actions
If there are S states and A actions that we
can do at each state, then there are AS
policies
How do we get the best policy?
Pick the policy that gives the maximal expected
reward
For each policy p
Simulate the policy (take actions suggested by
the policy) to get behavior traces
Evaluate the behavior traces
Take the average value of the behavior traces.
How long should behavior traces be?
Each trace is no longer than k (Finite Horizon
case)
Policy will be horizon-dependent (optimal action
depends not just on what state you are in, but
how far is your horizon)
Eg Financial portfolio advice for yuppies vs.
retirees.
No limit on the size of the trace (Infinite
horizon case)
Policy is not horizon dependent
Qn Is there a simpler way than having to
evaluate AS policies?
Yes

We will concentrate on infinite horizon
problems (infinite horizon doesnt
necessarily mean that that all behavior
traces are infinite. They could be finite
and end in a sink state)
52
(No Transcript)
53
.8
.1
.1
54
Updates can be done synchronously OR
asynchronously --convergence guaranteed
as long as each state updated
infinitely often
Why are values coming down first? Why are some
states reaching optimal value faster?
.8
.1
.1
55
Terminating Value Iteration

The basic idea is to terminate the value
iteration when the values have converged (i.e.,
not changing much from iteration to iteration)
Set a threshold e and stop when the change across
two consecutive iterations is less than e
There is a minor problem since value is a vector
We can bound the maximum change that is allowed
in any of the dimensions between two successive
iterations by e
Max norm . of a vector is the maximal value
among all its dimensions. We are basically
terminating when Ui Ui1 lt e

56
Policies converge earlier than values

There are finite number of policies but infinite
number of value functions.
So entire regions of value vector are mapped
to a specific policy
So policies may be converging faster than
values. Search in the space of policies
Given a utility vector Ui we can compute the
greedy policy pui
The policy loss of pui is Upui-U
(max norm difference of two vectors is the
maximum amount by which they differ on any
dimension)

P4
P3
V(S2)
U
P2
P1
V(S1)
Consider an MDP with 2 states and 2 actions
57
n linear equations with n unknowns.
We can either solve the linear eqns exactly,
or solve them approximately by running the
value iteration a few times (the update wont
have the max operation)
58
Other ways of solving MDPs

Value and Policy iteration are the bed-rock
methods for solving MDPs. Both give optimality
guarantees
Both of them tend to be very inefficient for
large (several thousand state) MDPs
Many ideas are used to improve the efficiency
while giving up optimality guarantees
E.g. Consider the part of the policy for more
likely states (envelope extension method)
Interleave search and execution (Real Time
Dynamic Programming)
Do limited-depth analysis based on reachability
to find the value of a state (and there by the
best action you you should be doingwhich is the
action that is sending you the best value)
The values of the leaf nodes are set to be their
immediate rewards
If all the leaf nodes are terminal nodes, then
the backed up value will be true optimal value.
Otherwise, it is an approximation

RTDP
59
What if you see this as a game?
If you are perpetual optimist then V2
max(V3,V4)
Min-Max!
60
Incomplete observability(the dreaded POMDPs)

To model partial observability, all we need to do
is to look at MDP in the space of belief states
(belief states are fully observable even when
world states are not)
Policy maps belief states to actions
In practice, this causes (humongous) problems
The space of belief states is continuous (even
if the underlying world is discrete and finite).
GET IT? GET IT??
Even approximate policies are hard to find
(PSPACE-hard).
Problems with few dozen world states are hard to
solve currently
Depth-limited exploration (such as that done in
adversarial games) are the only option

Belief state s10.3, s20.4 s40.3
5 LEFTs
5 UPs
This figure basically shows that belief states
change as we take actions
61
(No Transcript)
62
(No Transcript)
63
(No Transcript)
64
(No Transcript)
65
MDPs and Deterministic Search

Problem solving agent search corresponds to what
special case of MDP?
Actions are deterministic Goal states are all
equally valued, and are all sink states.
Is it worth solving the problem using MDPs?
The construction of optimal policy is an overkill
The policy, in effect, gives us the optimal path
from every state to the goal state(s))
The value function, or its approximations, on the
other hand are useful. How?
As heuristics for the problem solving agents
search
This shows an interesting connection between
dynamic programming and state search paradigms
DP solves many related problems on the way to
solving the one problem we want
State search tries to solve just the problem we
want
We can use DP to find heuristics to run state
search..

66
Modeling Softgoal problems as deterministic MDPs
67
SSPPStochastic Shortest Path Problem An MDP with
Init and Goal states

MDPs dont have a notion of an initial and
goal state. (Process orientation instead of
task orientation)
Goals are sort of modeled by reward functions
Allows pretty expressive goals (in theory)
Normal MDP algorithms dont use initial state
information (since policy is supposed to cover
the entire search space anyway).
Could consider envelope extension methods
Compute a deterministic plan (which gives the
policy for some of the states Extend the policy
to other states that are likely to happen during
execution
RTDP methods

SSSP are a special case of MDPs where
(a) initial state is given
(b) there are absorbing goal states
(c) Actions have costs. Goal states have zero
costs.
A proper policy for SSSP is a policy which is
guaranteed to ultimately put the agent in one of
the absorbing states
For SSSP, it would be worth finding a partial
policy that only covers the relevant states
(states that are reachable from init and goal
states on any optimal policy)
Value/Policy Iteration dont consider the notion
of relevance
Consider heuristic state search algorithms
Heuristic can be seen as the estimate of the
value of a state.

68
AO search for solving SSP problems
Main issues -- Cost of a node is
expected cost of its children -- The And
tree can have LOOPS ?Cost backup
is complicated
Intermediate nodes given admissible heuristic
estimates --can be just the shortest paths
(or their estimates)
69
LAO--turning bottom-up labeling into a full DP
70
RTDP Approach Interleave Planning Execution
(Simulation)
Start from the current state S. Expand the tree
(either uniformly to k-levels, or
non-uniformlygoing deeper in some
branches) Evaluate the leaf nodes back-up the
values to S. Update the stored value of S. Pick
the action that leads to best value Do it or
simulate it. Loop back. Leaf nodes evaluated
by Using their cached values ?If this
node has been evaluated using RTDP
analysis in the past, you use its
remembered value else use the
heuristic value ?If not use heuristics to
estimate a. Immediate reward values
b. Reachability heuristics
Sort of like depth-limited game-playing
(expectimax) --Who is the game against? Can
also do reinforcement learning this way ?
The Mij are not known correctly in RL
71
Greedy On-Policy RTDP without execution
?Using the current utility values, select the
action with the highest expected utility
(greedy action) at each state, until you reach
a terminating state. Update the values along
this path. Loop backuntil the values stabilize
72
(No Transcript)
73
Envelope Extension Methods

For each action, take the most likely outcome and
discard the rest.
Find a plan (deterministic path) from Init to
Goal state. This is a (very partial) policy for
just the states that fall on the maximum
probability state sequence.
Consider states that are most likely to be
encountered while traveling this path.
Find policy for those states too.
Tricky part is to show that we can converge to
the optimal policy