11/22: Conditional Planning - PowerPoint PPT Presentation

About This Presentation
Title:

11/22: Conditional Planning

Description:

Homework 4 will be due before the last class ... looks all directions every 3 millimeters of driving; also Sphexishness) [XII/Puccini project] ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 57
Provided by: min63
Category:

less

Transcript and Presenter's Notes

Title: 11/22: Conditional Planning


1
11/22 Conditional Planning Replanning
  • Current Standings sent
  • Semester project report due 11/30
  • Homework 4 will be due before the last class
  • Next class Review of MDPs (please read Chapter
    16 and the class slides)

2
Sensing Actions
  • Sensing actions in essence partition a belief
    state
  • Sensing a formula f splits a belief state B to
    Bf Bf
  • Both partitions need to be taken to the goal
    state now
  • Tree plan
  • AO search
  • Heuristics will have to compare two generalized
    AND branches
  • In the figure, the lower branch has an expected
    cost of 11,000
  • The upper branch has a fixed sensing cost of 300
    based on the outcome, a cost of 7 or 12,000
  • If we consider worst case cost, we assume the
    cost is 12,300
  • If we consider both to be equally likey, we
    assume 6303.5 units cost
  • If we know actual probabilities that the sensing
    action returns one result as against other, we
    can use that to get the expected cost

7
300
12,000
As
A
11,000
3
Sensing General observations
  • Sensing can be thought in terms of
  • Speicific state variables whose values can be
    found
  • OR sensing actions that evaluate truth of some
    boolean formula over the state variables.
  • Sense(p) Sense(pV(qr))
  • A general action may have both causative effects
    and sensing effects
  • Sensing effect changes the agents knowledge, and
    not the world
  • Causative effect changes the world (and may give
    certain knowledge to the agent)
  • A pure sensing action only has sensing effects a
    pure causative action only has causative effects.

4
Progression/Regression with Sensing
  • When applied to a belief state, AT RUN TIME the
    sensing effects of an action wind up reducing the
    cardinality of that belief state
  • basically by removing all states that are not
    consistent with the sensed effects
  • AT PLAN TIME, Sensing actions PARTITION belief
    states
  • If you apply Sense-f? to a belief state B, you
    get a partition of B1 Bf and B2 Bf
  • You will have to make a plan that takes both
    partitions to the goal state
  • Introduces branches in the plan
  • If you regress two belief state Bf and Bf over
    a sensing action Sense-f?, you get the belief
    state B

5
(No Transcript)
6
(No Transcript)
7
Note Full vs. Partial observability is
independent of sensing individual fluents vs.
sensing formulas.
(assuming single literal sensing)
If a state variable p Is in B, then there is
some action Ap that Can sense whether p is true
or false
If PB, the problem is fully observable If B is
empty, the problem is non observable If B is a
subset of P, it is partially observable
8
Full Observability State Space partitioned to
singleton Obs. Classes Non-observability Entire
state space is a single observation class
Partial Observability Between 1 and S
observation classes
9
Hardness classes for planning with sensing
  • Planning with sensing is hard or easy depending
    on (easy case listed first)
  • Whether the sensory actions give us full or
    partial observability
  • Whether the sensory actions sense individual
    fluents or formulas on fluents
  • Whether the sensing actions are always applicable
    or have preconditions that need to be achieved
    before the action can be done

10
A Simple Progression Algorithm in the presence of
pure sensing actions
  • Call the procedure Plan(BI,G,nil) where
  • Procedure Plan(B,G,P)
  • If G is satisfied in all states of B, then
    return P
  • Non-deterministically choose
  • I. Non-deterministically choose a causative
    action a that is applicable in B.
  • Return Plan(a(B),G,Pa)
  • II. Non-deterministically choose a sensing action
    s that senses a formula f (could be a single
    state variable)
  • Let p Plan(Bf,G,nil) pPlan(Bf,G,nil)
  • /Bf is the set of states of B in which f is true
    /
  • Return P(s?pp)

If we always pick I and never do II then we will
produce conformant Plans (if we succeed).
11
Remarks on the progression with sensing actions
  • Progression is implicitly finding an AND subtree
    of an AND/OR Graph
  • If we look for AND subgraphs, we can represent
    DAGS.
  • The amount of sensing done in the eventual
    solution plan is controlled by how often we pick
    step I vs. step II (if we always pick I, we get
    conformant solutions).
  • Progression is as clue-less as to whether to do
    sensing and which sensing to do, as it is about
    which causative action to apply
  • Need heuristic support

12
Heuristics for sensing
  • We need to compare the cumulative distance of B1
    and B2 to goal with that of B3 to goal
  • Notice that Planning cost is related to plan size
    while plan exec cost is related to the length of
    the deepest branch (or expected length of a
    branch)
  • If we use the conformant belief state distance
    (as discussed last class), then we will be over
    estimating the distance (since sensing may allow
    us to do shorter branch)
  • Bryce ICAPS 05submitted starts wth the
    conformant relaxed plan and introduces sensory
    actions into the plan to estimate the cost more
    accurately

B1
B2
B3
13
Sensing More things under the mat(which we
wont lift for now ?)
  • Sensing extends the notion of goals (and action
    preconditions).
  • Findout goals Check if Rao is awake vs. Wake up
    Rao
  • Presents some tricky issues in terms of goal
    satisfaction!
  • You cannot use causative effects to support
    findout goals
  • But what if the causative effects are supporting
    another needed goal and wind up affecting the
    goal as a side-effect? (e.g. Have-gong-go-off
    find-out-if-rao-is-awake)
  • Quantification is no longer syntactic sugaring in
    effects and preconditions in the presence of
    sensing actions
  • Rm can satisfy the effect forall files
    remove(file) without KNOWING what are the files
    in the directory!
  • This is alternative to finding each files name
    and doing rm ltfile-namegt
  • Sensing actions can have preconditions (as well
    as other causative effects) they can have cost
  • The problem of OVER-SENSING (Sort of like a
    beginning driver who looks all directions every 3
    millimeters of driving also Sphexishness)
    XII/Puccini project
  • Handling over-sensing using local-closedworld
    assumptions
  • Listing a file doesnt destroy your knowledge
    about the size of a file but
  • compressing it does. If you dont recognize
    it, you will always be checking the size of the
    file after each and every action

14
Very simple Example
Problem Init dont know p
Goal g
A1 pgtr,p A2 pgtr,p A3 rgtg O5
observe(p)
Plan O5p?A1?A3A2?A3
Notice that in this case we also have a
conformant plan A1A2A3 --Whether or not
the conformant plan is cheaper depends on
how costly is sensing action O5 compared to A1
and A2
15
Very simple Example
Problem Init dont know p
Goal g
A1 pgtr,p A2 pgtr,p A3 rgtg O5
observe(p)
Plan O5p?A1?A3A2?A3
O5p?
Y
N
A1
A2
A3
A3
16
A more interesting example Medication
This domain is partially observable because the
states (D,I,B) and (D,I,B) cannot be
distinguished
The patient is not Dead and may be Ill. The test
paper is not Blue. We want to make the patient be
not Dead and not Ill We have three actions
Medicate which makes the patient not ill if he is
ill Stainwhich makes the test paper blue if the
patient is ill Sense-paperwhich can tell us if
the paper is blue or not.
No conformant plan possible here. Also, notice
that I cannot be sensed directly but only
through B
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
Goal directed conditional planning
  • Recall that regression of two belief state Bf
    and Bf over a sensing action Sense-f will
    result in a belief state B
  • Search with this definition leads to two
    challenges
  • We have to combine search states into single ones
    (a sort of reverse AO operation)
  • We may need to explicitly condition a goal
    formula in partially observable case (especially
    when certain fluents can only be indirectly
    sensed)
  • Example is the Medicate domain where I has to be
    found through B
  • If you have a goal state B, you can always write
    it as Bf and Bf for any arbitrary f! (The goal
    Happy is achieved by achieving the twin goals
    Happyrich as well as Happyrich)
  • Of course, we need to pick the f such that f/f
    can be sensed (i.e. f and f defines an
    observational class feature)
  • This step seems to go against the grain of
    goal-directedensswe may not know what to sense
    based on what our goal is after all! ?

Regression for PO case is Still
not Well-understood
23
Regresssion
24
Handling the combination during regression
  • We have to combine search states into single ones
    (a sort of reverse AO operation)
  • Two ideas
  • In addition to the normal regression children,
    also generate children from any pair of regressed
    states on the search fringe (has a breadth-first
    feel. Can be expensive!) Tuan Le does this
  • Do a contingent regression. Specifically, go
    ahead and generate B from Bf using Sense-f but
    now you have to go forward from the not-f
    branch of Sense-f to goal too. CNLP does this
    See the example

25
Need for explicit conditioning during regression
(not needed for Fully Observable case)
  • If you have a goal state B, you can always write
    it as Bf and Bf for any arbitrary f! (The goal
    Happy is achieved by achieving the twin goals
    Happyrich as well as Happyrich)
  • Of course, we need to pick the f such that f/f
    can be sensed (i.e. f and f defines an
    observational class feature)
  • This step seems to go against the grain of
    goal-directedensswe may not know what to sense
    based on what our goal is after all! ?

Notice the analogy to conditioning in
evaluating a probabilistic query
Consider the Medicate problem. Coming from the
goal of DI, we will never see the
connection to sensing blue!
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
We now have yet another way of handling unsafe
links --Conditioning to put the threatening
step in a different world!
Similar processing can be done for regression
(PO planning is nothing but least-committed
regression planning)
34
Sensing More things under the mat
  • Sensing extends the notion of goals too.
  • Check if Rao is awake vs. Wake up Rao
  • Presents some tricky issues in terms of goal
    satisfaction!
  • Handling quantified effects and preconditions in
    the presence of sensing actions
  • Rm can satisfy the effect forall files
    remove(file) without KNOWING what are the files
    in the directory!
  • Sensing actions can have preconditions (as well
    as other causative effects)
  • The problem of OVER-SENSING (Sort of like the
    initial driver also Sphexishness) XII/Puccini
    project
  • Handling over-sensing using local-closedworld
    assumptions
  • Listing a file doesnt destroy your knowledge
    about the size of a file but
  • compressing it does. If you dont recognize
    it, you will always be checking the size of the
    file after each and every action
  • A general action may have both causative effects
    and sensing effects
  • Sensing effect changes the agents knowledge, and
    not the world
  • Causative effect changes the world (and may give
    certain knowledge to the agent)
  • A pure sensing action only has sensing effects a
    pure causative action only has causative effects.
  • The recent work on conditional planning has
    considered mostly simplistic sensing actions that
    have no preconditions and only have pure sensing
    effects.
  • Sensing has cost!

35
11/24
  • Replanning
  • MDPs
  • HW4 updated See the paper task
  • Only MDP stuff to be added

36
Sensing More things under the mat(which we
wont lift for now ?)
Review
  • Sensing extends the notion of goals (and action
    preconditions).
  • Findout goals Check if Rao is awake vs. Wake up
    Rao
  • Presents some tricky issues in terms of goal
    satisfaction!
  • You cannot use causative effects to support
    findout goals
  • But what if the causative effects are supporting
    another needed goal and wind up affecting the
    goal as a side-effect? (e.g. Have-gong-go-off
    find-out-if-rao-is-awake)
  • Quantification is no longer syntactic sugaring in
    effects and preconditions in the presence of
    sensing actions
  • Rm can satisfy the effect forall files
    remove(file) without KNOWING what are the files
    in the directory!
  • This is alternative to finding each files name
    and doing rm ltfile-namegt
  • Sensing actions can have preconditions (as well
    as other causative effects) they can have cost
  • The problem of OVER-SENSING (Sort of like a
    beginning driver who looks all directions every 3
    millimeters of driving also Sphexishness)
    XII/Puccini project
  • Handling over-sensing using local-closedworld
    assumptions
  • Listing a file doesnt destroy your knowledge
    about the size of a file but
  • compressing it does. If you dont recognize
    it, you will always be checking the size of the
    file after each and every action

37
Sensing Limited Contingency planning
  • In many real-world scenarios, having a plan that
    works in all contingencies is too hard
  • An idea is to make a plan for some of the
    contingencies and monitor/Replan as necessary.
  • Qn What contingencies should we plan for?
  • The ones that are most likely to occur(need
    likelihoods)
  • Qn What do we do if an unexpected contingency
    arises?
  • Monitor (the observable parts of the world)
  • When it goes out of expected world, replan
    starting from that state.

38
Things more complicated if the world is
partially observable ?Need to insert
sensing actions to sense fluents
that can only be indirectly sensed
39
Triangle Tables
40
This involves disjunctive goals!
41
ReplanningRespecting Commitments
  • In real-world, where you make commitments based
    on your plan, you cannot just throw away the plan
    at the first sign of failure
  • One heuristic is to reuse as much of the old plan
    as possible while doing replanning.
  • A more systematic approach is to
  • Capture the commitments made by the agent based
    on the current plan
  • Give these commitments as additional soft
    constraints to the planner

42
Replanning as a universal antidote
  • If the domain is observable and lenient to
    failures, and we are willing to do replanning,
    then we can always handle non-deterministic as
    well as stochastic actions with classical
    planning!
  • Solve the deterministic relaxation of the
    problem
  • Start executing it, while monitoring the world
    state
  • When an unexpected state is encountered, replan
  • A planner that did this in the First Intl.
    Planning CompetitionProbabilistic Track, called
    FF-Replan, won the competition.

43
20 years of research into decision
theoretic planning, ..and FF-Replan is the
result?
30 years of research into programming
languages, ..and C is the result?
44
Models of Planning
Uncertainty Deterministic
Disjunctive Probabilistic
Classical Contingent (FO)MDP
??? Contingent POMDP
??? Conformant (NO)MDP
Complete Observation Partial None
45
MDPs as Utility-based problem solving agents
Repeat
46
Repeat
can generalize to have action costs C(a,s)
If Mij matrix is not known a priori, then we
have a reinforcement learning scenario..
47
(Value)
How about deterministic case? U(si) is the
shortest path to the goal ?
48
Repeat
49
(No Transcript)
50
Policies change with rewards..
-
-
51
What does a solution to an MDP look like?
  • The solution should tell the optimal action to do
    in each state (called a Policy)
  • Policy is a function from states to actions (
    see finite horizon case below)
  • Not a sequence of actions anymore
  • Needed because of the non-deterministic actions
  • If there are S states and A actions that we
    can do at each state, then there are AS
    policies
  • How do we get the best policy?
  • Pick the policy that gives the maximal expected
    reward
  • For each policy p
  • Simulate the policy (take actions suggested by
    the policy) to get behavior traces
  • Evaluate the behavior traces
  • Take the average value of the behavior traces.
  • How long should behavior traces be?
  • Each trace is no longer than k (Finite Horizon
    case)
  • Policy will be horizon-dependent (optimal action
    depends not just on what state you are in, but
    how far is your horizon)
  • Eg Financial portfolio advice for yuppies vs.
    retirees.
  • No limit on the size of the trace (Infinite
    horizon case)
  • Policy is not horizon dependent
  • Qn Is there a simpler way than having to
    evaluate AS policies?
  • Yes

We will concentrate on infinite horizon
problems (infinite horizon doesnt
necessarily mean that that all behavior
traces are infinite. They could be finite
and end in a sink state)
52
(No Transcript)
53
.8
.1
.1
54
Updates can be done synchronously OR
asynchronously --convergence guaranteed
as long as each state updated
infinitely often
Why are values coming down first? Why are some
states reaching optimal value faster?
.8
.1
.1
55
Terminating Value Iteration
  • The basic idea is to terminate the value
    iteration when the values have converged (i.e.,
    not changing much from iteration to iteration)
  • Set a threshold e and stop when the change across
    two consecutive iterations is less than e
  • There is a minor problem since value is a vector
  • We can bound the maximum change that is allowed
    in any of the dimensions between two successive
    iterations by e
  • Max norm . of a vector is the maximal value
    among all its dimensions. We are basically
    terminating when Ui Ui1 lt e

56
Policies converge earlier than values
  • There are finite number of policies but infinite
    number of value functions.
  • So entire regions of value vector are mapped
    to a specific policy
  • So policies may be converging faster than
    values. Search in the space of policies
  • Given a utility vector Ui we can compute the
    greedy policy pui
  • The policy loss of pui is Upui-U
  • (max norm difference of two vectors is the
    maximum amount by which they differ on any
    dimension)

P4
P3
V(S2)
U
P2
P1
V(S1)
Consider an MDP with 2 states and 2 actions
57
n linear equations with n unknowns.
We can either solve the linear eqns exactly,
or solve them approximately by running the
value iteration a few times (the update wont
have the max operation)
58
Other ways of solving MDPs
  • Value and Policy iteration are the bed-rock
    methods for solving MDPs. Both give optimality
    guarantees
  • Both of them tend to be very inefficient for
    large (several thousand state) MDPs
  • Many ideas are used to improve the efficiency
    while giving up optimality guarantees
  • E.g. Consider the part of the policy for more
    likely states (envelope extension method)
  • Interleave search and execution (Real Time
    Dynamic Programming)
  • Do limited-depth analysis based on reachability
    to find the value of a state (and there by the
    best action you you should be doingwhich is the
    action that is sending you the best value)
  • The values of the leaf nodes are set to be their
    immediate rewards
  • If all the leaf nodes are terminal nodes, then
    the backed up value will be true optimal value.
    Otherwise, it is an approximation

RTDP
59
What if you see this as a game?
If you are perpetual optimist then V2
max(V3,V4)
Min-Max!
60
Incomplete observability(the dreaded POMDPs)
  • To model partial observability, all we need to do
    is to look at MDP in the space of belief states
    (belief states are fully observable even when
    world states are not)
  • Policy maps belief states to actions
  • In practice, this causes (humongous) problems
  • The space of belief states is continuous (even
    if the underlying world is discrete and finite).
    GET IT? GET IT??
  • Even approximate policies are hard to find
    (PSPACE-hard).
  • Problems with few dozen world states are hard to
    solve currently
  • Depth-limited exploration (such as that done in
    adversarial games) are the only option

Belief state s10.3, s20.4 s40.3
5 LEFTs
5 UPs
This figure basically shows that belief states
change as we take actions
61
(No Transcript)
62
(No Transcript)
63
(No Transcript)
64
(No Transcript)
65
MDPs and Deterministic Search
  • Problem solving agent search corresponds to what
    special case of MDP?
  • Actions are deterministic Goal states are all
    equally valued, and are all sink states.
  • Is it worth solving the problem using MDPs?
  • The construction of optimal policy is an overkill
  • The policy, in effect, gives us the optimal path
    from every state to the goal state(s))
  • The value function, or its approximations, on the
    other hand are useful. How?
  • As heuristics for the problem solving agents
    search
  • This shows an interesting connection between
    dynamic programming and state search paradigms
  • DP solves many related problems on the way to
    solving the one problem we want
  • State search tries to solve just the problem we
    want
  • We can use DP to find heuristics to run state
    search..

66
Modeling Softgoal problems as deterministic MDPs
67
SSPPStochastic Shortest Path Problem An MDP with
Init and Goal states
  • MDPs dont have a notion of an initial and
    goal state. (Process orientation instead of
    task orientation)
  • Goals are sort of modeled by reward functions
  • Allows pretty expressive goals (in theory)
  • Normal MDP algorithms dont use initial state
    information (since policy is supposed to cover
    the entire search space anyway).
  • Could consider envelope extension methods
  • Compute a deterministic plan (which gives the
    policy for some of the states Extend the policy
    to other states that are likely to happen during
    execution
  • RTDP methods
  • SSSP are a special case of MDPs where
  • (a) initial state is given
  • (b) there are absorbing goal states
  • (c) Actions have costs. Goal states have zero
    costs.
  • A proper policy for SSSP is a policy which is
    guaranteed to ultimately put the agent in one of
    the absorbing states
  • For SSSP, it would be worth finding a partial
    policy that only covers the relevant states
    (states that are reachable from init and goal
    states on any optimal policy)
  • Value/Policy Iteration dont consider the notion
    of relevance
  • Consider heuristic state search algorithms
  • Heuristic can be seen as the estimate of the
    value of a state.

68
AO search for solving SSP problems
Main issues -- Cost of a node is
expected cost of its children -- The And
tree can have LOOPS ?Cost backup
is complicated
Intermediate nodes given admissible heuristic
estimates --can be just the shortest paths
(or their estimates)
69
LAO--turning bottom-up labeling into a full DP
70
RTDP Approach Interleave Planning Execution
(Simulation)
Start from the current state S. Expand the tree
(either uniformly to k-levels, or
non-uniformlygoing deeper in some
branches) Evaluate the leaf nodes back-up the
values to S. Update the stored value of S. Pick
the action that leads to best value Do it or
simulate it. Loop back. Leaf nodes evaluated
by Using their cached values ?If this
node has been evaluated using RTDP
analysis in the past, you use its
remembered value else use the
heuristic value ?If not use heuristics to
estimate a. Immediate reward values
b. Reachability heuristics
Sort of like depth-limited game-playing
(expectimax) --Who is the game against? Can
also do reinforcement learning this way ?
The Mij are not known correctly in RL
71
Greedy On-Policy RTDP without execution
?Using the current utility values, select the
action with the highest expected utility
(greedy action) at each state, until you reach
a terminating state. Update the values along
this path. Loop backuntil the values stabilize
72
(No Transcript)
73
Envelope Extension Methods
  • For each action, take the most likely outcome and
    discard the rest.
  • Find a plan (deterministic path) from Init to
    Goal state. This is a (very partial) policy for
    just the states that fall on the maximum
    probability state sequence.
  • Consider states that are most likely to be
    encountered while traveling this path.
  • Find policy for those states too.
  • Tricky part is to show that we can converge to
    the optimal policy
Write a Comment
User Comments (0)
About PowerShow.com