5/6: Summary and Decision Theoretic Planning - PowerPoint PPT Presentation

About This Presentation

Title:

5/6: Summary and Decision Theoretic Planning

Description:

Metric-Temporal Planning: Issues and Representation. Search ... (belief) state action tables. Deterministic Success: Must reach goal-state with probability 1 ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 30

Provided by: min63

Learn more at: https://rakaposhi.eas.asu.edu

Category:

more less

Transcript and Presenter's Notes

Title: 5/6: Summary and Decision Theoretic Planning

1
5/6 Summary and Decision Theoretic Planning

?Last homework socket opened (two more problems
to be addedScheduling, MDPs)
Project 3 due today
Sapa homework points sent..

2
Current Grades
3
Sapa homework grades
4
(No Transcript)
5
All that water under the bridge

Actions, Proofs, Planning Strategies (Week 2
1/281/30)
More PO planning, dealing with partially
instantiated actions, and start of deriving
heuristics. (Week 3 2/42/6)
Reachability Heuristics contd. (2/11/13)
Heuristics for Partial order planning Graphplan
search (2/18 2/20).
EBL for Graphplan Solving planning graph by
compilation strategies (2/252/27).
Compilation to SAT, ILP and Naive
Encoding(3/43/6).
Knowledge-based planners.

Metric-Temporal Planning Issues and
Representation.
Search Techniques Heuristics.
Tracking multiple objective heuristics (cost
propagation) partialization LPG
Temporal Constraint Networks Scheduling
4/224/24 Incompleteness and Unertainty Belief
States Conformant planning
4/295/1 Conditional Planning
Decision Theoretic Planning

6
Problems, Solutions, Success Measures3
orthogonal dimensions

Conformant Plans Dont lookjust do
Sequences
Contingent/Conditional Plans Look, and based on
what you see, Do look again
Directed acyclic graphs
Policies If in (belief) state S, do action a
(belief) state?action tables

Incompleteness in the initial state
Un (partial) observability of states
Non-deterministic actions
Uncertainty in state or effects
Complex reward functions (allowing degrees of
satisfaction)

Deterministic Success Must reach goal-state with
probability 1
Probabilistic Success Must succeed with
probability gt k (0ltklt1)
Maximal Expected Reward Maximize the expected
reward (an optimization problem)

7
The Trouble with Probabilities

Once we have probabilities associated with the
action effects, as well as the constituents of a
belief state,
The belief space size explodes
Infinitely large?may be able to find a plan if
one exists, but exhaustively searching to prove
plan doesnt exist is out of the question
Conformant Probabilistic planning is known to be
Semi-decidable
So, solving POMDPs is semi-decidable too.
Introduces the notion of partial satisfaction
and expected value of the plan (rather than
0-1 valuation)

8
Useful as normative modeling tools In tons of
places --planning, (reinforcement) learning,
multi-agent interactions..
MDPs are generalizations of Markov chains where
transitions are under the control of an agent.
?HMMs are thus generalized to POMDPs
9
aka action cost C(a,s)
If Mij matrix is not known a priori, then we
have a reinforcement learning scenario..
10
MDPs vs. Markov Chains

Markov chains are transition systems, where
transitions happen automatically
HMMs (hidden markov models) are markov chains
where the current state is partially observable.
Has been very useful in many different areas.
Generalization to MDPs leads to POMDPs

MDPs are generalizations of Markov chains where
transitions are under the control of an agent.

11
(No Transcript)
12
Policies change with rewards..
-
-
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
Updates can be done synchronously OR
asynchronously --convergence guaranteed
as long as each state updated
infinitely often
Why are values coming down first? Why are some
states reaching optimal value faster?
17
Policies converge earlier than values
Given a utility vector Ui we can compute the
greedy policy pui The policy loss of p is
Up-U (max norm difference of two
vectors is the maximum amount by which they
differ on any dimension)
So search in the space of policies
18
We can either solve the linear eqns exactly,
or solve them approximately by running the
value iteration a few times (the update wont
have max factor)
19
The Big Computational Issues in MDP

MDP models are quite easy to specify and
understand conceptually. The big issue is
compactness and effciency
Policy construction is polynomial in the size of
state space (which is bad news!)
For POMDPs, the state space is the belief space
(infinite ?)
Compact representations needed for
Actions
Reward function
Policy
Value
Efficient methods needed for
Policy/value update

Representations that have been tried include
Decision trees
Neural nets,
Bayesian nets
ADDs (algebraic decision diagramswhich are a
general case of BDDswhere the leaf nodes can
have real-valued valuation instead of T/F).

20
SPUDD Using ADDs to Represent Actions, Rewards
and Policies
21
MDPs and Planning Problems

FOMDPS (fully observable MDPS) can be used to
model planning problems with fully observable
states, but non-deterministic transitions
POMDPs (partially observable MDPs)a
generalization of MDP framework, where the
current state can only be partially observedwill
be needed to handle planning problems with
partial observability
POMDPs can be solved by converting them into
FOMDPsbut the conversion takes us from world
states to belief states (which is a continuous
space)

22
SSPPStochastic Shortest Path Problem An MDP with
Init and Goal states

MDPs dont have a notion of an initial and
goal state. (Process orientation instead of
task orientation)
Goals are sort of modeled by reward functions
Allows pretty expressive goals (in theory)
Normal MDP algorithms dont use initial state
information (since policy is supposed to cover
the entire search space anyway).
Could consider envelope extension methods
Compute a deterministic plan (which gives the
policy for some of the states Extend the policy
to other states that are likely to happen during
execution
RTDP methods

SSSP are a special case of MDPs where
(a) initial state is given
(b) there are absorbing goal states
(c) Actions have costs. Goal states have zero
costs.
A proper policy for SSSP is a policy which is
guaranteed to ultimately put the agent in one of
the absorbing states
For SSSP, it would be worth finding a partial
policy that only covers the relevant states
(states that are reachable from init and goal
states on any optimal policy)
Value/Policy Iteration dont consider the notion
of relevance
Consider heuristic state search algorithms
Heuristic can be seen as the estimate of the
value of a state.
(L)AO or
RTDP algorithms
(or envelope extension methods)

23
AO search for solving SSP problems
Main issues -- Cost of a node is
expected cost of its children -- The And
tree can have LOOPS ?Cost backup
is complicated
Intermediate nodes given admissible heuristic
estimates --can be just the shortest paths
(or their estimates)
24
LAO--turning bottom-up labeling into a full DP
25
RTDP Approach Interleave Planning Execution
(Simulation)
Start from the current state S. Expand the tree
(either uniformly to k-levels, or
non-uniformlygoing deeper in some
branches) Evaluate the leaf nodes back-up the
values to S. Update the stored value of S. Pick
the action that leads to best value Do it or
simulate it. Loop back. Leaf nodes evaluated
by Using their cached values ?If this
node has been evaluated using RTDP
analysis in the past, you use its
remembered value else use the
heuristic value ?If not use heuristics to
estimate a. Immediate reward values
b. Reachability heuristics
Sort of like depth-limited game-playing
(expectimax) --Who is the game against? Can
also do reinforcement learning this way ?
The Mij are not known correctly in RL
26
Greedy On-Policy RTDP without execution
?Using the current utility values, select the
action with the highest expected utility
(greedy action) at each state, until you reach
a terminating state. Update the values along
this path. Loop backuntil the values stabilize
27
(No Transcript)
28
Envelope Extension Methods

For each action, take the most likely outcome and
discard the rest.
Find a plan (deterministic path) from Init to
Goal state. This is a (very partial) policy for
just the states that fall on the maximum
probability state sequence.
Consider states that are most likely to be
encountered while traveling this path.
Find policy for those states too.
Tricky part is to show that we can converge to
the optimal policy

29
Incomplete observability(the dreaded POMDPs)

To model partial observability, all we need to do
is to look at MDP in the space of belief states
(belief states are fully observable even when
world states are not)
Policy maps belief states to actions
In practice, this causes (humongous) problems
The space of belief states is continuous (even
if the underlying world is discrete and finite).
Even approximate policies are hard to find
(PSPACE-hard).
Problems with few dozen world states are hard to
solve currently
Depth-limited exploration (such as that done in
adversarial games) are the only option