5/6: Summary and Decision Theoretic Planning - PowerPoint PPT Presentation

About This Presentation
Title:

5/6: Summary and Decision Theoretic Planning

Description:

Metric-Temporal Planning: Issues and Representation. Search ... (belief) state action tables. Deterministic Success: Must reach goal-state with probability 1 ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 30
Provided by: min63
Category:

less

Transcript and Presenter's Notes

Title: 5/6: Summary and Decision Theoretic Planning


1
5/6 Summary and Decision Theoretic Planning
  • ?Last homework socket opened (two more problems
    to be addedScheduling, MDPs)
  • Project 3 due today
  • Sapa homework points sent..

2
Current Grades
3
Sapa homework grades
4
(No Transcript)
5
All that water under the bridge
  • Actions, Proofs, Planning Strategies (Week 2
    1/281/30)
  • More PO planning, dealing with partially
    instantiated actions, and start of deriving
    heuristics. (Week 3 2/42/6)
  • Reachability Heuristics contd. (2/11/13)
  • Heuristics for Partial order planning Graphplan
    search (2/18 2/20).
  • EBL for Graphplan Solving planning graph by
    compilation strategies (2/252/27).
  • Compilation to SAT, ILP and Naive
    Encoding(3/43/6).
  • Knowledge-based planners.
  • Metric-Temporal Planning Issues and
    Representation.
  • Search Techniques Heuristics.
  • Tracking multiple objective heuristics (cost
    propagation) partialization LPG
  • Temporal Constraint Networks Scheduling
  • 4/224/24 Incompleteness and Unertainty Belief
    States Conformant planning
  • 4/295/1 Conditional Planning
  • Decision Theoretic Planning

6
Problems, Solutions, Success Measures3
orthogonal dimensions
  • Conformant Plans Dont lookjust do
  • Sequences
  • Contingent/Conditional Plans Look, and based on
    what you see, Do look again
  • Directed acyclic graphs
  • Policies If in (belief) state S, do action a
  • (belief) state?action tables
  • Incompleteness in the initial state
  • Un (partial) observability of states
  • Non-deterministic actions
  • Uncertainty in state or effects
  • Complex reward functions (allowing degrees of
    satisfaction)
  • Deterministic Success Must reach goal-state with
    probability 1
  • Probabilistic Success Must succeed with
    probability gt k (0ltklt1)
  • Maximal Expected Reward Maximize the expected
    reward (an optimization problem)

7
The Trouble with Probabilities
  • Once we have probabilities associated with the
    action effects, as well as the constituents of a
    belief state,
  • The belief space size explodes
  • Infinitely large?may be able to find a plan if
    one exists, but exhaustively searching to prove
    plan doesnt exist is out of the question
  • Conformant Probabilistic planning is known to be
    Semi-decidable
  • So, solving POMDPs is semi-decidable too.
  • Introduces the notion of partial satisfaction
    and expected value of the plan (rather than
    0-1 valuation)

8
Useful as normative modeling tools In tons of
places --planning, (reinforcement) learning,
multi-agent interactions..
MDPs are generalizations of Markov chains where
transitions are under the control of an agent.
?HMMs are thus generalized to POMDPs
9
aka action cost C(a,s)
If Mij matrix is not known a priori, then we
have a reinforcement learning scenario..
10
MDPs vs. Markov Chains
  • Markov chains are transition systems, where
    transitions happen automatically
  • HMMs (hidden markov models) are markov chains
    where the current state is partially observable.
    Has been very useful in many different areas.
  • Generalization to MDPs leads to POMDPs
  • MDPs are generalizations of Markov chains where
    transitions are under the control of an agent.

11
(No Transcript)
12
Policies change with rewards..
-
-
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
Updates can be done synchronously OR
asynchronously --convergence guaranteed
as long as each state updated
infinitely often
Why are values coming down first? Why are some
states reaching optimal value faster?
17
Policies converge earlier than values
Given a utility vector Ui we can compute the
greedy policy pui The policy loss of p is
Up-U (max norm difference of two
vectors is the maximum amount by which they
differ on any dimension)
So search in the space of policies
18
We can either solve the linear eqns exactly,
or solve them approximately by running the
value iteration a few times (the update wont
have max factor)
19
The Big Computational Issues in MDP
  • MDP models are quite easy to specify and
    understand conceptually. The big issue is
    compactness and effciency
  • Policy construction is polynomial in the size of
    state space (which is bad news!)
  • For POMDPs, the state space is the belief space
    (infinite ?)
  • Compact representations needed for
  • Actions
  • Reward function
  • Policy
  • Value
  • Efficient methods needed for
  • Policy/value update
  • Representations that have been tried include
  • Decision trees
  • Neural nets,
  • Bayesian nets
  • ADDs (algebraic decision diagramswhich are a
    general case of BDDswhere the leaf nodes can
    have real-valued valuation instead of T/F).

20
SPUDD Using ADDs to Represent Actions, Rewards
and Policies
21
MDPs and Planning Problems
  • FOMDPS (fully observable MDPS) can be used to
    model planning problems with fully observable
    states, but non-deterministic transitions
  • POMDPs (partially observable MDPs)a
    generalization of MDP framework, where the
    current state can only be partially observedwill
    be needed to handle planning problems with
    partial observability
  • POMDPs can be solved by converting them into
    FOMDPsbut the conversion takes us from world
    states to belief states (which is a continuous
    space)

22
SSPPStochastic Shortest Path Problem An MDP with
Init and Goal states
  • MDPs dont have a notion of an initial and
    goal state. (Process orientation instead of
    task orientation)
  • Goals are sort of modeled by reward functions
  • Allows pretty expressive goals (in theory)
  • Normal MDP algorithms dont use initial state
    information (since policy is supposed to cover
    the entire search space anyway).
  • Could consider envelope extension methods
  • Compute a deterministic plan (which gives the
    policy for some of the states Extend the policy
    to other states that are likely to happen during
    execution
  • RTDP methods
  • SSSP are a special case of MDPs where
  • (a) initial state is given
  • (b) there are absorbing goal states
  • (c) Actions have costs. Goal states have zero
    costs.
  • A proper policy for SSSP is a policy which is
    guaranteed to ultimately put the agent in one of
    the absorbing states
  • For SSSP, it would be worth finding a partial
    policy that only covers the relevant states
    (states that are reachable from init and goal
    states on any optimal policy)
  • Value/Policy Iteration dont consider the notion
    of relevance
  • Consider heuristic state search algorithms
  • Heuristic can be seen as the estimate of the
    value of a state.
  • (L)AO or
  • RTDP algorithms
  • (or envelope extension methods)

23
AO search for solving SSP problems
Main issues -- Cost of a node is
expected cost of its children -- The And
tree can have LOOPS ?Cost backup
is complicated
Intermediate nodes given admissible heuristic
estimates --can be just the shortest paths
(or their estimates)
24
LAO--turning bottom-up labeling into a full DP
25
RTDP Approach Interleave Planning Execution
(Simulation)
Start from the current state S. Expand the tree
(either uniformly to k-levels, or
non-uniformlygoing deeper in some
branches) Evaluate the leaf nodes back-up the
values to S. Update the stored value of S. Pick
the action that leads to best value Do it or
simulate it. Loop back. Leaf nodes evaluated
by Using their cached values ?If this
node has been evaluated using RTDP
analysis in the past, you use its
remembered value else use the
heuristic value ?If not use heuristics to
estimate a. Immediate reward values
b. Reachability heuristics
Sort of like depth-limited game-playing
(expectimax) --Who is the game against? Can
also do reinforcement learning this way ?
The Mij are not known correctly in RL
26
Greedy On-Policy RTDP without execution
?Using the current utility values, select the
action with the highest expected utility
(greedy action) at each state, until you reach
a terminating state. Update the values along
this path. Loop backuntil the values stabilize
27
(No Transcript)
28
Envelope Extension Methods
  • For each action, take the most likely outcome and
    discard the rest.
  • Find a plan (deterministic path) from Init to
    Goal state. This is a (very partial) policy for
    just the states that fall on the maximum
    probability state sequence.
  • Consider states that are most likely to be
    encountered while traveling this path.
  • Find policy for those states too.
  • Tricky part is to show that we can converge to
    the optimal policy

29
Incomplete observability(the dreaded POMDPs)
  • To model partial observability, all we need to do
    is to look at MDP in the space of belief states
    (belief states are fully observable even when
    world states are not)
  • Policy maps belief states to actions
  • In practice, this causes (humongous) problems
  • The space of belief states is continuous (even
    if the underlying world is discrete and finite).
  • Even approximate policies are hard to find
    (PSPACE-hard).
  • Problems with few dozen world states are hard to
    solve currently
  • Depth-limited exploration (such as that done in
    adversarial games) are the only option

30
(No Transcript)
31
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com