Title: 5/6: Summary and Decision Theoretic Planning
15/6 Summary and Decision Theoretic Planning
- ?Last homework socket opened (two more problems
to be addedScheduling, MDPs) - Project 3 due today
- Sapa homework points sent..
2Current Grades
3Sapa homework grades
4(No Transcript)
5All that water under the bridge
- Actions, Proofs, Planning Strategies (Week 2
1/281/30) - More PO planning, dealing with partially
instantiated actions, and start of deriving
heuristics. (Week 3 2/42/6) - Reachability Heuristics contd. (2/11/13)
- Heuristics for Partial order planning Graphplan
search (2/18 2/20). - EBL for Graphplan Solving planning graph by
compilation strategies (2/252/27). - Compilation to SAT, ILP and Naive
Encoding(3/43/6). - Knowledge-based planners.
- Metric-Temporal Planning Issues and
Representation. - Search Techniques Heuristics.
- Tracking multiple objective heuristics (cost
propagation) partialization LPG - Temporal Constraint Networks Scheduling
- 4/224/24 Incompleteness and Unertainty Belief
States Conformant planning - 4/295/1 Conditional Planning
- Decision Theoretic Planning
6Problems, Solutions, Success Measures3
orthogonal dimensions
- Conformant Plans Dont lookjust do
- Sequences
- Contingent/Conditional Plans Look, and based on
what you see, Do look again - Directed acyclic graphs
- Policies If in (belief) state S, do action a
- (belief) state?action tables
- Incompleteness in the initial state
- Un (partial) observability of states
- Non-deterministic actions
- Uncertainty in state or effects
- Complex reward functions (allowing degrees of
satisfaction)
- Deterministic Success Must reach goal-state with
probability 1 - Probabilistic Success Must succeed with
probability gt k (0ltklt1) - Maximal Expected Reward Maximize the expected
reward (an optimization problem)
7The Trouble with Probabilities
- Once we have probabilities associated with the
action effects, as well as the constituents of a
belief state, - The belief space size explodes
- Infinitely large?may be able to find a plan if
one exists, but exhaustively searching to prove
plan doesnt exist is out of the question - Conformant Probabilistic planning is known to be
Semi-decidable - So, solving POMDPs is semi-decidable too.
- Introduces the notion of partial satisfaction
and expected value of the plan (rather than
0-1 valuation)
8Useful as normative modeling tools In tons of
places --planning, (reinforcement) learning,
multi-agent interactions..
MDPs are generalizations of Markov chains where
transitions are under the control of an agent.
?HMMs are thus generalized to POMDPs
9aka action cost C(a,s)
If Mij matrix is not known a priori, then we
have a reinforcement learning scenario..
10MDPs vs. Markov Chains
- Markov chains are transition systems, where
transitions happen automatically - HMMs (hidden markov models) are markov chains
where the current state is partially observable.
Has been very useful in many different areas. - Generalization to MDPs leads to POMDPs
- MDPs are generalizations of Markov chains where
transitions are under the control of an agent.
11(No Transcript)
12Policies change with rewards..
-
-
13(No Transcript)
14(No Transcript)
15(No Transcript)
16Updates can be done synchronously OR
asynchronously --convergence guaranteed
as long as each state updated
infinitely often
Why are values coming down first? Why are some
states reaching optimal value faster?
17Policies converge earlier than values
Given a utility vector Ui we can compute the
greedy policy pui The policy loss of p is
Up-U (max norm difference of two
vectors is the maximum amount by which they
differ on any dimension)
So search in the space of policies
18We can either solve the linear eqns exactly,
or solve them approximately by running the
value iteration a few times (the update wont
have max factor)
19The Big Computational Issues in MDP
- MDP models are quite easy to specify and
understand conceptually. The big issue is
compactness and effciency - Policy construction is polynomial in the size of
state space (which is bad news!) - For POMDPs, the state space is the belief space
(infinite ?) - Compact representations needed for
- Actions
- Reward function
- Policy
- Value
- Efficient methods needed for
- Policy/value update
-
- Representations that have been tried include
- Decision trees
- Neural nets,
- Bayesian nets
- ADDs (algebraic decision diagramswhich are a
general case of BDDswhere the leaf nodes can
have real-valued valuation instead of T/F).
20SPUDD Using ADDs to Represent Actions, Rewards
and Policies
21MDPs and Planning Problems
- FOMDPS (fully observable MDPS) can be used to
model planning problems with fully observable
states, but non-deterministic transitions - POMDPs (partially observable MDPs)a
generalization of MDP framework, where the
current state can only be partially observedwill
be needed to handle planning problems with
partial observability - POMDPs can be solved by converting them into
FOMDPsbut the conversion takes us from world
states to belief states (which is a continuous
space)
22SSPPStochastic Shortest Path Problem An MDP with
Init and Goal states
- MDPs dont have a notion of an initial and
goal state. (Process orientation instead of
task orientation) - Goals are sort of modeled by reward functions
- Allows pretty expressive goals (in theory)
- Normal MDP algorithms dont use initial state
information (since policy is supposed to cover
the entire search space anyway). - Could consider envelope extension methods
- Compute a deterministic plan (which gives the
policy for some of the states Extend the policy
to other states that are likely to happen during
execution - RTDP methods
- SSSP are a special case of MDPs where
- (a) initial state is given
- (b) there are absorbing goal states
- (c) Actions have costs. Goal states have zero
costs. - A proper policy for SSSP is a policy which is
guaranteed to ultimately put the agent in one of
the absorbing states - For SSSP, it would be worth finding a partial
policy that only covers the relevant states
(states that are reachable from init and goal
states on any optimal policy) - Value/Policy Iteration dont consider the notion
of relevance - Consider heuristic state search algorithms
- Heuristic can be seen as the estimate of the
value of a state. - (L)AO or
- RTDP algorithms
- (or envelope extension methods)
23AO search for solving SSP problems
Main issues -- Cost of a node is
expected cost of its children -- The And
tree can have LOOPS ?Cost backup
is complicated
Intermediate nodes given admissible heuristic
estimates --can be just the shortest paths
(or their estimates)
24LAO--turning bottom-up labeling into a full DP
25RTDP Approach Interleave Planning Execution
(Simulation)
Start from the current state S. Expand the tree
(either uniformly to k-levels, or
non-uniformlygoing deeper in some
branches) Evaluate the leaf nodes back-up the
values to S. Update the stored value of S. Pick
the action that leads to best value Do it or
simulate it. Loop back. Leaf nodes evaluated
by Using their cached values ?If this
node has been evaluated using RTDP
analysis in the past, you use its
remembered value else use the
heuristic value ?If not use heuristics to
estimate a. Immediate reward values
b. Reachability heuristics
Sort of like depth-limited game-playing
(expectimax) --Who is the game against? Can
also do reinforcement learning this way ?
The Mij are not known correctly in RL
26Greedy On-Policy RTDP without execution
?Using the current utility values, select the
action with the highest expected utility
(greedy action) at each state, until you reach
a terminating state. Update the values along
this path. Loop backuntil the values stabilize
27(No Transcript)
28Envelope Extension Methods
- For each action, take the most likely outcome and
discard the rest. - Find a plan (deterministic path) from Init to
Goal state. This is a (very partial) policy for
just the states that fall on the maximum
probability state sequence. - Consider states that are most likely to be
encountered while traveling this path. - Find policy for those states too.
- Tricky part is to show that we can converge to
the optimal policy
29Incomplete observability(the dreaded POMDPs)
- To model partial observability, all we need to do
is to look at MDP in the space of belief states
(belief states are fully observable even when
world states are not) - Policy maps belief states to actions
- In practice, this causes (humongous) problems
- The space of belief states is continuous (even
if the underlying world is discrete and finite). - Even approximate policies are hard to find
(PSPACE-hard). - Problems with few dozen world states are hard to
solve currently - Depth-limited exploration (such as that done in
adversarial games) are the only option
30(No Transcript)
31(No Transcript)