Title: Introduction to Planning Under Uncertainty
1Introduction to PlanningUnder Uncertainty
- Michael L. Littman
- Rutgers University
- Department of Computer Science
- Rutgers Laboratory for Real Life Reinforcement
Learning
2Planning?
- Selecting
- explicit decision making
- and Executing
- influences an environment
- a Sequence of Actions
- outcome depends on multiple steps
- to Accomplish some Objective
- performance measured
3Uncertainty?
- Four principle types
- outcome uncertainty
- dont know which outcome will occur (ex. MDPs).
- effect uncertainty
- dont know possible outcomes (ex. RL).
- state uncertainty
- dont know current state (ex. POMDPs).
- agent uncertainty
- dont know what other agents will do (ex.
games). - Classical planning no uncertainty
- Real-world problems one or more of the above.
4A Model-based Viewpoint
- Types of uncertainty can be combined
- outcome MDPs
- state POMDPs
- agent partially observable SGs
- effect RL in partially observable SGs
- What uncertainty does your problem have?
- What model most directly captures it?
- What algorithms most appropriate?
5Example Planning Problem
X
6MDPs
- Markov decision process
- finite set of states s in S
- finite set of actions a in A
- transition probabilities T(s,a,s)
- rewards (or costs) R(s,a)
- Objective select actions to maximize total
expected reward (often discounted by g).
7About MDPs
- Control problems Outcome uncertainty.
- Specified via a set of matrices.
- Complexity P-complete.
- Bellman equations
- V(s) maxa (R(s,a) g Ss T(s,a,s) V(s))
- Q(s,a) R(s,a) g Ss T(s,a,s) maxa Q(s,a)
- Algorithms
- value iteration (workhorse), V F(V).
- policy iteration (often fast, loose analysis)
- modified policy iteration (compromise)
- linear programming
8LP for MDP
- min Ss Vs s.t. Vs R(s,a) g Ss T(s,a,s)
Vs - Vs
- polynomial time, worst case (only known)
- key to extensions (constraints, approx.)
9Propositional Representations
- Instead of big matrices, more AI-like operators.
- states assignments to propositional vars.
- transitions
- dynamic Bayesian networks
- tree-based representations
- circuits
- STRIPS operators extended for probabilities
- PPDDL 1.0
10PPDDL 1.0 Example
- (action drive
- parameters (?from - location ?to
- location) - precondition (and (at-vehicle
?from) (road ?from ?to) - (not
(has-flat-tire))) - effect (probabilistic
- .85 (and (at-vehicle ?to)
- (not (at-vehicle ?from))
-
(decrease reward 1)) - .15 (and (has-flat-tire)
- (decrease reward 1)) ))
11Planning Ahead
- (action pick-up-tire
- parameters (?loc - location)
- precondition (and (at-vehicle
?loc) (spare-at ?loc) - (not
(vehicle-has-spare))) - effect (and (vehicle-has-spare)
(decrease reward 1))) - (action change-tire
- precondition (has-flat-tire)
- effect (and (when
(vehicle-has-spare) - (and (not (has-flat-tire))
-
(not (vehicle-has-spare)) -
(decrease reward 1))) - (when (not
(vehicle-has-spare)) - (and
(not (has-flat-tire)) -
(decrease reward 100)))))
12Implications
- PPDDL 1.0 First probabilistic planning
competition - Part of IPC-4 in ICAPS-04 (Vancouver in June).
- Worst-case complexity (propositional)
- Polynomial horizon PSPACE-complete
- Infinite-horizon EXP-complete
- Known representations interconvertible
- Worst-case complexity (relational)
- Not sure, probably worse
PLUG
13Propositional MDP Algorithms
- Many have been proposed
- variations of POP search
- V(s) maxa (R(s,a) g Ss T(s,a,s) V(s))
- VI with structured V V(s)
- VI with parameterized V
- V(s) w f(s)
- Conversion to probabilistic SAT
14Reinforcement Learning
- Add effect uncertainty to outcome uncertainty
- actions described by unknown probabilities.
- Given opportunity to interact with environment
- s, a, r, s, a, r, s, a, r, s, a, r, s, a, r,
- Want to find a policy to maximize sum of rs.
- Dont miss out too badly while learning.
- exploration vs. exploitation
- Real-Life Reinforcement Learning Symposium Fall
2004 in DC.
PLUG 2
15Algorithmic Approaches
- Model-free (Q-learning et al.)
- Iteratively estimate V, Q.
- Methods typically converge in the limit.
- Model-based
- Use experience to estimate T, R.
- Becomes a planning problem.
- Can be made polytime approximation
- These methods can interact badly with function
approximation for V or Q.
16Planning and Learning
17Polynomial Time RL
- Let M be a Markov decision process over N states.
Let P(T,e,M) be the set of all policies that get
within e of their true return in the first T
steps, and that opt(P(T,e,M)) is the optimal
asymptotic expected undiscounted return
achievable in P(T,e,M). There exists an algorithm
A, taking inputs e, d, N, T and opt(P(T,e,M))
such that after a total number of actions and
computation time bounded by a polynomial in 1/e,
1/d, N, T, and Rmax, with probability at least
1-d, the total undiscounted return will be at
least opt(P(T,e,M))-e.
18Explicit Explore Exploit
- (Initialization) Initially, the set S of known
states is empty. - (Balanced Wandering) Any time the current state
is not in S, the algorithm performs balanced
wandering. - (Discovery of New Known States) Any time a state
i has been visited mknown times during balanced
wandering, it enters the known set S, and no
longer participates in balanced wandering. - (Off-line optimizations) Compute optimal
policies for Mr (maximize reward, avoiding
unknown states) and Md (minimize steps to unknown
state). - Execute Mr if it is within e/2 of optimal,
otherwise Md is likely to quickly discover a
state out of S.
19POMDPs
- Partially observable Markov decision process
- MDP plus
- finite set of observations z in Z
- observation probabilites O(s,z)
- Decision maker see observations, not states.
- Outcome uncertainty, state uncertainty.
- Information state is Markov
20Sondiks Observation
- b is belief state or information state (vector)
- V(b) maxa (R(b,a) g Sb T(s,b,s) V(b))
- V(b) maxa (R(a)b g Sz O(a)V(b))
- Closure property If V is piecewise linear and
convex function of b, then V is piecewise linear
and convex function of b! - Maximum value achieved by a choice of vector for
each b (function of z). - VI approach.
21POMDP Value Functions
-100
Pr(i2)
Value function (finite horizon) finite
rep. Piecewise-linear and convex (Sondik
71) animation by Thrun
22Algorithms
- If V has n vectors, V has at most A nZ.
- But, most are dominated.
- Witness algorithm
- Each needed combo testified by some b.
- Search for bs for which V(b) not equal to
one-step lookahead and add the combos. - Point-based approximations
- Quit early. Bound complexity of V.
23POMDPs
- Complexity
- Exact poly horizon PSPACE-complete.
- Exact infinite horizon incomputable!
- Discounted approximation finite.
- Single step of exact VI If poly, RPNP.
- Witness, incremental pruning polynomial if
intermediate Q functions stay small. - Pet Peeve Its not about the number of states!
24RL in POMDPs
- If we add effects uncertainty, harder still.
- Active approaches
- model-based
- EM, PSR, instance-based
- memoryless
- multistep updates key (TD(lgt0)), can do well.
- policy search
- dont bother with values, smart generate and test
25Stochastic Games
- Stochastic (or Markov) Games
- MDP plus
- set of n players
- finite set of actions ai in Ai for the players
- transition probabilities T(s,a1,,an,s)
- rewards (or costs) Ri(s,a1,,an)
- Objective select actions to maximize total
expected reward. - But, how handle other players?
26Dealing with Agent Uncertainty
- Assume agent follows fixed strategy
- can model as MDP
- Assume agent selected from fixed population
- can model as POMDP
- Paranoid assume agent minimizes reward
- can model as zero-sum game
- Find equilibrium (no incentive to change)
- game theoretic approach
27Example Grid Game 3
U, D, R, L, X No move on collision Semiwalls
(50) -1 for step, -10 for collision, 100 for
goal, 0 if back to initial config. Both can get
goal.
28Complexity, Algorithms
- Zero sum, one state Equivalent to LP.
- Zero sum, multistate Approximate via VI, but
optimal values can be irrational (not exact). - General sum, one state (terminal) Open.
- General sum, one state (repeated) Polytime using
threats. - General sum, multistate Only approximations.
29Beyond
- RL in stochastic games
- convergent, polytime algs for zero sum
- objective still debated for general sum
- Partially observable stochastic games
- Oy. Modeling, complexity, and approximation
work. Relevant but tough. - RL in partially observable SGs
- Well, thats life, isnt it?
30Discussion
- Complexity tends to rise when handling more
uncertainty. - Use a model that is appropriate for your problem
not too rich! - Very active area of research in machine learning,
AI, planning - Lets be precise about what we are solving!