Introduction to Planning Under Uncertainty

1 / 26
About This Presentation
Title:

Introduction to Planning Under Uncertainty

Description:

Objective: select actions to maximize total expected reward (often discounted by g) ... Discounted approximation: finite. Single step of exact VI: If poly, RP=NP. ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Introduction to Planning Under Uncertainty


1
Introduction to PlanningUnder Uncertainty
  • Michael L. Littman
  • Rutgers University
  • Department of Computer Science
  • Rutgers Laboratory for Real Life Reinforcement
    Learning

2
Planning?
  • Selecting
  • explicit decision making
  • and Executing
  • influences an environment
  • a Sequence of Actions
  • outcome depends on multiple steps
  • to Accomplish some Objective
  • performance measured

3
Uncertainty?
  • Four principle types
  • outcome uncertainty
  • dont know which outcome will occur (ex. MDPs).
  • effect uncertainty
  • dont know possible outcomes (ex. RL).
  • state uncertainty
  • dont know current state (ex. POMDPs).
  • agent uncertainty
  • dont know what other agents will do (ex.
    games).
  • Classical planning no uncertainty
  • Real-world problems one or more of the above.

4
A Model-based Viewpoint
  • Types of uncertainty can be combined
  • outcome MDPs
  • state POMDPs
  • agent partially observable SGs
  • effect RL in partially observable SGs
  • What uncertainty does your problem have?
  • What model most directly captures it?
  • What algorithms most appropriate?

5
Example Planning Problem
X
6
MDPs
  • Markov decision process
  • finite set of states s in S
  • finite set of actions a in A
  • transition probabilities T(s,a,s)
  • rewards (or costs) R(s,a)
  • Objective select actions to maximize total
    expected reward (often discounted by g).

7
About MDPs
  • Control problems Outcome uncertainty.
  • Specified via a set of matrices.
  • Complexity P-complete.
  • Bellman equations
  • V(s) maxa (R(s,a) g Ss T(s,a,s) V(s))
  • Q(s,a) R(s,a) g Ss T(s,a,s) maxa Q(s,a)
  • Algorithms
  • value iteration (workhorse), V F(V).
  • policy iteration (often fast, loose analysis)
  • modified policy iteration (compromise)
  • linear programming

8
LP for MDP
  • min Ss Vs s.t. Vs R(s,a) g Ss T(s,a,s)
    Vs
  • Vs
  • polynomial time, worst case (only known)
  • key to extensions (constraints, approx.)

9
Propositional Representations
  • Instead of big matrices, more AI-like operators.
  • states assignments to propositional vars.
  • transitions
  • dynamic Bayesian networks
  • tree-based representations
  • circuits
  • STRIPS operators extended for probabilities
  • PPDDL 1.0

10
PPDDL 1.0 Example
  • (action drive
  • parameters (?from - location ?to
    - location)
  • precondition (and (at-vehicle
    ?from) (road ?from ?to)
  • (not
    (has-flat-tire)))
  • effect (probabilistic
  • .85 (and (at-vehicle ?to)
  • (not (at-vehicle ?from))

  • (decrease reward 1))
  • .15 (and (has-flat-tire)
  • (decrease reward 1)) ))

11
Planning Ahead
  • (action pick-up-tire
  • parameters (?loc - location)
  • precondition (and (at-vehicle
    ?loc) (spare-at ?loc)
  • (not
    (vehicle-has-spare)))
  • effect (and (vehicle-has-spare)
    (decrease reward 1)))
  • (action change-tire
  • precondition (has-flat-tire)
  • effect (and (when
    (vehicle-has-spare)
  • (and (not (has-flat-tire))

  • (not (vehicle-has-spare))

  • (decrease reward 1)))
  • (when (not
    (vehicle-has-spare))
  • (and
    (not (has-flat-tire))

  • (decrease reward 100)))))

12
Implications
  • PPDDL 1.0 First probabilistic planning
    competition
  • Part of IPC-4 in ICAPS-04 (Vancouver in June).
  • Worst-case complexity (propositional)
  • Polynomial horizon PSPACE-complete
  • Infinite-horizon EXP-complete
  • Known representations interconvertible
  • Worst-case complexity (relational)
  • Not sure, probably worse

PLUG
13
Propositional MDP Algorithms
  • Many have been proposed
  • variations of POP search
  • V(s) maxa (R(s,a) g Ss T(s,a,s) V(s))
  • VI with structured V V(s)
  • VI with parameterized V
  • V(s) w f(s)
  • Conversion to probabilistic SAT

14
Reinforcement Learning
  • Add effect uncertainty to outcome uncertainty
  • actions described by unknown probabilities.
  • Given opportunity to interact with environment
  • s, a, r, s, a, r, s, a, r, s, a, r, s, a, r,
  • Want to find a policy to maximize sum of rs.
  • Dont miss out too badly while learning.
  • exploration vs. exploitation
  • Real-Life Reinforcement Learning Symposium Fall
    2004 in DC.

PLUG 2
15
Algorithmic Approaches
  • Model-free (Q-learning et al.)
  • Iteratively estimate V, Q.
  • Methods typically converge in the limit.
  • Model-based
  • Use experience to estimate T, R.
  • Becomes a planning problem.
  • Can be made polytime approximation
  • These methods can interact badly with function
    approximation for V or Q.

16
Planning and Learning
17
Polynomial Time RL
  • Let M be a Markov decision process over N states.
    Let P(T,e,M) be the set of all policies that get
    within e of their true return in the first T
    steps, and that opt(P(T,e,M)) is the optimal
    asymptotic expected undiscounted return
    achievable in P(T,e,M). There exists an algorithm
    A, taking inputs e, d, N, T and opt(P(T,e,M))
    such that after a total number of actions and
    computation time bounded by a polynomial in 1/e,
    1/d, N, T, and Rmax, with probability at least
    1-d, the total undiscounted return will be at
    least opt(P(T,e,M))-e.

18
Explicit Explore Exploit
  • (Initialization) Initially, the set S of known
    states is empty.
  • (Balanced Wandering) Any time the current state
    is not in S, the algorithm performs balanced
    wandering.
  • (Discovery of New Known States) Any time a state
    i has been visited mknown times during balanced
    wandering, it enters the known set S, and no
    longer participates in balanced wandering.
  • (Off-line optimizations) Compute optimal
    policies for Mr (maximize reward, avoiding
    unknown states) and Md (minimize steps to unknown
    state).
  • Execute Mr if it is within e/2 of optimal,
    otherwise Md is likely to quickly discover a
    state out of S.

19
POMDPs
  • Partially observable Markov decision process
  • MDP plus
  • finite set of observations z in Z
  • observation probabilites O(s,z)
  • Decision maker see observations, not states.
  • Outcome uncertainty, state uncertainty.
  • Information state is Markov

20
Sondiks Observation
  • b is belief state or information state (vector)
  • V(b) maxa (R(b,a) g Sb T(s,b,s) V(b))
  • V(b) maxa (R(a)b g Sz O(a)V(b))
  • Closure property If V is piecewise linear and
    convex function of b, then V is piecewise linear
    and convex function of b!
  • Maximum value achieved by a choice of vector for
    each b (function of z).
  • VI approach.

21
POMDP Value Functions
-100
Pr(i2)
Value function (finite horizon) finite
rep. Piecewise-linear and convex (Sondik
71) animation by Thrun
22
Algorithms
  • If V has n vectors, V has at most A nZ.
  • But, most are dominated.
  • Witness algorithm
  • Each needed combo testified by some b.
  • Search for bs for which V(b) not equal to
    one-step lookahead and add the combos.
  • Point-based approximations
  • Quit early. Bound complexity of V.

23
POMDPs
  • Complexity
  • Exact poly horizon PSPACE-complete.
  • Exact infinite horizon incomputable!
  • Discounted approximation finite.
  • Single step of exact VI If poly, RPNP.
  • Witness, incremental pruning polynomial if
    intermediate Q functions stay small.
  • Pet Peeve Its not about the number of states!

24
RL in POMDPs
  • If we add effects uncertainty, harder still.
  • Active approaches
  • model-based
  • EM, PSR, instance-based
  • memoryless
  • multistep updates key (TD(lgt0)), can do well.
  • policy search
  • dont bother with values, smart generate and test

25
Stochastic Games
  • Stochastic (or Markov) Games
  • MDP plus
  • set of n players
  • finite set of actions ai in Ai for the players
  • transition probabilities T(s,a1,,an,s)
  • rewards (or costs) Ri(s,a1,,an)
  • Objective select actions to maximize total
    expected reward.
  • But, how handle other players?

26
Dealing with Agent Uncertainty
  • Assume agent follows fixed strategy
  • can model as MDP
  • Assume agent selected from fixed population
  • can model as POMDP
  • Paranoid assume agent minimizes reward
  • can model as zero-sum game
  • Find equilibrium (no incentive to change)
  • game theoretic approach

27
Example Grid Game 3
U, D, R, L, X No move on collision Semiwalls
(50) -1 for step, -10 for collision, 100 for
goal, 0 if back to initial config. Both can get
goal.
28
Complexity, Algorithms
  • Zero sum, one state Equivalent to LP.
  • Zero sum, multistate Approximate via VI, but
    optimal values can be irrational (not exact).
  • General sum, one state (terminal) Open.
  • General sum, one state (repeated) Polytime using
    threats.
  • General sum, multistate Only approximations.

29
Beyond
  • RL in stochastic games
  • convergent, polytime algs for zero sum
  • objective still debated for general sum
  • Partially observable stochastic games
  • Oy. Modeling, complexity, and approximation
    work. Relevant but tough.
  • RL in partially observable SGs
  • Well, thats life, isnt it?

30
Discussion
  • Complexity tends to rise when handling more
    uncertainty.
  • Use a model that is appropriate for your problem
    not too rich!
  • Very active area of research in machine learning,
    AI, planning
  • Lets be precise about what we are solving!
Write a Comment
User Comments (0)