Introduction to Planning Under Uncertainty

About This Presentation

Title:

Introduction to Planning Under Uncertainty

Description:

Objective: select actions to maximize total expected reward (often discounted by g) ... Discounted approximation: finite. Single step of exact VI: If poly, RP=NP. ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 27

Provided by: Michael1849

Learn more at: http://www.cs.rutgers.edu

more less

Transcript and Presenter's Notes

Title: Introduction to Planning Under Uncertainty

1
Introduction to PlanningUnder Uncertainty

Michael L. Littman
Rutgers University
Department of Computer Science
Rutgers Laboratory for Real Life Reinforcement
Learning

2
Planning?

Selecting
explicit decision making
and Executing
influences an environment
a Sequence of Actions
outcome depends on multiple steps
to Accomplish some Objective
performance measured

3
Uncertainty?

Four principle types
outcome uncertainty
dont know which outcome will occur (ex. MDPs).
effect uncertainty
dont know possible outcomes (ex. RL).
state uncertainty
dont know current state (ex. POMDPs).
agent uncertainty
dont know what other agents will do (ex.
games).
Classical planning no uncertainty
Real-world problems one or more of the above.

4
A Model-based Viewpoint

Types of uncertainty can be combined
outcome MDPs
state POMDPs
agent partially observable SGs
effect RL in partially observable SGs
What uncertainty does your problem have?
What model most directly captures it?
What algorithms most appropriate?

5
Example Planning Problem
X
6
MDPs

Markov decision process
finite set of states s in S
finite set of actions a in A
transition probabilities T(s,a,s)
rewards (or costs) R(s,a)
Objective select actions to maximize total
expected reward (often discounted by g).

7
About MDPs

Control problems Outcome uncertainty.
Specified via a set of matrices.
Complexity P-complete.
Bellman equations
V(s) maxa (R(s,a) g Ss T(s,a,s) V(s))
Q(s,a) R(s,a) g Ss T(s,a,s) maxa Q(s,a)
Algorithms
value iteration (workhorse), V F(V).
policy iteration (often fast, loose analysis)
modified policy iteration (compromise)
linear programming

8
LP for MDP

min Ss Vs s.t. Vs R(s,a) g Ss T(s,a,s)
Vs
Vs
polynomial time, worst case (only known)
key to extensions (constraints, approx.)

9
Propositional Representations

Instead of big matrices, more AI-like operators.
states assignments to propositional vars.
transitions
dynamic Bayesian networks
tree-based representations
circuits
STRIPS operators extended for probabilities
PPDDL 1.0

10
PPDDL 1.0 Example

(action drive
parameters (?from - location ?to
- location)
precondition (and (at-vehicle
?from) (road ?from ?to)
(not
(has-flat-tire)))
effect (probabilistic
.85 (and (at-vehicle ?to)
(not (at-vehicle ?from))
(decrease reward 1))
.15 (and (has-flat-tire)
(decrease reward 1)) ))

11
Planning Ahead

(action pick-up-tire
parameters (?loc - location)
precondition (and (at-vehicle
?loc) (spare-at ?loc)
(not
(vehicle-has-spare)))
effect (and (vehicle-has-spare)
(decrease reward 1)))
(action change-tire
precondition (has-flat-tire)
effect (and (when
(vehicle-has-spare)
(and (not (has-flat-tire))
(not (vehicle-has-spare))
(decrease reward 1)))
(when (not
(vehicle-has-spare))
(and
(not (has-flat-tire))
(decrease reward 100)))))

12
Implications

PPDDL 1.0 First probabilistic planning
competition
Part of IPC-4 in ICAPS-04 (Vancouver in June).
Worst-case complexity (propositional)
Polynomial horizon PSPACE-complete
Infinite-horizon EXP-complete
Known representations interconvertible
Worst-case complexity (relational)
Not sure, probably worse

PLUG
13
Propositional MDP Algorithms

Many have been proposed
variations of POP search
V(s) maxa (R(s,a) g Ss T(s,a,s) V(s))
VI with structured V V(s)
VI with parameterized V
V(s) w f(s)
Conversion to probabilistic SAT

14
Reinforcement Learning

Add effect uncertainty to outcome uncertainty
actions described by unknown probabilities.
Given opportunity to interact with environment
s, a, r, s, a, r, s, a, r, s, a, r, s, a, r,
Want to find a policy to maximize sum of rs.
Dont miss out too badly while learning.
exploration vs. exploitation
Real-Life Reinforcement Learning Symposium Fall
2004 in DC.

PLUG 2
15
Algorithmic Approaches

Model-free (Q-learning et al.)
Iteratively estimate V, Q.
Methods typically converge in the limit.
Model-based
Use experience to estimate T, R.
Becomes a planning problem.
Can be made polytime approximation
These methods can interact badly with function
approximation for V or Q.

16
Planning and Learning
17
Polynomial Time RL

Let M be a Markov decision process over N states.
Let P(T,e,M) be the set of all policies that get
within e of their true return in the first T
steps, and that opt(P(T,e,M)) is the optimal
asymptotic expected undiscounted return
achievable in P(T,e,M). There exists an algorithm
A, taking inputs e, d, N, T and opt(P(T,e,M))
such that after a total number of actions and
computation time bounded by a polynomial in 1/e,
1/d, N, T, and Rmax, with probability at least
1-d, the total undiscounted return will be at
least opt(P(T,e,M))-e.

18
Explicit Explore Exploit

(Initialization) Initially, the set S of known
states is empty.
(Balanced Wandering) Any time the current state
is not in S, the algorithm performs balanced
wandering.
(Discovery of New Known States) Any time a state
i has been visited mknown times during balanced
wandering, it enters the known set S, and no
longer participates in balanced wandering.
(Off-line optimizations) Compute optimal
policies for Mr (maximize reward, avoiding
unknown states) and Md (minimize steps to unknown
state).
Execute Mr if it is within e/2 of optimal,
otherwise Md is likely to quickly discover a
state out of S.

19
POMDPs

Partially observable Markov decision process
MDP plus
finite set of observations z in Z
observation probabilites O(s,z)
Decision maker see observations, not states.
Outcome uncertainty, state uncertainty.
Information state is Markov

20
Sondiks Observation

b is belief state or information state (vector)
V(b) maxa (R(b,a) g Sb T(s,b,s) V(b))
V(b) maxa (R(a)b g Sz O(a)V(b))
Closure property If V is piecewise linear and
convex function of b, then V is piecewise linear
and convex function of b!
Maximum value achieved by a choice of vector for
each b (function of z).
VI approach.

21
POMDP Value Functions
-100
Pr(i2)
Value function (finite horizon) finite
rep. Piecewise-linear and convex (Sondik
71) animation by Thrun
22
Algorithms

If V has n vectors, V has at most A nZ.
But, most are dominated.
Witness algorithm
Each needed combo testified by some b.
Search for bs for which V(b) not equal to
one-step lookahead and add the combos.
Point-based approximations
Quit early. Bound complexity of V.

23
POMDPs

Complexity
Exact poly horizon PSPACE-complete.
Exact infinite horizon incomputable!
Discounted approximation finite.
Single step of exact VI If poly, RPNP.
Witness, incremental pruning polynomial if
intermediate Q functions stay small.
Pet Peeve Its not about the number of states!

24
RL in POMDPs

If we add effects uncertainty, harder still.
Active approaches
model-based
EM, PSR, instance-based
memoryless
multistep updates key (TD(lgt0)), can do well.
policy search
dont bother with values, smart generate and test

25
Stochastic Games

Stochastic (or Markov) Games
MDP plus
set of n players
finite set of actions ai in Ai for the players
transition probabilities T(s,a1,,an,s)
rewards (or costs) Ri(s,a1,,an)
Objective select actions to maximize total
expected reward.
But, how handle other players?

26
Dealing with Agent Uncertainty

Assume agent follows fixed strategy
can model as MDP
Assume agent selected from fixed population
can model as POMDP
Paranoid assume agent minimizes reward
can model as zero-sum game
Find equilibrium (no incentive to change)
game theoretic approach

27
Example Grid Game 3
U, D, R, L, X No move on collision Semiwalls
(50) -1 for step, -10 for collision, 100 for
goal, 0 if back to initial config. Both can get
goal.
28
Complexity, Algorithms

Zero sum, one state Equivalent to LP.
Zero sum, multistate Approximate via VI, but
optimal values can be irrational (not exact).
General sum, one state (terminal) Open.
General sum, one state (repeated) Polytime using
threats.
General sum, multistate Only approximations.

29
Beyond

RL in stochastic games
convergent, polytime algs for zero sum
objective still debated for general sum
Partially observable stochastic games
Oy. Modeling, complexity, and approximation
work. Relevant but tough.
RL in partially observable SGs
Well, thats life, isnt it?

30
Discussion

Complexity tends to rise when handling more
uncertainty.
Use a model that is appropriate for your problem
not too rich!
Very active area of research in machine learning,
AI, planning
Lets be precise about what we are solving!

Write a Comment

User Comments (0)