Density Estimation and MDPs

About This Presentation

Title:

Density Estimation and MDPs

Description:

Efficient exact DP step. Efficient projection (function approximation) ... DP. Weighted linear. regression. Must do these steps efficiently! ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 62

Provided by: ronp2

Learn more at: http://www.cs.cmu.edu

more less

Transcript and Presenter's Notes

Title: Density Estimation and MDPs

1
Density Estimation and MDPs

Ronald Parr
Stanford University

Joint work with Daphne Koller, Andrew Ng (U.C.
Berkeley) and Andres Rodriguez
2
What we aim to do

Plan for/control complex systems
Challenges
very large state spaces
hidden state information
Examples
Drive a car
Ride a bicycle
Operate a factory
Contribution novel uses of density estimation

3
Talk Outline

(PO)MDP overview
Traditional (PO)MDP solution methods
Density Estimation
(PO)MDPs meet density estimation
Reinforcement learning for PO domains
Dynamic programming w/function approx.
Policy search
Experimental Results

4
The MDP Framework

Markov Decision Process
Stochastic state transitions
Reward (or cost) function

5
5
0.7
0.5
-1
-1
0.3
0.5
Action 2
Action 1
5
MDPs

Uncertain action outcomes
Cost minimization (reward maximization)
Examples
Ride bicycle
Drive car
Operate factory
Assume that full state is known

6
Value Determination in MDPs

Compute expected, discounted value of plan
st - random variable for state at time t
g - discount factor
R(st) - reward for state st

e.g. Expected value of factory output
7
Dynamic Programming (DP)

Successive approximations
Fixed point is V
O(S2) per iteration
For n state variables, S2n

8
Partial Observability

Examples
road hazards
intentions of other agents
status of equipment
Complication true state is not known
state depends upon history
information state dist. over true states

9
DP for POMDPs

DP still works, but
s is now a belief state, i.e. prob. dist.
For n state variables, dist. over S2n states
Representing s exactly is difficult
Representing V exactly is nightmarish

10
Density Estimation

Efficiently represent dist. over many vars.
Broadly interpreted, includes
Statistical learning
Bayes net learning
Mixture models
Tracking
Kalman filters
DBNs

11
Example Dynamic Bayesian Networks
Time
t
t1
X
Y
Z
State Variables
12
Problem Variable Correlation
t0
t1
t2
13
Solution BK algorithm
Break into smaller clusters
Approximation/ marginalization step
Exact step
With mixing, bounded projection error total
error is bounded
14
Density Estimation meets POMDPs

Problems
Representing state
Representing value function
Solution
Use BK algorithm for state estimation
Use reinforcement learning for V (e.g. Parr
Russell 95, Littman et al. 95)
Represent V with neural net
Rodriguez, Parr and Koller, NIPS 99

15
Approximate POMDP RL
Environment
Belief State Estimation
O
R
A
A
Reinforcement Learner
Action Selection
16
Navigation Problem

Uncertain initial location
4-way sonar
Need for information gathering actions
60 states (15 positions x 4 orientations)

17
Navigation Results
18
Machine Maintenance
widgets
4 machine maintenance states per machine Reward
for output Components degrade, reducing
output Repair requires expensive total disassembly
19
Maintenance Results
20
Maintenance Results (Turnerized)
Decomposed NN has fewer inputs, learns faster
21
Summary

Advances
Use of factored belief state
Scales POMDP RL to larger state spaces
Limitations
No help with regular MDPs
Can be slow
No convergence guarantees

22
Goal DP with guarantees

Focus on value determination in MDPs
Efficient exact DP step
Efficient projection (function approximation)
Non-expansive function approximation
(convergence, bounded error)

23
A Value Determination Problem
M3
M5
M6
M2
M4
Reward for output
M1
Machines require predecessors to work They go
offline/online stochastically
24
Efficient, Stable DP
Idea Restrict class of value functions
VFA
DP
V0
VFA Neural Network, Regression, etc.
Issues Stability, Closeness of to V,
efficiency
25
Stability

Naïve function approximation is unstable Boyan
Moore 95, Bertsekas Tsitsiklis 96
Simple examples where V
Weighted linear regression is stable Nelson
1958, Van Roy 1998
Weights must correspond to stationary
distribution of policy r

26
Stable Approximate DP
DP
Weighted linear regression
lowest error possible
error in final result
? effective contraction rate
27
Efficiency Issues
DP, projection consider every state individually
DP
Weighted linear regression
Must do these steps efficiently!!!
28
Compact Models Compact V?
t
t1
Suppose R 1 if Z T
X
XYZ
Y
Z
Vt1
R1
Start with a uniform value function
29
Value Function Growth
XYZ
DP
Vt1
Vt
R1
Reward depends upon Z
30
Value Function Growth
DP
Vt-1
Vt
R1
Z depends upon previous Y and Z
31
Value Function Growth
Eventually, V has 2n partitions
DP
Vt-1
R1
See Boutilier, Dearden Goldszmidt (IJCAI 95)
for method that avoids worst case when possible.
32
Compact Reward Functions
R1
R2

R

...
X
U
V
W
W
33
Basis Functions

V w1h1(X1) w2h2(X2)
Use compact basis functions
h(Xi) basis defined over vars in Xi

Examples h function of status of subgoals h
function of inventory in different stores h
function of status of machines in factory
34
Efficient DP
Observe that DP is a linear operation
DP
DP
DP
Y1 X1 È parents(X1)
35
Growth of Basis Functions
t
t1
Suppose h1f(Y) DP(h1) f(X,Y) Each basis
function is replaced by a function with a
potentially larger domain
X
Y
Z
Need to control growth in function domains
36
Projection
DP
P
Regression projects back into original space
37
Efficient Projection
Want to project all points
K basis functions
Projection matrix (ATA)-1 is k x k
h1(s1) h2(s1)... h1(s2) h2(s2) . . .
2n states
38
Efficient dot product
Need to compute
Observe no. of unique terms in summation
is product of no. of unique terms in
bases Xi x Xj
Complexity of dot product is O(Xi x Xj)
Compute using same observation
39
Want Weighted Projection

Stability required weighted regression
But, stationary dist. r may not be compact
Boyen-Koller Approximation UAI 98
Provides factored with bounded error
Dot product weighted dot product

40
Weighted dot products
Need to compute
If is factored, and basis functions are
compact Let
i.e. all vars. in the enclosing BK clusters
41
Stability
Idea If error in not too large, then
were OK.
Theorem If
and
then
42
Approximate DP summary

Get compact, approx. stationary distribution
Start with linear value function
Repeat until convergence
Exact DP replaces bases with larger fns.
Project value function back into linear space
Efficient because of
Factored transition model
Compact basis functions
Compact approx. stationary distribution

43
Sample Revisited
M3
M5
M6
M2
M4
Reward for output
M1
Machines require predecessors to work, Fail
stochastically
44
Results Stability and Weighted Projection
0.5
Unweighted Projection
0.45
Weighted Projection
0.4
0.35
0.3
0.25
Weighted Sum of Squared Errors
0.2
0.15
0.1
0.05
0
2
3
4
5
6
7
8
9
10
Basis Functions Added
45
Approximate vs. Exact V
3.5
Exact
Approximate
3
2.5
2
Value
1.5
1
0.5
0
0
10
20
30
40
50
60
State
46
Summary

Advances
Stable, approximate DP for large models
Efficient DP, projection steps
Limitations
Prediction only, no policy improvement
non-trivial to add policy improvement
Policy representation may grow

47
Direct Policy Search
Idea Search smoothly parameterized policies
Policy function
Value function (wrt starting dist.)
See Williams 83, Marbach Tsitskilis 98, Baird
Moore 99, Meauleau et al. 99, Peshkin et al.
99, Konda Tsitsiklis 00, Sutton et al. 00
48
Policy Search with Density Estimation

Typically compute value gradient
Works for both MDPs and POMPDs
Gradient computation methods
Single trajectories
Exact (small models)
Value function
Our approach
Take all trajectories simultaneously
Ng, Parr Koller NIPS 99

49
Policy Evaluation
Idea Model rollout
Project, get cost
Approx. dist.
Initial dist.
50
Rollout Based Policy Search
Idea Estimate Search space e.g. using
simplex search
Theorem
Suppose
Optimize to reach
N.B. Given density estimation, this turns
policy search into simple function maximization
51
Simple BAT net
52
Simplex Search Results
53
Gradient Ascent
Simplex is weak better to use gradient ascent
Assume differentiable model, approximation
estimated density
Combined propagation/estimation operator
54
Apply the Chain Rule
Rollout
Recursive formulation
Differentiation
c.f. Neural Networks
55
What if full model is not available?
Assume generative model
Black Box
State
Next State
Action
56
Rollout with sampling
Generate Samples
Samples from
Fitted
Weight Samples
Fit Samples
Weight according to
57
Gradient Ascent Sampling
If model fitting is differentiable, why not do
Problem Samples are from wrong distribution
58
Thought Experiment
Consider a new
Redo estimation, reweighting old samples
everything else
59
Notes on reweighting