Density Estimation and MDPs

About This Presentation
Title:

Density Estimation and MDPs

Description:

Efficient exact DP step. Efficient projection (function approximation) ... DP. Weighted linear. regression. Must do these steps efficiently! ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 62
Provided by: ronp2
Learn more at: http://www.cs.cmu.edu

less

Transcript and Presenter's Notes

Title: Density Estimation and MDPs


1
Density Estimation and MDPs
  • Ronald Parr
  • Stanford University

Joint work with Daphne Koller, Andrew Ng (U.C.
Berkeley) and Andres Rodriguez
2
What we aim to do
  • Plan for/control complex systems
  • Challenges
  • very large state spaces
  • hidden state information
  • Examples
  • Drive a car
  • Ride a bicycle
  • Operate a factory
  • Contribution novel uses of density estimation

3
Talk Outline
  • (PO)MDP overview
  • Traditional (PO)MDP solution methods
  • Density Estimation
  • (PO)MDPs meet density estimation
  • Reinforcement learning for PO domains
  • Dynamic programming w/function approx.
  • Policy search
  • Experimental Results

4
The MDP Framework
  • Markov Decision Process
  • Stochastic state transitions
  • Reward (or cost) function

5
5
0.7
0.5
-1
-1
0.3
0.5
Action 2
Action 1
5
MDPs
  • Uncertain action outcomes
  • Cost minimization (reward maximization)
  • Examples
  • Ride bicycle
  • Drive car
  • Operate factory
  • Assume that full state is known

6
Value Determination in MDPs
  • Compute expected, discounted value of plan
  • st - random variable for state at time t
  • g - discount factor
  • R(st) - reward for state st

e.g. Expected value of factory output
7
Dynamic Programming (DP)
  • Successive approximations
  • Fixed point is V
  • O(S2) per iteration
  • For n state variables, S2n

8
Partial Observability
  • Examples
  • road hazards
  • intentions of other agents
  • status of equipment
  • Complication true state is not known
  • state depends upon history
  • information state dist. over true states

9
DP for POMDPs
  • DP still works, but
  • s is now a belief state, i.e. prob. dist.
  • For n state variables, dist. over S2n states
  • Representing s exactly is difficult
  • Representing V exactly is nightmarish

10
Density Estimation
  • Efficiently represent dist. over many vars.
  • Broadly interpreted, includes
  • Statistical learning
  • Bayes net learning
  • Mixture models
  • Tracking
  • Kalman filters
  • DBNs

11
Example Dynamic Bayesian Networks
Time
t
t1
X
Y
Z
State Variables
12
Problem Variable Correlation
t0
t1
t2
13
Solution BK algorithm
Break into smaller clusters
Approximation/ marginalization step
Exact step
With mixing, bounded projection error total
error is bounded
14
Density Estimation meets POMDPs
  • Problems
  • Representing state
  • Representing value function
  • Solution
  • Use BK algorithm for state estimation
  • Use reinforcement learning for V (e.g. Parr
    Russell 95, Littman et al. 95)
  • Represent V with neural net
  • Rodriguez, Parr and Koller, NIPS 99

15
Approximate POMDP RL
Environment
Belief State Estimation
O
R
A
A
Reinforcement Learner
Action Selection
16
Navigation Problem
  • Uncertain initial location
  • 4-way sonar
  • Need for information gathering actions
  • 60 states (15 positions x 4 orientations)

17
Navigation Results
18
Machine Maintenance
widgets
4 machine maintenance states per machine Reward
for output Components degrade, reducing
output Repair requires expensive total disassembly
19
Maintenance Results
20
Maintenance Results (Turnerized)
Decomposed NN has fewer inputs, learns faster
21
Summary
  • Advances
  • Use of factored belief state
  • Scales POMDP RL to larger state spaces
  • Limitations
  • No help with regular MDPs
  • Can be slow
  • No convergence guarantees

22
Goal DP with guarantees
  • Focus on value determination in MDPs
  • Efficient exact DP step
  • Efficient projection (function approximation)
  • Non-expansive function approximation
    (convergence, bounded error)

23
A Value Determination Problem
M3
M5
M6
M2
M4
Reward for output
M1
Machines require predecessors to work They go
offline/online stochastically
24
Efficient, Stable DP
Idea Restrict class of value functions
VFA
DP
V0
VFA Neural Network, Regression, etc.
Issues Stability, Closeness of to V,
efficiency
25
Stability
  • Naïve function approximation is unstable Boyan
    Moore 95, Bertsekas Tsitsiklis 96
  • Simple examples where V
  • Weighted linear regression is stable Nelson
    1958, Van Roy 1998
  • Weights must correspond to stationary
    distribution of policy r

26
Stable Approximate DP
DP
Weighted linear regression
lowest error possible
error in final result
? effective contraction rate
27
Efficiency Issues
DP, projection consider every state individually
DP
Weighted linear regression
Must do these steps efficiently!!!
28
Compact Models Compact V?
t
t1
Suppose R 1 if Z T
X
XYZ
Y
Z
Vt1
R1
Start with a uniform value function
29
Value Function Growth
XYZ
DP
Vt1
Vt
R1
Reward depends upon Z
30
Value Function Growth
DP
Vt-1
Vt
R1
Z depends upon previous Y and Z
31
Value Function Growth
Eventually, V has 2n partitions
DP
Vt-1
R1
See Boutilier, Dearden Goldszmidt (IJCAI 95)
for method that avoids worst case when possible.
32
Compact Reward Functions
R1
R2

R


...
X
U
V
W
W
33
Basis Functions
  • V w1h1(X1) w2h2(X2)
  • Use compact basis functions
  • h(Xi) basis defined over vars in Xi

Examples h function of status of subgoals h
function of inventory in different stores h
function of status of machines in factory
34
Efficient DP
Observe that DP is a linear operation
DP
DP
DP
Y1 X1 È parents(X1)
35
Growth of Basis Functions
t
t1
Suppose h1f(Y) DP(h1) f(X,Y) Each basis
function is replaced by a function with a
potentially larger domain
X
Y
Z
Need to control growth in function domains
36
Projection
DP
P
Regression projects back into original space
37
Efficient Projection
Want to project all points
K basis functions
Projection matrix (ATA)-1 is k x k
h1(s1) h2(s1)... h1(s2) h2(s2) . . .
2n states
38
Efficient dot product
Need to compute
Observe no. of unique terms in summation
is product of no. of unique terms in
bases Xi x Xj
Complexity of dot product is O(Xi x Xj)
Compute using same observation
39
Want Weighted Projection
  • Stability required weighted regression
  • But, stationary dist. r may not be compact
  • Boyen-Koller Approximation UAI 98
  • Provides factored with bounded error
  • Dot product weighted dot product

40
Weighted dot products
Need to compute
If is factored, and basis functions are
compact Let
i.e. all vars. in the enclosing BK clusters
41
Stability
Idea If error in not too large, then
were OK.
Theorem If
and
then
42
Approximate DP summary
  • Get compact, approx. stationary distribution
  • Start with linear value function
  • Repeat until convergence
  • Exact DP replaces bases with larger fns.
  • Project value function back into linear space
  • Efficient because of
  • Factored transition model
  • Compact basis functions
  • Compact approx. stationary distribution

43
Sample Revisited
M3
M5
M6
M2
M4
Reward for output
M1
Machines require predecessors to work, Fail
stochastically
44
Results Stability and Weighted Projection
0.5
Unweighted Projection
0.45
Weighted Projection
0.4
0.35
0.3
0.25
Weighted Sum of Squared Errors
0.2
0.15
0.1
0.05
0
2
3
4
5
6
7
8
9
10
Basis Functions Added
45
Approximate vs. Exact V
3.5
Exact
Approximate
3
2.5
2
Value
1.5
1
0.5
0
0
10
20
30
40
50
60
State
46
Summary
  • Advances
  • Stable, approximate DP for large models
  • Efficient DP, projection steps
  • Limitations
  • Prediction only, no policy improvement
  • non-trivial to add policy improvement
  • Policy representation may grow

47
Direct Policy Search
Idea Search smoothly parameterized policies
Policy function
Value function (wrt starting dist.)
See Williams 83, Marbach Tsitskilis 98, Baird
Moore 99, Meauleau et al. 99, Peshkin et al.
99, Konda Tsitsiklis 00, Sutton et al. 00
48
Policy Search with Density Estimation
  • Typically compute value gradient
  • Works for both MDPs and POMPDs
  • Gradient computation methods
  • Single trajectories
  • Exact (small models)
  • Value function
  • Our approach
  • Take all trajectories simultaneously
  • Ng, Parr Koller NIPS 99

49
Policy Evaluation
Idea Model rollout
Project, get cost
Approx. dist.
Initial dist.
50
Rollout Based Policy Search
Idea Estimate Search space e.g. using
simplex search
Theorem
Suppose
Optimize to reach
N.B. Given density estimation, this turns
policy search into simple function maximization
51
Simple BAT net
52
Simplex Search Results
53
Gradient Ascent
Simplex is weak better to use gradient ascent
Assume differentiable model, approximation
estimated density
Combined propagation/estimation operator
54
Apply the Chain Rule
Rollout
Recursive formulation
Differentiation
c.f. Neural Networks
55
What if full model is not available?
Assume generative model
Black Box
State
Next State
Action
56
Rollout with sampling
Generate Samples
Samples from
Fitted
Weight Samples
Fit Samples
Weight according to
57
Gradient Ascent Sampling
If model fitting is differentiable, why not do
Problem Samples are from wrong distribution
58
Thought Experiment
Consider a new
Redo estimation, reweighting old samples
everything else
59
Notes on reweighting
  • No samples are actually reused!
  • Used for differentiation only
  • Accurate, since differentiation considers an
    infinitesimal change in

60
Bicycle Example
  • Bicycle simulator from Randlov Astrom 98
  • 9 actions for combinations of lean and torque
  • 6 dimensional state absorbing goal state
  • Fitted to 6D multivariate Gaussian
  • Used horizon of 200 steps, 300 samples/step
  • softmax action selection
  • Achieved results comparable to RA
  • 5 km vs. 7km for good trials
  • 1.5km vs. 1.7km for best runs

61
Conclusions
  • 3 new uses for density estimation in (PO)MDPs
  • POMDP RL
  • Function approx. with density estimation
  • Structured MDPs
  • Value determination with guarantees
  • Policy search
  • Search space of parameterized policies
Write a Comment
User Comments (0)