Title: Kernelized Value Function Approximation for Reinforcement Learning
1Kernelized Value Function Approximation for
Reinforcement Learning
- Gavin Taylor and Ronald Parr
- Duke University
2Overview
3Overview - Contributions
- Construct new model-based VFA
- Equate novel VFA with previous work
- Decompose Bellman Error into reward and
transition error - Use decomposition to understand VFA
reward error
transition error
Bellman Error
4Outline
- Motivation, Notation, and Framework
- Kernel-Based Models
- Model-Based VFA
- Interpretation of Previous Work
- Bellman Error Decomposition
- Experimental Results and Conclusions
5Markov Reward Processes
- M(S,P,R,?)
- Value V(s)expected, discounted sum of rewards
from state s - Bellman equation
- Bellman equation in matrix notation
6Kernels
- Properties
- Symmetric function between two points
- PSD K-matrix
- Uses
- Dot-product in high-dimensional space (kernel
trick) - Gain expressiveness
- Risks
- Overfitting
- High computational cost
7Outline
- Motivation, Notation, and Framework
- Kernel-Based Models
- Model-Based VFA
- Interpretation of Previous Work
- Bellman Error Decomposition
- Experimental Results and Conclusions
8Kernelized Regression
- Apply kernel trick to least-squares regression
- t target values
- K kernel matrix, where
- k(x) column vector, where
- regularization matrix
9Kernel-Based Models
- Approximate reward model
- Approximate transition model
- Want to predict k(s) (not s)
- Construct matrix K, where
10Model-based Value Function
11Model-based Value Function
Unregularized
Regularized
Whole state space
12Previous Work
- Kernel Least-Squares Temporal Difference Learning
(KLSTD) Xu et. al., 2005 - Rederive LSTD, replacing dot products with
kernels - No regularization
- Gaussian Process Temporal Difference Learning
(GPTD) Engel, et al., 2005 - Model value directly with a GP
- Gaussian Processes in Reinforcement Learning
(GPRL) Rasmussen and Kuss, 2004 - Model transitions and value with GPs
- Deterministic reward
13Equivalency
GPTD noise parameter
GPRL regularization parameter
14Outline
- Motivation, Notation, and Framework
- Kernel-Based Models
- Model-Based VFA
- Interpretation of Previous Work
- Bellman Error Decomposition
- Experimental Results and Conclusions
15Model Error
- Error in reward approximation
- Error in transition approximation
expected next kernel values
approximate next kernel values
16Bellman Error
Bellman Error a linear combination of reward and
transition errors
reward error
transition error
17Outline
- Motivation, Notation, and Framework
- Kernel-Based Models
- Model-Based VFA
- Interpretation of Previous Work
- Bellman Error Decomposition
- Experimental Results and Conclusions
18Experiments
- Version of two room problem Mahadevan
Maggioni, 2006 - Use Bellman Error decomposition to tune
regularization parameters
REWARD
19Experiments
20Conclusion
- Novel, model-based view of kernelized RL built
around kernel regression - Previous work differs from model-based view only
in approach to regularization - Bellman Error can be decomposed into transition
and reward error - Transition and reward error can be used to tune
parameters
21Thank you!
22What about policy improvement?
- Wrap policy iteration around kernelized VFA
- Example KLSPI
- Bellman error decomposition will be policy
dependent - Choice of regularization parameters may be policy
dependent - Our results do not apply to SARSA variants of
kernelized RL, e.g., GPSARSA
23Whats left?
- Kernel selection
- Kernel selection (not just parameter tuning)
- Varying kernel parameters across states
- Combining kernels (See Kolter Ng 09)
- Computation costs in large problems
- K is O(samples)
- Inverting K is expensive
- Role of sparsification, interaction
w/regularization
24Comparing model-based approaches
- Transition model
- GPRL models s as a GP
- TP approximates k(s) given k(s)
- Reward model
- GPRL deterministic reward
- TP reward approximated with regularized,
kernelized regression
25Dont you have to know the model?
- For our experiments graphs Reward, transition
errors calculated with true R, K - In practice Cross-validation could be used to
tune parameters to minimize reward and transition
errors
26Why is the GPTD regularization term asymmetric?
- GPTD is equivalent to TP when
- Can be viewed as propagating the regularizer
through the transition model -
- Is this a good idea?
- Our contribution Tools to evaluate this question
27What about Variances?
- Variances can play an important role in Bayesian
interpretations of kernelized RL - Can guide exploration
- Can ground regularization parameters
- Our analysis focuses on the mean
- Variances a valid topic for future work
28Does this apply to the recent work of Farahmand
et al.?
- Not directly
- All methods assume (s,r,s) data
- Farahmand et al. include next states (s) in
their kernel, i.e., k(s,s) and k(s,s) - Previous work, and ours, includes only s in the
kernel k(s,s)
29How is This Different from Parr et al. ICML 2008?
- Parr et al. considers linear fixed point
solutions, not kernelized methods - Equivalence between linear fixed point methods
was fairly well understood already - Our contribution
- We provide a unifying view of previous
kernel-based methods - We extend the equivalence between model-based and
direct methods to the kernelized case