Reinforcement Learning in MDPs

1 / 24
About This Presentation
Title:

Reinforcement Learning in MDPs

Description:

... The basis functions are fixed, but arbitrarily selected (non-linear) ... Least-Square Fixed-Point Approximation ... zoom in. Grid world: 1260 states ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 25
Provided by: liha

less

Transcript and Presenter's Notes

Title: Reinforcement Learning in MDPs


1
Reinforcement Learning in MDPs by Lease-Square
Policy Iteration
Presented by Lihan He Machine Learning Reading
Group Duke University 09/16/2005
2
Outline
  • MDP and Q-function
  • Value function approximation
  • LSPI Least-Square Policy Iteration
  • Proto-value Functions
  • RPI Representation Policy Iteration

3
Markov Decision Process (MDP)
An MDP is a model M lt S, A, T, R gt a set of
environment states S, a set of actions A, a
transition function T S ? A ? S ? 0,1 ,
T(s,a,s) P(s s,a), a reward function R S
?A ? R . A policy is a function ? S ?
A. Value function (expected cumulative reward)
V? S ? R . Satisfying Bellman Eq.
V?(s) R(s, ?(s)) ? ?s P(s s, ?(s)) V?(s)
4
Markov Decision Process (MDP)
An example of grid world environment
5
State-action value function Q
The state-action value function Q?(s,a) of any
policy ? is defined over all possible
combinations of states and actions and indicates
the expected, discounted, total reward when
taking action a in state s and following policy ?
thereafter.
Q?(s,a) R(s,a) ? ?s P(s s,a) Q?(s, ?(s))
Given policy ?, for each state-action pair, we
have a Q?(s,a) value.
In matrix format, above Bellman equation becomes
Q? R ? P Q?
Q? , R vectors of size SA P
stochastic matrix of size (SA ? S)
P((s,a),s) P(ss,a)
6
How the policy iteration works?
Q? R ? P Q?
Model
For model-free reinforcement learning, we dont
have model P.
7
Value function approximation
where
8
Value function approximation
Examples of basis functions
Polynomials
Use indicator function I(aai) to decouple
actions so that each action gets its own
parameters.
Radial basis functions (RBF)
Proto-value functions
Other manually designed bases based on specific
problems
9
Value function approximation
Least-Square Fixed-Point Approximation
Let
We have
Use Q? R ? P Q?, and remember is the
projection of Q? onto F space, by projection
theory, finally we get
10
Least-Square Policy Iteration
Solving the parameter w is equivalent to solving
linear system
where
11
Least-Square Policy Iteration
If we sampled data from underlying transition
probability, samples
A and b can be learned (in block) as
Or (real-time)
12
Least-Square Policy Iteration
13
Proto-Value Function
Proto-value functions are good bases for value
function approximation.
  • Do not need to design bases manually
  • Data tell us what are the corresponding
    proto-value functions
  • Generate from topology of the underlying state
    space
  • Do not estimate underlying state transition
    probability
  • Capture the intrinsic smoothness constraints that
    true value functions have.

14
Proto-Value Function
1. Graph representation of the underlying state
transition
2. Adjacency matrix A
1 1
1 1 1
1 1
1 1 1
1 1 1
1 1 1
1 1
1 1
1 1
1 1
1 1
15
Proto-Value Function
3. Combinatorial Laplacian L
LT - A
where T is the diagonal matrix whose entries are
row sums of the adjacency matrix A
4. Proto value functions
Eigenvectors of the combinatorial Laplacian L
Each eigenvector provides one basis fj(s),
combined with indicator function for action a, we
get fj(s,a),
16
Grid world 1260 states
Example of proto value functions
Adjacency matrix
zoom in
17
Proto-value functions Low-order eigenvectors as
basis functions
18
Optimal value function
Value function approximation using 10 proto-value
functions as bases
19
Representation Policy Iteration (offline)
Input D, k, ?, e, p0(w0)
1. Construct basis functions
  1. Use sample D to learn a graph that encodes the
    underlying state space topology.
  2. Compute the lowest-order k eigenvectors of the
    combinatorial Laplacian on the graph. The basis
    functions f(s,a) are produced by combining the k
    proto-value functions with indicator function of
    action a.

2. p ? p0
3. repeat
p ? p
value function update
policy update
until
20
Representation Policy Iteration (online)
Input D0, k, ?, e, p0(w0)
1. Initialization
2. p ? p(0)
3. repeat
(a) p(t) ? p
(b) execute p(t) to get new data D(t)st, at,
rt, st .
(c) If new data sample D(t) changes the topology
of G, compute a new set of basis functions.
(d)
value function update
policy update
(e)
until
21
Example Chain MDP, rewards 1 at state 10 41,
otherwise 0.
Optimal policy 1-9, 26-41 Right 11-25,
42-50 Left
22
Example Chain MDP
20 bases used
Value function and approximation in each iteration
23
Example Chain MDP
Policy L1 error with respect to optimal policy
Steps to convergence
Performance comparison
24
References M. Lagoudakis R. Parr,
Least-Square Policy Iteration. Journal of Machine
Learning Research 4 (2003), 1107-1149. -- Give
LSPI algorithm for reinforcement learning S.
Mahadevan, Proto-Value Functions Developmental
Reinforcement Learning. Proceedings of ICML2005.
-- How to build basis function for LSPI
algorithm C.Kwok D. Fox, Reinforcement
Learning for Sensing Strategies. Proceedings of
IROS2004. -- An application of LSPI
Write a Comment
User Comments (0)