Title: LeastSquares Methods For Reinforcement Learning
1Least-Squares Methods For Reinforcement
Learning
2Outline
Introduction Least-Squares Method for
Reinforcement Learning Simulation
Example Conclusion
3Introduction
- Learning From Interaction
- Interact with environment
- Consequences of actions to achieve goals
- No explicit teacher but experience
- Examples
- Chess player in a game
- Someone prepares some food
- The actions of a gazelle calf after its born
4Introduction
- Characteristics
- Decision making in uncertain environment
- Actions
- Affect the future situation
- Effects cannot be fully predicted
- Goals are explicit
- Use experience to improve performance
5Introduction
- What to be learned
- Mapping from situations to actions
- Maximizes a scalar reward or reinforcement signal
- Learning
- Does not need to be told which actions to take
- Must discover which actions yield most reward by
trying
6Introduction
- Challenge
- Action may affect not only immediate reward but
also the next situation, and consequently all
subsequent rewards - Trial and error search
- Delayed reward
7Introduction
- Exploration and exploitation
- Exploit what it already knows in order to obtain
reward - Explore in order to make better action selections
in the future - Neither can be pursued exclusively without
failing at the task - Trade-off
8Introduction
- Components of an agent
- Policy
- Decision-making function
- Reward (Total reward, Average reward, Discounted
reward) - Good and bad events for the agent
- Value
- Rewards in a long run
- Model of environment
- Behavior of the environment
9Introduction
- Markov Property Markov Decision Processes
- Independence of pathall that matters is in the
current state signal - A reinforcement learning task that satisfies the
Markov property is called a Markov decision
process, MDP - Finite Markov Decision Process (MDP)
10Introduction
- Three categories of methods for solving the
reinforcement learning problem - Dynamic programming
- Complete and accurate model of the environment
- A full backup operation on each state
- Monte Carlo methods
- A backup for each state based on the entire
sequence of observed rewards from that state
until the end of the episode - Temporal-difference learning
- Approximate the optimal value function, and to
view the approximation as an adequate guide
11LS Method for Reinforcement Learning
- For stochastic dynamic system
Control decision generated by policy
Current State
Disturbance independently sampled from some
fixed distribution
MDP can be denoted by a quadruple
state transition probability
Action Set
State Set
The policy is a mapping
denotes the reward function
is a Markov chain
12LS Method for Reinforcement Learning
- For each policy , the value function is
defined by equation -
The optimal value function is defined by
13LS Method for Reinforcement Learning
The optimal action can be generated through
Introducing Q value function
Now the optimal action can be generated through
14LS Method for Reinforcement Learning
- The exact Q-values for all state-action pairs can
be obtained by solving the Bellman equations
(full backups)
or, in matrix format
denotes the transition probability from
to
15LS Method for Reinforcement Learning
Popular variant of temporal-difference learning
to approximate Q value functions.
In the absence of the model of the MDP, using
sample data
The temporal difference is defined as
Consider one-step Q-learning, the updated
equation is
16LS Method for Reinforcement Learning
The final decision base upon Q-learning
The reason for the development of approximation
methods
- Size of state-action space
- The overwhelming requirement for computation
The categories of approximation methods for
Machine Learning
- Model Approximation
- Policy Approximation
- Value Function Approximation
17LS Method for Reinforcement Learning
- Model-Free Least-Squares Q-learning
Linear Function Approximator
Basis Functions
A vector of scalar weights
18LS Method for Reinforcement Learning
is
matrix and
If the model of MDP
is available
19LS Method for Reinforcement Learning
The policy
where
and
If the model of MDP
is not available Model-Free
Given Samples
20LS Method for Reinforcement Learning
- Optimal policy can be found
The greedy policy is represented by the parameter
and can be determined on demand for any given
state.
21LS Method for Reinforcement Learning
- Simulation
- System is hard to model but easy to simulate
- Implicitly indicate the features of the system in
terms of the state visiting frequency - Orthogonal least-squares algorithm for training
an RBF network - Systematic learning approach for solving center
selection problem - Newly added center always maximizes the amount of
energy of the desired network output
22LS Method for Reinforcement Learning
Hybrid Least-Squares Method
Action
State
Reward
Simulation Orthogonal Least-Squares regression
Environment
Feature Configuration
Least-Squares Policy Iteration (LSPI) algorithm
Optimal policy
23LS Method for Reinforcement Learning
24Simulation
Cart-Pole System
25Simulation
26Simulation
27Conclusion
- From Reinforcement learning perspective, the
intractability of solutions to sequential
decision problems requires value function
approximation methods - At present, linear function approximators are the
best alternatives as approximation architecture
mainly due to their transparent structure. - Model-free least squares policy iteration (LSPI)
method is a promising algorithm that uses linear
approximator architecture to achieve policy
optimization in the spirit of Q-learning. May
converge in surprising few steps - Inspired by orthogonal least-squares regression
method for selecting the centers of RBF neural
network, a new hybrid learning method for LSPI
can produce more robust and human-independent
solution.
28Questions Comments?