LeastSquares Methods For Reinforcement Learning

About This Presentation

Title:

LeastSquares Methods For Reinforcement Learning

Description:

The actions of a gazelle calf after its born. 4. Introduction. Characteristics ... Does not need to be told which actions to take ... – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 29

Provided by: Lic8

more less

Transcript and Presenter's Notes

Title: LeastSquares Methods For Reinforcement Learning

1
Least-Squares Methods For Reinforcement
Learning

Li, Hailin

2
Outline
Introduction Least-Squares Method for
Reinforcement Learning Simulation
Example Conclusion
3
Introduction

Learning From Interaction
Interact with environment
Consequences of actions to achieve goals
No explicit teacher but experience
Examples
Chess player in a game
Someone prepares some food
The actions of a gazelle calf after its born

4
Introduction

Characteristics
Decision making in uncertain environment
Actions
Affect the future situation
Effects cannot be fully predicted
Goals are explicit
Use experience to improve performance

5
Introduction

What to be learned
Mapping from situations to actions
Maximizes a scalar reward or reinforcement signal
Learning
Does not need to be told which actions to take
Must discover which actions yield most reward by
trying

6
Introduction

Challenge
Action may affect not only immediate reward but
also the next situation, and consequently all
subsequent rewards
Trial and error search
Delayed reward

7
Introduction

Exploration and exploitation
Exploit what it already knows in order to obtain
reward
Explore in order to make better action selections
in the future
Neither can be pursued exclusively without
failing at the task
Trade-off

8
Introduction

Components of an agent
Policy
Decision-making function
Reward (Total reward, Average reward, Discounted
reward)
Good and bad events for the agent
Value
Rewards in a long run
Model of environment
Behavior of the environment

9
Introduction

Markov Property Markov Decision Processes
Independence of pathall that matters is in the
current state signal
A reinforcement learning task that satisfies the
Markov property is called a Markov decision
process, MDP
Finite Markov Decision Process (MDP)

10
Introduction

Three categories of methods for solving the
reinforcement learning problem
Dynamic programming
Complete and accurate model of the environment
A full backup operation on each state
Monte Carlo methods
A backup for each state based on the entire
sequence of observed rewards from that state
until the end of the episode
Temporal-difference learning
Approximate the optimal value function, and to
view the approximation as an adequate guide

11
LS Method for Reinforcement Learning

For stochastic dynamic system

Control decision generated by policy
Current State
Disturbance independently sampled from some
fixed distribution
MDP can be denoted by a quadruple
state transition probability
Action Set
State Set
The policy is a mapping
denotes the reward function
is a Markov chain
12
LS Method for Reinforcement Learning

For each policy , the value function is
defined by equation

The optimal value function is defined by
13
LS Method for Reinforcement Learning
The optimal action can be generated through
Introducing Q value function
Now the optimal action can be generated through
14
LS Method for Reinforcement Learning

The exact Q-values for all state-action pairs can
be obtained by solving the Bellman equations
(full backups)

or, in matrix format
denotes the transition probability from
to
15
LS Method for Reinforcement Learning

Traditional Q-learning

Popular variant of temporal-difference learning
to approximate Q value functions.
In the absence of the model of the MDP, using
sample data
The temporal difference is defined as
Consider one-step Q-learning, the updated
equation is
16
LS Method for Reinforcement Learning
The final decision base upon Q-learning
The reason for the development of approximation
methods

Size of state-action space
The overwhelming requirement for computation

The categories of approximation methods for
Machine Learning

Model Approximation
Policy Approximation
Value Function Approximation

17
LS Method for Reinforcement Learning

Model-Free Least-Squares Q-learning

Linear Function Approximator
Basis Functions
A vector of scalar weights
18
LS Method for Reinforcement Learning

For a fixed policy

is
matrix and
If the model of MDP
is available
19
LS Method for Reinforcement Learning
The policy
where
and
If the model of MDP
is not available Model-Free
Given Samples
20
LS Method for Reinforcement Learning

Optimal policy can be found

The greedy policy is represented by the parameter
and can be determined on demand for any given
state.
21
LS Method for Reinforcement Learning

Simulation
System is hard to model but easy to simulate
Implicitly indicate the features of the system in
terms of the state visiting frequency
Orthogonal least-squares algorithm for training
an RBF network
Systematic learning approach for solving center
selection problem
Newly added center always maximizes the amount of
energy of the desired network output

22
LS Method for Reinforcement Learning
Hybrid Least-Squares Method
Action
State
Reward
Simulation Orthogonal Least-Squares regression
Environment
Feature Configuration
Least-Squares Policy Iteration (LSPI) algorithm
Optimal policy
23
LS Method for Reinforcement Learning
24
Simulation
Cart-Pole System
25
Simulation
26
Simulation
27
Conclusion

From Reinforcement learning perspective, the
intractability of solutions to sequential
decision problems requires value function
approximation methods
At present, linear function approximators are the
best alternatives as approximation architecture
mainly due to their transparent structure.
Model-free least squares policy iteration (LSPI)
method is a promising algorithm that uses linear
approximator architecture to achieve policy
optimization in the spirit of Q-learning. May
converge in surprising few steps
Inspired by orthogonal least-squares regression
method for selecting the centers of RBF neural
network, a new hybrid learning method for LSPI
can produce more robust and human-independent
solution.

28
Questions Comments?

Write a Comment

User Comments (0)