LeastSquares Methods For Reinforcement Learning

1 / 28
About This Presentation
Title:

LeastSquares Methods For Reinforcement Learning

Description:

The actions of a gazelle calf after its born. 4. Introduction. Characteristics ... Does not need to be told which actions to take ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 29
Provided by: Lic8

less

Transcript and Presenter's Notes

Title: LeastSquares Methods For Reinforcement Learning


1
Least-Squares Methods For Reinforcement
Learning
  • Li, Hailin

2
Outline
Introduction Least-Squares Method for
Reinforcement Learning Simulation
Example Conclusion
3
Introduction
  • Learning From Interaction
  • Interact with environment
  • Consequences of actions to achieve goals
  • No explicit teacher but experience
  • Examples
  • Chess player in a game
  • Someone prepares some food
  • The actions of a gazelle calf after its born

4
Introduction
  • Characteristics
  • Decision making in uncertain environment
  • Actions
  • Affect the future situation
  • Effects cannot be fully predicted
  • Goals are explicit
  • Use experience to improve performance

5
Introduction
  • What to be learned
  • Mapping from situations to actions
  • Maximizes a scalar reward or reinforcement signal
  • Learning
  • Does not need to be told which actions to take
  • Must discover which actions yield most reward by
    trying

6
Introduction
  • Challenge
  • Action may affect not only immediate reward but
    also the next situation, and consequently all
    subsequent rewards
  • Trial and error search
  • Delayed reward

7
Introduction
  • Exploration and exploitation
  • Exploit what it already knows in order to obtain
    reward
  • Explore in order to make better action selections
    in the future
  • Neither can be pursued exclusively without
    failing at the task
  • Trade-off

8
Introduction
  • Components of an agent
  • Policy
  • Decision-making function
  • Reward (Total reward, Average reward, Discounted
    reward)
  • Good and bad events for the agent
  • Value
  • Rewards in a long run
  • Model of environment
  • Behavior of the environment

9
Introduction
  • Markov Property Markov Decision Processes
  • Independence of pathall that matters is in the
    current state signal
  • A reinforcement learning task that satisfies the
    Markov property is called a Markov decision
    process, MDP
  • Finite Markov Decision Process (MDP)

10
Introduction
  • Three categories of methods for solving the
    reinforcement learning problem
  • Dynamic programming
  • Complete and accurate model of the environment
  • A full backup operation on each state
  • Monte Carlo methods
  • A backup for each state based on the entire
    sequence of observed rewards from that state
    until the end of the episode
  • Temporal-difference learning
  • Approximate the optimal value function, and to
    view the approximation as an adequate guide

11
LS Method for Reinforcement Learning
  • For stochastic dynamic system

Control decision generated by policy
Current State
Disturbance independently sampled from some
fixed distribution
MDP can be denoted by a quadruple
state transition probability
Action Set
State Set
The policy is a mapping
denotes the reward function
is a Markov chain
12
LS Method for Reinforcement Learning
  • For each policy , the value function is
    defined by equation


The optimal value function is defined by
13
LS Method for Reinforcement Learning
The optimal action can be generated through
Introducing Q value function
Now the optimal action can be generated through
14
LS Method for Reinforcement Learning
  • The exact Q-values for all state-action pairs can
    be obtained by solving the Bellman equations
    (full backups)

or, in matrix format
denotes the transition probability from
to
15
LS Method for Reinforcement Learning
  • Traditional Q-learning

Popular variant of temporal-difference learning
to approximate Q value functions.
In the absence of the model of the MDP, using
sample data
The temporal difference is defined as
Consider one-step Q-learning, the updated
equation is
16
LS Method for Reinforcement Learning
The final decision base upon Q-learning
The reason for the development of approximation
methods
  • Size of state-action space
  • The overwhelming requirement for computation

The categories of approximation methods for
Machine Learning
  • Model Approximation
  • Policy Approximation
  • Value Function Approximation

17
LS Method for Reinforcement Learning
  • Model-Free Least-Squares Q-learning

Linear Function Approximator
Basis Functions
A vector of scalar weights
18
LS Method for Reinforcement Learning
  • For a fixed policy

is
matrix and
If the model of MDP
is available
19
LS Method for Reinforcement Learning
The policy
where
and
If the model of MDP
is not available Model-Free
Given Samples
20
LS Method for Reinforcement Learning
  • Optimal policy can be found

The greedy policy is represented by the parameter
and can be determined on demand for any given
state.
21
LS Method for Reinforcement Learning
  • Simulation
  • System is hard to model but easy to simulate
  • Implicitly indicate the features of the system in
    terms of the state visiting frequency
  • Orthogonal least-squares algorithm for training
    an RBF network
  • Systematic learning approach for solving center
    selection problem
  • Newly added center always maximizes the amount of
    energy of the desired network output

22
LS Method for Reinforcement Learning
Hybrid Least-Squares Method
Action
State
Reward
Simulation Orthogonal Least-Squares regression
Environment
Feature Configuration
Least-Squares Policy Iteration (LSPI) algorithm
Optimal policy
23
LS Method for Reinforcement Learning
24
Simulation
Cart-Pole System
25
Simulation
26
Simulation
27
Conclusion
  • From Reinforcement learning perspective, the
    intractability of solutions to sequential
    decision problems requires value function
    approximation methods
  • At present, linear function approximators are the
    best alternatives as approximation architecture
    mainly due to their transparent structure.
  • Model-free least squares policy iteration (LSPI)
    method is a promising algorithm that uses linear
    approximator architecture to achieve policy
    optimization in the spirit of Q-learning. May
    converge in surprising few steps
  • Inspired by orthogonal least-squares regression
    method for selecting the centers of RBF neural
    network, a new hybrid learning method for LSPI
    can produce more robust and human-independent
    solution.

28
Questions Comments?
Write a Comment
User Comments (0)