An Introduction to COMPUTATIONAL REINFORCEMENT LEARING

1 / 58
About This Presentation
Title:

An Introduction to COMPUTATIONAL REINFORCEMENT LEARING

Description:

An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts Amherst Lecture 3 – PowerPoint PPT presentation

Number of Views:0
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING


1
An Introduction to COMPUTATIONAL REINFORCEMENT
LEARING
  • Andrew G. Barto
  • Department of Computer Science
  • University of Massachusetts Amherst
  • Lecture 3

Autonomous Learning Laboratory Department of
Computer Science
2
The Overall Plan
  • Lecture 1
  • What is Computational Reinforcement Learning?
  • Learning from evaluative feedback
  • Markov decision processes
  • Lecture 2
  • Dynamic Programming
  • Basic Monte Carlo methods
  • Temporal Difference methods
  • A unified perspective
  • Connections to neuroscience
  • Lecture 3
  • Function approximation
  • Model-based methods
  • Dimensions of Reinforcement Learning

3
The Overall Plan
  • Lecture 1
  • What is Computational Reinforcement Learning?
  • Learning from evaluative feedback
  • Markov decision processes
  • Lecture 2
  • Dynamic Programming
  • Basic Monte Carlo methods
  • Temporal Difference methods
  • A unified perspective
  • Connections to neuroscience
  • Lecture 3
  • Function approximation
  • Model-based methods
  • Dimensions of Reinforcement Learning

4
Lecture 3, Part 1 Generalization and Function
Approximation
Objectives of this part
  • Look at how experience with a limited part of the
    state set be used to produce good behavior over a
    much larger part
  • Overview of function approximation (FA) methods
    and how they can be adapted to RL

5
Value Prediction with FA
As usual Policy Evaluation (the prediction
problem) for a given policy p, compute
the state-value function
In earlier chapters, value functions were stored
in lookup tables.
6
Adapt Supervised Learning Algorithms
Training Info desired (target) outputs
Supervised Learning System
Inputs
Outputs
Training example input, target output
Error (target output actual output)
7
Backups as Training Examples
As a training example
input
target output
8
Any FA Method?
  • In principle, yes
  • artificial neural networks
  • decision trees
  • multivariate regression methods
  • etc.
  • But RL has some special requirements
  • usually want to learn while interacting
  • ability to handle nonstationarity
  • other?

9
Gradient Descent Methods
transpose
10
Performance Measures
  • Many are applicable but
  • a common and simple one is the mean-squared error
    (MSE) over a distribution P
  • Why P ?
  • Why minimize MSE?
  • Let us assume that P is always the distribution
    of states with which backups are done.
  • The on-policy distribution the distribution
    created while following the policy being
    evaluated. Stronger results are available for
    this distribution.

11
Gradient Descent
Iteratively move down the gradient
12
Gradient Descent Cont.
For the MSE given above and using the chain rule
13
Gradient Descent Cont.
Use just the sample gradient instead
Since each sample gradient is an unbiased
estimate of the true gradient, this converges to
a local minimum of the MSE if a decreases
appropriately with t.
14
But We Dont have these Targets
15
What about TD(l) Targets?
16
On-Line Gradient-Descent TD(l)
17
Linear Methods
18
Nice Properties of Linear FA Methods
  • The gradient is very simple
  • For MSE, the error surface is simple quadratic
    surface with a single minumum.
  • Linear gradient descent TD(l) converges
  • Step size decreases appropriately
  • On-line sampling (states sampled from the
    on-policy distribution)
  • Converges to parameter vector with
    property

best parameter vector
(Tsitsiklis Van Roy, 1997)
19
Coarse Coding
20
Learning and Coarse Coding
21
Tile Coding
  • Binary feature for each tile
  • Number of features present at any one time is
    constant
  • Binary features means weighted sum easy to
    compute
  • Easy to compute indices of the freatures present

22
Tile Coding Cont.
Irregular tilings
Hashing
CMAC Cerebellar Model Arithmetic
Computer Albus 1971
23
Radial Basis Functions (RBFs)
e.g., Gaussians
24
Can you beat the curse of dimensionality?
  • Can you keep the number of features from going up
    exponentially with the dimension?
  • Function complexity, not dimensionality, is the
    problem.
  • Kanerva coding
  • Select a bunch of binary prototypes
  • Use hamming distance as distance measure
  • Dimensionality is no longer a problem, only
    complexity
  • Lazy learning schemes
  • Remember all the data
  • To get new value, find nearest neighbors and
    interpolate
  • e.g., locally-weighted regression

25
Control with FA
  • Learning state-action values
  • The general gradient-descent rule
  • Gradient-descent Sarsa(l) (backward view)

26
Linear Gradient Descent Sarsa(l)
27
GPI Linear Gradient Descent Watkins Q(l)
28
Mountain-Car Task
29
Mountain-Car Results
30
Bairds Counterexample
31
Bairds Counterexample Cont.
32
Should We Bootstrap?
33
Summary
  • Generalization
  • Adapting supervised-learning function
    approximation methods
  • Gradient-descent methods
  • Linear gradient-descent methods
  • Radial basis functions
  • Tile coding
  • Kanerva coding
  • Nonlinear gradient-descent methods?
    Backpropation?
  • Subleties involving function approximation,
    bootstrapping and the on-policy/off-policy
    distinction

34
The Overall Plan
  • Lecture 1
  • What is Computational Reinforcement Learning?
  • Learning from evaluative feedback
  • Markov decision processes
  • Lecture 2
  • Dynamic Programming
  • Basic Monte Carlo methods
  • Temporal Difference methods
  • A unified perspective
  • Connections to neuroscience
  • Lecture 3
  • Function approximation
  • Model-based methods
  • Dimensions of Reinforcement Learning

35
Lecture 3, Part 2 Model-Based Methods
Objectives of this part
  • Use of environment models
  • Integration of planning and learning methods

36
Models
  • Model anything the agent can use to predict how
    the environment will respond to its actions
  • Distribution model description of all
    possibilities and their probabilities
  • e.g.,
  • Sample model produces sample experiences
  • e.g., a simulation model
  • Both types of models can be used to produce
    simulated experience
  • Often sample models are much easier to come by

37
Planning
  • Planning any computational process that uses a
    model to create or improve a policy
  • Planning in AI
  • state-space planning
  • plan-space planning (e.g., partial-order planner)
  • We take the following (unusual) view
  • all state-space planning methods involve
    computing value functions, either explicitly or
    implicitly
  • they all apply backups to simulated experience

38
Planning Cont.
  • Classical DP methods are state-space planning
    methods
  • Heuristic search methods are state-space planning
    methods
  • A planning method based on Q-learning

Random-Sample One-Step Tabular Q-Planning
39
Learning, Planning, and Acting
  • Two uses of real experience
  • model learning to improve the model
  • direct RL to directly improve the value function
    and policy
  • Improving value function and/or policy via a
    model is sometimes called indirect RL or
    model-based RL. Here, we call it planning.

40
Direct vs. Indirect RL
  • Indirect (model-based) methods
  • make fuller use of experience get better policy
    with fewer environment interactions
  • Direct methods
  • simpler
  • not affected by bad models

But they are very closely related and can be
usefully combined planning, acting, model
learning, and direct RL can occur simultaneously
and in parallel
41
The Dyna Architecture (Sutton 1990)
42
The Dyna-Q Algorithm
direct RL
model learning
planning
43
Dyna-Q on a Simple Maze
rewards 0 until goal, when 1
44
Dyna-Q Snapshots Midway in 2nd Episode
45
When the Model is Wrong Blocking Maze
The changed envirnoment is harder
46
Shortcut Maze
The changed environment is easier
47
What is Dyna-Q ?
  • Uses an exploration bonus
  • Keeps track of time since each state-action pair
    was tried for real
  • An extra reward is added for transitions caused
    by state-action pairs related to how long ago
    they were tried the longer unvisited, the more
    reward for visiting
  • The agent actually plans how to visit long
    unvisited states

48
Prioritized Sweeping
  • Which states or state-action pairs should be
    generated during planning?
  • Work backwards from states whose values have just
    changed
  • Maintain a queue of state-action pairs whose
    values would change a lot if backed up,
    prioritized by the size of the change
  • When a new backup occurs, insert predecessors
    according to their priorities
  • Always perform backups from first in queue
  • Moore and Atkeson 1993 Peng and Williams, 1993

49
Prioritized Sweeping
50
Prioritized Sweeping vs. Dyna-Q
Both use N5 backups per environmental interaction
51
Rod Maneuvering (Moore and Atkeson 1993)
52
Full and Sample (One-Step) Backups
53
Full vs. Sample Backups
b successor states, equally likely initial error
1 assume all next states values are correct
54
Trajectory Sampling
  • Trajectory sampling perform backups along
    simulated trajectories
  • This samples from the on-policy distribution
  • Advantages when function approximation is used
  • Focusing of computation can cause vast
    uninteresting parts of the state space to be
    (usefully) ignored

Initial states
Irrelevant states
Reachable under optimal control
55
Trajectory Sampling Experiment
  • one-step full tabular backups
  • uniform cycled through all state-action pairs
  • on-policy backed up along simulated trajectories
  • 200 randomly generated undiscounted episodic
    tasks
  • 2 actions for each state, each with b equally
    likely next states
  • .1 prob of transition to terminal state
  • expected reward on each transition selected from
    mean 0 variance 1 Gaussian

56
Heuristic Search
  • Used for action selection, not for changing a
    value function (heuristic evaluation function)
  • Backed-up values are computed, but typically
    discarded
  • Extension of the idea of a greedy policy only
    deeper
  • Also suggests ways to select states to backup
    smart focusing

57
Summary
  • Emphasized close relationship between planning
    and learning
  • Important distinction between distribution models
    and sample models
  • Looked at some ways to integrate planning and
    learning
  • synergy among planning, acting, model learning
  • Distribution of backups focus of the computation
  • trajectory sampling backup along trajectories
  • prioritized sweeping
  • heuristic search
  • Size of backups full vs. sample deep vs.
    shallow

58
The Overall Plan
  • Lecture 1
  • What is Computational Reinforcement Learning?
  • Learning from evaluative feedback
  • Markov decision processes
  • Lecture 2
  • Dynamic Programming
  • Basic Monte Carlo methods
  • Temporal Difference methods
  • A unified perspective
  • Connections to neuroscience
  • Lecture 3
  • Function approximation
  • Model-based methods
  • Dimensions of Reinforcement Learning

59
Lecture 3, part 3 Dimensions of Reinforcement
Learning
Objectives of this part
  • Review the treatment of RL taken in this course
  • What have left out?
  • What are the hot research areas?

60
Three Common Ideas
  • Estimation of value functions
  • Backing up values along real or simulated
    trajectories
  • Generalized Policy Iteration maintain an
    approximate optimal value function and
    approximate optimal policy, use each to improve
    the other

61
Backup Dimensions
62
Other Dimensions
  • Function approximation
  • tables
  • aggregation
  • other linear methods
  • many nonlinear methods
  • On-policy/Off-policy
  • On-policy learn the value function of the policy
    being followed
  • Off-policy try learn the value function for the
    best policy, irrespective of what policy is being
    followed

63
Still More Dimensions
  • Definition of return episodic, continuing,
    discounted, etc.
  • Action values vs. state values vs. afterstate
    values
  • Action selection/exploration e-greed, softmax,
    more sophisticated methods
  • Synchronous vs. asynchronous
  • Replacing vs. accumulating traces
  • Real vs. simulated experience
  • Location of backups (search control)
  • Timing of backups part of selecting actions or
    only afterward?
  • Memory for backups how long should backed up
    values be retained?

64
Frontier Dimensions
  • Prove convergence for bootstrapping control
    methods.
  • Trajectory sampling
  • Non-Markov case
  • Partially Observable MDPs (POMDPs)
  • Bayesian approach belief states
  • construct state from sequence of observations
  • Try to do the best you can with non-Markov states
  • Modularity and hierarchies
  • Learning and planning at several different levels
  • Theory of options

65
More Frontier Dimensions
  • Using more structure
  • factored state spaces dynamic Bayes nets
  • factored action spaces

66
Still More Frontier Dimensions
  • Incorporating prior knowledge
  • advice and hints
  • trainers and teachers
  • shaping
  • Lyapunov functions
  • etc.
Write a Comment
User Comments (0)