An Introduction to COMPUTATIONAL REINFORCEMENT LEARING

About This Presentation

Title:

An Introduction to COMPUTATIONAL REINFORCEMENT LEARING

Description:

Rod Maneuvering (Moore and Atkeson 1993) A. G. Barto, Barcelona Lectures, April 2006. ... What are the hot research areas? Objectives of this part: ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 59

Provided by: andy287

more less

Transcript and Presenter's Notes

Title: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING

1
An Introduction to COMPUTATIONAL REINFORCEMENT
LEARING

Andrew G. Barto
Department of Computer Science
University of Massachusetts Amherst
Lecture 3

Autonomous Learning Laboratory Department of
Computer Science
2
The Overall Plan

Lecture 1
What is Computational Reinforcement Learning?
Learning from evaluative feedback
Markov decision processes
Lecture 2
Dynamic Programming
Basic Monte Carlo methods
Temporal Difference methods
A unified perspective
Connections to neuroscience
Lecture 3
Function approximation
Model-based methods
Dimensions of Reinforcement Learning

3
The Overall Plan

Lecture 1
What is Computational Reinforcement Learning?
Learning from evaluative feedback
Markov decision processes
Lecture 2
Dynamic Programming
Basic Monte Carlo methods
Temporal Difference methods
A unified perspective
Connections to neuroscience
Lecture 3
Function approximation
Model-based methods
Dimensions of Reinforcement Learning

4
Lecture 3, Part 1 Generalization and Function
Approximation
Objectives of this part

Look at how experience with a limited part of the
state set be used to produce good behavior over a
much larger part
Overview of function approximation (FA) methods
and how they can be adapted to RL

5
Value Prediction with FA
As usual Policy Evaluation (the prediction
problem) for a given policy p, compute
the state-value function
In earlier chapters, value functions were stored
in lookup tables.
6
Adapt Supervised Learning Algorithms
Training Info desired (target) outputs
Supervised Learning System
Inputs
Outputs
Training example input, target output
Error (target output actual output)
7
Backups as Training Examples
As a training example
input
target output
8
Any FA Method?

In principle, yes
artificial neural networks
decision trees
multivariate regression methods
etc.
But RL has some special requirements
usually want to learn while interacting
ability to handle nonstationarity
other?

9
Gradient Descent Methods
transpose
10
Performance Measures

Many are applicable but
a common and simple one is the mean-squared error
(MSE) over a distribution P
Why P ?
Why minimize MSE?
Let us assume that P is always the distribution
of states with which backups are done.
The on-policy distribution the distribution
created while following the policy being
evaluated. Stronger results are available for
this distribution.

11
Gradient Descent
Iteratively move down the gradient
12
Gradient Descent Cont.
For the MSE given above and using the chain rule
13
Gradient Descent Cont.
Use just the sample gradient instead
Since each sample gradient is an unbiased
estimate of the true gradient, this converges to
a local minimum of the MSE if a decreases
appropriately with t.
14
But We Dont have these Targets
15
What about TD(l) Targets?
16
On-Line Gradient-Descent TD(l)
17
Linear Methods
18
Nice Properties of Linear FA Methods

The gradient is very simple
For MSE, the error surface is simple quadratic
surface with a single minumum.
Linear gradient descent TD(l) converges
Step size decreases appropriately
On-line sampling (states sampled from the
on-policy distribution)
Converges to parameter vector with
property

best parameter vector
(Tsitsiklis Van Roy, 1997)
19
Coarse Coding
20
Learning and Coarse Coding
21
Tile Coding

Binary feature for each tile
Number of features present at any one time is
constant
Binary features means weighted sum easy to
compute
Easy to compute indices of the freatures present

22
Tile Coding Cont.
Irregular tilings
Hashing
CMAC Cerebellar Model Arithmetic
Computer Albus 1971
23
Radial Basis Functions (RBFs)
e.g., Gaussians
24
Can you beat the curse of dimensionality?

Can you keep the number of features from going up
exponentially with the dimension?
Function complexity, not dimensionality, is the
problem.
Kanerva coding
Select a bunch of binary prototypes
Use hamming distance as distance measure
Dimensionality is no longer a problem, only
complexity
Lazy learning schemes
Remember all the data
To get new value, find nearest neighbors and
interpolate
e.g., locally-weighted regression

25
Control with FA

Learning state-action values
The general gradient-descent rule
Gradient-descent Sarsa(l) (backward view)

26
Linear Gradient Descent Sarsa(l)
27
GPI Linear Gradient Descent Watkins Q(l)
28
Mountain-Car Task
29
Mountain-Car Results
30
Bairds Counterexample
31
Bairds Counterexample Cont.
32
Should We Bootstrap?
33
Summary

Generalization
Adapting supervised-learning function
approximation methods
Gradient-descent methods
Linear gradient-descent methods
Radial basis functions
Tile coding
Kanerva coding
Nonlinear gradient-descent methods?
Backpropation?
Subleties involving function approximation,
bootstrapping and the on-policy/off-policy
distinction

34
The Overall Plan

Lecture 1
What is Computational Reinforcement Learning?
Learning from evaluative feedback
Markov decision processes
Lecture 2
Dynamic Programming
Basic Monte Carlo methods
Temporal Difference methods
A unified perspective
Connections to neuroscience
Lecture 3
Function approximation
Model-based methods
Dimensions of Reinforcement Learning

35
Lecture 3, Part 2 Model-Based Methods
Objectives of this part

Use of environment models
Integration of planning and learning methods

36
Models

Model anything the agent can use to predict how
the environment will respond to its actions
Distribution model description of all
possibilities and their probabilities
e.g.,
Sample model produces sample experiences
e.g., a simulation model
Both types of models can be used to produce
simulated experience
Often sample models are much easier to come by

37
Planning

Planning any computational process that uses a
model to create or improve a policy
Planning in AI
state-space planning
plan-space planning (e.g., partial-order planner)
We take the following (unusual) view
all state-space planning methods involve
computing value functions, either explicitly or
implicitly
they all apply backups to simulated experience

38
Planning Cont.

Classical DP methods are state-space planning
methods
Heuristic search methods are state-space planning
methods
A planning method based on Q-learning

Random-Sample One-Step Tabular Q-Planning
39
Learning, Planning, and Acting

Two uses of real experience
model learning to improve the model
direct RL to directly improve the value function
and policy
Improving value function and/or policy via a
model is sometimes called indirect RL or
model-based RL. Here, we call it planning.

40
Direct vs. Indirect RL

Indirect (model-based) methods
make fuller use of experience get better policy
with fewer environment interactions

Direct methods
simpler
not affected by bad models

But they are very closely related and can be
usefully combined planning, acting, model
learning, and direct RL can occur simultaneously
and in parallel
41
The Dyna Architecture (Sutton 1990)
42
The Dyna-Q Algorithm
direct RL
model learning
planning
43
Dyna-Q on a Simple Maze
rewards 0 until goal, when 1
44
Dyna-Q Snapshots Midway in 2nd Episode
45
When the Model is Wrong Blocking Maze
The changed envirnoment is harder
46
Shortcut Maze
The changed environment is easier
47
What is Dyna-Q ?

Uses an exploration bonus
Keeps track of time since each state-action pair
was tried for real
An extra reward is added for transitions caused
by state-action pairs related to how long ago
they were tried the longer unvisited, the more
reward for visiting
The agent actually plans how to visit long
unvisited states

48
Prioritized Sweeping

Which states or state-action pairs should be
generated during planning?
Work backwards from states whose values have just
changed
Maintain a queue of state-action pairs whose
values would change a lot if backed up,
prioritized by the size of the change
When a new backup occurs, insert predecessors
according to their priorities
Always perform backups from first in queue
Moore and Atkeson 1993 Peng and Williams, 1993

49
Prioritized Sweeping
50
Prioritized Sweeping vs. Dyna-Q
Both use N5 backups per environmental interaction
51
Rod Maneuvering (Moore and Atkeson 1993)
52
Full and Sample (One-Step) Backups
53
Full vs. Sample Backups
b successor states, equally likely initial error
1 assume all next states values are correct
54
Trajectory Sampling

Trajectory sampling perform backups along
simulated trajectories
This samples from the on-policy distribution
Advantages when function approximation is used
Focusing of computation can cause vast
uninteresting parts of the state space to be
(usefully) ignored

Initial states
Irrelevant states
Reachable under optimal control
55
Trajectory Sampling Experiment

one-step full tabular backups
uniform cycled through all state-action pairs
on-policy backed up along simulated trajectories
200 randomly generated undiscounted episodic
tasks
2 actions for each state, each with b equally
likely next states
.1 prob of transition to terminal state
expected reward on each transition selected from
mean 0 variance 1 Gaussian

56
Heuristic Search

Used for action selection, not for changing a
value function (heuristic evaluation function)
Backed-up values are computed, but typically
discarded
Extension of the idea of a greedy policy only
deeper
Also suggests ways to select states to backup
smart focusing

57
Summary

Emphasized close relationship between planning
and learning
Important distinction between distribution models
and sample models
Looked at some ways to integrate planning and
learning
synergy among planning, acting, model learning
Distribution of backups focus of the computation
trajectory sampling backup along trajectories
prioritized sweeping
heuristic search
Size of backups full vs. sample deep vs.
shallow

58
The Overall Plan

Lecture 1
What is Computational Reinforcement Learning?
Learning from evaluative feedback
Markov decision processes
Lecture 2
Dynamic Programming
Basic Monte Carlo methods
Temporal Difference methods
A unified perspective
Connections to neuroscience
Lecture 3
Function approximation
Model-based methods
Dimensions of Reinforcement Learning

59
Lecture 3, part 3 Dimensions of Reinforcement
Learning
Objectives of this part

Review the treatment of RL taken in this course
What have left out?
What are the hot research areas?

60
Three Common Ideas

Estimation of value functions
Backing up values along real or simulated
trajectories
Generalized Policy Iteration maintain an
approximate optimal value function and
approximate optimal policy, use each to improve
the other

61
Backup Dimensions
62
Other Dimensions

Function approximation
tables
aggregation
other linear methods
many nonlinear methods
On-policy/Off-policy
On-policy learn the value function of the policy
being followed
Off-policy try learn the value function for the
best policy, irrespective of what policy is being
followed

63
Still More Dimensions

Definition of return episodic, continuing,
discounted, etc.
Action values vs. state values vs. afterstate
values
Action selection/exploration e-greed, softmax,
more sophisticated methods
Synchronous vs. asynchronous
Replacing vs. accumulating traces
Real vs. simulated experience
Location of backups (search control)
Timing of backups part of selecting actions or
only afterward?
Memory for backups how long should backed up
values be retained?

64
Frontier Dimensions

Prove convergence for bootstrapping control
methods.
Trajectory sampling
Non-Markov case
Partially Observable MDPs (POMDPs)
Bayesian approach belief states
construct state from sequence of observations
Try to do the best you can with non-Markov states
Modularity and hierarchies
Learning and planning at several different levels
Theory of options

65
More Frontier Dimensions

Using more structure
factored state spaces dynamic Bayes nets
factored action spaces

66
Still More Frontier Dimensions

Incorporating prior knowledge
advice and hints
trainers and teachers
shaping
Lyapunov functions
etc.

Write a Comment

User Comments (0)