Chapter 6: Temporal Difference Learning - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

Chapter 6: Temporal Difference Learning

Description:

These methods bootstrap and sample, combining aspects of DP and MC methods ... What is common to all three classes of methods? DP, MC, TD ... – PowerPoint PPT presentation

Number of Views:66

Avg rating:3.0/5.0

Slides: 39

Provided by: andy284

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 6: Temporal Difference Learning

1
Chapter 6 Temporal Difference Learning
Objectives of this chapter

Introduce Temporal Difference (TD) learning
Focus first on policy evaluation, or prediction,
methods
Then extend to control methods

2
TD Prediction
Policy Evaluation (the prediction problem)
for a given policy p, compute the state-value
function
Recall
target the actual return after time t
target an estimate of the return
3
Simple Monte Carlo
4
Simplest TD Method
5
cf. Dynamic Programming
T
T
T
6
TD methods bootstrap and sample

Bootstrapping update involves an estimate
MC does not bootstrap
DP bootstraps
TD bootstraps
Sampling update does not involve an expected
value
MC samples
DP does not sample
TD samples

7
Example Driving Home
(5)
(15)
(10)
(10)
(3)
8
Driving Home
Changes recommended by Monte Carlo methods (a1)
Changes recommended by TD methods (a1)
9
Advantages of TD Learning

TD methods do not require a model of the
environment, only experience
TD, but not MC, methods can be fully incremental
You can learn before knowing the final outcome
Less memory
Less peak computation
You can learn without the final outcome
From incomplete sequences
Both MC and TD converge (under certain
assumptions to be detailed later), but which is
faster?

10
Convergence of TD

In the mean for a constant (sufficiently small)
step-size (Sutton 1988)
With probability 1 and decreasing step-size (Dyan
1992)
With probability 1 extended
Jaakkola, Jordan and Singh 1994
Tsitsiklis 1994

11
Random Walk Example
Values learned by TD(0) after various numbers of
episodes
12
TD and MC on the Random Walk
Data averaged over 100 sequences of episodes
13
Optimality of TD(0)
Batch Updating train completely on a finite
amount of data, e.g., train repeatedly on
10 episodes until convergence. Compute
updates according to TD(0), but only update
estimates after each complete pass through the
data.
For any finite Markov prediction task, under
batch updating, TD(0) converges for sufficiently
small a. Constant-a MC also converges under
these conditions, but to a difference answer!
14
Random Walk under Batch Updating
After each new episode, all previous episodes
were treated as a batch, and algorithm was
trained until convergence. All repeated 100 times.
15
You are the Predictor
Suppose you observe the following 8 episodes
A, 0, B, 0 B, 1 B, 1 B, 1 B, 1 B, 1 B, 1 B, 0
16
You are the Predictor
17
You are the Predictor

The prediction that best matches the training
data is V(A)0
This minimizes the mean-square-error on the
training set
This is what a batch Monte Carlo method gets
If we consider the sequentiality of the problem,
then we would set V(A).75
This is correct for the maximum likelihood
estimate of a Markov model generating the data
i.e, if we do a best fit Markov model, and assume
it is exactly correct, and then compute what it
predicts (how?)
This is called the certainty-equivalence estimate
This is what TD(0) gets

18
Learning An Action-Value Function
19
Sarsa On-Policy TD Control
Turn this into a control method by always
updating the policy to be greedy with respect to
the current estimate
20
History of SARSA

First proposed by Rummery and Niranjan (1994)
under the name modified Q-learning
Sutton introduced the name SARSA in 1996
Convergence of tabular case by
Singh, Jaakkola, Littman and Szepesvari (2000)

21
Windy Gridworld
undiscounted, episodic, reward 1 until goal
22
Results of Sarsa on the Windy Gridworld
23
Q-Learning Off-Policy TD Control
24
History of Q-learning

First introduced by Watkins (1989) with some
convergence analysis
Better convergence analysis by Watkins and Dyan
(1992)
General convergence results by
Jaakkola, Jordan and Singh (1994)
Tsitsiklis (1994)
Finite sample convergence (convergence rate) by
Szepesvari (1999)

25
Cliffwalking
e-greedy, e 0.1
26
The Book

Part I The Problem
Introduction
Evaluative Feedback
The Reinforcement Learning Problem
Part II Elementary Solution Methods
Dynamic Programming
Monte Carlo Methods
Temporal Difference Learning
Part III A Unified View
Eligibility Traces
Generalization and Function Approximation
Planning and Learning
Dimensions of Reinforcement Learning
Case Studies

27
Unified View
28
System Path and Markov Chain

System path
Probability of a path
Return of a path
Under a fixed policy , a MDP becomes a Markov
chain

29
Classification of MDPs

1 and 2 communicate
1 and 3 do not communicate
1 is recurrent
3 is transient
When two states communicate, they are of the same
type
1 and 2 are recurrent
3 and 4 are transient
States that communicate with each other form a
subset called a class
1 and 2 form a recurrent class
3 and 4 form a transient class
A recurrent class R is an absorbing class if no
elements outside R can be reached by any of the
elements R

30
Classification of MDPs

Recurrent or Ergodic if the Markov chain
corresponding to every deterministic stationary
policy consists of a single recurrent class
Unichain if the Markov chain corresponding to
every deterministic stationary policy consists of
a single recurrent class plus a possibly empty
set of transient states
Communicating if for every pair of states s and
in S, there exists a deterministic stationary
policy under which is accessible from s

31
Asymptotic Behavior of a Markov Chain

Stationary Distribution if the Markov chain has
a single recurrent class, then a stationary
distribution exists and satisfies equation
Example

32
Average Reward Optimality

Average reward of state under policy
If MDP is unichain, then the average reward of
policy is
Optimal policy policy with maximum average
reward

33
Average Adjusted Value Function

Differential or Relative or Average-Adjusted
Value Function
Theorem for any unichain MDP, there exists a
value function and a scalar satisfying
the equation
Such that the greedy policy resulting from
achieves the optimal average reward
where over all
policies

34
R-Learning
35
Access-Control Queuing Task
Apply R-learning

n servers
Customers have four different priorities, which
pay reward of 1, 2, 4, or 8, if served
At each time step, customer at head of queue is
accepted (assigned to a server) or removed from
the queue
Proportion of randomly distributed high priority
customers in queue is h
Busy server becomes free with probability p on
each time step
Statistics of arrivals and departures are unknown

n10, h.5, p.06
36
Afterstates

Usually, a state-value function evaluates states
in which the agent can take an action.
But sometimes it is useful to evaluate states
after agent has acted, as in tic-tac-toe.
Why is this useful?
What is this in general?

37
Summary