Chapter 6: Temporal Difference Learning - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Chapter 6: Temporal Difference Learning

Description:

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction. 1 ... R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction. 13 ... – PowerPoint PPT presentation

Number of Views:236
Avg rating:3.0/5.0
Slides: 39
Provided by: andy284
Category:

less

Transcript and Presenter's Notes

Title: Chapter 6: Temporal Difference Learning


1
Chapter 6 Temporal Difference Learning
Objectives of this chapter
  • Introduce Temporal Difference (TD) learning
  • Focus first on policy evaluation, or prediction,
    methods
  • Then extend to control methods

2
TD Prediction
Policy Evaluation (the prediction problem)
for a given policy p, compute the state-value
function
Recall
target the actual return after time t
target an estimate of the return
3
Simple Monte Carlo
4
Simplest TD Method
5
cf. Dynamic Programming
T
T
T
6
TD methods bootstrap and sample
  • Bootstrapping update involves an estimate
  • MC does not bootstrap
  • DP bootstraps
  • TD bootstraps
  • Sampling update does not involve an expected
    value
  • MC samples
  • DP does not sample
  • TD samples

7
Example Driving Home
(5)
(15)
(10)
(10)
(3)
8
Driving Home
Changes recommended by Monte Carlo methods (a1)
Changes recommended by TD methods (a1)
9
Advantages of TD Learning
  • TD methods do not require a model of the
    environment, only experience
  • TD, but not MC, methods can be fully incremental
  • You can learn before knowing the final outcome
  • Less memory
  • Less peak computation
  • You can learn without the final outcome
  • From incomplete sequences
  • Both MC and TD converge (under certain
    assumptions to be detailed later), but which is
    faster?

10
Convergence of TD
  • In the mean for a constant (sufficiently small)
    step-size (Sutton 1988)
  • With probability 1 and decreasing step-size (Dyan
    1992)
  • With probability 1 extended
  • Jaakkola, Jordan and Singh 1994
  • Tsitsiklis 1994

11
Random Walk Example
Values learned by TD(0) after various numbers of
episodes
12
TD and MC on the Random Walk
Data averaged over 100 sequences of episodes
13
Optimality of TD(0)
Batch Updating train completely on a finite
amount of data, e.g., train repeatedly on
10 episodes until convergence. Compute
updates according to TD(0), but only update
estimates after each complete pass through the
data.
For any finite Markov prediction task, under
batch updating, TD(0) converges for sufficiently
small a. Constant-a MC also converges under
these conditions, but to a difference answer!
14
Random Walk under Batch Updating
After each new episode, all previous episodes
were treated as a batch, and algorithm was
trained until convergence. All repeated 100 times.
15
You are the Predictor
Suppose you observe the following 8 episodes
A, 0, B, 0 B, 1 B, 1 B, 1 B, 1 B, 1 B, 1 B, 0
16
You are the Predictor
17
You are the Predictor
  • The prediction that best matches the training
    data is V(A)0
  • This minimizes the mean-square-error on the
    training set
  • This is what a batch Monte Carlo method gets
  • If we consider the sequentiality of the problem,
    then we would set V(A).75
  • This is correct for the maximum likelihood
    estimate of a Markov model generating the data
  • i.e, if we do a best fit Markov model, and assume
    it is exactly correct, and then compute what it
    predicts (how?)
  • This is called the certainty-equivalence estimate
  • This is what TD(0) gets

18
Learning An Action-Value Function
19
Sarsa On-Policy TD Control
Turn this into a control method by always
updating the policy to be greedy with respect to
the current estimate
20
History of SARSA
  • First proposed by Rummery and Niranjan (1994)
    under the name modified Q-learning
  • Sutton introduced the name SARSA in 1996
  • Convergence of tabular case by
  • Singh, Jaakkola, Littman and Szepesvari (2000)

21
Windy Gridworld
undiscounted, episodic, reward 1 until goal
22
Results of Sarsa on the Windy Gridworld
23
Q-Learning Off-Policy TD Control
24
History of Q-learning
  • First introduced by Watkins (1989) with some
    convergence analysis
  • Better convergence analysis by Watkins and Dyan
    (1992)
  • General convergence results by
  • Jaakkola, Jordan and Singh (1994)
  • Tsitsiklis (1994)
  • Finite sample convergence (convergence rate) by
    Szepesvari (1999)

25
Cliffwalking
e-greedy, e 0.1
26
The Book
  • Part I The Problem
  • Introduction
  • Evaluative Feedback
  • The Reinforcement Learning Problem
  • Part II Elementary Solution Methods
  • Dynamic Programming
  • Monte Carlo Methods
  • Temporal Difference Learning
  • Part III A Unified View
  • Eligibility Traces
  • Generalization and Function Approximation
  • Planning and Learning
  • Dimensions of Reinforcement Learning
  • Case Studies

27
Unified View
28
System Path and Markov Chain
  • System path
  • Probability of a path
  • Return of a path
  • Under a fixed policy , a MDP becomes a Markov
    chain

29
Classification of MDPs
  • 1 and 2 communicate
  • 1 and 3 do not communicate
  • 1 is recurrent
  • 3 is transient
  • When two states communicate, they are of the same
    type
  • 1 and 2 are recurrent
  • 3 and 4 are transient
  • States that communicate with each other form a
    subset called a class
  • 1 and 2 form a recurrent class
  • 3 and 4 form a transient class
  • A recurrent class R is an absorbing class if no
    elements outside R can be reached by any of the
    elements R

30
Classification of MDPs
  • Recurrent or Ergodic if the Markov chain
    corresponding to every deterministic stationary
    policy consists of a single recurrent class
  • Unichain if the Markov chain corresponding to
    every deterministic stationary policy consists of
    a single recurrent class plus a possibly empty
    set of transient states
  • Communicating if for every pair of states s and
    in S, there exists a deterministic stationary
    policy under which is accessible from s

31
Asymptotic Behavior of a Markov Chain
  • Stationary Distribution if the Markov chain has
    a single recurrent class, then a stationary
    distribution exists and satisfies equation
  • Example

32
Average Reward Optimality
  • Average reward of state under policy
  • If MDP is unichain, then the average reward of
    policy is
  • Optimal policy policy with maximum average
    reward

33
Average Adjusted Value Function
  • Differential or Relative or Average-Adjusted
    Value Function
  • Theorem for any unichain MDP, there exists a
    value function and a scalar satisfying
    the equation
  • Such that the greedy policy resulting from
    achieves the optimal average reward
    where over all
    policies

34
R-Learning
35
Access-Control Queuing Task
Apply R-learning
  • n servers
  • Customers have four different priorities, which
    pay reward of 1, 2, 4, or 8, if served
  • At each time step, customer at head of queue is
    accepted (assigned to a server) or removed from
    the queue
  • Proportion of randomly distributed high priority
    customers in queue is h
  • Busy server becomes free with probability p on
    each time step
  • Statistics of arrivals and departures are unknown

n10, h.5, p.06
36
Afterstates
  • Usually, a state-value function evaluates states
    in which the agent can take an action.
  • But sometimes it is useful to evaluate states
    after agent has acted, as in tic-tac-toe.
  • Why is this useful?
  • What is this in general?

37
Summary
  • TD prediction
  • Introduced one-step tabular model-free TD methods
  • Extend prediction to control by employing some
    form of GPI
  • On-policy control Sarsa
  • Off-policy control Q-learning and R-learning
  • These methods bootstrap and sample, combining
    aspects of DP and MC methods

38
Questions
  • What can I tell you about RL?
  • What is common to all three classes of methods?
    DP, MC, TD
  • What are the principle strengths and weaknesses
    of each?
  • In what sense is our RL view complete?
  • In what senses is it incomplete?
  • What are the principal things missing?
  • The broad applicability of these ideas
  • What does the term bootstrapping refer to?
  • What is the relationship between DP and learning?
Write a Comment
User Comments (0)
About PowerShow.com