From%20Sutton%20 - PowerPoint PPT Presentation

About This Presentation
Title:

From%20Sutton%20

Description:

Shout dt backwards over time. The strength of your voice decreases with temporal distance by gl ... Watkins: Zero out eligibility trace after a non-greedy action. ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 34
Provided by: andyb1
Category:

less

Transcript and Presenter's Notes

Title: From%20Sutton%20


1
Reinforcement LearningAn Introduction
  • From Sutton Barto

2
Chapter 7 Eligibility Traces
3
N-step TD Prediction
  • Idea Look farther into the future when you do TD
    backup (1, 2, 3, , n steps)

4
Mathematics of N-step TD Prediction
  • Monte Carlo
  • TD
  • Use V to estimate remaining return
  • n-step TD
  • 2 step return
  • n-step return

5
Learning with N-step Backups
  • Backup (on-line or off-line)
  • Error reduction property of n-step returns
  • Using this, you can show that n-step methods
    converge

6
Random Walk Examples
  • How does 2-step TD work here?
  • How about 3-step TD?

7
A Larger Example
  • Task 19 state random walk
  • Do you think there is an optimal n? for
    everything?

8
Averaging N-step Returns
One backup
  • n-step methods were introduced to help with TD(l)
    understanding
  • Idea backup an average of several returns
  • e.g. backup half of 2-step and half of 4-step
  • Called a complex backup
  • Draw each component
  • Label with the weights for that component

9
Forward View of TD(l)
  • TD(l) is a method for averaging all n-step
    backups
  • weight by ln-1 (time since visitation)
  • l-return
  • Backup using l-return

10
l-return Weighting Function
Until termination
After termination
11
Relation to TD(0) and MC
  • l-return can be rewritten as
  • If l 1, you get MC
  • If l 0, you get TD(0)

Until termination
After termination
12
Forward View of TD(l) II
  • Look forward from each state to determine update
    from future states and rewards

13
l-return on the Random Walk
  • Same 19 state random walk as before
  • Why do you think intermediate values of l are
    best?

14
Backward View
  • Shout dt backwards over time
  • The strength of your voice decreases with
    temporal distance by gl

15
Backward View of TD(l)
  • The forward view was for theory
  • The backward view is for mechanism
  • New variable called eligibility trace
  • On each step, decay all traces by gl and
    increment the trace for the current state by 1
  • Accumulating trace

16
On-line Tabular TD(l)
17
Relation of Backwards View to MC TD(0)
  • Using update rule
  • As before, if you set l to 0, you get to TD(0)
  • If you set l to 1, you get MC but in a better way
  • Can apply TD(1) to continuing tasks
  • Works incrementally and on-line (instead of
    waiting to the end of the episode)

18
Forward View Backward View
  • The forward (theoretical) view of TD(l) is
    equivalent to the backward (mechanistic) view for
    off-line updating
  • The book shows
  • On-line updating with small a is similar

algebra shown in book
19
n-step TD vs TD(l?
  • Same 19 state random walk
  • TD(l) performs a bit better

20
Control Sarsa(l)
  • Save eligibility for state-action pairs instead
    of just states

21
Sarsa(l) Algorithm
22
Sarsa(l) Gridworld Example
  • With one trial, the agent has much more
    information about how to get to the goal
  • not necessarily the best way
  • Can considerably accelerate learning

23
Three Approaches to Q(l)
  • How can we extend this to Q-learning?
  • If you mark every state action pair as eligible,
    you backup over non-greedy policy
  • Watkins Zero out eligibility trace after a
    non-greedy action. Do max when backing up at
    first non-greedy choice.

24
Watkinss Q(l)
25
Pengs Q(l)
  • Disadvantage to Watkinss method
  • Early in learning, the eligibility trace will be
    cut (zeroed out) frequently resulting in little
    advantage to traces
  • Peng
  • Backup max action except at end
  • Never cut traces
  • Disadvantage
  • Complicated to implement

26
Naïve Q(l)
  • Idea is it really a problem to backup
    exploratory actions?
  • Never zero traces
  • Always backup max at current action (unlike Peng
    or Watkinss)
  • Is this truly naïve?
  • Works well is preliminary empirical studies

What is the backup diagram?
27
Comparison Task
  • Compared Watkinss, Pengs, and Naïve (called
    McGoverns here) Q(l) on several tasks.
  • See McGovern and Sutton (1997). Towards a Better
    Q(l) for other tasks and results (stochastic
    tasks, continuing tasks, etc)
  • Deterministic gridworld with obstacles
  • 10x10 gridworld
  • 25 randomly generated obstacles
  • 30 runs
  • a 0.05, g 0.9, l 0.9, e 0.05,
    accumulating traces

From McGovern and Sutton (1997). Towards a
better Q(l)
28
Comparison Results
From McGovern and Sutton (1997). Towards a
better Q(l)
29
Convergence of the Q(l)s
  • None of the methods are proven to converge.
  • Much extra credit if you can prove any of them.
  • Watkinss is thought to converge to Q
  • Pengs is thought to converge to a mixture of Qp
    and Q
  • Naïve - Q?

30
Eligibility Traces for Actor-Critic Methods
  • Critic On-policy learning of Vp. Use TD(l) as
    described before.
  • Actor Needs eligibility traces for each
    state-action pair.
  • We change the update equation
  • Can change the other actor-critic update

to
to
where
31
Replacing Traces
  • Using accumulating traces, frequently visited
    states can have eligibilities greater than 1
  • This can be a problem for convergence
  • Replacing traces Instead of adding 1 when you
    visit a state, set that trace to 1

32
Replacing Traces Example
  • Same 19 state random walk task as before
  • Replacing traces perform better than accumulating
    traces over more values of l

33
Why Replacing Traces?
  • Replacing traces can significantly speed learning
  • They can make the system perform well for a
    broader set of parameters
  • Accumulating traces can do poorly on certain
    types of tasks

Why is this task particularly onerous for
accumulating traces?
34
More Replacing Traces
  • Off-line replacing trace TD(1) is identical to
    first-visit MC
  • Extension to action-values
  • When you revisit a state, what should you do with
    the traces for the other actions?
  • Singh and Sutton say to set them to zero

35
Implementation Issues with Traces
  • Could require much more computation
  • But most eligibility traces are VERY close to
    zero
  • If you implement it in Matlab, backup is only one
    line of code and is very fast (Matlab is
    optimized for matrices)

36
Variable l
  • Can generalize to variable l
  • Here l is a function of time
  • Could define

37
Conclusions
  • Provides efficient, incremental way to combine MC
    and TD
  • Includes advantages of MC (can deal with lack of
    Markov property)
  • Includes advantages of TD (using TD error,
    bootstrapping)
  • Can significantly speed learning
  • Does have a cost in computation

38
The two views
Write a Comment
User Comments (0)
About PowerShow.com