Reinforcement Learning Eligibility Traces - PowerPoint PPT Presentation

1 / 75
About This Presentation
Title:

Reinforcement Learning Eligibility Traces

Description:

When offline, the new V(s) will be for the next episode. Error Reduction Property ... Two cases: Both behavior and estimation policies take the greedy path. ... – PowerPoint PPT presentation

Number of Views:159
Avg rating:3.0/5.0
Slides: 76
Provided by: taiwe
Category:

less

Transcript and Presenter's Notes

Title: Reinforcement Learning Eligibility Traces


1
Reinforcement LearningEligibility Traces
  • ??????

????
2
Content
  • n-step TD prediction
  • Forward View of TD(?)
  • Backward View of TD(?)
  • Equivalence of the Forward and Backward Views
  • Sarsa(?)
  • Q(?)
  • Eligibility Traces for Actor-Critic Methods
  • Replacing Traces
  • Implementation Issues

3
Reinforcement LearningEligibility Traces
  • n-Step
  • TD Prediction

4
Elementary Methods
Monte Carlo Methods
Dynamic Programming
TD(0)
5
Monte Carlo vs. TD(0)
  • Monte Carlo
  • observe reward for all steps in an episode
  • TD(0)
  • observed one step only

6
n-Step TD Prediction
7
n-Step TD Prediction
8
Backups
Monte Carlo
TD(0)
n-step TD
9
n-Step TD Backup
online
offline
When offline, the new V(s) will be for the next
episode.
10
Error Reduction Property
online
offline
Maximum error using n-step return
Maximum error using V (current value)
n-step return
11
Example (Random Walk)
Consider 2-step TD, 3-step TD,
n? is optimal?
12
Example (19-state Random Walk)
13
Exercise (Random Walk)
14
Exercise (Random Walk)
  • Evaluate value function for random policy
  • Approximate value function using n-step TD (try
    different ns and ?s), and compare their
    performance.
  • Find optimal policy.

15
Reinforcement LearningEligibility Traces
  • The Forward View of TD(?)

16
Averaging n-step Returns
  • We are not limited to simply using n-step TD
    returns
  • For example, we could take average n-step TD
    returns like

17
TD(?) ? ?-Return
  • TD(?) is a method for averaging all n-step
    backups
  • weight by ?n?1 (time since visitation)
  • Called ?-return
  • Backup using ?-return

w1
w2
w3
wT?t ?1
18
TD(?) ? ?-Return
  • TD(?) is a method for averaging all n-step
    backups
  • weight by ?n?1 (time since visitation)
  • Called ?-return
  • Backup using ?-return

w1
w2
w3
wT?t ?1
19
TD(?) ? ?-Return
  • TD(?) is a method for averaging all n-step
    backups
  • weight by ?n?1 (time since visitation)
  • Called ?-return
  • Backup using ?-return

w1
w2
w3
wT?t ?1
20
TD(?) ? ?-Return
How about if ??0?
How about if ??1?
  • TD(?) is a method for averaging all n-step
    backups
  • weight by ?n?1 (time since visitation)
  • Called ?-return
  • Backup using ?-return

w1
w2
w3
wT?t ?1
21
TD(?) ? ?-Return
How about if ??0?
How about if ??1?
  • TD(?) is a method for averaging all n-step
    backups
  • weight by ?n?1 (time since visitation)
  • Called ?-return
  • Backup using ?-return

??0 TD(0) ??1 Monte Carlo
w1
w2
w3
wT?t ?1
22
Forward View of TD(?)
A theoretical view
23
TD(?) on the Random Walk
24
Reinforcement LearningEligibility Traces
  • The Backward View of TD(?)

25
Why Backward View?
  • Forward view is acausal
  • Not implementable
  • Backward view is causal
  • Implementable
  • In the offline case, achieving the same result as
    the forward view

26
Eligibility Traces
  • Each state is associated with an additional
    memory variable ? eligibility trace, defined by

27
Eligibility Traces
  • Each state is associated with an additional
    memory variable ? eligibility trace, defined by

28
Eligibility Traces
  • Each state is associated with an additional
    memory variable ? eligibility trace, defined by

29
Eligibility ? Recency of Visiting
  • At any time, the traces record which states have
    recently been visited, where recently" is
    defined in terms of ??.
  • The traces indicate the degree to which each
    state is eligible for undergoing learning changes
    should a reinforcing event occur.
  • Reinforcing event

The moment-by-moment 1-step TD errors
30
Reinforcing Event
The moment-by-moment 1-step TD errors
31
TD(??)
Eligibility Traces
Reinforcing Events
Value updates
32
Online TD(??)
33
Backward View of TD(??)
34
Backwards View vs. MC TD(0)
  • Set ? to 0, we get to TD(0)
  • Set ? to 1, we get MC but in a better way
  • Can apply TD(1) to continuing tasks
  • Works incrementally and on-line (instead of
    waiting to the end of the episode)

How about 0 lt ? lt 1?
35
Reinforcement LearningEligibility Traces
  • Equivalence of the Forward and Backward Views

36
Offline TD(?)s
Offline Forward TD(?) ? ?-Return
Offline Backward TD(?)
37
Forward View Backward View
Forward updates
Backward updates
See the proof
38
Forward View Backward View
Forward updates
Backward updates
39
TD(?) on the Random Walk
40
Reinforcement LearningEligibility Traces
  • Sarsa(?)

41
Sarsa(?)
  • TD(?) ?
  • Use eligibility traces for policy evaluation
  • How can eligibility traces be used for control?
  • Learn Qt(s, a) rather than Vt(s).

42
Sarsa(?)
Eligibility Traces
Reinforcing Events
Updates
43
Sarsa(?)
44
Sarsa(?) ? Traces in Grid World
  • With one trial, the agent has much more
    information about how to get to the goal
  • not necessarily the best way
  • Considerably accelerate learning

45
Reinforcement LearningEligibility Traces
  • Q(?)

46
Q-Learning
  • An off-policy method
  • breaks from time to time to take exploratory
    actions
  • a simple time trace cannot be easily implemented
  • How to combine eligibility traces and
    Q-learning?
  • Three methods
  • Watkins's Q(?)
  • Peng's Q (?)
  • Naïve Q (?)

47
Watkins's Q(?)
Estimation policy (e.g., greedy)
Behavior policy (e.g., ?-greedy)
Greedy Path
Non-Greedy Path
48
Backups ? Watkins's Q(?)
How to define the eligibility traces?
Two cases
  • Both behavior and estimation policies take the
    greedy path.
  • Behavior path has taken a non-greedy action
    before the episode ends.

Case 1
Case 2
49
Watkins's Q(?)
50
Watkins's Q(?)
51
Peng's Q(?)
  • Cutting off traces loses much of the advantage of
    using eligibility traces.
  • If exploratory actions are frequent, as they
    often are early in learning, then only rarely
    will backups of more than one or two steps be
    done, and learning may be little faster than
    1-step Q-learning.
  • Peng's Q(?) is an alternate version of Q(?) meant
    to remedy this.

52
Backups ? Peng's Q(?)
Peng, J. and Williams, R. J. (1996). Incremental
Multi-Step Q-Learning. Machine Learning,
22(1/2/3).
  • Never cut traces
  • Backup max action except at end
  • The book says it outperforms Watkins Q(?) and
    almost as well as Sarsa(?)
  • Disadvantage difficult for implementation

53
Peng's Q(?)
Peng, J. and Williams, R. J. (1996). Incremental
Multi-Step Q-Learning. Machine Learning,
22(1/2/3).
See
for notations.
54
Naïve Q(?)
  • Idea Is it really a problem to backup
    exploratory actions?
  • Never zero traces
  • Always backup max at current action (unlike Peng
    or Watkinss)
  • Is this truly naïve?
  • Works well is preliminary empirical studies

55
Naïve Q(?)
56
Comparisons
  • McGovern, Amy and Sutton, Richard S. (1997)
    Towards a better Q(?). Presented at the Fall 1997
    Reinforcement Learning Workshop.
  • Deterministic gridworld with obstacles
  • 10x10 gridworld
  • 25 randomly generated obstacles
  • 30 runs
  • ? 0.05, ? 0.9, ? 0.9, ? 0.05,
  • accumulating traces

57
Comparisons
58
Convergence of the Q(l)s
  • None of the methods are proven to converge.
  • Much extra credit if you can prove any of them.
  • Watkinss is thought to converge to Q
  • Pengs is thought to converge to a mixture of Q?
    and Q
  • Naïve - Q?

59
Reinforcement LearningEligibility Traces
  • Eligibility Traces for Actor-Critic Methods

60
Actor-Critic Methods
  • Critic On-policy learning of V?. Use TD(?) as
    described before.
  • Actor Needs eligibility traces for each
    state-action pair.

61
Policy Parameters Update
Method 1
62
Policy Parameters Update
Method 2
63
Reinforcement LearningEligibility Traces
  • Replacing Traces

64
Accumulating/Replacing Traces
Replacing Traces
Accumulating Traces
65
Why Replacing Traces?
  • Using accumulating traces, frequently visited
    states can have eligibilities greater than 1
  • This can be a problem for convergence
  • Replacing traces can significantly speed learning
  • They can make the system perform well for a
    broader set of parameters
  • Accumulating traces can do poorly on certain
    types of tasks

66
Example (19-State Random Walk)
67
Extension to action-values
  • When you revisit a state, what should you do with
    the traces for the other actions?
  • Singh and Sutton (1996) ? to set traces of all
    other actions from the revisited state to 0.

68
Reinforcement LearningEligibility Traces
  • Implementation Issues

69
Implementation Issues
  • For practical use we cannot compute every trace
    down to the last.
  • Dropping very small values is recommended and
    encouraged.
  • If you implement it in Matlab, backup is only one
    line of code and is very fast (Matlab is
    optimized for matrices).
  • Use with neural networks and backpropagation
    generally only causes a doubling of needed
    computational power.

70
Variable ?
  • Can generalize to variable ?
  • Here ? is a function of time
  • E.g.,

71
Reinforcement LearningEligibility Traces
  • Proof

72
Proof
An accumulating eligibility trace can be written
explicitly (non-recursively) as
73
Proof
74
Proof
75
Proof
Write a Comment
User Comments (0)
About PowerShow.com