Reinforcement Learning Eligibility Traces - PowerPoint PPT Presentation

1 / 71

About This Presentation

Title:

Reinforcement Learning Eligibility Traces

Description:

Reinforcement Learning Eligibility Traces Content n-step TD prediction Forward View of TD ... – PowerPoint PPT presentation

Number of Views:171

Avg rating:3.0/5.0

Slides: 72

Provided by: Tai131

Category:

more less

Transcript and Presenter's Notes

Title: Reinforcement Learning Eligibility Traces

1
Reinforcement LearningEligibility Traces

??????
???????
?????????

2
Content

n-step TD prediction
Forward View of TD(?)
Backward View of TD(?)
Equivalence of the Forward and Backward Views
Sarsa(?)
Q(?)
Eligibility Traces for Actor-Critic Methods
Replacing Traces
Implementation Issues

3
Reinforcement LearningEligibility Traces

n-Step
TD Prediction
???????
?????????

4
Elementary Methods
Monte Carlo Methods
Dynamic Programming
TD(0)
5
Monte Carlo vs. TD(0)

Monte Carlo
observe reward for all steps in an episode
TD(0)
observed one step only

6
n-Step TD Prediction
7
n-Step TD Prediction
8
Backups
Monte Carlo
TD(0)
n-step TD
9
n-Step TD Backup
online
offline
When offline, the new V(s) will be for the next
episode.
10
Error Reduction Property
online
offline
Maximum error using n-step return
Maximum error using V (current value)
n-step return
11
Example (Random Walk)
Consider 2-step TD, 3-step TD,
n? is optimal?
12
Example (19-state Random Walk)
13
Exercise (Random Walk)
14
Exercise (Random Walk)

Evaluate value function for random policy
Approximate value function using n-step TD (try
different ns and ?s), and compare their
performance.
Find optimal policy.

15
Reinforcement LearningEligibility Traces

The Forward View of TD(?)
???????
?????????

16
Averaging n-step Returns

We are not limited to simply using n-step TD
returns
For example, we could take average n-step TD
returns like

17
TD(?) ? ?-Return

TD(?) is a method for averaging all n-step
backups
weight by ?n?1 (time since visitation)
Called ?-return
Backup using ?-return

w1
w2
w3
wT?t ?1
18
TD(?) ? ?-Return

TD(?) is a method for averaging all n-step
backups
weight by ?n?1 (time since visitation)
Called ?-return
Backup using ?-return

w1
w2
w3
wT?t
19
Forward View of TD(?)
A theoretical view
20
TD(?) on the Random Walk
21
Reinforcement LearningEligibility Traces

The Backward View of TD(?)
???????
?????????

22
Why Backward View?

Forward view is acausal
Not implementable
Backward view is causal
Implementable
In the offline case, achieving the same result as
the forward view

23
Eligibility Traces

Each state is associated with an additional
memory variable ? eligibility trace, defined by

24
Eligibility Traces

Each state is associated with an additional
memory variable ? eligibility trace, defined by

25
Eligibility Traces

Each state is associated with an additional
memory variable ? eligibility trace, defined by

26
Eligibility ? Recency of Visiting

At any time, the traces record which states have
recently been visited, where recently" is
defined in terms of ??.
The traces indicate the degree to which each
state is eligible for undergoing learning changes
should a reinforcing event occur.
Reinforcing event

The moment-by-moment 1-step TD errors
27
Reinforcing Event
The moment-by-moment 1-step TD errors
28
TD(??)
Eligibility Traces
Reinforcing Events
Value updates
29
Online TD(??)
30
Backward View of TD(??)
31
Backwards View vs. MC TD(0)

Set ? to 0, we get to TD(0)
Set ? to 1, we get MC but in a better way
Can apply TD(1) to continuing tasks
Works incrementally and on-line (instead of
waiting to the end of the episode)

How about 0 lt ? lt 1?
32
Reinforcement LearningEligibility Traces

Equivalence of the Forward and Backward Views
???????
?????????

33
Offline TD(?)s
Offline Forward TD(?) ? ?-Return
Offline Backward TD(?)
34
Forward View Backward View
Forward updates
Backward updates
See the proof
35
Forward View Backward View
Forward updates
Backward updates
36
TD(?) on the Random Walk
37
Reinforcement LearningEligibility Traces

Sarsa(?)
???????
?????????

38
Sarsa(?)

TD(?) ?
Use eligibility traces for policy evaluation
How can eligibility traces be used for control?
Learn Qt(s, a) rather than Vt(s).

39
Sarsa(?)
Eligibility Traces
Reinforcing Events
Updates
40
Sarsa(?)
41
Sarsa(?) ? Traces in Grid World

With one trial, the agent has much more
information about how to get to the goal
not necessarily the best way
Considerably accelerate learning

42
Reinforcement LearningEligibility Traces

Q(?)
???????
?????????

43
Q-Learning

An off-policy method
breaks from time to time to take exploratory
actions
a simple time trace cannot be easily implemented
How to combine eligibility traces and
Q-learning?
Three methods
Watkins's Q(?)
Peng's Q (?)
Naïve Q (?)

44
Watkins's Q(?)
Estimation policy (e.g., greedy)
Behavior policy (e.g., ?-greedy)
Greedy Path
Non-Greedy Path
45
Backups ? Watkins's Q(?)
How to define the eligibility traces?
Two cases

Both behavior and estimation policies take the
greedy path.
Behavior path has taken a non-greedy action
before the episode ends.

Case 1
Case 2
46
Watkins's Q(?)
47
Watkins's Q(?)
48
Peng's Q(?)

Cutting off traces loses much of the advantage of
using eligibility traces.
If exploratory actions are frequent, as they
often are early in learning, then only rarely
will backups of more than one or two steps be
done, and learning may be little faster than
1-step Q-learning.
Peng's Q(?) is an alternate version of Q(?) meant
to remedy this.

49
Backups ? Peng's Q(?)
Peng, J. and Williams, R. J. (1996). Incremental
Multi-Step Q-Learning. Machine Learning,
22(1/2/3).

Never cut traces
Backup max action except at end
The book says it outperforms Watkins Q(?) and
almost as well as Sarsa(?)
Disadvantage difficult for implementation

50
Peng's Q(?)
Peng, J. and Williams, R. J. (1996). Incremental
Multi-Step Q-Learning. Machine Learning,
22(1/2/3).
See
for notations.
51
Naïve Q(?)

Idea Is it really a problem to backup
exploratory actions?
Never zero traces
Always backup max at current action (unlike Peng
or Watkinss)
Is this truly naïve?
Works well is preliminary empirical studies

52
Naïve Q(?)
53
Comparisons

McGovern, Amy and Sutton, Richard S. (1997)
Towards a better Q(?). Presented at the Fall 1997
Reinforcement Learning Workshop.
Deterministic gridworld with obstacles
10x10 gridworld
25 randomly generated obstacles
30 runs
? 0.05, ? 0.9, ? 0.9, ? 0.05,
accumulating traces

54
Comparisons
55
Convergence of the Q(l)s

None of the methods are proven to converge.
Much extra credit if you can prove any of them.
Watkinss is thought to converge to Q
Pengs is thought to converge to a mixture of Q?
and Q
Naïve - Q?

56
Reinforcement LearningEligibility Traces

Eligibility Traces for Actor-Critic Methods
???????
?????????

57
Actor-Critic Methods

Critic On-policy learning of V?. Use TD(?) as
described before.
Actor Needs eligibility traces for each
state-action pair.

58
Policy Parameters Update
Method 1
59
Policy Parameters Update
Method 2
60
Reinforcement LearningEligibility Traces

Replacing Traces
???????
?????????

61
Accumulating/Replacing Traces
Replacing Traces
Accumulating Traces
62
Why Replacing Traces?

Using accumulating traces, frequently visited
states can have eligibilities greater than 1
This can be a problem for convergence
Replacing traces can significantly speed learning
They can make the system perform well for a
broader set of parameters
Accumulating traces can do poorly on certain
types of tasks

63
Example (19-State Random Walk)
64
Extension to action-values

When you revisit a state, what should you do with
the traces for the other actions?
Singh and Sutton (1996) ? to set traces of all
other actions from the revisited state to 0.

65
Reinforcement LearningEligibility Traces

Implementation Issues
???????
?????????

66
Implementation Issues

For practical use we cannot compute every trace
down to the last.
Dropping very small values is recommended and
encouraged.
If you implement it in Matlab, backup is only one
line of code and is very fast (Matlab is
optimized for matrices).
Use with neural networks and backpropagation
generally only causes a doubling of needed
computational power.

67
Variable ?