Title: Reinforcement Learning Eligibility Traces
1Reinforcement LearningEligibility Traces
2Content
- n-step TD prediction
- Forward View of TD(?)
- Backward View of TD(?)
- Equivalence of the Forward and Backward Views
- Sarsa(?)
- Q(?)
- Eligibility Traces for Actor-Critic Methods
- Replacing Traces
- Implementation Issues
3Reinforcement LearningEligibility Traces
- n-Step
- TD Prediction
- ???????
- ?????????
4Elementary Methods
Monte Carlo Methods
Dynamic Programming
TD(0)
5Monte Carlo vs. TD(0)
- Monte Carlo
- observe reward for all steps in an episode
- TD(0)
- observed one step only
6n-Step TD Prediction
7n-Step TD Prediction
8Backups
Monte Carlo
TD(0)
n-step TD
9n-Step TD Backup
online
offline
When offline, the new V(s) will be for the next
episode.
10Error Reduction Property
online
offline
Maximum error using n-step return
Maximum error using V (current value)
n-step return
11Example (Random Walk)
Consider 2-step TD, 3-step TD,
n? is optimal?
12Example (19-state Random Walk)
13Exercise (Random Walk)
14Exercise (Random Walk)
- Evaluate value function for random policy
- Approximate value function using n-step TD (try
different ns and ?s), and compare their
performance. - Find optimal policy.
15Reinforcement LearningEligibility Traces
- The Forward View of TD(?)
- ???????
- ?????????
16Averaging n-step Returns
- We are not limited to simply using n-step TD
returns - For example, we could take average n-step TD
returns like
17TD(?) ? ?-Return
- TD(?) is a method for averaging all n-step
backups - weight by ?n?1 (time since visitation)
- Called ?-return
- Backup using ?-return
w1
w2
w3
wT?t ?1
18TD(?) ? ?-Return
- TD(?) is a method for averaging all n-step
backups - weight by ?n?1 (time since visitation)
- Called ?-return
- Backup using ?-return
w1
w2
w3
wT?t
19Forward View of TD(?)
A theoretical view
20TD(?) on the Random Walk
21Reinforcement LearningEligibility Traces
- The Backward View of TD(?)
- ???????
- ?????????
22Why Backward View?
- Forward view is acausal
- Not implementable
- Backward view is causal
- Implementable
- In the offline case, achieving the same result as
the forward view
23Eligibility Traces
- Each state is associated with an additional
memory variable ? eligibility trace, defined by
24Eligibility Traces
- Each state is associated with an additional
memory variable ? eligibility trace, defined by
25Eligibility Traces
- Each state is associated with an additional
memory variable ? eligibility trace, defined by
26Eligibility ? Recency of Visiting
- At any time, the traces record which states have
recently been visited, where recently" is
defined in terms of ??. - The traces indicate the degree to which each
state is eligible for undergoing learning changes
should a reinforcing event occur. - Reinforcing event
The moment-by-moment 1-step TD errors
27Reinforcing Event
The moment-by-moment 1-step TD errors
28TD(??)
Eligibility Traces
Reinforcing Events
Value updates
29Online TD(??)
30Backward View of TD(??)
31Backwards View vs. MC TD(0)
- Set ? to 0, we get to TD(0)
- Set ? to 1, we get MC but in a better way
- Can apply TD(1) to continuing tasks
- Works incrementally and on-line (instead of
waiting to the end of the episode)
How about 0 lt ? lt 1?
32Reinforcement LearningEligibility Traces
- Equivalence of the Forward and Backward Views
- ???????
- ?????????
33Offline TD(?)s
Offline Forward TD(?) ? ?-Return
Offline Backward TD(?)
34Forward View Backward View
Forward updates
Backward updates
See the proof
35Forward View Backward View
Forward updates
Backward updates
36TD(?) on the Random Walk
37Reinforcement LearningEligibility Traces
- Sarsa(?)
- ???????
- ?????????
38Sarsa(?)
- TD(?) ?
- Use eligibility traces for policy evaluation
- How can eligibility traces be used for control?
- Learn Qt(s, a) rather than Vt(s).
39Sarsa(?)
Eligibility Traces
Reinforcing Events
Updates
40Sarsa(?)
41Sarsa(?) ? Traces in Grid World
- With one trial, the agent has much more
information about how to get to the goal - not necessarily the best way
- Considerably accelerate learning
42Reinforcement LearningEligibility Traces
43Q-Learning
- An off-policy method
- breaks from time to time to take exploratory
actions - a simple time trace cannot be easily implemented
- How to combine eligibility traces and
Q-learning? - Three methods
- Watkins's Q(?)
- Peng's Q (?)
- Naïve Q (?)
44Watkins's Q(?)
Estimation policy (e.g., greedy)
Behavior policy (e.g., ?-greedy)
Greedy Path
Non-Greedy Path
45Backups ? Watkins's Q(?)
How to define the eligibility traces?
Two cases
- Both behavior and estimation policies take the
greedy path. - Behavior path has taken a non-greedy action
before the episode ends.
Case 1
Case 2
46Watkins's Q(?)
47Watkins's Q(?)
48Peng's Q(?)
- Cutting off traces loses much of the advantage of
using eligibility traces. - If exploratory actions are frequent, as they
often are early in learning, then only rarely
will backups of more than one or two steps be
done, and learning may be little faster than
1-step Q-learning. - Peng's Q(?) is an alternate version of Q(?) meant
to remedy this.
49Backups ? Peng's Q(?)
Peng, J. and Williams, R. J. (1996). Incremental
Multi-Step Q-Learning. Machine Learning,
22(1/2/3).
- Never cut traces
- Backup max action except at end
- The book says it outperforms Watkins Q(?) and
almost as well as Sarsa(?) - Disadvantage difficult for implementation
50Peng's Q(?)
Peng, J. and Williams, R. J. (1996). Incremental
Multi-Step Q-Learning. Machine Learning,
22(1/2/3).
See
for notations.
51Naïve Q(?)
- Idea Is it really a problem to backup
exploratory actions? - Never zero traces
- Always backup max at current action (unlike Peng
or Watkinss) - Is this truly naïve?
- Works well is preliminary empirical studies
52Naïve Q(?)
53Comparisons
- McGovern, Amy and Sutton, Richard S. (1997)
Towards a better Q(?). Presented at the Fall 1997
Reinforcement Learning Workshop. - Deterministic gridworld with obstacles
- 10x10 gridworld
- 25 randomly generated obstacles
- 30 runs
- ? 0.05, ? 0.9, ? 0.9, ? 0.05,
- accumulating traces
54Comparisons
55Convergence of the Q(l)s
- None of the methods are proven to converge.
- Much extra credit if you can prove any of them.
- Watkinss is thought to converge to Q
- Pengs is thought to converge to a mixture of Q?
and Q - Naïve - Q?
56Reinforcement LearningEligibility Traces
- Eligibility Traces for Actor-Critic Methods
-
- ???????
- ?????????
57Actor-Critic Methods
- Critic On-policy learning of V?. Use TD(?) as
described before. - Actor Needs eligibility traces for each
state-action pair.
58Policy Parameters Update
Method 1
59Policy Parameters Update
Method 2
60Reinforcement LearningEligibility Traces
- Replacing Traces
- ???????
- ?????????
61Accumulating/Replacing Traces
Replacing Traces
Accumulating Traces
62Why Replacing Traces?
- Using accumulating traces, frequently visited
states can have eligibilities greater than 1 - This can be a problem for convergence
- Replacing traces can significantly speed learning
- They can make the system perform well for a
broader set of parameters - Accumulating traces can do poorly on certain
types of tasks
63Example (19-State Random Walk)
64Extension to action-values
- When you revisit a state, what should you do with
the traces for the other actions? - Singh and Sutton (1996) ? to set traces of all
other actions from the revisited state to 0.
65Reinforcement LearningEligibility Traces
- Implementation Issues
- ???????
- ?????????
66Implementation Issues
- For practical use we cannot compute every trace
down to the last. - Dropping very small values is recommended and
encouraged. - If you implement it in Matlab, backup is only one
line of code and is very fast (Matlab is
optimized for matrices). - Use with neural networks and backpropagation
generally only causes a doubling of needed
computational power.
67Variable ?
- Can generalize to variable ?
- Here ? is a function of time
- E.g.,
68Proof
An accumulating eligibility trace can be written
explicitly (non-recursively) as
69Proof
70Proof
71Proof