Title: From%20Sutton%20
1Reinforcement LearningAn Introduction
2Chapter 7 Eligibility Traces
3N-step TD Prediction
- Idea Look farther into the future when you do TD
backup (1, 2, 3, , n steps)
4Mathematics of N-step TD Prediction
- Monte Carlo
- TD
- Use V to estimate remaining return
- n-step TD
- 2 step return
- n-step return
5Learning with N-step Backups
- Backup (on-line or off-line)
- Error reduction property of n-step returns
- Using this, you can show that n-step methods
converge
6Random Walk Examples
- How does 2-step TD work here?
- How about 3-step TD?
7A Larger Example
- Task 19 state random walk
- Do you think there is an optimal n? for
everything?
8Averaging N-step Returns
One backup
- n-step methods were introduced to help with TD(l)
understanding - Idea backup an average of several returns
- e.g. backup half of 2-step and half of 4-step
- Called a complex backup
- Draw each component
- Label with the weights for that component
9Forward View of TD(l)
- TD(l) is a method for averaging all n-step
backups - weight by ln-1 (time since visitation)
- l-return
- Backup using l-return
10l-return Weighting Function
Until termination
After termination
11Relation to TD(0) and MC
- l-return can be rewritten as
- If l 1, you get MC
- If l 0, you get TD(0)
Until termination
After termination
12Forward View of TD(l) II
- Look forward from each state to determine update
from future states and rewards
13l-return on the Random Walk
- Same 19 state random walk as before
- Why do you think intermediate values of l are
best?
14Backward View
- Shout dt backwards over time
- The strength of your voice decreases with
temporal distance by gl
15Backward View of TD(l)
- The forward view was for theory
- The backward view is for mechanism
- New variable called eligibility trace
- On each step, decay all traces by gl and
increment the trace for the current state by 1 - Accumulating trace
16On-line Tabular TD(l)
17Relation of Backwards View to MC TD(0)
- Using update rule
- As before, if you set l to 0, you get to TD(0)
- If you set l to 1, you get MC but in a better way
- Can apply TD(1) to continuing tasks
- Works incrementally and on-line (instead of
waiting to the end of the episode)
18Forward View Backward View
- The forward (theoretical) view of TD(l) is
equivalent to the backward (mechanistic) view for
off-line updating - The book shows
- On-line updating with small a is similar
algebra shown in book
19n-step TD vs TD(l?
- Same 19 state random walk
- TD(l) performs a bit better
20Control Sarsa(l)
- Save eligibility for state-action pairs instead
of just states
21Sarsa(l) Algorithm
22Sarsa(l) Gridworld Example
- With one trial, the agent has much more
information about how to get to the goal - not necessarily the best way
- Can considerably accelerate learning
23Three Approaches to Q(l)
- How can we extend this to Q-learning?
- If you mark every state action pair as eligible,
you backup over non-greedy policy - Watkins Zero out eligibility trace after a
non-greedy action. Do max when backing up at
first non-greedy choice.
24Watkinss Q(l)
25Pengs Q(l)
- Disadvantage to Watkinss method
- Early in learning, the eligibility trace will be
cut (zeroed out) frequently resulting in little
advantage to traces - Peng
- Backup max action except at end
- Never cut traces
- Disadvantage
- Complicated to implement
26Naïve Q(l)
- Idea is it really a problem to backup
exploratory actions? - Never zero traces
- Always backup max at current action (unlike Peng
or Watkinss) - Is this truly naïve?
- Works well is preliminary empirical studies
What is the backup diagram?
27Comparison Task
- Compared Watkinss, Pengs, and Naïve (called
McGoverns here) Q(l) on several tasks. - See McGovern and Sutton (1997). Towards a Better
Q(l) for other tasks and results (stochastic
tasks, continuing tasks, etc) - Deterministic gridworld with obstacles
- 10x10 gridworld
- 25 randomly generated obstacles
- 30 runs
- a 0.05, g 0.9, l 0.9, e 0.05,
accumulating traces
From McGovern and Sutton (1997). Towards a
better Q(l)
28Comparison Results
From McGovern and Sutton (1997). Towards a
better Q(l)
29Convergence of the Q(l)s
- None of the methods are proven to converge.
- Much extra credit if you can prove any of them.
- Watkinss is thought to converge to Q
- Pengs is thought to converge to a mixture of Qp
and Q - Naïve - Q?
30Eligibility Traces for Actor-Critic Methods
- Critic On-policy learning of Vp. Use TD(l) as
described before. - Actor Needs eligibility traces for each
state-action pair. - We change the update equation
- Can change the other actor-critic update
to
to
where
31Replacing Traces
- Using accumulating traces, frequently visited
states can have eligibilities greater than 1 - This can be a problem for convergence
- Replacing traces Instead of adding 1 when you
visit a state, set that trace to 1
32Replacing Traces Example
- Same 19 state random walk task as before
- Replacing traces perform better than accumulating
traces over more values of l
33Why Replacing Traces?
- Replacing traces can significantly speed learning
- They can make the system perform well for a
broader set of parameters - Accumulating traces can do poorly on certain
types of tasks
Why is this task particularly onerous for
accumulating traces?
34More Replacing Traces
- Off-line replacing trace TD(1) is identical to
first-visit MC - Extension to action-values
- When you revisit a state, what should you do with
the traces for the other actions? - Singh and Sutton say to set them to zero
35Implementation Issues with Traces
- Could require much more computation
- But most eligibility traces are VERY close to
zero - If you implement it in Matlab, backup is only one
line of code and is very fast (Matlab is
optimized for matrices)
36Variable l
- Can generalize to variable l
- Here l is a function of time
- Could define
37Conclusions
- Provides efficient, incremental way to combine MC
and TD - Includes advantages of MC (can deal with lack of
Markov property) - Includes advantages of TD (using TD error,
bootstrapping) - Can significantly speed learning
- Does have a cost in computation
38The two views