Title: Accumulation vs. replacement; model-free vs. model-based RL
1Accumulation vs. replacement model-free vs.
model-based RL
2Today in history
- Last time
- Explanations of Q-learning
- Action selection
- On/off-policy learning
- Use of experience
- Eligibility traces
- SARSA
- Today
- SARSA(?)
- Replacing vs accumulating traces
- Thinking about eligibility
- R3 discussion
3Administrivia
- Select presentation days
- Tues, May 1
- Alex, Blake, Diane
- Thu, May 3
- Hairong, Jesse, Josh
4Presentation hints
Terrans packaged rant...
5Presentation hints
- Formal presentation to an audience
- Trying to convince audience of something
- E.g., you have invented a great idea and proven
that it works - Subtext youre smart and they should invest in
you - Think of it as a sales pitch (sort-of)
- Get the core idea across
- Dont dwell on tedious detail
- Dont be fluffy
6Presentation hints
- Practice!
- Time will be tight -- time yourself
- Get friends/colleagues to help you practice
- Practice!
- Think about order of material presentation
- Practice!
7Presentation hints
- Avoid
- using
- every
- clever
- powerpoint
- trick
And be careful with cute, but pointless images
8Presentation hints
Oh, and avoid using bizarre fonts and really tiny
font sizes just so that you can cram as much junk
on the screen as possible. Remember its more
important that the audience actually understand
your material than that you convey more volume
of material in the same time. Its essentially
pointless to ream through bunches of text or
incredible amounts of math if nobody in the
audience gets it. At best, they will be bored
and zone out for most of your talk. At worst,
they will be actively put off or annoyed by your
presentation. And, presumably, you want them all
to like you and be impressed with your material
and ideas, so its counterproductive to
antagonize your audience. Remember at some
point, your project, future funding, and/or job
may depend on a presentation like this, so it
behooves you to keep your audience happy. I have
actually seen people give abysmally bad
presentations and be completely rejected from the
job opening because of their poor presentations.
Now that that has been said, I still need to fill
out this page with a large blob of text so that
its as intimidating as possible. Honestly, I
dont expect anybody to actually read this far
even in the online copy, let alone in class. If
you do actually get this far while Im flashing
this page up in class, do please shout out. Ill
be most impressed and youll get brownie points
for speed reading. Even if you happen to read
this far in the online copy, please send me a
note, just to satisfy my curiosity about whos
determined enough to get that far. Hm. Still
half a page to fill. This is a pretty
drastically condensed slide. Lets see. Need
more text. Maybe a little web mining... Ok,
here we go APRIL is the cruellest month,
breeding / Lilacs out of the dead land, mixing /
Memory and desire, stirring / Dull roots with
spring rain. / Winter kept us warm, covering /
Earth in forgetful snow, feeding / A little life
with dried tubers. / Summer surprised us, coming
over the Starnbergersee / With a shower of rain
we stopped in the colonnade, / And went on in
sunlight, into the Hofgarten, / And drank coffee,
and talked for an hour. / Bin gar keine Russin,
stamm' aus Litauen, echt deutsch. / And when we
were children, staying at the archduke's, / My
cousin's, he took me out on a sled, / And I was
frightened. He said, Marie, / Marie, hold on
tight. And down we went. / In the mountains,
there you feel free. / I read, much of the night,
and go south in the winter. / / What are the
roots that clutch, what branches grow / Out of
this stony rubbish? Son of man, / You cannot say,
or guess, for you know only / A heap of broken
images, where the sun beats, / And the dead tree
gives no shelter, the cricket no relief, / And
the dry stone no sound of water. Only / There is
shadow under this red rock, / (Come in under the
shadow of this red rock), / And I will show you
something different from either / Your shadow at
morning striding behind you / Or your shadow at
evening rising to meet you / I will show you
fear in a handful of dust. / Frisch weht der Wind
/ Der Heimat zu. / Mein Irisch Kind, / Wo weilest
du? / 'You gave me hyacinths first a year ago /
'They called me the hyacinth girl.' / Yet when
we came back, late, from the Hyacinth garden, /
Your arms full, and your hair wet, I could not /
Speak, and my eyes failed, I was neither / Living
nor dead, and I knew nothing, / Looking into the
heart of light, the silence. / Od' und leer das
Meer.
9Presentation hints
Oh yeah. Dont switch slides too quickly.
10Presentation hints
- Be sure to look at audience
- Dont just read from your slides
- Dont stare at screen whole time
- Be careful w/ laser pointers
- Practice!
11Back to RL...
12The Q-learning algorithm
- Algorithm Q_learn
- Inputs State space S Act. space A
- Discount ? (0lt?lt1) Learning rate a (0ltalt1)
- Outputs Q
- Repeat
- sget_current_world_state()
- apick_next_action(Q,s)
- (r,s)act_in_world(a)
- Q(s,a)Q(s,a)a(r?max_a(Q(s,a))-Q(s,a))
- Until (bored)
13SARSA-learning algorithm
- Algorithm SARSA_learn
- Inputs State space S Act. space A
- Discount ? (0lt?lt1) Learning rate a (0ltalt1)
- Outputs Q
- Q random(S,A) // Initialize
- sget_current_world_state()
- apick_next_action(Q,s)
- Repeat
- (r,s)act_in_world(a)
- apick_next_action(Q,s)
- Q(s,a)Q(s,a)a(r?Q(s,a)-Q(s,a))
- aa ss
- Until (bored)
14Radioactive breadcrumbs
- Can now define eligibility traces for SARSA
- In addition to Q(s,a) table, keep an e(s,a) table
- Records eligibility (real number) for each
state/action pair - At every step ((s,a,r,s,a) tuple)
- Increment e(s,a) for current (s,a) pair by 1
- Update all Q(s,a) vals in proportion to their
e(s,a) - Decay all e(s,a) by factor of ??
- Leslie Kaelbling calls this the radioactive
breadcrumbs form of RL
15SARSA(?)-learning alg.
- Algorithm SARSA(?)_learn
- Inputs S, A, ? (0lt?lt1), a (0ltalt1), ? (0lt?lt1)
- Outputs Q
- e(s,a)0 // for all s, a
- sget_curr_world_st() apick_nxt_act(Q,s)
- Repeat
- (r,s)act_in_world(a)
- apick_next_action(Q,s)
- dr?Q(s,a)-Q(s,a)
- e(s,a)1
- foreach (s,a) pair in (SXA)
- Q(s,a)Q(s,a)ae(s,a)d
- e(s,a)??
- aa ss
- Until (bored)
16SARSA(?)-learning alg.
- Algorithm SARSA(?)_learn
- Inputs S, A, ? (0lt?lt1), a (0ltalt1), ? (0lt?lt1)
- Outputs Q
- e(s,a)0 // for all s, a
- sget_curr_world_st() apick_nxt_act(Q,s)
- Repeat
- (r,s)act_in_world(a)
- apick_next_action(Q,s)
- dr?Q(s,a)-Q(s,a)
- e(s,a)1
- foreach (s,a) pair in (SXA)
- Q(s,a)Q(s,a)ae(s,a)d
- e(s,a)??
- aa ss
- Until (bored)
17The trail of crumbs
Sutton Barto, Sec 7.5
18The trail of crumbs
?0
Sutton Barto, Sec 7.5
19The trail of crumbs
Sutton Barto, Sec 7.5
20Eligibility for a single state
e(si,aj)
1st visit
2nd visit
...
Sutton Barto, Sec 7.5
21Eligibility trace followup
- Eligibility trace allows
- Tracking where the agent has been
- Backup of rewards over longer periods
- Credit assignment state/action pairs rewarded
for having contributed to getting to the reward - Why does it work?
22The forward view of elig.
- Original SARSA did one step backup
Info backup
Rest of trajectory
Q(st1,at1)
rt
Q(s,a)
23The forward view of elig.
- Original SARSA did one step backup
- Could also do a two step backup
Info backup
Rest of trajectory
Q(st2,at2)
rt1
rt
Q(s,a)
24The forward view of elig.
- Original SARSA did one step backup
- Could also do a two step backup
- Or even an n step backup
25The forward view of elig.
- Small-step backups (n1, n2, etc.) are slow and
nearsighted - Large-step backups (n100, n1000, n8) are
expensive and may miss near-term effects - Want a way to combine them
- Can take a weighted average of different backups
- E.g.
26The forward view of elig.
1/3
2/3
27The forward view of elig.
- How do you know which number of steps to avg
over? And what the weights should be? - Accumulating eligibility traces are just a clever
way to easily avg. over all n
28The forward view of elig.
?0
?1
?2
?n-1
29Replacing traces
- Kind just described are accumulating e-traces
- Every time you go back to state, add extra e.
- There are also replacing eligibility traces
- Every time you go back to a state/action, reset
e(s,a) to 1 - Works better sometimes
Sutton Barto, Sec 7.8