Accumulation vs. replacement; model-free vs. model-based RL

1 / 29
About This Presentation
Title:

Accumulation vs. replacement; model-free vs. model-based RL

Description:

... using bizarre fonts and really tiny font sizes just so that you can cram as much ... Frisch weht der Wind / Der Heimat zu. / Mein Irisch Kind, / Wo weilest du? ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 30
Provided by: csU94
Learn more at: http://www.cs.unm.edu

less

Transcript and Presenter's Notes

Title: Accumulation vs. replacement; model-free vs. model-based RL


1
Accumulation vs. replacement model-free vs.
model-based RL
2
Today in history
  • Last time
  • Explanations of Q-learning
  • Action selection
  • On/off-policy learning
  • Use of experience
  • Eligibility traces
  • SARSA
  • Today
  • SARSA(?)
  • Replacing vs accumulating traces
  • Thinking about eligibility
  • R3 discussion

3
Administrivia
  • Select presentation days
  • Tues, May 1
  • Alex, Blake, Diane
  • Thu, May 3
  • Hairong, Jesse, Josh

4
Presentation hints
Terrans packaged rant...
5
Presentation hints
  • Formal presentation to an audience
  • Trying to convince audience of something
  • E.g., you have invented a great idea and proven
    that it works
  • Subtext youre smart and they should invest in
    you
  • Think of it as a sales pitch (sort-of)
  • Get the core idea across
  • Dont dwell on tedious detail
  • Dont be fluffy

6
Presentation hints
  • Practice!
  • Time will be tight -- time yourself
  • Get friends/colleagues to help you practice
  • Practice!
  • Think about order of material presentation
  • Practice!

7
Presentation hints
  • Avoid
  • using
  • every
  • clever
  • powerpoint
  • trick

And be careful with cute, but pointless images
8
Presentation hints
Oh, and avoid using bizarre fonts and really tiny
font sizes just so that you can cram as much junk
on the screen as possible. Remember its more
important that the audience actually understand
your material than that you convey more volume
of material in the same time. Its essentially
pointless to ream through bunches of text or
incredible amounts of math if nobody in the
audience gets it. At best, they will be bored
and zone out for most of your talk. At worst,
they will be actively put off or annoyed by your
presentation. And, presumably, you want them all
to like you and be impressed with your material
and ideas, so its counterproductive to
antagonize your audience. Remember at some
point, your project, future funding, and/or job
may depend on a presentation like this, so it
behooves you to keep your audience happy. I have
actually seen people give abysmally bad
presentations and be completely rejected from the
job opening because of their poor presentations.
Now that that has been said, I still need to fill
out this page with a large blob of text so that
its as intimidating as possible. Honestly, I
dont expect anybody to actually read this far
even in the online copy, let alone in class. If
you do actually get this far while Im flashing
this page up in class, do please shout out. Ill
be most impressed and youll get brownie points
for speed reading. Even if you happen to read
this far in the online copy, please send me a
note, just to satisfy my curiosity about whos
determined enough to get that far. Hm. Still
half a page to fill. This is a pretty
drastically condensed slide. Lets see. Need
more text. Maybe a little web mining... Ok,
here we go APRIL is the cruellest month,
breeding / Lilacs out of the dead land, mixing /
Memory and desire, stirring / Dull roots with
spring rain. / Winter kept us warm, covering /
Earth in forgetful snow, feeding / A little life
with dried tubers. / Summer surprised us, coming
over the Starnbergersee / With a shower of rain
we stopped in the colonnade, / And went on in
sunlight, into the Hofgarten, / And drank coffee,
and talked for an hour. / Bin gar keine Russin,
stamm' aus Litauen, echt deutsch. / And when we
were children, staying at the archduke's, / My
cousin's, he took me out on a sled, / And I was
frightened. He said, Marie, / Marie, hold on
tight. And down we went. / In the mountains,
there you feel free. / I read, much of the night,
and go south in the winter. / / What are the
roots that clutch, what branches grow / Out of
this stony rubbish? Son of man, / You cannot say,
or guess, for you know only / A heap of broken
images, where the sun beats, / And the dead tree
gives no shelter, the cricket no relief, / And
the dry stone no sound of water. Only / There is
shadow under this red rock, / (Come in under the
shadow of this red rock), / And I will show you
something different from either / Your shadow at
morning striding behind you / Or your shadow at
evening rising to meet you / I will show you
fear in a handful of dust. / Frisch weht der Wind
/ Der Heimat zu. / Mein Irisch Kind, / Wo weilest
du? / 'You gave me hyacinths first a year ago /
'They called me the hyacinth girl.' / Yet when
we came back, late, from the Hyacinth garden, /
Your arms full, and your hair wet, I could not /
Speak, and my eyes failed, I was neither / Living
nor dead, and I knew nothing, / Looking into the
heart of light, the silence. / Od' und leer das
Meer.
9
Presentation hints
Oh yeah. Dont switch slides too quickly.
10
Presentation hints
  • Be sure to look at audience
  • Dont just read from your slides
  • Dont stare at screen whole time
  • Be careful w/ laser pointers
  • Practice!

11
Back to RL...
12
The Q-learning algorithm
  • Algorithm Q_learn
  • Inputs State space S Act. space A
  • Discount ? (0lt?lt1) Learning rate a (0ltalt1)
  • Outputs Q
  • Repeat
  • sget_current_world_state()
  • apick_next_action(Q,s)
  • (r,s)act_in_world(a)
  • Q(s,a)Q(s,a)a(r?max_a(Q(s,a))-Q(s,a))
  • Until (bored)

13
SARSA-learning algorithm
  • Algorithm SARSA_learn
  • Inputs State space S Act. space A
  • Discount ? (0lt?lt1) Learning rate a (0ltalt1)
  • Outputs Q
  • Q random(S,A) // Initialize
  • sget_current_world_state()
  • apick_next_action(Q,s)
  • Repeat
  • (r,s)act_in_world(a)
  • apick_next_action(Q,s)
  • Q(s,a)Q(s,a)a(r?Q(s,a)-Q(s,a))
  • aa ss
  • Until (bored)

14
Radioactive breadcrumbs
  • Can now define eligibility traces for SARSA
  • In addition to Q(s,a) table, keep an e(s,a) table
  • Records eligibility (real number) for each
    state/action pair
  • At every step ((s,a,r,s,a) tuple)
  • Increment e(s,a) for current (s,a) pair by 1
  • Update all Q(s,a) vals in proportion to their
    e(s,a)
  • Decay all e(s,a) by factor of ??
  • Leslie Kaelbling calls this the radioactive
    breadcrumbs form of RL

15
SARSA(?)-learning alg.
  • Algorithm SARSA(?)_learn
  • Inputs S, A, ? (0lt?lt1), a (0ltalt1), ? (0lt?lt1)
  • Outputs Q
  • e(s,a)0 // for all s, a
  • sget_curr_world_st() apick_nxt_act(Q,s)
  • Repeat
  • (r,s)act_in_world(a)
  • apick_next_action(Q,s)
  • dr?Q(s,a)-Q(s,a)
  • e(s,a)1
  • foreach (s,a) pair in (SXA)
  • Q(s,a)Q(s,a)ae(s,a)d
  • e(s,a)??
  • aa ss
  • Until (bored)

16
SARSA(?)-learning alg.
  • Algorithm SARSA(?)_learn
  • Inputs S, A, ? (0lt?lt1), a (0ltalt1), ? (0lt?lt1)
  • Outputs Q
  • e(s,a)0 // for all s, a
  • sget_curr_world_st() apick_nxt_act(Q,s)
  • Repeat
  • (r,s)act_in_world(a)
  • apick_next_action(Q,s)
  • dr?Q(s,a)-Q(s,a)
  • e(s,a)1
  • foreach (s,a) pair in (SXA)
  • Q(s,a)Q(s,a)ae(s,a)d
  • e(s,a)??
  • aa ss
  • Until (bored)

17
The trail of crumbs
Sutton Barto, Sec 7.5
18
The trail of crumbs
?0
Sutton Barto, Sec 7.5
19
The trail of crumbs
Sutton Barto, Sec 7.5
20
Eligibility for a single state
e(si,aj)
1st visit
2nd visit
...
Sutton Barto, Sec 7.5
21
Eligibility trace followup
  • Eligibility trace allows
  • Tracking where the agent has been
  • Backup of rewards over longer periods
  • Credit assignment state/action pairs rewarded
    for having contributed to getting to the reward
  • Why does it work?

22
The forward view of elig.
  • Original SARSA did one step backup

Info backup
Rest of trajectory
Q(st1,at1)
rt
Q(s,a)
23
The forward view of elig.
  • Original SARSA did one step backup
  • Could also do a two step backup

Info backup
Rest of trajectory
Q(st2,at2)
rt1
rt
Q(s,a)
24
The forward view of elig.
  • Original SARSA did one step backup
  • Could also do a two step backup
  • Or even an n step backup

25
The forward view of elig.
  • Small-step backups (n1, n2, etc.) are slow and
    nearsighted
  • Large-step backups (n100, n1000, n8) are
    expensive and may miss near-term effects
  • Want a way to combine them
  • Can take a weighted average of different backups
  • E.g.

26
The forward view of elig.
1/3
2/3
27
The forward view of elig.
  • How do you know which number of steps to avg
    over? And what the weights should be?
  • Accumulating eligibility traces are just a clever
    way to easily avg. over all n

28
The forward view of elig.
?0
?1
?2
?n-1
29
Replacing traces
  • Kind just described are accumulating e-traces
  • Every time you go back to state, add extra e.
  • There are also replacing eligibility traces
  • Every time you go back to a state/action, reset
    e(s,a) to 1
  • Works better sometimes

Sutton Barto, Sec 7.8
Write a Comment
User Comments (0)