Does RL Occur Naturally? - PowerPoint PPT Presentation

About This Presentation
Title:

Does RL Occur Naturally?

Description:

'It would be quite possible to have the machine try out ... Also in the Locust. Locust scanning. Sobel, 1990. Moved target, so as to make a independent of D ... – PowerPoint PPT presentation

Number of Views:10
Avg rating:3.0/5.0
Slides: 27
Provided by: rand149
Category:
Tags: naturally | occur

less

Transcript and Presenter's Notes

Title: Does RL Occur Naturally?


1
Does RL Occur Naturally?
  • C. R. Gallistel
  • Rutgers Center for Cognitive Science

2
Turings Vision (47-48)
  • It would be quite possible to have the machine
    try out behaviors and accept or reject them
  • What we want is a machine that can learn from
    experience. The possibility of letting the
    machine alter its own instructions provides the
    mechanism for this
  • It might possible to carry through the
    organizing of a learning machine with only two
    interfering inputs, one for reward (R) or
    pleasure and the other for pain or punishment
    (P). It is intended that pain stimuli occur when
    the machines behavior is wrong, pleasure stimuli
    when it is particularly right.

3
A Different Vision
  • Policy (what to do given a state of the world) is
    pre-specified and immutable
  • Learning consists in determining the state of the
    world its all model estimation
  • Appropriate sampling behavior is itself
    prespecified

4
The Deep Reasons
  • Wolpert Macreadys No Free Lunch theorems
  • Chomskys Poverty of the Stimulus argument
  • Bottom line reinforcement learning takes too
    long
  • Because there is not enough information in the R
    P signals
  • Because learning in the absence of a highly
    structured hypothesis space is a practical
    impossibility (we dont live long enough)

5
Learning by Integrating
  • Ant knows where it is
  • This knowledge is acquired (learned)
  • It is acquired by path integration

--Harkness Maroudas,1985
6
Building a Map
  • Ant remembers where the food was (records its
    coordinates)
  • Bees ants make a map by the GPS principle
    (record location coordinates-- views)
  • They do not discover by trial and error that this
    is a good thing to do
  • As in the GPS, the computational machinery to
    determine a course from an arbitrary location to
    an arbitrary location is built in
  • No RL learning here

7
Ranging Behavior
  • When leaving a new food source or a new nest
    (hive), bees wasps fly backwards in an ever
    increasing zigzag
  • Determining visual feature distances by parallax
  • Innately specified sampling (model building)
    behavior

Wehner, 1981
8
Also in the Locust
  • Locust scanning
  • Sobel, 1990
  • Moved target, so as to make a independent of D
  • Reproduced function relating take off velocity to
    D

9
Learning by Parameter Estimation
  • Animals (including insects) use sun as compass
    reference
  • To do this, must learn solar ephemeris suns
    compass bearing as a function of the time of
    day--where it is when
  • Solar ephemeris varies with latitude and season

10
Learning from the Dance
  • Returning forager does a dance to tell other
    foragers the location (range bearing) of source
  • Compass bearing, g, specified by specifying
    current solar bearing, s
  • Range specified by number of waggles
  • Hopeless as an RL problem?
  • compass bearing of sun g compass bearing of
    sources solar bearing of source

11
Ephemeris Framework
12
Deceived Dancing
Dyer, 1987
13
Poverty of Stimulus
  • Dyer Dickinson, 1994
  • Incubator raised bees allowed to forage to
    station due west of hive but only in late
    afternoon when sun declining in west
  • On heavy overcast day, moved to new field line
    with different compass orientation and allowed to
    forage in morning (with feeder west of hive
    location)
  • Experimenter observes dance of returning foragers
    to estimate where they believe the sun to be

14
Bees Believe Earth is Round
15
Implications
  • Form of solar ephemeris equation is built into
    the nervous system
  • Only its parameters are estimated from
    observation
  • Solves poverty of the stimulus problem the
    information about universal properties of the
    ephemeris in the priors
  • Neural net without this prior information could
    not generalize as bees do

16
Language Learning
  • Same story?
  • Innate universal grammar specifies structure
    common to all language
  • Distinctions between languages are due to
    differences in parameters (e.g., head final
    versus head first)
  • Learning a language reduces to learning the
    (binary?) parameter values
  • Mark Baker (2001) The Atoms of Language

17
Natural Learning Curves
  • Gallistel et al (PNAS 2004)
  • Analyzed individual(!) learning curves from
    standard paradigms and in pigeons, rats, rabbits
    and mice
  • Pavlovian (autoshaping in pigeon, rat mouse)
  • Eyeblink in rabbit
  • Maze in rat
  • Water maze in mouse
  • Regardless of paradigm, the typical curve cannot
    be distinguished from a step function
  • Latency and size of step varies between subjects
  • Averaging across these steps produces a gradual
    learning curve its gradualness is an averaging
    artifact

18
Matching
  • Subjects foraging back and forth between
    locations where food becomes available
    unpredictably (on random rate schedules with
    unlimited holds)
  • Subjects match the ratio of the time they invest
    in the locations (expected stay duration, T1/T2)
    to the ratio of the incomes they have derived
    from them (I1/I2)
  • Matching equates returns Ri Ii/TiI1/T1
    I2/T2 iff T1/T2 I1/I2

19
RL Models
  • Most assume hill-climbing discovery of the policy
    that equates returns
  • Policy is one dimensional(ratio of expected stay
    durations)
  • Try-out given policy (stay ratio)
  • Determine direction of inequality
  • Adjust investment ratio accordingly

20
But (Gallistel et al 2001)
  • Adjustment of investment ratio after a step
    change in the relative rates of reward is quick
    and step-like

21
Bayesian Ideal Detector Analysis
22
Second Example
23
D Incomes, Not D Returns
  • Evidence of a change in behavior appears as soon
    as there is evidence of a change in incomes
  • And (often) before there is evidence of a change
    in returns

24
Evidence ofAbsence of Evidence
  • Upper panel Odds that subjects stay durations
    had changed as a function of session time
  • Lower panel Odds that subjects returns had
    changed. There was no evidence--in the returns!

25
Implications
  • Matching is an innate policy
  • Depends only on estimates of incomes
  • Anti-aliasing sampling behavior to detect
    periodic structure in reward provision built into
    policy
  • Estimates of incomes to be expected based on
    small samples taken only when a change in income
    detected
  • Here, too, learning is model updating, not policy
    value updating
  • Subjects perversely ignore returns (policy values)

26
Conclusions
  • Most (all?) natural learning looks like model
    estimation
  • Efficient model estimation is made possible by
  • Informative priors (a highly structured
    problem-specific hypothesis space)
  • Innately specified efficient sampling routines
Write a Comment
User Comments (0)
About PowerShow.com