A Bayesian Approach to the Poverty of the Stimulus - PowerPoint PPT Presentation

1 / 103
About This Presentation
Title:

A Bayesian Approach to the Poverty of the Stimulus

Description:

Formal properties information-theoretic, simplicity-based argument (Chomsky, 1956) ... RG-B. RG-M. Are eagles that alive do fly? (aux n comp adj aux vi) ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 104
Provided by: amype1
Category:

less

Transcript and Presenter's Notes

Title: A Bayesian Approach to the Poverty of the Stimulus


1
A Bayesian Approach to the Poverty of the
Stimulus
  • Amy Perfors
  • MIT

With Josh Tenenbaum (MIT) and Terry Regier
(University of Chicago)
2
Innate
Learned
3
Explicit Structure
Innate
Learned
No explicit Structure
4
Language has hierarchical phrase structure
No
Yes
5
Why believe that language has hierarchical phrase
structure?
  • Formal properties information-theoretic,
    simplicity-based argument (Chomsky, 1956)
  • Dependency structure of language
  • A finite-state grammar cannot capture the
    infinite sets of English sentences with
    dependencies like this
  • If we restrict ourselves to only a finite set of
    sentences, then in theory a finite-state grammar
    could account for them but this grammar will be
    so complex as to be of little use or interest.

6
Why believe that structure dependence is innate?
  • The Argument from the Poverty of the Stimulus
    (PoS)

Simple declarative The girl is happy, They are
eating Simple interrogative Is the girl happy?
Are they eating?
Data
  • Linear move the first is (auxiliary) in the
    sentence to the beginning
  • Hierarchical move the auxiliary in the main
    clause to the beginning

Hypotheses
Test
Complex declarative The girl who is sleeping is
happy.
Children say Is the girl who is sleeping
happy? NOT Is the girl who sleeping is happy?
Result
Chomsky, 1965, 1980 Crain Nakayama, 1987
7
Why believe its not innate?
  • There are actually enough complex interrogatives
    (Pullum Scholz 02)
  • Childrens behavior can be explained via
    statistical learning of natural language data
    (Lewis Elman 01 Reali Christiansen 05)
  • It is not necessary to assume a grammar with
    explicit structure

8
Explicit Structure
Innate
Learned
No explicit Structure
9
Explicit Structure
Innate
Learned
No explicit Structure
10
Our argument
11
Our argument
  • We suggest that, contra the PoS claim
  • It is possible, given the nature of the input and
    certain domain-general assumptions about the
    learning mechanism, that an ideal, unbiased
    learner can realize that language has a
    hierarchical phrase structure therefore this
    knowledge need not be innate
  • The reason grammars with hierarchical phrase
    structure offer an optimal tradeoff between
    simplicity and fit to natural language data

12
Plan
  • Model
  • Data corpus of child-directed speech (CHILDES)
  • Grammars
  • Linear hierarchical
  • Both Hand-designed result of local search
  • Linear automatic, unsupervised ML
  • Evaluation
  • Complexity vs. fit
  • Results
  • Implications

13
The model Data
  • Corpus from CHILDES database (Adam, Brown corpus)
  • 55 files, age range 23 to 52
  • Sentences spoken by adults to children
  • Each word replaced by syntactic category
  • det, n, adj, prep, pro, prop, to, part, vi, v,
    aux, comp, wh, c
  • Ungrammatical sentences and the most
    grammatically complex sentence types were
    removed kept 21792 out of 25876 utterances
  • Topicalized sentences(66) sentences serial verb
    constructions (459), subordinate phrases (845),
    sentential complements (1636), and conjunctions
    (634). Ungrammatical sentences (444)

14
Data
  • Final corpus contained 2336 individual sentence
    types corresponding to 21792 sentence tokens

15
Data variation
  • Amount of evidence available at different points
    in development

16
Data variation
  • Amount of evidence available at different points
    in development
  • Amount comprehended at different points in
    development

17
Data amount available
  • Rough estimate split by age

Files
Age
types
types
23
173
7.4
Epoch 0
1
Epoch 1
23 to 28
879
38
11
Epoch 2
23 to 31
1295
55
22
Epoch 3
23 to 35
1735
74
33
2090
89
23 to 42
Epoch 4
44
55
Epoch 5
2336
100
23 to 52
18
Data amount comprehended
  • Rough estimate split by frequency

Frequency
types
tokens
types
Level 1
8
500
0.3
28
Level 2
37
100
1.6
55
Level 3
67
50
2.9
64
115
25
4.9
71
Level 4
Level 5
268
10
12
82
Level 6
2336
100
100
1 (all)
19
The model
  • Data
  • Child-directed speech (CHILDES)
  • Grammars
  • Linear hierarchical
  • Both Hand-designed result of local search
  • Linear automatic, unsupervised ML)
  • Evaluation
  • Complexity vs. fit

20
Grammar types
Linear
Hierarchical
Flat grammar
Regular grammar
1-state grammar
Context-free grammar
Rules
Rules
Rules
Rules
Example
List of each sentence
Anything accepted
NT ? t NT
NT ? NT NT NT ? t NT NT ? NT NT ? t
NT ? t
Example
Example
Example
21
Specific hierarchical grammars Hand-designed
CFG-S
CFG-L
Larger CFG
Standard CFG
Description
Description
Designed to be as linguistically plausible as
possible
Derived from CFG-S contains additional
productions corresponding to different expansions
of the same NT (puts less probability mass on
recursive productions)
Example productions
Example productions
77 rules, 15 non-terminals
133 rules, 15 non-terminals
22
Specific linear grammars Hand-designed
Poor fit, high compression
Exact fit, no compression
FLAT
1-STATE
List of each sentence
Anything accepted
2336 rules, 0 non-terminals
26 rules, 0 non-terminals
23
Specific linear grammars Hand-designed
Poor fit, high compression
Exact fit, no compression
REG-N
FLAT
1-STATE
Narrowest regular derived from CFG
List of each sentence
Anything accepted
2336 rules, 0 non-terminals
289 rules, 85 non-terminals
26 rules, 0 non-terminals
24
Specific linear grammars Hand-designed
Poor fit, high compression
Exact fit, no compression
REG-M
REG-N
FLAT
1-STATE
Mid-level regular derived from CFG
Narrowest regular derived from CFG
List of each sentence
Anything accepted
2336 rules, 0 non-terminals
169 rules, 14 non-terminals
289 rules, 85 non-terminals
26 prods, 0 non-terminals
25
Specific linear grammars Hand-designed
Poor fit, high compression
Exact fit, no compression
REG-B
REG-M
REG-N
FLAT
1-STATE
Broadest regular derived from CFG
Mid-level regular derived from CFG
Narrowest regular derived from CFG
List of each sentence
Anything accepted
2336 rules, 0 non-terminals
117 rules, 10 non-terminals
169 prods, 14 non-terminals
289 prods, 85 non-terminals
26 rules, 0 non-terminals
26
Automated search
Local search around hand-designed grammars
Linear unsupervised, automatic HMM learning
Goldwater Griffiths, 2007
Bayesian model for acquisition of trigram HMM
(designed for POS tagging, but given a corpus of
syntactic categories, learns a regular grammar)
27
The model
  • Data
  • Child-directed speech (CHILDES)
  • Grammars
  • Linear hierarchical
  • Hand-designed result of local search
  • Linear automatic, unsupervised ML
  • Evaluation
  • Complexity vs. fit

28
Grammars
T type of grammar
G Specific grammar
D Data
unbiased (uniform)
29
Grammars
T type of grammar
G Specific grammar
D Data
complexity (prior)
data fit (likelihood)
30
Tradeoff Complexity vs. Fit
  • Low prior probability more complex
  • Low likelihood poor fit to the data

Fit high Simplicity low
Fit low Simplicity high
Fit moderate Simplicity moderate
31
Measuring complexity prior
  • Designing a grammar (Gods eye view)
  • Grammars with more rules and non-terminals will
    have lower prior probability

n of nonterminals Ni
items in production i Pk productions of
nonterminal k V vocab size Tk
production probability parameters for k
32
Measuring fit likelihood
  • Probability of that grammar generating the data
  • Product of the probability of each parse

Ex pro aux det n
0.25
0.50.251.00.250.5 0.016
33
Plan
  • Model
  • Data corpus of child-directed speech (CHILDES)
  • Grammars
  • Linear hierarchical
  • Hand-designed result of local search
  • Linear automated, unsupervised ML
  • Evaluation
  • Complexity vs. fit
  • Results
  • Implications

34
Results data split by frequency levels (estimate
of comprehension)
Log posterior probability (lower magnitude
better)
35
Results data split by age (estimate of
availability)
36
Results data split by age (estimate of
availability)
Log posterior probability (lower magnitude
better)
37
Generalization How well does each grammar
predict sentences it hasnt seen?
38
Generalization How well does each grammar
predict sentences it hasnt seen?
Complex interrogatives
39
Take-home messages
  • Shown that given reasonable domain-general
    assumptions, an unbiased rational learner could
    realize that languages have a hierarchical
    structure based on typical child-directed input
  • This paradigm is valuable it makes any
    assumptions explicit and enables us to rigorously
    evaluate how different representations capture
    the tradeoff between simplicity and fit to data
  • In some ways, higher-order knowledge may be
    easier to learn than specific details (the
    blessing of abstraction)

40
Implications for innateness?
  • Ideal learner
  • Strong(er) assumptions
  • The learner can find the best grammar in the
    space of possibilities
  • Weak(er) assumptions
  • The learner has the ability to parse the corpus
    into syntactic categories
  • The learner can represent both linear and
    hierarchical grammars
  • Assume a particular way of calculating complexity
    data fit
  • Have we actually found representative grammars?

41
The End
Thanks also to the following for many helpful
discussions Virginia Savova, Jeff Elman, Danny
Fox, Adam Albright, Fei Xu, Mark Johnson, Ken
Wexler, Ted Gibson, Sharon Goldwater, Michael
Frank, Charles Kemp, Vikash Mansinghka, Noah
Goodman
42
(No Transcript)
43
Grammars
T grammar type
G Specific grammar
D Data
44
Grammars
T grammar type
G Specific grammar
D Data
45
The Argument from the Poverty of the Stimulus
(PoS)
P1. Children show a specific pattern of behavior
B P2. A particular generalization G must be
grasped in order to produce B P3. It is
impossible to reasonably induce G simply on the
basis of the data D that children receive
T
G
C1. Some abstract knowledge T, limiting which
specific generalizations G are possible, is
necessary
B
D
46
The Argument from the Poverty of the Stimulus
(PoS)
P1. Children show a specific pattern of behavior
B P2. A particular generalization G must be
grasped in order to produce B P3. It is
impossible to reasonably induce G simply on the
basis of the data D that children receive
T
G
C1. Some abstract knowledge T, limiting which
specific generalizations G are possible, is
necessary
Corollary The abstract knowledge T could not
itself be learned, or could not be learned before
G is known C2. T must be innate
B
D
47
The Argument from the Poverty of the Stimulus
(PoS)
G a specific grammar D typical child-directed
speech input B children dont make certain
mistakes (they dont seem to entertain
structure-independent hypotheses) T language
has hierarchical phrase structure
P1. Children show a specific pattern of behavior
B P2. A particular generalization G must be
grasped in order to produce B P3. It is
impossible to reasonably induce G simply on the
basis of the data D that children receive
C1. Some abstract knowledge T, limiting which
specific generalizations G are possible, is
necessary
Corollary The abstract knowledge T could not
itself be learned, or could not be learned before
G is known C2. T must be innate
48
Data
  • Final corpus contained 2336 individual sentence
    types corresponding to 21792 sentence tokens
  • Why types?
  • Grammar learning depends on what sentences are
    generated, not on how many of each type there are
  • Much more computationally tractable
  • The distribution of sentence tokens depends on
    many factors other than the grammar (e.g.,
    pragmatics, semantics, discussion topics)
    Goldwater, Griffiths, Johnson 05

49
Specific linear grammars Hand-designed
Poor fit, high compression
Exact fit, no compression
REG-B
REG-M
REG-N
FLAT
1-STATE
Broadest regular derived from CFG
Mid-level regular derived from CFG
Narrowest regular derived from CFG
List of each sentence
Anything accepted
2336 rules, 0 non-terminals
117 rules, 10 non-terminals
169 prods, 14 non-terminals
289 prods, 85 non-terminals
26 rules, 0 non-terminals
50
Why these results?
  • Natural language actually is generated from a
    grammar that looks more like a CFG
  • The other grammars overfit and therefore do not
    capture important language-specific
    generalizations

Flat
51
(No Transcript)
52
Computing the prior
CFG
REG
Context-free grammar
NT ? NT NT NT ? t NT NT ? NT NT ? t
NT ? t NT
Regular grammar
NT ? t
53
(No Transcript)
54
Likelihood, intuitively
Z rule out because it does not explain some of
the data points X and Y both explain the data
points, but X is the more likely source
55
(No Transcript)
56
Possible empirical tests
  • Present people with data the model learns FLAT,
    REG, and CFGs from see which novel productions
    they generalize to
  • Non-linguistic? To small children?
  • Examples of learning regular grammars in real
    life does the model do the same?

57
Do people learn regular grammars?
X s1 s2 s3 Spanish dancer, do the
splits. Spanish dancer, give a kick. Spanish
dancer, turn around.
Childrens Songs Line level grammar
S1 s2 s3 w1 w1 w1 Miss Mary Mack, Mack, Mack All
dressed in black, black, black With silver
buttons, buttons, buttons All down her back,
back, back She asked her mother, mother,
mother,
58
Do people learn regular grammars?
Bubble gum, bubble gum, chew and blow, Bubble
gum, bubble gum, scrape your toe, Bubble gum,
bubble gum, tastes so sweet,
Childrens Songs Song level X X s1 s2 s3
Teddy bear, teddy bear, turn around. Teddy bear,
teddy bear, touch the ground. Teddy bear, teddy
bear, show your shoe. Teddy bear, teddy bear,
that will do. Teddy bear, teddy bear, go
upstairs.
Dolly Dimple walks like this, Dolly Dimple talks
like this, Dolly Dimple smiles like this, Dolly
Dimple throws a kiss.
59
Do people learn regular grammars?
Dough a Thing I Buy Beer With Ray a guy who buys
me beer Me, the one who wants a beer Fa, a long
way to the beer So, I think I'll have a beer La,
-gers great but so is beer! Tea, no thanks I'll
have a beer
Songs containing items represented as lists
(where order matters)
A my name is Alice And my husband's name is
Arthur, We come from Alabama, Where we sell
artichokes. B my name is Barney And my wife's
name is Bridget, We come from Brooklyn, Where we
sell bicycles.
Cinderella, dressed in yella, Went upstairs to
kiss a fella, Made a mistake and kissed a
snake, How many doctors did it take? 1, 2, 3,
60
Do people learn regular grammars?
If I were the marrying kind I thank the lord I'm
not sir The kind of rugger I would be Would be a
rugby position/item sir Cos I'd verb
phrase And you'd verb phrase We'd all verb
phrase together
Most of the song is a template, with repeated
(varying) element
You put your body part in You put your body
part out You put your body part in and you
shake it all about You do the hokey pokey And
you turn yourself around And that's what it's all
about!
If youre happy and you know it verb your body
part If youre happy and you know it then your
face will surely show it If youre happy and you
know it verb your body part
61
Do people learn regular grammars?
Other interesting structures
I know a song that never ends, It goes on and on
my friends, I know a song that never ends, And
this is how it goes (repeat)
There was a farmer had a dog, And Bingo was his
name-O. B-I-N-G-O! B-I-N-G-O! B-I-N-G-O! And
Bingo was his name-O! (each subsequent verse,
replace a letter with a clap)
Oh, Sir Richard, do not touch me (each
subsequent verse, remove the last word at the end
of the sentence)
62
(No Transcript)
63
New PRG 1-state
Det, n, pro, prop, prep, adj, aux, wh, comp, to,
v, vi, part
Log(prior) 0 no free parameters
Det, n, pro, prop, prep, adj, aux, wh, comp, to,
v, vi, part
S
End
64
Another PRG standard noise
  • For instance, level-1 PRG noise would be the
    best regular grammar for the corpus at level 1,
    plus the 1-state model
  • This could parse all levels of evidence
  • Perhaps this would be better than a more
    complicated PRG at later levels of evidence

65
(No Transcript)
66
Results frequency levels (comprehension
estimates)
Log posterior (smaller is better)
Log prior, log likelihood (abs)
P
L
P
L
P
L
P
L
P
L
P
L
67
Results availability by age
Log posterior (smaller is better)
Log prior, log likelihood (abs)
P
L
P
L
P
L
P
L
P
L
P
L
68
(No Transcript)
69
Specific grammars of each type
  • One type of hand-designed grammar

69 productions, 14 nonterminals
390 productions, 85 nonterminals
70
Specific grammars of each type
  • The other type of hand-designed grammar

126 productions, 14 nonterminals
170 productions, 14 nonterminals
71
(No Transcript)
72
The Argument from the Poverty of the Stimulus
(PoS)
P1. It is impossible to have made some
generalization G simply on the basis of data
D P2. Children show behavior B P3. Behavior B is
not possible without having made G
G a specific grammar D typical child-directed
speech input B children dont make certain
mistakes (they dont seem to entertain
structure-independent hypotheses) T language
has hierarchical phrase structure
C1. Some constraints T, which limit what type of
generalizations G are possible, must be innate
73
1 Children hear complex interrogatives
  • Well, a few, but not many
  • Adam (CHILDES) 0.048
  • No yes-no questions
  • Four wh-questions (e.g., What is the music its
    playing?)
  • Nina (CHILDES) 0.068
  • No yes-no questions
  • 14 wh-questions
  • In all, most estimates are ltlt 1 of input

Legate Yang 2002
74
1 Children hear complex interrogatives
  • Well, a few, but not many
  • Adam (CHILDES) 0.048
  • No yes-no questions
  • Four wh-questions (e.g., What is the music its
    playing?)
  • Nina (CHILDES) 0.068
  • No yes-no questions
  • 14 wh-questions
  • In all, most estimates are ltlt 1 of input

How much is enough?
Legate Yang 2002
75
2 Can get the behavior without structure
  • There is enough statistical information in the
    input to be able to conclude which type of
    complex interrogative is ungrammatical

Rare comp adj aux
Common comp aux adj
Reali Christiansen 2004 Lewis Elman, 2001
76
2 Can get the behavior without structure
  • Response there is enough statistical information
    in the input to be able to conclude that Are
    eagles that alive can fly? is ungrammatical
  • Sidesteps the question does not address the
    innateness of structure (knowledge X)
  • Explanatorily opaque

Rare comp adj aux
Common comp aux adj
Reali Christiansen 2004 Lewis Elman, 2001
77
Why do linguists believe that language has
hierarchical phrase structure?
  • Formal properties information-theoretic,
    simplicity-based argument (Chomsky, 1956)
  • A sentence has an (i,j) dependency if replacement
    of the ith symbol ai of S by bi requires a
    corresponding replacement of the jth symbolf aj
    of S by bj
  • If S has an m-termed dependency set in L, at
    least 2m states are necessary in the
    finite-state grammar that generates L
  • Therefore, if L is a finite-state language, then
    there is an m such that no sentence S of L has a
    dependency set of more than m terms in L
  • The mirror language made up of sentences
    consisting of a string X followed by X in reverse
    (e.g., aa, abba, babbab, aabbaa, etc), has the
    property that for any m we can find a dependency
    set D (1,2m), (2,2m-1),..,(m,m1). Therefore
    it cannot be captured by any finite-state grammar
  • English has infinite sets of sentences with
    dependency sets with more than any fixed number
    of terms. E.g. the man who said that S5 is
    arriving today, there is a dependency between
    man and is. Therefore English cannot be
    finite-state
  • There is the possible counterargument that since
    any finite corpus could be captured by a
    finite-state grammar, then English is only not
    finite-state in the limit but in practice, it
    could be
  • Easy counterargument simplicity considerations.
    Chomsky If the processes have a limit, then the
    construction of a finite-state grammar will not
    be literally impossible (since a list is a
    trivial finite-state grammar), but this grammar
    will be so complex as to be of little use or
    interest.

78
The big picture
Innate
Learned
79
Grammar Acquisition (Chomsky)
Innate
Learned
80
The Argument from the Poverty of the Stimulus
(PoS)
P1. Children show behavior B
B
81
The Argument from the Poverty of the Stimulus
(PoS)
P1. Children show behavior B P2. Behavior B is
not possible without having some specific grammar
or rule G
G
B
82
The Argument from the Poverty of the Stimulus
(PoS)
P1. Children show behavior B P2. Behavior B is
not possible without having some specific grammar
or rule G P3. It is impossible to have learned G
simply on the basis of data D
G
X
B
D
83
The Argument from the Poverty of the Stimulus
(PoS)
P1. Children show behavior B P2. Behavior B is
not possible without having some specific grammar
or rule G P3. It is impossible to have learned G
simply on the basis of data D
T
G
C1. Some constraints T, which limit what type of
grammars are possible, must be innate
B
D
84
Replies to the PoS argument
There are enough complex interrogatives in D
P1. It is impossible to have made some
generalization G simply on the basis of data
D P2. Children show behavior B P3. Behavior B is
not possible without having made G
e.g., Pullum Scholz 2002
C1. Some constraints T, which limit what type of
generalizations G are possible, must be innate
85
Replies to the PoS argument
There are enough complex interrogatives in D
P1. It is impossible to have made some
generalization G simply on the basis of data
D P2. Children show behavior B P3. Behavior B is
not possible without having made G
Pullum Scholz, 2002
There is a route to B other than G (statistical
learning)
e.g., Lewis Elman , 2001
Reali Christiansen, 2005
C1. Some constraints T, which limit what type of
generalizations G are possible, must be innate
86
Innate
Learned
87
Explicit structure
Innate
Learned
No explicit structure
88
Explicit structure
Innate
Learned
No explicit structure
89
(No Transcript)
90
(No Transcript)
91
(No Transcript)
92
(No Transcript)
93
(No Transcript)
94
(No Transcript)
95
(No Transcript)
96
(No Transcript)
97
(No Transcript)
98
(No Transcript)
99
(No Transcript)
100
(No Transcript)
101
(No Transcript)
102
(No Transcript)
103
(No Transcript)
104
(No Transcript)
105
(No Transcript)
106
(No Transcript)
107
(No Transcript)
108
(No Transcript)
109
(No Transcript)
110
Our argument
  • Assumptions equipped with
  • Capacity to represent both linear and
    hierarchical grammars (no bias)
  • Rational Bayesian learning mechanism
    probability calculation
  • Ability to effectively search the space of
    possible grammars

111
Take-home message
  • Shown that given reasonable domain-general
    assumptions, an unbiased rational learner could
    realize that languages have a hierarchical
    structure based on typical child-directed input

112
Take-home message
  • Shown that given reasonable domain-general
    assumptions, an unbiased rational learner could
    realize that languages have a hierarchical
    structure based on typical child-directed input
  • Can use this paradigm to explore the role of
    recursive elements in a grammar
  • The winning grammar contains additional
    non-recursive counterparts for complex NPs
  • Perhaps language, while fundamentally recursive,
    contains duplicate non-recursive elements that
    more precisely match the input?

113
The role of recursion
  • Evaluated an additional grammar (CFG-DL) that
    contained no recursive complex NPs at all
    instead, multiply-embedded, depth-limited ones
  • No sentence in the corpus occurred with more than
    two levels of nesting

114
The role of recursion results
Log posterior probability (lower magnitude
better)
115
The role of recursion Results
116
The role of recursion Implications
  • Optimal tradeoff results in a grammar that goes
    beyond the data in interesting ways
  • Auxiliary fronting
  • Recursive complex NPs
  • A grammar with recursive complex NPs is more
    optimal, even though
  • Recursive productions hurt in the likelihood
  • There are no sentences with more than two levels
    of nesting in the input
Write a Comment
User Comments (0)
About PowerShow.com