Title: A Bayesian Approach to the Poverty of the Stimulus
1A Bayesian Approach to the Poverty of the
Stimulus
With Josh Tenenbaum (MIT) and Terry Regier
(University of Chicago)
2Innate
Learned
3Explicit Structure
Innate
Learned
No explicit Structure
4Language has hierarchical phrase structure
No
Yes
5Why believe that language has hierarchical phrase
structure?
- Formal properties information-theoretic,
simplicity-based argument (Chomsky, 1956) - Dependency structure of language
- A finite-state grammar cannot capture the
infinite sets of English sentences with
dependencies like this - If we restrict ourselves to only a finite set of
sentences, then in theory a finite-state grammar
could account for them but this grammar will be
so complex as to be of little use or interest.
6Why believe that structure dependence is innate?
- The Argument from the Poverty of the Stimulus
(PoS)
Simple declarative The girl is happy, They are
eating Simple interrogative Is the girl happy?
Are they eating?
Data
- Linear move the first is (auxiliary) in the
sentence to the beginning - Hierarchical move the auxiliary in the main
clause to the beginning
Hypotheses
Test
Complex declarative The girl who is sleeping is
happy.
Children say Is the girl who is sleeping
happy? NOT Is the girl who sleeping is happy?
Result
Chomsky, 1965, 1980 Crain Nakayama, 1987
7Why believe its not innate?
- There are actually enough complex interrogatives
(Pullum Scholz 02) - Childrens behavior can be explained via
statistical learning of natural language data
(Lewis Elman 01 Reali Christiansen 05) - It is not necessary to assume a grammar with
explicit structure
8Explicit Structure
Innate
Learned
No explicit Structure
9Explicit Structure
Innate
Learned
No explicit Structure
10Our argument
11Our argument
- We suggest that, contra the PoS claim
- It is possible, given the nature of the input and
certain domain-general assumptions about the
learning mechanism, that an ideal, unbiased
learner can realize that language has a
hierarchical phrase structure therefore this
knowledge need not be innate - The reason grammars with hierarchical phrase
structure offer an optimal tradeoff between
simplicity and fit to natural language data
12Plan
- Model
- Data corpus of child-directed speech (CHILDES)
- Grammars
- Linear hierarchical
- Both Hand-designed result of local search
- Linear automatic, unsupervised ML
- Evaluation
- Complexity vs. fit
- Results
- Implications
13The model Data
- Corpus from CHILDES database (Adam, Brown corpus)
- 55 files, age range 23 to 52
- Sentences spoken by adults to children
- Each word replaced by syntactic category
- det, n, adj, prep, pro, prop, to, part, vi, v,
aux, comp, wh, c - Ungrammatical sentences and the most
grammatically complex sentence types were
removed kept 21792 out of 25876 utterances - Topicalized sentences(66) sentences serial verb
constructions (459), subordinate phrases (845),
sentential complements (1636), and conjunctions
(634). Ungrammatical sentences (444)
14Data
- Final corpus contained 2336 individual sentence
types corresponding to 21792 sentence tokens
15Data variation
- Amount of evidence available at different points
in development
16Data variation
- Amount of evidence available at different points
in development - Amount comprehended at different points in
development
17Data amount available
- Rough estimate split by age
Files
Age
types
types
23
173
7.4
Epoch 0
1
Epoch 1
23 to 28
879
38
11
Epoch 2
23 to 31
1295
55
22
Epoch 3
23 to 35
1735
74
33
2090
89
23 to 42
Epoch 4
44
55
Epoch 5
2336
100
23 to 52
18Data amount comprehended
- Rough estimate split by frequency
Frequency
types
tokens
types
Level 1
8
500
0.3
28
Level 2
37
100
1.6
55
Level 3
67
50
2.9
64
115
25
4.9
71
Level 4
Level 5
268
10
12
82
Level 6
2336
100
100
1 (all)
19The model
- Data
- Child-directed speech (CHILDES)
- Grammars
- Linear hierarchical
- Both Hand-designed result of local search
- Linear automatic, unsupervised ML)
- Evaluation
- Complexity vs. fit
20Grammar types
Linear
Hierarchical
Flat grammar
Regular grammar
1-state grammar
Context-free grammar
Rules
Rules
Rules
Rules
Example
List of each sentence
Anything accepted
NT ? t NT
NT ? NT NT NT ? t NT NT ? NT NT ? t
NT ? t
Example
Example
Example
21Specific hierarchical grammars Hand-designed
CFG-S
CFG-L
Larger CFG
Standard CFG
Description
Description
Designed to be as linguistically plausible as
possible
Derived from CFG-S contains additional
productions corresponding to different expansions
of the same NT (puts less probability mass on
recursive productions)
Example productions
Example productions
77 rules, 15 non-terminals
133 rules, 15 non-terminals
22Specific linear grammars Hand-designed
Poor fit, high compression
Exact fit, no compression
FLAT
1-STATE
List of each sentence
Anything accepted
2336 rules, 0 non-terminals
26 rules, 0 non-terminals
23Specific linear grammars Hand-designed
Poor fit, high compression
Exact fit, no compression
REG-N
FLAT
1-STATE
Narrowest regular derived from CFG
List of each sentence
Anything accepted
2336 rules, 0 non-terminals
289 rules, 85 non-terminals
26 rules, 0 non-terminals
24Specific linear grammars Hand-designed
Poor fit, high compression
Exact fit, no compression
REG-M
REG-N
FLAT
1-STATE
Mid-level regular derived from CFG
Narrowest regular derived from CFG
List of each sentence
Anything accepted
2336 rules, 0 non-terminals
169 rules, 14 non-terminals
289 rules, 85 non-terminals
26 prods, 0 non-terminals
25Specific linear grammars Hand-designed
Poor fit, high compression
Exact fit, no compression
REG-B
REG-M
REG-N
FLAT
1-STATE
Broadest regular derived from CFG
Mid-level regular derived from CFG
Narrowest regular derived from CFG
List of each sentence
Anything accepted
2336 rules, 0 non-terminals
117 rules, 10 non-terminals
169 prods, 14 non-terminals
289 prods, 85 non-terminals
26 rules, 0 non-terminals
26Automated search
Local search around hand-designed grammars
Linear unsupervised, automatic HMM learning
Goldwater Griffiths, 2007
Bayesian model for acquisition of trigram HMM
(designed for POS tagging, but given a corpus of
syntactic categories, learns a regular grammar)
27The model
- Data
- Child-directed speech (CHILDES)
- Grammars
- Linear hierarchical
- Hand-designed result of local search
- Linear automatic, unsupervised ML
- Evaluation
- Complexity vs. fit
28Grammars
T type of grammar
G Specific grammar
D Data
unbiased (uniform)
29Grammars
T type of grammar
G Specific grammar
D Data
complexity (prior)
data fit (likelihood)
30Tradeoff Complexity vs. Fit
- Low prior probability more complex
- Low likelihood poor fit to the data
Fit high Simplicity low
Fit low Simplicity high
Fit moderate Simplicity moderate
31Measuring complexity prior
- Designing a grammar (Gods eye view)
- Grammars with more rules and non-terminals will
have lower prior probability
n of nonterminals Ni
items in production i Pk productions of
nonterminal k V vocab size Tk
production probability parameters for k
32Measuring fit likelihood
- Probability of that grammar generating the data
- Product of the probability of each parse
Ex pro aux det n
0.25
0.50.251.00.250.5 0.016
33 Plan
- Model
- Data corpus of child-directed speech (CHILDES)
- Grammars
- Linear hierarchical
- Hand-designed result of local search
- Linear automated, unsupervised ML
- Evaluation
- Complexity vs. fit
- Results
- Implications
34Results data split by frequency levels (estimate
of comprehension)
Log posterior probability (lower magnitude
better)
35Results data split by age (estimate of
availability)
36Results data split by age (estimate of
availability)
Log posterior probability (lower magnitude
better)
37Generalization How well does each grammar
predict sentences it hasnt seen?
38Generalization How well does each grammar
predict sentences it hasnt seen?
Complex interrogatives
39Take-home messages
- Shown that given reasonable domain-general
assumptions, an unbiased rational learner could
realize that languages have a hierarchical
structure based on typical child-directed input - This paradigm is valuable it makes any
assumptions explicit and enables us to rigorously
evaluate how different representations capture
the tradeoff between simplicity and fit to data - In some ways, higher-order knowledge may be
easier to learn than specific details (the
blessing of abstraction)
40Implications for innateness?
- Ideal learner
- Strong(er) assumptions
- The learner can find the best grammar in the
space of possibilities - Weak(er) assumptions
- The learner has the ability to parse the corpus
into syntactic categories - The learner can represent both linear and
hierarchical grammars - Assume a particular way of calculating complexity
data fit - Have we actually found representative grammars?
41The End
Thanks also to the following for many helpful
discussions Virginia Savova, Jeff Elman, Danny
Fox, Adam Albright, Fei Xu, Mark Johnson, Ken
Wexler, Ted Gibson, Sharon Goldwater, Michael
Frank, Charles Kemp, Vikash Mansinghka, Noah
Goodman
42(No Transcript)
43Grammars
T grammar type
G Specific grammar
D Data
44Grammars
T grammar type
G Specific grammar
D Data
45The Argument from the Poverty of the Stimulus
(PoS)
P1. Children show a specific pattern of behavior
B P2. A particular generalization G must be
grasped in order to produce B P3. It is
impossible to reasonably induce G simply on the
basis of the data D that children receive
T
G
C1. Some abstract knowledge T, limiting which
specific generalizations G are possible, is
necessary
B
D
46The Argument from the Poverty of the Stimulus
(PoS)
P1. Children show a specific pattern of behavior
B P2. A particular generalization G must be
grasped in order to produce B P3. It is
impossible to reasonably induce G simply on the
basis of the data D that children receive
T
G
C1. Some abstract knowledge T, limiting which
specific generalizations G are possible, is
necessary
Corollary The abstract knowledge T could not
itself be learned, or could not be learned before
G is known C2. T must be innate
B
D
47The Argument from the Poverty of the Stimulus
(PoS)
G a specific grammar D typical child-directed
speech input B children dont make certain
mistakes (they dont seem to entertain
structure-independent hypotheses) T language
has hierarchical phrase structure
P1. Children show a specific pattern of behavior
B P2. A particular generalization G must be
grasped in order to produce B P3. It is
impossible to reasonably induce G simply on the
basis of the data D that children receive
C1. Some abstract knowledge T, limiting which
specific generalizations G are possible, is
necessary
Corollary The abstract knowledge T could not
itself be learned, or could not be learned before
G is known C2. T must be innate
48Data
- Final corpus contained 2336 individual sentence
types corresponding to 21792 sentence tokens - Why types?
- Grammar learning depends on what sentences are
generated, not on how many of each type there are - Much more computationally tractable
- The distribution of sentence tokens depends on
many factors other than the grammar (e.g.,
pragmatics, semantics, discussion topics)
Goldwater, Griffiths, Johnson 05
49Specific linear grammars Hand-designed
Poor fit, high compression
Exact fit, no compression
REG-B
REG-M
REG-N
FLAT
1-STATE
Broadest regular derived from CFG
Mid-level regular derived from CFG
Narrowest regular derived from CFG
List of each sentence
Anything accepted
2336 rules, 0 non-terminals
117 rules, 10 non-terminals
169 prods, 14 non-terminals
289 prods, 85 non-terminals
26 rules, 0 non-terminals
50Why these results?
- Natural language actually is generated from a
grammar that looks more like a CFG - The other grammars overfit and therefore do not
capture important language-specific
generalizations
Flat
51(No Transcript)
52Computing the prior
CFG
REG
Context-free grammar
NT ? NT NT NT ? t NT NT ? NT NT ? t
NT ? t NT
Regular grammar
NT ? t
53(No Transcript)
54Likelihood, intuitively
Z rule out because it does not explain some of
the data points X and Y both explain the data
points, but X is the more likely source
55(No Transcript)
56Possible empirical tests
- Present people with data the model learns FLAT,
REG, and CFGs from see which novel productions
they generalize to - Non-linguistic? To small children?
- Examples of learning regular grammars in real
life does the model do the same?
57Do people learn regular grammars?
X s1 s2 s3 Spanish dancer, do the
splits. Spanish dancer, give a kick. Spanish
dancer, turn around.
Childrens Songs Line level grammar
S1 s2 s3 w1 w1 w1 Miss Mary Mack, Mack, Mack All
dressed in black, black, black With silver
buttons, buttons, buttons All down her back,
back, back She asked her mother, mother,
mother,
58Do people learn regular grammars?
Bubble gum, bubble gum, chew and blow, Bubble
gum, bubble gum, scrape your toe, Bubble gum,
bubble gum, tastes so sweet,
Childrens Songs Song level X X s1 s2 s3
Teddy bear, teddy bear, turn around. Teddy bear,
teddy bear, touch the ground. Teddy bear, teddy
bear, show your shoe. Teddy bear, teddy bear,
that will do. Teddy bear, teddy bear, go
upstairs.
Dolly Dimple walks like this, Dolly Dimple talks
like this, Dolly Dimple smiles like this, Dolly
Dimple throws a kiss.
59Do people learn regular grammars?
Dough a Thing I Buy Beer With Ray a guy who buys
me beer Me, the one who wants a beer Fa, a long
way to the beer So, I think I'll have a beer La,
-gers great but so is beer! Tea, no thanks I'll
have a beer
Songs containing items represented as lists
(where order matters)
A my name is Alice And my husband's name is
Arthur, We come from Alabama, Where we sell
artichokes. B my name is Barney And my wife's
name is Bridget, We come from Brooklyn, Where we
sell bicycles.
Cinderella, dressed in yella, Went upstairs to
kiss a fella, Made a mistake and kissed a
snake, How many doctors did it take? 1, 2, 3,
60Do people learn regular grammars?
If I were the marrying kind I thank the lord I'm
not sir The kind of rugger I would be Would be a
rugby position/item sir Cos I'd verb
phrase And you'd verb phrase We'd all verb
phrase together
Most of the song is a template, with repeated
(varying) element
You put your body part in You put your body
part out You put your body part in and you
shake it all about You do the hokey pokey And
you turn yourself around And that's what it's all
about!
If youre happy and you know it verb your body
part If youre happy and you know it then your
face will surely show it If youre happy and you
know it verb your body part
61Do people learn regular grammars?
Other interesting structures
I know a song that never ends, It goes on and on
my friends, I know a song that never ends, And
this is how it goes (repeat)
There was a farmer had a dog, And Bingo was his
name-O. B-I-N-G-O! B-I-N-G-O! B-I-N-G-O! And
Bingo was his name-O! (each subsequent verse,
replace a letter with a clap)
Oh, Sir Richard, do not touch me (each
subsequent verse, remove the last word at the end
of the sentence)
62(No Transcript)
63New PRG 1-state
Det, n, pro, prop, prep, adj, aux, wh, comp, to,
v, vi, part
Log(prior) 0 no free parameters
Det, n, pro, prop, prep, adj, aux, wh, comp, to,
v, vi, part
S
End
64Another PRG standard noise
- For instance, level-1 PRG noise would be the
best regular grammar for the corpus at level 1,
plus the 1-state model - This could parse all levels of evidence
- Perhaps this would be better than a more
complicated PRG at later levels of evidence
65(No Transcript)
66Results frequency levels (comprehension
estimates)
Log posterior (smaller is better)
Log prior, log likelihood (abs)
P
L
P
L
P
L
P
L
P
L
P
L
67Results availability by age
Log posterior (smaller is better)
Log prior, log likelihood (abs)
P
L
P
L
P
L
P
L
P
L
P
L
68(No Transcript)
69Specific grammars of each type
- One type of hand-designed grammar
69 productions, 14 nonterminals
390 productions, 85 nonterminals
70Specific grammars of each type
- The other type of hand-designed grammar
126 productions, 14 nonterminals
170 productions, 14 nonterminals
71(No Transcript)
72The Argument from the Poverty of the Stimulus
(PoS)
P1. It is impossible to have made some
generalization G simply on the basis of data
D P2. Children show behavior B P3. Behavior B is
not possible without having made G
G a specific grammar D typical child-directed
speech input B children dont make certain
mistakes (they dont seem to entertain
structure-independent hypotheses) T language
has hierarchical phrase structure
C1. Some constraints T, which limit what type of
generalizations G are possible, must be innate
731 Children hear complex interrogatives
- Well, a few, but not many
- Adam (CHILDES) 0.048
- No yes-no questions
- Four wh-questions (e.g., What is the music its
playing?) - Nina (CHILDES) 0.068
- No yes-no questions
- 14 wh-questions
- In all, most estimates are ltlt 1 of input
Legate Yang 2002
741 Children hear complex interrogatives
- Well, a few, but not many
- Adam (CHILDES) 0.048
- No yes-no questions
- Four wh-questions (e.g., What is the music its
playing?) - Nina (CHILDES) 0.068
- No yes-no questions
- 14 wh-questions
- In all, most estimates are ltlt 1 of input
How much is enough?
Legate Yang 2002
752 Can get the behavior without structure
- There is enough statistical information in the
input to be able to conclude which type of
complex interrogative is ungrammatical
Rare comp adj aux
Common comp aux adj
Reali Christiansen 2004 Lewis Elman, 2001
762 Can get the behavior without structure
- Response there is enough statistical information
in the input to be able to conclude that Are
eagles that alive can fly? is ungrammatical
- Sidesteps the question does not address the
innateness of structure (knowledge X) - Explanatorily opaque
Rare comp adj aux
Common comp aux adj
Reali Christiansen 2004 Lewis Elman, 2001
77Why do linguists believe that language has
hierarchical phrase structure?
- Formal properties information-theoretic,
simplicity-based argument (Chomsky, 1956) - A sentence has an (i,j) dependency if replacement
of the ith symbol ai of S by bi requires a
corresponding replacement of the jth symbolf aj
of S by bj - If S has an m-termed dependency set in L, at
least 2m states are necessary in the
finite-state grammar that generates L - Therefore, if L is a finite-state language, then
there is an m such that no sentence S of L has a
dependency set of more than m terms in L - The mirror language made up of sentences
consisting of a string X followed by X in reverse
(e.g., aa, abba, babbab, aabbaa, etc), has the
property that for any m we can find a dependency
set D (1,2m), (2,2m-1),..,(m,m1). Therefore
it cannot be captured by any finite-state grammar - English has infinite sets of sentences with
dependency sets with more than any fixed number
of terms. E.g. the man who said that S5 is
arriving today, there is a dependency between
man and is. Therefore English cannot be
finite-state - There is the possible counterargument that since
any finite corpus could be captured by a
finite-state grammar, then English is only not
finite-state in the limit but in practice, it
could be - Easy counterargument simplicity considerations.
Chomsky If the processes have a limit, then the
construction of a finite-state grammar will not
be literally impossible (since a list is a
trivial finite-state grammar), but this grammar
will be so complex as to be of little use or
interest.
78The big picture
Innate
Learned
79Grammar Acquisition (Chomsky)
Innate
Learned
80The Argument from the Poverty of the Stimulus
(PoS)
P1. Children show behavior B
B
81The Argument from the Poverty of the Stimulus
(PoS)
P1. Children show behavior B P2. Behavior B is
not possible without having some specific grammar
or rule G
G
B
82The Argument from the Poverty of the Stimulus
(PoS)
P1. Children show behavior B P2. Behavior B is
not possible without having some specific grammar
or rule G P3. It is impossible to have learned G
simply on the basis of data D
G
X
B
D
83The Argument from the Poverty of the Stimulus
(PoS)
P1. Children show behavior B P2. Behavior B is
not possible without having some specific grammar
or rule G P3. It is impossible to have learned G
simply on the basis of data D
T
G
C1. Some constraints T, which limit what type of
grammars are possible, must be innate
B
D
84Replies to the PoS argument
There are enough complex interrogatives in D
P1. It is impossible to have made some
generalization G simply on the basis of data
D P2. Children show behavior B P3. Behavior B is
not possible without having made G
e.g., Pullum Scholz 2002
C1. Some constraints T, which limit what type of
generalizations G are possible, must be innate
85Replies to the PoS argument
There are enough complex interrogatives in D
P1. It is impossible to have made some
generalization G simply on the basis of data
D P2. Children show behavior B P3. Behavior B is
not possible without having made G
Pullum Scholz, 2002
There is a route to B other than G (statistical
learning)
e.g., Lewis Elman , 2001
Reali Christiansen, 2005
C1. Some constraints T, which limit what type of
generalizations G are possible, must be innate
86Innate
Learned
87Explicit structure
Innate
Learned
No explicit structure
88Explicit structure
Innate
Learned
No explicit structure
89(No Transcript)
90(No Transcript)
91(No Transcript)
92(No Transcript)
93(No Transcript)
94(No Transcript)
95(No Transcript)
96(No Transcript)
97(No Transcript)
98(No Transcript)
99(No Transcript)
100(No Transcript)
101(No Transcript)
102(No Transcript)
103(No Transcript)
104(No Transcript)
105(No Transcript)
106(No Transcript)
107(No Transcript)
108(No Transcript)
109(No Transcript)
110Our argument
- Assumptions equipped with
- Capacity to represent both linear and
hierarchical grammars (no bias) - Rational Bayesian learning mechanism
probability calculation - Ability to effectively search the space of
possible grammars
111Take-home message
- Shown that given reasonable domain-general
assumptions, an unbiased rational learner could
realize that languages have a hierarchical
structure based on typical child-directed input
112Take-home message
- Shown that given reasonable domain-general
assumptions, an unbiased rational learner could
realize that languages have a hierarchical
structure based on typical child-directed input - Can use this paradigm to explore the role of
recursive elements in a grammar - The winning grammar contains additional
non-recursive counterparts for complex NPs - Perhaps language, while fundamentally recursive,
contains duplicate non-recursive elements that
more precisely match the input?
113The role of recursion
- Evaluated an additional grammar (CFG-DL) that
contained no recursive complex NPs at all
instead, multiply-embedded, depth-limited ones - No sentence in the corpus occurred with more than
two levels of nesting
114The role of recursion results
Log posterior probability (lower magnitude
better)
115The role of recursion Results
116The role of recursion Implications
- Optimal tradeoff results in a grammar that goes
beyond the data in interesting ways - Auxiliary fronting
- Recursive complex NPs
- A grammar with recursive complex NPs is more
optimal, even though - Recursive productions hurt in the likelihood
- There are no sentences with more than two levels
of nesting in the input