A Bayesian Approach to the Poverty of the Stimulus

About This Presentation

Title:

A Bayesian Approach to the Poverty of the Stimulus

Description:

Formal properties information-theoretic, simplicity-based argument (Chomsky, 1956) ... RG-B. RG-M. Are eagles that alive do fly? (aux n comp adj aux vi) ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 104

Provided by: amype1

Category:

more less

Transcript and Presenter's Notes

Title: A Bayesian Approach to the Poverty of the Stimulus

1
A Bayesian Approach to the Poverty of the
Stimulus

Amy Perfors
MIT

With Josh Tenenbaum (MIT) and Terry Regier
(University of Chicago)
2
Innate
Learned
3
Explicit Structure
Innate
Learned
No explicit Structure
4
Language has hierarchical phrase structure
No
Yes
5
Why believe that language has hierarchical phrase
structure?

Formal properties information-theoretic,
simplicity-based argument (Chomsky, 1956)
Dependency structure of language
A finite-state grammar cannot capture the
infinite sets of English sentences with
dependencies like this
If we restrict ourselves to only a finite set of
sentences, then in theory a finite-state grammar
could account for them but this grammar will be
so complex as to be of little use or interest.

6
Why believe that structure dependence is innate?

The Argument from the Poverty of the Stimulus
(PoS)

Simple declarative The girl is happy, They are
eating Simple interrogative Is the girl happy?
Are they eating?
Data

Linear move the first is (auxiliary) in the
sentence to the beginning
Hierarchical move the auxiliary in the main
clause to the beginning

Hypotheses
Test
Complex declarative The girl who is sleeping is
happy.
Children say Is the girl who is sleeping
happy? NOT Is the girl who sleeping is happy?
Result
Chomsky, 1965, 1980 Crain Nakayama, 1987
7
Why believe its not innate?

There are actually enough complex interrogatives
(Pullum Scholz 02)
Childrens behavior can be explained via
statistical learning of natural language data
(Lewis Elman 01 Reali Christiansen 05)
It is not necessary to assume a grammar with
explicit structure

8
Explicit Structure
Innate
Learned
No explicit Structure
9
Explicit Structure
Innate
Learned
No explicit Structure
10
Our argument
11
Our argument

We suggest that, contra the PoS claim
It is possible, given the nature of the input and
certain domain-general assumptions about the
learning mechanism, that an ideal, unbiased
learner can realize that language has a
hierarchical phrase structure therefore this
knowledge need not be innate
The reason grammars with hierarchical phrase
structure offer an optimal tradeoff between
simplicity and fit to natural language data

12
Plan

Model
Data corpus of child-directed speech (CHILDES)
Grammars
Linear hierarchical
Both Hand-designed result of local search
Linear automatic, unsupervised ML
Evaluation
Complexity vs. fit
Results
Implications

13
The model Data

Corpus from CHILDES database (Adam, Brown corpus)
55 files, age range 23 to 52
Sentences spoken by adults to children
Each word replaced by syntactic category
det, n, adj, prep, pro, prop, to, part, vi, v,
aux, comp, wh, c
Ungrammatical sentences and the most
grammatically complex sentence types were
removed kept 21792 out of 25876 utterances
Topicalized sentences(66) sentences serial verb
constructions (459), subordinate phrases (845),
sentential complements (1636), and conjunctions
(634). Ungrammatical sentences (444)

14
Data

Final corpus contained 2336 individual sentence
types corresponding to 21792 sentence tokens

15
Data variation

Amount of evidence available at different points
in development

16
Data variation

Amount of evidence available at different points
in development
Amount comprehended at different points in
development

17
Data amount available

Rough estimate split by age

Files
Age
types
types
23
173
7.4
Epoch 0
1
Epoch 1
23 to 28
879
38
11
Epoch 2
23 to 31
1295
55
22
Epoch 3
23 to 35
1735
74
33
2090
89
23 to 42
Epoch 4
44
55
Epoch 5
2336
100
23 to 52
18
Data amount comprehended

Rough estimate split by frequency

Frequency
types
tokens
types
Level 1
8
500
0.3
28
Level 2
37
100
1.6
55
Level 3
67
50
2.9
64
115
25
4.9
71
Level 4
Level 5
268
10
12
82
Level 6
2336
100
100
1 (all)
19
The model

Data
Child-directed speech (CHILDES)
Grammars
Linear hierarchical
Both Hand-designed result of local search
Linear automatic, unsupervised ML)
Evaluation
Complexity vs. fit

20
Grammar types
Linear
Hierarchical
Flat grammar
Regular grammar
1-state grammar
Context-free grammar
Rules
Rules
Rules
Rules
Example
List of each sentence
Anything accepted
NT ? t NT
NT ? NT NT NT ? t NT NT ? NT NT ? t
NT ? t
Example
Example
Example
21
Specific hierarchical grammars Hand-designed
CFG-S
CFG-L
Larger CFG
Standard CFG
Description
Description
Designed to be as linguistically plausible as
possible
Derived from CFG-S contains additional
productions corresponding to different expansions
of the same NT (puts less probability mass on
recursive productions)
Example productions
Example productions
77 rules, 15 non-terminals
133 rules, 15 non-terminals
22
Specific linear grammars Hand-designed
Poor fit, high compression
Exact fit, no compression
FLAT
1-STATE
List of each sentence
Anything accepted
2336 rules, 0 non-terminals
26 rules, 0 non-terminals
23
Specific linear grammars Hand-designed
Poor fit, high compression
Exact fit, no compression
REG-N
FLAT
1-STATE
Narrowest regular derived from CFG
List of each sentence
Anything accepted
2336 rules, 0 non-terminals
289 rules, 85 non-terminals
26 rules, 0 non-terminals
24
Specific linear grammars Hand-designed
Poor fit, high compression
Exact fit, no compression
REG-M
REG-N
FLAT
1-STATE
Mid-level regular derived from CFG
Narrowest regular derived from CFG
List of each sentence
Anything accepted
2336 rules, 0 non-terminals
169 rules, 14 non-terminals
289 rules, 85 non-terminals
26 prods, 0 non-terminals
25
Specific linear grammars Hand-designed
Poor fit, high compression
Exact fit, no compression
REG-B
REG-M
REG-N
FLAT
1-STATE
Broadest regular derived from CFG
Mid-level regular derived from CFG
Narrowest regular derived from CFG
List of each sentence
Anything accepted
2336 rules, 0 non-terminals
117 rules, 10 non-terminals
169 prods, 14 non-terminals
289 prods, 85 non-terminals
26 rules, 0 non-terminals
26
Automated search
Local search around hand-designed grammars
Linear unsupervised, automatic HMM learning
Goldwater Griffiths, 2007
Bayesian model for acquisition of trigram HMM
(designed for POS tagging, but given a corpus of
syntactic categories, learns a regular grammar)
27
The model

Data
Child-directed speech (CHILDES)
Grammars
Linear hierarchical
Hand-designed result of local search
Linear automatic, unsupervised ML
Evaluation
Complexity vs. fit

28
Grammars
T type of grammar
G Specific grammar
D Data
unbiased (uniform)
29
Grammars
T type of grammar
G Specific grammar
D Data
complexity (prior)
data fit (likelihood)
30
Tradeoff Complexity vs. Fit

Low prior probability more complex
Low likelihood poor fit to the data

Fit high Simplicity low
Fit low Simplicity high
Fit moderate Simplicity moderate
31
Measuring complexity prior

Designing a grammar (Gods eye view)
Grammars with more rules and non-terminals will
have lower prior probability

n of nonterminals Ni
items in production i Pk productions of
nonterminal k V vocab size Tk
production probability parameters for k
32
Measuring fit likelihood

Probability of that grammar generating the data
Product of the probability of each parse

Ex pro aux det n
0.25
0.50.251.00.250.5 0.016
33
Plan

Model
Data corpus of child-directed speech (CHILDES)
Grammars
Linear hierarchical
Hand-designed result of local search
Linear automated, unsupervised ML
Evaluation
Complexity vs. fit
Results
Implications

34
Results data split by frequency levels (estimate
of comprehension)
Log posterior probability (lower magnitude
better)
35
Results data split by age (estimate of
availability)
36
Results data split by age (estimate of
availability)
Log posterior probability (lower magnitude
better)
37
Generalization How well does each grammar
predict sentences it hasnt seen?
38
Generalization How well does each grammar
predict sentences it hasnt seen?
Complex interrogatives
39
Take-home messages

Shown that given reasonable domain-general
assumptions, an unbiased rational learner could
realize that languages have a hierarchical
structure based on typical child-directed input
This paradigm is valuable it makes any
assumptions explicit and enables us to rigorously
evaluate how different representations capture
the tradeoff between simplicity and fit to data
In some ways, higher-order knowledge may be
easier to learn than specific details (the
blessing of abstraction)

40
Implications for innateness?

Ideal learner
Strong(er) assumptions
The learner can find the best grammar in the
space of possibilities
Weak(er) assumptions
The learner has the ability to parse the corpus
into syntactic categories
The learner can represent both linear and
hierarchical grammars
Assume a particular way of calculating complexity
data fit
Have we actually found representative grammars?

41
The End
Thanks also to the following for many helpful
discussions Virginia Savova, Jeff Elman, Danny
Fox, Adam Albright, Fei Xu, Mark Johnson, Ken
Wexler, Ted Gibson, Sharon Goldwater, Michael
Frank, Charles Kemp, Vikash Mansinghka, Noah
Goodman
42
(No Transcript)
43
Grammars
T grammar type
G Specific grammar
D Data
44
Grammars
T grammar type
G Specific grammar
D Data
45
The Argument from the Poverty of the Stimulus
(PoS)
P1. Children show a specific pattern of behavior
B P2. A particular generalization G must be
grasped in order to produce B P3. It is
impossible to reasonably induce G simply on the
basis of the data D that children receive
T
G
C1. Some abstract knowledge T, limiting which
specific generalizations G are possible, is
necessary
B
D
46
The Argument from the Poverty of the Stimulus
(PoS)
P1. Children show a specific pattern of behavior
B P2. A particular generalization G must be
grasped in order to produce B P3. It is
impossible to reasonably induce G simply on the
basis of the data D that children receive
T
G
C1. Some abstract knowledge T, limiting which
specific generalizations G are possible, is
necessary
Corollary The abstract knowledge T could not
itself be learned, or could not be learned before
G is known C2. T must be innate
B
D
47
The Argument from the Poverty of the Stimulus
(PoS)
G a specific grammar D typical child-directed
speech input B children dont make certain
mistakes (they dont seem to entertain
structure-independent hypotheses) T language
has hierarchical phrase structure
P1. Children show a specific pattern of behavior
B P2. A particular generalization G must be
grasped in order to produce B P3. It is
impossible to reasonably induce G simply on the
basis of the data D that children receive
C1. Some abstract knowledge T, limiting which
specific generalizations G are possible, is
necessary
Corollary The abstract knowledge T could not
itself be learned, or could not be learned before
G is known C2. T must be innate
48
Data

Final corpus contained 2336 individual sentence
types corresponding to 21792 sentence tokens
Why types?
Grammar learning depends on what sentences are
generated, not on how many of each type there are
Much more computationally tractable
The distribution of sentence tokens depends on
many factors other than the grammar (e.g.,
pragmatics, semantics, discussion topics)
Goldwater, Griffiths, Johnson 05

49
Specific linear grammars Hand-designed
Poor fit, high compression
Exact fit, no compression
REG-B
REG-M
REG-N
FLAT
1-STATE
Broadest regular derived from CFG
Mid-level regular derived from CFG
Narrowest regular derived from CFG
List of each sentence
Anything accepted
2336 rules, 0 non-terminals
117 rules, 10 non-terminals
169 prods, 14 non-terminals
289 prods, 85 non-terminals
26 rules, 0 non-terminals
50
Why these results?

Natural language actually is generated from a
grammar that looks more like a CFG
The other grammars overfit and therefore do not
capture important language-specific
generalizations

Flat
51
(No Transcript)
52
Computing the prior
CFG
REG
Context-free grammar
NT ? NT NT NT ? t NT NT ? NT NT ? t
NT ? t NT
Regular grammar
NT ? t
53
(No Transcript)
54
Likelihood, intuitively
Z rule out because it does not explain some of
the data points X and Y both explain the data
points, but X is the more likely source
55
(No Transcript)
56
Possible empirical tests

Present people with data the model learns FLAT,
REG, and CFGs from see which novel productions
they generalize to
Non-linguistic? To small children?
Examples of learning regular grammars in real
life does the model do the same?

57
Do people learn regular grammars?
X s1 s2 s3 Spanish dancer, do the
splits. Spanish dancer, give a kick. Spanish
dancer, turn around.
Childrens Songs Line level grammar
S1 s2 s3 w1 w1 w1 Miss Mary Mack, Mack, Mack All
dressed in black, black, black With silver
buttons, buttons, buttons All down her back,
back, back She asked her mother, mother,
mother,
58
Do people learn regular grammars?
Bubble gum, bubble gum, chew and blow, Bubble
gum, bubble gum, scrape your toe, Bubble gum,
bubble gum, tastes so sweet,
Childrens Songs Song level X X s1 s2 s3
Teddy bear, teddy bear, turn around. Teddy bear,
teddy bear, touch the ground. Teddy bear, teddy
bear, show your shoe. Teddy bear, teddy bear,
that will do. Teddy bear, teddy bear, go
upstairs.
Dolly Dimple walks like this, Dolly Dimple talks
like this, Dolly Dimple smiles like this, Dolly
Dimple throws a kiss.
59
Do people learn regular grammars?
Dough a Thing I Buy Beer With Ray a guy who buys
me beer Me, the one who wants a beer Fa, a long
way to the beer So, I think I'll have a beer La,
-gers great but so is beer! Tea, no thanks I'll
have a beer
Songs containing items represented as lists
(where order matters)
A my name is Alice And my husband's name is
Arthur, We come from Alabama, Where we sell
artichokes. B my name is Barney And my wife's
name is Bridget, We come from Brooklyn, Where we
sell bicycles.
Cinderella, dressed in yella, Went upstairs to
kiss a fella, Made a mistake and kissed a
snake, How many doctors did it take? 1, 2, 3,
60
Do people learn regular grammars?
If I were the marrying kind I thank the lord I'm
not sir The kind of rugger I would be Would be a
rugby position/item sir Cos I'd verb
phrase And you'd verb phrase We'd all verb
phrase together
Most of the song is a template, with repeated
(varying) element
You put your body part in You put your body
part out You put your body part in and you
shake it all about You do the hokey pokey And
you turn yourself around And that's what it's all
about!
If youre happy and you know it verb your body
part If youre happy and you know it then your
face will surely show it If youre happy and you
know it verb your body part
61
Do people learn regular grammars?
Other interesting structures
I know a song that never ends, It goes on and on
my friends, I know a song that never ends, And
this is how it goes (repeat)
There was a farmer had a dog, And Bingo was his
name-O. B-I-N-G-O! B-I-N-G-O! B-I-N-G-O! And
Bingo was his name-O! (each subsequent verse,
replace a letter with a clap)
Oh, Sir Richard, do not touch me (each
subsequent verse, remove the last word at the end
of the sentence)
62
(No Transcript)
63
New PRG 1-state
Det, n, pro, prop, prep, adj, aux, wh, comp, to,
v, vi, part
Log(prior) 0 no free parameters
Det, n, pro, prop, prep, adj, aux, wh, comp, to,
v, vi, part
S
End
64
Another PRG standard noise

For instance, level-1 PRG noise would be the
best regular grammar for the corpus at level 1,
plus the 1-state model
This could parse all levels of evidence
Perhaps this would be better than a more
complicated PRG at later levels of evidence

65
(No Transcript)
66
Results frequency levels (comprehension
estimates)
Log posterior (smaller is better)
Log prior, log likelihood (abs)
P
L
P
L
P
L
P
L
P
L
P
L
67
Results availability by age
Log posterior (smaller is better)
Log prior, log likelihood (abs)
P
L
P
L
P
L
P
L
P
L
P
L
68
(No Transcript)
69
Specific grammars of each type

One type of hand-designed grammar

69 productions, 14 nonterminals
390 productions, 85 nonterminals
70
Specific grammars of each type

The other type of hand-designed grammar

126 productions, 14 nonterminals
170 productions, 14 nonterminals
71
(No Transcript)
72
The Argument from the Poverty of the Stimulus
(PoS)
P1. It is impossible to have made some
generalization G simply on the basis of data
D P2. Children show behavior B P3. Behavior B is
not possible without having made G
G a specific grammar D typical child-directed
speech input B children dont make certain
mistakes (they dont seem to entertain
structure-independent hypotheses) T language
has hierarchical phrase structure
C1. Some constraints T, which limit what type of
generalizations G are possible, must be innate
73
1 Children hear complex interrogatives

Well, a few, but not many
Adam (CHILDES) 0.048
No yes-no questions
Four wh-questions (e.g., What is the music its
playing?)
Nina (CHILDES) 0.068
No yes-no questions
14 wh-questions
In all, most estimates are ltlt 1 of input

Legate Yang 2002
74
1 Children hear complex interrogatives

Well, a few, but not many
Adam (CHILDES) 0.048
No yes-no questions
Four wh-questions (e.g., What is the music its
playing?)
Nina (CHILDES) 0.068
No yes-no questions
14 wh-questions
In all, most estimates are ltlt 1 of input

How much is enough?
Legate Yang 2002
75
2 Can get the behavior without structure

There is enough statistical information in the
input to be able to conclude which type of
complex interrogative is ungrammatical

Rare comp adj aux
Common comp aux adj
Reali Christiansen 2004 Lewis Elman, 2001
76
2 Can get the behavior without structure

Response there is enough statistical information
in the input to be able to conclude that Are
eagles that alive can fly? is ungrammatical

Sidesteps the question does not address the
innateness of structure (knowledge X)
Explanatorily opaque

Rare comp adj aux
Common comp aux adj
Reali Christiansen 2004 Lewis Elman, 2001
77
Why do linguists believe that language has
hierarchical phrase structure?

Formal properties information-theoretic,
simplicity-based argument (Chomsky, 1956)
A sentence has an (i,j) dependency if replacement
of the ith symbol ai of S by bi requires a
corresponding replacement of the jth symbolf aj
of S by bj
If S has an m-termed dependency set in L, at
least 2m states are necessary in the
finite-state grammar that generates L
Therefore, if L is a finite-state language, then
there is an m such that no sentence S of L has a
dependency set of more than m terms in L
The mirror language made up of sentences
consisting of a string X followed by X in reverse
(e.g., aa, abba, babbab, aabbaa, etc), has the
property that for any m we can find a dependency
set D (1,2m), (2,2m-1),..,(m,m1). Therefore
it cannot be captured by any finite-state grammar
English has infinite sets of sentences with
dependency sets with more than any fixed number
of terms. E.g. the man who said that S5 is
arriving today, there is a dependency between
man and is. Therefore English cannot be
finite-state
There is the possible counterargument that since
any finite corpus could be captured by a
finite-state grammar, then English is only not
finite-state in the limit but in practice, it
could be
Easy counterargument simplicity considerations.
Chomsky If the processes have a limit, then the
construction of a finite-state grammar will not
be literally impossible (since a list is a
trivial finite-state grammar), but this grammar
will be so complex as to be of little use or
interest.

78
The big picture
Innate
Learned
79
Grammar Acquisition (Chomsky)
Innate
Learned
80
The Argument from the Poverty of the Stimulus
(PoS)
P1. Children show behavior B
B
81
The Argument from the Poverty of the Stimulus
(PoS)
P1. Children show behavior B P2. Behavior B is
not possible without having some specific grammar
or rule G
G
B
82
The Argument from the Poverty of the Stimulus
(PoS)
P1. Children show behavior B P2. Behavior B is
not possible without having some specific grammar
or rule G P3. It is impossible to have learned G
simply on the basis of data D
G
X
B
D
83
The Argument from the Poverty of the Stimulus
(PoS)
P1. Children show behavior B P2. Behavior B is
not possible without having some specific grammar
or rule G P3. It is impossible to have learned G
simply on the basis of data D
T
G
C1. Some constraints T, which limit what type of
grammars are possible, must be innate
B
D
84
Replies to the PoS argument
There are enough complex interrogatives in D
P1. It is impossible to have made some
generalization G simply on the basis of data
D P2. Children show behavior B P3. Behavior B is
not possible without having made G
e.g., Pullum Scholz 2002
C1. Some constraints T, which limit what type of
generalizations G are possible, must be innate
85
Replies to the PoS argument
There are enough complex interrogatives in D
P1. It is impossible to have made some
generalization G simply on the basis of data
D P2. Children show behavior B P3. Behavior B is
not possible without having made G
Pullum Scholz, 2002
There is a route to B other than G (statistical
learning)
e.g., Lewis Elman , 2001
Reali Christiansen, 2005
C1. Some constraints T, which limit what type of
generalizations G are possible, must be innate
86
Innate
Learned
87
Explicit structure
Innate
Learned
No explicit structure
88
Explicit structure
Innate
Learned
No explicit structure
89
(No Transcript)
90
(No Transcript)
91
(No Transcript)
92
(No Transcript)
93
(No Transcript)
94
(No Transcript)
95
(No Transcript)
96
(No Transcript)
97
(No Transcript)
98
(No Transcript)
99
(No Transcript)
100
(No Transcript)
101
(No Transcript)
102
(No Transcript)
103
(No Transcript)
104
(No Transcript)
105
(No Transcript)
106
(No Transcript)
107
(No Transcript)
108
(No Transcript)
109
(No Transcript)
110
Our argument

Assumptions equipped with
Capacity to represent both linear and
hierarchical grammars (no bias)
Rational Bayesian learning mechanism
probability calculation
Ability to effectively search the space of
possible grammars

111
Take-home message

Shown that given reasonable domain-general
assumptions, an unbiased rational learner could
realize that languages have a hierarchical
structure based on typical child-directed input

112
Take-home message

Shown that given reasonable domain-general
assumptions, an unbiased rational learner could
realize that languages have a hierarchical
structure based on typical child-directed input
Can use this paradigm to explore the role of
recursive elements in a grammar
The winning grammar contains additional
non-recursive counterparts for complex NPs
Perhaps language, while fundamentally recursive,
contains duplicate non-recursive elements that
more precisely match the input?

113
The role of recursion

Evaluated an additional grammar (CFG-DL) that
contained no recursive complex NPs at all
instead, multiply-embedded, depth-limited ones
No sentence in the corpus occurred with more than
two levels of nesting

114
The role of recursion results
Log posterior probability (lower magnitude
better)
115
The role of recursion Results
116
The role of recursion Implications

Optimal tradeoff results in a grammar that goes
beyond the data in interesting ways
Auxiliary fronting
Recursive complex NPs
A grammar with recursive complex NPs is more
optimal, even though
Recursive productions hurt in the likelihood
There are no sentences with more than two levels
of nesting in the input

Write a Comment

User Comments (0)