Minimum Description Length An Adequate Syntactic Theory? - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

Minimum Description Length An Adequate Syntactic Theory?

Description:

Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005 Linguistic Theory Diachronic Theories Learnability Poverty of the stimulus Language ... – PowerPoint PPT presentation

Number of Views:140

Avg rating:3.0/5.0

Slides: 44

Provided by: acuk

Category:

more less

Transcript and Presenter's Notes

Title: Minimum Description Length An Adequate Syntactic Theory?

1
Minimum Description LengthAn Adequate Syntactic
Theory?

Mike Dowman
3 June 2005

2
Linguistic Theory
Chomskys Conceptualization of Language
Acquisition
3
Diachronic Theories
Language Acquisition Device
Primary Linguistic Data
Arena of Language Use
Hurfords Diachronic Spiral
4
Learnability

Poverty of the stimulus
Language is really complex
?Obscure and abstract rules constrain,
wh-movement, pronoun binding, passive formation,
etc.
?Examples of E-language dont give sufficient
information to determine this

5
WH-movement

Whoi do you think Lord Emsworth will invite ti?
Whoi do you think that Lord Emsworth will invite
ti?
Whoi do you think ti will arrive first?
Whoi do you think that ti will arrive first?

6
Negative Evidence

Some constructions seem impossible to learn
without negative evidence
John gave a painting to the museum
John gave the museum a painting
John donated a painting to the museum
John donated the museum a painting

7
Implicit Negative Evidence

If constructions dont appear can we just assume
theyre not grammatical?
? No we only see a tiny proportion of possible,
grammatical sentences
People generalize from examples they have seen
to form new utterances
Under exactly what circumstances does a child
conclude that a nonwitnessed sentence is
ungrammatical? (Pinker, 1989)

8
Learnability Proofs

Gold (1967) for languages to be learnable in the
limit we must have
Negative evidence
or a priori restrictions on possible languages
But learnable in the limit means being sure that
we have determined the correct language

9
Statistical Learnability

Horning (1969)
If grammars are statistical
so utterances are produced with frequencies
corresponding to the grammar
Languages are learnable
But we can never be sure when the correct grammar
has been found
This just gets more likely as we see more data

10
Hornings Proof

Used Bayes rule
More complex grammars are less probable a priori
P(h)
Statistical grammars can assign probabilities to
data P(d h)
Search through all possible grammars, starting
with the simplest

11
MDL

Hornings evaluation method for grammars can be
seen as a form of Minimum Description Length
Simplest is best (Occams Razor)
Simplest means specifiable with the least amount
of information
Information theory (Shannon, 1948) allows us to
link probability and information
Amount of information -log Probability

12
Encoding Grammars and Data
1010100111010100101101010001100111100011010110
Decoder
Grammar
Data coded in terms of grammar
The comedian died A kangaroo burped The aeroplane
laughed Some comedian burped
A ? B C B ? D E E ? kangaroo, aeroplane,
comedian D ? the, a, some C ? died, laughed,
burped
13
Complexity and Probability

More complex grammar
Longer coding length, so lower probability
More restrictive grammar
Less choices for data, so each possibility has a
higher probability

Most restrictive grammar just lists all possible
utterances
Only the observed data is grammatical, so it has
a high probability
A simple grammar could be made that allowed any
sentences
Grammar would have a high probability
But data a very low one
MDL finds a middle ground between always
generalizing and never generalizing

15
Rampant Synonymy?

Inductive inference (Solomonoff, 1960a)
Kolmogorov complexity (Kolmogorov, 1965)
Minimum Message Length (Wallace and Boulton,
1968)
Algorithmic Information Theory (Chaitin, 1969)
Minimum Description Length (Rissanen, 1978)
Minimum Coding Length (Ellison, 1992)
Bayesian Learning (Stolcke, 1994)
Minimum Representation Length (Brent, 1996)

16
Evaluation and Search

MDL principle gives us an evaluation criterion
for grammars (with respect to corpora)
But it doesnt solve the problem of how to find
the grammars in the first place
? Search mechanism needed

17
Two Learnability Problems

How to determine which of two or more grammars is
best given some data
How to guide the search for grammars so that we
can find the correct one, without considering
every logically possible grammar

18
MDL in Linguistics

Solomonoff (1960b) Mechanization of Linguistic
Learning
Learning phrase structure grammars for simple
toy languages Stolcke (1994), Langley and
Stromsten (2000)
Or real corpora Chen (1995), Grünwald (1994)
Or for language modelling in speech recognition
systems Starkie (2001)

19
Not Just Syntax!

Phonology Ellison (1992), Rissanen and Ristad
(1994)
Morphology Brent (1993), Goldsmith (2001)
Segmenting continuous speech de Marcken (1996),
Brent and Cartwright (1997)

20
MDL and Parameter Setting

Briscoe (1999) and Rissanen and Ristad (1994)
used MDL as part of parameter setting learning
mechanisms

MDL and Iterated Learning

Briscoe (1999) used MDL as part of an
expression-induction model
Brighton (2002) investigated effect of
bottlenecks on an MDL learner
Roberts et al (2005) modelled lexical exceptions
to syntactic rules

21
An Example My Model

Learns simple phrase structure grammars
Binary or non-branching rules
A ? B C
D ? E
F ? tomato
All derivations start from special symbol S
null symbol in 3rd position indicates
non-branching rule

22
Encoding Grammars

Grammars can be coded as lists of three symbols
First symbol is rules left hand side, second and
third its right hand side
A, B, C, D, E, null, F, tomato, null
First we have to encode the frequency of each
symbol

23
Encoding Data

1 S ? NP VP (3)
2 NP ? john (2)
3 NP ? mary (1)
4 VP ? screamed (2)
5 VP ? died (1)
Data 1, 2, 4, 1, 2, 5, 1, 3, 4
Probabilities 1 ? 3/3, 2 ? 2/3, 4 ? 2/3,
1 ? 3/3, 2 ? 2/3
We must record the frequency of each rule

Total frequency for S 3
Total frequency for NP 3
Total frequency for VP 3
24
Encoding in My Model
1010100111010100101101010001100111100011010110
Decoder
Grammar
Symbol Frequencies
Rule Frequencies
Data
John screamed John died Mary Screamed
Rule 1 ? 3 Rule 2 ? 2 Rule 3 ? 1 Rule 4 ? 2 Rule
5 ? 1
1 S ? NP VP 2 NP ? john 3 NP ? mary 4 VP ?
screamed 5 VP ? died
S (1) NP (3) VP (3) john (1) mary (1) screamed
(1) died (1) null (4)
25
Search Strategy

Start with simple grammar that allows all
sentences
Make simple change and see if it improves the
evaluation (add a rule, delete a rule, change a
symbol in a rule, etc.)
Annealing search
First stage just look at data coding length
Second stage look at overall evaluation

26
Example English
Learned Grammar S ? NP VP VP ? ran VP ?
screamed VP ? Vt NP VP ? Vs S Vt ? hit Vt ?
kicked Vs ? thinks Vs ? hopes NP ? John NP ?
Ethel NP ? Mary NP ? Noam

John hit Mary
Mary hit Ethel
Ethel ran
John ran
Mary ran
Ethel hit John
Noam hit John
Ethel screamed
Mary kicked Ethel
John hopes Ethel thinks Mary hit Ethel
Ethel thinks John ran
John thinks Ethel ran
Mary ran
Ethel hit Mary
Mary thinks John hit Ethel
John screamed
Noam hopes John screamed
Mary hopes Ethel hit John
Noam kicked Mary

27
Evaluations
28
Dative Alternation

Children learn distinction between alternating
and non-alternating verbs
Previously unseen verbs are used productively in
both constructions
? New verbs follow regular pattern
During learning children use non-alternating
verbs in both constructions
? U-shaped learning

29
Training Data

Three alternating verbs gave, passed, lent
One non-alternating verb donated
One verb seen only once sent
The museum lent Sam a painting
John gave a painting to Sam
Sam donated John to the museum
The museum sent a painting to Sam

30
Dative Evaluations
31
Grammar Properties

Learned grammar distinguishes alternating and
non-alternating verbs
sent appears in alternating class
With less data, only one class of verbs, so
donated can appear in both constructions
All sentences generated by the grammar are
grammatical
But structures are not right

32
Learned Structures
S
X
Y
Z
NP
NP VA DET
N P
NP
John gave a
painting to
Sam
33
Regular and Irregular Rules

Why does the model place a newly seen verb in the
regular class?
Y ? VA NP
Y ? VA Z
Y ? VP Z
VA ? passed
VA ? gave
VA ? lent
VP ? donated
VA / VP ? sent

sent doesnt alternate sent alternates
Overall Evaluation (bits) 1703.6 1703.4
Grammar (bits) 322.2 321.0
Data (bits) 1381.4 1382.3
Regular constructions are preferred because the
grammar is coded statistically
34
Why use Statistical Grammars?

Statistics are a valuable source of information
They help to infer when absences are due to
chance
The learned grammar predicted that sent should
appear in the double object construction
but in 150 sentences it was only seen in the
prepositional dative construction
With a non statistical grammar we need an
explanation as to why this is
A statistical grammar knows that sent is rare,
which explains the absence of double object
occurrences

35
Scaling Up Onnis, Roberts and Chater (2003)

Causative alternation
John cut the string
The string cut
John arrived the train
The train arrived
John bounced the ball
The ball bounced

36
Onnis et als Data

Two word classes N and V
NV and VN only allowable sentences
16 verbs alternate NV VN
10 verbs NV only
10 verbs VN only
Coding scheme marks non-alternating verbs as
exceptional (cost in coding scheme)

37
Onnis et als Results

lt 16,000 sentences ? all verbs alternate
gt 16,000 sentences ? non alternating verbs
classified as exceptional
No search mechanism ? Just looked at evaluations
with and without exceptions
In expression-induction model quasi-regularities
appear as a result of chance omissions

38
MDL and MML Issues

Numeric parameters - accuracy
Bayes optimal classification (not MAP learning)
Monte Carlo methods
? If we see a sentence, work out the probability
of it for each grammar
? Weighted sum gives probability of sentence
Unseen data zero probability?

39
One and Two Part Codes
1010100111010100101101010001100111100011010110
Decoder
Grammar
Data coded in terms of grammar
1010100111010100101101010001100111100011010110
Decoder
Data and grammar combined
Data
Grammar
40
Coding English Texts

Grammar is a frequency for each letter and for
space
Counts start at one
We decode a series of letters and update the
counts for each letter
All letters coded in terms of their probabilities
at that point in the decoding
At end we have a decoded text and grammar

41
Decoding Example
Letter Count Count Count Count
A 1 2 2 2
B 1 1 2 3
C 1 1 1 1
Space 1 1 1 1
Decoded string A (P1/4) B (P1/5) B (P2/6)
42
One-Part Grammars

Grammars can also be coded using one-part codes
Start with no grammar, but have a probability
associated with adding a new rule
Each time we decode data we first choose to add a
new rule, or use an existing one
Examples are Dowman (2000) or Venkataraman (1997)

43
Conclusions