Title: Minimum Description Length An Adequate Syntactic Theory?
1Minimum Description LengthAn Adequate Syntactic
Theory?
2Linguistic Theory
Chomskys Conceptualization of Language
Acquisition
3Diachronic Theories
Language Acquisition Device
Primary Linguistic Data
Arena of Language Use
Hurfords Diachronic Spiral
4Learnability
- Poverty of the stimulus
- Language is really complex
- ?Obscure and abstract rules constrain,
wh-movement, pronoun binding, passive formation,
etc. - ?Examples of E-language dont give sufficient
information to determine this
5WH-movement
- Whoi do you think Lord Emsworth will invite ti?
- Whoi do you think that Lord Emsworth will invite
ti? - Whoi do you think ti will arrive first?
- Whoi do you think that ti will arrive first?
6Negative Evidence
- Some constructions seem impossible to learn
without negative evidence - John gave a painting to the museum
- John gave the museum a painting
- John donated a painting to the museum
- John donated the museum a painting
7Implicit Negative Evidence
- If constructions dont appear can we just assume
theyre not grammatical? - ? No we only see a tiny proportion of possible,
grammatical sentences - People generalize from examples they have seen
to form new utterances - Under exactly what circumstances does a child
conclude that a nonwitnessed sentence is
ungrammatical? (Pinker, 1989)
8Learnability Proofs
- Gold (1967) for languages to be learnable in the
limit we must have - Negative evidence
- or a priori restrictions on possible languages
- But learnable in the limit means being sure that
we have determined the correct language
9Statistical Learnability
- Horning (1969)
- If grammars are statistical
- so utterances are produced with frequencies
corresponding to the grammar - Languages are learnable
- But we can never be sure when the correct grammar
has been found - This just gets more likely as we see more data
10Hornings Proof
- Used Bayes rule
- More complex grammars are less probable a priori
P(h) - Statistical grammars can assign probabilities to
data P(d h) - Search through all possible grammars, starting
with the simplest
11MDL
- Hornings evaluation method for grammars can be
seen as a form of Minimum Description Length - Simplest is best (Occams Razor)
- Simplest means specifiable with the least amount
of information - Information theory (Shannon, 1948) allows us to
link probability and information - Amount of information -log Probability
12Encoding Grammars and Data
1010100111010100101101010001100111100011010110
Decoder
Grammar
Data coded in terms of grammar
The comedian died A kangaroo burped The aeroplane
laughed Some comedian burped
A ? B C B ? D E E ? kangaroo, aeroplane,
comedian D ? the, a, some C ? died, laughed,
burped
13Complexity and Probability
- More complex grammar
- Longer coding length, so lower probability
- More restrictive grammar
- Less choices for data, so each possibility has a
higher probability
14- Most restrictive grammar just lists all possible
utterances - Only the observed data is grammatical, so it has
a high probability - A simple grammar could be made that allowed any
sentences - Grammar would have a high probability
- But data a very low one
- MDL finds a middle ground between always
generalizing and never generalizing
15Rampant Synonymy?
- Inductive inference (Solomonoff, 1960a)
- Kolmogorov complexity (Kolmogorov, 1965)
- Minimum Message Length (Wallace and Boulton,
1968) - Algorithmic Information Theory (Chaitin, 1969)
- Minimum Description Length (Rissanen, 1978)
- Minimum Coding Length (Ellison, 1992)
- Bayesian Learning (Stolcke, 1994)
- Minimum Representation Length (Brent, 1996)
16Evaluation and Search
- MDL principle gives us an evaluation criterion
for grammars (with respect to corpora) - But it doesnt solve the problem of how to find
the grammars in the first place - ? Search mechanism needed
17Two Learnability Problems
- How to determine which of two or more grammars is
best given some data - How to guide the search for grammars so that we
can find the correct one, without considering
every logically possible grammar
18MDL in Linguistics
- Solomonoff (1960b) Mechanization of Linguistic
Learning - Learning phrase structure grammars for simple
toy languages Stolcke (1994), Langley and
Stromsten (2000) - Or real corpora Chen (1995), Grünwald (1994)
- Or for language modelling in speech recognition
systems Starkie (2001)
19Not Just Syntax!
- Phonology Ellison (1992), Rissanen and Ristad
(1994) - Morphology Brent (1993), Goldsmith (2001)
- Segmenting continuous speech de Marcken (1996),
Brent and Cartwright (1997)
20MDL and Parameter Setting
- Briscoe (1999) and Rissanen and Ristad (1994)
used MDL as part of parameter setting learning
mechanisms
MDL and Iterated Learning
- Briscoe (1999) used MDL as part of an
expression-induction model - Brighton (2002) investigated effect of
bottlenecks on an MDL learner - Roberts et al (2005) modelled lexical exceptions
to syntactic rules
21An Example My Model
- Learns simple phrase structure grammars
- Binary or non-branching rules
- A ? B C
- D ? E
- F ? tomato
- All derivations start from special symbol S
- null symbol in 3rd position indicates
non-branching rule
22Encoding Grammars
- Grammars can be coded as lists of three symbols
- First symbol is rules left hand side, second and
third its right hand side - A, B, C, D, E, null, F, tomato, null
- First we have to encode the frequency of each
symbol
23Encoding Data
- 1 S ? NP VP (3)
- 2 NP ? john (2)
- 3 NP ? mary (1)
- 4 VP ? screamed (2)
- 5 VP ? died (1)
- Data 1, 2, 4, 1, 2, 5, 1, 3, 4
- Probabilities 1 ? 3/3, 2 ? 2/3, 4 ? 2/3,
- 1 ? 3/3, 2 ? 2/3
- We must record the frequency of each rule
Total frequency for S 3
Total frequency for NP 3
Total frequency for VP 3
24Encoding in My Model
1010100111010100101101010001100111100011010110
Decoder
Grammar
Symbol Frequencies
Rule Frequencies
Data
John screamed John died Mary Screamed
Rule 1 ? 3 Rule 2 ? 2 Rule 3 ? 1 Rule 4 ? 2 Rule
5 ? 1
1 S ? NP VP 2 NP ? john 3 NP ? mary 4 VP ?
screamed 5 VP ? died
S (1) NP (3) VP (3) john (1) mary (1) screamed
(1) died (1) null (4)
25Search Strategy
- Start with simple grammar that allows all
sentences - Make simple change and see if it improves the
evaluation (add a rule, delete a rule, change a
symbol in a rule, etc.) - Annealing search
- First stage just look at data coding length
- Second stage look at overall evaluation
26Example English
Learned Grammar S ? NP VP VP ? ran VP ?
screamed VP ? Vt NP VP ? Vs S Vt ? hit Vt ?
kicked Vs ? thinks Vs ? hopes NP ? John NP ?
Ethel NP ? Mary NP ? Noam
- John hit Mary
- Mary hit Ethel
- Ethel ran
- John ran
- Mary ran
- Ethel hit John
- Noam hit John
- Ethel screamed
- Mary kicked Ethel
- John hopes Ethel thinks Mary hit Ethel
- Ethel thinks John ran
- John thinks Ethel ran
- Mary ran
- Ethel hit Mary
- Mary thinks John hit Ethel
- John screamed
- Noam hopes John screamed
- Mary hopes Ethel hit John
- Noam kicked Mary
27Evaluations
28Dative Alternation
- Children learn distinction between alternating
and non-alternating verbs - Previously unseen verbs are used productively in
both constructions - ? New verbs follow regular pattern
- During learning children use non-alternating
verbs in both constructions - ? U-shaped learning
29Training Data
- Three alternating verbs gave, passed, lent
- One non-alternating verb donated
- One verb seen only once sent
- The museum lent Sam a painting
- John gave a painting to Sam
- Sam donated John to the museum
- The museum sent a painting to Sam
30Dative Evaluations
31Grammar Properties
- Learned grammar distinguishes alternating and
non-alternating verbs - sent appears in alternating class
- With less data, only one class of verbs, so
donated can appear in both constructions - All sentences generated by the grammar are
grammatical - But structures are not right
32Learned Structures
S
X
Y
Z
NP
NP VA DET
N P
NP
John gave a
painting to
Sam
33Regular and Irregular Rules
- Why does the model place a newly seen verb in the
regular class? - Y ? VA NP
- Y ? VA Z
- Y ? VP Z
- VA ? passed
- VA ? gave
- VA ? lent
- VP ? donated
- VA / VP ? sent
sent doesnt alternate sent alternates
Overall Evaluation (bits) 1703.6 1703.4
Grammar (bits) 322.2 321.0
Data (bits) 1381.4 1382.3
Regular constructions are preferred because the
grammar is coded statistically
34Why use Statistical Grammars?
- Statistics are a valuable source of information
- They help to infer when absences are due to
chance - The learned grammar predicted that sent should
appear in the double object construction - but in 150 sentences it was only seen in the
prepositional dative construction - With a non statistical grammar we need an
explanation as to why this is - A statistical grammar knows that sent is rare,
which explains the absence of double object
occurrences
35Scaling Up Onnis, Roberts and Chater (2003)
- Causative alternation
- John cut the string
- The string cut
- John arrived the train
- The train arrived
- John bounced the ball
- The ball bounced
36Onnis et als Data
- Two word classes N and V
- NV and VN only allowable sentences
- 16 verbs alternate NV VN
- 10 verbs NV only
- 10 verbs VN only
- Coding scheme marks non-alternating verbs as
exceptional (cost in coding scheme)
37Onnis et als Results
- lt 16,000 sentences ? all verbs alternate
- gt 16,000 sentences ? non alternating verbs
classified as exceptional - No search mechanism ? Just looked at evaluations
with and without exceptions - In expression-induction model quasi-regularities
appear as a result of chance omissions
38MDL and MML Issues
- Numeric parameters - accuracy
- Bayes optimal classification (not MAP learning)
Monte Carlo methods - ? If we see a sentence, work out the probability
of it for each grammar - ? Weighted sum gives probability of sentence
- Unseen data zero probability?
39One and Two Part Codes
1010100111010100101101010001100111100011010110
Decoder
Grammar
Data coded in terms of grammar
1010100111010100101101010001100111100011010110
Decoder
Data and grammar combined
Data
Grammar
40Coding English Texts
- Grammar is a frequency for each letter and for
space - Counts start at one
- We decode a series of letters and update the
counts for each letter - All letters coded in terms of their probabilities
at that point in the decoding - At end we have a decoded text and grammar
41Decoding Example
Letter Count Count Count Count
A 1 2 2 2
B 1 1 2 3
C 1 1 1 1
Space 1 1 1 1
Decoded string A (P1/4) B (P1/5) B (P2/6)
42One-Part Grammars
- Grammars can also be coded using one-part codes
- Start with no grammar, but have a probability
associated with adding a new rule - Each time we decode data we first choose to add a
new rule, or use an existing one - Examples are Dowman (2000) or Venkataraman (1997)
43Conclusions
- MDL can solve the poverty of the stimulus problem
- But it doesnt solve the problem of constraining
the search for grammars - Coding schemes create learning biases
- Statistical grammars and statistical coding of
grammars can help learning