Minimum Description Length An Adequate Syntactic Theory? - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Minimum Description Length An Adequate Syntactic Theory?

Description:

Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005 Linguistic Theory Diachronic Theories Learnability Poverty of the stimulus Language ... – PowerPoint PPT presentation

Number of Views:137
Avg rating:3.0/5.0
Slides: 44
Provided by: acuk
Category:

less

Transcript and Presenter's Notes

Title: Minimum Description Length An Adequate Syntactic Theory?


1
Minimum Description LengthAn Adequate Syntactic
Theory?
  • Mike Dowman
  • 3 June 2005

2
Linguistic Theory
Chomskys Conceptualization of Language
Acquisition
3
Diachronic Theories
Language Acquisition Device
Primary Linguistic Data
Arena of Language Use
Hurfords Diachronic Spiral
4
Learnability
  • Poverty of the stimulus
  • Language is really complex
  • ?Obscure and abstract rules constrain,
    wh-movement, pronoun binding, passive formation,
    etc.
  • ?Examples of E-language dont give sufficient
    information to determine this

5
WH-movement
  • Whoi do you think Lord Emsworth will invite ti?
  • Whoi do you think that Lord Emsworth will invite
    ti?
  • Whoi do you think ti will arrive first?
  • Whoi do you think that ti will arrive first?

6
Negative Evidence
  • Some constructions seem impossible to learn
    without negative evidence
  • John gave a painting to the museum
  • John gave the museum a painting
  • John donated a painting to the museum
  • John donated the museum a painting

7
Implicit Negative Evidence
  • If constructions dont appear can we just assume
    theyre not grammatical?
  • ? No we only see a tiny proportion of possible,
    grammatical sentences
  • People generalize from examples they have seen
    to form new utterances
  • Under exactly what circumstances does a child
    conclude that a nonwitnessed sentence is
    ungrammatical? (Pinker, 1989)

8
Learnability Proofs
  • Gold (1967) for languages to be learnable in the
    limit we must have
  • Negative evidence
  • or a priori restrictions on possible languages
  • But learnable in the limit means being sure that
    we have determined the correct language

9
Statistical Learnability
  • Horning (1969)
  • If grammars are statistical
  • so utterances are produced with frequencies
    corresponding to the grammar
  • Languages are learnable
  • But we can never be sure when the correct grammar
    has been found
  • This just gets more likely as we see more data

10
Hornings Proof
  • Used Bayes rule
  • More complex grammars are less probable a priori
    P(h)
  • Statistical grammars can assign probabilities to
    data P(d h)
  • Search through all possible grammars, starting
    with the simplest

11
MDL
  • Hornings evaluation method for grammars can be
    seen as a form of Minimum Description Length
  • Simplest is best (Occams Razor)
  • Simplest means specifiable with the least amount
    of information
  • Information theory (Shannon, 1948) allows us to
    link probability and information
  • Amount of information -log Probability

12
Encoding Grammars and Data
1010100111010100101101010001100111100011010110
Decoder
Grammar
Data coded in terms of grammar
The comedian died A kangaroo burped The aeroplane
laughed Some comedian burped
A ? B C B ? D E E ? kangaroo, aeroplane,
comedian D ? the, a, some C ? died, laughed,
burped
13
Complexity and Probability
  • More complex grammar
  • Longer coding length, so lower probability
  • More restrictive grammar
  • Less choices for data, so each possibility has a
    higher probability

14
  • Most restrictive grammar just lists all possible
    utterances
  • Only the observed data is grammatical, so it has
    a high probability
  • A simple grammar could be made that allowed any
    sentences
  • Grammar would have a high probability
  • But data a very low one
  • MDL finds a middle ground between always
    generalizing and never generalizing

15
Rampant Synonymy?
  • Inductive inference (Solomonoff, 1960a)
  • Kolmogorov complexity (Kolmogorov, 1965)
  • Minimum Message Length (Wallace and Boulton,
    1968)
  • Algorithmic Information Theory (Chaitin, 1969)
  • Minimum Description Length (Rissanen, 1978)
  • Minimum Coding Length (Ellison, 1992)
  • Bayesian Learning (Stolcke, 1994)
  • Minimum Representation Length (Brent, 1996)

16
Evaluation and Search
  • MDL principle gives us an evaluation criterion
    for grammars (with respect to corpora)
  • But it doesnt solve the problem of how to find
    the grammars in the first place
  • ? Search mechanism needed

17
Two Learnability Problems
  • How to determine which of two or more grammars is
    best given some data
  • How to guide the search for grammars so that we
    can find the correct one, without considering
    every logically possible grammar

18
MDL in Linguistics
  • Solomonoff (1960b) Mechanization of Linguistic
    Learning
  • Learning phrase structure grammars for simple
    toy languages Stolcke (1994), Langley and
    Stromsten (2000)
  • Or real corpora Chen (1995), Grünwald (1994)
  • Or for language modelling in speech recognition
    systems Starkie (2001)

19
Not Just Syntax!
  • Phonology Ellison (1992), Rissanen and Ristad
    (1994)
  • Morphology Brent (1993), Goldsmith (2001)
  • Segmenting continuous speech de Marcken (1996),
    Brent and Cartwright (1997)

20
MDL and Parameter Setting
  • Briscoe (1999) and Rissanen and Ristad (1994)
    used MDL as part of parameter setting learning
    mechanisms

MDL and Iterated Learning
  • Briscoe (1999) used MDL as part of an
    expression-induction model
  • Brighton (2002) investigated effect of
    bottlenecks on an MDL learner
  • Roberts et al (2005) modelled lexical exceptions
    to syntactic rules

21
An Example My Model
  • Learns simple phrase structure grammars
  • Binary or non-branching rules
  • A ? B C
  • D ? E
  • F ? tomato
  • All derivations start from special symbol S
  • null symbol in 3rd position indicates
    non-branching rule

22
Encoding Grammars
  • Grammars can be coded as lists of three symbols
  • First symbol is rules left hand side, second and
    third its right hand side
  • A, B, C, D, E, null, F, tomato, null
  • First we have to encode the frequency of each
    symbol

23
Encoding Data
  • 1 S ? NP VP (3)
  • 2 NP ? john (2)
  • 3 NP ? mary (1)
  • 4 VP ? screamed (2)
  • 5 VP ? died (1)
  • Data 1, 2, 4, 1, 2, 5, 1, 3, 4
  • Probabilities 1 ? 3/3, 2 ? 2/3, 4 ? 2/3,
  • 1 ? 3/3, 2 ? 2/3
  • We must record the frequency of each rule

Total frequency for S 3
Total frequency for NP 3
Total frequency for VP 3
24
Encoding in My Model
1010100111010100101101010001100111100011010110
Decoder
Grammar
Symbol Frequencies
Rule Frequencies
Data
John screamed John died Mary Screamed
Rule 1 ? 3 Rule 2 ? 2 Rule 3 ? 1 Rule 4 ? 2 Rule
5 ? 1
1 S ? NP VP 2 NP ? john 3 NP ? mary 4 VP ?
screamed 5 VP ? died
S (1) NP (3) VP (3) john (1) mary (1) screamed
(1) died (1) null (4)
25
Search Strategy
  • Start with simple grammar that allows all
    sentences
  • Make simple change and see if it improves the
    evaluation (add a rule, delete a rule, change a
    symbol in a rule, etc.)
  • Annealing search
  • First stage just look at data coding length
  • Second stage look at overall evaluation

26
Example English
Learned Grammar S ? NP VP VP ? ran VP ?
screamed VP ? Vt NP VP ? Vs S Vt ? hit Vt ?
kicked Vs ? thinks Vs ? hopes NP ? John NP ?
Ethel NP ? Mary NP ? Noam
  • John hit Mary
  • Mary hit Ethel
  • Ethel ran
  • John ran
  • Mary ran
  • Ethel hit John
  • Noam hit John
  • Ethel screamed
  • Mary kicked Ethel
  • John hopes Ethel thinks Mary hit Ethel
  • Ethel thinks John ran
  • John thinks Ethel ran
  • Mary ran
  • Ethel hit Mary
  • Mary thinks John hit Ethel
  • John screamed
  • Noam hopes John screamed
  • Mary hopes Ethel hit John
  • Noam kicked Mary

27
Evaluations
28
Dative Alternation
  • Children learn distinction between alternating
    and non-alternating verbs
  • Previously unseen verbs are used productively in
    both constructions
  • ? New verbs follow regular pattern
  • During learning children use non-alternating
    verbs in both constructions
  • ? U-shaped learning

29
Training Data
  • Three alternating verbs gave, passed, lent
  • One non-alternating verb donated
  • One verb seen only once sent
  • The museum lent Sam a painting
  • John gave a painting to Sam
  • Sam donated John to the museum
  • The museum sent a painting to Sam

30
Dative Evaluations
31
Grammar Properties
  • Learned grammar distinguishes alternating and
    non-alternating verbs
  • sent appears in alternating class
  • With less data, only one class of verbs, so
    donated can appear in both constructions
  • All sentences generated by the grammar are
    grammatical
  • But structures are not right

32
Learned Structures
S
X
Y
Z
NP
NP VA DET
N P
NP
John gave a
painting to
Sam
33
Regular and Irregular Rules
  • Why does the model place a newly seen verb in the
    regular class?
  • Y ? VA NP
  • Y ? VA Z
  • Y ? VP Z
  • VA ? passed
  • VA ? gave
  • VA ? lent
  • VP ? donated
  • VA / VP ? sent

sent doesnt alternate sent alternates
Overall Evaluation (bits) 1703.6 1703.4
Grammar (bits) 322.2 321.0
Data (bits) 1381.4 1382.3
Regular constructions are preferred because the
grammar is coded statistically
34
Why use Statistical Grammars?
  • Statistics are a valuable source of information
  • They help to infer when absences are due to
    chance
  • The learned grammar predicted that sent should
    appear in the double object construction
  • but in 150 sentences it was only seen in the
    prepositional dative construction
  • With a non statistical grammar we need an
    explanation as to why this is
  • A statistical grammar knows that sent is rare,
    which explains the absence of double object
    occurrences

35
Scaling Up Onnis, Roberts and Chater (2003)
  • Causative alternation
  • John cut the string
  • The string cut
  • John arrived the train
  • The train arrived
  • John bounced the ball
  • The ball bounced

36
Onnis et als Data
  • Two word classes N and V
  • NV and VN only allowable sentences
  • 16 verbs alternate NV VN
  • 10 verbs NV only
  • 10 verbs VN only
  • Coding scheme marks non-alternating verbs as
    exceptional (cost in coding scheme)

37
Onnis et als Results
  • lt 16,000 sentences ? all verbs alternate
  • gt 16,000 sentences ? non alternating verbs
    classified as exceptional
  • No search mechanism ? Just looked at evaluations
    with and without exceptions
  • In expression-induction model quasi-regularities
    appear as a result of chance omissions

38
MDL and MML Issues
  • Numeric parameters - accuracy
  • Bayes optimal classification (not MAP learning)
    Monte Carlo methods
  • ? If we see a sentence, work out the probability
    of it for each grammar
  • ? Weighted sum gives probability of sentence
  • Unseen data zero probability?

39
One and Two Part Codes
1010100111010100101101010001100111100011010110
Decoder
Grammar
Data coded in terms of grammar
1010100111010100101101010001100111100011010110
Decoder
Data and grammar combined
Data
Grammar
40
Coding English Texts
  • Grammar is a frequency for each letter and for
    space
  • Counts start at one
  • We decode a series of letters and update the
    counts for each letter
  • All letters coded in terms of their probabilities
    at that point in the decoding
  • At end we have a decoded text and grammar

41
Decoding Example
Letter Count Count Count Count
A 1 2 2 2
B 1 1 2 3
C 1 1 1 1
Space 1 1 1 1
Decoded string A (P1/4) B (P1/5) B (P2/6)
42
One-Part Grammars
  • Grammars can also be coded using one-part codes
  • Start with no grammar, but have a probability
    associated with adding a new rule
  • Each time we decode data we first choose to add a
    new rule, or use an existing one
  • Examples are Dowman (2000) or Venkataraman (1997)

43
Conclusions
  • MDL can solve the poverty of the stimulus problem
  • But it doesnt solve the problem of constraining
    the search for grammars
  • Coding schemes create learning biases
  • Statistical grammars and statistical coding of
    grammars can help learning
Write a Comment
User Comments (0)
About PowerShow.com