Title: Capturing linguistic interaction in a grammar
1Capturing linguistic interaction in a grammar
- A method for empirically evaluatingthe grammar
of a parsed corpus
Sean Wallis Survey of English Usage University
College London s.wallis_at_ucl.ac.uk
2Capturing linguistic interaction...
- Parsed corpus linguistics
- Empirical evaluation of grammar
- Experiments
- Attributive AJPs
- Preverbal AVPs
- Embedded postmodifying clauses
- Conclusions
- Comparing grammars or corpora
- Potential applications
3Parsed corpus linguistics
- Several million-word parsed corpora exist
- Each sentence analysed in the form of a tree
- different languages have been analysed
- limited amount of spontaneous speech data
- Commitment to a particular grammar required
- different schemes have been applied
- problems computational completeness manual
consistency - Tools support linguistic research in corpora
4Parsed corpus linguistics
- An example tree from ICE-GB (spoken)
S1A-006 23
5Parsed corpus linguistics
- Three kinds of evidence may be obtained from a
parsed corpus - Frequency evidence of a particular known rule,
structure or linguistic event - Coverage evidence of new rules, etc.
- Interaction evidence of the relationship between
rules, structures and events - This evidence is necessarily framed within a
particular grammatical scheme - So how might we evaluate this grammar?
6Empirical evaluation of grammar
- Many theories, frameworks and grammars
- no agreed evaluation method exists
- linguistics is divided into competing camps
- status of parsed corpora suspect
- Possible method retrievability of events
- circularity you get out what you put in
- redundancy improvement by mere addition
- atomic based on single events, not pattern
- specificity based on particular phenomena
- New method retrievability of event sequences
7Experiment 1 attributive AJPs
- Adjectives before a noun in English
- Simple idea plot the frequency of NPs with at
least n 0, 1, 2, 3 attributive AJPs
8Experiment 1 attributive AJPs
- Adjectives before a noun in English
- Simple idea plot the frequency of NPs with at
least n 0, 1, 2, 3 attributive AJPs
Raw frequency
Log frequency
NB not a straight line
9Experiment 1 analysis of results
- If the log-frequency line is straight
- exponential fall in frequency (constant
probability) - no interaction between decisions (cf. coin
tossing) - Sequential probability analysis
- calculate probability of adding each AJP
- error bars (binomial)
- probability falls
- second lt first
- third lt second
- fourth lt second
- decisions interact
10Experiment 1 analysis of results
- If the log-frequency line is straight
- exponential fall in frequency (constant
probability) - no interaction between decisions (cf. coin
tossing) - Sequential probability analysis
- calculate probability of adding each AJP
- error bars (binomial)
- probability falls
- second lt first
- third lt second
- fourth lt second
- decisions interact
probability
11Experiment 1 analysis of results
- If the log-frequency line is straight
- exponential fall in frequency (constant
probability) - no interaction between decisions (cf. coin
tossing) - Sequential probability analysis
- calculate probability of adding each AJP
- error bars (binomial)
- probability falls
- decisions interact
- fit to a power law
- y m.x k
- find m and x
probability
y 0.1931x -1.2793
12Experiment 1 explanations?
- Feedback loop for each successive AJP, it is
more difficult to add a further AJP - Explanation 1 semantic constraints
- tend to say tall green ship
- do not tend to say tall short ship or green tall
ship - Explanation 2 communicative economy
- once speaker said tall green ship, tends to only
say ship - Further investigation required
- General principle
- significant change (usually, fall) in probability
is evidence of an interaction along grammatical
axis
13Experiments 2,3 variations
- ? Restrict head common and proper nouns
- Common nouns similar results
- Proper nouns and adjectives are often treated as
compounds (Northern England vs. lower Loire ) - ? Ignore grammar adjective noun strings
- Some misclassifications / miscounting (noise)
- she was beautiful, people said tall very
green ship - Similar results
- slightly weaker (third lt second ns at p0.01)
- Insufficient evidence for grammar
- null hypothesis simple lexical adjacency
14Experiment 4 preverbal AVPs
- Consider adverb phrases before a verb
- Results very different
- Probability does not fall significantly between
first and second AVP - Probability does fall between third and second
AVP - Possible constraints
- (weak) communicative
- not (strong) semantic
- Further investigationneeded
15Experiment 4 preverbal AVPs
- Consider adverb phrases before a verb
- Results very different
- Probability does not fall significantly between
first and second AVP - Probability does fall between third and second
AVP - Possible constraints
- (weak) communicative
- not (strong) semantic
- Further investigationneeded
- Not power law R2 lt 0.24
probability
16Experiment 5 embedded clauses
- Another way to specify nouns in English
- add clause after noun to explicate it
- the ship that was tall and green
- the ship in the port
- may be embedded
- the ship in the port with the ancient
lighthouse - or successively postmodified
- the ship in the portwith a very old mast
- Compare successive embedding and sequential
postmodifying clauses - Axis embedding depth / sequence length
17Experiment 5 method
- Extract examples with FTFs
- at least n levels of embedded postmodification
18Experiment 5 method
- Extract examples with FTFs
- at least n levels of embedded postmodification
- 0
- 1
- 2
(etc.)
19Experiment 5 method
- Extract examples with FTFs
- at least n levels of embedded postmodification
- 0
- 1
- 2
- problems
- multiple matching cases (use ICECUP IV to
classify) - overlapping cases (subtract extra case)
- co-ordination of clauses or NPs (use alternative
patterns)
(etc.)
20Experiment 5 analysis of results
- Probability of adding a further embedded clause
falls with each level - second lt first
- sequential lt embedding
- Embedding only
- third lt first
- insufficient data forthird lt second
- Conclusion
- Interaction along embedding and sequential axes
21Experiment 5 analysis of results
- Probability of adding a further embedded clause
falls with each level - second lt first
- sequential lt embedding
- Embedding only
- third lt first
- insufficient data forthird lt second
- Conclusion
- Interaction along embedding and sequential axes
embedded
sequential
probability
22Experiment 5 analysis of results
- Probability of adding a further embedded clause
falls with each level - second lt first
- sequential lt embedding
- Fitting to f m.x k
- k lt 0 fall ( f m/x k)
- k is high steep
- Conclusion
- Both match power law R2 gt 0.99
embedded
y 0.0539x -1.2206
sequential
y 0.0523x -1.6516
23Experiment 5 explanations?
- Lexical adjacency?
- No 87 of 2-level cases have at least one VP, NP
or clause between upper and lower heads - Misclassified cases of embedding?
- No very few (5) semantically ambiguous cases
- Language production constraints?
- Possibly, could also be communicative economy
- contrast spontaneous speech with other modes
- Positive proof of recursive tree grammar
- Established from parsed corpus
- cf. negative proof (NLP parsing problems)
24Conclusions
- A new method for evaluating interactions along
grammatical axes - General purpose, robust, structural
- More abstract than linguistic choice
experiments - Depends on a concept of grammatical distance
along an axis, based on the chosen grammar - Method has philosophical implications
- Grammar viewed as structure of linguistic choices
- Linguistics as an evaluable observational science
- Signature (trace) of language production
decisions - A unification of theoretical and corpus
linguistics?
25Comparing grammars or corpora
- Can we reliably retrieve known interaction
patterns with different grammars? - Do these patterns differ across corpora?
- Benefits over individual event retrieval
- non-circular generalisation across local syntax
- not subject to redundancy arbitrary terms makes
trends more difficult to retrieve - not atomic based on patterns of interaction
- general patterns may have multiple explanations
- Supplements retrieval of events
26Potential applications
- Corpus linguistics
- Optimising existing grammar
- e.g. co-ordination, compound nouns
- Theoretical linguistics
- Comparing different grammars, same language
- Comparing different languages or periods
- Psycholinguistics
- Search for evidence of language production
constraints in spontaneous speech corpora - speech and language therapy
- language acquisition and development
27Links and further reading
- Survey of English Usage
- www.ucl.ac.uk/english-usage
- Corpora and grammar
- .../projects/ice-gb
- Full paper
- .../staff/sean/resources/analysing-grammatical-int
eraction.pdf - Sequential analysis spreadsheet (Excel)
- .../staff/sean/resources/interaction-trends.xls