Title: LING 406 Intro to Computational Linguistics Collocations
1LING 406Intro to Computational
LinguisticsCollocations
- Richard Sproat
- URL http//catarina.ai.uiuc.edu/L406_08
2This Lecture
- What are collocations?
- Measures of association
- Pointwise Mutual Information
- Frequency-Weighted Mutual Information
- Pearsons 2
- Dunnings likelihood ratios
- Non-binary collocations
3Some characteristics of collocations
- Firth (1957) Collocations of a given word are
statements of the habitual or customary places of
that word - In plain English collocations are expressions
constructed out of two or more words that have
some special property - Non-compositionality kick the bucket, white wine
- Non-substitutability kick the pail, yellow
wine - Non-modifiability kick the big bucket, very
white wine
4Some kinds of collocations
- Idioms kick the bucket, red herring
- Nominal compounds dog catcher, brown bread, sump
pump, white wine - Verb particle constructions give up, bowl over,
chew out
5Why care?
- Lexicography
- Machine translation
- Word segmentation
- Sense disambiguation
6Simple frequency NY Times Newswire 1990 (4
months)
7Simple frequency Justeson-Katz filtration
8Statistical approaches to binary collocations
- Frequency in and of itself doesnt tell you that
words are particularly associated with each
other if both words are frequent you might
expect their combination to be frequent just by
chance. - Statistical measures of association can give an
estimate of how much more likely than chance a
given combination is.
9(Pointwise) Mutual Information
- Mutual Information was originally proposed as an
information-theoretic measure of channel capacity
(Fano 1961).
101995 AP Newswire Collocations
111995 AP Newswire Non-Collocations
12Problems with mutual information
- It is unreliable for small counts. (But this is
really a problem with the MLE)? - The second, and more serious problem is that
mutual information relates to estimated
probability in a counterintuitive way
13Frequency-weighted MI
141995 AP Newswire collocations
15Problems with Frequency-Weighted Mutual
Information
- Main problem is that it tends to overreward
frequency
16Pearsons ?-square
17Pearsons ?-square
181995 AP Newswire collocations
19Problems with ?-square
20Dunnings (1993) likelihood ratios
n! / (n-k)!k!
211995 AP Newswire collocations
22Problems with likelihood ratios
23Some Chinese examples MI
24Weighted mutual information
25?-square
26Likelihood ratios
27Errors on top 500 by each Measure (10
MillionCharacter ROCLING Corpus)
28Extracting non-binary collocations
29Smadjas 1993 Xtract
30Smadjas 1993 Xtract
31Smadjas 1993 Xtract
32Summary
- Various statistical measures of collocation
- Each has their advantages and drawbacks
- Collocations are useful in a number of areas,
which well turn to next