LING 406 Intro to Computational Linguistics Collocations - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

LING 406 Intro to Computational Linguistics Collocations

Description:

Some characteristics of collocations ... Idioms: kick the bucket, red herring ... Some Chinese examples: MI. 03/29/08. Linguistics 406. 24. Weighted mutual information ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 33
Provided by: serrano5
Category:

less

Transcript and Presenter's Notes

Title: LING 406 Intro to Computational Linguistics Collocations


1
LING 406Intro to Computational
LinguisticsCollocations
  • Richard Sproat
  • URL http//catarina.ai.uiuc.edu/L406_08

2
This Lecture
  • What are collocations?
  • Measures of association
  • Pointwise Mutual Information
  • Frequency-Weighted Mutual Information
  • Pearsons 2
  • Dunnings likelihood ratios
  • Non-binary collocations

3
Some characteristics of collocations
  • Firth (1957) Collocations of a given word are
    statements of the habitual or customary places of
    that word
  • In plain English collocations are expressions
    constructed out of two or more words that have
    some special property
  • Non-compositionality kick the bucket, white wine
  • Non-substitutability kick the pail, yellow
    wine
  • Non-modifiability kick the big bucket, very
    white wine

4
Some kinds of collocations
  • Idioms kick the bucket, red herring
  • Nominal compounds dog catcher, brown bread, sump
    pump, white wine
  • Verb particle constructions give up, bowl over,
    chew out

5
Why care?
  • Lexicography
  • Machine translation
  • Word segmentation
  • Sense disambiguation

6
Simple frequency NY Times Newswire 1990 (4
months)
7
Simple frequency Justeson-Katz filtration
8
Statistical approaches to binary collocations
  • Frequency in and of itself doesnt tell you that
    words are particularly associated with each
    other if both words are frequent you might
    expect their combination to be frequent just by
    chance.
  • Statistical measures of association can give an
    estimate of how much more likely than chance a
    given combination is.

9
(Pointwise) Mutual Information
  • Mutual Information was originally proposed as an
    information-theoretic measure of channel capacity
    (Fano 1961).

10
1995 AP Newswire Collocations
11
1995 AP Newswire Non-Collocations
12
Problems with mutual information
  • It is unreliable for small counts. (But this is
    really a problem with the MLE)?
  • The second, and more serious problem is that
    mutual information relates to estimated
    probability in a counterintuitive way

13
Frequency-weighted MI
14
1995 AP Newswire collocations
15
Problems with Frequency-Weighted Mutual
Information
  • Main problem is that it tends to overreward
    frequency

16
Pearsons ?-square
17
Pearsons ?-square
18
1995 AP Newswire collocations
19
Problems with ?-square
20
Dunnings (1993) likelihood ratios
n! / (n-k)!k!
21
1995 AP Newswire collocations
22
Problems with likelihood ratios
23
Some Chinese examples MI
24
Weighted mutual information
25
?-square
26
Likelihood ratios
27
Errors on top 500 by each Measure (10
MillionCharacter ROCLING Corpus)
28
Extracting non-binary collocations
29
Smadjas 1993 Xtract
30
Smadjas 1993 Xtract
31
Smadjas 1993 Xtract
32
Summary
  • Various statistical measures of collocation
  • Each has their advantages and drawbacks
  • Collocations are useful in a number of areas,
    which well turn to next
Write a Comment
User Comments (0)
About PowerShow.com