COLLOCATIONS - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

COLLOCATIONS

Description:

A collocation is an expression consisting of two or more words that ... Non-modifiability. e.g. as poor as church mouse / mice? Can not translate word by word ... – PowerPoint PPT presentation

Number of Views:189
Avg rating:3.0/5.0
Slides: 39
Provided by: mtgrou
Category:

less

Transcript and Presenter's Notes

Title: COLLOCATIONS


1
COLLOCATIONS
  • He Zhongjun
  • 2007-04-13

2
Outline
  • Introduction
  • Approaches to find collocations
  • Frequency
  • Mean and Variance
  • Hypothesis test
  • Mutual information
  • Applications

3
Outline
  • Introduction
  • Approaches to find collocations
  • Frequency
  • Mean and Variance
  • Hypothesis test
  • Mutual information
  • Applications

4
What are collocations?
  • A collocation is an expression consisting of two
    or more words that correspond to some
    conventional way of saying things. (-- the book)
  • A collocation is a sequence of two or more
    consecutive words, that has characteristics of a
    syntactic and semantic unit, and whose exact and
    unambiguous meaning or connotation cannot be
    derived directly from the meaning or connotation
    of its components. (-- Choueka, 1988)

5
Examples
  • noun phrases
  • strong tea vs. powerful tea
  • verbs
  • make a decision vs. take a decision
  • knock door vs. hit door
  • make up
  • Idioms
  • kick the bucket (??)
  • Subtle, unexplainable, native speaker usage
  • broad daylight vs. bright daylight
  • ??,??,???

6
Introduction Character /Criteria
  • Non-compositionality
  • e.g. kick the bucket white wine, white hair,
    white woman
  • Non-substitutability
  • e.g. white wine -gt yellow wine?
  • Non-modifiability
  • e.g. as poor as church mouse / mice?
  • Can not translate word by word

7
Outline
  • Introduction
  • Approaches to find collocations
  • Frequency
  • Mean and Variance
  • Hypothesis test
  • Mutual information
  • Applications

8
Frequency (2-1)
  • Counting
  • e.g. the count of bigrams
  • in corpus

Not effective, most of the pairs are function
words!
9
Frequency (2-2)
  • Filter by Part-Of-Speech (Justeson and Katz 1995)
  • Or using stop list of
  • function words

simple quantitative technique simple
linguistic knowledge
10
Mean and Variance(4-1)
  • Fixed bigrams -gt bigrams at a distance
  • she knocked on his door
  • They knocked at the door
  • 100 women knocked on Donaldson s door
  • A man knocked on the metal front door
  • Mean offset
  • (3355)/4 4.0
  • deviation

11
Mean and Variance(4-2)
  • Mean
  • Variance

Low variance means two words usually occur at
about the same distance
12
Mean and Variance(4-3)
The mean of -1.15 indicates that strong usually
occurs at the left side.
e.g. strong business support
strong and for dont form collocations
13
Mean and Variance(4-4)
If the mean is close to 1.0 and the deviation is
low, it can find collocations as frequency-based
method. It can also find loose phrases.
14
Hypothesis Testing
  • What if high frequency and low variance is
    accidental
  • e.g. new companies, new and companies are
    frequently occurring words, however, it is not
    collocation.
  • Hypothesis testing assessing whether or not
    something is a chance event
  • Null hypothesis H0 there is no association
    between the words beyond chance occurrences
  • Compute the probability p that the event would
    occur if H0 were true
  • If p gt P reject H0
  • otherwise, accept H0

15
t-test (5-1)
distribution mean
sample mean
t statistic
sample variance
sample size
Think of the corpus as a long sequence of N
bigrams, if the interest bigram occurs, the
value is 1, otherwise, the value is 0. (binomial
distribution )
16
t-test (5-2)
  • N(new) 15828, N(companies) 4675,
    N(tokens)14307668
  • N(new companies) 8
  • P(new) 15828/14307668, P(companies)
    4675/14307668
  • P(new companies) 8/14307668 5.59110-7
  • H0 P(new companies) p(new)p(companies) 3.615
    10-7
  • mean (assuming
    Bernoulli trial)
  • variance
  • t 0.9999932 lt 2.576
    Accept H0

17
t-test (5-3)
Rank the bigrams with the same frequency, which a
frequency-based method cannot do.
18
t-test (5-4)
  • Using t-test to find words whose co-occurrence
    patterns best distinguish between two words
  • e.g. lexicography (Church et al., 1989)

19
t-test (5-5)
20
Pearsons chi-square test (4-1)
  • t-test assumes probabilities are approximately
    normally distributed
  • test not assuming normality
  • Compare the observed frequencies with the
    frequencies expected for independence. If the
    difference is large, reject H0

21
Pearsons chi-square test (4-2)
Accept H0,, new and companies occur
independently!
22
Pearsons chi-square test (4-3)
  • Identification of translation pairs in aligned
    corpora (Church et al., 1991)

59 is the number of sentence pairs which have cow
in English and vache in French.
Reject H0, (cow, vache) is a translation pair.
23
Pearsons chi-square test (4-4)
  • Metric for Corpus similarity (Kilgarriff et al.,
    1998)
  • H0 Two corpora drawn from same source

24
Likelihood ratios (3-1)
  • More appropriate of sparse data
  • Two alternative explanations for the occurrence
    frequency of a bigram (Dunning 1993)
  • H1 P(w2w1) P(w2 w1) p (independence)
  • H2 P(w2w1) p1 ? p2 P(w2 w1)
    (dependence)
  • log ? log ( L(H1) / L(H2) )L(H) likelihood
    of observing O under H

25
Likelihood ratios (3-2)
  • c1, c2, c12 are the number of occurrences of w1,
    w2, w1w2, and assuming a binomial distribution

26
Likelihood ratios (3-3)
If is a likelihood ratio of a particular
form, then is asymptotically
distributed (Mood et al., 1974)
Likelihood ratio test is more appropriate for
sparse data.
27
Mutual Information (7-1)
  • Information you gain about x when knowing y
  • Pointwise mutual information (Church et al.1991
    Church and Hanks 1989)

28
Mutual Information (7-2)
The amount of information about the occurrence of
Ayatollah at position i in the corpus increases
by 18.38 bits if we are told that Ruhollah occurs
at position i1.
29
Mutual Information (7-3)
Problem1 information gain ? direct dependence
English house of commons French chambre de
communes
30
Mutual Information (7-4)
  • ?2 considers more than (house, communes)
  • MI considers only (house, communes)

31
Mutual Information (7-5)
Problem2 Data sparseness
32
Mutual Information (7-6)
  • For Perfect dependence
  • For perfect independence
  • MI is a not good measure of dependence since the
    score depends on the frequency of the individual
    words.

33
Mutual Information (7-7)
  • Pointwise MI MI(new, companies)
  • Uncertainty reduced in predicting companies
    When knowing the previous word is new
  • Small sample, not good measure if count is low
  • MI ? 0, good indication of independence
  • Mutual information MI (wi-1 , wi )
  • How much information (entropy) gained
  • Unary Model P(w) - Bigram Model P(wi wi-1)
  • Estimated using a large sample

34
Outline
  • Introduction
  • Approaches to find collocations
  • Frequency
  • Mean and Variance
  • Hypothesis test
  • Mutual information
  • Applications

35
Applications
  • Computational lexicography
  • Information Retrieval
  • Accuracy of retrieval can be improved if the
    similarity between a user query and a document is
    determined based on common collocations instead
    of words. (Fagan 1989)
  • Natural Language Generation (Smadja 1993)
  • Cross Language information retrieval (Hull and
    Grefenstette 1998)

36
Collocations and Word Sense Disambiguation
  • Association or co-occurrence
  • doctor and nurse
  • plane and airport
  • Both are important for word sense disambiguation
  • Collocation - local context (One sense per
    collocation)
  • Drop me a line (letter)
  • .. on the line .. (phone line)
  • Occurrence - topical context or global context
  • Subject based disambiguation

37
References
  • Choueka, Yaacov. 1988. Looking for needles in a
    haystack or locating interesting collocational
    expressions in large textual databases. In
    Proceedings of the RIAO, pp. 4338.
  • Justeson, John S., and Slava M. Katz. 1995.
    Technical terminology some linguistic properties
    and an algorithm for identification in text.
    Natural Language Engineering 1927.
  • Church, Kenneth Ward, and Patrick Hanks. 1989.
    Word association norms, mutual information and
    lexicography. In ACL 27, pp. 7683.
  • Church, Kenneth, William Gale, Patrick Hanks, and
    Donald Hindle. 1991. Using statistics in lexical
    analysis. In Uri Zernik (ed.), Lexical
    Acquisition Exploiting On-Line Resources to
    Build a Lexicon, pp. 115164. Hillsdale, NJ
    Lawrence Erlbaum.
  • Kilgarriff, Adam, and Tony Rose. 1998. Metrics
    for corpus similarity and homogeneity.
    Manuscript, ITRI, University of Brighton.
  • Dunning, Ted. 1993. Accurate methods for the
    statistics of surprise and coincidence.
    Computational Linguistics 196174.
  • Mood, Alexander M., Franklin A. Graybill, and
    Duane C. Boes. 1974. Introduction to the theory
    of statistics. New York McGraw-Hill. 3rd
    edition.
  • Fagan, Joel L. 1989. The effectiveness of a
    nonsyntactic approach to automatic phrase
    indexing for document retrieval. Journal of the
    American Society for Information Science
    40115132.
  • Smadja, Frank. 1993. Retrieving collocations from
    text Xtract. Computational Linguistics
    19143177.
  • Hull, David A., and Gregory Grefenstette. 1998.
    Querying across languages A dictionary-based
    approach to multilingual information retrieval.
    In Karen Sparck Jones and Peter Willett (eds.),
    Readings in Information Retrieval. San Francisco
    Morgan Kaufmann.

38
  • Thanks!
Write a Comment
User Comments (0)
About PowerShow.com