Title: COLLOCATIONS
1COLLOCATIONS
2Outline
- Introduction
- Approaches to find collocations
- Frequency
- Mean and Variance
- Hypothesis test
- Mutual information
- Applications
3Outline
- Introduction
- Approaches to find collocations
- Frequency
- Mean and Variance
- Hypothesis test
- Mutual information
- Applications
4What are collocations?
- A collocation is an expression consisting of two
or more words that correspond to some
conventional way of saying things. (-- the book) - A collocation is a sequence of two or more
consecutive words, that has characteristics of a
syntactic and semantic unit, and whose exact and
unambiguous meaning or connotation cannot be
derived directly from the meaning or connotation
of its components. (-- Choueka, 1988)
5Examples
- noun phrases
- strong tea vs. powerful tea
- verbs
- make a decision vs. take a decision
- knock door vs. hit door
- make up
- Idioms
- kick the bucket (??)
- Subtle, unexplainable, native speaker usage
- broad daylight vs. bright daylight
- ??,??,???
-
6Introduction Character /Criteria
- Non-compositionality
- e.g. kick the bucket white wine, white hair,
white woman - Non-substitutability
- e.g. white wine -gt yellow wine?
- Non-modifiability
- e.g. as poor as church mouse / mice?
- Can not translate word by word
7Outline
- Introduction
- Approaches to find collocations
- Frequency
- Mean and Variance
- Hypothesis test
- Mutual information
- Applications
8Frequency (2-1)
- Counting
- e.g. the count of bigrams
- in corpus
Not effective, most of the pairs are function
words!
9Frequency (2-2)
- Filter by Part-Of-Speech (Justeson and Katz 1995)
- Or using stop list of
- function words
-
simple quantitative technique simple
linguistic knowledge
10Mean and Variance(4-1)
- Fixed bigrams -gt bigrams at a distance
- she knocked on his door
- They knocked at the door
- 100 women knocked on Donaldson s door
- A man knocked on the metal front door
- Mean offset
- (3355)/4 4.0
- deviation
11Mean and Variance(4-2)
Low variance means two words usually occur at
about the same distance
12Mean and Variance(4-3)
The mean of -1.15 indicates that strong usually
occurs at the left side.
e.g. strong business support
strong and for dont form collocations
13Mean and Variance(4-4)
If the mean is close to 1.0 and the deviation is
low, it can find collocations as frequency-based
method. It can also find loose phrases.
14Hypothesis Testing
- What if high frequency and low variance is
accidental - e.g. new companies, new and companies are
frequently occurring words, however, it is not
collocation. - Hypothesis testing assessing whether or not
something is a chance event - Null hypothesis H0 there is no association
between the words beyond chance occurrences - Compute the probability p that the event would
occur if H0 were true - If p gt P reject H0
- otherwise, accept H0
15t-test (5-1)
distribution mean
sample mean
t statistic
sample variance
sample size
Think of the corpus as a long sequence of N
bigrams, if the interest bigram occurs, the
value is 1, otherwise, the value is 0. (binomial
distribution )
16t-test (5-2)
- N(new) 15828, N(companies) 4675,
N(tokens)14307668 - N(new companies) 8
- P(new) 15828/14307668, P(companies)
4675/14307668 - P(new companies) 8/14307668 5.59110-7
- H0 P(new companies) p(new)p(companies) 3.615
10-7 - mean (assuming
Bernoulli trial) - variance
- t 0.9999932 lt 2.576
Accept H0
17t-test (5-3)
Rank the bigrams with the same frequency, which a
frequency-based method cannot do.
18t-test (5-4)
- Using t-test to find words whose co-occurrence
patterns best distinguish between two words - e.g. lexicography (Church et al., 1989)
19t-test (5-5)
20Pearsons chi-square test (4-1)
- t-test assumes probabilities are approximately
normally distributed - test not assuming normality
- Compare the observed frequencies with the
frequencies expected for independence. If the
difference is large, reject H0
21Pearsons chi-square test (4-2)
Accept H0,, new and companies occur
independently!
22Pearsons chi-square test (4-3)
- Identification of translation pairs in aligned
corpora (Church et al., 1991)
59 is the number of sentence pairs which have cow
in English and vache in French.
Reject H0, (cow, vache) is a translation pair.
23Pearsons chi-square test (4-4)
- Metric for Corpus similarity (Kilgarriff et al.,
1998)
- H0 Two corpora drawn from same source
24Likelihood ratios (3-1)
- More appropriate of sparse data
- Two alternative explanations for the occurrence
frequency of a bigram (Dunning 1993) - H1 P(w2w1) P(w2 w1) p (independence)
- H2 P(w2w1) p1 ? p2 P(w2 w1)
(dependence) - log ? log ( L(H1) / L(H2) )L(H) likelihood
of observing O under H
25Likelihood ratios (3-2)
- c1, c2, c12 are the number of occurrences of w1,
w2, w1w2, and assuming a binomial distribution
26Likelihood ratios (3-3)
If is a likelihood ratio of a particular
form, then is asymptotically
distributed (Mood et al., 1974)
Likelihood ratio test is more appropriate for
sparse data.
27Mutual Information (7-1)
- Information you gain about x when knowing y
- Pointwise mutual information (Church et al.1991
Church and Hanks 1989)
28Mutual Information (7-2)
The amount of information about the occurrence of
Ayatollah at position i in the corpus increases
by 18.38 bits if we are told that Ruhollah occurs
at position i1.
29Mutual Information (7-3)
Problem1 information gain ? direct dependence
English house of commons French chambre de
communes
30Mutual Information (7-4)
- ?2 considers more than (house, communes)
- MI considers only (house, communes)
31Mutual Information (7-5)
Problem2 Data sparseness
32Mutual Information (7-6)
- For Perfect dependence
- For perfect independence
- MI is a not good measure of dependence since the
score depends on the frequency of the individual
words.
33Mutual Information (7-7)
- Pointwise MI MI(new, companies)
- Uncertainty reduced in predicting companies
When knowing the previous word is new - Small sample, not good measure if count is low
- MI ? 0, good indication of independence
- Mutual information MI (wi-1 , wi )
- How much information (entropy) gained
- Unary Model P(w) - Bigram Model P(wi wi-1)
- Estimated using a large sample
34Outline
- Introduction
- Approaches to find collocations
- Frequency
- Mean and Variance
- Hypothesis test
- Mutual information
- Applications
35Applications
- Computational lexicography
- Information Retrieval
- Accuracy of retrieval can be improved if the
similarity between a user query and a document is
determined based on common collocations instead
of words. (Fagan 1989) - Natural Language Generation (Smadja 1993)
- Cross Language information retrieval (Hull and
Grefenstette 1998)
36Collocations and Word Sense Disambiguation
- Association or co-occurrence
- doctor and nurse
- plane and airport
- Both are important for word sense disambiguation
- Collocation - local context (One sense per
collocation) - Drop me a line (letter)
- .. on the line .. (phone line)
- Occurrence - topical context or global context
- Subject based disambiguation
37References
- Choueka, Yaacov. 1988. Looking for needles in a
haystack or locating interesting collocational
expressions in large textual databases. In
Proceedings of the RIAO, pp. 4338. - Justeson, John S., and Slava M. Katz. 1995.
Technical terminology some linguistic properties
and an algorithm for identification in text.
Natural Language Engineering 1927. - Church, Kenneth Ward, and Patrick Hanks. 1989.
Word association norms, mutual information and
lexicography. In ACL 27, pp. 7683. - Church, Kenneth, William Gale, Patrick Hanks, and
Donald Hindle. 1991. Using statistics in lexical
analysis. In Uri Zernik (ed.), Lexical
Acquisition Exploiting On-Line Resources to
Build a Lexicon, pp. 115164. Hillsdale, NJ
Lawrence Erlbaum. - Kilgarriff, Adam, and Tony Rose. 1998. Metrics
for corpus similarity and homogeneity.
Manuscript, ITRI, University of Brighton. - Dunning, Ted. 1993. Accurate methods for the
statistics of surprise and coincidence.
Computational Linguistics 196174. - Mood, Alexander M., Franklin A. Graybill, and
Duane C. Boes. 1974. Introduction to the theory
of statistics. New York McGraw-Hill. 3rd
edition. - Fagan, Joel L. 1989. The effectiveness of a
nonsyntactic approach to automatic phrase
indexing for document retrieval. Journal of the
American Society for Information Science
40115132. - Smadja, Frank. 1993. Retrieving collocations from
text Xtract. Computational Linguistics
19143177. - Hull, David A., and Gregory Grefenstette. 1998.
Querying across languages A dictionary-based
approach to multilingual information retrieval.
In Karen Sparck Jones and Peter Willett (eds.),
Readings in Information Retrieval. San Francisco
Morgan Kaufmann.
38