Collocation and Word Similarity - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Collocation and Word Similarity

Description:

strong tea, weapons of mass destruction, make up ... 076, Reagans 0.074, weed 0.071, 'Princess Diana' 0.069, flower 0.069, jungle 0.064, brush 0.063 ... – PowerPoint PPT presentation

Number of Views:293

Avg rating:3.0/5.0

Slides: 33

Provided by: csBra

Category:

more less

Transcript and Presenter's Notes

Title: Collocation and Word Similarity

1
Collocation and Word Similarity

James Pustejovsky
CS 114

2
Collocation

Habitual co-occurrence of words
strong tea, weapons of mass destruction, make up
Collocations are characterized by limited
compositionality.
Idioms red tape, miss the boat, spill guts
stiff breeze, stiff wind

3
Term Extraction

The problem of extracting technical terms from a
corpus is called term extraction.
The extracted terms include single words and
collocations.

4
Frequency Counts

If we count the frequency of adjacent word pairs
(bigrams), the most frequent bigrams consist of
functional words
80871 of the
58841 in the
15494 to be
13362 of a
11428 New York

5
Filtering by POS Pattern

Only count pairs that fits a predetermined
pattern
A N
N N
A A N
A N N

6
Measure the Strength of Association

Pointwise Mutual Information
compares the observed frequency with expected
frequency
Other measures
Log likelihood ratio
t-test
?2 test

7
Computing Word Similarity

Learning the meaning of new words
A bottle of tezguino is on the table
Everyone likes tezguino
Tezguino makes you drunk
We make tezguino out of corn
Finding similar words is an initial step in
automatic acquisition of meaning

8
Distributional Hypothesis

Zelig Harris 1968
Meanings of words are determined to a large
extent by their distributional patterns.
Learning the meaning of a word is dependent, at
least in part, on exposure to the word in its
linguistic contexts of its use.

9
Automatic Thesaurus Construction

Corpus- and genre-specific
Text corpus itself is of much less help in
thesaurus construction then in dictionary
compilation
Capture similarity that is not usually captured
by manually constructed thesauri. For example
...who faced a number of charges such as
prostitution and drugs

10
Similarity-based Smoothing

Example
He said the CIA had seen spectacular threat
reporting about massive casualties in the United
States in the spring and summer last year.
Suppose we have not seen spectacular threat
before.
We should not say its probability is 0.
Consider the frequency counts of similar words
significant threat 8
bizarre threat 1
tremendous threat 2

11
Similar Words

Distributional Hypothesis
If two words have similar set of collocations,
they are probably similar.
Representing words as feature vectors
Each collocation is a feature
We used an information-theoretic similarity
measure (many others are possible too).

12
Example absurd vs. perposterous
Subjects of preposterous accusation 1 allegation
8 disparity 1 idea 4 offer 2 packaging 1 re
viewer 1 route 1 suggestion 1 that 5 thought
1
Subjects of absurd allegation 5 argument 2 cha
rge 4 claim 2 idea 4 it 48 notion 2 stateme
nt 3 suggestion 2 that 14 thinking 1
13
Similarity between Feature Vectors

Given two objects o1 and o2, which are
represented as two feature vectors, how to
compute the similarity between the two objects?

14
Distance-based Measure
15
Jaccards Coefficient
16
Cosine Coefficient
17
Dice Coefficient
18
Some Problems

Some features are more important than others
that is absurd vs. that is preposterous
the idea is absurd vs. the idea is preposterous
The raw frequency count is not a good feature
it is absurd occurred 48 times
the idea is absurd occurred 4 times.

19
A Better Representation
where mij is the pointwise mutual information
between feature i and the object j
20
Example suit
Mutual Information
Frequency Count

lawsuit 0.259631
jacket 0.171716
shirt 0.158707
trouser 0.149233
coat 0.145872
blouse 0.139
sweater 0.138804
blazer 0.138382
dress 0.138045
pant 0.137097
uniform 0.128672
"business suit" 0.127607
tuxedo 0.127156

lawsuit 0.909332
petition 0.793921
tuxedo 0.748025
"trench coat" 0.744021
"straw hat" 0.737379
"baseball cap" 0.735842
wig 0.735192
sweater 0.73493
"top hat" 0.734683
turban 0.730693
"polo shirt" 0.730284
T-shirt 0.729481
"business suit" 0.729432

21
Example duty
Mutual Information
Frequency Count

responsibility 0.154056
obligation 0.134577
task 0.120646
"customs duty" 0.119004
"import duty" 0.117061
tariff 0.107041
job 0.102759
chore 0.0933882
"military service" 0.0933471
"excise tax" 0.0933373
tax 0.0867553
assignment 0.0862838
role 0.0859558

destiny 0.734345
skill 0.72338
capability 0.71689
resource 0.678734
discretion 0.673192
task 0.670158
office 0.664695
prerogative 0.664287
status 0.661489
staff 0.659559
power 0.650249
presence 0.649828
money 0.646917

22
Recognizing Non-compositional Phrases (Idioms)

Non-composition phrases (idioms) presents a
special challenge to NLP systems
Machine Translation
Word-by-word translation of idioms can lead to
very misleading (even laughable results).
draft beer
drag ones feet
Information retrieval
Words in an idiom must occur together
Expansion of individual words reduces precision.

23
Basic Idea of the Algorithm

If a dependency relationship between two words is
an idiom (or part of an idiom), its mutual
information is normally different from the mutual
information of other dependency relationships
involving similar words (literally similar
dependency relationships).

24
Mutual Information of a Dependency Relationship

A dependency relationship is a triple (head, rel,
modifier)
where P(head, rel, modifier) is the probability
of the dependency relationship. P(head, rel) and
P(rel, modifer) are probabilities of head and
modifier participating in relation rel.

25
Example 1 beat the bushes

Similar words to beat
defeat 0.157, outscore 0.139, stab 0.13, "beat
up" 0.127, assault 0.12, lead 0.118, shoot 0.118,
rap 0.113, kill 0.106, trail 0.106
Similar words to bush
shrub 0.131, tree 0.115, vine 0.077, grass 0.076,
Reagans 0.074, weed 0.071, "Princess Diana"
0.069, flower 0.069, jungle 0.064, brush 0.063
Similar collocations to beat the bush
beat bush 38 5.50
beat jungle 1 2.20

26
Example 2 red tape

Similar words of red
blue 0.268, green 0.261, yellow 0.26, white
0.242, pink 0.233, orange 0.224, purple 0.22,
black 0.2, colored 0.193, gray 0.191
Collocations involving similar words
red tape 259 5.87
yellow tape 12 3.75
orange tape 2 2.64
black tape 9 1.07
colored tape 2 3.17

27
Example 3 economic impact

Similar words of economic
financial 0.305, political 0.243, social 0.219,
fiscal0.209, cultural 0.202, budgetary 0.2,
technological 0.196,organizational 0.19,
ecological 0.189, monetary 0.189
Similar words of impact
effect 0.227, implication 0.163, consequence
0.156,significance 0.146, repercussion 0.141,
fallout 0.141, potential0.137, ramification
0.129, risk 0.126, influence 0.125

economic impact 171 1.85
financial impact 127 1.72
political impact 46 0.50
social impact 15 0.94
budgetary impact 8 3.20
ecological impact 4 2.59
economic effect 84 0.70
economic implication 17 0.80
economic consequence 59 1.88
economic significance 10 0.84
economic fallout 7 1.66
economic repercussion 7 1.84
economic potential 27 1.24
economic ramification 8 2.19
economic risk 17 -0.33

29
Example Idioms Identified

animal NnnN party
animal party 30 3.68
assassination NnnN character
assassination character 35 6.87
table NjnabA negotiating
table negotiating 207 7.48
work NnnN paper
work paper 146 3.90
work newspaper 10 0.72

take Vcomp1N toll
take toll 498 3.53
take number 149 -0.18
toe Vcomp1N line
toe line 40 6.56
get Vcomp1N kick
get kick 84 3.48
get interception 8 1.10
get touchdown 11 0.70

31
Parser Errors

rising NnnN voice
repurchase NnnN will
remains NnnN fact
refuel Vcomp1N stop
question NjnabA asking
public NnnN going

(box) THE GOAT
Analyzed as v box np the goat
67 occurrences
Annualized average rate of return after expenses
for the past 30 days, not a forecast of future
returns
s np forecast of future v returns
occurred 212 times

33
Idiosyncrasy in the World