Title: Collocation and Word Similarity
1Collocation and Word Similarity
2Collocation
- Habitual co-occurrence of words
- strong tea, weapons of mass destruction, make up
- Collocations are characterized by limited
compositionality. - Idioms red tape, miss the boat, spill guts
- stiff breeze, stiff wind
3Term Extraction
- The problem of extracting technical terms from a
corpus is called term extraction. - The extracted terms include single words and
collocations.
4Frequency Counts
- If we count the frequency of adjacent word pairs
(bigrams), the most frequent bigrams consist of
functional words - 80871 of the
- 58841 in the
- 15494 to be
- 13362 of a
- 11428 New York
5Filtering by POS Pattern
- Only count pairs that fits a predetermined
pattern - A N
- N N
- A A N
- A N N
6Measure the Strength of Association
- Pointwise Mutual Information
- compares the observed frequency with expected
frequency - Other measures
- Log likelihood ratio
- t-test
- ?2 test
7Computing Word Similarity
- Learning the meaning of new words
- A bottle of tezguino is on the table
- Everyone likes tezguino
- Tezguino makes you drunk
- We make tezguino out of corn
- Finding similar words is an initial step in
automatic acquisition of meaning
8Distributional Hypothesis
- Zelig Harris 1968
- Meanings of words are determined to a large
extent by their distributional patterns. - Learning the meaning of a word is dependent, at
least in part, on exposure to the word in its
linguistic contexts of its use.
9Automatic Thesaurus Construction
- Corpus- and genre-specific
- Text corpus itself is of much less help in
thesaurus construction then in dictionary
compilation - Capture similarity that is not usually captured
by manually constructed thesauri. For example - ...who faced a number of charges such as
prostitution and drugs
10Similarity-based Smoothing
- Example
- He said the CIA had seen spectacular threat
reporting about massive casualties in the United
States in the spring and summer last year. - Suppose we have not seen spectacular threat
before. - We should not say its probability is 0.
- Consider the frequency counts of similar words
- significant threat 8
- bizarre threat 1
- tremendous threat 2
11Similar Words
- Distributional Hypothesis
- If two words have similar set of collocations,
they are probably similar. - Representing words as feature vectors
- Each collocation is a feature
- We used an information-theoretic similarity
measure (many others are possible too).
12Example absurd vs. perposterous
Subjects of preposterous accusation 1 allegation
8 disparity 1 idea 4 offer 2 packaging 1 re
viewer 1 route 1 suggestion 1 that 5 thought
1
Subjects of absurd allegation 5 argument 2 cha
rge 4 claim 2 idea 4 it 48 notion 2 stateme
nt 3 suggestion 2 that 14 thinking 1
13Similarity between Feature Vectors
- Given two objects o1 and o2, which are
represented as two feature vectors, how to
compute the similarity between the two objects?
14Distance-based Measure
15Jaccards Coefficient
16Cosine Coefficient
17Dice Coefficient
18Some Problems
- Some features are more important than others
- that is absurd vs. that is preposterous
- the idea is absurd vs. the idea is preposterous
- The raw frequency count is not a good feature
- it is absurd occurred 48 times
- the idea is absurd occurred 4 times.
19A Better Representation
where mij is the pointwise mutual information
between feature i and the object j
20Example suit
Mutual Information
Frequency Count
- lawsuit 0.259631
- jacket 0.171716
- shirt 0.158707
- trouser 0.149233
- coat 0.145872
- blouse 0.139
- sweater 0.138804
- blazer 0.138382
- dress 0.138045
- pant 0.137097
- uniform 0.128672
- "business suit" 0.127607
- tuxedo 0.127156
- lawsuit 0.909332
- petition 0.793921
- tuxedo 0.748025
- "trench coat" 0.744021
- "straw hat" 0.737379
- "baseball cap" 0.735842
- wig 0.735192
- sweater 0.73493
- "top hat" 0.734683
- turban 0.730693
- "polo shirt" 0.730284
- T-shirt 0.729481
- "business suit" 0.729432
21Example duty
Mutual Information
Frequency Count
- responsibility 0.154056
- obligation 0.134577
- task 0.120646
- "customs duty" 0.119004
- "import duty" 0.117061
- tariff 0.107041
- job 0.102759
- chore 0.0933882
- "military service" 0.0933471
- "excise tax" 0.0933373
- tax 0.0867553
- assignment 0.0862838
- role 0.0859558
- destiny 0.734345
- skill 0.72338
- capability 0.71689
- resource 0.678734
- discretion 0.673192
- task 0.670158
- office 0.664695
- prerogative 0.664287
- status 0.661489
- staff 0.659559
- power 0.650249
- presence 0.649828
- money 0.646917
22Recognizing Non-compositional Phrases (Idioms)
- Non-composition phrases (idioms) presents a
special challenge to NLP systems - Machine Translation
- Word-by-word translation of idioms can lead to
very misleading (even laughable results). - draft beer
- drag ones feet
- Information retrieval
- Words in an idiom must occur together
- Expansion of individual words reduces precision.
23Basic Idea of the Algorithm
- If a dependency relationship between two words is
an idiom (or part of an idiom), its mutual
information is normally different from the mutual
information of other dependency relationships
involving similar words (literally similar
dependency relationships).
24Mutual Information of a Dependency Relationship
- A dependency relationship is a triple (head, rel,
modifier) -
- where P(head, rel, modifier) is the probability
of the dependency relationship. P(head, rel) and
P(rel, modifer) are probabilities of head and
modifier participating in relation rel. -
25Example 1 beat the bushes
- Similar words to beat
- defeat 0.157, outscore 0.139, stab 0.13, "beat
up" 0.127, assault 0.12, lead 0.118, shoot 0.118,
rap 0.113, kill 0.106, trail 0.106 - Similar words to bush
- shrub 0.131, tree 0.115, vine 0.077, grass 0.076,
Reagans 0.074, weed 0.071, "Princess Diana"
0.069, flower 0.069, jungle 0.064, brush 0.063 - Similar collocations to beat the bush
- beat bush 38 5.50
- beat jungle 1 2.20
26Example 2 red tape
- Similar words of red
- blue 0.268, green 0.261, yellow 0.26, white
0.242, pink 0.233, orange 0.224, purple 0.22,
black 0.2, colored 0.193, gray 0.191 - Collocations involving similar words
- red tape 259 5.87
- yellow tape 12 3.75
- orange tape 2 2.64
- black tape 9 1.07
- colored tape 2 3.17
27Example 3 economic impact
- Similar words of economic
- financial 0.305, political 0.243, social 0.219,
fiscal0.209, cultural 0.202, budgetary 0.2,
technological 0.196,organizational 0.19,
ecological 0.189, monetary 0.189 - Similar words of impact
- effect 0.227, implication 0.163, consequence
0.156,significance 0.146, repercussion 0.141,
fallout 0.141, potential0.137, ramification
0.129, risk 0.126, influence 0.125
28- economic impact 171 1.85
- financial impact 127 1.72
- political impact 46 0.50
- social impact 15 0.94
- budgetary impact 8 3.20
- ecological impact 4 2.59
- economic effect 84 0.70
- economic implication 17 0.80
- economic consequence 59 1.88
- economic significance 10 0.84
- economic fallout 7 1.66
- economic repercussion 7 1.84
- economic potential 27 1.24
- economic ramification 8 2.19
- economic risk 17 -0.33
29Example Idioms Identified
- animal NnnN party
- animal party 30 3.68
- assassination NnnN character
- assassination character 35 6.87
- table NjnabA negotiating
- table negotiating 207 7.48
- work NnnN paper
- work paper 146 3.90
- work newspaper 10 0.72
30- take Vcomp1N toll
- take toll 498 3.53
- take number 149 -0.18
- toe Vcomp1N line
- toe line 40 6.56
- get Vcomp1N kick
- get kick 84 3.48
- get interception 8 1.10
- get touchdown 11 0.70
31Parser Errors
- rising NnnN voice
- repurchase NnnN will
- remains NnnN fact
- refuel Vcomp1N stop
- question NjnabA asking
- public NnnN going
32- (box) THE GOAT
- Analyzed as v box np the goat
- 67 occurrences
- Annualized average rate of return after expenses
for the past 30 days, not a forecast of future
returns - s np forecast of future v returns
- occurred 212 times
33Idiosyncrasy in the World
- economist NnnN Harvard
- economist Harvard 14 5.20
- economist university 3 1.32
- economist Stanford 2 1.68
- explore VsubjN philosopher
- explore philosopher 19 7.60
- explore writer 1 1.66
- explore scientist 5 2.01
- explore playwright 1 4.09
- KEYgod NjnabA WORDHindu
- god Hindu 14 87.5776 1 7.19046
- god Buddhist 1 3.65768 0.298 4.64339
- god Islamic 2 6.18425 0.193 4.06823