Collocation and Word Similarity - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Collocation and Word Similarity

Description:

strong tea, weapons of mass destruction, make up ... 076, Reagans 0.074, weed 0.071, 'Princess Diana' 0.069, flower 0.069, jungle 0.064, brush 0.063 ... – PowerPoint PPT presentation

Number of Views:293
Avg rating:3.0/5.0
Slides: 33
Provided by: csBra
Category:

less

Transcript and Presenter's Notes

Title: Collocation and Word Similarity


1
Collocation and Word Similarity
  • James Pustejovsky
  • CS 114

2
Collocation
  • Habitual co-occurrence of words
  • strong tea, weapons of mass destruction, make up
  • Collocations are characterized by limited
    compositionality.
  • Idioms red tape, miss the boat, spill guts
  • stiff breeze, stiff wind

3
Term Extraction
  • The problem of extracting technical terms from a
    corpus is called term extraction.
  • The extracted terms include single words and
    collocations.

4
Frequency Counts
  • If we count the frequency of adjacent word pairs
    (bigrams), the most frequent bigrams consist of
    functional words
  • 80871 of the
  • 58841 in the
  • 15494 to be
  • 13362 of a
  • 11428 New York

5
Filtering by POS Pattern
  • Only count pairs that fits a predetermined
    pattern
  • A N
  • N N
  • A A N
  • A N N

6
Measure the Strength of Association
  • Pointwise Mutual Information
  • compares the observed frequency with expected
    frequency
  • Other measures
  • Log likelihood ratio
  • t-test
  • ?2 test

7
Computing Word Similarity
  • Learning the meaning of new words
  • A bottle of tezguino is on the table
  • Everyone likes tezguino
  • Tezguino makes you drunk
  • We make tezguino out of corn
  • Finding similar words is an initial step in
    automatic acquisition of meaning

8
Distributional Hypothesis
  • Zelig Harris 1968
  • Meanings of words are determined to a large
    extent by their distributional patterns.
  • Learning the meaning of a word is dependent, at
    least in part, on exposure to the word in its
    linguistic contexts of its use.

9
Automatic Thesaurus Construction
  • Corpus- and genre-specific
  • Text corpus itself is of much less help in
    thesaurus construction then in dictionary
    compilation
  • Capture similarity that is not usually captured
    by manually constructed thesauri. For example
  • ...who faced a number of charges such as
    prostitution and drugs

10
Similarity-based Smoothing
  • Example
  • He said the CIA had seen spectacular threat
    reporting about massive casualties in the United
    States in the spring and summer last year.
  • Suppose we have not seen spectacular threat
    before.
  • We should not say its probability is 0.
  • Consider the frequency counts of similar words
  • significant threat 8
  • bizarre threat 1
  • tremendous threat 2

11
Similar Words
  • Distributional Hypothesis
  • If two words have similar set of collocations,
    they are probably similar.
  • Representing words as feature vectors
  • Each collocation is a feature
  • We used an information-theoretic similarity
    measure (many others are possible too).

12
Example absurd vs. perposterous
Subjects of preposterous accusation 1 allegation
8 disparity 1 idea 4 offer 2 packaging 1 re
viewer 1 route 1 suggestion 1 that 5 thought
1
Subjects of absurd allegation 5 argument 2 cha
rge 4 claim 2 idea 4 it 48 notion 2 stateme
nt 3 suggestion 2 that 14 thinking 1
13
Similarity between Feature Vectors
  • Given two objects o1 and o2, which are
    represented as two feature vectors, how to
    compute the similarity between the two objects?

14
Distance-based Measure
15
Jaccards Coefficient
16
Cosine Coefficient
17
Dice Coefficient
18
Some Problems
  • Some features are more important than others
  • that is absurd vs. that is preposterous
  • the idea is absurd vs. the idea is preposterous
  • The raw frequency count is not a good feature
  • it is absurd occurred 48 times
  • the idea is absurd occurred 4 times.

19
A Better Representation
where mij is the pointwise mutual information
between feature i and the object j
20
Example suit
Mutual Information
Frequency Count
  • lawsuit 0.259631
  • jacket 0.171716
  • shirt 0.158707
  • trouser 0.149233
  • coat 0.145872
  • blouse 0.139
  • sweater 0.138804
  • blazer 0.138382
  • dress 0.138045
  • pant 0.137097
  • uniform 0.128672
  • "business suit" 0.127607
  • tuxedo 0.127156
  • lawsuit 0.909332
  • petition 0.793921
  • tuxedo 0.748025
  • "trench coat" 0.744021
  • "straw hat" 0.737379
  • "baseball cap" 0.735842
  • wig 0.735192
  • sweater 0.73493
  • "top hat" 0.734683
  • turban 0.730693
  • "polo shirt" 0.730284
  • T-shirt 0.729481
  • "business suit" 0.729432

21
Example duty
Mutual Information
Frequency Count
  • responsibility 0.154056
  • obligation 0.134577
  • task 0.120646
  • "customs duty" 0.119004
  • "import duty" 0.117061
  • tariff 0.107041
  • job 0.102759
  • chore 0.0933882
  • "military service" 0.0933471
  • "excise tax" 0.0933373
  • tax 0.0867553
  • assignment 0.0862838
  • role 0.0859558
  • destiny 0.734345
  • skill 0.72338
  • capability 0.71689
  • resource 0.678734
  • discretion 0.673192
  • task 0.670158
  • office 0.664695
  • prerogative 0.664287
  • status 0.661489
  • staff 0.659559
  • power 0.650249
  • presence 0.649828
  • money 0.646917

22
Recognizing Non-compositional Phrases (Idioms)
  • Non-composition phrases (idioms) presents a
    special challenge to NLP systems
  • Machine Translation
  • Word-by-word translation of idioms can lead to
    very misleading (even laughable results).
  • draft beer
  • drag ones feet
  • Information retrieval
  • Words in an idiom must occur together
  • Expansion of individual words reduces precision.

23
Basic Idea of the Algorithm
  • If a dependency relationship between two words is
    an idiom (or part of an idiom), its mutual
    information is normally different from the mutual
    information of other dependency relationships
    involving similar words (literally similar
    dependency relationships).

24
Mutual Information of a Dependency Relationship
  • A dependency relationship is a triple (head, rel,
    modifier)
  • where P(head, rel, modifier) is the probability
    of the dependency relationship. P(head, rel) and
    P(rel, modifer) are probabilities of head and
    modifier participating in relation rel.

25
Example 1 beat the bushes
  • Similar words to beat
  • defeat 0.157, outscore 0.139, stab 0.13, "beat
    up" 0.127, assault 0.12, lead 0.118, shoot 0.118,
    rap 0.113, kill 0.106, trail 0.106
  • Similar words to bush
  • shrub 0.131, tree 0.115, vine 0.077, grass 0.076,
    Reagans 0.074, weed 0.071, "Princess Diana"
    0.069, flower 0.069, jungle 0.064, brush 0.063
  • Similar collocations to beat the bush
  • beat bush 38 5.50
  • beat jungle 1 2.20

26
Example 2 red tape
  • Similar words of red
  • blue 0.268, green 0.261, yellow 0.26, white
    0.242, pink 0.233, orange 0.224, purple 0.22,
    black 0.2, colored 0.193, gray 0.191
  • Collocations involving similar words
  • red tape 259 5.87
  • yellow tape 12 3.75
  • orange tape 2 2.64
  • black tape 9 1.07
  • colored tape 2 3.17

27
Example 3 economic impact
  • Similar words of economic
  • financial 0.305, political 0.243, social 0.219,
    fiscal0.209, cultural 0.202, budgetary 0.2,
    technological 0.196,organizational 0.19,
    ecological 0.189, monetary 0.189
  • Similar words of impact
  • effect 0.227, implication 0.163, consequence
    0.156,significance 0.146, repercussion 0.141,
    fallout 0.141, potential0.137, ramification
    0.129, risk 0.126, influence 0.125

28
  • economic impact 171 1.85
  • financial impact 127 1.72
  • political impact 46 0.50
  • social impact 15 0.94
  • budgetary impact 8 3.20
  • ecological impact 4 2.59
  • economic effect 84 0.70
  • economic implication 17 0.80
  • economic consequence 59 1.88
  • economic significance 10 0.84
  • economic fallout 7 1.66
  • economic repercussion 7 1.84
  • economic potential 27 1.24
  • economic ramification 8 2.19
  • economic risk 17 -0.33

29
Example Idioms Identified
  • animal NnnN party
  • animal party 30 3.68
  • assassination NnnN character
  • assassination character 35 6.87
  • table NjnabA negotiating
  • table negotiating 207 7.48
  • work NnnN paper
  • work paper 146 3.90
  • work newspaper 10 0.72

30
  • take Vcomp1N toll
  • take toll 498 3.53
  • take number 149 -0.18
  • toe Vcomp1N line
  • toe line 40 6.56
  • get Vcomp1N kick
  • get kick 84 3.48
  • get interception 8 1.10
  • get touchdown 11 0.70

31
Parser Errors
  • rising NnnN voice
  • repurchase NnnN will
  • remains NnnN fact
  • refuel Vcomp1N stop
  • question NjnabA asking
  • public NnnN going

32
  • (box) THE GOAT
  • Analyzed as v box np the goat
  • 67 occurrences
  • Annualized average rate of return after expenses
    for the past 30 days, not a forecast of future
    returns
  • s np forecast of future v returns
  • occurred 212 times

33
Idiosyncrasy in the World
  • economist NnnN Harvard
  • economist Harvard 14 5.20
  • economist university 3 1.32
  • economist Stanford 2 1.68
  • explore VsubjN philosopher
  • explore philosopher 19 7.60
  • explore writer 1 1.66
  • explore scientist 5 2.01
  • explore playwright 1 4.09
  • KEYgod NjnabA WORDHindu
  • god Hindu 14 87.5776 1 7.19046
  • god Buddhist 1 3.65768 0.298 4.64339
  • god Islamic 2 6.18425 0.193 4.06823
Write a Comment
User Comments (0)
About PowerShow.com