Title: COMP791A: Statistical Language Processing
1COMP791A Statistical Language Processing
- Word Sense Disambiguation
- Chap. 7
2Overview of the problem
- Many words have several meanings or senses
(homonyms or polysemous words) - Ex chair --gt furniture or person
- Ex dishes --gt plates or food
- Need to determine which sense of a word is used
in a specific sentence - Note
- often, the different senses of a word are closely
related - Ex title --gt right of legal ownership,
- document that is evidence of the legal
ownership, - name of work,
- often, several senses can be activated in a
single context (co-activation) - Ex This could bring competition to the trade
- Competition --gt the act of competing AND the
people who are competing
3Word Sense Disambiguation (WSD)
- To determine which of the senses of an ambiguous
word is invoked in a particular use of the word. - Potentially extremely useful problem
- Ex in machine translation
- chair --gt (person) directeur
- chair --gt (furniture) chaise
- bureau --gt desk
- bureau --gt office
- Can be done
- with rule-based methods
- with statistical methods
4WordNet
- most widely-used lexical database for English
- free!
- G. Miller at Princeton www.cogsci.princeton.edu/w
n - used in many applications of NLP
- EuroWorNet
- Dutch, Italian, Spanish, German, French, Czech
and Estonian - includes entries for open-class words only
(nouns, verbs, adjectives adverbs)
5WordNet Entries
- in WordNet 1.6 (now 2.0)
- 118,000 different word forms
- organized according to their meanings (senses)
- each entry has
- a dictionary-style definition (gloss) of each
sense - AND a set of domain-independent lexical relations
among - WordNets entries (words)
- senses
- sets of synonyms
- grouped into synsets (i.e. sets of synonyms)
6Example 1 WordNet entry for verb serve
7Rule-based WSD
- They served green-lipped mussels from New
Zealand. - Which airlines serve Denver?
- semantic restrictions on the predicate of an
argument - argument mussels
- --gt needs a predicate with the sense
provide-food - --gt sense 6 of WordNet
- argument Denver
- --gt needs a predicate with the sense attend-to
- --gt sense 10 of WordNet
8Example 2 WordNet entry for dish
9Rule-based WSD
- In our house, everybody has a career and none of
them includes washing dishes. - In her tiny kitchen, Ms. Chen works efficiently,
stir-frying several simple dishes, including
braised pigs ears and chicken livers with green
peppers. - semantic restrictions on the argument of a
predicate - predicate wash
- --gt needs an argument with the sense object
- --gt senses 1, 2 or 6 form WordNet
- predicate stir-fry
- --gt needs an argument with the sense food
- --gt sense 2 of WordNet
10Problem with rule-based WSD
- In some cases, the constraints on the predicate
and on the argument are not enough to pinpoint
one unique sense - ex What kind of dishes do you recommend?
- Figures of speech
- meaning of words can be generated dynamically
- instead of being fixed and stored in a lexicon or
set of selectional restrictions - Ex metaphor, metonymy
11Problem with rule-based WSD (cont)
- Metaphor
- using words / phrases whose meaning are
appropriate to different kinds of concepts - suggesting a likeness or analogy between them
- This deal does not scare Microsoft.
- scare has 2 senses in WordNet
- to cause fear
- to cause to lose courage
- metaphor the corporation is viewed as a person
- She is drowning in money
- metaphor money is viewed as a liquid
12Problem with rule-based WSD (cont)
- Metonymy
- referring to a concept by naming some other
concept closely related to it - We await word from the crown.
- a monarch is not the same thing as a crown
- but we often refer to the monarch as "the crown"
because the two are associated - Metonymy the crown refers to the monarch
- The White House had no comment.
- Metonymy The White House refers to the
administration
13WSD versus POS tagging
- butter can be a verb or noun
- I should butter my toasts.
- I like butter on my toasts.
- 2 different POS --gt 2 different usages with 2
different meanings - So WSD can be viewed as POS tagging (classifying
using semantic tags rather than POS tags) - But the 2 tasks are considered different
because - nearby structural cues (ex is the previous word
a determiner?) - are important in POS tagging
- are not effective for WSD
- distant content words
- are very effective for WSD
- are not interesting for POS
- So
- in POS tagging, we typically only look at the
local context - in WSD, we use content words in a larger context
14Approaches to Statistical WSD
- Supervised Disambiguation
- based on a labeled training set
- The learning system has
- a training set of feature-encoded inputs AND
- their appropriate sense label (category)
- Based on Lexical Resources
- use of external lexical resources such as
dictionaries and thesauri - Discourse properties
- Unsupervised Disambiguation
- based on unlabeled corpora
- The learning system has
- a training set of feature-encoded inputs BUT
- NOT their appropriate sense label (category)
15Approaches to Statistical WSD
- --gt Supervised Disambiguation
- Naïve Bayes
- Decision Trees
- Use of Lexical Resources
- Dictionary-based
- Thesaurus-based
- Translation-based
- Discourse properties
- Unsupervised Disambiguation
16Supervised WSD Overview
- A word is assumed to have a finite number of
discrete senses. - The sense of a word depends on the sense of
surrounding words - ex bass fish, musical instrument, ...
17Supervised WSD Overview (cont)
- WSD is viewed as typical classification problem
- use machine learning techniques to train a system
- that learns a classifier (a function f) to assign
to unseen examples one of a fixed number of
senses (categories) - f(input) correct sense
- Input
- Target word
- The word to be disambiguated
- Context (feature vector)
- a vector of relevant linguistic features that
represents its context (ex a window of words
around the target word)
18Examples of Feature Vectors
- Take a window of n word around the target word
- Encode information about the words around the
target word - typical features include words, root forms, POS
tags, frequency, - An electric guitar and bass player stand off to
one side, not really part of the scene, just as a
sort of nod to gringo expectations perhaps. - with position information
- (guitar, NN1), (and, CJC), (player, NN1),
(stand, VVB) - no position information, but word frequency
- fishing, big, sound, player, fly, rod, pound,
double, runs, playing, guitar, band - 0,0,0,1,0,0,0,0,0,0,1,0
- other features
- followed by "player", contains "show" in the
sentence, - yes, no,
19Supervised WSD
- Training corpus
- Each occurrence of the ambiguous word w is
annotated with a semantic label (its contextually
appropriate sense sk). - Several approaches from ML
- Bayesian classification
- Decision trees
- Neural networks
- K-nearest neighbor (kNN)
-
20Approaches to Statistical WSD
- --gt Supervised Disambiguation
- --gt Naïve Bayes
- Decision Trees
- Use of Lexical Resources
- Dictionary-based
- Thesaurus-based
- Translation-based
- Discourse properties
- Unsupervised Disambiguation
21Naïve Bayes Classification
- Goal choose the most probable sense s for a
word given a vector V of surrounding words - vector contains
- frequency of words
- vocabulary fishing, big, sound, player, fly,
rod, - 0, 0, 0, 2, 1, 0,
- Bayes decision rule
- s argmaxsk P(skV)
- where
- S is the set of possible senses for the target
word - sk is a sense in S
- V is the feature vector (the representation of
the context) - Using Bayes rule
22Decision Rule for Naive Bayes
- But P(V) is the same for all possible senses,
so it does not affect the final ranking of the
senses, so we can drop it. - To make the computations simpler, we often take
the log of probabilities
23Naïve Bayes WSD
- Training a Naïve Bayes classifier
- estimating P(vjsk) and P(sk) from a
sense-tagged training corpus - finding Maximum-Likelihood Estimation, perhaps
with appropriate smoothing
Nb of occurrences of feature j over the total nb
of features appearing in windows of Sk
Nb of occurrences of sense k over nb of all
occurrences of ambiguous word
24Naïve Bayes Algorithm
- // 1. training
- for all senses sk or word w
- for all words vj in the vocabulary
- compute
- for all senses sk of word w
- compute
- // 2. disambiguation
- for all senses sk of word w
- score(sk) log P(sk)
- for all words vj in the context window
- score (sk) score (sk) log P(vj sk)
- choose s with the greatest score(sk)
25Example
- Training corpus (context window ?3 words)
- Today the World Bank/BANK1 and partners are
calling for greater relief - Welcome to the Bank/BANK1 of America the
nation's leading financial institution - Welcome to America's Job Bank/BANK1 Visit our
site and - Web site of the European Central Bank/BANK1
located in Frankfurt - The Asian Development Bank/BANK1 ADB a
multilateral development finance - lounging against verdant banks/BANK2 carving out
the... - for swimming, had warned her off the banks/BANK2
of the Potomac. Nobody... - Training
- P(theBANK1) 5/30 P(theBANK2) 3/12
- P(worldBANK1) 1/30 P(worldBANK2) 0/12
- P(andBANK1) 1/30 P(andBANK2) 0/12
-
- P(offBANK1) 0/30 P(offBANK2) 1/12
- P(PotomacBANK1) 0/30 P(PotomacBANK2) 1/12
26Naïve Bayes Assumption
- Independence assumption
- The features (contextual words) are conditionally
independent - Probability of an entire feature vector given a
sense, is the product of the probabilities of its
individual features given that sense -
- Consequences
- Bag of words model
- the structure and linear ordering of words within
the context is ignored. - The presence of one word in the bag is
independent of another. - The independence assumption is incorrect but is
useful in WSD - (Gale, Church Yarowsky, 1992) report 90
correct disambiguation with 6 ambiguous nouns in
the Hansard
27Approaches to Statistical WSD
- --gt Supervised Disambiguation
- Naïve Bayes
- --gt Decision Trees
- Use of Lexical Resources
- Dictionary-based
- Thesaurus-based
- Translation-based
- Discourse properties
- Unsupervised Disambiguation
28Decision Tree Classifier
- Bayes Classifier uses information from all words
in the context window - But some words are more reliable than others to
indicate which sense is used
29Decision Tree Classifier (cont)
- Look for features that are very good indicators
of the result - Place these features (as questions) in nodes of a
decision tree - Split the examples so that those with different
values for the chosen feature are in a different
set - Repeat the same process with another feature
- A sequence of tests is applied to each feature
vector - if test succeeds --gt return the sense associated
with the test - otherwise --gt apply the next test
- if all features have been tested, then return a
default sense (most common one)
30Example bass
Observation Features Features Features Features Features Sense
Observation Includes fish? striped bass? Includes guitar? bass player? Includes piano? Sense
1 Yes Yes No No No fish
2 Yes Yes No No No fish
3 No No Yes No No instrument
4 No Yes No No No fish
5 Yes Yes No No No fish
6 No No Yes Yes Yes instrument
7 No Yes No No No fish
yes
no
no
yes
yes
no
31Another Example The restaurant
Input
Output
32A first decision tree
- But is it the best decision tree we can build?
33A better decision tree
- 4 tests instead of 9 11 branches instead of 21
34Choosing the best feature
- The key problem is choosing which feature to
split a given set of examples - Most used strategy information theory
-
Entropy (or self-information)
35Choosing the best feature (con't)
- The "discriminating power" of an attribute A
given a set S
- if the training set contains
- p positive examples and
- n negative examples
36Some intuition
Size Color Shape Output
Big Red Circle
Small Red Circle
Small Red Square -
Big Blue Circle -
- Size is the least discriminating attribute (i.e.
smallest information gain) - Shape and color are the most discriminating
attribute (i.e. highest information gain)
37A small example
Size Color Shape Output
Big Red Circle
Small Red Circle
Small Red Square -
Big Blue Circle -
- So first separate according to either color or
shape (root of the tree) - Note by definition 0log0 is 0
38The restaurant example
- With the data on p.27, we have
- So root of the tree should be attribute Patrons
(we gain more information) - do recursively for subtrees
39Back to WSD
- Need to translate the French word Prendre
- can be seen as WSD
- possible translations/sensestake, make, rise,
speak
Observation Features/Attributes Features/Attributes Features/Attributes Features/Attributes Features/Attributes Sense
Observation Tense Word left Direct object Word right Sense
1 mesure take
2 note take
3 exemple take
4 decision make
5 parole speak
6 parole rise
40Back to WSD (con't)
- (Brown et al., 1991) found
- On Canadian Hansard
Ambiguous word Possible senses / translations Best Feature Example
Prendre take , make, rise, speak Direct object Prendre une mesure --gt to take Prendre une décision --gt to make
Vouloir to want, to like Tense Present --gt to want Conditional --gt to like
Cent , Word to the left Pour --gt Number --gt
41Training Set
- With supervised methods, we need a large
sense-tagged training set where do you get it
from? - Using a "real" training set
- Main standard hand sense-tagged corpora
- SEMCOR corpus
- portion of the Brown corpus
- tagged with WordNet senses
- SENSEVAL corpus (www.senseval.org/)
- Standard WSD competition like MUC, TREC DUC
- Open Mind Word Expert(OMWE)
- Using pseudowords
- Artificial ambiguous words created by conflating
two or more words. - Ex occurrences of banana and door can be
replaced by banana-door - The disambiguation algorithm can now be tested on
this data to disambiguate the pseudoword
banana-door into either banana or door
42Problems
- With supervised (or unsupervised) methods
- need a large amount of work to create a
classifier for each ambiguous word! - So most work based in these techniques, report
work on a few words (2 to 12 words) - Scaling up these approaches to deal with all
ambiguous words is immense work! - Solution
- use lexical resources (ex machine-readable
dictionaries) - use distributional properties to improve
disambiguation - Ambiguous words are only used in one sense in any
given discourse and with any given collocate.
43Approaches to Statistical WSD
- Supervised Disambiguation
- Naïve Bayes
- Decision-tree
- --gt Use of Lexical Resources
- --gt Dictionary-based
- Thesaurus-based
- Translation-based
- Discourse properties
- Unsupervised Disambiguation
44WSD based on sense definitions
- (Lesk, 1986)
- A words dictionary definitions are likely to be
good indicators for the sense they define. - Method
- Express the dictionary definitions of the
ambiguous word as sets of bag-of-words - Express the context of the ambiguous word as a
single bag-of-words from the dictionary
definitions of the context words. - Choose the definition of the ambiguous word that
has the greatest overlap with the words occurring
in its context.
45Example
- "Cone" in dictionary
- DEF-1 solid body which narrows to a point
- BAG body, narrows, point, solid
- DEF-2 something of this shape whether solid or
hollow - BAG hollow, shape, something, solid
- DEF-3 fruit of certain evergreen tree
- BAG evergreen, fruit, tree
- To disambiguate "cone" in "pine cone"
- "Pine" in dictionary
- DEF-1 kind of evergreen tree
- DEF-2 waste away through sorrow or illness
- --gt BAG evergreen, illness, kind, sorrow,
tree, waste - so "cone" is
- score(DEF-1) body, narrows, point, solid ?
evergreen, illness, kind, sorrow, tree, waste - 0
- score(DEF-2) hollow,shape,something,solid ?
evergreen, illness, kind, sorrow, tree, waste - 0
46The algorithm
- For all senses sk of word w
- score(sk) overlap (
- - words in the dictionary definition of sense sk
- - the union of the words in all context windows
that also appear in a definition of w - )
- pick the sense s with the highest score(sk)
47Analysis
- Accuracies of 50-70 on short samples of texts
- Problem
- dictionary entries for the target words are
usually relatively short - and may not provide sufficient material to create
adequate classifiers - Because the words in the context and their
definitions must have direct overlap - One solution
- expand the list of words whose definitions make
use of the target word - Example
- if deposit does not occur in the definition of
bank - but bank occurs in the definition of deposit
- We can expand the classifier for bank to
include deposit as a relevant feature - However
- just knowing that deposit is related to bank
does not help much - if we do not know to which sense of bank it is
related to - --gt To make use of deposit as a feature, we
have to know which sense of bank was being used
in the definition - Solution
- Use a thesaurus
48Approaches to Statistical WSD
- Supervised Disambiguation
- Naïve Bayes
- Decision-tree
- --gt Use of Lexical Resources
- Dictionary-based
- --gt Thesaurus-based
- Translation-based
- Discourse properties
- Unsupervised Disambiguation
49Thesaurus-Based Disambiguation
- Thesauri include tags (subject codes) in their
entries that correspond to broad semantic
categories - Each word is assigned one or more subject codes
which corresponds to its different meanings - ANIMAL/INSECT (category 414)
- TOOLS/MACHINERY (category 348)
- The semantic categories of the words in a context
determine the semantic category of the whole
context - This category, determines which word senses are
used - For each subject code, count the number of words
in the context that have the same subject code - Select the subject code that has the highest
count - Accuracy 50 (but with difficult and highly
ambiguous words)
50Some Results
Word Sense Roget category Accuracy (Yarowsky, 1992)
bass musical instrument MUSIC 99
bass fish ANIMAL,INSECT 100
star space object UNIVERSE 96
star celebrity ENTERTAINER 95
star star-shaped object INSIGNIA 82
interest curiosity REASONING 88
interest advantage INJUSTICE 34
interest financial DEBT 90
interest share PROPERTY 38
51Approaches to Statistical WSD
- Supervised Disambiguation
- Naïve Bayes
- Decision-tree
- --gt Use of Lexical Resources
- Dictionary-based
- Thesaurus-based
- --gt Translation-based
- Discourse properties
- Unsupervised Disambiguation
52Translation-Based WSD
- Words can be disambiguated by looking at how they
are translated in other languages - Example the word interest
- To disambiguate the word interest in showed
interest - German translation of show is zeigen
- In German corpus
- we always find zeigen interesse
- we never find zeigen beteiligung
- So in the original phrase showed interest,
interest had sense2 - To disambiguate the word interest in acquired
an interest - German translation of acquired is erwarb
- In German corpus C(erwarb, beteiligung) gt
C(erwarb, interesse)
sense1 sense2
Definition legal share attention, concern
German Translation Beteiligung Interesse
English phrase acquire an interest show interest
Translation erwerb eine Beteiligung Interesse zeigen
53Approaches to Statistical WSD
- Supervised Disambiguation
- Naïve Bayes
- Decision-tree
- Use of Lexical Resources
- Dictionary-based
- Thesaurus-based
- Translation-based
- --gt Discourse properties
- Unsupervised Disambiguation
54Discourse Properties (Yarowsky, 1995)
- So far, all methods have considered each
occurrence of ambiguous word separately - But
- One sense per discourse
- One document --gt one sense
- One sense per collocation
- Select some nearby word that give very clues
ie. select words of a collocation lt-gt sense of
target word - (Yarowsky , 1995) shows a reduction of error rate
by 27 when using the discourse constraint! - i.e. assign the majority sense of the discourse
to all occurrences of the target word - we can combine these 2 heuristics
55Approaches to Statistical WSD
- Supervised Disambiguation
- Naïve Bayes
- Decision-tree
- Use of Lexical Resources
- Dictionary-based
- Thesaurus-based
- Translation-based
- Discourse properties
- --gt Unsupervised Disambiguation
56Unsupervised Disambiguation
- Disambiguate word senses
- without supporting tools such as dictionaries and
thesauri - without a labeled training text
- Without such resources, we cannot really
identify/label the senses - ie. cannot say bank-1 or bank-2
- we do not even know the different senses of a
word! - But we can
- Cluster/group the contexts of an ambiguous word
into a number of groups - discriminate between these groups without
actually labeling them
57Clustering
- Represent each instance of the ambiguous word as
a vector ltf1, f2, f3,, fv gt - V is the vocabulary size
- fi is the frequency of word i in the context.
- each vector can be visually represented in an V
dimensional space
V2
word2
V1
word1
V3
word3
58Clustering
- hypothesis same senses of words will have
similar neighboring words - Disambiguation algorithm
- Identify context vectors corresponding to all
occurrences of a particular word - Partition them into regions of high density
- Tag a sense for each such region
- Disambiguating a word
- Compute context vector of its occurrence
- Find the closest centroid of a region
- Assign the occurrence the sense of that centroid
59Evaluating WSD
- Metrics
- Accuracy the of words that are tagged
correctly - Precision Recall
- Good nb of correct answers provided by the
system - Bad nb of wrong answers provided by the system
- Null nb of cases in which the system doesnt
provide any answer - compared to a gold standard
- SEMCOR corpus, SENSEVAL corpus, original text
without pseudo-words, - Difficulty in evaluation
- Nature of the senses to distinguish has a huge
impact on results - coarse VS fine-grained sense distinction
- ex chair --gt person VS furniture
- ex bank --gt financial institution VS building
60Bounds on Performance
- Upper and Lower Bounds on Performance
- Measure of how well an algorithm performs
relative to the difficulty of the task. - Upper Bound
- Human performance
- Around 97-99 with few and clearly distinct
senses - Inter-judge agreement
- With words with clear distinct senses --gt 95
and up - With polysemous words with related senses ?
65-70 - Lower Bound (or baseline)
- Usually the assignment of the most frequent sense
- 90 is excellent for a word with 2 equiprobable
senses - 90 is trivial for a word with 2 senses with
probability ratios of 9 to 1 !!!
61SENSEVAL (www.senseval.org)
- Standard WSD competition like MUC, TREC DUC
- Goals
- Provide a common framework to compare WSD systems
- Standardise the task (especially evaluation
procedures) - Build and distribute new lexical resources
- Senseval-1 (1998)
- English, French and Italian
- HECTOR senses (Oxford University Press)
- Senseval-2 (2001)
- 13 languages, including Chinese
- WordNet senses
- Senseval-3 (March 2004)
- 7 languages (but various tasks)
- WordNet senses
62Training text for "arm" (SENSEVAL-1)
- ltinstance id"arm.n.om.053"gt ltanswer
instance"arm.n.om.053" senseid"arm10800"/gt - ltcontextgt
- Many ltp"JJ"/gt terrestrial ltp"JJ"/gt vertebrate
ltp"JJ"/gt animals ltp"NNS"/gt have ltp"VBP"/gt four
ltp"CD"/gt ltne"_NUM"/gt limbs ltp"NNS"/gt .
ltp"."/gt Those ltp"DT"/gt attached ltp"VBN"/gt to
ltp"TO"/gt the ltp"DT"/gt thoracic ltp"JJ"/gt
portion ltp"NN"/gt of ltp"IN"/gt the ltp"DT"/gt body
ltp"NN"/gt are ltp"VBP"/gt called ltp"VBN"/gt "
ltp"""/gt ltheadgt arms ltp"NNS"/gt lt/headgt .
ltp"."/gt " ltp"""/gt - lt/contextgt lt/instancegt
- ltinstance id"arm.n.om.045"gt ltanswer
instance"arm.n.om.045" senseid"arm10602"/gt - ltcontextgt You ltp"PRP"/gt are ltp"VBP"/gt likely
ltp"JJ"/gt to ltp"TO"/gt find ltp"VB"/gt a ltp"DT"/gt
rocking_chair ltp"NN"/gt with ltp"IN"/gt ltheadgt
arms ltp"NNS"/gt lt/headgt in ltp"IN"/gt a ltp"DT"/gt
museum ltp"NN"/gt - lt/contextgt lt/instancegt
- ltinstance id"arm.n.la.029"gt ltanswer
instance"arm.n.la.029" senseid"arm10601"/gt - ltcontextgt
- " ltp"""/gt Unlike ltp"IN"/gt Linder ltp"NNP"/gt ,
ltp","/gt who ltp"WP"/gt was ltp"VBD"/gt reportedly
ltp"RB"/gt carrying ltp"VBG"/gt a ltp"DT"/gt
Kalashnikov ltp"NNP"/gt assault_rifle ltp"NN"/gt
for ltp"IN"/gt protection ltp"NN"/gt , ltp","/gt
APSNICA ltp"NNP"/gt volunteers ltp"NNS"/gt do
ltp"VBP"/gt not ltp"RB"/gt bear ltp"VB"/gt ltheadgt
arms ltp"NNS"/gt lt/headgt . ltp"."/gt - lt/contextgt lt/instancegt
-
63What is a word sense anyways?
- A mental representations of different meaning of
a word - Experiments in psycho-linguistics
- Ask subjects classify index cards with sentences
containing an ambiguous words into different
piles - But inter-subject agreement is low
- Rely on introspection
- But introspection tends to rationalize often
non-rational decisions - Ask subjects to classify ambiguous words
according to dictionary definitions - Some results show high inter-subject agreement,
some results show low agreement!!!