COMP791A: Statistical Language Processing

About This Presentation

Title:

COMP791A: Statistical Language Processing

Description:

Title: COMP791: Statistical NLP Last modified by: Leila Kosseim Created Date: 12/7/1999 2:57:41 AM Document presentation format: On-screen Show Other titles – PowerPoint PPT presentation

Number of Views:148

Avg rating:3.0/5.0

Slides: 64

Provided by: umiacsUmd1

Learn more at: http://users.umiacs.umd.edu

Category:

more less

Transcript and Presenter's Notes

Title: COMP791A: Statistical Language Processing

1
COMP791A Statistical Language Processing

Word Sense Disambiguation
Chap. 7

2
Overview of the problem

Many words have several meanings or senses
(homonyms or polysemous words)
Ex chair --gt furniture or person
Ex dishes --gt plates or food
Need to determine which sense of a word is used
in a specific sentence
Note
often, the different senses of a word are closely
related
Ex title --gt right of legal ownership,
document that is evidence of the legal
ownership,
name of work,
often, several senses can be activated in a
single context (co-activation)
Ex This could bring competition to the trade
Competition --gt the act of competing AND the
people who are competing

3
Word Sense Disambiguation (WSD)

To determine which of the senses of an ambiguous
word is invoked in a particular use of the word.
Potentially extremely useful problem
Ex in machine translation
chair --gt (person) directeur
chair --gt (furniture) chaise
bureau --gt desk
bureau --gt office
Can be done
with rule-based methods
with statistical methods

4
WordNet

most widely-used lexical database for English
free!
G. Miller at Princeton www.cogsci.princeton.edu/w
n
used in many applications of NLP
EuroWorNet
Dutch, Italian, Spanish, German, French, Czech
and Estonian
includes entries for open-class words only
(nouns, verbs, adjectives adverbs)

5
WordNet Entries

in WordNet 1.6 (now 2.0)
118,000 different word forms
organized according to their meanings (senses)
each entry has
a dictionary-style definition (gloss) of each
sense
AND a set of domain-independent lexical relations
among
WordNets entries (words)
senses
sets of synonyms
grouped into synsets (i.e. sets of synonyms)

6
Example 1 WordNet entry for verb serve
7
Rule-based WSD

They served green-lipped mussels from New
Zealand.
Which airlines serve Denver?
semantic restrictions on the predicate of an
argument
argument mussels
--gt needs a predicate with the sense
provide-food
--gt sense 6 of WordNet
argument Denver
--gt needs a predicate with the sense attend-to
--gt sense 10 of WordNet

8
Example 2 WordNet entry for dish
9
Rule-based WSD

In our house, everybody has a career and none of
them includes washing dishes.
In her tiny kitchen, Ms. Chen works efficiently,
stir-frying several simple dishes, including
braised pigs ears and chicken livers with green
peppers.
semantic restrictions on the argument of a
predicate
predicate wash
--gt needs an argument with the sense object
--gt senses 1, 2 or 6 form WordNet
predicate stir-fry
--gt needs an argument with the sense food
--gt sense 2 of WordNet

10
Problem with rule-based WSD

In some cases, the constraints on the predicate
and on the argument are not enough to pinpoint
one unique sense
ex What kind of dishes do you recommend?
Figures of speech
meaning of words can be generated dynamically
instead of being fixed and stored in a lexicon or
set of selectional restrictions
Ex metaphor, metonymy

11
Problem with rule-based WSD (cont)

Metaphor
using words / phrases whose meaning are
appropriate to different kinds of concepts
suggesting a likeness or analogy between them
This deal does not scare Microsoft.
scare has 2 senses in WordNet
to cause fear
to cause to lose courage
metaphor the corporation is viewed as a person
She is drowning in money
metaphor money is viewed as a liquid

12
Problem with rule-based WSD (cont)

Metonymy
referring to a concept by naming some other
concept closely related to it
We await word from the crown.
a monarch is not the same thing as a crown
but we often refer to the monarch as "the crown"
because the two are associated
Metonymy the crown refers to the monarch
The White House had no comment.
Metonymy The White House refers to the
administration

13
WSD versus POS tagging

butter can be a verb or noun
I should butter my toasts.
I like butter on my toasts.
2 different POS --gt 2 different usages with 2
different meanings
So WSD can be viewed as POS tagging (classifying
using semantic tags rather than POS tags)
But the 2 tasks are considered different
because
nearby structural cues (ex is the previous word
a determiner?)
are important in POS tagging
are not effective for WSD
distant content words
are very effective for WSD
are not interesting for POS
So
in POS tagging, we typically only look at the
local context
in WSD, we use content words in a larger context

14
Approaches to Statistical WSD

Supervised Disambiguation
based on a labeled training set
The learning system has
a training set of feature-encoded inputs AND
their appropriate sense label (category)
Based on Lexical Resources
use of external lexical resources such as
dictionaries and thesauri
Discourse properties
Unsupervised Disambiguation
based on unlabeled corpora
The learning system has
a training set of feature-encoded inputs BUT
NOT their appropriate sense label (category)

15
Approaches to Statistical WSD

--gt Supervised Disambiguation
Naïve Bayes
Decision Trees
Use of Lexical Resources
Dictionary-based
Thesaurus-based
Translation-based
Discourse properties
Unsupervised Disambiguation

16
Supervised WSD Overview

A word is assumed to have a finite number of
discrete senses.
The sense of a word depends on the sense of
surrounding words
ex bass fish, musical instrument, ...

17
Supervised WSD Overview (cont)

WSD is viewed as typical classification problem
use machine learning techniques to train a system
that learns a classifier (a function f) to assign
to unseen examples one of a fixed number of
senses (categories)
f(input) correct sense
Input
Target word
The word to be disambiguated
Context (feature vector)
a vector of relevant linguistic features that
represents its context (ex a window of words
around the target word)

18
Examples of Feature Vectors

Take a window of n word around the target word
Encode information about the words around the
target word
typical features include words, root forms, POS
tags, frequency,
An electric guitar and bass player stand off to
one side, not really part of the scene, just as a
sort of nod to gringo expectations perhaps.
with position information
(guitar, NN1), (and, CJC), (player, NN1),
(stand, VVB)
no position information, but word frequency
fishing, big, sound, player, fly, rod, pound,
double, runs, playing, guitar, band
0,0,0,1,0,0,0,0,0,0,1,0
other features
followed by "player", contains "show" in the
sentence,
yes, no,

19
Supervised WSD

Training corpus
Each occurrence of the ambiguous word w is
annotated with a semantic label (its contextually
appropriate sense sk).
Several approaches from ML
Bayesian classification
Decision trees
Neural networks
K-nearest neighbor (kNN)

20
Approaches to Statistical WSD

--gt Supervised Disambiguation
--gt Naïve Bayes
Decision Trees
Use of Lexical Resources
Dictionary-based
Thesaurus-based
Translation-based
Discourse properties
Unsupervised Disambiguation

21
Naïve Bayes Classification

Goal choose the most probable sense s for a
word given a vector V of surrounding words
vector contains
frequency of words
vocabulary fishing, big, sound, player, fly,
rod,
0, 0, 0, 2, 1, 0,
Bayes decision rule
s argmaxsk P(skV)
where
S is the set of possible senses for the target
word
sk is a sense in S
V is the feature vector (the representation of
the context)
Using Bayes rule

22
Decision Rule for Naive Bayes

But P(V) is the same for all possible senses,
so it does not affect the final ranking of the
senses, so we can drop it.
To make the computations simpler, we often take
the log of probabilities

23
Naïve Bayes WSD

Training a Naïve Bayes classifier
estimating P(vjsk) and P(sk) from a
sense-tagged training corpus
finding Maximum-Likelihood Estimation, perhaps
with appropriate smoothing

Nb of occurrences of feature j over the total nb
of features appearing in windows of Sk
Nb of occurrences of sense k over nb of all
occurrences of ambiguous word
24
Naïve Bayes Algorithm

// 1. training
for all senses sk or word w
for all words vj in the vocabulary
compute
for all senses sk of word w
compute
// 2. disambiguation
for all senses sk of word w
score(sk) log P(sk)
for all words vj in the context window
score (sk) score (sk) log P(vj sk)
choose s with the greatest score(sk)

25
Example

Training corpus (context window ?3 words)
Today the World Bank/BANK1 and partners are
calling for greater relief
Welcome to the Bank/BANK1 of America the
nation's leading financial institution
Welcome to America's Job Bank/BANK1 Visit our
site and
Web site of the European Central Bank/BANK1
located in Frankfurt
The Asian Development Bank/BANK1 ADB a
multilateral development finance
lounging against verdant banks/BANK2 carving out
the...
for swimming, had warned her off the banks/BANK2
of the Potomac. Nobody...
Training
P(theBANK1) 5/30 P(theBANK2) 3/12
P(worldBANK1) 1/30 P(worldBANK2) 0/12
P(andBANK1) 1/30 P(andBANK2) 0/12
P(offBANK1) 0/30 P(offBANK2) 1/12
P(PotomacBANK1) 0/30 P(PotomacBANK2) 1/12

26
Naïve Bayes Assumption

Independence assumption
The features (contextual words) are conditionally
independent
Probability of an entire feature vector given a
sense, is the product of the probabilities of its
individual features given that sense
Consequences
Bag of words model
the structure and linear ordering of words within
the context is ignored.
The presence of one word in the bag is
independent of another.
The independence assumption is incorrect but is
useful in WSD
(Gale, Church Yarowsky, 1992) report 90
correct disambiguation with 6 ambiguous nouns in
the Hansard

27
Approaches to Statistical WSD

--gt Supervised Disambiguation
Naïve Bayes
--gt Decision Trees
Use of Lexical Resources
Dictionary-based
Thesaurus-based
Translation-based
Discourse properties
Unsupervised Disambiguation

28
Decision Tree Classifier

Bayes Classifier uses information from all words
in the context window
But some words are more reliable than others to
indicate which sense is used

29
Decision Tree Classifier (cont)

Look for features that are very good indicators
of the result
Place these features (as questions) in nodes of a
decision tree
Split the examples so that those with different
values for the chosen feature are in a different
set
Repeat the same process with another feature
A sequence of tests is applied to each feature
vector
if test succeeds --gt return the sense associated
with the test
otherwise --gt apply the next test
if all features have been tested, then return a
default sense (most common one)

30
Example bass
Observation Features Features Features Features Features Sense
Observation Includes fish? striped bass? Includes guitar? bass player? Includes piano? Sense
1 Yes Yes No No No fish
2 Yes Yes No No No fish
3 No No Yes No No instrument
4 No Yes No No No fish
5 Yes Yes No No No fish
6 No No Yes Yes Yes instrument
7 No Yes No No No fish
yes
no
no
yes
yes
no
31
Another Example The restaurant
Input

Training data

Output
32
A first decision tree

But is it the best decision tree we can build?

33
A better decision tree

4 tests instead of 9 11 branches instead of 21

34
Choosing the best feature

The key problem is choosing which feature to
split a given set of examples
Most used strategy information theory

Entropy (or self-information)
35
Choosing the best feature (con't)

The "discriminating power" of an attribute A
given a set S

if the training set contains
p positive examples and
n negative examples

36
Some intuition
Size Color Shape Output
Big Red Circle
Small Red Circle
Small Red Square -
Big Blue Circle -

Size is the least discriminating attribute (i.e.
smallest information gain)
Shape and color are the most discriminating
attribute (i.e. highest information gain)

37
A small example
Size Color Shape Output
Big Red Circle
Small Red Circle
Small Red Square -
Big Blue Circle -

So first separate according to either color or
shape (root of the tree)
Note by definition 0log0 is 0

38
The restaurant example

With the data on p.27, we have
So root of the tree should be attribute Patrons
(we gain more information)
do recursively for subtrees

39
Back to WSD

Need to translate the French word Prendre
can be seen as WSD
possible translations/sensestake, make, rise,
speak

Observation Features/Attributes Features/Attributes Features/Attributes Features/Attributes Features/Attributes Sense
Observation Tense Word left Direct object Word right Sense
1 mesure take
2 note take
3 exemple take
4 decision make
5 parole speak
6 parole rise
40
Back to WSD (con't)

(Brown et al., 1991) found
On Canadian Hansard

Ambiguous word Possible senses / translations Best Feature Example
Prendre take , make, rise, speak Direct object Prendre une mesure --gt to take Prendre une décision --gt to make
Vouloir to want, to like Tense Present --gt to want Conditional --gt to like
Cent , Word to the left Pour --gt Number --gt
41
Training Set

With supervised methods, we need a large
sense-tagged training set where do you get it
from?
Using a "real" training set
Main standard hand sense-tagged corpora
SEMCOR corpus
portion of the Brown corpus
tagged with WordNet senses
SENSEVAL corpus (www.senseval.org/)
Standard WSD competition like MUC, TREC DUC
Open Mind Word Expert(OMWE)
Using pseudowords
Artificial ambiguous words created by conflating
two or more words.
Ex occurrences of banana and door can be
replaced by banana-door
The disambiguation algorithm can now be tested on
this data to disambiguate the pseudoword
banana-door into either banana or door

42
Problems

With supervised (or unsupervised) methods
need a large amount of work to create a
classifier for each ambiguous word!
So most work based in these techniques, report
work on a few words (2 to 12 words)
Scaling up these approaches to deal with all
ambiguous words is immense work!
Solution
use lexical resources (ex machine-readable
dictionaries)
use distributional properties to improve
disambiguation
Ambiguous words are only used in one sense in any
given discourse and with any given collocate.

43
Approaches to Statistical WSD

Supervised Disambiguation
Naïve Bayes
Decision-tree
--gt Use of Lexical Resources
--gt Dictionary-based
Thesaurus-based
Translation-based
Discourse properties
Unsupervised Disambiguation

44
WSD based on sense definitions

(Lesk, 1986)
A words dictionary definitions are likely to be
good indicators for the sense they define.
Method
Express the dictionary definitions of the
ambiguous word as sets of bag-of-words
Express the context of the ambiguous word as a
single bag-of-words from the dictionary
definitions of the context words.
Choose the definition of the ambiguous word that
has the greatest overlap with the words occurring
in its context.

45
Example

"Cone" in dictionary
DEF-1 solid body which narrows to a point
BAG body, narrows, point, solid
DEF-2 something of this shape whether solid or
hollow
BAG hollow, shape, something, solid
DEF-3 fruit of certain evergreen tree
BAG evergreen, fruit, tree
To disambiguate "cone" in "pine cone"
"Pine" in dictionary
DEF-1 kind of evergreen tree
DEF-2 waste away through sorrow or illness
--gt BAG evergreen, illness, kind, sorrow,
tree, waste
so "cone" is
score(DEF-1) body, narrows, point, solid ?
evergreen, illness, kind, sorrow, tree, waste
0
score(DEF-2) hollow,shape,something,solid ?
evergreen, illness, kind, sorrow, tree, waste
0

46
The algorithm

For all senses sk of word w
score(sk) overlap (
- words in the dictionary definition of sense sk
- the union of the words in all context windows
that also appear in a definition of w
)
pick the sense s with the highest score(sk)

47
Analysis

Accuracies of 50-70 on short samples of texts
Problem
dictionary entries for the target words are
usually relatively short
and may not provide sufficient material to create
adequate classifiers
Because the words in the context and their
definitions must have direct overlap
One solution
expand the list of words whose definitions make
use of the target word
Example
if deposit does not occur in the definition of
bank
but bank occurs in the definition of deposit
We can expand the classifier for bank to
include deposit as a relevant feature
However
just knowing that deposit is related to bank
does not help much
if we do not know to which sense of bank it is
related to
--gt To make use of deposit as a feature, we
have to know which sense of bank was being used
in the definition
Solution
Use a thesaurus

48
Approaches to Statistical WSD

Supervised Disambiguation
Naïve Bayes
Decision-tree
--gt Use of Lexical Resources
Dictionary-based
--gt Thesaurus-based
Translation-based
Discourse properties
Unsupervised Disambiguation

49
Thesaurus-Based Disambiguation

Thesauri include tags (subject codes) in their
entries that correspond to broad semantic
categories
Each word is assigned one or more subject codes
which corresponds to its different meanings
ANIMAL/INSECT (category 414)
TOOLS/MACHINERY (category 348)
The semantic categories of the words in a context
determine the semantic category of the whole
context
This category, determines which word senses are
used
For each subject code, count the number of words
in the context that have the same subject code
Select the subject code that has the highest
count
Accuracy 50 (but with difficult and highly
ambiguous words)

50
Some Results

Roget categories

Word Sense Roget category Accuracy (Yarowsky, 1992)
bass musical instrument MUSIC 99
bass fish ANIMAL,INSECT 100
star space object UNIVERSE 96
star celebrity ENTERTAINER 95
star star-shaped object INSIGNIA 82
interest curiosity REASONING 88
interest advantage INJUSTICE 34
interest financial DEBT 90
interest share PROPERTY 38
51
Approaches to Statistical WSD

Supervised Disambiguation
Naïve Bayes
Decision-tree
--gt Use of Lexical Resources
Dictionary-based
Thesaurus-based
--gt Translation-based
Discourse properties
Unsupervised Disambiguation

52
Translation-Based WSD

Words can be disambiguated by looking at how they
are translated in other languages
Example the word interest
To disambiguate the word interest in showed
interest
German translation of show is zeigen
In German corpus
we always find zeigen interesse
we never find zeigen beteiligung
So in the original phrase showed interest,
interest had sense2
To disambiguate the word interest in acquired
an interest
German translation of acquired is erwarb
In German corpus C(erwarb, beteiligung) gt
C(erwarb, interesse)

sense1 sense2
Definition legal share attention, concern
German Translation Beteiligung Interesse
English phrase acquire an interest show interest
Translation erwerb eine Beteiligung Interesse zeigen
53
Approaches to Statistical WSD

Supervised Disambiguation
Naïve Bayes
Decision-tree
Use of Lexical Resources
Dictionary-based
Thesaurus-based
Translation-based
--gt Discourse properties
Unsupervised Disambiguation

54
Discourse Properties (Yarowsky, 1995)

So far, all methods have considered each
occurrence of ambiguous word separately
But
One sense per discourse
One document --gt one sense
One sense per collocation
Select some nearby word that give very clues
ie. select words of a collocation lt-gt sense of
target word
(Yarowsky , 1995) shows a reduction of error rate
by 27 when using the discourse constraint!
i.e. assign the majority sense of the discourse
to all occurrences of the target word
we can combine these 2 heuristics

55
Approaches to Statistical WSD

Supervised Disambiguation
Naïve Bayes
Decision-tree
Use of Lexical Resources
Dictionary-based
Thesaurus-based
Translation-based
Discourse properties
--gt Unsupervised Disambiguation

56
Unsupervised Disambiguation

Disambiguate word senses
without supporting tools such as dictionaries and
thesauri
without a labeled training text
Without such resources, we cannot really
identify/label the senses
ie. cannot say bank-1 or bank-2
we do not even know the different senses of a
word!
But we can
Cluster/group the contexts of an ambiguous word
into a number of groups
discriminate between these groups without
actually labeling them

57
Clustering

Represent each instance of the ambiguous word as
a vector ltf1, f2, f3,, fv gt
V is the vocabulary size
fi is the frequency of word i in the context.
each vector can be visually represented in an V
dimensional space

V2
word2
V1
word1
V3
word3
58
Clustering

hypothesis same senses of words will have
similar neighboring words
Disambiguation algorithm
Identify context vectors corresponding to all
occurrences of a particular word
Partition them into regions of high density
Tag a sense for each such region
Disambiguating a word
Compute context vector of its occurrence
Find the closest centroid of a region
Assign the occurrence the sense of that centroid

59
Evaluating WSD

Metrics
Accuracy the of words that are tagged
correctly
Precision Recall
Good nb of correct answers provided by the
system
Bad nb of wrong answers provided by the system
Null nb of cases in which the system doesnt
provide any answer
compared to a gold standard
SEMCOR corpus, SENSEVAL corpus, original text
without pseudo-words,
Difficulty in evaluation
Nature of the senses to distinguish has a huge
impact on results
coarse VS fine-grained sense distinction
ex chair --gt person VS furniture
ex bank --gt financial institution VS building

60
Bounds on Performance

Upper and Lower Bounds on Performance
Measure of how well an algorithm performs
relative to the difficulty of the task.
Upper Bound
Human performance
Around 97-99 with few and clearly distinct
senses
Inter-judge agreement
With words with clear distinct senses --gt 95
and up
With polysemous words with related senses ?
65-70
Lower Bound (or baseline)
Usually the assignment of the most frequent sense
90 is excellent for a word with 2 equiprobable
senses
90 is trivial for a word with 2 senses with
probability ratios of 9 to 1 !!!

61
SENSEVAL (www.senseval.org)

Standard WSD competition like MUC, TREC DUC
Goals
Provide a common framework to compare WSD systems
Standardise the task (especially evaluation
procedures)
Build and distribute new lexical resources
Senseval-1 (1998)
English, French and Italian
HECTOR senses (Oxford University Press)
Senseval-2 (2001)
13 languages, including Chinese
WordNet senses
Senseval-3 (March 2004)
7 languages (but various tasks)
WordNet senses

62
Training text for "arm" (SENSEVAL-1)

ltinstance id"arm.n.om.053"gt ltanswer
instance"arm.n.om.053" senseid"arm10800"/gt
ltcontextgt
Many ltp"JJ"/gt terrestrial ltp"JJ"/gt vertebrate
ltp"JJ"/gt animals ltp"NNS"/gt have ltp"VBP"/gt four
ltp"CD"/gt ltne"_NUM"/gt limbs ltp"NNS"/gt .
ltp"."/gt Those ltp"DT"/gt attached ltp"VBN"/gt to
ltp"TO"/gt the ltp"DT"/gt thoracic ltp"JJ"/gt
portion ltp"NN"/gt of ltp"IN"/gt the ltp"DT"/gt body
ltp"NN"/gt are ltp"VBP"/gt called ltp"VBN"/gt "
ltp"""/gt ltheadgt arms ltp"NNS"/gt lt/headgt .
ltp"."/gt " ltp"""/gt
lt/contextgt lt/instancegt
ltinstance id"arm.n.om.045"gt ltanswer
instance"arm.n.om.045" senseid"arm10602"/gt
ltcontextgt You ltp"PRP"/gt are ltp"VBP"/gt likely
ltp"JJ"/gt to ltp"TO"/gt find ltp"VB"/gt a ltp"DT"/gt
rocking_chair ltp"NN"/gt with ltp"IN"/gt ltheadgt
arms ltp"NNS"/gt lt/headgt in ltp"IN"/gt a ltp"DT"/gt
museum ltp"NN"/gt
lt/contextgt lt/instancegt
ltinstance id"arm.n.la.029"gt ltanswer
instance"arm.n.la.029" senseid"arm10601"/gt
ltcontextgt
" ltp"""/gt Unlike ltp"IN"/gt Linder ltp"NNP"/gt ,
ltp","/gt who ltp"WP"/gt was ltp"VBD"/gt reportedly
ltp"RB"/gt carrying ltp"VBG"/gt a ltp"DT"/gt
Kalashnikov ltp"NNP"/gt assault_rifle ltp"NN"/gt
for ltp"IN"/gt protection ltp"NN"/gt , ltp","/gt
APSNICA ltp"NNP"/gt volunteers ltp"NNS"/gt do
ltp"VBP"/gt not ltp"RB"/gt bear ltp"VB"/gt ltheadgt
arms ltp"NNS"/gt lt/headgt . ltp"."/gt
lt/contextgt lt/instancegt

63
What is a word sense anyways?

A mental representations of different meaning of
a word
Experiments in psycho-linguistics
Ask subjects classify index cards with sentences
containing an ambiguous words into different
piles
But inter-subject agreement is low
Rely on introspection
But introspection tends to rationalize often
non-rational decisions
Ask subjects to classify ambiguous words
according to dictionary definitions
Some results show high inter-subject agreement,
some results show low agreement!!!