Supervised%20Methods%20of%20Word%20Sense%20Disambiguation - PowerPoint PPT Presentation

About This Presentation

Title:

Supervised%20Methods%20of%20Word%20Sense%20Disambiguation

Description:

Title: Notebook Author: Rada Last modified by: umdcs Created Date: 9/17/2004 4:54:54 AM Document presentation format: On-screen Show Company: UNT Other titles – PowerPoint PPT presentation

Number of Views:169

Avg rating:3.0/5.0

Slides: 32

Provided by: Rada54

Category:

more less

Transcript and Presenter's Notes

Title: Supervised%20Methods%20of%20Word%20Sense%20Disambiguation

1

Part 4
Supervised Methods of Word Sense Disambiguation

2
Outline

What is Supervised Learning?
Task Definition
Single Classifiers
Naïve Bayesian Classifiers
Decision Lists and Trees
Ensembles of Classifiers

3
What is Supervised Learning?

Collect a set of examples that illustrate the
various possible classifications or outcomes of
an event.
Identify patterns in the examples associated with
each particular class of the event.
Generalize those patterns into rules.
Apply the rules to classify a new event.

4
Learn from these examples when do I go to the
store?
Day Go to Store? Hot Outside? Slept Well? Ate Well?
1 YES YES NO NO
2 NO YES NO YES
3 YES NO NO NO
4 NO NO NO YES
5
Outline

What is Supervised Learning?
Task Definition
Single Classifiers
Naïve Bayesian Classifiers
Decision Lists and Trees
Ensembles of Classifiers

6
Task Definition

Supervised WSD Class of methods that induces a
classifier from manually sense-tagged text using
machine learning techniques.
Resources
Sense Tagged Text
Dictionary (implicit source of sense inventory)
Syntactic Analysis (POS tagger, Chunker, Parser,
)
Scope
Typically one target word per context
Part of speech of target word resolved
Lends itself to lexical sample formulation
Reduces WSD to a classification problem where a
target word is assigned the most appropriate
sense from a given set of possibilities based on
the context in which it occurs

7
Sense Tagged Text
Bonnie and Clyde are two really famous criminals, I think they were bank/1 robbers
My bank/1 charges too much for an overdraft.
I went to the bank/1 to deposit my check and get a new ATM card.
The University of Minnesota has an East and a West Bank/2 campus right on the Mississippi River.
My grandfather planted his pole in the bank/2 and got a great big catfish!
The bank/2 is pretty muddy, I cant walk there.
8
Two Bags of Words(Co-occurrences in the window
of context)
FINANCIAL_BANK_BAG a an and are ATM Bonnie card charges check Clyde criminals deposit famous for get I much My new overdraft really robbers the they think to too two went were
RIVER_BANK_BAG a an and big campus cant catfish East got grandfather great has his I in is Minnesota Mississippi muddy My of on planted pole pretty right River The the there University walk West
9
Simple Supervised Approach

Given a sentence S containing bank
For each word Wi in S
If Wi is in FINANCIAL_BANK_BAG then
Sense_1 Sense_1 1
If Wi is in RIVER_BANK_BAG then
Sense_2 Sense_2 1
If Sense_1 gt Sense_2 then print Financial
else if Sense_2 gt Sense_1 then print River
else print Cant Decide

10
Supervised Methodology

Create a sample of training data where a given
target word is manually annotated with a sense
from a predetermined set of possibilities.
One tagged word per instance/lexical sample
disambiguation
Select a set of features with which to represent
context.
co-occurrences, collocations, POS tags, verb-obj
relations, etc...
Convert sense-tagged training instances to
feature vectors.
Apply a machine learning algorithm to induce a
classifier.
Form structure or relation among features
Parameters strength of feature interactions
Convert a held out sample of test data into
feature vectors.
correct sense tags are known but not used
Apply classifier to test instances to assign a
sense tag.

11
Outline

What is Supervised Learning?
Task Definition
Naïve Bayesian Classifier
Decision Lists and Trees
Ensembles of Classifiers

12
Naïve Bayesian Classifier

Naïve Bayesian Classifier well known in Machine
Learning community for good performance across a
range of tasks (e.g., Domingos and Pazzani, 1997)
Word Sense Disambiguation is no exception
Assumes conditional independence among features,
given the sense of a word.
The form of the model is assumed, but parameters
are estimated from training instances
When applied to WSD, features are often a bag of
words that come from the training data
Usually thousands of binary features that
indicate if a word is present in the context of
the target word (or not)

13
Bayesian Inference

Given observed features, what is most likely
sense?
Estimate probability of observed features given
sense
Estimate unconditional probability of sense
Unconditional probability of features is a
normalizing term, doesnt affect sense
classification

14
Naïve Bayesian Model
15
The Naïve Bayesian Classifier

Given 2,000 instances of bank, 1,500 for bank/1
(financial sense) and 500 for bank/2 (river
sense)
P(S1) 1,500/2000 .75
P(S2) 500/2,000 .25
Given credit occurs 200 times with bank/1 and 4
times with bank/2.
P(F1credit) 204/2000 .102
P(F1creditS1) 200/1,500 .133
P(F1creditS2) 4/500 .008
Given a test instance that has one feature
credit
P(S1F1credit) .133.75/.102 .978
P(S2F1credit) .008.25/.102 .020

16
Comparative Results

(Leacock, et. al. 1993) compared Naïve Bayes with
a Neural Network and a Context Vector approach
when disambiguating six senses of line
(Mooney, 1996) compared Naïve Bayes with a Neural
Network, Decision Tree/List Learners, Disjunctive
and Conjunctive Normal Form learners, and a
perceptron when disambiguating six senses of
line
(Pedersen, 1998) compared Naïve Bayes with
Decision Tree, Rule Based Learner, Probabilistic
Model, etc. when disambiguating line and 12 other
words
All found that Naïve Bayesian Classifier
performed as well as any of the other methods!

17
Outline

What is Supervised Learning?
Task Definition
Naïve Bayesian Classifiers
Decision Lists and Trees
Ensembles of Classifiers

18
Decision Lists and Trees

Very widely used in Machine Learning.
Decision trees used very early for WSD research
(e.g., Kelly and Stone, 1975 Black, 1988).
Represent disambiguation problem as a series of
questions (presence of feature) that reveal the
sense of a word.
List decides between two senses after one
positive answer
Tree allows for decision among multiple senses
after a series of answers
Uses a smaller, more refined set of features than
bag of words and Naïve Bayes.
More descriptive and easier to interpret.

19
Decision List for WSD (Yarowsky, 1994)

Identify collocational features from sense tagged
data.
Word immediately to the left or right of target
I have my bank/1 statement.
The river bank/2 is muddy.
Pair of words to immediate left or right of
target
The worlds richest bank/1 is here in New York.
The river bank/2 is muddy.
Words found within k positions to left or right
of target, where k is often 10-50
My credit is just horrible because my bank/1 has
made several mistakes with my account and the
balance is very low.

20
Building the Decision List

Sort order of collocation tests using log of
conditional probabilities.
Words most indicative of one sense (and not the
other) will be ranked highly.

21
Computing DL score

Given 2,000 instances of bank, 1,500 for bank/1
(financial sense) and 500 for bank/2 (river
sense)
P(S1) 1,500/2,000 .75
P(S2) 500/2,000 .25
Given credit occurs 200 times with bank/1 and 4
times with bank/2.
P(F1credit) 204/2,000 .102
P(F1creditS1) 200/1,500 .133
P(F1creditS2) 4/500 .008
From Bayes Rule
P(S1F1credit) .133.75/.102 .978
P(S2F1credit) .008.25/.102 .020
DL Score abs (log (.978/.020)) 3.89

22
Using the Decision List

Sort DL-score, go through test instance looking
for matching feature. First match reveals sense

DL-score Feature Sense
3.89 credit within bank Bank/1 financial
2.20 bank is muddy Bank/2 river
1.09 pole within bank Bank/2 river
0.00 of the bank N/A
23
Using the Decision List
24
Learning a Decision Tree

Identify the feature that most cleanly divides
the training data into the known senses.
Cleanly measured by information gain or gain
ratio.
Create subsets of training data according to
feature values.
Find another feature that most cleanly divides a
subset of the training data.
Continue until each subset of training data is
pure or as clean as possible.
Well known decision tree learning algorithms
include ID3 and C4.5 (Quillian, 1986, 1993)
In Senseval-1 a modified decision list (which
supported some conditional branching) was most
accurate for English Lexical Sample task
(Yarowsky, 2000)

25
Supervised WSD with Individual Classifiers

Most supervised Machine Learning algorithms have
been applied to Word Sense Disambiguation, most
work reasonably well.
Features tend to differentiate among methods more
than the learning algorithms.
Good sets of features tend to include
Co-occurrences or keywords (global)
Collocations (local)
Bigrams (local and global)
Part of speech (local)
Predicate-argument relations
Verb-object, subject-verb,
Heads of Noun and Verb Phrases

26
Convergence of Results

Accuracy of different systems applied to the same
data tends to converge on a particular value, no
one system shockingly better than another.
Senseval-1, a number of systems in range of
74-78 accuracy for English Lexical Sample task.
Senseval-2, a number of systems in range of
61-64 accuracy for English Lexical Sample task.
Senseval-3, a number of systems in range of
70-73 accuracy for English Lexical Sample task
What to do next?

27
Outline

What is Supervised Learning?
Task Definition
Naïve Bayesian Classifiers
Decision Lists and Trees
Ensembles of Classifiers

28
Ensembles of Classifiers

Classifier error has two components (Bias and
Variance)
Some algorithms (e.g., decision trees) try and
build a representation of the training data Low
Bias/High Variance
Others (e.g., Naïve Bayes) assume a parametric
form and dont represent the training data High
Bias/Low Variance
Combining classifiers with different bias
variance characteristics can lead to improved
overall accuracy
Bagging a decision tree can smooth out the
effect of small variations in the training data
(Breiman, 1996)
Sample with replacement from the training data to
learn multiple decision trees.
Outliers in training data will tend to be
obscured/eliminated.

29
Ensemble Considerations

Must choose different learning algorithms with
significantly different bias/variance
characteristics.
Naïve Bayesian Classifier versus Decision Tree
Must choose feature representations that yield
significantly different (independent?) views of
the training data.
Lexical versus syntactic features
Must choose how to combine classifiers.
Simple Majority Voting
Averaging of probabilities across multiple
classifier output
Maximum Entropy combination (e.g., Klein, et.
al., 2002)

30
Ensemble Results

(Pedersen, 2000) achieved state of art for
interest and line data using ensemble of Naïve
Bayesian Classifiers.
Many Naïve Bayesian Classifiers trained on
varying sized windows of context / bags of words.
Classifiers combined by a weighted vote
(Florian and Yarowsky, 2002) achieved state of
the art for Senseval-1 and Senseval-2 data using
combination of six classifiers.
Rich set of collocational and syntactic features.
Combined via linear combination of top three
classifiers.
Many Senseval-2 and Senseval-3 systems employed
ensemble methods.

31
References

(Black, 1988) E. Black (1988) An experiment in
computational discrimination of English word
senses. IBM Journal of Research and Development
(32) pg. 185-194.
(Breiman, 1996) L. Breiman. (1996) The heuristics
of instability in model selection. Annals of
Statistics (24) pg. 2350-2383.
(Domingos and Pazzani, 1997) P. Domingos and M.
Pazzani. (1997) On the Optimality of the Simple
Bayesian Classifier under Zero-One Loss, Machine
Learning (29) pg. 103-130.
(Domingos, 2000) P. Domingos. (2000) A Unified
Bias Variance Decomposition for Zero-One and
Squared Loss. In Proceedings of AAAI. Pg.
564-569.
(Florian an dYarowsky, 2002) R. Florian and D.
Yarowsky. (2002) Modeling Consensus Classifier
Combination for Word Sense Disambiguation. In
Proceedings of EMNLP, pp 25-32.
(Kelly and Stone, 1975). E. Kelly and P. Stone.
(1975) Computer Recognition of English Word
Senses, North Holland Publishing Co., Amsterdam.
(Klein, et. al., 2002) D. Klein, K. Toutanova, H.
Tolga Ilhan, S. Kamvar, and C. Manning, Combining
Heterogeneous Classifiers for Word-Sense
Disambiguation, Proceedings of Senseval-2. pg.
87-89.
(Leacock, et. al. 1993) C. Leacock, J. Towell, E.
Voorhees. (1993) Corpus based statistical sense
resolution. In Proceedings of the ARPA Workshop
on Human Language Technology. pg. 260-265.
(Mooney, 1996) R. Mooney. (1996) Comparative
experiments on disambiguating word senses An
illustration of the role of bias in machine
learning. Proceedings of EMNLP. pg. 82-91.
(Pedersen, 1998) T. Pedersen. (1998) Learning
Probabilistic Models of Word Sense
Disambiguation. Ph.D. Dissertation. Southern
Methodist University.
(Pedersen, 2000) T. Pedersen (2000) A simple
approach to building ensembles of Naive Bayesian
classifiers for word sense disambiguation. In
Proceedings of NAACL.
(Quillian, 1986). J.R. Quillian (1986) Induction
of Decision Trees. Machine Learning (1). pg.
81-106.
(Quillian, 1993). J.R. Quillian (1993) C4.5
Programs for Machine Learning. San Francisco,
Morgan Kaufmann.
(Yarowsky, 1994) D. Yarowsky. (1994) Decision
lists for lexical ambiguity resolution
Application to accent restoration in Spanish and
French. In Proceedings of ACL. pp. 88-95.
(Yarowsky, 2000) D. Yarowsky. (2000)
Hierarchical decision lists for word sense
disambiguation. Computers and the Humanities, 34.