Combining%20Lexical%20and%20Syntactic%20Features%20for%20Supervised%20Word%20Sense%20Disambiguation presentation

About This Presentation

Transcript and Presenter's Notes

Title: Combining%20Lexical%20and%20Syntactic%20Features%20for%20Supervised%20Word%20Sense%20Disambiguation

1
Combining Lexical and Syntactic Features for
Supervised Word Sense Disambiguation

Saif Mohammad Ted
Pedersen
University of Toronto
University of Minnesota
http//www.cs.toronto.edu/smm
http//www.d.umn.edu/tpederse

2
Word Sense Disambiguation

Harry cast a bewitching spell
Humans immediately understand spell to mean a
charm or incantation.
reading out letter by letter or a period of time
?
Words with multiple senses polysemy, ambiguity!
Utilize background knowledge and context.
Machines lack background knowledge.
Automatically identifying the intended sense of a
word in written text, based on its context,
remains a hard problem.
Best accuracies in recent international event,
around 65.

3
Why do we need WSD !

Information Retrieval
Query cricket bat
Documents pertaining to the insect and the
mammal, irrelevant.
Machine Translation
Consider English to Hindi translation.
head to sar (upper part of the body) or adhyaksh
(leader)?
Machine-human interaction
Instructions to machines.
Interactive home system turn on the lights
Domestic Android get the door
Applications are widespread and will affect our
way of life.

4
Terminology

Harry cast a bewitching spell
Target word the word whose intended sense is to
be identified.
spell
Context the sentence housing the target word
and possibly, 1 or 2 sentences around it.
Harry cast a bewitching spell
Instance target word along with its context.
WSD is a classification problem wherein the
occurrence of the
target word is assigned to one of its many
possible senses.

5
Corpus-Based Supervised Machine Learning

A computer program is said to learn from
experience if its performance at tasks
improves with experience.
- Mitchell
Task Word Sense Disambiguation of given test
instances.
Performance Ratio of instances correctly
disambiguated to the total test instances
accuracy.
Experience Manually created instances such that
target words are marked with intended sense
training instances.
Harry cast a bewitching spell / incantation

6
Decision Trees

A kind of classifier.
Assigns a class by asking a series of questions.
Questions correspond to features of the instance.
Question asked depends on answer to previous
question.
Inverted tree structure.
Interconnected nodes.
Top most node is called the root.
Each node corresponds to a question / feature.
Each possible value of feature has corresponding
branch.
Leaves terminate every path from root.
Each leaf is associated with a class.

7
WSD Tree
Feature 1 ?
0
1
Feature 4?
Feature 2 ?
0
1
0
1
Feature 4 ?
Feature 2 ?
SENSE 1
SENSE 3
1
0
0
1
Feature 3 ?
SENSE 4
SENSE 3
SENSE 1
0
1
SENSE 3
SENSE 2
8
Choice of Learning Algorithm

Why use decision trees for WSD ?
It has drawbacks training data fragmentation
What about other learning algorithms such as
neural networks?
Context is a rich source of discrete features.
The learned model likely meaningful.
May provide insight into the interaction of
features.
Pedersen2001 Choosing the right features is
of
greater significance than the learning algorithm
itself
A Decision Tree of Bigrams is an Accurate
Predictor of Word Sense T. Pedersen, In the
Proceedings of the Second Meeting of the North
American Chapter of the Association for
Computational Linguistics
(NAACL-01), June 2-7, 2001, Pittsburgh, PA.

9
Lexical Features

Surface form
A word we observe in text.
Case(n)
1. Object of investigation 2. frame or covering
3. A weird person
Surface forms case, cases, casing
An occurrence of casing suggests sense 2.
Unigrams and Bigrams
One word and two word sequences in text.
The interest rate is low
Unigrams the, interest, rate, is, low
Bigrams the interest, interest rate, rate is, is
low

10
Part of Speech Tagging

Brill Tagger most widely used tool.
Accuracy around 95.
Source code available.
Easily understood rules.
Pre-tagging is the act of manually assigning tags
to selected words in a text prior to tagging.
Brill tagger does not guaranteed pre-tagging.
A patch to the tagger provided BrillPatch.
Guaranteed Pre-Tagging for the Brill Tagger,
Mohammad, S. and Pedersen, T., In Proceedings of
Fourth International Conference of Intelligent
Systems and Text Processing, February 2003,
Mexico.

11
Part of Speech Features

A word used in different senses is likely to have
different sets of pos tags around it.
Why did jack turn/VB against/IN his/PRP team/NN
Why did jack turn/VB left/NN at/IN the/DT
crossing
Features used
Individual word POS P-2, P-1, P0, P1, P2
P1 JJ implies that the word to the right of the
target word is an adjective.
A combination of the above.

12
Parse Features

Collins Parser used to parse the data.
Source code available.
Uses part of speech tagged data as input.
Head word of a phrase.
the hard work, the hard surface
Phrase itself noun phrase, verb phrase and so
on.
Parent Head word of the parent phrase.
fasten the line, cross the line
Parent phrase.
http//www.ai.mit.edu/people/mcollins

13
Sample Parse Tree
SENTENCE
VERB PHRASE
NOUN PHRASE
Harry
NOUN PHRASE
cast
NNP
VBD
spell
bewitching
a
NN
JJ
DT
14
Sense-Tagged Data

Senseval-2 data
4,328 instances of test data and 8,611 instances
of training data ranging over 73 different noun,
verb and adjectives.
Senseval-1 data
8,512 test instances and 13,276 training
instances, ranging over 35 nouns, verbs and
adjectives.
line, hard, interest, serve data
4149, 4337, 4378 and 2476 sense-tagged instances
with line, hard, serve and interest as the head
words.
Around 50,000 sense-tagged instances in all!

15
Experiments
16
Lexical Senseval-1 Senseval-2
Sval-2 Sval-1 line hard serve interest
Majority 47.7 56.3 54.3 81.5 42.2 54.9
Surface Form 49.3 62.9 54.3 81.5 44.2 64.0
Unigram 55.3 66.9 74.5 83.4 73.3 75.7
Bigram 55.1 66.9 72.9 89.5 72.1 79.9
17
Individual Word POS (Senseval-1)
All Nouns Verbs Adj.
Majority 56.3 57.2 56.9 64.3
P-2 57.5 58.2 58.6 64.0
P-1 59.2 62.2 58.2 64.3
P0 60.3 62.5 58.2 64.3
P1 63.9 65.4 64.4 66.2
P-2 59.9 60.0 60.8 65.2
18
Individual Word POS (Senseval-2)
All Nouns Verbs Adj.
Majority 47.7 51.0 39.7 59.0
P-2 47.1 51.9 38.0 57.9
P-1 49.6 55.2 40.2 59.0
P0 49.9 55.7 40.6 58.2
P1 53.1 53.8 49.1 61.0
P-2 48.9 50.2 43.2 59.4
19
Combining POS Features
Sval-2 Sval-1 line hard serve interest
Majority 47.7 56.3 54.3 81.5 42.2 54.9
P0, P1 54.3 66.7 54.1 81.9 60.2 70.5
P-1, P0, P1 54.6 68.0 60.4 84.8 73.0 78.8
P-2, P-1, P0, P1 , P2 54.6 67.8 62.3 86.2 75.7 80.6
20
Parse Features (Senseval-1)
All Nouns Verbs Adj.
Majority 56.3 57.2 56.9 64.3
Head 64.3 70.9 59.8 66.9
Parent 60.6 62.6 60.3 65.8
Phrase 58.5 57.5 57.2 66.2
Par. Phr. 57.9 58.1 58.3 66.2
21
Parse Features (Senseval-2)
All Nouns Verbs Adj.
Majority 47.7 51.0 39.7 59.0
Head 51.7 58.5 39.8 64.0
Parent 50.0 56.1 40.1 59.3
Phrase 48.3 51.7 40.3 59.5
Par. Phr. 48.5 53.0 39.1 60.3
22
Thoughts

Both lexical and syntactic features perform
comparably.
But do they get the same instances right ?
How much are the individual feature sets
redundant.
Are there instances correctly disambiguated by
one feature set and not by the other ?
How much are the individual feature sets
complementary.
Is the effort to combine of lexical and syntactic
features justified?

23
Measures

Baseline Ensemble accuracy of a hypothetical
ensemble which predicts the sense correctly only
if both individual feature sets do so.
Quantifies redundancy amongst feature sets.
Optimal Ensemble accuracy of a hypothetical
ensemble which predicts the sense correctly if
either of the individual feature sets do so.
Difference with individual accuracies quantifies
complementarity.
We used a simple ensemble which sums up the
probabilities for each sense by the individual
feature
sets to decide the intended sense.

24
Best Combinations
Data Set 1 Set 2 Base Ens. Opt. Best
Sval-2 47.7 Unigrams 55.3 P-1,P0, P1 55.3 43.6 57.0 67.9 66.7
Sval-1 56.3 Unigrams 66.9 P-1,P0, P1 68.0 57.6 71.1 78.0 81.1
line 54.3 Unigrams 74.5 P-1,P0, P1 60.4 55.1 74.2 82.0 88.0
hard 81.5 Bigrams 89.5 Head, Par 87.7 86.1 88.9 91.3 83.0
serve 42.2 Unigrams 73.3 P-1,P0, P1 73.0 58.4 81.6 89.9 83.0
interest 54.9 Bigrams 79.9 P-1,P0, P1 78.8 67.6 83.2 90.1 89.0
25
Conclusions

Significant amount of complementarity across
lexical and syntactic features.
Combination of the two justified.
We show that simple lexical and part of speech
features can achieve state of the art results.
How best to capitalize on the complementarity
still an open issue.

26
Conclusions (continued)

Part of speech of word immediately to the right
of target word found most useful.
Pos of words immediately to the right of target
word best for verbs and adjectives.
Nouns helped by tags on either side.
(P0, P1) found to be most potent in case of small
training data per instance (Sval data).
Larger pos context size (P-2, P-1, P0, P1 , P2)
shown to be beneficial when training data per
instance is large (line, hard, serve and interest
data)
Head word of phrase particularly useful for
adjectives
Nouns helped by both head and parent.

27
Code, Data Resources

SyntaLex A system to do WSD using lexical and
syntactic features. Wekas decision tree learning
algorithm is utilized.
posSenseval part of speech tags any data in
Senseval-2 data format. Brill Tagger used.
parseSenseval parses data in a format as output
by the Brill Tagger. Output is in Senseval-2 data
format with part of speech and parse information
as xml tags. Uses Collins Parser.
Packages to convert line hard, serve and interest
data to Senseval-1 and Senseval-2 data formats.
BrillPatch Patch to Brill Tagger to employ
Guaranteed
Pre-Tagging.
http//www.d.umn.edu/tpederse/code.html
http//www.d.umn.edu/tpederse/data.html

28
Senseval-3 (Mar-1 to April 15, 2004) Around 8000
training and 4000 test instances. Results
expected shortly.

Thank You

Write a Comment

User Comments (0)

About PowerShow.com

Combining%20Lexical%20and%20Syntactic%20Features%20for%20Supervised%20Word%20Sense%20Disambiguation PowerPoint PPT Presentation