Title: Combining%20Lexical%20and%20Syntactic%20Features%20for%20Supervised%20Word%20Sense%20Disambiguation
1Combining Lexical and Syntactic Features for
Supervised Word Sense Disambiguation
- Saif Mohammad Ted
Pedersen - University of Toronto
University of Minnesota - http//www.cs.toronto.edu/smm
http//www.d.umn.edu/tpederse -
2Word Sense Disambiguation
- Harry cast a bewitching spell
- Humans immediately understand spell to mean a
charm or incantation. - reading out letter by letter or a period of time
? - Words with multiple senses polysemy, ambiguity!
- Utilize background knowledge and context.
- Machines lack background knowledge.
- Automatically identifying the intended sense of a
word in written text, based on its context,
remains a hard problem. - Best accuracies in recent international event,
around 65.
3Why do we need WSD !
- Information Retrieval
- Query cricket bat
- Documents pertaining to the insect and the
mammal, irrelevant. - Machine Translation
- Consider English to Hindi translation.
- head to sar (upper part of the body) or adhyaksh
(leader)? - Machine-human interaction
- Instructions to machines.
- Interactive home system turn on the lights
- Domestic Android get the door
- Applications are widespread and will affect our
way of life.
4Terminology
- Harry cast a bewitching spell
- Target word the word whose intended sense is to
be identified. - spell
- Context the sentence housing the target word
and possibly, 1 or 2 sentences around it. - Harry cast a bewitching spell
- Instance target word along with its context.
- WSD is a classification problem wherein the
occurrence of the - target word is assigned to one of its many
possible senses.
5Corpus-Based Supervised Machine Learning
- A computer program is said to learn from
experience if its performance at tasks
improves with experience. - - Mitchell
- Task Word Sense Disambiguation of given test
instances. - Performance Ratio of instances correctly
disambiguated to the total test instances
accuracy. - Experience Manually created instances such that
target words are marked with intended sense
training instances. - Harry cast a bewitching spell / incantation
6Decision Trees
- A kind of classifier.
- Assigns a class by asking a series of questions.
- Questions correspond to features of the instance.
- Question asked depends on answer to previous
question. - Inverted tree structure.
- Interconnected nodes.
- Top most node is called the root.
- Each node corresponds to a question / feature.
- Each possible value of feature has corresponding
branch. - Leaves terminate every path from root.
- Each leaf is associated with a class.
7WSD Tree
Feature 1 ?
0
1
Feature 4?
Feature 2 ?
0
1
0
1
Feature 4 ?
Feature 2 ?
SENSE 1
SENSE 3
1
0
0
1
Feature 3 ?
SENSE 4
SENSE 3
SENSE 1
0
1
SENSE 3
SENSE 2
8Choice of Learning Algorithm
- Why use decision trees for WSD ?
- It has drawbacks training data fragmentation
- What about other learning algorithms such as
neural networks? - Context is a rich source of discrete features.
- The learned model likely meaningful.
- May provide insight into the interaction of
features. - Pedersen2001 Choosing the right features is
of - greater significance than the learning algorithm
itself - A Decision Tree of Bigrams is an Accurate
Predictor of Word Sense T. Pedersen, In the
Proceedings of the Second Meeting of the North
American Chapter of the Association for
Computational Linguistics - (NAACL-01), June 2-7, 2001, Pittsburgh, PA.
9Lexical Features
- Surface form
- A word we observe in text.
- Case(n)
- 1. Object of investigation 2. frame or covering
3. A weird person - Surface forms case, cases, casing
- An occurrence of casing suggests sense 2.
- Unigrams and Bigrams
- One word and two word sequences in text.
- The interest rate is low
- Unigrams the, interest, rate, is, low
- Bigrams the interest, interest rate, rate is, is
low
10Part of Speech Tagging
- Brill Tagger most widely used tool.
- Accuracy around 95.
- Source code available.
- Easily understood rules.
- Pre-tagging is the act of manually assigning tags
to selected words in a text prior to tagging. - Brill tagger does not guaranteed pre-tagging.
- A patch to the tagger provided BrillPatch.
- Guaranteed Pre-Tagging for the Brill Tagger,
Mohammad, S. and Pedersen, T., In Proceedings of
Fourth International Conference of Intelligent
Systems and Text Processing, February 2003,
Mexico.
11Part of Speech Features
- A word used in different senses is likely to have
different sets of pos tags around it. - Why did jack turn/VB against/IN his/PRP team/NN
- Why did jack turn/VB left/NN at/IN the/DT
crossing - Features used
- Individual word POS P-2, P-1, P0, P1, P2
- P1 JJ implies that the word to the right of the
target word is an adjective. - A combination of the above.
12Parse Features
- Collins Parser used to parse the data.
- Source code available.
- Uses part of speech tagged data as input.
- Head word of a phrase.
- the hard work, the hard surface
- Phrase itself noun phrase, verb phrase and so
on. - Parent Head word of the parent phrase.
- fasten the line, cross the line
- Parent phrase.
-
- http//www.ai.mit.edu/people/mcollins
13Sample Parse Tree
SENTENCE
VERB PHRASE
NOUN PHRASE
Harry
NOUN PHRASE
cast
NNP
VBD
spell
bewitching
a
NN
JJ
DT
14Sense-Tagged Data
- Senseval-2 data
- 4,328 instances of test data and 8,611 instances
of training data ranging over 73 different noun,
verb and adjectives. - Senseval-1 data
- 8,512 test instances and 13,276 training
instances, ranging over 35 nouns, verbs and
adjectives. - line, hard, interest, serve data
- 4149, 4337, 4378 and 2476 sense-tagged instances
with line, hard, serve and interest as the head
words. - Around 50,000 sense-tagged instances in all!
15Experiments
16Lexical Senseval-1 Senseval-2
Sval-2 Sval-1 line hard serve interest
Majority 47.7 56.3 54.3 81.5 42.2 54.9
Surface Form 49.3 62.9 54.3 81.5 44.2 64.0
Unigram 55.3 66.9 74.5 83.4 73.3 75.7
Bigram 55.1 66.9 72.9 89.5 72.1 79.9
17Individual Word POS (Senseval-1)
All Nouns Verbs Adj.
Majority 56.3 57.2 56.9 64.3
P-2 57.5 58.2 58.6 64.0
P-1 59.2 62.2 58.2 64.3
P0 60.3 62.5 58.2 64.3
P1 63.9 65.4 64.4 66.2
P-2 59.9 60.0 60.8 65.2
18Individual Word POS (Senseval-2)
All Nouns Verbs Adj.
Majority 47.7 51.0 39.7 59.0
P-2 47.1 51.9 38.0 57.9
P-1 49.6 55.2 40.2 59.0
P0 49.9 55.7 40.6 58.2
P1 53.1 53.8 49.1 61.0
P-2 48.9 50.2 43.2 59.4
19Combining POS Features
Sval-2 Sval-1 line hard serve interest
Majority 47.7 56.3 54.3 81.5 42.2 54.9
P0, P1 54.3 66.7 54.1 81.9 60.2 70.5
P-1, P0, P1 54.6 68.0 60.4 84.8 73.0 78.8
P-2, P-1, P0, P1 , P2 54.6 67.8 62.3 86.2 75.7 80.6
20Parse Features (Senseval-1)
All Nouns Verbs Adj.
Majority 56.3 57.2 56.9 64.3
Head 64.3 70.9 59.8 66.9
Parent 60.6 62.6 60.3 65.8
Phrase 58.5 57.5 57.2 66.2
Par. Phr. 57.9 58.1 58.3 66.2
21Parse Features (Senseval-2)
All Nouns Verbs Adj.
Majority 47.7 51.0 39.7 59.0
Head 51.7 58.5 39.8 64.0
Parent 50.0 56.1 40.1 59.3
Phrase 48.3 51.7 40.3 59.5
Par. Phr. 48.5 53.0 39.1 60.3
22Thoughts
- Both lexical and syntactic features perform
comparably. - But do they get the same instances right ?
- How much are the individual feature sets
redundant. - Are there instances correctly disambiguated by
one feature set and not by the other ? - How much are the individual feature sets
complementary. - Is the effort to combine of lexical and syntactic
- features justified?
23Measures
- Baseline Ensemble accuracy of a hypothetical
ensemble which predicts the sense correctly only
if both individual feature sets do so. - Quantifies redundancy amongst feature sets.
- Optimal Ensemble accuracy of a hypothetical
ensemble which predicts the sense correctly if
either of the individual feature sets do so. - Difference with individual accuracies quantifies
complementarity. - We used a simple ensemble which sums up the
- probabilities for each sense by the individual
feature - sets to decide the intended sense.
24Best Combinations
Data Set 1 Set 2 Base Ens. Opt. Best
Sval-2 47.7 Unigrams 55.3 P-1,P0, P1 55.3 43.6 57.0 67.9 66.7
Sval-1 56.3 Unigrams 66.9 P-1,P0, P1 68.0 57.6 71.1 78.0 81.1
line 54.3 Unigrams 74.5 P-1,P0, P1 60.4 55.1 74.2 82.0 88.0
hard 81.5 Bigrams 89.5 Head, Par 87.7 86.1 88.9 91.3 83.0
serve 42.2 Unigrams 73.3 P-1,P0, P1 73.0 58.4 81.6 89.9 83.0
interest 54.9 Bigrams 79.9 P-1,P0, P1 78.8 67.6 83.2 90.1 89.0
25Conclusions
- Significant amount of complementarity across
lexical and syntactic features. - Combination of the two justified.
- We show that simple lexical and part of speech
features can achieve state of the art results. - How best to capitalize on the complementarity
still an open issue.
26Conclusions (continued)
- Part of speech of word immediately to the right
of target word found most useful. - Pos of words immediately to the right of target
word best for verbs and adjectives. - Nouns helped by tags on either side.
- (P0, P1) found to be most potent in case of small
training data per instance (Sval data). - Larger pos context size (P-2, P-1, P0, P1 , P2)
shown to be beneficial when training data per
instance is large (line, hard, serve and interest
data) - Head word of phrase particularly useful for
adjectives - Nouns helped by both head and parent.
27Code, Data Resources
- SyntaLex A system to do WSD using lexical and
syntactic features. Wekas decision tree learning
algorithm is utilized. - posSenseval part of speech tags any data in
Senseval-2 data format. Brill Tagger used. - parseSenseval parses data in a format as output
by the Brill Tagger. Output is in Senseval-2 data
format with part of speech and parse information
as xml tags. Uses Collins Parser. - Packages to convert line hard, serve and interest
data to Senseval-1 and Senseval-2 data formats. - BrillPatch Patch to Brill Tagger to employ
Guaranteed - Pre-Tagging.
- http//www.d.umn.edu/tpederse/code.html
- http//www.d.umn.edu/tpederse/data.html
28Senseval-3 (Mar-1 to April 15, 2004) Around 8000
training and 4000 test instances. Results
expected shortly.