Combining%20Lexical%20and%20Syntactic%20Features%20for%20Supervised%20Word%20Sense%20Disambiguation PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Combining%20Lexical%20and%20Syntactic%20Features%20for%20Supervised%20Word%20Sense%20Disambiguation


1
Combining Lexical and Syntactic Features for
Supervised Word Sense Disambiguation
  • Saif Mohammad Ted
    Pedersen
  • University of Toronto
    University of Minnesota
  • http//www.cs.toronto.edu/smm
    http//www.d.umn.edu/tpederse

2
Word Sense Disambiguation
  • Harry cast a bewitching spell
  • Humans immediately understand spell to mean a
    charm or incantation.
  • reading out letter by letter or a period of time
    ?
  • Words with multiple senses polysemy, ambiguity!
  • Utilize background knowledge and context.
  • Machines lack background knowledge.
  • Automatically identifying the intended sense of a
    word in written text, based on its context,
    remains a hard problem.
  • Best accuracies in recent international event,
    around 65.

3
Why do we need WSD !
  • Information Retrieval
  • Query cricket bat
  • Documents pertaining to the insect and the
    mammal, irrelevant.
  • Machine Translation
  • Consider English to Hindi translation.
  • head to sar (upper part of the body) or adhyaksh
    (leader)?
  • Machine-human interaction
  • Instructions to machines.
  • Interactive home system turn on the lights
  • Domestic Android get the door
  • Applications are widespread and will affect our
    way of life.

4
Terminology
  • Harry cast a bewitching spell
  • Target word the word whose intended sense is to
    be identified.
  • spell
  • Context the sentence housing the target word
    and possibly, 1 or 2 sentences around it.
  • Harry cast a bewitching spell
  • Instance target word along with its context.
  • WSD is a classification problem wherein the
    occurrence of the
  • target word is assigned to one of its many
    possible senses.

5
Corpus-Based Supervised Machine Learning
  • A computer program is said to learn from
    experience if its performance at tasks
    improves with experience.
  • - Mitchell
  • Task Word Sense Disambiguation of given test
    instances.
  • Performance Ratio of instances correctly
    disambiguated to the total test instances
    accuracy.
  • Experience Manually created instances such that
    target words are marked with intended sense
    training instances.
  • Harry cast a bewitching spell / incantation

6
Decision Trees
  • A kind of classifier.
  • Assigns a class by asking a series of questions.
  • Questions correspond to features of the instance.
  • Question asked depends on answer to previous
    question.
  • Inverted tree structure.
  • Interconnected nodes.
  • Top most node is called the root.
  • Each node corresponds to a question / feature.
  • Each possible value of feature has corresponding
    branch.
  • Leaves terminate every path from root.
  • Each leaf is associated with a class.

7
WSD Tree
Feature 1 ?
0
1
Feature 4?
Feature 2 ?
0
1
0
1
Feature 4 ?
Feature 2 ?
SENSE 1
SENSE 3
1
0
0
1
Feature 3 ?
SENSE 4
SENSE 3
SENSE 1
0
1
SENSE 3
SENSE 2
8
Choice of Learning Algorithm
  • Why use decision trees for WSD ?
  • It has drawbacks training data fragmentation
  • What about other learning algorithms such as
    neural networks?
  • Context is a rich source of discrete features.
  • The learned model likely meaningful.
  • May provide insight into the interaction of
    features.
  • Pedersen2001 Choosing the right features is
    of
  • greater significance than the learning algorithm
    itself
  • A Decision Tree of Bigrams is an Accurate
    Predictor of Word Sense T. Pedersen, In the
    Proceedings of the Second Meeting of the North
    American Chapter of the Association for
    Computational Linguistics
  • (NAACL-01), June 2-7, 2001, Pittsburgh, PA.

9
Lexical Features
  • Surface form
  • A word we observe in text.
  • Case(n)
  • 1. Object of investigation 2. frame or covering
    3. A weird person
  • Surface forms case, cases, casing
  • An occurrence of casing suggests sense 2.
  • Unigrams and Bigrams
  • One word and two word sequences in text.
  • The interest rate is low
  • Unigrams the, interest, rate, is, low
  • Bigrams the interest, interest rate, rate is, is
    low

10
Part of Speech Tagging
  • Brill Tagger most widely used tool.
  • Accuracy around 95.
  • Source code available.
  • Easily understood rules.
  • Pre-tagging is the act of manually assigning tags
    to selected words in a text prior to tagging.
  • Brill tagger does not guaranteed pre-tagging.
  • A patch to the tagger provided BrillPatch.
  • Guaranteed Pre-Tagging for the Brill Tagger,
    Mohammad, S. and Pedersen, T., In Proceedings of
    Fourth International Conference of Intelligent
    Systems and Text Processing, February 2003,
    Mexico.

11
Part of Speech Features
  • A word used in different senses is likely to have
    different sets of pos tags around it.
  • Why did jack turn/VB against/IN his/PRP team/NN
  • Why did jack turn/VB left/NN at/IN the/DT
    crossing
  • Features used
  • Individual word POS P-2, P-1, P0, P1, P2
  • P1 JJ implies that the word to the right of the
    target word is an adjective.
  • A combination of the above.

12
Parse Features
  • Collins Parser used to parse the data.
  • Source code available.
  • Uses part of speech tagged data as input.
  • Head word of a phrase.
  • the hard work, the hard surface
  • Phrase itself noun phrase, verb phrase and so
    on.
  • Parent Head word of the parent phrase.
  • fasten the line, cross the line
  • Parent phrase.
  • http//www.ai.mit.edu/people/mcollins

13
Sample Parse Tree
SENTENCE
VERB PHRASE
NOUN PHRASE
Harry
NOUN PHRASE
cast
NNP
VBD
spell
bewitching
a
NN
JJ
DT
14
Sense-Tagged Data
  • Senseval-2 data
  • 4,328 instances of test data and 8,611 instances
    of training data ranging over 73 different noun,
    verb and adjectives.
  • Senseval-1 data
  • 8,512 test instances and 13,276 training
    instances, ranging over 35 nouns, verbs and
    adjectives.
  • line, hard, interest, serve data
  • 4149, 4337, 4378 and 2476 sense-tagged instances
    with line, hard, serve and interest as the head
    words.
  • Around 50,000 sense-tagged instances in all!

15
Experiments
16
Lexical Senseval-1 Senseval-2
Sval-2 Sval-1 line hard serve interest
Majority 47.7 56.3 54.3 81.5 42.2 54.9
Surface Form 49.3 62.9 54.3 81.5 44.2 64.0
Unigram 55.3 66.9 74.5 83.4 73.3 75.7
Bigram 55.1 66.9 72.9 89.5 72.1 79.9
17
Individual Word POS (Senseval-1)
All Nouns Verbs Adj.
Majority 56.3 57.2 56.9 64.3
P-2 57.5 58.2 58.6 64.0
P-1 59.2 62.2 58.2 64.3
P0 60.3 62.5 58.2 64.3
P1 63.9 65.4 64.4 66.2
P-2 59.9 60.0 60.8 65.2
18
Individual Word POS (Senseval-2)
All Nouns Verbs Adj.
Majority 47.7 51.0 39.7 59.0
P-2 47.1 51.9 38.0 57.9
P-1 49.6 55.2 40.2 59.0
P0 49.9 55.7 40.6 58.2
P1 53.1 53.8 49.1 61.0
P-2 48.9 50.2 43.2 59.4
19
Combining POS Features
Sval-2 Sval-1 line hard serve interest
Majority 47.7 56.3 54.3 81.5 42.2 54.9
P0, P1 54.3 66.7 54.1 81.9 60.2 70.5
P-1, P0, P1 54.6 68.0 60.4 84.8 73.0 78.8
P-2, P-1, P0, P1 , P2 54.6 67.8 62.3 86.2 75.7 80.6
20
Parse Features (Senseval-1)
All Nouns Verbs Adj.
Majority 56.3 57.2 56.9 64.3
Head 64.3 70.9 59.8 66.9
Parent 60.6 62.6 60.3 65.8
Phrase 58.5 57.5 57.2 66.2
Par. Phr. 57.9 58.1 58.3 66.2
21
Parse Features (Senseval-2)
All Nouns Verbs Adj.
Majority 47.7 51.0 39.7 59.0
Head 51.7 58.5 39.8 64.0
Parent 50.0 56.1 40.1 59.3
Phrase 48.3 51.7 40.3 59.5
Par. Phr. 48.5 53.0 39.1 60.3
22
Thoughts
  • Both lexical and syntactic features perform
    comparably.
  • But do they get the same instances right ?
  • How much are the individual feature sets
    redundant.
  • Are there instances correctly disambiguated by
    one feature set and not by the other ?
  • How much are the individual feature sets
    complementary.
  • Is the effort to combine of lexical and syntactic
  • features justified?

23
Measures
  • Baseline Ensemble accuracy of a hypothetical
    ensemble which predicts the sense correctly only
    if both individual feature sets do so.
  • Quantifies redundancy amongst feature sets.
  • Optimal Ensemble accuracy of a hypothetical
    ensemble which predicts the sense correctly if
    either of the individual feature sets do so.
  • Difference with individual accuracies quantifies
    complementarity.
  • We used a simple ensemble which sums up the
  • probabilities for each sense by the individual
    feature
  • sets to decide the intended sense.

24
Best Combinations
Data Set 1 Set 2 Base Ens. Opt. Best
Sval-2 47.7 Unigrams 55.3 P-1,P0, P1 55.3 43.6 57.0 67.9 66.7
Sval-1 56.3 Unigrams 66.9 P-1,P0, P1 68.0 57.6 71.1 78.0 81.1
line 54.3 Unigrams 74.5 P-1,P0, P1 60.4 55.1 74.2 82.0 88.0
hard 81.5 Bigrams 89.5 Head, Par 87.7 86.1 88.9 91.3 83.0
serve 42.2 Unigrams 73.3 P-1,P0, P1 73.0 58.4 81.6 89.9 83.0
interest 54.9 Bigrams 79.9 P-1,P0, P1 78.8 67.6 83.2 90.1 89.0
25
Conclusions
  • Significant amount of complementarity across
    lexical and syntactic features.
  • Combination of the two justified.
  • We show that simple lexical and part of speech
    features can achieve state of the art results.
  • How best to capitalize on the complementarity
    still an open issue.

26
Conclusions (continued)
  • Part of speech of word immediately to the right
    of target word found most useful.
  • Pos of words immediately to the right of target
    word best for verbs and adjectives.
  • Nouns helped by tags on either side.
  • (P0, P1) found to be most potent in case of small
    training data per instance (Sval data).
  • Larger pos context size (P-2, P-1, P0, P1 , P2)
    shown to be beneficial when training data per
    instance is large (line, hard, serve and interest
    data)
  • Head word of phrase particularly useful for
    adjectives
  • Nouns helped by both head and parent.

27
Code, Data Resources
  • SyntaLex A system to do WSD using lexical and
    syntactic features. Wekas decision tree learning
    algorithm is utilized.
  • posSenseval part of speech tags any data in
    Senseval-2 data format. Brill Tagger used.
  • parseSenseval parses data in a format as output
    by the Brill Tagger. Output is in Senseval-2 data
    format with part of speech and parse information
    as xml tags. Uses Collins Parser.
  • Packages to convert line hard, serve and interest
    data to Senseval-1 and Senseval-2 data formats.
  • BrillPatch Patch to Brill Tagger to employ
    Guaranteed
  • Pre-Tagging.
  • http//www.d.umn.edu/tpederse/code.html
  • http//www.d.umn.edu/tpederse/data.html

28
Senseval-3 (Mar-1 to April 15, 2004) Around 8000
training and 4000 test instances. Results
expected shortly.
  • Thank You
Write a Comment
User Comments (0)
About PowerShow.com