Title: Unsupervised Word Sense Disambiguation Rivaling Supervised Methods David Yarowsky
1Unsupervised Word Sense Disambiguation Rivaling
Supervised MethodsDavid Yarowsky
- G22.2591 Presentation, Sonjia Waxmonsky
2Introduction
- Presents unsupervised learning algorithm for word
sense disambiguation that can be applied to
completely untagged text - Based on supervised machine learning algorithm
that uses decision lists - Performance matches that of supervised system
3Properties of Language
- One sense per collocation
- Nearby words provide strong and consistent clues
as to the sense of a target word - One sense per discourse
- The sense of a target word is highly consistent
within a single document
4Decision List Algorithm
- Supervised algorithm
- Based on One sense per collocation property
- Start with large set of possible collocations
- Calculate log-likelihood ratio of word-sense
probability for each collocation - Higher log-likelihood more predictive evidence
- Collocations are ordered in a decision list, with
most predictive collocations ranked highest
5Decision List Algorithm
- Decision list is used to classify instances of
target word
the loss of animal and plant species through
extinction
Classification is based on the highest ranking
rule that matches the target context
LogL Collocation Sense
9.31 flower (within /- k words) A (living)
9.24 job (within /- k words) B (factory)
9.03 fruit (within /- k words) A (living)
9.02 plant species A (living)
... ...
6Advantage of Decision Lists
- Multiple collocations may match a single context
- But, only the single most predictive piece of
evidence is used to classify the target word - Result The classification procedure combines a
large amount of non-independent information
without complex modeling
7Bootstrapping Algorithm
Sense-A life
Sense-B factory
- All occurrences of the target word are identified
- A small training set of seed data is tagged with
word sense
8Selecting Training Seeds
- Initial training set should accurately
distinguish among possible senses - Strategies
- Select a single, defining seed collocation for
each possible sense. - Ex life and manufacturing for target plant
- Use words from dictionary definitions
- Hand-label most frequent collocates
9Bootstrapping Algorithm
- Iterative procedure
- Train decision list algorithm on seed set
- Classify residual data with decision list
- Create new seed set by identifying samples that
are tagged with a probability above a certain
threshold - Retrain classifier on new seed set
10Bootstrapping Algorithm
- Seed set grows and residual set shrinks .
11Bootstrapping Algorithm
- Convergence Stop when residual set stabilizes
12Final Decision List
- Original seed collocations may not necessarily be
at the top of the list - Possible for sample in the original seed data to
be reclassified - Initial misclassifications in seed data can be
corrected
13One Sense per Discourse
- Algorithm can be improved by applying One Sense
per Discourse constraint - After algorithm has converged
- Identify tokens tagged with low confidence, label
with dominant tag of that document - After each iteration
- Extend tag to all examples in a single document
after enough examples are tagged with a single
sense
14Evaluation
- Test corpus extracted from 460 million word
corpus of multiple sources (news articles,
transcripts, novels, etc.) - Performance of multiple models compared with
- supervised decision lists
- unsupervised learning algorithm of Schütze
(1992), based on alignment of clusters with word
senses
15Results
- Applying the One sense per discourse constraint
improves performance
Word Senses Unsupervised (Dictionary seed data) Unsupervised - Applying One Sense per Discourse Unsupervised - Applying One Sense per Discourse
Word Senses Unsupervised (Dictionary seed data) After last iter. After each iter.
plant living/factory 97.3 98.3 98.6
space volume/outer 92.3 93.3 93.6
tank vehicle/ container 94.6 97.8 96.5
motion legal/physical 97.4 98.5 97.9
-
Average - 94.8 96.1 96.5
Accuracy ()
16Results
- Accuracy exceeds Schütze algorithm for all target
words, and matches that of supervised algorithm
Word Senses Supervised Unsupervised / Schütze Unsupervised / Bootstrapping
plant living/factory 97.7 92 98.6
space volume/outer 93.9 90 93.6
tank vehicle/ container 97.1 95 96.5
motion legal/physical 98.0 92 97.9
-
Average - 96.1 92.2 96.5
Accuracy ()