CategoryBased Pseudowords - PowerPoint PPT Presentation

About This Presentation

Title:

CategoryBased Pseudowords

Description:

Number of Views:63

Avg rating:3.0/5.0

Slides: 19

Provided by: nak9

Learn more at: https://biotext.berkeley.edu

Category:

Tags: categorybased | ambiguity | ambiguous | categoryless | pseudowords

Transcript and Presenter's Notes

Title: CategoryBased Pseudowords

1
Category-Based Pseudowords

2
Word sense disambiguation

WSD task determine the sense of a particular
instance of a multi-sense word given its context
classic ambiguous example bank
homography
river bank
financial institution
polysemy
financial institution
building

3
Evaluation

Ideally using a sense-tagged corpus
general purpose e.g. SENSEVAL corpus
specific domain, e.g. biomedical
the National Library of Medicine test collection
contains instances of 50 highly frequent
ambiguous concepts from the UMLS Metathesaurus.
Moving to a new domain
a sense-tagged corpus may be unavailable
even when available, may be unsuitable
What if we use a different sense distinction
e.g. MeSH instead of the UMLS Metathesaurus?
What if we are also interested in less frequent
words, e.g. need to evaluate an all-words system?

4
Pseudowords

building a sense-tagged corpus is very expensive,
so create an artificial one
pseudoword composite comprised of two or more
words, chosen at random (Gale et al.92),
(Schuetze92)
e.g. banana and door ? banana_door
accepted as an upper bound of the true systems
accuracy

5
Problems

Chosen entirely at random, and thus
difficult to characterize in terms of the type of
ambiguity being modeled
optimistic in their estimations (Gaustad01)
highly likely to combine semantically distinct
words
real ambiguous words have senses similar in
meaning and difficult to distinguish

6
The solution

7
MeSH and Medline

8
Pseudowords generation (1)

Build a list C of the category couples and their
frequencies in the training corpus

9
Pseudowords generation (2)

Generate pseudowords with the following
characteristics
represent a real ambiguity class pair (met in the
training corpus)
the number of pseudowords drawn from a particular
class pair is proportional to the pairs
frequency
only unambiguous words are used as pseudowords
constituents
multi-word concepts are allowed as elements, e.g.
general systems theory
glutathione s-tranferase

10
Pseudowords generation (3)

Pseudowords for the lower bound
in real texts, the more frequent sense for a
two-sense distinction occurs around 92 of the
time (Sanderson van Rijsbergen99)
evenly distributed senses are harder
so we build a balanced list W of pairs
we calculate the mean corpus word frequency E and
then find the words with freq. in E/23E/2
in the particular experiment
E45.21, which gave a list of 64,596 pairs

11
Pseudowords generation (4)

importance sampling
1) Select a category pair c1,c2 from C by
sampling from a multinomial distribution with
parameters proportional to the frequencies of the
elements of C.
2) Sample uniformly to draw two random distinct
words w1 and w2 whose classes correspond to the
classes selected in step 1).
3) If the word pair w1,w2 has been sampled
already, go to step 1) and try again.
we sampled 1,000 pseudowords (88,758 instances)
out of the possible 64,596

12
Sample pseudowords

13
Classifier

Naïve Bayes classifier
simple, commonly used for WSD, and among the best
performing
we used a symmetric context window
10, 20, 40 and 300 words on each side
category name as a proxy for the sense
ambiguous MeSH categories as target
UNambiguous MeSH categories as features (we use a
class-based model, and not a word-based one)

14
Abbreviations

we have no real disambiguated corpus
use abbreviations, as suggested in (Liu et
al.,02)
represent real ambiguous words
but may be due to accident
intermediate position between entirely random
pseudowords and real ambiguous words
we generated 98,841 abbreviations (332,020
instances in total) such that
their expansions are fully and unambiguously
mapped to MeSH
they represent exactly two distinct categories
used an abbr. extraction tool described in
(SchwartzHearst03)

15
Sample abbreviations
16
Evaluation

17
Conclusions

We introduced category based pseudowords based
on distributions from lexical category
co-occurrence
give a more accurate lower bound
allow detailed study (many samples) of a
particular sense ambiguity
represent a better motivated word grouping in
pseudowords

18
Thank you!