CategoryBased Pseudowords - PowerPoint PPT Presentation

About This Presentation
Title:

CategoryBased Pseudowords

Description:

difficult to characterize in terms of the type of ambiguity being modeled ... represent a real ambiguity class pair (met in the training corpus) ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 19
Provided by: nak9
Category:

less

Transcript and Presenter's Notes

Title: CategoryBased Pseudowords


1
Category-Based Pseudowords
  • Preslav Nakov Marti Hearst
  • University of California at Berkeley
  • EECS SIMS
  • Supported by Genentech and ARDA Aquaint

2
Word sense disambiguation
  • WSD task determine the sense of a particular
    instance of a multi-sense word given its context
  • classic ambiguous example bank
  • homography
  • river bank
  • financial institution
  • polysemy
  • financial institution
  • building

3
Evaluation
  • Ideally using a sense-tagged corpus
  • general purpose e.g. SENSEVAL corpus
  • specific domain, e.g. biomedical
  • the National Library of Medicine test collection
  • contains instances of 50 highly frequent
    ambiguous concepts from the UMLS Metathesaurus.
  • Moving to a new domain
  • a sense-tagged corpus may be unavailable
  • even when available, may be unsuitable
  • What if we use a different sense distinction
    e.g. MeSH instead of the UMLS Metathesaurus?
  • What if we are also interested in less frequent
    words, e.g. need to evaluate an all-words system?

4
Pseudowords
  • building a sense-tagged corpus is very expensive,
    so create an artificial one
  • pseudoword composite comprised of two or more
    words, chosen at random (Gale et al.92),
    (Schuetze92)
  • e.g. banana and door ? banana_door
  • accepted as an upper bound of the true systems
    accuracy

5
Problems
  • Chosen entirely at random, and thus
  • difficult to characterize in terms of the type of
    ambiguity being modeled
  • optimistic in their estimations (Gaustad01)
  • highly likely to combine semantically distinct
    words
  • real ambiguous words have senses similar in
    meaning and difficult to distinguish

6
The solution
  • Use lexical category membership

7
MeSH and Medline
  • we use MeSH (Medical Subject Headings)
  • example Eye has the following codes
    A01.456.505.420 (child of Face)
  • A09.371 (child of Sense Organs)
  • average number of senses 2.12
  • we cut after the first period to allow
    generalization (e.g. A01 and A09)
  • 71.18 - single class, 22.14 - two classes
  • the ambiguity drops to 1.39
  • Medline abstracts - 180,226
  • training 120,150
  • testing 60,076

8
Pseudowords generation (1)
  • Build a list C of the category couples and their
    frequencies in the training corpus

9
Pseudowords generation (2)
  • Generate pseudowords with the following
    characteristics
  • represent a real ambiguity class pair (met in the
    training corpus)
  • the number of pseudowords drawn from a particular
    class pair is proportional to the pairs
    frequency
  • only unambiguous words are used as pseudowords
    constituents
  • multi-word concepts are allowed as elements, e.g.
  • general systems theory
  • glutathione s-tranferase

10
Pseudowords generation (3)
  • Pseudowords for the lower bound
  • in real texts, the more frequent sense for a
    two-sense distinction occurs around 92 of the
    time (Sanderson van Rijsbergen99)
  • evenly distributed senses are harder
  • so we build a balanced list W of pairs
  • we calculate the mean corpus word frequency E and
    then find the words with freq. in E/23E/2
  • in the particular experiment
  • E45.21, which gave a list of 64,596 pairs

11
Pseudowords generation (4)
  • importance sampling
  • 1) Select a category pair c1,c2 from C by
    sampling from a multinomial distribution with
    parameters proportional to the frequencies of the
    elements of C.
  • 2) Sample uniformly to draw two random distinct
    words w1 and w2 whose classes correspond to the
    classes selected in step 1).
  • 3) If the word pair w1,w2 has been sampled
    already, go to step 1) and try again.
  • we sampled 1,000 pseudowords (88,758 instances)
    out of the possible 64,596

12
Sample pseudowords
  • the more unusual pairs come from less frequent
    categories

13
Classifier
  • Naïve Bayes classifier
  • simple, commonly used for WSD, and among the best
    performing
  • we used a symmetric context window
  • 10, 20, 40 and 300 words on each side
  • category name as a proxy for the sense
  • ambiguous MeSH categories as target
  • UNambiguous MeSH categories as features (we use a
    class-based model, and not a word-based one)

14
Abbreviations
  • we have no real disambiguated corpus
  • use abbreviations, as suggested in (Liu et
    al.,02)
  • represent real ambiguous words
  • but may be due to accident
  • intermediate position between entirely random
    pseudowords and real ambiguous words
  • we generated 98,841 abbreviations (332,020
    instances in total) such that
  • their expansions are fully and unambiguously
    mapped to MeSH
  • they represent exactly two distinct categories
  • used an abbr. extraction tool described in
    (SchwartzHearst03)

15
Sample abbreviations
16
Evaluation
  • Category based
  • baseline choose the more frequent class (shown
    for abbreviations)
  • pessimistic evenly distributed constituents
  • realistic random constituents (frequency at
    least 5)
  • abbreviations
  • Non-category based
  • optimistic completely random (the standard way
    to generate)

17
Conclusions
  • We introduced category based pseudowords based
    on distributions from lexical category
    co-occurrence
  • give a more accurate lower bound
  • allow detailed study (many samples) of a
    particular sense ambiguity
  • represent a better motivated word grouping in
    pseudowords

18
Thank you!
  • Your questions?
Write a Comment
User Comments (0)
About PowerShow.com