Word Sense Disambiguation - PowerPoint PPT Presentation

About This Presentation
Title:

Word Sense Disambiguation

Description:

Word Sense Disambiguation. Information Retrieval. CMSC 35100. Natural ... Cluster: Dry Oyster Whisky Hot Float Ice. Learning Nothing Useful, Wrong Question ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 19
Provided by: classesCs
Category:

less

Transcript and Presenter's Notes

Title: Word Sense Disambiguation


1
Word Sense DisambiguationInformation Retrieval
  • CMSC 35100
  • Natural Language Processing
  • May 20, 2003

2
Roadmap
  • Word Sense Disambiguation
  • Knowledge-based Approaches
  • Sense similarity in a taxonomy
  • Issues in WSD
  • Why they work why they dont
  • Information Retrieval
  • Vector Space Model
  • Computing similarity
  • Term weighting
  • Enhancements Expansion, Stemming, Synonyms

3
Resniks WordNet Labeling Detail
  • Assume Source of Clusters
  • Assume KB Word Senses in WordNet IS-A hierarchy
  • Assume a Text Corpus
  • Calculate Informativeness
  • For Each KB Node
  • Sum occurrences of it and all children
  • Informativeness
  • Disambiguate wrt Cluster WordNet
  • Find MIS for each pair, I
  • For each subsumed sense, Vote I
  • Select Sense with Highest Vote

4
Sense Labeling Under WordNet
  • Use Local Content Words as Clusters
  • Biology Plants, Animals, Rainforests, species
  • Industry Company, Products, Range, Systems
  • Find Common Ancestors in WordNet
  • Biology Plants Animals isa Living Thing
  • Industry Product Plant isa Artifact isa Entity
  • Use Most Informative
  • Result Correct Selection

5
The Question of Context
  • Shared Intuition
  • Context
  • Area of Disagreement
  • What is context?
  • Wide vs Narrow Window
  • Word Co-occurrences

Sense
6
Taxonomy of Contextual Information
  • Topical Content
  • Word Associations
  • Syntactic Constraints
  • Selectional Preferences
  • World Knowledge Inference

7
A Trivial Definition of ContextAll Words within
X words of Target
  • Many words Schutze - 1000 characters, several
    sentences
  • Unordered Bag of Words
  • Information Captured Topic Word Association
  • Limits on Applicability
  • Nouns vs. Verbs Adjectives
  • Schutze Nouns - 92, Train -Verb, 69

8
Limits of Wide Context
  • Comparison of Wide-Context Techniques (LTV 93)
  • Neural Net, Context Vector, Bayesian Classifier,
    Simulated Annealing
  • Results 2 Senses - 90 3 senses 70
  • People Sentences 100 Bag of Words 70
  • Inadequate Context
  • Need Narrow Context
  • Local Constraints Override
  • Retain Order, Adjacency

9
Surface Regularities Useful Disambiguators
  • Not Necessarily!
  • Scratching her nose vs Kicking the bucket
    (deMarcken 1995)
  • Right for the Wrong Reason
  • Burglar Rob Thieves Stray Crate Chase Lookout
  • Learning the Corpus, not the Sense
  • The Ste. Cluster Dry Oyster Whisky Hot Float
    Ice
  • Learning Nothing Useful, Wrong Question
  • Keeping Bring Hoping Wiping Could Should Some
    Them Rest

10
Interactions Below the Surface
  • Constraints Not All Created Equal
  • The Astronomer Married the Star
  • Selectional Restrictions Override Topic
  • No Surface Regularities
  • The emigration/immigration bill guaranteed
    passports to all Soviet citizens
  • No Substitute for Understanding

11
What is Similar
  • Ad-hoc Definitions of Sense
  • Cluster in word space, WordNet Sense, Seed
    Sense Circular
  • Schutze Vector Distance in Word Space
  • Resnik Informativeness of WordNet Subsumer
    Cluster
  • Relation in Cluster not WordNet is-a hierarchy
  • Yarowsky No Similarity, Only Difference
  • Decision Lists - 1/Pair
  • Find Discriminants

12
Information Retrieval
  • Query/Document similarity
  • Expression of users information need
  • Query
  • Searchable units encode concepts
  • Documents
  • Paragraphs, encyclopedia entries, web pages,
  • Collection searchable group of documents
  • Elementary units terms
  • E.g. words, phrases, stems,..
  • Bag of words (typically)
  • man, dog, bit

13
Vector Space Model
  • Represent documents and queries as
  • Vectors of term-based features
  • Features tied to occurrence of terms in
    collection
  • E.g.
  • Solution 1 Binary features t1 if presense, 0
    otherwise
  • Similiarity number of terms in common
  • Dot product

14
Vector Space Model II
  • Problem Not all terms equally interesting
  • E.g. the vs dog vs Levow
  • Solution Replace binary term features with
    weights
  • Document collection term-by-document matrix
  • View as vector in multidimensional space
  • Nearby vectors are related
  • Normalize for vector length

15
Vector Similarity Computation
  • Similarity Dot product
  • Normalization
  • Normalize weights in advance
  • Normalize post-hoc

16
Term Weighting
  • Aboutness
  • To what degree is this term what document is
    about?
  • Within document measure
  • Term frequency (tf) occurrences of t in doc j
  • Specificity
  • How surprised are you to see this term?
  • Collection frequency
  • Inverse document frequency (idf)

17
Term Selection Formation
  • Selection
  • Some terms are truly useless
  • Too frequent, no content
  • E.g. the, a, and,
  • Stop words ignore such terms altogether
  • Creation
  • Too many surface forms for same concepts
  • E.g. inflections of words verb conjugations,
    plural
  • Stem terms treat all forms as same underlying

18
Query Refinement
  • Typical queries very short, ambiguous
  • Cat animal/Unix command
  • Add more terms to disambiguate, improve
  • Relevance feedback
  • Retrieve with original queries
  • Present results
  • Ask user to tag relevant/non-relevant
  • push toward relevant vectors, away from nr
  • ß?1 (0.75,0.25) r rel docs, s non-rel docs
Write a Comment
User Comments (0)
About PowerShow.com