Word sense disambiguation and information retrieval - PowerPoint PPT Presentation

About This Presentation
Title:

Word sense disambiguation and information retrieval

Description:

(17.1) '..., everybody has a career and none of them includes washing DISHES' ... One sense per collocation. Also automatic selection from machine readable dictionary ... – PowerPoint PPT presentation

Number of Views:188
Avg rating:3.0/5.0
Slides: 20
Provided by: jrit3
Category:

less

Transcript and Presenter's Notes

Title: Word sense disambiguation and information retrieval


1
Word sense disambiguation and information
retrieval
  • Chapter 17
  • Jurafsky, D. Martin J. H.
  • SPEECH and LANGUAGE PROCESSING
  • Jarmo Ritola - jarmo.ritola_at_hut.fi

2
Lexical Semantic Processing
  • Word sense disambiguation
  • which sense of a word is being used
  • non-trivial task
  • robust algorithms
  • Information retrieval
  • broad field
  • storage and retrieval of requested text documents
  • vector space model

3
Word Sense Disambiguation
  • (17.1) ..., everybody has a career and none of
    them includes washing DISHES
  • (17.2) In her tiny kitchen at home, Ms. Chen
    works efficiently, stir-frying several simple
    DISHES, including braised pigs ears and chicken
    livers with green peppers
  • (17.6) Im looking for a restaurant that SERVES
    vegetarian DISHES

4
Selectional Restriction
  • Rule-to-rule approach
  • Blocks the formation of representations with
    selectional restriction violations
  • Correct sense achieved as side effect
  • PATIENT roles, mutual exclution
  • dishes stir-fry gt food sense
  • dishes wash gt artifact sense
  • Need hierarchical types and restrictions

5
S.R. Limitations
  • Selectional restrictions too general
  • (17.7) kind of DISHES do you recommend?
  • True restriction violations
  • (17.8) you cant EAT gold for lunch
  • negative environment
  • (17.9) Mr. Kulkarni ATE glass
  • Metaphoric and metonymic uses
  • Selectional association (Resnik)

6
Robust Word Sense Disambiguation
  • Robust, stand alone systems
  • Preprocessing
  • part-of-speech tagging, context selection,
    stemming, morphological processing, parsing
  • Feature selection, feature vector
  • Train classifier to assign words to senses
  • Supervised, bootstrapping, unsupervised
  • Does the system scale?

7
Inputs Feature Vectors
  • Target word, context
  • Select relevant linguistic features
  • Encode them in a usable form
  • Numeric or nominal values
  • Collocational features
  • Co-occurrence features

8
Inputs Feature Vectors (2)
  • (17.11) An electric guitar and BASS player stand
    off to one side, not really part of the scene,
    just as a sort of nod to gringo expectations
    perhaps.
  • Collocational
  • guitar, NN1, and, CJC, player, NN1, stand,
    VVB
  • Co-occurrence
  • 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0
  • fishing, big, sound, player, fly, rod, pound,
    double, runs, playing, guitar, band

9
Supervised Learning
  • Feature-encoded inputs categories
  • Naïve Bayes classifier
  • Decision list classifiers
  • case statements
  • tests ordered according to sense likelihood

10
Bootstrapping Approaches
  • Seeds, small number of labeled instances
  • Initial classifier extracts larger training set
  • Repeat gt series of classifier with improving
    accuracy and coverage
  • Hand labeling examples
  • One sense per collocation
  • Also automatic selection from machine readable
    dictionary

11
Unsupervised Methods
  • Unlabeled feature vectors are grouped into
    clusters according to a similarity metric
  • Clusters are labeled by hand
  • Agglomerative clustering
  • Challenges
  • correct senses may not be known
  • heterogeneous clusters
  • Number clusters and senses differ

12
Dictionary Based Approaches
  • Large-scale disambiguation possible
  • Sense definitions retrieved from the dictionary
  • The sense with highest overlap within context
    words
  • Dictionary entries relative short
  • Not enough overlap
  • expand word list, subject codes

13
Information Retrieval
  • Compositional semantics
  • Bag of words methods
  • Terminology
  • document
  • collection
  • term
  • query
  • Ad hoc retrieval

14
The Vector Space Model
  • List of terms within the collection
  • document vector presence/absence of terms
  • raw term frequency
  • normalization gt direction of vector
  • similarity is cosine of angle between vectors

15
The Vector Space Model
  • Document collection
  • Term by weight matrix

16
Term Weighting
  • Enormous impact on the effectiveness
  • Term frequency within a single document
  • Distribution of term across a collection
  • Same weighting scheme for documents and queries
  • Alternative weighting methods for queries
  • AltaVista di,j contains 1000000000 words
  • average query 2.3 words

17
Recall versus precision
  • Stemming
  • Stop list
  • Homonymy, polysemy, synonymy, hyponymy
  • Improving user queries
  • relevance feedback
  • query expansion, thesaurus, thesaurus generation,
    term clustering

18
Summary
  • WSD assign word to senses
  • Selectional restriction
  • Machine learning approaches (small scale)
  • supervised, bootstrapping, unsupervised
  • Machine readable dictionaries (large scale)
  • Bag of words method, Vector space model
  • Query improvement (relevance feedback)

19
Exercise - Relevance Feedback
The document collection is ordered according to
the 'raw term frequency' of words "speech" and
"language". The values and ordering is shown in
the table below.
  • You want to find documents with many "speech"
    words but few "language" words (e.g. relation 8
    2). Your initial query is "speech", "language",
    i.e. they have equal weights.
  • The search machine always returns three most
    similar documents.
  • Show that with relevance feedback you
  • get the documents you want.
  • How important is the correctness of
  • feedback from the user?
Write a Comment
User Comments (0)
About PowerShow.com