Word%20Sense%20Disambiguation%20and%20Information%20Retrieval - PowerPoint PPT Presentation

About This Presentation
Title:

Word%20Sense%20Disambiguation%20and%20Information%20Retrieval

Description:

Disambiguation based on manually created rules ... rigging, harness, trappings, fittings, accoutrements, paraphernalia, equipage, ... – PowerPoint PPT presentation

Number of Views:270
Avg rating:3.0/5.0
Slides: 37
Provided by: usage
Category:

less

Transcript and Presenter's Notes

Title: Word%20Sense%20Disambiguation%20and%20Information%20Retrieval


1
Word Sense Disambiguation and Information
Retrieval
  • By Guitao Gao
  • Qing Ma
  • Prof Jian-Yun Nie

2
Outline
  • Introduction
  • WSD Approches
  • Conclusion

3
Introduction
  • Task of Information Retrieval
  • Content Repesentation
  • Indexing
  • Bag of words indexing
  • Problems
  • Synonymy query expansion
  • Polysemy Word Sense Disambiguation

4
WSD Approaches
  • Disambiguation based on manually created rules
  • Disambiguation using machine readable
    dictionaries
  • Disambiguation using thesauri
  • Disambiguation based on unsupervised machine
    learning with corpora

5
Disambiguation based on manually created rules
  • Weiss approach Lesk 1988
  • set of rules to disambiguate five words
  • context rule within 5 words
  • template rule specific location
  • accuracy 90
  • IR improvement 1
  • Small Riegers approach Small 1982
  • Expert system

6
Disambiguation using machine readable
dictionaries
  • Lesks approach Lesk 1988
  • Senses are represented by different definitions
  • Looked up context words definitions
  • Find co-occurring words
  • Select most similar sense
  • Accuracy 50 - 70.
  • Problem no enough overlapping words between
    definitions

7
Disambiguation using machine readable dictionaries
  • Wilks approach Wilks 1990
  • Attempt to solve Lesks problem
  • Expanding dictionary definition
  • Use Longman Dictionary of Contemporary English (
    LDOCE )
  • more word co-occurring evidence collected
  • Accuracy between 53 and 85.

8
Wilks approach Wilks 1990
  • Commonly co-occurring words in LDOCE. Wilks
    1990

9
Disambiguation using machine readable dictionaries
  • Luks approach Luk 1995
  • Statistical sense disambiguation
  • Use definitions from LDOCE
  • co-occurrence data collected from Brown corpus
  • defining concepts 1792 words used to write
    definitions of LDOCE
  • LDOCE pre-processed conceptual expansion

10
Luks approach Luk 1995
Noun sentence and its conceptual expansion Luk
1995
11
Luks approach Luk 1995 cont.
  • Collect co-occurrence data of defining concepts
    by constructing a two-dimensional Concept
    Co-occurrence Data Table (CCDT)
  • Brown corpus divided into sentences
  • collect conceptual co-occurrence data for each
    defining concept which occurs in the sentence
  • Insert collect data in the Concept Co-occurrence
    Data Table.

12
Luks approach Luk 1995 cont.
  • Score each sense S with respect to context C

Luk 1995
13
Luks approach Luk 1995 cont.
  • Select sense with the highest score
  • Accuracy 77
  • Human accuracy 71

14
Approaches using Roget's Thesaurus Yarowsky
1992
  • Resources used
  • Roget's Thesaurus
  • Grolier Multimedia Encyclopedia
  • Senses of a word categories in Roget's
    Thesaurus
  • 1042 broad categories covering areas like,
    tools/machinery or animals/insects

15
Approaches using Roget's Thesaurus Yarowsky
1992 cont.
  • tool, implement, appliance, contraption,
    apparatus, utensil, device, gadget, craft,
    machine, engine, motor, dynamo, generator, mill,
    lathe, equipment, gear, tackle, tackling,
    rigging, harness, trappings, fittings,
    accoutrements, paraphernalia, equipage, outfit,
    appointments, furniture, material, plant,
    appurtenances, a wheel, jack, clockwork,
    wheel-work, spring, screw,

Some words placed into the tools/machinery
category Yarowsky 1992
16
Approaches using Roget's Thesaurus Yarowsky
1992 cont.
  • Collect context for each category
  • From Grolier Encyclopedia
  • each occurrence of each member of the category
  • extracts 100 surrounding words

Sample occurrence of words in the tools/machinery
category Yarowsky 1992
17
Approaches using Roget's Thesaurus Yarowsky
1992 cont.
  • Identify and weight salient words

Sample salient words for Roget categories 348 and
414 Yarowsky 1992
  • To disambiguate a word sums up the weights of
    all salient words appearing in context
  • Accuracy 92 disambiguating 12 words

18
Introduction to WordNet(1)
  • Online thesaurus system
  • Synsets Synonymous Words
  • Hierachical Relationship

19
Introduction to WordNet(2)
Sanderson 2000
20
Voorhees Disambg. Experiment
  • Calculation of Semantic Distance Synset and
    Context words
  • Words Sense Synset closest to Context Words
  • Retrieval Result Worse than non-Disambig.

21
Gonzalos IR experiment(1)
  • Two Questions
  • Can WordNet really offer any potential for text
    retrieval
  • How is text Retrieval performance affected by the
    disambiguation errors?

22
Gonzalos IR experiment(2)
  • Text Collection Summary and Document
  • Experiments
  • 1. Standard Smart Run
  • 2. Indexed In Terms of Word-Sense
  • 3. Indexed In Terms of Synset
  • 4. Introduction of Disambiguation Error

23
Gonzalos IR experiment(3)
  • Experiements correct document
    retrieved
  • Indexed by synsets 62.0
  • Indexing by word senses 53.2
  • Indexing by words 48.0
  • Indexing by synsets(5 error) 62.0
  • Id. with 10 errors 60.8
  • Id. with 20 errors 56.1
  • Id. with 30 errors 54.4
  • Id. with all possible 52.6
  • Id. with 60 errors 49.1

24
Gonzalos IR experiment(4)
  • Disambiguation with WordNet can improve text
    retrieval
  • Solution lies in reliable Automatic WSD technique

25
Disambiguation With Unsupervised Learning
  • Yarowskys Unsupervised Method
  • One Sense Per Collocation
  • eg Plant(manufacturing/life)
  • One Sense Per Discourse
  • eg defense(War/Sports)

26
Yarowskys Unsupervised Method cont.
  • Algorithm Details
  • Step1Store Word and its contexts as line
  • eg.zonal distribution of plant life..
  • Step2 Identify a few words that represent the
    word Sense
  • eg. plant(manufacturing/life)
  • Step3a Get rules from the training set
  • plant X gt A, weight
  • plant Y gt B, weight
  • Step3bUse the rules created in 3a to classify
    all occurrences of plant sample set.

27
Yarowskys Unsupervised Method cont.
  • Step3c Use one-sense-per-discourse rule to
    filter or augment this addition
  • Step3d Repeat Step 3 a-b-c iteratively.
  • Step4 the training converges on a stable
    residual set.
  • Step 5 the result will be a set of rules. Those
    rules will be used to disambiguate the word
    plant.
  • eg. plant growth gt life
  • plant car gt manufacturing

28
Yarowskys Unsupervised Method cont.
  • Advantages of this method
  • Better accuracy compared to other unsupervised
    method
  • No need for costly hand-tagged training
    sets(supervised method)

29
Schütze and Pedersens approachSchütze 1995
  • Source of word sense definitions
  • Not using a dictionary or thesaurus
  • Only using only the corpus to be
    disambiguated(Category B TREC-1 collection )
  • Thesaurus construction
  • Collect a (symmetric ) term-term matrix C
  • Entry cij number of times that words i and j
    co-occur in a symmetric window of total size k
  • Use SVD to reduce the dimensionality

30
Schütze and Pedersens approachSchütze 1995
cont.
  • Thesaurus vector columns
  • Semantic similarity cosine between columns
  • Thesaurus associate each word with its nearest
    neighbors
  • Context vector summing thesaurus vectors of
    context words

31
Schütze and Pedersens approachSchütze 1995
cont.
  • Disambiguation algorithm
  • Identify context vectors corresponding to all
    occurrences of a particular word
  • Partition them into regions of high density
  • Tag a sense for each such region
  • Disambiguating a word
  • Compute context vector of its occurrence
  • Find the closest centroid of a region
  • Assign the occurrence the sense of that centroid

32
Schütze and Pedersens approachSchütze 1995
cont.
  • Accuracy 90
  • Application to IR
  • replacing the words by word senses
  • sense based retrievals average precision for 11
    points of recall increased 4 with respect to
    word based.
  • Combine the ranking for each document
  • average precision increased 11
  • Each occurrence is assigned n(2,3,4,5) senses
  • average precision increased 14 for n3

33
Schütze and Pedersens approachSchütze 1995
cont.
34
Conclusion
  • How much can WSD help improve IR effectiveness?
    Open question
  • Weiss 1, Voorhees method negative
  • Krovetz and Croft, Sanderson only useful for
    short queries
  • Schütze and Pedersens approaches and Gonzalos
    experiment positive result
  • WSD must be accurate to be useful for IR
  • Schütze and Pedersens, Yarowskys algorithm
    promising for IR
  • Luks approach robust for data sparse, suitable
    for small corpus.

35
References
  • Krovetz 92 R. Krovetz W.B. Croft (1992).
    Lexical Ambiguity and Information Retrieval, in
    ACM Transactions onInformation Systems, 10(1).
  •  Gonzalo 1998 J. Gonzalo, F. Verdejo, I. Chugur
    and J. Cigarran, Indexing with WordNet synsets
    can improve Text Retrieval, Proceedings of the
    COLING/ACL 98 Workshop on Usage of WordNet for
    NLP, Montreal,1998
  •  Gonzalo 1992 R. Krovetz W.B. Croft .
    Lexical Ambiguity and Information Retrieval, in
    ACM Transactions on Information Systems, 10(1),
    1992
  •  Lesk 1988 M. Lesk , They said true things,
    but called them by wrong names vocabulary
    problems in retrieval systems, in Proc. 4th
    Annual Conference of the University of Waterloo
    Centre for the New OED, 1988
  •  Luk 1995 A.K. Luk. Statistical sense
    disambiguation with relatively small corpora
    using dictionary definitions. In Proceedings of
    the 33rd Annual Meeting of the ACL, Columbus,
    Ohio, June 1995. Association for Computational
    Linguistics.
  •  Salton 83 G. Salton M.J. McGill (1983).
    Introduction To Modern Information Retrieval. The
    SMART and SIRE experimental retrieval systems, in
    New York McGraw-Hill
  •  Sanderson 1997 Sanderson, M. Word Sense
    Disambiguation and Information Retrieval, PhD
    Thesis, Technical Report (TR-1997-7) of the
    Department of Computing Science at the University
    of Glasgow, Glasgow G12 8QQ, UK.
  •  Sanderson 2000 Sanderson, Mark, Retrieving
    with Good Sense, http//citeseer.nj.nec.com/sande
    rson00retrieving.html, 2000
  •   
  •  

36
References cont.
  • Schütze 1995 H. Schütze J.O. Pedersen.
    Information retrieval based on word senses, in
    Proceedings of the Symposium on Document Analysis
    and Information Retrieval, 4 161-175.
  • Small 1982 S. Small C. Rieger , Parsing and
    comprehending with word experts (a theoryand its
    realisation) in Strategies for Natural Language
    Processing, W.G. Lehnert M.H. Ringle, Eds.,
    LEA 89-148, 1982
  •  Voorhees 1993 E. M. Voorhees, Using WordNet
    to disambiguate word sense for text retrieval, in
    Proceedings of ACM SIGIR Conference, (16)
    171-180. 1993
  •  Weiss 73 S.F. Weiss (1973). Learning to
    disambiguate, in Information Storage and
    Retrieval, 933-41, 1973
  •  Wilks 1990 Y. Wilks, D. Fass, C. Guo, J.E.
    Mcdonald, T. Plate, B.M. Slator (1990).
    ProvidingMachine Tractable Dictionary Tools, in
    Machine Translation, 5 99-154, 1990 
  • Yarowsky 1992 D. Yarowsky, Word sense
    disambiguation using statistical models of
    Rogets categories trained on large corpora, in
    Proceedings of COLING Conference 454-460, 1992
  •  Yarowsky 1994 Yarowsky, D. Decision lists for
    lexical ambiguity resolutionApplication to
    Accent Restoration in Spanish and French. In
    Proceedings of the 32rd Annual Meeting of the
    Association for Computational Linguistics, Las
    Cruces, NM, 1994
  • Yarowsky 1995 Yarowsky, D. Unsupervised word
    sense disambiguation rivaling supervised
    methods. In Proceedings of the 33rd Annual
    Meeting of the Association for Computational
    Linguistics, pages 189-- 196, Cambridge, MA, 1995
Write a Comment
User Comments (0)
About PowerShow.com