Improving WSD lexical resources using Topic Signatures - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Improving WSD lexical resources using Topic Signatures

Description:

Queries are made using the synset information (synonyms and glosses) ... 3)Use synonyms with the AND operator and words from the defining phrase with the ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 17
Provided by: gliozz
Category:

less

Transcript and Presenter's Notes

Title: Improving WSD lexical resources using Topic Signatures


1
  • Improving WSD lexical resources using Topic
    Signatures
  • Alfio Gliozzo.
  • gliozzo_at_itc.it
  • ITC-irst
  • http//tcc.itc.it/research/textec/topics/disambig
    uation/

2
Outline
  • The Knowledge Acquisition Bottleneck
  • Augmenting WordNet information
  • Topic signature definition
  • Topic signatures for WordNet synsets
  • Methodologies for Topic Signature Development
  • Problems in evaluating Topic Signature Quality
  • WSD using topic signature
  • Relations between topic signatures and WND
  • Topic signature disambiguation?

3
The Knowledge Acquisition Bottleneck
  • Many NLP supervised systems suffer from the lack
    of widely available semantic tagged corpora
    marquéz,2000
  • Typically 1000-2500 occurrences for each word are
    required in order to train an accurate WSD system
    (i.e. accuracy 78 for the lemma interest using
    2476 examples Bruce and Wiebe,1994
  • Performaces decrease drammatically using a small
    corpus (i.e. senseval-2 systems, 10 examples per
    word sense, max accuracy 64)
  • NG,97 estimated that the manual annotation
    effort necessary to build a complete sense tagged
    English corpus for WSD would be about 16 man years

4
Solutions (proposals)
  • Improving ML techniques optimizing the learning
    curve (Boosting (Escudero et al 2000), SVM,
    Feature selection (Moldovan et al. 2001)
  • Acquiring learning examples using automatic
    procedures
  • Bootstrap (Learn supervised classifiers in a
    small training corpus in order to produce large
    corpora of low quality examples for
    training,i.e.)
  • Acquiring sense tagged corpora from WWW (Mihalcea
    and Moldovan,1999)
  • Augmenting WordNet information
  • Computational theory to explain synonym and
    polisemy phenomena (??????)

5
Acquiring sense tagged corpora from WWW (Mihalcea
and Moldovan,1999)
  • Problem Automatically develop a large scale
    training corpus for WSD
  • Solution Use WordNet information to make queries
    about word senses concordances in the retrieved
    documents can be used as examples.
  • Queries are made using the synset information
    (synonyms and glosses)
  • i.e. query for the noun interest
  • Sense1 sense of concern AND (interest OR
    involvement)
  • Sense4 fixed charge AND interest percentage of
    amount AND interest
  • Results 20 lemmas (7 n, 7 v, 3j, 3r), mean
    polisemy 6, 1080 example manually checked,
    accuracy 91

6
Lexical resources
  • In order to reduce the learning corpus dimension
    it is possible to make use of lexical databases
    as additional source of information (i.e.
    WordNet)
  • The main problem of such a resources is the lack
    of an uniform and objective methodology, so most
    of the information there present is provided by
    the lexicographer intuition
  • WordNet has been criticized for its lack of
    relations between topically related senses and
    the proliferation of word senses Magnini and
    Cavaglià,2000

7
Augmenting WordNet information
  • Manual Annotation
  • i.e. WN-Domains Magnini and Cavaglià,2000
  • Acquisition from annotated corpora
  • I.e. Corpus Based domain acquisition Magnini et
    al.,2001
  • Acquisition from dictionaries
  • Acquisition from non annotated corpora and WWW
  • Topic Signatures

8
Topic signature definition
  • Concepts are linked to lists of topically related
    words
  • A topic signature is defined as a family of
    related terms t,lt(w1,s1)(wi,si) ,where t is
    the topic (I.e the target concept, a word sense),
    wi is a word associated with the topic, with
    strength si
  • In order to acquire topic signatures a set of
    documents related to the topic is required
  • Topic signatures are extracted comparing the
    distribution of terms inside topic-related and
    non topic-related collections of documents

9
Topic signatures for synsets
  • The topic is the word sense
  • The collection of texts for a sense is the set of
    document (sentences) in which the word sense
    occurs
  • The topic signature is a list of topic related
    words (the most representative words for the
    document collection)
  • Example topic signature for the lemma waiter
  • Waiter, server - a person whose occupation is to
    serve at table (as in a restaurant) restaurant,
    waitress, dinner, lunch, etc.
  • Waiter - a person who waits or awaits hospital,
    station, airport, boyfriend, girlfriend, etc.

10
Acquiring collections of documents related to
senses
  • Extracted from collections of documents related
    to the senses Agirre et. Al 2000
  • Sense tagged corpora (semcor Miller et al. 1993
    , WSJ part of DSONg and Lee,1996)is the
    collection of all the documents(sentences)
    containing the synset
  • Query on collections of documents (Altavista) is
    the collection of documents retrieved by a search
    engine making an appropriate query

11
Obtaining sense related documents from WWW
  • Queries are constructed using the information in
    the ontology following the methodology of
    (Mihalcea and Moldovan,1999)
  • Four procedures
  • 1)Use monosemous synonyms
  • 2)Use the defining phrases
  • 3)Use synonyms with the AND operator and words
    from the defining phrase with the NEAR operator
  • 4)Use synonyms and words from the defining
    phrases with the AND operator
  • The procedure i is only applied if the procedure
    i-1fails to retrieve any examples (150 documents
    per sense)

12
Calculating TS weights
  • Tokenization and lemmatization of the document
    collections
  • For each document collection i (representing the
    ith sense of the word) a vector vfi containing
    the words and their frequencies has been
    extracted (wordj,freqi,j)
  • For each vfi the ?2 has been evaluated using the
    vectors belonging to the remaining senses as
    contrast set, obtaining vectors of couples
    (wordj,wi,j) where
  • Nb mi,j is the product of the frequency of wordj
    in the whole corpus times the dimension (in
    number of tokens) of the corpus for the sense i
    normalized by the whole corpus dimension
    (expected value of freqi,j assuming true the null
    hypothesis)

13
Example the lemma church
  • Church1 Christian church, Cristianity
  • catholic 59, spirit 43,protestant 42, rector 32
  • Church2

14
Evaluating TS quality
  • Its a difficult task
  • Strategies
  • Manual verification
  • WSD Agirre et al.,2000
  • WordNet domains comparison Gliozzo,2002

15
Using TS in a WSD task
  • Topic signature for WN synsets can be used to
    discriminate two senses of the same word in WSD
    task
  • Underlying idea TS for senses of a word are the
    list of the most discriminating terms between
    collections of texts for each sense.
  • WSD algorithm
  • let Sl-n l 1-n ... L0 l1 ln-1 ln be the
    lemmatized sentence, in which L0 is the lemma to
    disambiguate.
  • Let s1, s2, sp be the WN synsets for L0
  • Let TSj be the list of words in the for sj
  • For each j from 1 to p evaluate
  • Select the sense sj such as Sj is maximized

16
WordNet domains comparison
  • Topic signature for a synset can be used to
    detect the Semantic Domain of the synset
  • Algorithm
  • Let
    be the topic
    signature for the synset S
  • Let DD1, , Dd be a complete set of domains
    (second level of WND)
  • Let Dwi be the normalized set of domain for wordi
  • For i form i to d evaluate
  • where
  • Select the domain Di such as Di is maximized
Write a Comment
User Comments (0)
About PowerShow.com