Title: Domain Driven Disambiguation DDD
1- Domain Driven Disambiguation (DDD)
- Alfio Gliozzo, Bernardo Magnini, Carlo
Strapparava. - gliozzo,magnini, strappa_at_itc.it
- ITC-irst
-
- http//tcc.itc.it/research/textec/topics/disambig
uation/
2Outline 1/2 DDD system architecture
- General Object Oriented Architecture to easily
develop a great variety of WSD systems
implementing different Categorization and Feature
Extraction (FE) techniques. - Domain Driven Disambiguator (DDD) as an instance
of the general architecture.
3Outline 2/2 DDD Methodology
- WordNet Domains
- Textual proprieties of Semantic Domains.
- Text categorization using WN-Domains
- DDD classifier
4A general WSD System Architecture
Instance (Feat1,Value1) (FeatN,ValueN)
Text
- Corpus
- Reader
- get-text
- (text-id)
- Classifier
- Classify-
- Instance
- (instance)
- Import Model
- (model)
- Build Model
- (instances)
- Feature
- Extractor
- Get-token-
- features
- (text,tok-id)
sense
Sense
Model
Corpus
Supervised Learning
ARFF
5DDD system
- Is a specific implementation of the general WSD
system architecture. - The modules feature extractor and classifier are
specialized..
WN-D
6Domain Driven Disambiguation
- Underlying assumption
- The polisemy of terms in domain specific corpora
tends to disappear. - Knowing in advance the relevant semantic
domain(s) of a text makes Word Sense
Disambiguation easier. - Semantic domains play an important role in the
disambiguation process
7WordNet Domains 1/3(Magnini and Cavaglià,2000)
- Developed at ITC-Irst (Magnini and Cavaglià,
2000) - All synsets in WordNet have been annotated by a
Domain Label (i.e SPORT, ART, ECONOMY, ) - FACTOTUM label is used for generic synsets (not
belonging to any domain).
8WordNet Domains 2/3 Polisemy Reduction
9WordNet Domain 3/3Semantic Domains Organization
- 163 Domain labels collected from dictionaries
- Four level hierarchy (Based on Dewey Decimal
Classification) - Mapping between DDC and WND (completeness)
- 41 basic domains used for the experiments (2nd
level)
10Domains and Texts (Magnini et al. 2002)
- Many words in a document belongs to the domain of
the text (statistics on Semcor). - One Domain per Discourse is stronger than One
Sense per Discourse (10 vs. 31 exceptions in
Semcor)
11TC using WND 1/2 (Magnini et al., 2001)
- Domain frequency is evaluated for each domain
(excluding FACTOTUM) counting its occurrences
inside the text. - Domain relevance is the normalized domain
frequency value (in the range 0,1) computed on
windows of fixed size (20-100 tokens). - Text vectors collects relevance values for each
domain - Example ((sport . 0.93) (economy . 0.70) (art
. 0))
12Text Categorization using WND2/2 example
From the plush Connolly hide leather sofa and
chairs in the living room to the Bang and Olufsen
stereo, and remote control television complete
with video, you're surrounded by the HIGHEST
QUALITY. The inlaid chequerboard top of the
coffee table houses all kind of games, including
backgammon, chess and Scrabble. You'll also find
a selection of books, from Queen Victoria's
Highland journals, to the very latest bestselling
thriller. The dinner table and chairs are
elegant yet comfortable, and you can be assured
of the finest tableware and crystal for meals at
home.
13Text Categorization using WND2/2 example
From the plush Connolly hide leather sofa and
chairs in the living room to the Bang and Olufsen
stereo, and remote control television complete
with video, you're surrounded by the HIGHEST
QUALITY. The inlaid chequerboard top of the
coffee table houses all kind of games, including
backgammon, chess and Scrabble. You'll also find
a selection of books, from Queen Victoria's
Highland journals, to the very latest bestselling
thriller. The dinner table and chairs are
elegant yet comfortable, and you can be assured
of the finest tableware and crystal for meals at
home.
14Text Categorization using WND2/2 example
From the plush Connolly hide leather sofa and
chairs in the living room to the Bang and Olufsen
stereo, and remote control television complete
with video, you're surrounded by the HIGHEST
QUALITY. The inlaid chequerboard top of the
coffee table houses all kind of games, including
backgammon, chess and Scrabble. You'll also find
a selection of books, from Queen Victoria's
Highland journals, to the very latest bestselling
thriller. The dinner table and chairs are
elegant yet comfortable, and you can be assured
of the finest tableware and crystal for meals at
home.
15Text Categorization using WND2/2 example
From the plush Connolly hide leather sofa and
chairs in the living room to the Bang and Olufsen
stereo, and remote control television complete
with video, you're surrounded by the HIGHEST
QUALITY. The inlaid chequerboard top of the
coffee table houses all kind of games, including
backgammon, chess and Scrabble. You'll also find
a selection of books, from Queen Victoria's
Highland journals, to the very latest bestselling
thriller. The dinner table and chairs are
elegant yet comfortable, and you can be assured
of the finest tableware and crystal for meals at
home.
16Sense Vector
- Represents the domain/s of the typical contexts
in which the sense occurs - DV for bank1 is ((economy . 1.75) (sport . 0.2)
(law . 0) )) - DV for a generic sense (FACTOTUM) is uniform
- Can be evaluated from
- Sense Tagged Corpora summing the Text Vectors of
the contexts of the examples (supervised) - WordNet-Domains (unsupervised)
17DDD Classifier
- Two steps
- Context Categorization (text vector)
- Similarity between synsets and contexts vectors
- Example
- Bank1 depository financial institution ...
- Bank2 sloping land
- TEXT He cashed a check at the bank
Dot Product
1,731878
0,06185
18IRST results at SENSEVAL-2
19Domains and Texts (Magnini et al. 2002)
- Many words in a document belongs to the domain of
the text (statistics on Semcor).
20Conclusions and Future Works
- DDD architecture is an experimental framework to
easily design and develop several WSD techniques. - At the moment the Domain Oriented Methodology has
been (partially) explored and implemented. - We plan to improve the system using machine
learning techniques (i.e Support Vector Machines,
AdaBoost, ) - We plan to integrate the domain approach with
other WSD techniques to take into account also
syntactic information.
21Other directions
- Acquisition of domain information from Text
categorized corpora (Magnini et al., 2002a) - Investigations on the connections polisemy,
synonimy and domains - Using domain information in order to improve and
develop ML and memory based TC and IR techniques - WDD
22References 1/2
- (Magnini and Cavaglià, 2000) Bernardo Magnini
and Gabriela Cavaglia'. Integrating Subject Field
Codes into WordNet. In Gavrilidou M., Crayannis
G., Markantonatu S., Piperidis S. and Stainhaouer
G. (Eds.) Proceedings of LREC-2000, Second
International Conference on Language Resources
and Evaluation, Athens, Greece, 31 MAY- 2 JUNE
2000, pp. 1413-1418. - (Magnini and Strapparava, 2000) Bernardo Magnini
and Carlo Strapparava. Experiments in Word Domain
Disambiguation for Parallel Texts. Proceedings of
the ACL workshop on Word Senses and
Multilinguality, pag. 27-33, October 7, 2000,
Hong Kong. - (Magnini and Strapparava,2001) Bernardo Magnini
and Carlo Strapparava. Using WordNet to Improve
User Modelling in a Web Document Recommender
System. Proceedings of the NAACL Workshop,
"WordNet and Other Lexical Resources
Applications, Extensions and Customizations",
Pittsburgh, June 3-4, pp. 132-137, 2001. Also
ITC-irst Technical Report Ref. No. 0106-02.
23References 2/2
- (Magnini et al.,2001) Bernardo Magnini, Carlo
Strapparava, Giovanni Pezzulo and Alfio Gliozzo.
Using Domain Information for Word Sense
Disambiguation. Proceedings of Senseval-2, Second
international Workshop on Evaluationg Word Sense
Disambiguation Systems, pag. 111-114, 5-6 July,
2001, Toulose, France. - (Magnini et al., 2002a) Bernardo Magnini, Carlo
Strapparava, Giovanni Pezzulo and Alfio Gliozzo.
Comparing ontology-Based and Corpus-Based Domain
Annotations in WordNet in Proceedings of the
First Global WordNet Conference, Mysore(India),
2002. - (Magnini et al.,2002b) B. Magnini, C.
Strapparava, G. Pezzulo, A. Gliozzo (2002) "The
Role of Domain Information in Word Sense
Disambiguation". to appear in Special Issue of
Journal of Natural Language Engineering on
"Evaluating Word Sense Disambiguation Systems". - (Gliozzo, 2001) Alfio Gliozzo, Ruolo dei campi
semantici nella struttura di un lessico
computazionale utilizzo per la disambiguazione
automatica di senso Thesis, University of
Bologna, 2001
24- Domain Driven Disambiguation (DDD)
- Alfio Gliozzo, Bernardo Magnini, Carlo
Strapparava. - gliozzo,magnini, strappa_at_itc.it
- ITC-irst
-
- http//tcc.itc.it/research/textec/topics/disambig
uation/