Title: Website Term Browser Un sistema interactivo y multiling
1Website Term BrowserUn sistema interactivo y
multilingüe de búsqueda textual basado en
técnicas lingüÃsticas
Departamento de Lenguajes y Sistemas
Informáticos UNIVERSIDAD NACIONAL DE EDUCACIÓN A
DISTANCIA TESIS DOCTORAL
- Anselmo Peñas Padilla
- Directores
- Julio Gonzalo Arroyo
- MarÃa Felisa Verdejo MaÃllo
2Structure
- I. Problem definition and goals
- II. Experiments in Lexical Ambiguity and Indexing
- III. Website Term Browser
- IV. Evaluation framework
3Classic Information Retrieval
I. Problem definition and goals
- Retrieve documents relevant to users information
need - Pre-supposes
- Static information needs
- Value is found in the retrieved set of documents
(not in searching process) - Ignores
- Task (purpose) that origins the information need
- Changes in the information needs
- Interactivity
- Imprecise information needs
- Users develop strategies without system aid
- Help users to express and precise their
information needs
Information need
4Language barriers
I. Problem definition and goals
- Problems in query formulation
- Users dont know the appropriate domain
terminology - Users cant express their information need in a
foreign language - Translinguality
- Natural Language characteristics
- Lexical ambiguity
- Terminology variation
- Help users to overcome language barriers
5General approaches
I. Problem definition and goals
Information Retrieval
Natural Language Processing
6Natural Language Processing
I. Problem definition and goals
- Help users to express and precise their
information needs? - Open field in IR
- Help users to overcome language barriers?
- Phrase extraction and normalization
- Explicit disambiguation (POS, WSD)
- Bad strategies or too much error in automatic
processing? - Conceptual indexing
7Goals
I. Problem definition and goals
- Study the role of automatic linguistic techniques
within classic IR model - Phrase indexing, POS tagging, WSD
- Semantic distinction of phrases
- Viability of conceptual indexing
- Section II Experiments in Lexical Ambiguity and
Indexing
8Goals
I. Problem definition and goals
- Develop a model
- to help users to express and precise their
information needs - to help users to overcome language barriers
- Bringing to users the collection terminology
- Morpho-syntactic, semantic translingual
variations - Without needs of thesauri construction
- Establish an appropriate evaluation framework
- Sections III IV Website Term Browser
9Proposed approach
Information Retrieval
Natural Language Processing
10Structure
- I. Problem definition and goals
- II. Experiments in Lexical Ambiguity and Indexing
- III. Website Term Browser
- IV. Evaluation framework
11Contents
II. Experiments in Lexical Ambiguity and Indexing
- Morpho-syntactic ambiguity in IR
- Phrase indexing
- Semantic distinction of lexical compounds in IR
- Conceptual indexing
- ITEM Search Engine
- Conclusions
IR-SEMCOR, hand annotated test collection
12Morpho-syntactic ambiguity in IR
II. Experiments in Lexical Ambiguity and Indexing
Texts ...particle crosses the wall... ...canadian
red cross... ...boat to cross mississippi river...
13(No Transcript)
14Phrase indexing
II. Experiments in Lexical Ambiguity and Indexing
Texts ...a guide for the fisher
who... ...information on cat care... ...arboreal
carnivorous called fisher cat...
15(No Transcript)
16Semantic distinction of compounds
II. Experiments in Lexical Ambiguity and Indexing
Types of lexical compounds
- Automatic classification through WordNet
- Endocentric one component is hyperonym
- Appositional all components are hyperonyms
- Exocentric no components are hyperonyms
17(No Transcript)
18Conceptual Indexing
II. Experiments in Lexical Ambiguity and Indexing
Texts ...spring... ...muelle... ...spring... ...fo
untain... ...fuente... ...spring... ...springtime.
.. ...primavera...
Conceptual Index n03114639 n05727069 n09151839
Query spring
- This model can improve text retrieval (Gonzalo
1998 Gonzalo 1999) - Depending on WSD error rate
19Synset indexing with no errors in WSD
20Conceptual Indexing
II. Experiments in Lexical Ambiguity and Indexing
- Although explicit disambiguation strategies
applied to Indexing - POS tagging
- Phrase indexing
- Word Sense Disambiguation
- dont produce a significative improvement in IR
- Conceptual indexing based on synsets
- Needs automatic WSD accuracy near to
state-of-the-art (60) - Permit Cross-Language Information Retrieval
- Qualitative evaluation justifies a prototype
development
21Textual representation query is translated into
the target language Conceptual representation
query and documents are compared at a conceptual
level
Selection of newspaper determines the target
language
Selection of query language
Selection of WSD strategy
Retrieved documents
22ITEM Search Engine
II. Experiments in Lexical Ambiguity and Indexing
- Conceptual indexing seems atractive but there are
some unsolved challenges - Low accuracy in Word Sense Disambiguation due to
- Unrestricted domains in EWN
- Fine grain distinction of senses
- Indexing units ? translation units
- Loss of information in word by word
disambiguation - High cost, low benefit
- Users perceive a slower and less transparent
system
23Conclusions
II. Experiments in Lexical Ambiguity and Indexing
- Dont subordinate NLP to classic IR model
- Even an improvement of 10 wouldnt change users
perception - Think of users
- Find new paradigms in Information Access
- In a higher level, closer to users
- Consider users tasks
- Consider users interaction
- New places for NLP techniques in IR
- Interaction over partial NLP processing
- A proposal Terminology Retrieval Term Browsing
24Structure
- I. Problem definition and goals
- II. Experiments in Lexical Ambiguity and Indexing
- III. Website Term Browser
- IV. Evaluation framework
25Contents
III. Website Term Browser
- Terminology Retrieval
- Term extraction
- Indexing
- Retrieval model
- Query expansion and translation
- Website Term Browser interface
26Terminology Retrieval
III. Transition to an interactive model
- Term Browsing
- Navigate through relevant terminology
- Access information from retrieved terms
- Terminology Retrieval
- Retrieve relevant terms related to the query
- Phrase extraction
- Phrase indexing
- Phrase retrieval
- Recall is more important than precision in term
extraction - Relaxing linguistic processing is possible
- Premise dont lose phrases
27Term extraction
III. Transition to an interactive model
- Syntactic pattern (Spanish, English, French,
Italian, Catalan) - phr_content phr_closed phr_content
phr_content - phr_content noun, adjective, number, infinitive,
participle - phr_closed article, preposition, conjunction
- Needs POS tagging
- High computational cost
- Tagging oriented to phrase detection
28Indexing
III. Transition to an interactive model
- Steps
- Text pre-processing and listing of words
- Word tagging (oriented to phrase detection)
- Phrase detection lemmatization of components
- Document indexing statistics (document
frequency)
- Phrase selection (Subsumption Lexicalization
degree) - Phrase indexing
29Retrieval model
III. Transition to an interactive model
query
30Query expansion and translation
III. Transition to an interactive model
- Tratados
- acuerdo
- capitulación
- concertación
- convenio
- cuidar, pacto
- manejar
- procesar
- accord
- discourse
- handle
- manage
- pact
- process
- treat
- treatise
- treaty
Prohibición embargo entredicho interdicción interd
icto proscripción ban interdiction prohibition
proscription
Pruebas cata, catadura degustación ensayo escandal
lo experimento gustación muestreo,
tanteo demonstrate establish, exhibit experiment
experimentation fall, fitting indicate,
point present, proof prove, run sample,
sampling shew,show, taste test, trial, try
de
Nucleares nuclear nuclear
de
Expansion
Translation
31Query in Spanish
Hierarchy of terms
Ranking of documents
English
Spanish
Catalan
32(No Transcript)
33Structure
- I. Problem definition and goals
- II. Experiments in Lexical Ambiguity and Indexing
- III. Website Term Browser
- IV. Evaluation framework
34Evaluation of Terminology Retrieval
V. Evaluation framework
- Compare
- Terminology Retrieval
- Hand-crafted Multilingual Thesaurus
35(No Transcript)
36Evaluation of Terminology Retrieval
V. Evaluation framework
- Recall of mono-lexical terms (lemmas)
- Monolingual 85 - 95
- Translingual 55 - 65
- Recall of poly-lexical terms (phrases)
- Monolingual 40 - 65
- Translingual 10 - 45
- Loss of recall due to
- Phrase extraction (mainly POS tagging) 3 - 17
- Phrase indexing (mainly lemmatization) 2 - 34
- Phrase selection 12 - 37
- Lack of connections between different languages
in EWN - Lack in EWN adjective hierarchies
37Usefulness of Term Browsing
V. Evaluation framework
- Previous experiences in interactivity evaluation
(TREC) need - Precise queries
- Laboratory conditions
- Controlled users
- There arent differences between systems
- Identify better approaches is not possible
- A new framework is here proposed
- Real work environment
- Register users interaction
- Compare the use of
- Term area provided by WTB
- Document ranking provided by Google
38QUERY
RECONSULT WITH TERM
EXPLORE TERM
EXPLORE DOCUMENT
39Usefulness of Term Browsing
V. Evaluation framework
- 2318 sessions with interaction
- An average of 5.16 actions per session
- EXPLORE_TERM is used in 65
LOG FILE 539 2001/03/14 121033 QUERY UNED
193.146.241.164 ozone hole 2001/03/14 121120
EXPLORE_TERM 539684 degradación de la capa de
ozono 2001/03/14 121129 EXPLORE_DOC
http//www.uned.es/doctorado/0108.htm ... EXPL
ORE_TERM RECONSULT EXPLORE_DOC ...
40Usefulness of Term Browsing
V. Evaluation framework
- All queries 1 word
queries gt1 word queries - First action EXPLORE_DOC 42 47
39 - after QUERY EXPLORE_TERM 51 45
55 - RECONSULT 7 8 6
- Last action
- before finishing QUERY 50 57
46 - the session with EXPLORE_TERM 44
38 47 - explore DOC RECONSULT 6 5
7
41Structure
- I. Problem definition and goals
- II. Experiments in Lexical Ambiguity and Indexing
- III. Website Term Browser
- IV. Evaluation framework
42Conclusions
- Lexical Ambiguity has been studied using
IR-Semcor - Evaluation free of automatic processing errors
- Explicit disambiguation at indexing doesnt seem
to improve retrieval (POS, WSD, Semantic
distinction of lexical compounds) - Conceptual indexing based on EuroWordNet synsets
needs to solve some challenges - Think of users to find new places for NLP
43Conclusions
- A search model based on extraction, retrieval and
browsing of terminology has been developed - User oriented
- Interaction over terminological information
- Intermediate way between free-searching and
thesaurus-guided searching - Without needs of thesaurus construction
- Bringing to users the collection terminology
- Morpho-syntactic semantic variations
- Translinguality
44Conclusions
- An evaluation framework for Terminology Retrieval
and Term Browsing has been established - Points the way to improve Terminology Retrieval
- Users appreciate Term Browsing
- WTB phrasal information can substantially
complement the document ranking provided by the
search engines
45Website Term BrowserUn sistema interactivo y
multilingüe de búsqueda textual basado en
técnicas lingüÃsticas
Departamento de Lenguajes y Sistemas
Informáticos UNIVERSIDAD NACIONAL DE EDUCACIÓN A
DISTANCIA TESIS DOCTORAL
- Anselmo Peñas Padilla
- Directores
- Julio Gonzalo Arroyo
- MarÃa Felisa Verdejo MaÃllo