XRCE at CLEF 07 Domainspecific Track - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

XRCE at CLEF 07 Domainspecific Track

Description:

XRCE at CLEF 07. Domain-specific Track. Stephane Clinchant and Jean-Michel Renders ... Mono-lingual Domain-Specific Information Retrieval. Query Language Model ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 24

Provided by: clefca

Category:

more less

Transcript and Presenter's Notes

Title: XRCE at CLEF 07 Domainspecific Track

1
XRCE at CLEF 07Domain-specific Track

Stephane Clinchant and Jean-Michel Renders
(presented by Gabriela Csurka)
Xerox Research Centre Europe
France

2
Outline

Introduction
Mono-lingual Domain-Specific Information
Retrieval
Query Language Model Refinement with PRF
Lexical Entailment
Results
Cross-lingual Domain-Specific Information
Retrieval
Machine translation (Matrax)
Dictionary Adaptation
Results
Conclusion

3
CLEF domain-specific track

Domain-Specific Information Retrieval
Leveraging the structure of data in collections
(i.e. controlled vocabularies and other metadata)
to improve search.
Multilingual database in social science domain
German Social Science Information Centres
databases
Social Science Research Projects) databases
Tasks
Mono-lingual retrieval queries and documents are
in the same language
Cross-lingual retrieval queries in one language
are used with a collection in a different
language.

4
Information retrieval

Simply keyword matching is not enough to retrieve
the best documents for a query.

Information need
d1
matching
d2
query

dn
Courtesy Drawing borrowed from C Manning and
P. Raghavan lectures at Standford University
5
IR with Language Modeling

Treat each document as a multinomial distribution
model ?d

Information need
d1
generation
d2
query

dn

Then the documents d in the corpus can be
either by
or computing a cross-entropy similarity between
?d and ?q

Courtesy Drawing borrowed from C Manning and
P. Raghavan lectures at Standford University
6
How we estimate ?d

A simple language model could be obtained
(Maximul Likelihood) by considering the frequency
of words in
The probabilities are smoothed by the corpus
language model by
We used Jelinek-Mercer interpolation
The role of smoothing the language model is
LM more accurate (the query word can be absent in
a document)
The has an IDF effect (renormalize the frequency
of words with respect to its occurence in the
corpus C )

7
Query LM Refinement with PRF

Aim Adapt (refine) the LM of a particular
query
How Detecting implicit relevant concept
present in the retrieved documents using
pseudo-relevance feedback (PRF)

Pseudo-Relevance Feedback Top N ranked
text based on query similarity
Final rank Re-ranked documents based on refined
query similarity
Query

?F
??q (1-?)?F
?q

8
How to estimate ?F

Let F(q)d1,d2, ..dN be the N most relevant
document for query q
Draw di following
With ?F assumed to be multinomial (peaked at
relevant terms).
Then With ?F is estimated by EM algorithm from
the global likelihood
where P(w ?C ) is word probability built upon
the Corpus and ? (0.5) a fix parameter

Zhai and Lafferty, SIGIR 2001.
9
Lexical Entailment

Lexical Entailment A thesaurus built
automatically on a given Corpus
Given by the probabilities that one term entails
another term based on the Corpus
which is filtered using the information gain and
an additional parameter enables us to increase
the weights given to the self-entailment P(uu).
Applied for IR words from the document are
translated into different query terms
If we add a background smoothing, we obtain
Pros Finds relation between terms that feedback
can not.
Cons Heavier to compute and queries gets longer.

S. Clinchant C. Goutte E. Gaussier Lexical
Entailment for Information Retrieval ECIR 06
10
Other results for comparison
11
DS 07- Monolingual Official Runs

PRF Lexical Entailment is a Double Lexical
Entailment model where a first LE model is used
to provide the system with an initial set of TOPn
documents, from which a mixture model for
pseudo-feedback is built, and a second retrieval
is performed based once again on the LE model
applied to the enriched query.

12
Cross-Lingual -IR
Documents (in target language)
Information need
d1
Query (in source language
???
d2

Query translation
dn

Translation

Document translation
13
What to translate?

Document translation - translate documents into
the query language
Pro translation may be (theoretically) more
precise and documents become readable by the
user
Cons huge volume to be translated
Query translation - translate the query into
document language
Pros flexibility (translation on demand) and
less text to translate
Cons less precise and the retrieved documents
need to be translated to be readable.

14
How to translate ?

Statistical Machine Translation MatraX
Alignment Model learnt on parallel Corpus
(JRC-AC with Giza word alignment)
Language Model (N-gram) Learnt on GIRT Corpus
Translates the source sentence into the K target
sentences
Use them as mono-lingual query
Dictionary Based Approach with or without
adaptation
Extract a probabilistic bilingual dictionary from
different resources (standard, domain-specific
thesaurus, JRC-AC)
Use the translated query with mono-lingual
retrieval approaches
Adapt the dictionary to a particular (query
(feedback), target corpus)
Use the adapted query with mono-lingual retrieval
approaches

15
Dictionary based CLIR without Adaptation

Idea Rank the documents
according to the cross-entropy similarity
between the language model of the query and the
language model of the document
using a probabilistic bilingual dictionary given
by P(wt ws), the probability that the word ws
is translated in wt

16
Dictionary Adaptation

Aim Adapt the dictionary to a particular query
How Detecting implicit coherence present in
the relevant documents using PR
The first IR (CL-LM) can be seen also as a
dictionary disambiguation process.

PRF Top N ranked text
Final rank Re-ranked target documents
Source Query
Dictionary
Adapted Dictionary
P(wtws)

?st
CL-LM
?qs
CL-LM

17
How to estimate ?st

Let F(q) be the relevant documents retrieved by
the translated query using (CL-LM).
The global model likelihood becomes
The estimation of ?st is done by EM initializing
it by the general dictionary P(wt ws).
Finally, we apply (CL-LM) again, but with the
adapted query language model
Note
This is an extension of the Query LM Refinement
with PRF to multi-lingual case
This algorithm realizes both the query enrichment
and dictionary adaptation

18
Other results for comparison
19
DS 07- Bilingual Official Runs
20
Conclusion

Mono-lingual Domain-Specific Information
Retrieval
The Query LM refinement with PRF (LMPRF) give
better performance than the Lexical Entailment
(Table 7) but unlike the latter it makes 2
retrieval steps.
Both over-perform the non-adapted LM case (Table
9)
Combining them allowed for further improvements
(Table 7)
Cross-lingual Domain-Specific Information
Retrieval
Combining the QT with further adaptation was
benefic (Table 9)
Matrax is bettter than dictionary based IR
without adaptation (Table 9)
The dictionary adaptation method gave better
results than the query translation with Matrax
independently from the further mono-language
adaptation (Table 8)
Combining the Lexical Entailment with LMPRF was
benefic in the cross-lingual case too (Table 8)

21
Thank you for your attention!

? Not satisfied with the answer?
? You can always get answer directly from the
authors
Stephane.Clinchant_at_xrce.xerox.com or
Jean-Michel.Render_at_xrce.xerox.com

22
Back-up
23
MatraX
Bi-phrase library
Pre-processing
JRC-AC
Bi-phrase library construction
GIRT
Training set
Model parameter optimization
Development set
Language Modeling (SRI LM-toolkit)
Decoder
Language Model
Model params

Write a Comment

User Comments (0)