Title: Translation Enhancement: a New Relevance Feedback Method for CrossLanguage Information Access
1Translation Enhancement a New Relevance
Feedback Method for Cross-Language Information
Access
Daqing He University of Pittsburgh dah44_at_pitt.edu
Dan Wu Wuhan University woodan_at_whu.edu.cn
2Outline
- Motivations
- Translation Enhancement
- Experiments and Results
- Conclusions
3Query Translation Based CLIR in TREC like
Environments
query translation
4Usages of RF Information
- Query expansion (QE) methods perform QE before
query translation (thus pre-translation QE)
or/and after query translation (post-translation
QE) - Post-translation QE or the combination of the two
performed the best Ballesteros Croft 97,
McNamee Mayfield 02
Query Translation
Search on Target Language Collection
Relevance Feedback
Query
Pre-translation Query Expansion
Post-translation Query Expansion
Search on a Source Language Collection
5Translations in Query Translation based CLIR
query translation
result translation
6What Can Obtain From RF?
query translation
f1,f2,fn
result translation
e1,e2,em
e1 ltgt f1 en ltgt fn
7Usages of RF Information - II
- Query expansion (QE) methods expand
pre-translation or/and post-translation queries - Translation Enhancement (TE) improve query
translation resources using the obtained relevant
translation relationships
Query Translation
Search on Target Language Collection
Relevance Feedback
Query
Pre-translation Query Expansion
Post-translation Query Expansion
Search on a Source Language Collection
8Benefits
- By applying extracted translation relationships
back to query translation - Make query translation and result document
translation consistent with each other - Help future pre-translation query expansion
- Tailor the query translation resources toward
users current search - New translation alternatives can be introduced by
TE - Potentially solve some out-of-vocabulary terms
(OOV) - TE does not replace QE
- they work at different steps of RF in CLIA
- TE can help pre-translation QE
- Maybe they can be combined?
9Contributions of This Work
- many related works on applying extracted
translation relationships in improving CLIR
effectiveness - Nie, Simard, Isabelle, Durand 99 used web mined
parallel texts for CLIR - Xu, Weischedel and Nguyen 01 estimates
translation probabilities based on a parallel
corpus - Lavrenko, Choquette and Croft 02 describes a
cross lingual relevance model that uses parallel
corpus as one resource for translation - Our contributions are at
- Studying methods for extracting translation
relationships - Using extracted translation relationships from
relevant returned documents pairs for enhancing
query translation directly - Exploring the combination of TE and QE
10Research Questions on TE
- How to obtain relevant translation relationships?
- How to enhance query translation with the
relevant translation relationships? - Do it make sense to integrate TE with other RF
methods?
11Obtain Translation Relationships
- Borrow ideas from mining on parallel corpus
- Establish alignment at certain level
- Best at word alignment level between docs and
their translations - Minimum at sentence alignment level
- When word alignment is available
- Translations based on Word Alignment (TWA) train
GIZA to obtain a word alignment model, and get
word alignment from the model - When only sentence alignment is available
- Keep All Translations (KAT) keep all the
translation relationships of the query terms
identified in the sentence pairs in relevant docs - Keep One Best Translation (K1T) based on KAT,
but keep the one has the highest translation
probability in the dictionary - Keep Most Frequent Translation (KFT) based on
KAT, but keep the one has the highest frequency
in the relevant doc
12Obtain Translation Relationships without Word
Alignment
Dictionary
E1 F11, F1m1 E2 F21, F2m2 EnFn1, Fnmn
E1, E2, ,En
E1 E2 E2E1E1 E2 E1E1
F11 F21 F22F11F12 F22F11F11
D1
D1
E1 ? F11 (D1.4) E1 ? F12 (D1.1) E2 ? F21
(D1.1) E2 ? F22 (D1.2)
Stemming and back off strategy are used to
increase the finding of instances of query terms
and their translations inside the relevant docs
and their translation docs
KAT
E1 ? F11 (D1.4) E1 ? F12 (D1.1) E2 ? F21
(D1.1) E2 ? F22 (D1.2) E1 ? F11 (D2.4,)
K1T
E1 ? F11 (D1.4) E2 ? F21 (D1.1)
KFT
E1 ? F11 (D1.4) E2 ? F22 (D1.2)
13Convert Extracted Relationships into Translation
Probability
- Pi,j(j is trans of ij is in Rel) the
probability of translation alternative j being
the translation of term i, given that j is in the
relevant documents set - tfj,k the frequency of j being extracted as the
translation of i from the relevant document k - n all the relevant documents
- mi all the translation alternatives of term i
14Enhanced Translation Probability
- Combine the translation probabilities obtained
from relevant document set with that in the
original dictionary - ? the parameter to adjust different weight of
translation probability in relevant documents set
and general dictionary - Normalization
15Experiment Goals and Objectives
- Is Translation Enhancement an effective RF
method? - To test whether translation enhancement methods
can improve CLIA in blind RF - Can Translation Enhancement be combined with
other RF methods? - To test whether combining translation enhancement
with query expansion can improve CLIA in blind RF - To test whether translation enhancement can
improve CLIA in interactive RF (not discuss in
this talk) - Is Translation Enhancement effective in real
interactive search environment?
16Experiment Resources
- English to Chinese CLIR
- English queries and Chinese documents
- Preprocessing Tools
- Stanford Chinese segmentation tool for Chinese
documents - Porter stemmer for English queries and documents
- an English and a Chinese stop word list
- Collections
- TDT4 and TDT5 Chinese collection (83,627
documents) - TDT4 and TDT5 English MT collection (83,627
documents) - TDT4 and TDT5 English collection (306,498
documents) - Translation Resources
- an English-Chinese bilingual lexicon with
translation probabilities obtained from large
parallel corpus Wang Oard 06 - GIZA machine translation toolkit
- Indri 2.4 search engine
- Evaluation Metrics (TREC evaluation)
- MAP Mean Average Precision
17Query Types
- Topics
- 44 TDT4 and TDT5 English topics converted into
TREC format - All topics manually translated into Chinese
- Query (TREC format)
- Title (short T queries)
- Title Description (medium TD queries)
- Title Description Narrative (long TDN
queries)
18Baselines
- Monolingual Baseline
- use Chinese queries to search on Chinese
collection - Lower Cross-language Baseline
- use English queries to cross language search on
Chinese collection without any performance
enhancement technique - cumulative probability threshold (CPT) from 0.0
to 1.0 with an increment of 0.1 at each time,
below display the one with the best MAP
19Baselines - II
- Higher Cross-lingual Baseline
- Same as the low CL baseline, but with query
expansion - Use default Indri Pseudo RF mechanism
- use top 20 documents of the result rank list
- top 20 terms are expanded
- Relative weight between original query and
expanded term are tuned for specific QE method - pre-translation query expansion
- post-translation query expansion
- combine pre and post translation query expansion
20TE Methods vs Baselines
- All four TE performed better than CL lower
baseline - TWA improved the most, KAT improved the least
- TWA significantly improved in all three query
types - KFT significantly improved in T and TD query
types - But only TWA achieved 93 of Mono Baseline at TDN
21TE Methods vs QE Methods
- Pre-QE performed the worst among QE methods
- All TE methods are at least comparable to best QE
- TWA outperforms best QE at TD and TDN
- significant at TDN
22TE and QE Combination
- Combine TWA and Post-QE
- Comparable to the State-of-Art CLIR performance
- Significant over the single runs in almost all
query types
? p 0.01, ? 0.01 lt p 0.05
23TE in Resolving OOV Terms
- Trough word alignment, some OOV terms can be
resolved with high quality translations - 11 OOV terms found their translations through
TWA, only 2 of them are wrong (indicated by )
24Conclusion
- Translation enhancement can improve CLIA in
pseudo RF - Translation enhancement approach performs better
in the process where human are involved in
(discussed in the paper) - Translation enhancement can be combined with QE
- TE and QE work on different part of RF process
- Combination of them significantly improve the
CLIR performance - Translation enhancement can help resolve out of
vocabulary terms in query translation - The quality of resolving OOV is reasonable high
- Future work
- Extract translation relationships based on
Statistical MT output, no word alignment needed - Better integration of TE and QE
- Interactive translation enhancement
25Thank you !