Title: Resolving Translation Ambiguity using Ontological Chains for Cross Language Information Retrieval
1?????????????????????????Resolving Translation
Ambiguity using Ontological Chains for Cross
Language Information Retrieval
- Institute of Computer and Information Science
- National Chiao Tung University
- Student Je-Wei Liang
- Advisor Dr. Hao-Ren Ke
- Dr. Wei-Pang Yang
2Outline
- Introduction
- Related Work
- Query Translation
- Word Sense Disambiguation
- Ontology-based CLIR
- Query Translation
- Resolving Translation Ambiguity
- Mono-lingual IR
- Evaluation
- Document set
- Topics
- Relevance Assessment
- Result
- Conclusion
- Reference
3Introduction What is Cross Language Information
Retrieval?
- Enable users to query in one language and
retrieve relevant documents in other languages. - Users neednt know the exact translation of
their queries.
4Introduction Why CLIR ?
- Chinese is the most spoken language in the world,
but web pages are mostly in English. - Through the aid of CLIR systems, Chinese speakers
are able to retrieve English documents.
5Introduction Ambiguity
- Ambiguity occurs when segmentation, translation
and indexing in CLIR systems. - Our research focuses on resolving translation
ambiguity.
6Motivation and Objectives
- Motivation
- Queries are usually short.
- e.g. ?
- A word may have many possible translations.
- e.g. ?? man, gentleman
- Lack of domain knowledge
- e.g. galleries is related to library
- Objectives
- Design an ontology-based CLIR System
- Query expansion using WordNet.
- Resolving translation ambiguity using ontological
chains.
7Related Work
8Related Work Improved use of Contextual
Information in CLIR Fung98
- ?? and flu have similar context.
- News stories about ??? in Hong Kong.
9Related Work Improved use of Contextual
Information in CLIR Fung98
10Related Work WSD using Lexical Chain Barzilay97
- Word meanings are represented by synonym sets
(synsets) - Relations defined in WordNet
- Synonym / Antonym
- Hypernym / Hyponym (relation is a kind of)
11Related Work WSD using Lexical Chain Barzilay97
- A procedure for constructing lexical chains
follows three steps - Select a set of candidate words (nouns).
- For each candidate word, find an appropriate
chain relying on a relatedness criterion among
members of the chains. - If it is found, insert the word in the chain and
update it accordingly. - Three kinds of relations are defined
- Extra-strongSynonym, 10 point
- StrongHolonym , 7 point
- Medium-strongHypernym , 4 point
12WSD using Lexical Chain Barzilay97
13WSD using Lexical Chain Barzilay97
- Machine
- an efficient person
- E.g. the boxer was a magnificent fighting
machine"
14WSD using Lexical Chain Barzilay97
Score 11
Score 30
15Related Work Building a Chinese English WordNet
Chen02
16Related Work Building a Chinese English WordNet
Chen02
17Related Work Building a Chinese English WordNet
Chen02
18Advantages and Disadvantages of Related Work
19System workflow
20Query Translation
21Query Translation
- For each query term, add all its synonyms,
hypernyms and hyponyms to the query. - Original query terms are more important than
newly added terms, so we need to refine term
weight. - Term weight in the query can be defined
22Query Translation An Example
- The query ? is translated to fish, and
retrieves the following documents by looking up
WordNet.
23Query Translation Translation Ambiguity
24Resolving Translation Ambiguity
25Ontology Construction
- Each ImageCLEF2004 document belongs to one or
more categories. - There are 946 distinct categories.
- Human experts gather related categories to form a
hierarchy. - E.g. fish processing and fisherman are
assigned to the parent node ??.
26Keyword Extraction
- Ontology Node Representation
- Each node is represented by the term-to-concept
vector. - Weight is defined as the product of term
frequency and inverse concept frequency. - ltW1,W2, W3, ..,Wngt
- Pick the most important k terms as keywords.
27Keyword Extraction Example
- Keywords relevant to universities and
university libraries
28Building Ontological Chains
- Build an ontological chain for each query.
- For each query, find the most similar N ontology
leaf nodes. - Measuring pairwise semantic distance among the
selected leaf nodes, well obtain a semantic
graph. - Find connected components of the network, and
pick up the strongest component as our
ontological chain. - For each node in the chain, add its sibling nodes
to the chain. - Calculate mutual information (MI) according to
the chain for each English query term. - Pick up terms having MI gt T, and T is the
threshold.
29Building Ontological Chains
- Step 1
- Similarity of query Q and ontology leaf node Li
is defined as the following - tij is the number of distinct Chinese query terms
in document j belonging to Li - N is the number of documents belonging to Li
- E.g. ????????? is similar to fish processing,
fishwive
30Building Ontological Chains
- Step2
- Define the semantic distance between 2 ontology
leaf nodes as - K is a constant, and D is the path length between
the 2 nodes. - E.g. distance between herring and fish
processing is K/3.
31Building Ontological Chains
- Calculate pairwise semantic distance and then
obtain a semantic graph.
32Building Ontological Chains
Step3 Employ union-find algorithm to find
connected components, and choose the maximum
weighted one.
33Building Ontological Chains
- Step 4 add fish markets, fisherman
- Step 5,6 For the term ??, pick up man
instead of gentleman.
34Monolingual Information Retrieval System
35Document Vector Representation
- Each document has 3 kind of features
- Terms Wi,j is defined as tf idf
- Categories Wci,j is defined by boolean weighting
scheme - Temporal feature Wti,j is defined by boolean
weighting scheme - We use cosine measure as our similarity function
36Query Vector Representation
- Each feature is multiplied by a weighting factor.
- We define 3 temporal operation before, in, after
- E.g. D1 is published in 1898 D2 published in
1901, and Q is the operation before 1900
37Evaluation Dataset Description
- ImageCLEF2004 bilingual ad hoc
- task.
- St Andrews University Library
- photographic collection.
- Photos are primarily historic in
- nature from areas in and around
- Scotland.
- Dataset Overview
- 28133 SGML documents consist
- of text and images.
- 946 categories
38Evaluation Dataset Description
39Evaluation Topics
- Topics are based on real search request,
including query logs, and requests from patrons.
40Evaluation Metrics Mean Average Precision
- Average Precision
- Average of precision at each relevant document
retrieved. - E.g. average precision of the query ?????????.
- Mean Average Precision
- Mean of the individual average precision scores.
41System Demonstration Retrieval
42System Demonstration Ontology
43System Demonstration Monolingual
44System Demonstration Dictionary-based
45System Demonstration Ontology-based
46Evaluation Precision/Recall at Top 100
- Ontology-based CLIR performs better than
dictionary-lookup CLIR. - Ontology-based CLIR system reaches 85
performance of monolingual IR system. - Without Ontology, CLIR reaches only 42
performance of monolingual IR system.
47Evaluation 11-point Precision/Recall
- Ontology-based CLIR system performs better than
dictionary-lookup CLIR.
48Evaluation Mean Average Precision
- Ontology-based CLIR system performs better than
dictionary-lookup CLIR. - Ontology-based CLIR system reaches 92
performance of monolingual IR system. - Without Ontology, CLIR reaches only 81
performance of monolingual IR system.
49System Demonstration Feedback
50System Demonstration Feedback
51System Demonstration Feedback
52Evaluation Relevance Feedback
- Pick up M retrieved relevant documents as
positive examples, and N retrieved non-relevant
documents as negative examples.
53Discussion
- With ontological chains, CLIR will perform better
than monolingual IR. - Without semantic features, documents are
retrieved only if they have common terms as the
query. - catch, fisherman and salting are related to
the query man and woman processing fish in the
ontology. - Ontological chains use semantic features and
perform better than keyword matching. - Our similarity function performs better than
cosine measure. - Terms are independent in the vector space model.
- Our similarity function have the same effect as
and operator.
54Discussion When the query is specific
- Our approach performs a little worse than
keyword-matching. - E.g. 1908??????????
- Our approach may expand too much ontology nodes.
55Discussion When the query is general
- Our approach performs much better than
keyword-matching. - E.g. ?????????
- man, woman, processing are general terms
and appears in many documents.
56Conclusion
- Ontology can be employed to represent domain
knowledge in an CLIR systems. - The proposed ontological chain approach can be
used to resolve translation ambiguity. - The proposed ontological chain approach gains
better precision than others especially when
translation candidates are very large.
57Future Work
- Semantic indexing can be used to resolve
polysemous words.
58Reference
- Ballesteros98 L. Ballesteros and W.B. Croft,
Resolving ambiguity for cross language
retrieval, Proc. 21st annual international ACM
SIGIR conference on Research and development in
information retrieval, pp.64-71, 1998. - Barzilay97 R. Barzilay and M. Elhadad, Using
Lexical Chains for Text Summarization, ACL/EACL
Workshop on Intelligent Scalable Text
Summarization, 1997. - Carbonell97 J. Carbonell, Y. Yang, R.
Frederking, R.D. Brown, Y. Geng, and D. Lee,
"Translingual Information Retrieval A
Comparative Evaluation," Proc. Fifteenth
International Joint Conference on Artificial
Intelligence Vol 1, pp. 708-715, 1997. - Chen02 H.H. Chen, C.C. Lin and W.C. Lin,
Building a Chinese-English wordnet for
translingual applications, ACM Transactions on
Asian Language Information Processing vol. 1,
Issue 2, pp.103-122, 2002. - CLEF04 Cross Language Evaluation Forum,
avalible at http//clef.iei.pi.cnr.it2002/2004.ht
ml - Frakes92 W.B. Frakes, R. Baeza-Yates,
Information Retrieval, Data Structures
Algorithms. Prentice Hall, 1992. - Fung98 P. Fung, L.Y. Yee, An IR Approach for
Translating New Words from Nonparallel,
Comparable Texts,Proc. of the 36th Annual
Conference of the Association for Computational
Linguistics, pp. 414-420, 1998.
59Reference
- Gruber93 T. R. Gruber, A translation approach
to portable ontologies, Knowledge Acquisition,
pp. 199-220, 1993 - ImageCLEF04 Cross Language Evaluation Forum,
avalible at http//ir.shef.ac.uk/imageclef2004/ - Kipfer01 B.A. Kipfer and R. L. Chapman, Roget's
International Thesaurus. , HarperResource, 2001. - Larkey03 L.S. Larkey and M.E. Connell,
Structured Queries, Language Modeling, and
Relevance Modeling in Cross-Language Information
Retrieval, Information Processing and Management
Special Issue on Cross Language Information
Retrieval, 2003. - Littman98 M.L. Littman, S.T. Dumais, and T.K.
Landauer,Automatic cross-language information
retrieval using latent semantic
indexing,Cross-Language Information Retrieval,
pp. 5162, 1998. - Lu02 W.H. Lu, L.F. Chien and H.L. Lee,
Translation of web queries using anchor text
mining, ACM Transactions on Asian Language
Information Processing ,Vol 1, Issue 2,
pp.159-172, 2002 - Miller95 G. Miller, "Wordnet A Lexical
Database for English, Proc. of Communications of
CACM, 1995.
60Reference
- Miller99 D.R.H. Miller, T. Leek, R.M. Schwartz,
A hidden Markov model information retrieval
system, Proc. of the 22nd annual international
ACM SIGIR conference on Research and development
in information, pp. 214-221, 1999. - Nie99 J.Y. Nie, M. Simard, P. Isabelle and R.
Durand , Cross-Language Information Retrieval
Based on Parallel Texts and Automatic Mining of
Parallel Texts from the Web, Proc. of the 22nd
annual international ACM SIGIR conference on
Research and development in information, pp.
74-81, 1999. - Porter80 M. F. Porter, An algorithm for suffix
stripping, Program, Vol. 14, No. 3, pp. 130-137,
1980 - Rocchio71 J. Rocchio, Relevance Feedback in
Information Retrieval, Prentice-Hall, Inc.,
1971. - Salton83 G. Salton and M. J. McGill,
Introduction to Modern Information Retrieval ,
McGraw-Hill, 1983 - Savoy03 J. Savoy ,Cross-language information
retrieval experiments based on CLEF 2000
corpora, Information Processing Management
,Vol. 39, Issue 1, pp. 75-115, 2003. - Trajan75 R.E. Tarjan, Efficiency of a Good But
Not Linear Set Union Algorithm, Journal of the
ACM, Vol 22, Issue 2, pp. 215-225, 1975.
61Reference
- Xu01 J. Xu, R. Weischedel, and C. Nguyen,
Evaluating a probabilistic model for
cross-lingual information retrieval, Proc. 24th
annual international ACM SIGIR conference on
Research and development in information retrieval
, pp. 105-110, 2001 - Zhang02 Y. Zhang and P. Vines, Improved use of
Contextual Information in Cross-language
Information Retrieval, ACDS, 2002.