Title: Cross-Language Information Retrieval (CLIR)
1Cross-Language Information Retrieval (CLIR)
- 614 Information Retrieval Theory
- YooJin Ha
- 12/17/2004
2Content
- Definition
- Users need
- General problem
- Organization of CLIR model
- Matching methods
- Experiments
- Future expectation
3What is CLIR?
- CLIR (Cross Language Information Retrieval) is a
retrieval system that operates with queries in
one language to retrieve documents in other
languages. -- allow users to access information
written in the users languages of choice. - CLIR can be used to enhance the ability of users
to search and retrieve documents in many languages
4Who will be users of a CLIR system?
- 1. Bilingual users who have good reading skills
in their second language may have poor language
productive skills and thus cannot express their
information need in their 2nd language as well as
they can in their first language. - 2. Monolingual users who have interest or need
such information for their research but some
reality issues (such as cost and time before
coming resources for a full translation. - (Ogden et al., 1999 Abdelali et al., 2002)
5Users need in query formulationAccording
to my survey (about OPAC, WorldCat usage), most
users want to search with their own language.
Americans take it for granted that they can
present the system with queries in English. This
then forces all users to enter queries the same
way for example, Korean would want to present
queries in their native language, just as Chinese
and Japanese would also want. If WorldCat can
morph into a cross language system, then
catalogers and the general public would have
greater access and use of this system.
6Multilanguage examples
- Rutgers University Libraries WorldCat
- Arctos Interactive Multilingual Search with Ursa
7History
- 1964 International Road Research Documentation
system used controlled vocabulary thesaurus in
English, French, and German - 1970 Salton conducted to retrieve other
languages than a query, augmenting the SMART
system using with hand-constructed bilingual
term, thesauri. - 1978 ISO adopted ISO standard 5964 on
constructing of multilingual thesauri - (Oard Diekema,
1998)
8Problems
- Various structures by different languages. (I.e.
stemming is widely used to increase recall
performance) - Difficulties in normalization (I.e. Korean- too
morphological, so it is difficult to decompose) - CLIR system must be adapted to the
characteristics of whichever languages it will
use. - Disambiguation (difficult to find a correct term
- often several possible meanings for the term
in that query) - None of them has been fully successful yet.
9Some solutions
- In stemming, statistical dependence based on
co-occurrence of terms (I.e. Spanish has more
forms of each verb)(Croft al et, 1996) - Dictionary combined with a mechanism to string
word meaning together (I.e. German) (Sheridan
Ballerini, 1996) - Automatic query enrichment by using local
feedback and local context analysis (Ballestores
Croft, 1998)
10Machine translation (MT)
- Many people easily think the MT can be applied to
any system. It often makes errors and only few
available examples of research done on MT with
specific regard to CLIR. - MT only can work well for specific domains, such
as containing specific terminology. (Oard Dorr,
1996) - SYSTRAN Language Translation Technology
- AltaVista - Babel Fish Translation
11Organization of CLIR system - Oard (1997)
system model
- 1. Document pre-processing (collection)
- 2. Query formulation
- 3. Matching
- 4. Selection
- 5. Examination
- 6. Relevance feedback
- (cited in Oard Diekema, 1998)
12Matching strategies (Oard Diekema, 1998)
13Organization of CLIR system
- Matching
- Cognate matching
- Translation query translation,
documenttranslation, interlingual techniques - Translation knowledge ontology,
bilingualdictionary, machine translation
lexicons,corpora - (Oard and Diekema,
1998 )
14Matching - Cognate matching
- Guess the meaning based on similarities in
spelling or pronunciation (i.e. proper nouns and
technological terms) - - Buckley et al. (1998) letter sequences
with similar sounds (I.e., c, k, qu) - - Davis extended this technique using fuzzy
matching to discover Spanish cognates for
English. - This method is mostly combined with other major
translation methods.
15Matching - Translation
- Query translation
- Document translation,
- Interlingual techniques
16Matching - Translation
- Most system apply query translation first due to
its efficiency compared to document translation
in time and cost. - Risk of query translation queries often do not
contain enough context to permit good translation
- Some experimental systems tried to provide
translation of summary for small amount of
retrieved documents. (I.e. Keizai, MULINEX)
17Matching - Translation
- Interlingual techniques
- Based on multilingual thesauri,
controlled-vocabulary techniques applied to
convert both the query and the documents into a
unified language independent representation.
(I.e. Lexical Semantic Indexing (LSI) Dumais al.
et.,1997)
18Matching - Translation Interlingua approach
(example)
- TextWises CINDOR (Conceptual Interlingua
Approach - Both documents and queries are mapped into the
Conceptual Interlingua - Permits matching and retrieval based on any
combination of languages involved, rather than
relying on pairwise translations - Enables language-independent retrieval based on
natural language concepts and matching to
synonyms in all languages - (Liddy and Sheridan,
1999) - Cindorsearch
19TextWises CINDOR(Liddy and Sheridan, 1999)
20Matching
- 3. Translation knowledge
- - Ontology,
- - Bilingual dictionary,
- - Machine translation lexicons,
- - Corpora
21Matching 3. Translation knowledge Ontology
- The most dominant and effective source for CLIR
(i.e. thesauri- support controlled-vocabulary
and free-text retrieval, hierarchical
relationships (broader, narrow terms), synonymy,
and related term) - (I.e. SPIDER system multilingual thesaurus
and query expansion with Italian and German) - (Sheridan
Ballerini, 1996) - It help users to define better queries but needs
users understanding of those structure. If it is
given easy explanations or examples of this, it
can be a good aid tool.(user must create the
query using only vocabulary from the thesaurus)
22Matching 3. Translation knowledge Bilingual
dictionary
- It has been widely used. It is useful because it
can enhance users query formulation. - Problems ambiguity (selecting correct terms)
- a limited scope
-
- - Abdelali et al. (2002) attempt to present
visual dictionaries for who does not have
knowledge of the target language. It is a useful
try of query-interactive approach
23Translation knowledge Machine Translation
Lexicon
- To overcome the limitations of general purpose
transfer dictionary, Salton (1971) - used tuned lexicons and thesauri built from
controlled vocabulary to good success in specific
text retrieval problems. - - Preparing special purpose of lexical
- resources remains a daunting task.
- (cited on Ogden et. al., 1999)
24Translation knowledge parallel/comparable corpora
- Text corpora contains examples of usage patterns
in the query language that can be matched to
examples in the target language if the sentence
or paragraphs of texts are aligned to one
another. - Braschler and Schäuble (2000) applied this
approach, and obtained useful alignment sets
among German, French, English, and Italian
language documents. One aspect of the theory is
that end users can be best served by efficient,
inexpensive systems
25Example of an alignment pair(Corpus-based
approaches)
- German and French documents, only titles are
shown. - Takeshita zu Antrittsbesuch in Bonn
eingetroffen - (Takashita has arrived in Bonn for first
visit) - Arrivee en RFA du premier ministre japonais
- (Arrival of the Japanese prime minister in
the FRG) - Alignments are produced by indicators to
find similarities between pairs of documents. -
(Brashler Schable, 2000)
26Translation knowledge parallel/comparable corpora
- Generalized Vector Space Model
- use a bilingual training corpus to build
matrices of document and term weights - (Carbonell et al. 1997)
- Latent Semantic Indexing
- - Telcordia Latent Semantic Indexing (LSI)
Demos - - LSI by Dumais, Landauer Littman (SIGIR96)
27Attempted experiment system
- ARCTOS An interactive search engine that
illustrates a selection interface that shows how
thumbnail images can be used to support selection
without knowing the document's language. ARCTOS
was developed by the Computing Research Lab at
New Mexico State University. - MULINEX(Multilingual Indexing, Navigation and
Editing Extensions for the WWW) multilingual
search engine for the WWW. It has collected
around 5000 French, English and German web pages
related to the European Currency Union
(http//mulinex.dfki.de/demo.html)
28Attempted experiment system
- MUNDIAL
- MINDS(Multilingual Interactive Document
Summarization) Summarization in Spanish,
Japanese, Korean, and English - CANAL (Cataloging with Multilingual Natural
Language Access/Linguistic Server) - KEIZAI
- Unicode used. For who dont know the target
language (Japanese, Korean, and English)
29User Studies
- Keizai (1999) They focused on trying to
understand how to create interfaces and systems
that are useful to people. empirical user
studies evaluating retrieval visualization
interfaces. - Keizai is a cross-language interactive retrieval
that uses URSA (Unicode Retrieval System
Architecture), developed at the Computing
Research Laboratory at New Mexico State
University.
30System Evaluation
- TREC(Text retrieval conferences) US
- TREC-3 Spanish / TREC-4 Chinese/
- TREC-6 SMART - balance query
- English and French
- TREC-7 English, French, and German, and
- Italian)
- CLEF(Cross-language evaluation forum) in Europe
- NCCTR (National Institute of Informatics-NACSIS
Test Collection for IR system) in Japan
31System Evaluation
- Arctos project User involvement. The user was
asked to interactively improve the query
translation using bilingual translation
resources. The retrieved documents were presented
using document thumbnails and query highlighting - TITAN Kikui et al, 1996, Japanese-English
- Multilingual Gisting Resnik, 1997
32User Interactive CLIR
- Ogden et al. (1999) implemented more interactive
approaches the system, Arctos, provides a user
with a browser-based interface with which to
enter English queries. After an initial query is
entered, the query is translated using a simple
word-for-word or phrasal translator. The user
can then interactively improve the query
translation using links to on-line bilingual
translation resources and then submit the query
for retrieval against document collections in the
target.
33User Interactive CLIR
- User-aided query translation
- user can disambiguate query term
- (I.e. Query expansion)
- Query formulation
- Davis and Ogden (1997) who implemented the
QUILT system by providing a display of the
Spanish translation of English query terms.
Judging relevance with showing summary of
English translation - (Cited in Oard and Dikema, 1998, p.229)
34Conclusion
- Needed for more CLIR user studies(users
information seeking behavior and users particular
domain characteristics) in order to reflect their
real needs to provide updated information. - CLIR evaluations are needed more attention of
users interaction with the system as well as
system performance. - Needed more efficient interface in order to
reflect users needs.
35References
- Braschler, M. Schäuble, P. (2000). Using
Corpus-Based Approaches in a System for
Multilingual Information Retrieval. Information
Retrieval, 3, 273-284. - Grefenstette, G.. (Ed.). (1998). Cross-Language
Information Retrieval. Norwell, MA Kluwer - Lee, K., Kageura, K. Choi, K. (2004). Implicit
ambiguity resolution using incremental clustering
in cross-language information retrieval.
Information Processing and Management, 40,
145-159. - Oard, D., Diekema, A. (1998). Cross-Language
Information Retrieval. Annual Review of
Information Science and Technology (ARIST), 33,
223-256.
36References
- Ogden, D., Cowie, J., Davis, M.,Ludovik, E.,
Molina-Salado, H. Shin, H. (1999). Getting
Information from Documents You Cannot Read An
Interactive Cross-Language Tex Retrieval and
Summarization System. Joint ACM Digital
Library/SIGIR Workshop on Multilingual
Information Discovery and AccesS (MIDAS). - Peters, C. (Ed.) (2000). Cross-Language
Information Retrieval and Evaluation Workshop of
th Cross-Language Evaluation Forum, CLEF 2000,
Lisbon, Portugal, September 2000. Berlin
Springer-Verlag. - Sheridan, P. Ballerini, J. (1996). Experiments
in multilingual information retrieval using the
SPIDER system. ACM, 58-65.