Cross-Language Information Retrieval (CLIR) - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

Cross-Language Information Retrieval (CLIR)

Description:

... technique using fuzzy matching to discover Spanish cognates for English. ... system by providing a display of the Spanish translation of English query terms. ... – PowerPoint PPT presentation

Number of Views:1031

Avg rating:3.0/5.0

Slides: 37

Provided by: Gue145

Category:

more less

Transcript and Presenter's Notes

Title: Cross-Language Information Retrieval (CLIR)

1
Cross-Language Information Retrieval (CLIR)

614 Information Retrieval Theory
YooJin Ha
12/17/2004

2
Content

Definition
Users need
General problem
Organization of CLIR model
Matching methods
Experiments
Future expectation

3
What is CLIR?

CLIR (Cross Language Information Retrieval) is a
retrieval system that operates with queries in
one language to retrieve documents in other
languages. -- allow users to access information
written in the users languages of choice.
CLIR can be used to enhance the ability of users
to search and retrieve documents in many languages

4
Who will be users of a CLIR system?

1. Bilingual users who have good reading skills
in their second language may have poor language
productive skills and thus cannot express their
information need in their 2nd language as well as
they can in their first language.
2. Monolingual users who have interest or need
such information for their research but some
reality issues (such as cost and time before
coming resources for a full translation.
(Ogden et al., 1999 Abdelali et al., 2002)

5
Users need in query formulationAccording
to my survey (about OPAC, WorldCat usage), most
users want to search with their own language.
Americans take it for granted that they can
present the system with queries in English. This
then forces all users to enter queries the same
way for example, Korean would want to present
queries in their native language, just as Chinese
and Japanese would also want. If WorldCat can
morph into a cross language system, then
catalogers and the general public would have
greater access and use of this system.
6
Multilanguage examples

Rutgers University Libraries WorldCat
Arctos Interactive Multilingual Search with Ursa

7
History

1964 International Road Research Documentation
system used controlled vocabulary thesaurus in
English, French, and German
1970 Salton conducted to retrieve other
languages than a query, augmenting the SMART
system using with hand-constructed bilingual
term, thesauri.
1978 ISO adopted ISO standard 5964 on
constructing of multilingual thesauri
(Oard Diekema,
1998)

8
Problems

Various structures by different languages. (I.e.
stemming is widely used to increase recall
performance)
Difficulties in normalization (I.e. Korean- too
morphological, so it is difficult to decompose)
CLIR system must be adapted to the
characteristics of whichever languages it will
use.
Disambiguation (difficult to find a correct term
often several possible meanings for the term
in that query)
None of them has been fully successful yet.

9
Some solutions

In stemming, statistical dependence based on
co-occurrence of terms (I.e. Spanish has more
forms of each verb)(Croft al et, 1996)
Dictionary combined with a mechanism to string
word meaning together (I.e. German) (Sheridan
Ballerini, 1996)
Automatic query enrichment by using local
feedback and local context analysis (Ballestores
Croft, 1998)

10
Machine translation (MT)

Many people easily think the MT can be applied to
any system. It often makes errors and only few
available examples of research done on MT with
specific regard to CLIR.
MT only can work well for specific domains, such
as containing specific terminology. (Oard Dorr,
1996)
SYSTRAN Language Translation Technology
AltaVista - Babel Fish Translation

11
Organization of CLIR system - Oard (1997)
system model

1. Document pre-processing (collection)
2. Query formulation
3. Matching
4. Selection
5. Examination
6. Relevance feedback
(cited in Oard Diekema, 1998)

12
Matching strategies (Oard Diekema, 1998)
13
Organization of CLIR system

Matching
Cognate matching
Translation query translation,
documenttranslation, interlingual techniques
Translation knowledge ontology,
bilingualdictionary, machine translation
lexicons,corpora
(Oard and Diekema,
1998 )

14
Matching - Cognate matching

Guess the meaning based on similarities in
spelling or pronunciation (i.e. proper nouns and
technological terms)
- Buckley et al. (1998) letter sequences
with similar sounds (I.e., c, k, qu)
- Davis extended this technique using fuzzy
matching to discover Spanish cognates for
English.
This method is mostly combined with other major
translation methods.

15
Matching - Translation

Query translation
Document translation,
Interlingual techniques

16
Matching - Translation

Most system apply query translation first due to
its efficiency compared to document translation
in time and cost.
Risk of query translation queries often do not
contain enough context to permit good translation
Some experimental systems tried to provide
translation of summary for small amount of
retrieved documents. (I.e. Keizai, MULINEX)

17
Matching - Translation

Interlingual techniques
Based on multilingual thesauri,
controlled-vocabulary techniques applied to
convert both the query and the documents into a
unified language independent representation.
(I.e. Lexical Semantic Indexing (LSI) Dumais al.
et.,1997)

18
Matching - Translation Interlingua approach
(example)

TextWises CINDOR (Conceptual Interlingua
Approach
Both documents and queries are mapped into the
Conceptual Interlingua
Permits matching and retrieval based on any
combination of languages involved, rather than
relying on pairwise translations
Enables language-independent retrieval based on
natural language concepts and matching to
synonyms in all languages
(Liddy and Sheridan,
1999)
Cindorsearch

19
TextWises CINDOR(Liddy and Sheridan, 1999)
20
Matching

3. Translation knowledge
- Ontology,
- Bilingual dictionary,
- Machine translation lexicons,
- Corpora

21
Matching 3. Translation knowledge Ontology

The most dominant and effective source for CLIR
(i.e. thesauri- support controlled-vocabulary
and free-text retrieval, hierarchical
relationships (broader, narrow terms), synonymy,
and related term)
(I.e. SPIDER system multilingual thesaurus
and query expansion with Italian and German)
(Sheridan
Ballerini, 1996)
It help users to define better queries but needs
users understanding of those structure. If it is
given easy explanations or examples of this, it
can be a good aid tool.(user must create the
query using only vocabulary from the thesaurus)

22
Matching 3. Translation knowledge Bilingual
dictionary

It has been widely used. It is useful because it
can enhance users query formulation.
Problems ambiguity (selecting correct terms)
a limited scope
- Abdelali et al. (2002) attempt to present
visual dictionaries for who does not have
knowledge of the target language. It is a useful
try of query-interactive approach

23
Translation knowledge Machine Translation
Lexicon

To overcome the limitations of general purpose
transfer dictionary, Salton (1971)
used tuned lexicons and thesauri built from
controlled vocabulary to good success in specific
text retrieval problems.
- Preparing special purpose of lexical
resources remains a daunting task.
(cited on Ogden et. al., 1999)

24
Translation knowledge parallel/comparable corpora

Text corpora contains examples of usage patterns
in the query language that can be matched to
examples in the target language if the sentence
or paragraphs of texts are aligned to one
another.
Braschler and Schäuble (2000) applied this
approach, and obtained useful alignment sets
among German, French, English, and Italian
language documents. One aspect of the theory is
that end users can be best served by efficient,
inexpensive systems

25
Example of an alignment pair(Corpus-based
approaches)

German and French documents, only titles are
shown.
Takeshita zu Antrittsbesuch in Bonn
eingetroffen
(Takashita has arrived in Bonn for first
visit)
Arrivee en RFA du premier ministre japonais
(Arrival of the Japanese prime minister in
the FRG)
Alignments are produced by indicators to
find similarities between pairs of documents.
(Brashler Schable, 2000)

26
Translation knowledge parallel/comparable corpora

Generalized Vector Space Model
use a bilingual training corpus to build
matrices of document and term weights
(Carbonell et al. 1997)
Latent Semantic Indexing
- Telcordia Latent Semantic Indexing (LSI)
Demos
- LSI by Dumais, Landauer Littman (SIGIR96)

27
Attempted experiment system

ARCTOS An interactive search engine that
illustrates a selection interface that shows how
thumbnail images can be used to support selection
without knowing the document's language. ARCTOS
was developed by the Computing Research Lab at
New Mexico State University.
MULINEX(Multilingual Indexing, Navigation and
Editing Extensions for the WWW) multilingual
search engine for the WWW. It has collected
around 5000 French, English and German web pages
related to the European Currency Union
(http//mulinex.dfki.de/demo.html)

28
Attempted experiment system

MUNDIAL
MINDS(Multilingual Interactive Document
Summarization) Summarization in Spanish,
Japanese, Korean, and English
CANAL (Cataloging with Multilingual Natural
Language Access/Linguistic Server)
KEIZAI
Unicode used. For who dont know the target
language (Japanese, Korean, and English)

29
User Studies

Keizai (1999) They focused on trying to
understand how to create interfaces and systems
that are useful to people. empirical user
studies evaluating retrieval visualization
interfaces.
Keizai is a cross-language interactive retrieval
that uses URSA (Unicode Retrieval System
Architecture), developed at the Computing
Research Laboratory at New Mexico State
University.

30
System Evaluation

TREC(Text retrieval conferences) US
TREC-3 Spanish / TREC-4 Chinese/
TREC-6 SMART - balance query
English and French
TREC-7 English, French, and German, and
Italian)
CLEF(Cross-language evaluation forum) in Europe
NCCTR (National Institute of Informatics-NACSIS
Test Collection for IR system) in Japan

31
System Evaluation

Arctos project User involvement. The user was
asked to interactively improve the query
translation using bilingual translation
resources. The retrieved documents were presented
using document thumbnails and query highlighting
TITAN Kikui et al, 1996, Japanese-English
Multilingual Gisting Resnik, 1997

32
User Interactive CLIR

Ogden et al. (1999) implemented more interactive
approaches the system, Arctos, provides a user
with a browser-based interface with which to
enter English queries. After an initial query is
entered, the query is translated using a simple
word-for-word or phrasal translator. The user
can then interactively improve the query
translation using links to on-line bilingual
translation resources and then submit the query
for retrieval against document collections in the
target.

33
User Interactive CLIR

User-aided query translation
user can disambiguate query term
(I.e. Query expansion)
Query formulation
Davis and Ogden (1997) who implemented the
QUILT system by providing a display of the
Spanish translation of English query terms.
Judging relevance with showing summary of
English translation
(Cited in Oard and Dikema, 1998, p.229)

34
Conclusion

Needed for more CLIR user studies(users
information seeking behavior and users particular
domain characteristics) in order to reflect their
real needs to provide updated information.
CLIR evaluations are needed more attention of
users interaction with the system as well as
system performance.
Needed more efficient interface in order to
reflect users needs.

35
References

Braschler, M. Schäuble, P. (2000). Using
Corpus-Based Approaches in a System for
Multilingual Information Retrieval. Information
Retrieval, 3, 273-284.
Grefenstette, G.. (Ed.). (1998). Cross-Language
Information Retrieval. Norwell, MA Kluwer
Lee, K., Kageura, K. Choi, K. (2004). Implicit
ambiguity resolution using incremental clustering
in cross-language information retrieval.
Information Processing and Management, 40,
145-159.
Oard, D., Diekema, A. (1998). Cross-Language
Information Retrieval. Annual Review of
Information Science and Technology (ARIST), 33,
223-256.

36
References

Ogden, D., Cowie, J., Davis, M.,Ludovik, E.,
Molina-Salado, H. Shin, H. (1999). Getting
Information from Documents You Cannot Read An
Interactive Cross-Language Tex Retrieval and
Summarization System. Joint ACM Digital
Library/SIGIR Workshop on Multilingual
Information Discovery and AccesS (MIDAS).
Peters, C. (Ed.) (2000). Cross-Language
Information Retrieval and Evaluation Workshop of
th Cross-Language Evaluation Forum, CLEF 2000,
Lisbon, Portugal, September 2000. Berlin
Springer-Verlag.
Sheridan, P. Ballerini, J. (1996). Experiments
in multilingual information retrieval using the
SPIDER system. ACM, 58-65.