Cross-Language Information Retrieval (CLIR) - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Cross-Language Information Retrieval (CLIR)

Description:

... technique using fuzzy matching to discover Spanish cognates for English. ... system by providing a display of the Spanish translation of English query terms. ... – PowerPoint PPT presentation

Number of Views:1031
Avg rating:3.0/5.0
Slides: 37
Provided by: Gue145
Category:

less

Transcript and Presenter's Notes

Title: Cross-Language Information Retrieval (CLIR)


1
Cross-Language Information Retrieval (CLIR)
  • 614 Information Retrieval Theory
  • YooJin Ha
  • 12/17/2004

2
Content
  • Definition
  • Users need
  • General problem
  • Organization of CLIR model
  • Matching methods
  • Experiments
  • Future expectation

3
What is CLIR?
  • CLIR (Cross Language Information Retrieval) is a
    retrieval system that operates with queries in
    one language to retrieve documents in other
    languages. -- allow users to access information
    written in the users languages of choice.
  • CLIR can be used to enhance the ability of users
    to search and retrieve documents in many languages

4
Who will be users of a CLIR system?
  • 1. Bilingual users who have good reading skills
    in their second language may have poor language
    productive skills and thus cannot express their
    information need in their 2nd language as well as
    they can in their first language.
  • 2. Monolingual users who have interest or need
    such information for their research but some
    reality issues (such as cost and time before
    coming resources for a full translation.
  • (Ogden et al., 1999 Abdelali et al., 2002)

5
Users need in query formulationAccording
to my survey (about OPAC, WorldCat usage), most
users want to search with their own language.
Americans take it for granted that they can
present the system with queries in English. This
then forces all users to enter queries the same
way for example, Korean would want to present
queries in their native language, just as Chinese
and Japanese would also want. If WorldCat can
morph into a cross language system, then
catalogers and the general public would have
greater access and use of this system.
6
Multilanguage examples
  • Rutgers University Libraries WorldCat
  • Arctos Interactive Multilingual Search with Ursa

7
History
  • 1964 International Road Research Documentation
    system used controlled vocabulary thesaurus in
    English, French, and German
  • 1970 Salton conducted to retrieve other
    languages than a query, augmenting the SMART
    system using with hand-constructed bilingual
    term, thesauri.
  • 1978 ISO adopted ISO standard 5964 on
    constructing of multilingual thesauri
  • (Oard Diekema,
    1998)

8
Problems
  • Various structures by different languages. (I.e.
    stemming is widely used to increase recall
    performance)
  • Difficulties in normalization (I.e. Korean- too
    morphological, so it is difficult to decompose)
  • CLIR system must be adapted to the
    characteristics of whichever languages it will
    use.
  • Disambiguation (difficult to find a correct term
  • often several possible meanings for the term
    in that query)
  • None of them has been fully successful yet.

9
Some solutions
  • In stemming, statistical dependence based on
    co-occurrence of terms (I.e. Spanish has more
    forms of each verb)(Croft al et, 1996)
  • Dictionary combined with a mechanism to string
    word meaning together (I.e. German) (Sheridan
    Ballerini, 1996)
  • Automatic query enrichment by using local
    feedback and local context analysis (Ballestores
    Croft, 1998)

10
Machine translation (MT)
  • Many people easily think the MT can be applied to
    any system. It often makes errors and only few
    available examples of research done on MT with
    specific regard to CLIR.
  • MT only can work well for specific domains, such
    as containing specific terminology. (Oard Dorr,
    1996)
  • SYSTRAN Language Translation Technology
  • AltaVista - Babel Fish Translation

11
Organization of CLIR system - Oard (1997)
system model
  • 1. Document pre-processing (collection)
  • 2. Query formulation
  • 3. Matching
  • 4. Selection
  • 5. Examination
  • 6. Relevance feedback
  • (cited in Oard Diekema, 1998)

12
Matching strategies (Oard Diekema, 1998)
13
Organization of CLIR system
  • Matching
  • Cognate matching
  • Translation query translation,
    documenttranslation, interlingual techniques
  • Translation knowledge ontology,
    bilingualdictionary, machine translation
    lexicons,corpora
  • (Oard and Diekema,
    1998 )

14
Matching - Cognate matching
  • Guess the meaning based on similarities in
    spelling or pronunciation (i.e. proper nouns and
    technological terms)
  • - Buckley et al. (1998) letter sequences
    with similar sounds (I.e., c, k, qu)
  • - Davis extended this technique using fuzzy
    matching to discover Spanish cognates for
    English.
  • This method is mostly combined with other major
    translation methods.

15
Matching - Translation
  • Query translation
  • Document translation,
  • Interlingual techniques

16
Matching - Translation
  • Most system apply query translation first due to
    its efficiency compared to document translation
    in time and cost.
  • Risk of query translation queries often do not
    contain enough context to permit good translation
  • Some experimental systems tried to provide
    translation of summary for small amount of
    retrieved documents. (I.e. Keizai, MULINEX)

17
Matching - Translation
  • Interlingual techniques
  • Based on multilingual thesauri,
    controlled-vocabulary techniques applied to
    convert both the query and the documents into a
    unified language independent representation.
    (I.e. Lexical Semantic Indexing (LSI) Dumais al.
    et.,1997)

18
Matching - Translation Interlingua approach
(example)
  • TextWises CINDOR (Conceptual Interlingua
    Approach
  • Both documents and queries are mapped into the
    Conceptual Interlingua
  • Permits matching and retrieval based on any
    combination of languages involved, rather than
    relying on pairwise translations
  • Enables language-independent retrieval based on
    natural language concepts and matching to
    synonyms in all languages
  • (Liddy and Sheridan,
    1999)
  • Cindorsearch

19
TextWises CINDOR(Liddy and Sheridan, 1999)
20
Matching
  • 3. Translation knowledge
  • - Ontology,
  • - Bilingual dictionary,
  • - Machine translation lexicons,
  • - Corpora

21
Matching 3. Translation knowledge Ontology
  • The most dominant and effective source for CLIR
    (i.e. thesauri- support controlled-vocabulary
    and free-text retrieval, hierarchical
    relationships (broader, narrow terms), synonymy,
    and related term)
  • (I.e. SPIDER system multilingual thesaurus
    and query expansion with Italian and German)
  • (Sheridan
    Ballerini, 1996)
  • It help users to define better queries but needs
    users understanding of those structure. If it is
    given easy explanations or examples of this, it
    can be a good aid tool.(user must create the
    query using only vocabulary from the thesaurus)

22
Matching 3. Translation knowledge Bilingual
dictionary
  • It has been widely used. It is useful because it
    can enhance users query formulation.
  • Problems ambiguity (selecting correct terms)
  • a limited scope
  • - Abdelali et al. (2002) attempt to present
    visual dictionaries for who does not have
    knowledge of the target language. It is a useful
    try of query-interactive approach

23
Translation knowledge Machine Translation
Lexicon
  • To overcome the limitations of general purpose
    transfer dictionary, Salton (1971)
  • used tuned lexicons and thesauri built from
    controlled vocabulary to good success in specific
    text retrieval problems.
  • - Preparing special purpose of lexical
  • resources remains a daunting task.
  • (cited on Ogden et. al., 1999)

24
Translation knowledge parallel/comparable corpora
  • Text corpora contains examples of usage patterns
    in the query language that can be matched to
    examples in the target language if the sentence
    or paragraphs of texts are aligned to one
    another.
  • Braschler and Schäuble (2000) applied this
    approach, and obtained useful alignment sets
    among German, French, English, and Italian
    language documents. One aspect of the theory is
    that end users can be best served by efficient,
    inexpensive systems

25
Example of an alignment pair(Corpus-based
approaches)
  • German and French documents, only titles are
    shown.
  • Takeshita zu Antrittsbesuch in Bonn
    eingetroffen
  • (Takashita has arrived in Bonn for first
    visit)
  • Arrivee en RFA du premier ministre japonais
  • (Arrival of the Japanese prime minister in
    the FRG)
  • Alignments are produced by indicators to
    find similarities between pairs of documents.

  • (Brashler Schable, 2000)

26
Translation knowledge parallel/comparable corpora
  • Generalized Vector Space Model
  • use a bilingual training corpus to build
    matrices of document and term weights
  • (Carbonell et al. 1997)
  • Latent Semantic Indexing
  • - Telcordia Latent Semantic Indexing (LSI)
    Demos
  • - LSI by Dumais, Landauer Littman (SIGIR96)

27
Attempted experiment system
  • ARCTOS An interactive search engine that
    illustrates a selection interface that shows how
    thumbnail images can be used to support selection
    without knowing the document's language. ARCTOS
    was developed by the Computing Research Lab at
    New Mexico State University.
  • MULINEX(Multilingual Indexing, Navigation and
    Editing Extensions for the WWW) multilingual
    search engine for the WWW. It has collected
    around 5000 French, English and German web pages
    related to the European Currency Union
    (http//mulinex.dfki.de/demo.html)

28
Attempted experiment system
  • MUNDIAL
  • MINDS(Multilingual Interactive Document
    Summarization) Summarization in Spanish,
    Japanese, Korean, and English
  • CANAL (Cataloging with Multilingual Natural
    Language Access/Linguistic Server)
  • KEIZAI
  • Unicode used. For who dont know the target
    language (Japanese, Korean, and English)

29
User Studies
  • Keizai (1999) They focused on trying to
    understand how to create interfaces and systems
    that are useful to people. empirical user
    studies evaluating retrieval visualization
    interfaces.
  • Keizai is a cross-language interactive retrieval
    that uses URSA (Unicode Retrieval System
    Architecture), developed at the Computing
    Research Laboratory at New Mexico State
    University.

30
System Evaluation
  • TREC(Text retrieval conferences) US
  • TREC-3 Spanish / TREC-4 Chinese/
  • TREC-6 SMART - balance query
  • English and French
  • TREC-7 English, French, and German, and
  • Italian)
  • CLEF(Cross-language evaluation forum) in Europe
  • NCCTR (National Institute of Informatics-NACSIS
    Test Collection for IR system) in Japan

31
System Evaluation
  • Arctos project User involvement. The user was
    asked to interactively improve the query
    translation using bilingual translation
    resources. The retrieved documents were presented
    using document thumbnails and query highlighting
  • TITAN Kikui et al, 1996, Japanese-English
  • Multilingual Gisting Resnik, 1997

32
User Interactive CLIR
  • Ogden et al. (1999) implemented more interactive
    approaches the system, Arctos, provides a user
    with a browser-based interface with which to
    enter English queries. After an initial query is
    entered, the query is translated using a simple
    word-for-word or phrasal translator. The user
    can then interactively improve the query
    translation using links to on-line bilingual
    translation resources and then submit the query
    for retrieval against document collections in the
    target.

33
User Interactive CLIR
  • User-aided query translation
  • user can disambiguate query term
  • (I.e. Query expansion)
  • Query formulation
  • Davis and Ogden (1997) who implemented the
    QUILT system by providing a display of the
    Spanish translation of English query terms.
    Judging relevance with showing summary of
    English translation
  • (Cited in Oard and Dikema, 1998, p.229)

34
Conclusion
  • Needed for more CLIR user studies(users
    information seeking behavior and users particular
    domain characteristics) in order to reflect their
    real needs to provide updated information.
  • CLIR evaluations are needed more attention of
    users interaction with the system as well as
    system performance.
  • Needed more efficient interface in order to
    reflect users needs.

35
References
  • Braschler, M. Schäuble, P. (2000). Using
    Corpus-Based Approaches in a System for
    Multilingual Information Retrieval. Information
    Retrieval, 3, 273-284.
  • Grefenstette, G.. (Ed.). (1998). Cross-Language
    Information Retrieval. Norwell, MA Kluwer
  • Lee, K., Kageura, K. Choi, K. (2004). Implicit
    ambiguity resolution using incremental clustering
    in cross-language information retrieval.
    Information Processing and Management, 40,
    145-159.
  • Oard, D., Diekema, A. (1998). Cross-Language
    Information Retrieval. Annual Review of
    Information Science and Technology (ARIST), 33,
    223-256.

36
References
  • Ogden, D., Cowie, J., Davis, M.,Ludovik, E.,
    Molina-Salado, H. Shin, H. (1999). Getting
    Information from Documents You Cannot Read An
    Interactive Cross-Language Tex Retrieval and
    Summarization System. Joint ACM Digital
    Library/SIGIR Workshop on Multilingual
    Information Discovery and AccesS (MIDAS).
  • Peters, C. (Ed.) (2000). Cross-Language
    Information Retrieval and Evaluation Workshop of
    th Cross-Language Evaluation Forum, CLEF 2000,
    Lisbon, Portugal, September 2000. Berlin
    Springer-Verlag.
  • Sheridan, P. Ballerini, J. (1996). Experiments
    in multilingual information retrieval using the
    SPIDER system. ACM, 58-65.
Write a Comment
User Comments (0)
About PowerShow.com