Title: Multilingual Information Retrieval
1Multilingual Information Retrieval
- by
- Jeanine Lilleng
- IDI, NTNU
2What is information retrieval?
- Information retrieval (IR) deals with the
- representation, storage, organisation of, and
- access to information items.
- Baeza-Yates annd Ribeiro-Neto
- in Modern Information Retrieval
3Problem
- The available amount of information is huge and
ever increasing. - We have little experience with handling these
huge amounts of information. - This information must be accessible, to be
usable. - The technology for navigation these huge amounts
of information is still quite immature.
4Applications of information retrieval
- Searching for information at the Internet
- Searching for papers and books in digital
libraries or normal libraries with digital
inventory. - Any search in textual information.
5Motivation for Multilingual Information Retrieval
- The existing division of information due to
language is artificial. - Norwegian bilingual research community
- Still very immature technology.
- Existing technology should be adapted to be used
with Norwegian.
6Multilingual information retrieval
- IR in information expressed in more than one
language. - IR in multilingual collections.
Collection Documents or books that have
something in common.
Multilingual Collection A collection with
content expressed in more than one language
7MLIR solution
- Translate query and / or documents. This
- enables us to use traditional IR methods on
- queries / documents.
8Issues in Multilingual Information Retrieval
- Takes more time to do the necessary processing.
- Inaccuracies due to translations can cause
problems. - Methods created to make information retrieval are
mostly language dependent and only applicable in
one language at the time.
9Strategies for doing MLIR
- Machine translation
- Statistical methods
- Dictionary / Thesaurus driven methods
Thesaurus (Treasury) An extended dictionary
including references between words and preferred
words to be used.
10Machine Translation
- Automatically generated translation of query and
/ or document. - Based on AI technology
- ? Makes documents in foreign languages available
for people not speaking the language - ? Expensive
- ? Language dependent
- ? Bi-lingual technology
- ? Cultural differences and ambiguities can
introduces errors
11Statistical methods
- Several probable translations are suggested with
different probabilities. - Uses parallel corpuses to mine probable
translations. - ? Methods are mostly language independent
- ? Domain independent
- ? Requires parallel corpuses
- ? Computational expensive
12Dictionary / Thesaurus driven methods
- Translation is based on dictionaries and / or
thesauri. - ? Computationally inexpensive method
- ? Can capture / represent domain knowledge
- ? Domain and language dependent
- ? Expensive to create dictionaries / thesauri
- ? Ambiguities are introduced when one word has
several translations.
13Combination of methods
- Most current research is based on combination of
the above mentioned strategies. - This makes sense due to the fact that different
approaches have different shortcomings. - Recent results confirms this.
14Aims in multilingual information retrieval
- Adapt information retrieval techniques to
multilingual information retrieval. - Create new methods, developed for multilingual
information retrieval.
Simplify the searching process. Create new ways
to manage the ever increasing information
overflow.
15My Thesis
- Research on MLIR
- Seen from a Norwegian perspective bilingual
Norwegians. - Experiment with combination of approaches.
- Preferably low cost, language independent
methods.