Multilingual Text Retrieval - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Multilingual Text Retrieval

Description:

An example of a Romance language problem is the use of accents on otherwise 'regular' letters. ... to include characters with accents, such as the one used ... – PowerPoint PPT presentation

Number of Views:479
Avg rating:3.0/5.0
Slides: 23
Provided by: dennisv1
Category:

less

Transcript and Presenter's Notes

Title: Multilingual Text Retrieval


1
Multilingual Text Retrieval
Applications of Multilingual Text RetrievalW.
Bruce Croft, John Broglio and Hideo
FujiiComputer Science DepartmentUniversity of
Massaschusetts
Explorative Multilingual Text Retrieval Based on
Fuzzy Multilingual Keyword ClassificationRowena
Chau and Chung-Hsing YehSchool of Business
SystemsMonash University,Victoria, Australia
By Dennis Pereira
2
Multilingual Text Retrieval
  • Topic Text and Multimedia
  • It would be interesting to study how pictures and
    video and other non-textual media is retrieved.
  • A brief scan over some of the research on the
    topic indicates that a human uses metadata to
    describe a photo or video and then the retrieval
    engine indexes the metadata.
  • However, what is the form of the metadata? Is it
    in the native language of the user who produced
    it? If so, how do we retrieve such information?

3
Multilingual Text Retrieval
The idea of multilingual text retrieval is
problematic for many retrieval engines. Fortunatel
y, there has been some indication that all
languages have similar properties that allow the
same techniques to be used for retrieval across
languages. In an English sense, Asian languages
seem to be the most problematic of language types
on which to perform retrieval.
4
Multilingual Text Retrieval
Why are Asian languages problematic? Asian
languages express concepts in terms of pictures
instead of words. So a concept in English may be
a two word phrase such as artificial
intelligence while in Japanese the concept is a
series of four pictures
5
Multilingual Text Retrieval
But Asian languages arent the only languages
that can cause problems. Even European (Romance)
languages have hurdles that need to be jumped in
order to perform retrieval on these language
types. An example of a Romance language problem
is the use of accents on otherwise regular
letters. Artificial intelligence in English is
Inteligência Artificial in Portuguese.
6
Multilingual Text Retrieval
Why is this a problem? We must represent the data
in a binary form so that the computer can
recognize the difference between one character
and another. In the U.S. text is normally encoded
in the ASCII standard, which has been extended to
include characters with accents, such as the one
used in Inteligência Artificial. Unfortunately,
ASCII can not be extended to include Asian or
other symbolic languages.
7
Multilingual Text Retrieval
From a retrieval point of view, there are many
different encodings that a file can be saved as.
And thus, many different encodings that the
retrieval engine must handle. When dealing with
symbolic languages, these files are never stored
as ASCII. More likely they are stored in a
specialized format capable of handling that
character set. A more universal approach is to
store and retrieve everything using Unicode.
Unicode is a standard representation of all
languages around the world, encoded into a single
format.
8
Multilingual Text Retrieval
Establishing Unicode as our standard we can then
attempt to perform retrieval. We have two
approaches that compliment each other
nicely. Croft et al, offer the first, a
traditional keyword retrieval approach. Give the
system the information need and a result is
returned ranked by a statistical model. Chau an
Yeh offer the second, differing from the first in
that the information need is unclear.
9
Multilingual Text Retrieval
Chau and Yeh argue that their approach is useful
when the information need is unclear, or in the
case of Asian languages, when the ability to type
in the search concepts is not trivial. There is
no keyboard for Chinese concepts for
example. Their approach therefore analyzes a set
of parallel corpora that can be used to classify
keywords into concept classes. By doing this,
the user can type a query in English and retrieve
documents in Chinese.
10
Multilingual Text Retrieval
How is that done? Two documents are parallel if
they are interpretations of the other. They
dont need to be exact translations, because
often there is not a one-to-one mapping of
expressions in one language to that of
another. The idea is that concepts, represented
by a set of characters, will be used consistently
in both versions of the document, allowing these
terms to be classifiable as members of a
particular category.
11
Multilingual Text Retrieval
Words can fall into more than one category, each
having a level of membership, represented by a
weight corresponding to the level. This weight
is determined by using the authors algorithm for
fuzzy clustering. These categories are used to
create concept classes. The user presents the
system with an elementary information need, and
by giving a term, or set of terms, in whatever
language the system is capable of handling, and
the terms are expanded to include the entire
class.
12
Multilingual Text Retrieval
Chau and Yehs retrieval approach is a vector
model. The concept class represents a vector
that can then be compared with the set of
documents to determine which documents are
returned and how they are ranked. This is an
interesting way to perform query expansion in
other languages. It may be a useful approach for
future systems that require retrieval across
languages. It may also be a useful approach for
expanding a basic query into a more specific and
focused query.
13
Multilingual Text Retrieval
14
Multilingual Text Retrieval
Croft et al, argue that the methods used for
English retrieval are extendable to other
languages. For example, the concept of stemming
or, for ranking purposes, the probabilistic
tf.idf weights. However, a problem arises with
languages other than English in that they may
have many different forms which can distort the
usefulness of stemming. This problem is solved
by language specific knowledge of common prefixes
and suffixes.
15
Multilingual Text Retrieval
Another problem that arises when performing a
query on Asian languages is the tokenization of
characters. There are not clear delimiters of
word breaks in Japanese. Therefore, how do we
index a Japanese document? One solution is to
index each character. Then, when the query is
submitted, the system attempts to match each
character to produce a result. Another solution
is to use some knowledge of the language and try
to determine the word boundaries by taking the
probability that a character is the terminator
for any given word.
16
Multilingual Text Retrieval
Japanese composed of different classes of
characters. Here words are detected when the
type of character changes.
17
Multilingual Text Retrieval
Croft and his team have found that indexing both
individual characters and words improves the
precision of the retrieval, especially on lower
recall. In other words, when fewer documents are
returned the chance of them being correctly
selected for return is higher. The data they used
to show this is a set of articles from Nikkei
compared to those from the Wallstreet Journal on
the same topics during the same time frame. 25
queries were performed on each set. The English
having been translated from the Japanese.
18
Multilingual Text Retrieval
Their results were
19
Multilingual Text Retrieval
The graph shows that at higher recall, the
precision is almost the same, indicating that the
data sets were correctly selected. The underlying
algorithms of this system were the same for both
English and Japanese. After the terms are
indexed, the retrieval process runs the same way
across all languages. The limitation of this
system is that the query must be entered in the
language of the documents needing to be retrieved.
20
Multilingual Text Retrieval
It would be an interesting approach to attempt
integration of the fuzzy classification algorithm
proposed by Chau and Yeh with the retrieval
system cited by Croft. Doing so may increase the
ability to perform multilingual text retrieval,
since, in my opinion, the system used by Chau is
a simple one, used to show that it is possible to
retrieve documents using fuzzy clustering. Adding
the capability of fuzzy classification to a
robust system, like that of Croft may prove to be
a substantial improvement to the retrieval field.
21
Multilingual Text Retrieval
In conclusion, we see how two different sets of
people address a similar problem. One from a
computer science point-of-view and the other from
a business application point-of-view. Both
approaches are attempting to retrieve
multilingual text. The computer science
point-of-view assumes that the query is given and
not a problem to acquire, the business
application point-of-view assumes that the query
is the problem.
22
Multilingual Text Retrieval
Explorative Multilingual Text Retrieval Based on
Fuzzy Multilingual Keyword ClassificationRowena
Chau and Chung-Hsing Yeh2000 - Proceedings of
the 5th international workshop on Information
retrieval with Asian languages http//portal.acm.o
rg/citation.cfm?id355219collGUIDEdlGUIDECFID
7615695CFTOKEN3724437 Applications of
Multilingual Text RetrievalW. Bruce Croft, John
Broglio, Hideo Fujii1996 - Proceedings of the
29th Annual Hawaii International Conference on
System Sciences http//ieeexplore.ieee.org/iel2/35
11/10449/00495303.pdf?isNumber10449prodIEEE20C
NFarnumber495303arSt98ared107vol.5arAuthor
Croft2CW.B.3BBroglio2CJ.3BFujii2CH.3B
Write a Comment
User Comments (0)
About PowerShow.com