Title: Indonesian Information Retrieval
1Indonesian Information Retrieval
- By Dr. Stéphane Bressan
- And Vinsensius Berlian Vega S N
- School of Computing
- National University of Singapore
2THE INTERNET
3Objective
- To build a search engine dedicated to the assist
the Indonesian speaker - Indexing the Indonesian Web
- Queries in Indonesian
- Effective and efficient retrieval of Indonesian
Web documents
4Search Engine Architecture
World Wide Web
URL Indexer
url request
(url, document)
document
(index keywords, url)
Data Storage
url
Index keywords
keywords
search engine
url
5Retrieval Model
- Vector Space Model
- Documents and Queries are represented as vectors
in terms-dimensional-space - Similarity Measure and Ranking System used is
Normalized Weighted Cosine
6Performance measure
- Recall/Precision
- Recall Fraction of relevant document retrieved
- Precision Fraction of relevant retrieved
document
Relevant Retrieved-Docs Ra
Retrieved Docs A
Relevant Docs R
7Specific Issues
- Segmentation
- Stemming
- Language Identification
8Segmentation
- Character set recognized
- Roman Alphabets
- No diacritics (e.g. è, é, or ä) in Indonesian,
but might need to include some, as foreign terms
might be absorbed unchanged (e.g. café) - dash (-), as it is used in writing plurals (e.g.
mobilcar, mobil-mobilcars), compound words
(e.g. tanda-tangan), and affixed
foreign/abbreviated terms (e.g. di-email, NEM-nya)
9Segmentation
- Character set recognized (cont.)
- Digit 2 should receive special treatment as it
is common to use it to denote plurals (e.g.
mobil2), including affixed plurals (e.g.
mobil2-nya) and plurals of affixed-terms (e.g.
peternak2)
10Stemming
- Indonesian Language is morphologically rich, i.e.
affixes are heavily used - Examples of affixes cars, government, reading
- All of the known Indonesian stemmers are
dictionary based - The challenge is to build an effective
non-dictionary based stemmer
11Stemming
- Affixes Could appear as
- prefixes (e.g. berbicara)
- suffixes (e.g. peranan)
- infix (e.g. gembunggt gelembung)
- circumfix (e.g. memperbaiki)
- Letters might change due to affixes (e.g.
peperintah gt pemerintah) - Forms inflections (e.g. pukul vs. dipukul) and
derivations (e.g. perintah vs. pemerintah) - Inflections grammatical variant (e.g. due to
gender, time) - Derivations semantic variant (i.e. change in
meaning)
12Stemming
- Morphologically rich, but most of them play the
derivational function gt stemming might not be as
crucial as other morphologically rich language
(e.g. Slovene Popovic92, French) - Might be applied iteratively (e.g. memperbaiki
(two prefixes and a suffix)) - The set of affixes is growing Kridalaksana89
(e.g. buatin), which especially apparent in
colloquial writing style and day-to-day
conversation
13Language Identification
- Need to be able to effectively separate
Indonesian and non-Indonesian documents
14Current Progress and Results
- Segmentation
- Stemming
- Language Identification
15Current Progress and Results
- Segmentation
- A formal segmentation rule specified in form of
EBNF rules has been developed
16Current Progress and Results
- Stemming Algorithms
- Implemented two non-dictionary-based stemmers in
form of Definite Clause Grammar using Prolog
17Current Progress and Results
- We propose a language distinction (as opposed to
language distinction) technique that could
identify whether a document is written in
Indonesian given only Indonesian samples
(Positive Only Learning)
()
?
Training documents
-
Learning Algorithm
18Trigrams
- Our technique is based on frequency of trigrams
- Trigrams are three-character substrings of a word
e.g. trigrams of hello are hel, ell, and
llo
19Positive Only Learning
if
gtq
otherwise
hx(d) hypothesis that document d is written in
language x fi,d frequency of trigram i in
document d wi weight associated with
trigram I (from the training set) q some
threshold
20Positive Only Learning
21Positive Only Learning
22Continuous Learning
- Given the high performance of the method, can the
algorithm trust its own decision to learn more? - Would it then be able to adapt to changes in
language trends ?
23Continuous Learning
- The idea is to treat new documents that are
judged to be Indonesian as new samples
(Continuous Learning)
()
?
Training documents
-
24Experimental Results
25Conclusion
- The Jungle is Neutral, F. S. Chapman
The Web is neutral, S. F. Bressan
26Conclusion
- We must be proactive in the usage and promotion
of our languages, and in the sharing and
development of our cultures. - Computer scientists must design and build tools
that cater for the cultural specificities