Indonesian Information Retrieval - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Indonesian Information Retrieval

Description:

English. English. English. English. English. English. English ... Tagalog. Tagalog. Tagalog. Fran ais. Fran ais. Fran ais. Fran ais. Fran ais. Fran ais. Melayu ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 27

Provided by: vinsensius

Category:

more less

Transcript and Presenter's Notes

Title: Indonesian Information Retrieval

1
Indonesian Information Retrieval

By Dr. Stéphane Bressan
And Vinsensius Berlian Vega S N
School of Computing
National University of Singapore

2
THE INTERNET
3
Objective

To build a search engine dedicated to the assist
the Indonesian speaker
Indexing the Indonesian Web
Queries in Indonesian
Effective and efficient retrieval of Indonesian
Web documents

4
Search Engine Architecture
World Wide Web
URL Indexer
url request
(url, document)
document
(index keywords, url)
Data Storage
url
Index keywords
keywords
search engine
url
5
Retrieval Model

Vector Space Model
Documents and Queries are represented as vectors
in terms-dimensional-space
Similarity Measure and Ranking System used is
Normalized Weighted Cosine

6
Performance measure

Recall/Precision
Recall Fraction of relevant document retrieved
Precision Fraction of relevant retrieved
document

Relevant Retrieved-Docs Ra
Retrieved Docs A
Relevant Docs R
7
Specific Issues

Segmentation
Stemming
Language Identification

8
Segmentation

Character set recognized
Roman Alphabets
No diacritics (e.g. è, é, or ä) in Indonesian,
but might need to include some, as foreign terms
might be absorbed unchanged (e.g. café)
dash (-), as it is used in writing plurals (e.g.
mobilcar, mobil-mobilcars), compound words
(e.g. tanda-tangan), and affixed
foreign/abbreviated terms (e.g. di-email, NEM-nya)

9
Segmentation

Character set recognized (cont.)
Digit 2 should receive special treatment as it
is common to use it to denote plurals (e.g.
mobil2), including affixed plurals (e.g.
mobil2-nya) and plurals of affixed-terms (e.g.
peternak2)

10
Stemming

Indonesian Language is morphologically rich, i.e.
affixes are heavily used
Examples of affixes cars, government, reading
All of the known Indonesian stemmers are
dictionary based
The challenge is to build an effective
non-dictionary based stemmer

11
Stemming

Affixes Could appear as
prefixes (e.g. berbicara)
suffixes (e.g. peranan)
infix (e.g. gembunggt gelembung)
circumfix (e.g. memperbaiki)
Letters might change due to affixes (e.g.
peperintah gt pemerintah)
Forms inflections (e.g. pukul vs. dipukul) and
derivations (e.g. perintah vs. pemerintah)
Inflections grammatical variant (e.g. due to
gender, time)
Derivations semantic variant (i.e. change in
meaning)

12
Stemming

Morphologically rich, but most of them play the
derivational function gt stemming might not be as
crucial as other morphologically rich language
(e.g. Slovene Popovic92, French)
Might be applied iteratively (e.g. memperbaiki
(two prefixes and a suffix))
The set of affixes is growing Kridalaksana89
(e.g. buatin), which especially apparent in
colloquial writing style and day-to-day
conversation

13
Language Identification

Need to be able to effectively separate
Indonesian and non-Indonesian documents

14
Current Progress and Results

Segmentation
Stemming
Language Identification

15
Current Progress and Results

Segmentation
A formal segmentation rule specified in form of
EBNF rules has been developed

16
Current Progress and Results

Stemming Algorithms
Implemented two non-dictionary-based stemmers in
form of Definite Clause Grammar using Prolog

17
Current Progress and Results

We propose a language distinction (as opposed to
language distinction) technique that could
identify whether a document is written in
Indonesian given only Indonesian samples
(Positive Only Learning)

()
?
Training documents
-

Learning Algorithm
18
Trigrams

Our technique is based on frequency of trigrams
Trigrams are three-character substrings of a word
e.g. trigrams of hello are hel, ell, and
llo

19
Positive Only Learning

Algorithm

if
gtq
otherwise
hx(d) hypothesis that document d is written in
language x fi,d frequency of trigram i in
document d wi weight associated with
trigram I (from the training set) q some
threshold
20
Positive Only Learning
21
Positive Only Learning

Experimental results

22
Continuous Learning

Given the high performance of the method, can the
algorithm trust its own decision to learn more?
Would it then be able to adapt to changes in
language trends ?

23
Continuous Learning

The idea is to treat new documents that are
judged to be Indonesian as new samples
(Continuous Learning)

()
?
Training documents
-

24
Experimental Results
25
Conclusion

The Jungle is Neutral, F. S. Chapman

The Web is neutral, S. F. Bressan
26
Conclusion

We must be proactive in the usage and promotion
of our languages, and in the sharing and
development of our cultures.
Computer scientists must design and build tools
that cater for the cultural specificities

Write a Comment

User Comments (0)