Indonesian Information Retrieval - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Indonesian Information Retrieval

Description:

English. English. English. English. English. English. English ... Tagalog. Tagalog. Tagalog. Fran ais. Fran ais. Fran ais. Fran ais. Fran ais. Fran ais. Melayu ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 27
Provided by: vinsensius
Category:

less

Transcript and Presenter's Notes

Title: Indonesian Information Retrieval


1
Indonesian Information Retrieval
  • By Dr. Stéphane Bressan
  • And Vinsensius Berlian Vega S N
  • School of Computing
  • National University of Singapore

2
THE INTERNET
3
Objective
  • To build a search engine dedicated to the assist
    the Indonesian speaker
  • Indexing the Indonesian Web
  • Queries in Indonesian
  • Effective and efficient retrieval of Indonesian
    Web documents

4
Search Engine Architecture
World Wide Web
URL Indexer
url request
(url, document)
document
(index keywords, url)
Data Storage
url
Index keywords
keywords
search engine
url
5
Retrieval Model
  • Vector Space Model
  • Documents and Queries are represented as vectors
    in terms-dimensional-space
  • Similarity Measure and Ranking System used is
    Normalized Weighted Cosine

6
Performance measure
  • Recall/Precision
  • Recall Fraction of relevant document retrieved
  • Precision Fraction of relevant retrieved
    document

Relevant Retrieved-Docs Ra
Retrieved Docs A
Relevant Docs R
7
Specific Issues
  • Segmentation
  • Stemming
  • Language Identification

8
Segmentation
  • Character set recognized
  • Roman Alphabets
  • No diacritics (e.g. è, é, or ä) in Indonesian,
    but might need to include some, as foreign terms
    might be absorbed unchanged (e.g. café)
  • dash (-), as it is used in writing plurals (e.g.
    mobilcar, mobil-mobilcars), compound words
    (e.g. tanda-tangan), and affixed
    foreign/abbreviated terms (e.g. di-email, NEM-nya)

9
Segmentation
  • Character set recognized (cont.)
  • Digit 2 should receive special treatment as it
    is common to use it to denote plurals (e.g.
    mobil2), including affixed plurals (e.g.
    mobil2-nya) and plurals of affixed-terms (e.g.
    peternak2)

10
Stemming
  • Indonesian Language is morphologically rich, i.e.
    affixes are heavily used
  • Examples of affixes cars, government, reading
  • All of the known Indonesian stemmers are
    dictionary based
  • The challenge is to build an effective
    non-dictionary based stemmer

11
Stemming
  • Affixes Could appear as
  • prefixes (e.g. berbicara)
  • suffixes (e.g. peranan)
  • infix (e.g. gembunggt gelembung)
  • circumfix (e.g. memperbaiki)
  • Letters might change due to affixes (e.g.
    peperintah gt pemerintah)
  • Forms inflections (e.g. pukul vs. dipukul) and
    derivations (e.g. perintah vs. pemerintah)
  • Inflections grammatical variant (e.g. due to
    gender, time)
  • Derivations semantic variant (i.e. change in
    meaning)

12
Stemming
  • Morphologically rich, but most of them play the
    derivational function gt stemming might not be as
    crucial as other morphologically rich language
    (e.g. Slovene Popovic92, French)
  • Might be applied iteratively (e.g. memperbaiki
    (two prefixes and a suffix))
  • The set of affixes is growing Kridalaksana89
    (e.g. buatin), which especially apparent in
    colloquial writing style and day-to-day
    conversation

13
Language Identification
  • Need to be able to effectively separate
    Indonesian and non-Indonesian documents

14
Current Progress and Results
  • Segmentation
  • Stemming
  • Language Identification

15
Current Progress and Results
  • Segmentation
  • A formal segmentation rule specified in form of
    EBNF rules has been developed

16
Current Progress and Results
  • Stemming Algorithms
  • Implemented two non-dictionary-based stemmers in
    form of Definite Clause Grammar using Prolog

17
Current Progress and Results
  • We propose a language distinction (as opposed to
    language distinction) technique that could
    identify whether a document is written in
    Indonesian given only Indonesian samples
    (Positive Only Learning)

()
?
Training documents
-

Learning Algorithm
18
Trigrams
  • Our technique is based on frequency of trigrams
  • Trigrams are three-character substrings of a word
    e.g. trigrams of hello are hel, ell, and
    llo

19
Positive Only Learning
  • Algorithm

if
gtq
otherwise
hx(d) hypothesis that document d is written in
language x fi,d frequency of trigram i in
document d wi weight associated with
trigram I (from the training set) q some
threshold
20
Positive Only Learning
21
Positive Only Learning
  • Experimental results

22
Continuous Learning
  • Given the high performance of the method, can the
    algorithm trust its own decision to learn more?
  • Would it then be able to adapt to changes in
    language trends ?

23
Continuous Learning
  • The idea is to treat new documents that are
    judged to be Indonesian as new samples
    (Continuous Learning)

()
?
Training documents
-

24
Experimental Results
25
Conclusion
  • The Jungle is Neutral, F. S. Chapman

The Web is neutral, S. F. Bressan
26
Conclusion
  • We must be proactive in the usage and promotion
    of our languages, and in the sharing and
    development of our cultures.
  • Computer scientists must design and build tools
    that cater for the cultural specificities
Write a Comment
User Comments (0)
About PowerShow.com