Developing IR systems via Sansarn Look - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Developing IR systems via Sansarn Look

Description:

Can search from keywords that are in Thai Language and English Language. ... Cross-language (Thai-English) retrieval. Semantic search via ontology. 18 ... – PowerPoint PPT presentation

Number of Views:18

Avg rating:3.0/5.0

Slides: 26

Provided by: wikiNe

Category:

more less

Transcript and Presenter's Notes

Title: Developing IR systems via Sansarn Look

1
Developing IR systems via Sansarn Look!
2
System Structure
Database (1) Researchers (2) Research
projects (3) Publications (4) Patents
3
Introducing Sansarn Look

Sansarn Look is a program for development and
manage data searching system via Web Browser
User can create indexes from existing file on
local machine's hard disk or document on the web
Program works through Web Browser's Interface,
which allows data management, giving commands and
including setting of other component that
relevant to the system can be done easily.
Stand out characteristic of Sansarn Look
Ability in searching from Thai Language keywords
correctly and cover most of the contents based on
entry keyword.
Increasing quality for efficiency in searching.
Example, Query Suggestion will present to user to
help them find the efficient keywword.

4
Qualification of System

Can search from keywords that are in Thai
Language and English Language.
System can support structured and unstructured
data by merging DBMS and Information Retrieval
System.
Contain Intelligent Information Analysis that
work as
Statistical Analysis
Information Visualization
NLP
Text Mining

5
(1) Database Connectivity Mirror Site

Use standard data structure such as XML schema to
match with meta data tag of Dublin Core in
specify field's value.
Remote Server of data resource generate HTML
document from existing database by having 1
record from database (e.g. Research project) to 1
HTML document
Then Remote Server will gather HTML document into
a list, One list is for ThaiReSearch server
sending Crawler to store document from list via
HTTP protocol.
ThaiReSearch server will parse tag and use the
data to create indexes for searching.
User can search data by using indexes of
ThaiReSearch server directly.

6
(2) Data Connectivity Remote Server

Use standard data structure such as XML schema to
match with meta data tag of Dublin Core in
specify field's value.
Remote Server of data resource generate HTML
document from existing database by having 1
record from database (e.g. Research project) to 1
HTML document, at the same time Remote Server
will creat indexes for searching.
When user searching data, query from user will
send through ThaiReSearch server to Remote Server
of data resource. Then Remote Server search from
indexes database and send the search result back
to ThaiReSearch server for display to user.
Connectivity between ThaiReSearch server and
Remote Server is done via Web Service

7
Comparison of Database connectivity techniques

(1) Mirror site
Advantage
Security of Data resource.
Time consuming for searching data is less than
Remote Server techniques.
Disadvantage
Not suitable for large database, it wil take
time to store data.
Difficult to update data.
(2) Remote server
Advantage
Time saving for storing data and don't have to
update data
Disadvantage
Time consuming in Searching data is greater than
Mirror Site

8
Sansarn Look!-Lucene API
9
Sansarn Look! Architecture
10
Sansarn Look! Hardware Requirement

Platform Windows and Linux
CPU speed at least 1 GHz (recommended)
Memory 512 MB
HD Space (system) 100 MB (for system)
HD Space (data/index) depends on data size
Java JDK 1.5.0

11
Characteristics of Sansarn Look

Built on a widely used open-source IR library
Lucene
Support full-text and field-text search
Support local files and Web pages
Support English-Thai texts
High-performance indexing for Thai language
User-friendly via Browser Interface
Configurability
Additional features on Thai languages
Query suggestion
Query spell-check

12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
Sansarn Look! Roadmap

2006 Sansarn Look! version 1.0
ThaiAnalyzer for indexing Word segmentation
(LexTo)
Query prediction (i-Key)
Query approximation
Thai Unknown word collection
Desktop search Linux and Windows
Focused crawler domain-specific

16
Sansarn Look! Roadmap (cont'd)

2007 Sansarn Look! version 2.0
Word segmentation (LexTo)
Syllable segmentation for unknown words
Dictionary with unknown words added
Soundex search
Ranking
combining keywords (on-page) with link-analysis
(off-page) factor
Search result clustering
Focused crawler language-specific
Thailand's Web archive

17
Sansarn Look! Roadmap (cont'd)

2008-10 (long-term plan)
Search result visualization
Question Answering retrieval
Cross-language (Thai-English) retrieval
Semantic search via ontology

18
Lucene An Open Source IR Library

Lucene is a high performance, scalable
Information Retrieval (IR) library.
Fact
created by Doug Cutting
free, open-source, originally implemented in Java
a member of the popular Apache Jakarta projects
licensed under the Apache Software License
not a complete application
also ported to Perl, Python, C, .NET
active user and developer communities
widely used, e.g., IBM and Microsoft.

19
Lucene A Typical Application Integration
20
Lucene org.apache.lucene.document

Sample usage
title Search Engine Performance
keyword Internet
category computer
content Search engines are evaluated based on
some measures.
Document doc new Document()
doc.add(Field.Text(content, Search Engine
Performance))
doc.add(Field.Keyword(keyword, Internet))
doc.add(Field.UnIndexed(category,
computer))
doc.add(Field.UnStored(content, Search
engines are evaluated based on ... ))

21
Lucene API org.apache.lucene.index

Term is ltfieldName, textgt
Inverted index
Term gt ltdf, ltdocNum, ltpositiongt gtgt
Lucene's Index Algorithm
Two basic algorithms
(1) make an index for a single document
(2) merge a set of indices
Incremental algorithm

22
Lucene API org.apache.lucene.index

IndexWriter is a core component relevant to
create index. IndexWriter create and storing
indexes from document that convert to be the
instance of Document as the example below
File indexDir new File(dirName)
Directory fsDir FSDirectory.getDirectory(inde
xDir, true)
IndexWriter writer new IndexWriter(directory,
analyzer, true)
writer.addDocument(doc)
writer.close()
Note
directory specify location of index database
analyzer specify method for analyse message
true specify for Lucene to create new index
database to replace old database(if old database
exist)

23
Lucene API org.apache.lucene.search

IndexSearcher is core component relevant to
searching index database. IndexSearcher works as
the index database opener in the Read-only mode
and search document by looking at query from
user.
Directory directory FSDirectory.getDirectory(ind
exDir, false)
IndexSearcher is new IndexSearcher(directory)
Hits is part of the result from searching by
call search function in IndexSearcher
Query query new TermQuery(new Term(content,
Thailand))
Hits hitsis.search(query)

24
Lucene API org.apache.lucene.search

Primitive queries
TermQuery match docs containing a Term
PhraseQuery match docs with sequence of Terms
BooleanQuery match docs matching other queries
Derived queries
PrefixQuery, WildcardQuery, etc.

25
Lucene API org.apache.lucene.analysis

An Analyzer is a TokenStream factory.
A TokenStream is an iterator over Tokens.
Input is a character iterator (Reader)
A Token is tuple lttext, type, start, length,
positionIncrementgt
text (e.g., pisa).
type (e.g., word, sent, para).
start length offsets, in characters (e.g,
lt5,4gt)
positionIncrement (normally 1)

26
Lucene API org.apache.lucene.analysis

Lucene provides classes for various text
analyzing approaches
WhitespaceAnalyzer divides text at whitespace.
SimpleAnalyzer normalizes token text to lower
case.
StopAnalyzer normalizes token text to lower
case and removes stop words from a token
stream.
StandardAnalyzer combine all above tasks,
suitable for English texts
These classes are designed for English language.
To make the analysis applicable to particular
language, a language-specific analyzer package
must be developed to segment or tokenize texts
into a set of words.