Title: Developing IR systems via Sansarn Look
1Developing IR systems via Sansarn Look!
2System Structure
Database (1) Researchers (2) Research
projects (3) Publications (4) Patents
3Introducing Sansarn Look
- Sansarn Look is a program for development and
manage data searching system via Web Browser - User can create indexes from existing file on
local machine's hard disk or document on the web - Program works through Web Browser's Interface,
which allows data management, giving commands and
including setting of other component that
relevant to the system can be done easily. - Stand out characteristic of Sansarn Look
- Ability in searching from Thai Language keywords
correctly and cover most of the contents based on
entry keyword. - Increasing quality for efficiency in searching.
Example, Query Suggestion will present to user to
help them find the efficient keywword.
4Qualification of System
- Can search from keywords that are in Thai
Language and English Language. - System can support structured and unstructured
data by merging DBMS and Information Retrieval
System. - Contain Intelligent Information Analysis that
work as - Statistical Analysis
- Information Visualization
- NLP
- Text Mining
5(1) Database Connectivity Mirror Site
- Use standard data structure such as XML schema to
match with meta data tag of Dublin Core in
specify field's value. - Remote Server of data resource generate HTML
document from existing database by having 1
record from database (e.g. Research project) to 1
HTML document - Then Remote Server will gather HTML document into
a list, One list is for ThaiReSearch server
sending Crawler to store document from list via
HTTP protocol. - ThaiReSearch server will parse tag and use the
data to create indexes for searching. - User can search data by using indexes of
ThaiReSearch server directly.
6(2) Data Connectivity Remote Server
- Use standard data structure such as XML schema to
match with meta data tag of Dublin Core in
specify field's value. - Remote Server of data resource generate HTML
document from existing database by having 1
record from database (e.g. Research project) to 1
HTML document, at the same time Remote Server
will creat indexes for searching. - When user searching data, query from user will
send through ThaiReSearch server to Remote Server
of data resource. Then Remote Server search from
indexes database and send the search result back
to ThaiReSearch server for display to user. - Connectivity between ThaiReSearch server and
Remote Server is done via Web Service
7Comparison of Database connectivity techniques
- (1) Mirror site
- Advantage
- Security of Data resource.
- Time consuming for searching data is less than
Remote Server techniques. - Disadvantage
- Not suitable for large database, it wil take
time to store data. - Difficult to update data.
- (2) Remote server
- Advantage
- Time saving for storing data and don't have to
update data - Disadvantage
- Time consuming in Searching data is greater than
Mirror Site
8Sansarn Look!-Lucene API
9Sansarn Look! Architecture
10Sansarn Look! Hardware Requirement
- Platform Windows and Linux
- CPU speed at least 1 GHz (recommended)
- Memory 512 MB
- HD Space (system) 100 MB (for system)
- HD Space (data/index) depends on data size
- Java JDK 1.5.0
11Characteristics of Sansarn Look
- Built on a widely used open-source IR library
Lucene - Support full-text and field-text search
- Support local files and Web pages
- Support English-Thai texts
- High-performance indexing for Thai language
- User-friendly via Browser Interface
- Configurability
- Additional features on Thai languages
- Query suggestion
- Query spell-check
12(No Transcript)
13(No Transcript)
14(No Transcript)
15Sansarn Look! Roadmap
- 2006 Sansarn Look! version 1.0
- ThaiAnalyzer for indexing Word segmentation
(LexTo) - Query prediction (i-Key)
- Query approximation
- Thai Unknown word collection
- Desktop search Linux and Windows
- Focused crawler domain-specific
16Sansarn Look! Roadmap (cont'd)
- 2007 Sansarn Look! version 2.0
- Word segmentation (LexTo)
- Syllable segmentation for unknown words
- Dictionary with unknown words added
- Soundex search
- Ranking
- combining keywords (on-page) with link-analysis
- (off-page) factor
- Search result clustering
- Focused crawler language-specific
- Thailand's Web archive
17Sansarn Look! Roadmap (cont'd)
- 2008-10 (long-term plan)
- Search result visualization
- Question Answering retrieval
- Cross-language (Thai-English) retrieval
- Semantic search via ontology
18Lucene An Open Source IR Library
- Lucene is a high performance, scalable
Information Retrieval (IR) library. - Fact
- created by Doug Cutting
- free, open-source, originally implemented in Java
- a member of the popular Apache Jakarta projects
- licensed under the Apache Software License
- not a complete application
- also ported to Perl, Python, C, .NET
- active user and developer communities
- widely used, e.g., IBM and Microsoft.
19Lucene A Typical Application Integration
20Lucene org.apache.lucene.document
- Sample usage
- title Search Engine Performance
- keyword Internet
- category computer
- content Search engines are evaluated based on
some measures. - Document doc new Document()
- doc.add(Field.Text(content, Search Engine
Performance)) - doc.add(Field.Keyword(keyword, Internet))
- doc.add(Field.UnIndexed(category,
computer)) - doc.add(Field.UnStored(content, Search
engines are evaluated based on ... ))
21Lucene API org.apache.lucene.index
- Term is ltfieldName, textgt
- Inverted index
- Term gt ltdf, ltdocNum, ltpositiongt gtgt
- Lucene's Index Algorithm
- Two basic algorithms
- (1) make an index for a single document
- (2) merge a set of indices
- Incremental algorithm
22Lucene API org.apache.lucene.index
- IndexWriter is a core component relevant to
create index. IndexWriter create and storing
indexes from document that convert to be the
instance of Document as the example below - File indexDir new File(dirName)
- Directory fsDir FSDirectory.getDirectory(inde
xDir, true) - IndexWriter writer new IndexWriter(directory,
analyzer, true) - writer.addDocument(doc)
- writer.close()
- Note
- directory specify location of index database
- analyzer specify method for analyse message
- true specify for Lucene to create new index
database to replace old database(if old database
exist)
23Lucene API org.apache.lucene.search
- IndexSearcher is core component relevant to
searching index database. IndexSearcher works as
the index database opener in the Read-only mode
and search document by looking at query from
user. - Directory directory FSDirectory.getDirectory(ind
exDir, false) - IndexSearcher is new IndexSearcher(directory)
- Hits is part of the result from searching by
call search function in IndexSearcher - Query query new TermQuery(new Term(content,
Thailand)) - Hits hitsis.search(query)
24Lucene API org.apache.lucene.search
- Primitive queries
- TermQuery match docs containing a Term
- PhraseQuery match docs with sequence of Terms
- BooleanQuery match docs matching other queries
- Derived queries
- PrefixQuery, WildcardQuery, etc.
25Lucene API org.apache.lucene.analysis
- An Analyzer is a TokenStream factory.
- A TokenStream is an iterator over Tokens.
- Input is a character iterator (Reader)
- A Token is tuple lttext, type, start, length,
positionIncrementgt - text (e.g., pisa).
- type (e.g., word, sent, para).
- start length offsets, in characters (e.g,
lt5,4gt) - positionIncrement (normally 1)
26Lucene API org.apache.lucene.analysis
- Lucene provides classes for various text
analyzing approaches - WhitespaceAnalyzer divides text at whitespace.
- SimpleAnalyzer normalizes token text to lower
case. - StopAnalyzer normalizes token text to lower
case and removes stop words from a token
stream. - StandardAnalyzer combine all above tasks,
suitable for English texts - These classes are designed for English language.
To make the analysis applicable to particular
language, a language-specific analyzer package
must be developed to segment or tokenize texts
into a set of words.