Developing IR systems via Sansarn Look - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Developing IR systems via Sansarn Look

Description:

Can search from keywords that are in Thai Language and English Language. ... Cross-language (Thai-English) retrieval. Semantic search via ontology. 18 ... – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0
Slides: 26
Provided by: wikiNe
Category:

less

Transcript and Presenter's Notes

Title: Developing IR systems via Sansarn Look


1
Developing IR systems via Sansarn Look!
2
System Structure
Database (1) Researchers (2) Research
projects (3) Publications (4) Patents
3
Introducing Sansarn Look
  • Sansarn Look is a program for development and
    manage data searching system via Web Browser
  • User can create indexes from existing file on
    local machine's hard disk or document on the web
  • Program works through Web Browser's Interface,
    which allows data management, giving commands and
    including setting of other component that
    relevant to the system can be done easily.
  • Stand out characteristic of Sansarn Look
  • Ability in searching from Thai Language keywords
    correctly and cover most of the contents based on
    entry keyword.
  • Increasing quality for efficiency in searching.
    Example, Query Suggestion will present to user to
    help them find the efficient keywword.

4
Qualification of System
  • Can search from keywords that are in Thai
    Language and English Language.
  • System can support structured and unstructured
    data by merging DBMS and Information Retrieval
    System.
  • Contain Intelligent Information Analysis that
    work as
  • Statistical Analysis
  • Information Visualization
  • NLP
  • Text Mining

5
(1) Database Connectivity Mirror Site
  • Use standard data structure such as XML schema to
    match with meta data tag of Dublin Core in
    specify field's value.
  • Remote Server of data resource generate HTML
    document from existing database by having 1
    record from database (e.g. Research project) to 1
    HTML document
  • Then Remote Server will gather HTML document into
    a list, One list is for ThaiReSearch server
    sending Crawler to store document from list via
    HTTP protocol.
  • ThaiReSearch server will parse tag and use the
    data to create indexes for searching.
  • User can search data by using indexes of
    ThaiReSearch server directly.

6
(2) Data Connectivity Remote Server
  • Use standard data structure such as XML schema to
    match with meta data tag of Dublin Core in
    specify field's value.
  • Remote Server of data resource generate HTML
    document from existing database by having 1
    record from database (e.g. Research project) to 1
    HTML document, at the same time Remote Server
    will creat indexes for searching.
  • When user searching data, query from user will
    send through ThaiReSearch server to Remote Server
    of data resource. Then Remote Server search from
    indexes database and send the search result back
    to ThaiReSearch server for display to user.
  • Connectivity between ThaiReSearch server and
    Remote Server is done via Web Service

7
Comparison of Database connectivity techniques
  • (1) Mirror site
  • Advantage
  • Security of Data resource.
  • Time consuming for searching data is less than
    Remote Server techniques.
  • Disadvantage
  • Not suitable for large database, it wil take
    time to store data.
  • Difficult to update data.
  • (2) Remote server
  • Advantage
  • Time saving for storing data and don't have to
    update data
  • Disadvantage
  • Time consuming in Searching data is greater than
    Mirror Site

8
Sansarn Look!-Lucene API
9
Sansarn Look! Architecture
10
Sansarn Look! Hardware Requirement
  • Platform Windows and Linux
  • CPU speed at least 1 GHz (recommended)
  • Memory 512 MB
  • HD Space (system) 100 MB (for system)
  • HD Space (data/index) depends on data size
  • Java JDK 1.5.0

11
Characteristics of Sansarn Look
  • Built on a widely used open-source IR library
    Lucene
  • Support full-text and field-text search
  • Support local files and Web pages
  • Support English-Thai texts
  • High-performance indexing for Thai language
  • User-friendly via Browser Interface
  • Configurability
  • Additional features on Thai languages
  • Query suggestion
  • Query spell-check

12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
Sansarn Look! Roadmap
  • 2006 Sansarn Look! version 1.0
  • ThaiAnalyzer for indexing Word segmentation
    (LexTo)
  • Query prediction (i-Key)
  • Query approximation
  • Thai Unknown word collection
  • Desktop search Linux and Windows
  • Focused crawler domain-specific

16
Sansarn Look! Roadmap (cont'd)
  • 2007 Sansarn Look! version 2.0
  • Word segmentation (LexTo)
  • Syllable segmentation for unknown words
  • Dictionary with unknown words added
  • Soundex search
  • Ranking
  • combining keywords (on-page) with link-analysis
  • (off-page) factor
  • Search result clustering
  • Focused crawler language-specific
  • Thailand's Web archive

17
Sansarn Look! Roadmap (cont'd)
  • 2008-10 (long-term plan)
  • Search result visualization
  • Question Answering retrieval
  • Cross-language (Thai-English) retrieval
  • Semantic search via ontology

18
Lucene An Open Source IR Library
  • Lucene is a high performance, scalable
    Information Retrieval (IR) library.
  • Fact
  • created by Doug Cutting
  • free, open-source, originally implemented in Java
  • a member of the popular Apache Jakarta projects
  • licensed under the Apache Software License
  • not a complete application
  • also ported to Perl, Python, C, .NET
  • active user and developer communities
  • widely used, e.g., IBM and Microsoft.

19
Lucene A Typical Application Integration
20
Lucene org.apache.lucene.document
  • Sample usage
  • title Search Engine Performance
  • keyword Internet
  • category computer
  • content Search engines are evaluated based on
    some measures.
  • Document doc new Document()
  • doc.add(Field.Text(content, Search Engine
    Performance))
  • doc.add(Field.Keyword(keyword, Internet))
  • doc.add(Field.UnIndexed(category,
    computer))
  • doc.add(Field.UnStored(content, Search
    engines are evaluated based on ... ))

21
Lucene API org.apache.lucene.index
  • Term is ltfieldName, textgt
  • Inverted index
  • Term gt ltdf, ltdocNum, ltpositiongt gtgt
  • Lucene's Index Algorithm
  • Two basic algorithms
  • (1) make an index for a single document
  • (2) merge a set of indices
  • Incremental algorithm

22
Lucene API org.apache.lucene.index
  • IndexWriter is a core component relevant to
    create index. IndexWriter create and storing
    indexes from document that convert to be the
    instance of Document as the example below
  • File indexDir new File(dirName)
  • Directory fsDir FSDirectory.getDirectory(inde
    xDir, true)
  • IndexWriter writer new IndexWriter(directory,
    analyzer, true)
  • writer.addDocument(doc)
  • writer.close()
  • Note
  • directory specify location of index database
  • analyzer specify method for analyse message
  • true specify for Lucene to create new index
    database to replace old database(if old database
    exist)

23
Lucene API org.apache.lucene.search
  • IndexSearcher is core component relevant to
    searching index database. IndexSearcher works as
    the index database opener in the Read-only mode
    and search document by looking at query from
    user.
  • Directory directory FSDirectory.getDirectory(ind
    exDir, false)
  • IndexSearcher is new IndexSearcher(directory)
  • Hits is part of the result from searching by
    call search function in IndexSearcher
  • Query query new TermQuery(new Term(content,
    Thailand))
  • Hits hitsis.search(query)

24
Lucene API org.apache.lucene.search
  • Primitive queries
  • TermQuery match docs containing a Term
  • PhraseQuery match docs with sequence of Terms
  • BooleanQuery match docs matching other queries
  • Derived queries
  • PrefixQuery, WildcardQuery, etc.

25
Lucene API org.apache.lucene.analysis
  • An Analyzer is a TokenStream factory.
  • A TokenStream is an iterator over Tokens.
  • Input is a character iterator (Reader)
  • A Token is tuple lttext, type, start, length,
    positionIncrementgt
  • text (e.g., pisa).
  • type (e.g., word, sent, para).
  • start length offsets, in characters (e.g,
    lt5,4gt)
  • positionIncrement (normally 1)

26
Lucene API org.apache.lucene.analysis
  • Lucene provides classes for various text
    analyzing approaches
  • WhitespaceAnalyzer divides text at whitespace.
  • SimpleAnalyzer normalizes token text to lower
    case.
  • StopAnalyzer normalizes token text to lower
    case and removes stop words from a token
    stream.
  • StandardAnalyzer combine all above tasks,
    suitable for English texts
  • These classes are designed for English language.
    To make the analysis applicable to particular
    language, a language-specific analyzer package
    must be developed to segment or tokenize texts
    into a set of words.
Write a Comment
User Comments (0)
About PowerShow.com