The Lucene Search Engine: Powerful, Flexible - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

The Lucene Search Engine: Powerful, Flexible

Description:

... it is written in a modular fashion, it allows a developer tremendous amount of ... effectively used to index an e-mail Inbox, a database, or a set of news feeds. ... – PowerPoint PPT presentation

Number of Views:240
Avg rating:3.0/5.0
Slides: 15
Provided by: Ani546
Category:

less

Transcript and Presenter's Notes

Title: The Lucene Search Engine: Powerful, Flexible


1
The Lucene Search Engine Powerful, Flexible
FREE!!
  • By
  • Anita Pillai

2
Introduction
  • Lucene is an open-source API maintained by the
    Apache Software Foundations Jakarta Project.
  • It has been implemented in Java with ports for
    C, .NET , Perl and Python.
  • Since it is written in a modular fashion, it
    allows a developer tremendous amount of freedom
    to decide how to use it to suit an application.
  • The deciding factor for a search engine is its
    effectiveness.
  • Two factors
  • Accuracy The percentage of the documents that
    were actually returned from the available set of
    documents for a particular query.
  • Precision The percentage of the documents
    returned that are actually about the particular
    query.

3
Search Engine Concepts
  • There are two paths (index path query path)
    through a search engine.
  • The index path shows how the index gets filled
    with documents.
  • The documents are fed to an analyzer which then
    transforms them into the appropriate weighted
    terms (or scores) and passes them to the
    IndexWriter.

4
  • The query path through the search engine shows
    how the index is queried for documents.
  • The same analyzer is used to derive a
    user-defined set of terms that are, in turn,
    passed to the IndexSearcher to perform the search
    of the index.
  • Because indexes rarely ever hold the entire
    document, a set of Hits are returned, where each
    hit represents what is retained about the
    document within the index.

5
Understanding indexing strategies
  • When the user starts the application for the
    first time, Scishare will use the default
    settings that identifies the user called PSEUDO
    USER. Pseudo users are provided with
    automatically generated X.509 certificates and
    have access to public resources.
  • Removing stop words ..
  • user starts application first time, Scishare use
    default settings identifies user called PSEUDO
    USER. Pseudo users provided automatically
    generated X.509 certificates access public
    resources.
  • Stemming..
  • user start application first time, Scishare use
    default setting identifies user called PSEUDO
    USER. Pseudo users provided automatically
    generated X.509 certificates access public
    resources.

6
  • Inverted index
  • Scishare
  • lt10,0.99gt
  • lt6,0.97gt
  • lt12,0.80gt
  • Pseudo user
  • lt10,0.93gt
  • lt3,0.92gt
  • lt7,0.60gt

7
Lucene
  • Lucene involves a set of classes that are
    implemented in an application depending on its
    requirements.
  • You start by indexing your documents. To index
    documents, you write a method that performs the
    following steps
  • Gathers a list of files to be indexed.
  • Create an instance of a Document object to handle
    an InputStream to each file.
  • Create an instance of Analyzer. This could be the
    included StandardAnalyzer or one as sophisticated
    as you can make it.
  • Create an IndexWriter with the following a
    location of where to locate the index, an
    instance of the Analyzer just created, and a flag
    to tell it whether to create the index or not (if
    it is missing).
  • Add the Document objects to the IndexWriter using
    the addDocument method.

8
  • Imports StandardAnalyzer, IndexWriter,
    FileDocument
  • private void indexFiles()
  • IndexWriter writer new IndexWriter("lib/indexIn
    fo", new StandardAnalyzer(), true)
  • indexDocs(writer,new File("lib/parseDoc"))
  • writer.optimize()
  • writer.close()
  • public static void indexDocs(IndexWriter writer,
    File file)throws Exception
  • if (file.isDirectory())
  • String files file.list()
  • for (int i 0 i lt files.length i)
  • indexDocs(writer, new File(file, filesi))
  • else
  • writer.addDocument(FileDocument.Document(file))

9
Querys
  • Create a QueryParser by passing in the default
    field to search (as a String) and an instance of
    Analyzer .
  • Call parse on QueryParser to return a Query
    object.
  • Initialize an IndexSearcher with the location of
    the index you wish to search.
  • Pass the Query object into the search method of
    IndexSearcher, which will return a Hits object
    where the Hits object is a ranked list of the
    Document objects.

10
  • try
  • File lib new File("lib")
  • Analyzer analyzer new StandardAnalyzer()
  • Searcher searcher new IndexSearcher(lib.getCano
    nicalPath() "/indexInfo")
  • Query query QueryParser.parse(searchStr,
    "contents", analyzer)
  • Hits hits searcher.search(query)
  • Document doc hits.doc(i)
  • path doc.get("path")
  • searcher.close()

11
  • The built-in query parser supports most queries,
    but if it is insufficient, you can always fall
    back on the set of query-building constructs
    provided. The query parser can parse queries like
    these
  • free AND "text search Search for documents
    containing "free" and the phrase "text search.
  • text search Search for documents containing
    "text" and preferentially containing "search.
  • giants football Search for "giants" but omit
    documents containing "football
  • authorpillai java Search for documents
    containing pillai" in the author field and
    "java" in the body.
  • Lucene also lets you write your own Analyzer to
    accommodate the sophistication desired for a
    particular application.

12
Applications
  • Searchable email.
  • Searchable WebPages.
  • Website search.
  • Content search.
  • Version control and content management.

13
Conclusion
  • The primary goal for Lucene is "simplicity
    without loss of power or performance.
  • Lucene's design leaves the user in charge of
    functions that he/she needs to knows about --
    selecting and retrieving documents, storing the
    index data -- and hides the details of the
    working of the underlying search engine.
  • Because Lucene is an API, it can be very
    effectively used to index an e-mail Inbox, a
    database, or a set of news feeds. The
    applications are limited only by how you choose
    to use them.
  • http//jakarta.apache.org/lucene/docs/index.html

14
? THANK YOU ?
Write a Comment
User Comments (0)
About PowerShow.com