Lucene Continued - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Lucene Continued

Description:

... in this class that we calculate any boosts that we want to place on fields, or ... The date boost has been really important for us. ... – PowerPoint PPT presentation

Number of Views:90
Avg rating:3.0/5.0
Slides: 16
Provided by: inter2
Category:

less

Transcript and Presenter's Notes

Title: Lucene Continued


1
Lucene (Continued)?
2
Lucene
High Level Infrastructure
  • When you look at building your search solution,
    you often find that the process is split into two
    main tasks building an index, and searching that
    index. This is definitely the case with Lucene
    (and the only time when this isnt the case is if
    your search goes directly to the database).
  • We wanted to keep the search interface fairly
    simple, so the code that interacts from the
    system sees two main interfaces, IndexBuilder,
    and IndexSearch.

3
Lucene
Index Builder
  • Any process which needs to build an index goes
    through the IndexBuilder. This is a simple
    interface which gives you two entrance points to
    the indexing process
  • By passing individual configuration settings to
    the class
  • E.g. path to the index, if you want to do an
    incremental build, and how often to optimize the
    Lucene index as you add records.
  • By passing in an index plan name
  • IndexBuilder will then look up the settings it
    needs from the configuration system. This allows
    you to tweak your variables in an external file,
    rather than code.

4
Lucene
Index Sources
  • The Index Builder abstracts the details of
    Lucene, and the Index Sources that are used to
    create the index itself.
  • The search interface is also kept very simple. A
    search is done via
  • IndexSearch.search(String inputQuery, int
    resultsStart,
  • int resultsCount)
  • e.g. Look for the terms EJB and WebLogic,
    returning up to the first 10 results
  • IndexSearch.search("EJB and WebLogic", 0, 10)

5
Lucene
How to tweak the ranking of records
  • There is one piece of logic that goes above and
    beyond munging the data to a Lucene friendly
    manner. It is in this class that we calculate any
    boosts that we want to place on fields, or the
    document itself. It turns out that we end up with
    the following boosters
  • The date boost has been really important for us.
    We have data that goes back for a long time, and
    seemed to be returning old reports too often.
    The date-based booster trick has gotten around
    this, allowing for the newest content to bubble
    up.

6
Lucene
Lucene Index Anatomy
  • You can successfully use Lucene without
    understanding this directory structure. Feel free
    to skip this section and treat the directory as a
    black box without regard to what is inside. When
    you are ready to dig deeper you'll find that the
    files you created in the last section contain
    statistics and other data to facilitate rapid
    searching and ranking. An index contains a
    sequence of documents. In our indexing example,
    each document represents information about a text
    file.

7
Lucene
Documents
  • Documents are the primary retrievable units from
    a Lucene query. Documents consist of a sequence
    of fields. Fields have a name ("contents" and
    "filename" in our example). Field values are a
    sequence of terms.

8
Lucene
Terms
  • A term is the smallest piece of a particular
    field. Fields have three attributes of interest
  • Stored -- Original text is available in the
    documents returned from a search.
  • Indexed -- Makes this field searchable.
  • Tokenized -- The text added is run through an
    analyzer and broken into relevant pieces (only
    makes sense for indexed fields).

9
Lucene
Analysis
Tokenized fields are where the real fun happens.
In our example, we are indexing the contents of
text files. The goal is to have the words in the
text file be searchable, but for practical
purposes it doesn't make sense to index every
word. Some words like "a", "and", and "the" are
generally considered irrelevant for searching and
can be optimized out -- these are called stop
words.
10
Lucene
Some Inverted Index Strategies
  • batch-based use file-sorting algorithms
    (textbook)
  • fastest to build fastest to search- slow to
    update
  • b-tree based update in place (http//lucene.sf.ne
    t/papers/sigir90.ps)
  • fast to search- update/build does not scale

11
Lucene
What you have to do
  • Lucene handles the indexing, searching and
    retrieving, but it doesn't handle
  • managing the process (instantiating the objects
    and hooking them together, both for indexing and
    for searching)
  • selecting the data files
  • parsing the data files
  • getting the search string from the user
  • displaying the search results to the use
  • - complex implementation
  • segment based lots of small indexes (Verity)
  • fast to build fast to update- slower to
    search

12
Lucene
  • two basic algorithms
  • make an index for a single document
  • merge a set of indices
  • incremental algorithm
  • maintain a stack of segment indices
  • create index for each incoming document
  • push new indexes onto the stack
  • let b10 be the merge factor M8 for (size 1
    size lt M size b)   if (there are b indexes
    with size docs on top of the stack)    pop them
    off the stack   merge them into a single
    index   push the merged index onto the
    stack  else    break 
  • optimization single-doc indexes kept in RAM,
    saves system calls
  • notes

Indexing In Depth
13
Lucene
Lucene's Disjunctive Search Algorithm


  • described in http//lucene.sf.net/papers/riao97.ps
  • since all postings must be processed
  • goal is to minimize per-posting computation
  • merges postings through a fixed-size array of
    accumulator buckets
  • performs boolean logic with bit masks
  • scales well with large queries

14
Lucene
Lucene's Conjunctive Search Algorithm?
.
Local and Replication mode
  • Algorithm
  • use linked list of pointers to doc list
  • initially sorted by doc
  • loop
  • if all are at same doc, record hit
  • skip first to-or-past last and move to end of
    list

15
Lucene
Summary
Lucene For key value pair fast retrieval. For
read oriented data.

Write a Comment
User Comments (0)
About PowerShow.com