Lucene Continued - PowerPoint PPT Presentation

1 / 15

About This Presentation

Title:

Lucene Continued

Description:

... in this class that we calculate any boosts that we want to place on fields, or ... The date boost has been really important for us. ... – PowerPoint PPT presentation

Number of Views:91

Avg rating:3.0/5.0

Slides: 16

Provided by: inter2

Category:

more less

Transcript and Presenter's Notes

Title: Lucene Continued

1
Lucene (Continued)?
2
Lucene
High Level Infrastructure

When you look at building your search solution,
you often find that the process is split into two
main tasks building an index, and searching that
index. This is definitely the case with Lucene
(and the only time when this isnt the case is if
your search goes directly to the database).
We wanted to keep the search interface fairly
simple, so the code that interacts from the
system sees two main interfaces, IndexBuilder,
and IndexSearch.

3
Lucene
Index Builder

Any process which needs to build an index goes
through the IndexBuilder. This is a simple
interface which gives you two entrance points to
the indexing process
By passing individual configuration settings to
the class
E.g. path to the index, if you want to do an
incremental build, and how often to optimize the
Lucene index as you add records.
By passing in an index plan name
IndexBuilder will then look up the settings it
needs from the configuration system. This allows
you to tweak your variables in an external file,
rather than code.

4
Lucene
Index Sources

The Index Builder abstracts the details of
Lucene, and the Index Sources that are used to
create the index itself.
The search interface is also kept very simple. A
search is done via
IndexSearch.search(String inputQuery, int
resultsStart,
int resultsCount)
e.g. Look for the terms EJB and WebLogic,
returning up to the first 10 results
IndexSearch.search("EJB and WebLogic", 0, 10)

5
Lucene
How to tweak the ranking of records

There is one piece of logic that goes above and
beyond munging the data to a Lucene friendly
manner. It is in this class that we calculate any
boosts that we want to place on fields, or the
document itself. It turns out that we end up with
the following boosters
The date boost has been really important for us.
We have data that goes back for a long time, and
seemed to be returning old reports too often.
The date-based booster trick has gotten around
this, allowing for the newest content to bubble
up.

6
Lucene
Lucene Index Anatomy

You can successfully use Lucene without
understanding this directory structure. Feel free
to skip this section and treat the directory as a
black box without regard to what is inside. When
you are ready to dig deeper you'll find that the
files you created in the last section contain
statistics and other data to facilitate rapid
searching and ranking. An index contains a
sequence of documents. In our indexing example,
each document represents information about a text
file.

7
Lucene
Documents

Documents are the primary retrievable units from
a Lucene query. Documents consist of a sequence
of fields. Fields have a name ("contents" and
"filename" in our example). Field values are a
sequence of terms.

8
Lucene
Terms

A term is the smallest piece of a particular
field. Fields have three attributes of interest
Stored -- Original text is available in the
documents returned from a search.
Indexed -- Makes this field searchable.
Tokenized -- The text added is run through an
analyzer and broken into relevant pieces (only
makes sense for indexed fields).

9
Lucene
Analysis
Tokenized fields are where the real fun happens.
In our example, we are indexing the contents of
text files. The goal is to have the words in the
text file be searchable, but for practical
purposes it doesn't make sense to index every
word. Some words like "a", "and", and "the" are
generally considered irrelevant for searching and
can be optimized out -- these are called stop
words.
10
Lucene
Some Inverted Index Strategies

batch-based use file-sorting algorithms
(textbook)
fastest to build fastest to search- slow to
update
b-tree based update in place (http//lucene.sf.ne
t/papers/sigir90.ps)
fast to search- update/build does not scale

11
Lucene
What you have to do

Lucene handles the indexing, searching and
retrieving, but it doesn't handle
managing the process (instantiating the objects
and hooking them together, both for indexing and
for searching)
selecting the data files
parsing the data files
getting the search string from the user
displaying the search results to the use
- complex implementation
segment based lots of small indexes (Verity)
fast to build fast to update- slower to
search

12
Lucene

two basic algorithms
make an index for a single document
merge a set of indices
incremental algorithm
maintain a stack of segment indices
create index for each incoming document
push new indexes onto the stack
let b10 be the merge factor M8 for (size 1
size lt M size b) if (there are b indexes
with size docs on top of the stack)    pop them
off the stack   merge them into a single
index   push the merged index onto the
stack else    break
optimization single-doc indexes kept in RAM,
saves system calls
notes