Title: Lucene Continued
1Lucene (Continued)?
2Lucene
High Level Infrastructure
- When you look at building your search solution,
you often find that the process is split into two
main tasks building an index, and searching that
index. This is definitely the case with Lucene
(and the only time when this isnt the case is if
your search goes directly to the database). - We wanted to keep the search interface fairly
simple, so the code that interacts from the
system sees two main interfaces, IndexBuilder,
and IndexSearch. -
3Lucene
Index Builder
- Any process which needs to build an index goes
through the IndexBuilder. This is a simple
interface which gives you two entrance points to
the indexing process - By passing individual configuration settings to
the class - E.g. path to the index, if you want to do an
incremental build, and how often to optimize the
Lucene index as you add records. - By passing in an index plan name
- IndexBuilder will then look up the settings it
needs from the configuration system. This allows
you to tweak your variables in an external file,
rather than code. -
4Lucene
Index Sources
- The Index Builder abstracts the details of
Lucene, and the Index Sources that are used to
create the index itself. - The search interface is also kept very simple. A
search is done via - IndexSearch.search(String inputQuery, int
resultsStart, - int resultsCount)
- e.g. Look for the terms EJB and WebLogic,
returning up to the first 10 results - IndexSearch.search("EJB and WebLogic", 0, 10)
-
5Lucene
How to tweak the ranking of records
- There is one piece of logic that goes above and
beyond munging the data to a Lucene friendly
manner. It is in this class that we calculate any
boosts that we want to place on fields, or the
document itself. It turns out that we end up with
the following boosters - The date boost has been really important for us.
We have data that goes back for a long time, and
seemed to be returning old reports too often.
The date-based booster trick has gotten around
this, allowing for the newest content to bubble
up.
6Lucene
Lucene Index Anatomy
- You can successfully use Lucene without
understanding this directory structure. Feel free
to skip this section and treat the directory as a
black box without regard to what is inside. When
you are ready to dig deeper you'll find that the
files you created in the last section contain
statistics and other data to facilitate rapid
searching and ranking. An index contains a
sequence of documents. In our indexing example,
each document represents information about a text
file.
7Lucene
Documents
-
- Documents are the primary retrievable units from
a Lucene query. Documents consist of a sequence
of fields. Fields have a name ("contents" and
"filename" in our example). Field values are a
sequence of terms. -
-
8Lucene
Terms
- A term is the smallest piece of a particular
field. Fields have three attributes of interest - Stored -- Original text is available in the
documents returned from a search. - Indexed -- Makes this field searchable.
- Tokenized -- The text added is run through an
analyzer and broken into relevant pieces (only
makes sense for indexed fields).
9Lucene
Analysis
Tokenized fields are where the real fun happens.
In our example, we are indexing the contents of
text files. The goal is to have the words in the
text file be searchable, but for practical
purposes it doesn't make sense to index every
word. Some words like "a", "and", and "the" are
generally considered irrelevant for searching and
can be optimized out -- these are called stop
words.
10Lucene
Some Inverted Index Strategies
- batch-based use file-sorting algorithms
(textbook) - fastest to build fastest to search- slow to
update - b-tree based update in place (http//lucene.sf.ne
t/papers/sigir90.ps) - fast to search- update/build does not scale
11Lucene
What you have to do
- Lucene handles the indexing, searching and
retrieving, but it doesn't handle - managing the process (instantiating the objects
and hooking them together, both for indexing and
for searching) - selecting the data files
- parsing the data files
- getting the search string from the user
- displaying the search results to the use
- - complex implementation
- segment based lots of small indexes (Verity)
- fast to build fast to update- slower to
search
12Lucene
- two basic algorithms
- make an index for a single document
- merge a set of indices
- incremental algorithm
- maintain a stack of segment indices
- create index for each incoming document
- push new indexes onto the stack
- let b10 be the merge factor M8 for (size 1
size lt M size b) if (there are b indexes
with size docs on top of the stack) pop them
off the stack merge them into a single
index push the merged index onto the
stack else break - optimization single-doc indexes kept in RAM,
saves system calls - notes
-
Indexing In Depth
13Lucene
Lucene's Disjunctive Search Algorithm
- described in http//lucene.sf.net/papers/riao97.ps
- since all postings must be processed
- goal is to minimize per-posting computation
- merges postings through a fixed-size array of
accumulator buckets - performs boolean logic with bit masks
- scales well with large queries
14Lucene
Lucene's Conjunctive Search Algorithm?
.
Local and Replication mode
- Algorithm
- use linked list of pointers to doc list
- initially sorted by doc
- loop
- if all are at same doc, record hit
- skip first to-or-past last and move to end of
list -
15Lucene
Summary
Lucene For key value pair fast retrieval. For
read oriented data.