Lucene Boot Camp

About This Presentation

Title:

Lucene Boot Camp

Description:

Lucene doesn't care about XML, Word, PDF, etc. ... Analysis is the process of creating Tokens to be indexed ... languages that use a space for word segmentation ... – PowerPoint PPT presentation

Number of Views:234

Avg rating:3.0/5.0

Slides: 84

Provided by: grantin

Category:

more less

Transcript and Presenter's Notes

Title: Lucene Boot Camp

1
Lucene Boot Camp

Grant Ingersoll
Lucid Imagination
Nov. 12, 2007
Atlanta, Georgia

2
Intro

My Background
Your Background
Brief History of Lucene
Goals for Tutorial
Understand Lucene core capabilities
Real examples, real code, real data
Ask Questions!!!!!

3
Schedule

10-1010 Introducing Lucene and Search
1010-12 Indexing, Analysis, Searching,
Performance
12-1205 Break
12-1 More on Indexing, Analysis, Searching,
Performance
1-230 Lunch
230-240 Recap, Questions, Content
240-440 Class Example
4-420 Break
420-5 Class Example
5-520 Lucene Contributions (time permitting)
520-525 Open Discussion (time permitting)
525-530 Resources/Wrap Up

4
Lucene is

NOT a crawler
See Nutch
NOT an application
See PoweredBy on the Wiki
NOT a library for doing Google PageRank or other
link analysis algorithms
See Nutch
A library for enabling text based search

5
A Few Words about Solr

HTTP-based Search Server
XML Configuration
XML, JSON, Ruby, PHP, Java support
Caching, Replication
Many, many nice features that Lucene users need
http//lucene.apache.org/solr

6
Search Basics

Goal Identify documents that are similar to
input query
Lucene uses a modified Vector Space Model (VSM)
Boolean VSM
TF-IDF
The words in the document and the query each
define a Vector in an n-dimensional space
Sim(q1, d1) cos T
In Lucene, boolean approach restricts what
documents to score

dj ltw1,j,w2,j,,wn,jgt q ltw1,q,w2,q,wn,qgt w
weight assigned to term
7
Indexing

Process of preparing and adding text to Lucene
Optimized for searching
Key Point Lucene only indexes Strings
What does this mean?
Lucene doesnt care about XML, Word, PDF, etc.
There are many good open source extractors
available
Its our job to convert whatever file format we
have into something Lucene can use

8
Indexing Classes

Analyzer
Creates tokens using a Tokenizer and filters them
through zero or more TokenFilters
IndexWriter
Responsible for converting text into internal
Lucene format

9
Indexing Classes

Directory
Where the Index is stored
RAMDirectory, FSDirectory, others
Document
A collection of Fields
Can be boosted
Field
Free text, keywords, dates, etc.
Defines attributes for storing, indexing
Can be boosted
Field Constructors and parameters
Open up Fieldable and Field in IDE

10
How to Index

Create IndexWriter
For each input
Create a Document
Add Fields to the Document
Add the Document to the IndexWriter
Close the IndexWriter
Optimize (optional)

11
Task 1.a

From the Boot Camp Files, use the
basic.ReutersIndexer skeleton to start
Index the small Reuters Collection using the
IndexWriter, a Directory and StandardAnalyzer
Boost every 10 documents by 3
Questions to Answer
What Fields should I define?
What attributes should each Field have?
What Fields should OMIT_NORMS?
Pick a field to boost and give a reason why you
think it should be boosted

12
Use the Luke
13
Searching

Key Classes
Searcher
Provides methods for searching
Take a moment to look at the Searcher class
declaration
IndexSearcher, MultiSearcher, ParallelMultiSearche
r
IndexReader
Loads a snapshot of the index into memory for
searching
Hits
Storage/caching of results from searching
QueryParser
JavaCC grammar for creating Lucene Queries
http//lucene.apache.org/java/docs/queryparsersynt
ax.html
Query
Logical representation of programs information
need

14
Query Parsing

Basic syntax
titlehockey (bodystanley AND bodycup)
OR/AND must be uppercase
Default operator is OR (can be changed)
Supports fairly advanced syntax, see the website
http//lucene.apache.org/java/docs/queryparsersynt
ax.html
Doesnt always play nice, so beware
Many applications construct queries
programmatically or restrict syntax

15
Task 1.b

Using the ReutersIndexerTest.java skeleton in the
boot camp files
Search your newly created index using queries you
develop
Delete a Document by the doc id
Hints
Use a IndexSearcher
Create a Query using the QueryParser
Display the results from the Hits
Questions
What is the default field for the QueryParser?
What Analyzer to use?

16
Task 1 Results

Locks
Lucene maintains locks on files to prevent index
corruption
Located in same directory as index
Scores from Hits are normalized
Scores across queries are NOT comparable
Lucene 2.3 has some transactional semantics for
indexing, but is not a DB

17
Deletion and Updates

Deletions can be a bit confusing
Both IndexReader and IndexWriter have delete
methods
Updates are always a delete and an add
Updates are always a delete and an add
Yes, that is a repeat!
Nature of data structures used in search

18
Analysis

Analysis is the process of creating Tokens to be
indexed
Analysis is usually done to improve results
overall, but it comes with a price
Lucene comes with many different Analyzers,
Tokenizers and TokenFilters, each with their own
goals
See contrib/analyzers
StandardAnalyzer is included with the core JAR
and does a good job for most English and
Latin-based tasks
Often times you want the same content analyzed in
different ways
Consider a catch-all Field in addition to other
Fields

19
Commonly Used Analyzers

StandardAnalyzer
WhitespaceAnalyzer
PerFieldAnalyzerWrapper
SimpleAnalyzer

20
Indexing in a Nutshell

For each Document
For each Field to be tokenized
Create the tokens using the specified Tokenizer
Tokens consist of a String, position, type and
offset information
Pass the tokens through the chained TokenFilters
where they can be changed or removed
Add the end result to the inverted index
Position information can be altered
Useful when removing words or to prevent phrases
from matching

21
Inverted Index
aardvark
Little Red Riding Hood
0
hood
0
1
little
0
2
Robin Hood
1
red
0
riding
0
robin
1
Little Women
2
women
2
zoo
22
Tokenization

Split words into Tokens to be processed
Tokenization is fairly straightforward for most
languages that use a space for word segmentation
More difficult for some East Asian languages
See the CJK Analyzer

23
Modifying Tokens

TokenFilters are used to alter the token stream
to be indexed
Common tasks
Remove stopwords
Lower case
Stem/Normalize -gt Wi-Fi -gt Wi Fi
Add Synonyms
StandardAnalyzer does things that you may not want

24
Custom Analyzers

Solution write your own Analyzer
Better solution write a configurable Analyzer so
you only need one Analyzer that you can easily
change for your projects
See Solr
Tokenizers and TokenFilters must be newly
constructed for each input

25
Special Cases

Dates and numbers need special treatment to be
searchable
o.a.l.document.DateTools
org.apache.solr.util.NumberUtils
Altering Position Information
Increase Position Gap between sentences to
prevent phrases from crossing sentence boundaries
Index synonyms at the same position so query can
match regardless of synonym used

26
5 minute Break
27
Indexing Performance

Behind the Scenes
Lucene indexes Documents into memory
At certain trigger points, memory (segments) are
flushed to the Directory
Segments are periodically merged
Lucene 2.3 has significant performance
improvements

28
IndexWriter Performance Factors

maxBufferedDocs
Minimum of docs before merge occurs and a new
segment is created
Usually, Larger faster, but more RAM
mergeFactor
How often segments are merged
Smaller less RAM, better for incremental
updates
Larger faster, better for batch indexing
maxFieldLength
Limit the number of terms in a Document

29
Lucene 2.3 IndexWriter Changes

setRAMBufferSizeMB
New model for automagically controlling indexing
factors based on the amount of memory in use
Obsoletes setMaxBufferedDocs and setMergeFactor
Takes storage and term vectors out of the merge
process
Turn off auto-commit if there are stored fields
and term vectors
Provides significant performance increase

30
Index Threading

IndexWriter and IndexReader are thread-safe and
can be shared between threads without external
synchronization
One open IndexWriter per Directory
Parallel Indexing
Index to separate Directory instances
Merge using IndexWriter.addIndexes
Could also distribute and collect

31
Benchmarking Indexing

contrib/benchmark
Try out different algorithms between Lucene 2.2
and trunk (2.3)
contrib/benchmark/conf
indexing.alg
indexing-multithreaded.alg
Info
Mac Pro 2 x 2GHz Dual-Core Xeon
4 GB RAM
ant run-task -Dtask.alg./conf/indexing.alg
-Dtask.mem1024M

32
Benchmarking Results
Your results will depend on analysis, etc.
33
Searching

Earlier we touched on basics of search using the
QueryParser
Now look at
Searcher/IndexReader Lifecycle
Query classes
More details on the QueryParser
Filters
Sorting

34
Lifecycle

Recall that the IndexReader loads a snapshot of
index into memory
This means updates made since loading the index
will not be seen
Business rules are needed to define how often to
reload the index, if at all
IndexReader.isCurrent() can help
Loading an index is an expensive operation
Do not open a Searcher/IndexReader for every
search

35
Query Classes

TermQuery is basis for all non-span queries
BooleanQuery combines multiple Query instances as
clauses
should
required
PhraseQuery finds terms occurring near each
other, position-wise
slop is the edit distance between two terms
Take 2-3 minutes to explore Query implementations

36
Spans

Spans provide information about where matches
took place
Not supported by the QueryParser
Can be used in BooleanQuery clauses
Take 2-3 minutes to explore SpanQuery classes
SpanNearQuery useful for doing phrase matching

37
QueryParser

MultiFieldQueryParser
Boolean operators cause confusion
Better to think in terms of required ( operator)
and not allowed (- operator)
Check JIRA for QueryParser issues
http//www.gossamer-threads.com/lists/lucene/java-
user/40945
Most applications either modify QP, create their
own, or restrict to a subset of the syntax
Your users may not need all the flexibility of
the QP

38
Sorting

Lucene default sort is by score
Searcher has several methods that take in a Sort
object
Sorting should be addressed during indexing
Sorting is done on Fields containing a single
term that can be used for comparison
The SortField defines the different sort types
available
AUTO, STRING, INT, FLOAT, CUSTOM, SCORE, DOC

39
Sorting II

Look at Searcher, Sort and SortField
Custom sorting is done with a SortComparatorSource
Sorting can be very expensive
Terms are cached in the FieldCache
SortFilterTest.java example

40
Filters

Filters restrict the search space to a subset of
Documents
Use Cases
Search within a Search
Restrict by date
Rating
Security
Author

41
Filter Classes

QueryWrapperFilter (QueryFilter)
Restrict to subset of Documents that match a
Query
RangeFilter
Restrict to Documents that fall within a range
Better alternative to RangeQuery
CachingWrapperFilter
Wrap another Filter and provide caching
SortFilterTest.java example

42
Expert Results

Searcher has several expert methods
Hits is not always what you need due to
Caching
Normalized Scores
Reexecutes Query repeatedly as results are
accessed
HitCollector allows low-level access to all
Documents as they are scored
TopDocs represents top n docs that match
TopDocsTest in examples

43
Searchers

MultiSearcher
Search over multiple Searchables, including
remote
MultiReader
Not a Searcher, but can be used with
IndexSearcher to achieve same results for local
indexes
ParallelMultiSearcher
Like MultiSearcher, but threaded
RemoteSearchable
RMI based remote searching
Look at MultiSearcherTest in example code

44
Search Performance

Search speed is based on a number of factors
Query Type(s)
Query Size
Analysis
Occurrences of Query Terms
Optimize
Index Size
Index type (RAMDirectory, other)
Usual Suspects
CPU
Memory
I/O
Business Needs

45
Query Types

Be careful with WildcardQuery as it rewrites to a
BooleanQuery containing all the terms that match
the wildcards
Avoid starting a WildcardQuery with wildcard
Use ConstantScoreRangeQuery instead of RangeQuery
Be careful with range queries and dates
User mailing list and Wiki have useful tips for
optimizing date handling

46
Query Size

Stopword removal
Search an all field instead of many fields with
the same terms
Disambiguation
May be useful when doing synonym expansion
Difficult to automate and may be slower
Some applications may allow the user to
disambiguate
Relevance Feedback/More Like This
Use most important words
Important can be defined in a number of ways

47
Usual Suspects

CPU
Profile your application
Memory
Examine your heap size, garbage collection
approach
I/O
Cache your Searcher
Define business logic for refreshing based on
indexing needs
Warm your Searcher before going live -- See Solr
Business Needs
Do you really need to support Wildcards?
What about date range queries down to the
millisecond?

48
Explanations

explain(Query, int) method is useful for
understanding why a Document scored the way it
did
ExplainsTest in sample code
Open Luke and try some queries and then use the
explain button

49
FieldSelector

Prior to version 2.1, Lucene always loaded all
Fields in a Document
FieldSelector API addition allows Lucene to skip
large Fields
Options Load, Lazy Load, No Load, Load and
Break, Load for Merge, Size, Size and Break
Makes storage of original content more viable
without large cost of loading it when not used
FieldSelectorTest in example code

50
Scoring and Similarity

Lucene has sophisticated scoring mechanism
designed to meet most needs
Has hooks for modifying scores
Scoring is handled by the Query, Weight and
Scorer class

51
Affecting Relevance

FunctionQuery from Solr (variation in Lucene)
Override Similarity
Implement own Query and related classes
Payloads
HitCollector
Take 5 to examine these

52
Lunch
1-230
53
Recap

Indexing
Searching
Performance
Odds and Ends
Explains
FieldSelector
Relevance

54
Next Up

Dealing with Content
File Formats
Extraction
Large Task
Miscellaneous
Wrapping Up

55
File Formats

Several open source libraries, projects for
extracting content to use in Lucene
PDF PDFBox
http//www.pdfbox.org/
Word POI, Open Office, TextMining
http//www.textmining.org/textmining.zip
XML SAX or Pull parser
HTML Neko, Jtidy
http//people.apache.org/andyc/neko/doc/html/
http//jtidy.sourceforge.net/
Tika
http//incubator.apache.org/tika/
Aperture
http//aperture.sourceforge.net

56
Aperture Basics

Crawlers
Data Connectors
Extraction Wrappers
POI, PDFBox, HTML, XML, etc.
http//aperture.wiki.sourceforge.net/Extractors
will give you info on what comes back from
Aperture
LuceneApertureCallbackHandler in example code

57
Large Task

Using the skeleton files in the
com.lucenebootcamp.training.full package
Get some content
Web, file system
Different file formats
Index it
Plan out your fields, boosts, field properties
Support updates and deletes
Optional
How fast can you make it go? Divide and conquer?
Multithreaded?

58
Large Task

Search Content
Allow for arbitrary user queries across multiple
Fields via command line or simple web interface
How fast can you make it?
Support
Sort
Filter
Explains
How much slower is to retrieve an explanation?

59
Large Task