FullText Search with Lucene - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

FullText Search with Lucene

Description:

category: superhero. powers: agility, spider-sense. Hits ... Write indexing code to get data and create Document objects. Write code to create query objects ... – PowerPoint PPT presentation

Number of Views:302

Avg rating:3.0/5.0

Slides: 23

Provided by: ping153

Learn more at: http://people.apache.org

Category:

more less

Transcript and Presenter's Notes

Title: FullText Search with Lucene

1
Full-Text Search with Lucene

Yonik Seeley
yonik_at_apache.org
02 May 2007
Amsterdam, Netherlands

slides http//www.apache.org/yonik
2
What is Lucene

High performance, scalable, full-text search
library
Focus Indexing Searching Documents
100 Java, no dependencies, no config files
No crawlers or document parsing
Users Wikipedia, Technorati, Monster.com,
Nabble, TheServerSide, Akamai, SourceForge
Applications Eclipse, JIRA, Roller, OpenGrok,
Nutch, Solr, many commercial products

3
Inverted Index
aardvark
Little Red Riding Hood
0
hood
0
1
little
0
2
Robin Hood
1
red
0
riding
0
robin
1
Little Women
2
women
2
zoo
4
Basic Application
Document super_name Spider-Man name
Peter Parker category superhero powers agility,
spider-sense
Hits (Matching Docs)
Query (powersagility)
addDocument()
search()

Get Lucene jar file
Write indexing code to get data and create
Document objects
Write code to create query objects
Write code to use/display results

IndexWriter
IndexSearcher
Lucene Index
5
Indexing Documents

IndexWriter writer new IndexWriter(directory,
analyzer, true)
Document doc new Document()
doc.add(new Field(super_name", Sandman",
Field.Store.YES, Field.Index.TOKENIZED))
doc.add(new Field(name", William Baker",
Field.Store.YES, Field.Index.TOKENIZED))
doc.add(new Field(name", Flint Marko",
Field.Store.YES, Field.Index.TOKENIZED))
// ...
writer.addDocument(doc)
writer.close()

6
Field Options

Indexed
Necessary for searching or sorting
Tokenized
Text analysis done before indexing
Stored
You get these back on a search hit
Compressed
Binary
Currently for stored-only fields

7
Searching an Index

IndexSearcher searcher new IndexSearcher(directo
ry)
QueryParser parser new QueryParser("defaultField
", analyzer)
Query query parser.parse(powersagility")
Hits hits searcher.search(query)
System.out.println(matches" hits.length())
Document doc hits.doc(0) // look at first
match
System.out.println(name" doc.get(name"))
searcher.close()

8
Scoring

VSM Vector Space Model
tf term frequency numer of matching terms in
field
lengthNorm number of tokens in field
idf inverse document frequency
coord coordination factor, number of matching
terms
document boost
query clause boost
http//lucene.apache.org/java/docs/scoring.html

9
Query Construction

Lucene QueryParser
Example queryParser.parse(nameSpider-Man")
good human entered queries, debugging, IPC
does text analysis and constructs appropriate
queries
not all query types supported
Programmatic query construction
Example new TermQuery(new Term(name,Spider-Man
))
explicit, no escaping necessary
does not do text analysis for you

10
Query Examples

justice league
EQUIV justice OR league
QueryParser default is optional
justice league nameaquaman
EQUIV justice AND league NOT nameaquaman
justice league nameaquaman
titlespiderman10 descriptionspiderman
descriptionspiderman movie10

11
Query Examples2

releaseDate2000 TO 2007
Range search lexicographic ordering, so beware
of numbers
Wildcard searches sup?r, sur, super
spider
Fuzzy search Levenshtein distance
Optional minimum similarity spider0.7
(Superman AND Lex Luthor) OR (Batman Joker)

12
Deleting Documents

IndexReader.deleteDocument(int id)
exclusive with IndexWriter
powerful
Deleting with IndexWriter
deleteDocuments(Term t)
updateDocument(Term t, Document d)
Deleting does not immediately reclaim space

13
Index Structure

IndexWriter params
MaxBufferedDocs
MergeFactor
MaxMergeDocs
MaxFieldLength

segments_3
_0.fnm _0.fdt _0.fdx _0.frq _0.tis _0.tii _0.prx _
0.nrm _0_1.del
_1.fnm _1.fdt _1.fdx
14
Performance

Indexing Performance
Index documents in batches
Raise merge factor
Raise maxBufferedDocs
Searching Performance
Reuse IndexSearcher
Lower merge factor
optimize
Use cached filters (see QueryFilter)
superhero langenglish
superhero filtered by langenglish

15
Analysis Search Relevancy
Query Analysis
Document Indexing Analysis
LexCorp BFG-9000
Lex corp bfg9000
WhitespaceTokenizer
WhitespaceTokenizer
LexCorp
BFG-9000
Lex
bfg9000
corp
WordDelimiterFilter catenateWords1
WordDelimiterFilter catenateWords0
BFG
9000
Lex
Corp
bfg
9000
Lex
corp
LexCorp
LowercaseFilter
LowercaseFilter
bfg
9000
lex
corp
bfg
9000
lex
corp
lexcorp
A Match!
16
Tokenizers