Lucene Boot Camp - PowerPoint PPT Presentation

1 / 83
About This Presentation
Title:

Lucene Boot Camp

Description:

Lucene doesn't care about XML, Word, PDF, etc. ... Analysis is the process of creating Tokens to be indexed ... languages that use a space for word segmentation ... – PowerPoint PPT presentation

Number of Views:234
Avg rating:3.0/5.0
Slides: 84
Provided by: grantin
Category:
Tags: boot | camp | lucene

less

Transcript and Presenter's Notes

Title: Lucene Boot Camp


1
Lucene Boot Camp
  • Grant Ingersoll
  • Lucid Imagination
  • Nov. 12, 2007
  • Atlanta, Georgia

2
Intro
  • My Background
  • Your Background
  • Brief History of Lucene
  • Goals for Tutorial
  • Understand Lucene core capabilities
  • Real examples, real code, real data
  • Ask Questions!!!!!

3
Schedule
  • 10-1010 Introducing Lucene and Search
  • 1010-12 Indexing, Analysis, Searching,
    Performance
  • 12-1205 Break
  • 12-1 More on Indexing, Analysis, Searching,
    Performance
  • 1-230 Lunch
  • 230-240 Recap, Questions, Content
  • 240-440 Class Example
  • 4-420 Break
  • 420-5 Class Example
  • 5-520 Lucene Contributions (time permitting)
  • 520-525 Open Discussion (time permitting)
  • 525-530 Resources/Wrap Up

4
Lucene is
  • NOT a crawler
  • See Nutch
  • NOT an application
  • See PoweredBy on the Wiki
  • NOT a library for doing Google PageRank or other
    link analysis algorithms
  • See Nutch
  • A library for enabling text based search

5
A Few Words about Solr
  • HTTP-based Search Server
  • XML Configuration
  • XML, JSON, Ruby, PHP, Java support
  • Caching, Replication
  • Many, many nice features that Lucene users need
  • http//lucene.apache.org/solr

6
Search Basics
  • Goal Identify documents that are similar to
    input query
  • Lucene uses a modified Vector Space Model (VSM)
  • Boolean VSM
  • TF-IDF
  • The words in the document and the query each
    define a Vector in an n-dimensional space
  • Sim(q1, d1) cos T
  • In Lucene, boolean approach restricts what
    documents to score

dj ltw1,j,w2,j,,wn,jgt q ltw1,q,w2,q,wn,qgt w
weight assigned to term
7
Indexing
  • Process of preparing and adding text to Lucene
  • Optimized for searching
  • Key Point Lucene only indexes Strings
  • What does this mean?
  • Lucene doesnt care about XML, Word, PDF, etc.
  • There are many good open source extractors
    available
  • Its our job to convert whatever file format we
    have into something Lucene can use

8
Indexing Classes
  • Analyzer
  • Creates tokens using a Tokenizer and filters them
    through zero or more TokenFilters
  • IndexWriter
  • Responsible for converting text into internal
    Lucene format

9
Indexing Classes
  • Directory
  • Where the Index is stored
  • RAMDirectory, FSDirectory, others
  • Document
  • A collection of Fields
  • Can be boosted
  • Field
  • Free text, keywords, dates, etc.
  • Defines attributes for storing, indexing
  • Can be boosted
  • Field Constructors and parameters
  • Open up Fieldable and Field in IDE

10
How to Index
  • Create IndexWriter
  • For each input
  • Create a Document
  • Add Fields to the Document
  • Add the Document to the IndexWriter
  • Close the IndexWriter
  • Optimize (optional)

11
Task 1.a
  • From the Boot Camp Files, use the
    basic.ReutersIndexer skeleton to start
  • Index the small Reuters Collection using the
    IndexWriter, a Directory and StandardAnalyzer
  • Boost every 10 documents by 3
  • Questions to Answer
  • What Fields should I define?
  • What attributes should each Field have?
  • What Fields should OMIT_NORMS?
  • Pick a field to boost and give a reason why you
    think it should be boosted

12
Use the Luke
13
Searching
  • Key Classes
  • Searcher
  • Provides methods for searching
  • Take a moment to look at the Searcher class
    declaration
  • IndexSearcher, MultiSearcher, ParallelMultiSearche
    r
  • IndexReader
  • Loads a snapshot of the index into memory for
    searching
  • Hits
  • Storage/caching of results from searching
  • QueryParser
  • JavaCC grammar for creating Lucene Queries
  • http//lucene.apache.org/java/docs/queryparsersynt
    ax.html
  • Query
  • Logical representation of programs information
    need

14
Query Parsing
  • Basic syntax
  • titlehockey (bodystanley AND bodycup)
  • OR/AND must be uppercase
  • Default operator is OR (can be changed)
  • Supports fairly advanced syntax, see the website
  • http//lucene.apache.org/java/docs/queryparsersynt
    ax.html
  • Doesnt always play nice, so beware
  • Many applications construct queries
    programmatically or restrict syntax

15
Task 1.b
  • Using the ReutersIndexerTest.java skeleton in the
    boot camp files
  • Search your newly created index using queries you
    develop
  • Delete a Document by the doc id
  • Hints
  • Use a IndexSearcher
  • Create a Query using the QueryParser
  • Display the results from the Hits
  • Questions
  • What is the default field for the QueryParser?
  • What Analyzer to use?

16
Task 1 Results
  • Locks
  • Lucene maintains locks on files to prevent index
    corruption
  • Located in same directory as index
  • Scores from Hits are normalized
  • Scores across queries are NOT comparable
  • Lucene 2.3 has some transactional semantics for
    indexing, but is not a DB

17
Deletion and Updates
  • Deletions can be a bit confusing
  • Both IndexReader and IndexWriter have delete
    methods
  • Updates are always a delete and an add
  • Updates are always a delete and an add
  • Yes, that is a repeat!
  • Nature of data structures used in search

18
Analysis
  • Analysis is the process of creating Tokens to be
    indexed
  • Analysis is usually done to improve results
    overall, but it comes with a price
  • Lucene comes with many different Analyzers,
    Tokenizers and TokenFilters, each with their own
    goals
  • See contrib/analyzers
  • StandardAnalyzer is included with the core JAR
    and does a good job for most English and
    Latin-based tasks
  • Often times you want the same content analyzed in
    different ways
  • Consider a catch-all Field in addition to other
    Fields

19
Commonly Used Analyzers
  • StandardAnalyzer
  • WhitespaceAnalyzer
  • PerFieldAnalyzerWrapper
  • SimpleAnalyzer

20
Indexing in a Nutshell
  • For each Document
  • For each Field to be tokenized
  • Create the tokens using the specified Tokenizer
  • Tokens consist of a String, position, type and
    offset information
  • Pass the tokens through the chained TokenFilters
    where they can be changed or removed
  • Add the end result to the inverted index
  • Position information can be altered
  • Useful when removing words or to prevent phrases
    from matching

21
Inverted Index
aardvark
Little Red Riding Hood
0
hood
0
1
little
0
2
Robin Hood
1
red
0
riding
0
robin
1
Little Women
2
women
2
zoo
22
Tokenization
  • Split words into Tokens to be processed
  • Tokenization is fairly straightforward for most
    languages that use a space for word segmentation
  • More difficult for some East Asian languages
  • See the CJK Analyzer

23
Modifying Tokens
  • TokenFilters are used to alter the token stream
    to be indexed
  • Common tasks
  • Remove stopwords
  • Lower case
  • Stem/Normalize -gt Wi-Fi -gt Wi Fi
  • Add Synonyms
  • StandardAnalyzer does things that you may not want

24
Custom Analyzers
  • Solution write your own Analyzer
  • Better solution write a configurable Analyzer so
    you only need one Analyzer that you can easily
    change for your projects
  • See Solr
  • Tokenizers and TokenFilters must be newly
    constructed for each input

25
Special Cases
  • Dates and numbers need special treatment to be
    searchable
  • o.a.l.document.DateTools
  • org.apache.solr.util.NumberUtils
  • Altering Position Information
  • Increase Position Gap between sentences to
    prevent phrases from crossing sentence boundaries
  • Index synonyms at the same position so query can
    match regardless of synonym used

26
5 minute Break
27
Indexing Performance
  • Behind the Scenes
  • Lucene indexes Documents into memory
  • At certain trigger points, memory (segments) are
    flushed to the Directory
  • Segments are periodically merged
  • Lucene 2.3 has significant performance
    improvements

28
IndexWriter Performance Factors
  • maxBufferedDocs
  • Minimum of docs before merge occurs and a new
    segment is created
  • Usually, Larger faster, but more RAM
  • mergeFactor
  • How often segments are merged
  • Smaller less RAM, better for incremental
    updates
  • Larger faster, better for batch indexing
  • maxFieldLength
  • Limit the number of terms in a Document

29
Lucene 2.3 IndexWriter Changes
  • setRAMBufferSizeMB
  • New model for automagically controlling indexing
    factors based on the amount of memory in use
  • Obsoletes setMaxBufferedDocs and setMergeFactor
  • Takes storage and term vectors out of the merge
    process
  • Turn off auto-commit if there are stored fields
    and term vectors
  • Provides significant performance increase

30
Index Threading
  • IndexWriter and IndexReader are thread-safe and
    can be shared between threads without external
    synchronization
  • One open IndexWriter per Directory
  • Parallel Indexing
  • Index to separate Directory instances
  • Merge using IndexWriter.addIndexes
  • Could also distribute and collect

31
Benchmarking Indexing
  • contrib/benchmark
  • Try out different algorithms between Lucene 2.2
    and trunk (2.3)
  • contrib/benchmark/conf
  • indexing.alg
  • indexing-multithreaded.alg
  • Info
  • Mac Pro 2 x 2GHz Dual-Core Xeon
  • 4 GB RAM
  • ant run-task -Dtask.alg./conf/indexing.alg
    -Dtask.mem1024M

32
Benchmarking Results
Your results will depend on analysis, etc.
33
Searching
  • Earlier we touched on basics of search using the
    QueryParser
  • Now look at
  • Searcher/IndexReader Lifecycle
  • Query classes
  • More details on the QueryParser
  • Filters
  • Sorting

34
Lifecycle
  • Recall that the IndexReader loads a snapshot of
    index into memory
  • This means updates made since loading the index
    will not be seen
  • Business rules are needed to define how often to
    reload the index, if at all
  • IndexReader.isCurrent() can help
  • Loading an index is an expensive operation
  • Do not open a Searcher/IndexReader for every
    search

35
Query Classes
  • TermQuery is basis for all non-span queries
  • BooleanQuery combines multiple Query instances as
    clauses
  • should
  • required
  • PhraseQuery finds terms occurring near each
    other, position-wise
  • slop is the edit distance between two terms
  • Take 2-3 minutes to explore Query implementations

36
Spans
  • Spans provide information about where matches
    took place
  • Not supported by the QueryParser
  • Can be used in BooleanQuery clauses
  • Take 2-3 minutes to explore SpanQuery classes
  • SpanNearQuery useful for doing phrase matching

37
QueryParser
  • MultiFieldQueryParser
  • Boolean operators cause confusion
  • Better to think in terms of required ( operator)
    and not allowed (- operator)
  • Check JIRA for QueryParser issues
  • http//www.gossamer-threads.com/lists/lucene/java-
    user/40945
  • Most applications either modify QP, create their
    own, or restrict to a subset of the syntax
  • Your users may not need all the flexibility of
    the QP

38
Sorting
  • Lucene default sort is by score
  • Searcher has several methods that take in a Sort
    object
  • Sorting should be addressed during indexing
  • Sorting is done on Fields containing a single
    term that can be used for comparison
  • The SortField defines the different sort types
    available
  • AUTO, STRING, INT, FLOAT, CUSTOM, SCORE, DOC

39
Sorting II
  • Look at Searcher, Sort and SortField
  • Custom sorting is done with a SortComparatorSource
  • Sorting can be very expensive
  • Terms are cached in the FieldCache
  • SortFilterTest.java example

40
Filters
  • Filters restrict the search space to a subset of
    Documents
  • Use Cases
  • Search within a Search
  • Restrict by date
  • Rating
  • Security
  • Author

41
Filter Classes
  • QueryWrapperFilter (QueryFilter)
  • Restrict to subset of Documents that match a
    Query
  • RangeFilter
  • Restrict to Documents that fall within a range
  • Better alternative to RangeQuery
  • CachingWrapperFilter
  • Wrap another Filter and provide caching
  • SortFilterTest.java example

42
Expert Results
  • Searcher has several expert methods
  • Hits is not always what you need due to
  • Caching
  • Normalized Scores
  • Reexecutes Query repeatedly as results are
    accessed
  • HitCollector allows low-level access to all
    Documents as they are scored
  • TopDocs represents top n docs that match
  • TopDocsTest in examples

43
Searchers
  • MultiSearcher
  • Search over multiple Searchables, including
    remote
  • MultiReader
  • Not a Searcher, but can be used with
    IndexSearcher to achieve same results for local
    indexes
  • ParallelMultiSearcher
  • Like MultiSearcher, but threaded
  • RemoteSearchable
  • RMI based remote searching
  • Look at MultiSearcherTest in example code

44
Search Performance
  • Search speed is based on a number of factors
  • Query Type(s)
  • Query Size
  • Analysis
  • Occurrences of Query Terms
  • Optimize
  • Index Size
  • Index type (RAMDirectory, other)
  • Usual Suspects
  • CPU
  • Memory
  • I/O
  • Business Needs

45
Query Types
  • Be careful with WildcardQuery as it rewrites to a
    BooleanQuery containing all the terms that match
    the wildcards
  • Avoid starting a WildcardQuery with wildcard
  • Use ConstantScoreRangeQuery instead of RangeQuery
  • Be careful with range queries and dates
  • User mailing list and Wiki have useful tips for
    optimizing date handling

46
Query Size
  • Stopword removal
  • Search an all field instead of many fields with
    the same terms
  • Disambiguation
  • May be useful when doing synonym expansion
  • Difficult to automate and may be slower
  • Some applications may allow the user to
    disambiguate
  • Relevance Feedback/More Like This
  • Use most important words
  • Important can be defined in a number of ways

47
Usual Suspects
  • CPU
  • Profile your application
  • Memory
  • Examine your heap size, garbage collection
    approach
  • I/O
  • Cache your Searcher
  • Define business logic for refreshing based on
    indexing needs
  • Warm your Searcher before going live -- See Solr
  • Business Needs
  • Do you really need to support Wildcards?
  • What about date range queries down to the
    millisecond?

48
Explanations
  • explain(Query, int) method is useful for
    understanding why a Document scored the way it
    did
  • ExplainsTest in sample code
  • Open Luke and try some queries and then use the
    explain button

49
FieldSelector
  • Prior to version 2.1, Lucene always loaded all
    Fields in a Document
  • FieldSelector API addition allows Lucene to skip
    large Fields
  • Options Load, Lazy Load, No Load, Load and
    Break, Load for Merge, Size, Size and Break
  • Makes storage of original content more viable
    without large cost of loading it when not used
  • FieldSelectorTest in example code

50
Scoring and Similarity
  • Lucene has sophisticated scoring mechanism
    designed to meet most needs
  • Has hooks for modifying scores
  • Scoring is handled by the Query, Weight and
    Scorer class

51
Affecting Relevance
  • FunctionQuery from Solr (variation in Lucene)
  • Override Similarity
  • Implement own Query and related classes
  • Payloads
  • HitCollector
  • Take 5 to examine these

52
Lunch
1-230
53
Recap
  • Indexing
  • Searching
  • Performance
  • Odds and Ends
  • Explains
  • FieldSelector
  • Relevance

54
Next Up
  • Dealing with Content
  • File Formats
  • Extraction
  • Large Task
  • Miscellaneous
  • Wrapping Up

55
File Formats
  • Several open source libraries, projects for
    extracting content to use in Lucene
  • PDF PDFBox
  • http//www.pdfbox.org/
  • Word POI, Open Office, TextMining
  • http//www.textmining.org/textmining.zip
  • XML SAX or Pull parser
  • HTML Neko, Jtidy
  • http//people.apache.org/andyc/neko/doc/html/
  • http//jtidy.sourceforge.net/
  • Tika
  • http//incubator.apache.org/tika/
  • Aperture
  • http//aperture.sourceforge.net

56
Aperture Basics
  • Crawlers
  • Data Connectors
  • Extraction Wrappers
  • POI, PDFBox, HTML, XML, etc.
  • http//aperture.wiki.sourceforge.net/Extractors
    will give you info on what comes back from
    Aperture
  • LuceneApertureCallbackHandler in example code

57
Large Task
  • Using the skeleton files in the
    com.lucenebootcamp.training.full package
  • Get some content
  • Web, file system
  • Different file formats
  • Index it
  • Plan out your fields, boosts, field properties
  • Support updates and deletes
  • Optional
  • How fast can you make it go? Divide and conquer?
    Multithreaded?

58
Large Task
  • Search Content
  • Allow for arbitrary user queries across multiple
    Fields via command line or simple web interface
  • How fast can you make it?
  • Support
  • Sort
  • Filter
  • Explains
  • How much slower is to retrieve an explanation?

59
Large Task
  • Document Retrieval
  • Display/write out the one or more documents
  • Support FieldSelector

60
Large Task
  • Optional Tasks
  • Hit Highlighting using contrib/Highlighter
  • Multithreaded indexing and Search
  • Explore other Field construction options
  • Binary fields, term vectors
  • Use Lucene trunk version and try out some of the
    changes in indexing
  • Try out Solr or Nutch at http//lucene.apache.org/
  • Whats do they offer that Lucene Java doesnt
    that you might need?

61
Large Task Metadata
  • Pair up if you want
  • Ask questions
  • 2 hours
  • Use Luke to check your index!
  • Explore other parts of Lucene that you are
    interested in
  • Be prepared to discuss/share with the class

62
Large Task Post-Mortem
  • Volunteers to share?

63
Term Information
  • TermEnum gives access to terms and how many
    Documents they occur in
  • IndexReader.terms()
  • IndexReader.termPositions()
  • TermDocs gives access to the frequency of a term
    in a Document
  • IndexReader.termDocs()
  • Term Vectors give access to term frequency
    information in a given Document
  • IndexReader.getTermFreqVector
  • TermsTest in sample code

64
Lucene Contributions
  • Many people have generously contributed code to
    help solve common problems
  • These are in contrib directory of the source
  • Popular
  • Analyzers
  • Highlighter
  • Queries and MoreLikeThis
  • Snowball Stemmers
  • Spellchecker

65
Open Discussion
  • Multilingual Best Practices
  • UNICODE
  • One Index versus many
  • Advanced Analysis
  • Distributed Lucene
  • Crawling
  • Hadoop
  • Nutch
  • Solr

66
Resources
  • http//lucene.apache.org/
  • http//en.wikipedia.org/wiki/Vector_space_model
  • Modern Information Retrieval by Baeza-Yates and
    Ribeiro-Neto
  • Lucene In Action by Hatcher and Gospodnetic
  • Wiki
  • Mailing Lists
  • java-user_at_lucene.apache.org
  • Discussions on how to use Lucene
  • java-dev_at_lucene.apache.org
  • Discussions on how to develop Lucene
  • Issue Tracking
  • https//issues.apache.org/jira/secure/Dashboard.js
    pa
  • We always welcome patches
  • Ask on the mailing list before reporting a bug

67
Resources
  • trainer_at_lucenebootcamp.com

68
Finally
  • Please take the time to fill out a survey to help
    me improve this training
  • Located in base directory of source
  • Email it to me at trainer_at_lucenebootcamp.com
  • There are several Lucene related talks on Friday

69
Extras
70
Task 2
  • Take 10-15 minutes, pair up, and write an
    Analyzer and Unit Test
  • Examine results in Luke
  • Run some searches
  • Ideas
  • Combine existing Tokenizers and TokenFilters
  • Normalize abbreviations
  • Filter out all words beginning with the letter A
  • Identify/Mark sentences
  • Questions
  • What would help improve search results?

71
Task 2 Results
  • Share what you did and why
  • Improving Results (in most cases)
  • Stemming
  • Ignore Case
  • Stopword Removal
  • Synonyms
  • Pay attention to business needs

72
Grab Bag
  • Accessing Term Information
  • TermEnum
  • TermDocs
  • Term Vectors
  • FieldSelector
  • Scoring and Similarity
  • File Formats

73
Task 6
  • Count and print all the unique terms in the index
    and their frequencies
  • Notes
  • Half of the class write it using TermEnum and
    TermDocs
  • Other Half write it using Term Vectors
  • Time your Task
  • Only count the title and body content

74
Task 6 Results
  • Term Vector approach is faster on smaller
    collections
  • TermEnum approach is faster on larger collections

75
Task 4
  • Re-index your collection
  • Add in a rating field that randomly assigns a
    number between 0 and 9
  • Write searches to sort by
  • Date
  • Title
  • Rating, Date, Doc Id
  • A Custom Sort
  • Questions
  • How to sort the title?
  • How to sort multiple Fields?

76
Task 4 Results
  • Add stitle to use for sorting the title

77
Task 5
  • Create and search using Filters to
  • Restrict to all docs written on Feb. 26, 1987
  • Restrict to all docs with the word computer in
    title
  • Also
  • Create a Filter where the length of the body
    title is greater than X

78
Task 5 Results
  • Solr has more advanced Filter mechanisms that may
    be worth using
  • Cache filters

79
Task 7
  • Pair up if you like and take 30-40 minutes to
  • Pick two file formats to work on
  • Identify content in that format
  • Can you index contents on your hard drive?
  • Project Gutenberg, Creative Commons, Wikipedia
  • Combine w/ Reuters collection
  • Extract the content and index it using the
    appropriate library
  • Store the content as a Field
  • Search the content
  • Load Documents with and without FieldSelector and
    measure performance

80
Task 7 (cont.)
  • Include score and explanation in results
  • Dump results to XML or HTML
  • Be prepared to share with class what you did
  • What libraries did you use?
  • What content did you use?
  • What is your Document structure?
  • What issues did you have?

81
20 Minute Break
82
Task 7 Results
  • Explain what your group did
  • Build a Content Handler Framework
  • Or help out with Tika

83
Task 8
  • Building on Task 7
  • Incorporate one or more contrib packages into
    your solution
Write a Comment
User Comments (0)
About PowerShow.com