Title: Document Indexing and Scoring in Lucene and Nutch
1Document Indexing and Scoring in Lucene and Nutch
- IST 441 Spring 2009
- Instructor Dr. C. Lee Giles
- Presenter Saurabh Kataria
2Outline
- Architecture of Lucene and Nutch
- Indexing in Lucene
- Searching in Lucene
- Lucenes scoring function
3Lucenes Open Architecture
Crawling
Parsing
Indexing
Lucene
Stop Analyzer
CN/DE/ Analyzer
Standard Analyzer
PDF HTML DOC TXT
File System
Lucene Docu- ments
TXT parser
FS Crawler
Index
indexer
PDF parser
WWW
indexer
Larm
HTML parser
IMAP Server
searcher
searcher
Searching
4Nutchs architecture
- Courtesy of Doug Cuttings presentation slide in
WWW 2004
5Nutchs architecture
- Searcher Given a query, it must quickly find a
small relevant subset of a corpus of documents,
then present them. Finding a large relevant
subset is normally done with an inverted index of
the corpus ranking within that set to produce
the most relevant documents, which then must be
summarized for display. - Indexer Creates the inverted index from which
the searcher extracts results. It uses Lucene
storing indexes. - Web DB Stores the document contents for indexing
and later summarization by the searcher, along
with information such as the link structure of
the document space and the time each document was
last fetched. - Fetcher Requests web pages, parses them, and
extracts links from them. Nutchs robot has been
written entirely from scratch.
6Lucenes index (conceptual)
Index
Document
Document
Field
Field
Field
Document
Name
Value
Field
Document
Field
7Create a Lucene index (step 1)
- Create Lucene document and add fields
- import org.apache.lucene.document.Document
- import org.apache.lucene.document.Field
- public void createDoc(String title, String body)
-
- Document docnew Document( )
- doc.add(new Field(text", content,
Field.Store.NO, Field.Index.TOKENIZED)) - doc.add(new Field(title", test,
Field.Store.YES, Field.Index.TOKENIZED)) -
8Create a Lucene index (step 2)
- Create an Analyser
- Options
- WhitespaceAnalyzer
- divides text at whitespace
- SimpleAnalyzer
- divides text at non-letters
- convert to lower case
- StopAnalyzer
- SimpleAnalyzer
- removes stop words
- StandardAnalyzer
- good for most European Languages
- removes stop words
- convert to lower case
9Create a Lucene index (step 2)
- An example of analyzing a document
10Create a Lucene index (step 3)
- Create an index writer, add Lucene document into
the index - import java.IOException
- import org.apache.lucene.index.IndexWriter
- import org.apache.lucene.analysis.standard.Standar
dAnalyser - public void WriteDoc(Document doc, String
idxPath) -
- try
- IndexWriter writer new IndexWriter(FSDirectory
.getDirectory(/data/index", true), new
SimpleAnalyzer(), true) - writer.addDocument(doc)
- writer.close( )
- catch (IOException exp)
- System.out.println(I/O Error!)
-
-
11Luence Index Behind the Scene
- Inverted Index (Inverted File)
Doc 1 Penn State Football football
Posting Table
Doc 2 Football players State
12Posting table
- Posting table is a fast look-up mechanism
- Key word
- Value posting id, satellite data (df, offset,
) - Lucene implements the posting table with Javas
hash table - Objectified from java.util.Hashtable
- Hash function depends on the JVM
- hc2 hc1 31 nextChar
- Posting table usage
- Indexing insertion (new terms), update (existing
terms) - Searching lookup, and construct document vector
13Lucene Index Files Field infos file (.fnm)
1, ltcontent, 0x01gt
14Lucene Index Files Term Dictionary file (.tis)
4,ltlt0,football,1gt,2gt ltlt0,penn,1gt, 1gt
ltlt1,layers,1gt,1gt ltlt0,state,1gt,2gt
Document Frequency can be obtained from this file.
15Lucene Index Files Term Info index (.tii)
4,ltfootball,1gt ltpenn,3gtltlayers,2gt ltstate,1gt
16Lucene Index Files Frequency file (.frq)
ltlt2, 2, 3gt lt3gt lt5gt lt3, 3gtgt
Term Frequency can be obtained from this file.
17Lucene Index Files Position file (.prx)
ltlt3, 64gt lt1gtgt ltlt1gt lt0gtgt ltlt0gt lt2gtgt ltlt2gt lt13gtgt
18Query Process in Lucene
Field info (in Memory)
Constant time
Query
Term Info Index (in Memory)
Constant time
Constant time
Constant time
Constant time
Term Dictionary (Random file access)
Frequency File (Random file access)
Position File (Random file access)
19Search Lucenes index (step 1)
- Construct an query (automatic)
- import org.apache.lucene.search.Query
- import org.apache.lucene.queryParser.QueryParser
- import org.apache.lucene.analysis.standard.Standar
dAnalyser - public void formQuery(String querystring)
-
- QueryParser qp new QueryParser (field, new
StandardAnalyser( )) - Query query qp.parse(querystring)
-
20Search Lucenes index (step 1)
- Types of query
- Boolean IST441 Giles IST441 OR Giles java
AND NOT SUN - wildcard nu?ch nutc
- phrase JAVA TOMCAT
- proximity lucene nutch 10
- fuzzy roam matches roams and foam
- date range
-
21Search Lucenes index (step 2)
- Search the index
- import org.apache.lucene.document.Document
- import org.apache.lucene.search.
- import org.apache.lucene.store.
- public void searchIdx(String idxPath)
-
- Directory fsDirFSDirectory.getDirectory(idxPath,
false) - IndexSearcher isnew IndexSearcher(fsDir)
-
- Hits hits is.search(query)
-
22Search Lucenes index (step 3)
- Display the results
- for (int i0ilthits.length()i)
-
- Document dochits.doc(i)
- //show your results
- System.out.println(iddoc.get(id))
-
23Default Scoring Function
- Similarity
- score(Q,D) coord(Q,D) queryNorm(Q)
- ? t in Q ( tf(t in D) idf(t)2
- t.getBoost() norm(D) )
- Question
- What type of IR model does Lucene use?
- factors
- term-based factors
- tf(t in D) term frequency of term t in document
d - default implementation
- idf(t) inverse document frequency of term t in
the entire corpus - default implementation
24Default Scoring Function
- coord(Q,D) overlap between Q and D / maximum
overlap - Maximum overlap is the maximum possible length of
overlap between Q and D - queryNorm(Q) 1/sum of square weight½
- sum of square weight q.getBoost()2 ? t in Q (
idf(t) t.getBoost() )2 - If t.getBoost() 1, q.getBoost() 1
- Then, sum of square weight ? t in Q ( idf(t) )2
- thus, queryNorm(Q) 1/(? t in Q ( idf(t) )2) ½
- norm(D) 1/number of terms½ (This is the
normalization by the total number of terms in a
document. Number of terms is the total number of
terms appeared in a document D.)
25Example
- D1 hello, please say hello to him.
- D2 say goodbye
- Q you say hello
- coord(Q, D) overlap between Q and D / maximum
overlap - coord(Q, D1) 2/3, coord(Q, D2) 1/2,
- queryNorm(Q) 1/sum of square weight½
- sum of square weight q.getBoost()2 ? t in Q (
idf(t) t.getBoost() )2 - t.getBoost() 1, q.getBoost() 1
- sum of square weight ? t in Q ( idf(t) )2
- queryNorm(Q) 1/(0.5945212) ½ 0.8596
- tf(t in d) frequency½
- tf(you,D1) 0, tf(say,D1) 1, tf(hello,D1) 2½
1.4142 - tf(you,D2) 0, tf(say,D2) 1, tf(hello,D2) 0
- idf(t) ln (N/(nj1)) 1
- idf(you) 0, idf(say) ln(2/(21)) 1
0.5945, idf(hello) ln(2/(11)) 1 1 - norm(D) 1/number of terms½
- norm(D1) 1/6½ 0.4082, norm(D2) 1/2½ 0.7071
- Score(Q, D1) 2/30.8596(10.594521.414212)0.
40820.4135 - Score(Q, D2) 1/20.8596(10.59452)0.70710.107
4