Document Indexing and Scoring in Lucene and Nutch - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Document Indexing and Scoring in Lucene and Nutch

Description:

IMAP. Server. FS. Crawler. Larm. PDF. HTML. DOC. TXT. TXT. parser. PDF. parser. HTML. parser. Lucene ... Finding a large relevant subset is normally done with ... – PowerPoint PPT presentation

Number of Views:654

Avg rating:3.0/5.0

Slides: 26

Provided by: andyh67

Category:

more less

Transcript and Presenter's Notes

Title: Document Indexing and Scoring in Lucene and Nutch

1
Document Indexing and Scoring in Lucene and Nutch

IST 441 Spring 2009
Instructor Dr. C. Lee Giles
Presenter Saurabh Kataria

2
Outline

Architecture of Lucene and Nutch
Indexing in Lucene
Searching in Lucene
Lucenes scoring function

3
Lucenes Open Architecture
Crawling
Parsing
Indexing
Lucene
Stop Analyzer
CN/DE/ Analyzer
Standard Analyzer
PDF HTML DOC TXT
File System
Lucene Docu- ments
TXT parser
FS Crawler
Index
indexer
PDF parser
WWW
indexer
Larm
HTML parser
IMAP Server
searcher
searcher
Searching
4
Nutchs architecture

Courtesy of Doug Cuttings presentation slide in
WWW 2004

5
Nutchs architecture

Searcher Given a query, it must quickly find a
small relevant subset of a corpus of documents,
then present them. Finding a large relevant
subset is normally done with an inverted index of
the corpus ranking within that set to produce
the most relevant documents, which then must be
summarized for display.
Indexer Creates the inverted index from which
the searcher extracts results. It uses Lucene
storing indexes.
Web DB Stores the document contents for indexing
and later summarization by the searcher, along
with information such as the link structure of
the document space and the time each document was
last fetched.
Fetcher Requests web pages, parses them, and
extracts links from them. Nutchs robot has been
written entirely from scratch.

6
Lucenes index (conceptual)
Index
Document
Document
Field
Field
Field
Document
Name
Value
Field
Document
Field
7
Create a Lucene index (step 1)

Create Lucene document and add fields
import org.apache.lucene.document.Document
import org.apache.lucene.document.Field
public void createDoc(String title, String body)
Document docnew Document( )
doc.add(new Field(text", content,
Field.Store.NO, Field.Index.TOKENIZED))
doc.add(new Field(title", test,
Field.Store.YES, Field.Index.TOKENIZED))

8
Create a Lucene index (step 2)

Create an Analyser
Options
WhitespaceAnalyzer
divides text at whitespace
SimpleAnalyzer
divides text at non-letters
convert to lower case
StopAnalyzer
SimpleAnalyzer
removes stop words
StandardAnalyzer
good for most European Languages
removes stop words
convert to lower case

9
Create a Lucene index (step 2)

An example of analyzing a document

10
Create a Lucene index (step 3)

Create an index writer, add Lucene document into
the index
import java.IOException
import org.apache.lucene.index.IndexWriter
import org.apache.lucene.analysis.standard.Standar
dAnalyser
public void WriteDoc(Document doc, String
idxPath)
try
IndexWriter writer new IndexWriter(FSDirectory
.getDirectory(/data/index", true), new
SimpleAnalyzer(), true)
writer.addDocument(doc)
writer.close( )
catch (IOException exp)
System.out.println(I/O Error!)

11
Luence Index Behind the Scene

Inverted Index (Inverted File)

Doc 1 Penn State Football football
Posting Table
Doc 2 Football players State
12
Posting table

Posting table is a fast look-up mechanism
Key word
Value posting id, satellite data (df, offset,
)
Lucene implements the posting table with Javas
hash table
Objectified from java.util.Hashtable
Hash function depends on the JVM
hc2 hc1 31 nextChar
Posting table usage
Indexing insertion (new terms), update (existing
terms)
Searching lookup, and construct document vector

13
Lucene Index Files Field infos file (.fnm)
1, ltcontent, 0x01gt
14
Lucene Index Files Term Dictionary file (.tis)
4,ltlt0,football,1gt,2gt ltlt0,penn,1gt, 1gt
ltlt1,layers,1gt,1gt ltlt0,state,1gt,2gt
Document Frequency can be obtained from this file.
15
Lucene Index Files Term Info index (.tii)
4,ltfootball,1gt ltpenn,3gtltlayers,2gt ltstate,1gt
16
Lucene Index Files Frequency file (.frq)
ltlt2, 2, 3gt lt3gt lt5gt lt3, 3gtgt
Term Frequency can be obtained from this file.
17
Lucene Index Files Position file (.prx)
ltlt3, 64gt lt1gtgt ltlt1gt lt0gtgt ltlt0gt lt2gtgt ltlt2gt lt13gtgt
18
Query Process in Lucene
Field info (in Memory)
Constant time
Query
Term Info Index (in Memory)
Constant time
Constant time
Constant time
Constant time
Term Dictionary (Random file access)
Frequency File (Random file access)
Position File (Random file access)
19
Search Lucenes index (step 1)

Construct an query (automatic)
import org.apache.lucene.search.Query
import org.apache.lucene.queryParser.QueryParser
import org.apache.lucene.analysis.standard.Standar
dAnalyser
public void formQuery(String querystring)
QueryParser qp new QueryParser (field, new
StandardAnalyser( ))
Query query qp.parse(querystring)

20
Search Lucenes index (step 1)

Types of query
Boolean IST441 Giles IST441 OR Giles java
AND NOT SUN
wildcard nu?ch nutc
phrase JAVA TOMCAT
proximity lucene nutch 10
fuzzy roam matches roams and foam
date range

21
Search Lucenes index (step 2)

Search the index
import org.apache.lucene.document.Document
import org.apache.lucene.search.
import org.apache.lucene.store.
public void searchIdx(String idxPath)
Directory fsDirFSDirectory.getDirectory(idxPath,
false)
IndexSearcher isnew IndexSearcher(fsDir)
Hits hits is.search(query)

22
Search Lucenes index (step 3)

Display the results
for (int i0ilthits.length()i)
Document dochits.doc(i)
//show your results
System.out.println(iddoc.get(id))

23
Default Scoring Function

Similarity
score(Q,D) coord(Q,D) queryNorm(Q)
? t in Q ( tf(t in D) idf(t)2
t.getBoost() norm(D) )
Question
What type of IR model does Lucene use?
factors
term-based factors
tf(t in D) term frequency of term t in document
d
default implementation
idf(t) inverse document frequency of term t in
the entire corpus
default implementation

24
Default Scoring Function

coord(Q,D) overlap between Q and D / maximum
overlap
Maximum overlap is the maximum possible length of
overlap between Q and D
queryNorm(Q) 1/sum of square weight½
sum of square weight q.getBoost()2 ? t in Q (
idf(t) t.getBoost() )2
If t.getBoost() 1, q.getBoost() 1
Then, sum of square weight ? t in Q ( idf(t) )2
thus, queryNorm(Q) 1/(? t in Q ( idf(t) )2) ½
norm(D) 1/number of terms½ (This is the
normalization by the total number of terms in a
document. Number of terms is the total number of
terms appeared in a document D.)

25
Example