Document Indexing and Scoring in Lucene and Nutch - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Document Indexing and Scoring in Lucene and Nutch

Description:

IMAP. Server. FS. Crawler. Larm. PDF. HTML. DOC. TXT. TXT. parser. PDF. parser. HTML. parser. Lucene ... Finding a large relevant subset is normally done with ... – PowerPoint PPT presentation

Number of Views:654
Avg rating:3.0/5.0
Slides: 26
Provided by: andyh67
Category:

less

Transcript and Presenter's Notes

Title: Document Indexing and Scoring in Lucene and Nutch


1
Document Indexing and Scoring in Lucene and Nutch
  • IST 441 Spring 2009
  • Instructor Dr. C. Lee Giles
  • Presenter Saurabh Kataria

2
Outline
  • Architecture of Lucene and Nutch
  • Indexing in Lucene
  • Searching in Lucene
  • Lucenes scoring function

3
Lucenes Open Architecture
Crawling
Parsing
Indexing
Lucene
Stop Analyzer
CN/DE/ Analyzer
Standard Analyzer
PDF HTML DOC TXT
File System
Lucene Docu- ments
TXT parser
FS Crawler
Index
indexer
PDF parser
WWW
indexer
Larm
HTML parser
IMAP Server
searcher
searcher
Searching
4
Nutchs architecture
  • Courtesy of Doug Cuttings presentation slide in
    WWW 2004

5
Nutchs architecture
  • Searcher Given a query, it must quickly find a
    small relevant subset of a corpus of documents,
    then present them. Finding a large relevant
    subset is normally done with an inverted index of
    the corpus ranking within that set to produce
    the most relevant documents, which then must be
    summarized for display.
  • Indexer Creates the inverted index from which
    the searcher extracts results. It uses Lucene
    storing indexes.
  • Web DB Stores the document contents for indexing
    and later summarization by the searcher, along
    with information such as the link structure of
    the document space and the time each document was
    last fetched.
  • Fetcher Requests web pages, parses them, and
    extracts links from them. Nutchs robot has been
    written entirely from scratch.

6
Lucenes index (conceptual)
Index
Document
Document
Field
Field
Field
Document
Name
Value
Field
Document
Field
7
Create a Lucene index (step 1)
  • Create Lucene document and add fields
  • import org.apache.lucene.document.Document
  • import org.apache.lucene.document.Field
  • public void createDoc(String title, String body)
  • Document docnew Document( )
  • doc.add(new Field(text", content,
    Field.Store.NO, Field.Index.TOKENIZED))
  • doc.add(new Field(title", test,
    Field.Store.YES, Field.Index.TOKENIZED))

8
Create a Lucene index (step 2)
  • Create an Analyser
  • Options
  • WhitespaceAnalyzer
  • divides text at whitespace
  • SimpleAnalyzer
  • divides text at non-letters
  • convert to lower case
  • StopAnalyzer
  • SimpleAnalyzer
  • removes stop words
  • StandardAnalyzer
  • good for most European Languages
  • removes stop words
  • convert to lower case

9
Create a Lucene index (step 2)
  • An example of analyzing a document

10
Create a Lucene index (step 3)
  • Create an index writer, add Lucene document into
    the index
  • import java.IOException
  • import org.apache.lucene.index.IndexWriter
  • import org.apache.lucene.analysis.standard.Standar
    dAnalyser
  • public void WriteDoc(Document doc, String
    idxPath)
  • try
  • IndexWriter writer new IndexWriter(FSDirectory
    .getDirectory(/data/index", true), new
    SimpleAnalyzer(), true)
  • writer.addDocument(doc)
  • writer.close( )
  • catch (IOException exp)
  • System.out.println(I/O Error!)

11
Luence Index Behind the Scene
  • Inverted Index (Inverted File)

Doc 1 Penn State Football football
Posting Table
Doc 2 Football players State
12
Posting table
  • Posting table is a fast look-up mechanism
  • Key word
  • Value posting id, satellite data (df, offset,
    )
  • Lucene implements the posting table with Javas
    hash table
  • Objectified from java.util.Hashtable
  • Hash function depends on the JVM
  • hc2 hc1 31 nextChar
  • Posting table usage
  • Indexing insertion (new terms), update (existing
    terms)
  • Searching lookup, and construct document vector

13
Lucene Index Files Field infos file (.fnm)
1, ltcontent, 0x01gt
14
Lucene Index Files Term Dictionary file (.tis)
4,ltlt0,football,1gt,2gt ltlt0,penn,1gt, 1gt
ltlt1,layers,1gt,1gt ltlt0,state,1gt,2gt
Document Frequency can be obtained from this file.
15
Lucene Index Files Term Info index (.tii)
4,ltfootball,1gt ltpenn,3gtltlayers,2gt ltstate,1gt
16
Lucene Index Files Frequency file (.frq)
ltlt2, 2, 3gt lt3gt lt5gt lt3, 3gtgt
Term Frequency can be obtained from this file.
17
Lucene Index Files Position file (.prx)
ltlt3, 64gt lt1gtgt ltlt1gt lt0gtgt ltlt0gt lt2gtgt ltlt2gt lt13gtgt
18
Query Process in Lucene
Field info (in Memory)
Constant time
Query
Term Info Index (in Memory)
Constant time
Constant time
Constant time
Constant time
Term Dictionary (Random file access)
Frequency File (Random file access)
Position File (Random file access)
19
Search Lucenes index (step 1)
  • Construct an query (automatic)
  • import org.apache.lucene.search.Query
  • import org.apache.lucene.queryParser.QueryParser
  • import org.apache.lucene.analysis.standard.Standar
    dAnalyser
  • public void formQuery(String querystring)
  • QueryParser qp new QueryParser (field, new
    StandardAnalyser( ))
  • Query query qp.parse(querystring)

20
Search Lucenes index (step 1)
  • Types of query
  • Boolean IST441 Giles IST441 OR Giles java
    AND NOT SUN
  • wildcard nu?ch nutc
  • phrase JAVA TOMCAT
  • proximity lucene nutch 10
  • fuzzy roam matches roams and foam
  • date range

21
Search Lucenes index (step 2)
  • Search the index
  • import org.apache.lucene.document.Document
  • import org.apache.lucene.search.
  • import org.apache.lucene.store.
  • public void searchIdx(String idxPath)
  • Directory fsDirFSDirectory.getDirectory(idxPath,
    false)
  • IndexSearcher isnew IndexSearcher(fsDir)
  • Hits hits is.search(query)

22
Search Lucenes index (step 3)
  • Display the results
  • for (int i0ilthits.length()i)
  • Document dochits.doc(i)
  • //show your results
  • System.out.println(iddoc.get(id))

23
Default Scoring Function
  • Similarity
  • score(Q,D)     coord(Q,D)   queryNorm(Q)  
  •   ? t in Q ( tf(t in D)    idf(t)2  
  •   t.getBoost()  norm(D) )
  • Question
  • What type of IR model does Lucene use?
  • factors
  • term-based factors
  • tf(t in D) term frequency of term t in document
    d
  • default implementation
  • idf(t) inverse document frequency of term t in
    the entire corpus
  • default implementation

24
Default Scoring Function
  • coord(Q,D) overlap between Q and D / maximum
    overlap
  • Maximum overlap is the maximum possible length of
    overlap between Q and D
  • queryNorm(Q) 1/sum of square weight½
  • sum of square weight q.getBoost()2  ? t in Q (
    idf(t) t.getBoost() )2
  • If t.getBoost() 1, q.getBoost() 1
  • Then, sum of square weight ? t in Q ( idf(t) )2
  • thus, queryNorm(Q) 1/(? t in Q ( idf(t) )2) ½
  • norm(D) 1/number of terms½ (This is the
    normalization by the total number of terms in a
    document. Number of terms is the total number of
    terms appeared in a document D.)

25
Example
  • D1 hello, please say hello to him.
  • D2 say goodbye
  • Q you say hello
  • coord(Q, D) overlap between Q and D / maximum
    overlap
  • coord(Q, D1) 2/3, coord(Q, D2) 1/2,
  • queryNorm(Q) 1/sum of square weight½
  • sum of square weight q.getBoost()2  ? t in Q (
    idf(t) t.getBoost() )2
  • t.getBoost() 1, q.getBoost() 1
  • sum of square weight ? t in Q ( idf(t) )2
  • queryNorm(Q) 1/(0.5945212) ½ 0.8596
  • tf(t in d) frequency½
  • tf(you,D1) 0, tf(say,D1) 1, tf(hello,D1) 2½
    1.4142
  • tf(you,D2) 0, tf(say,D2) 1, tf(hello,D2) 0
  • idf(t) ln (N/(nj1)) 1
  • idf(you) 0, idf(say) ln(2/(21)) 1
    0.5945, idf(hello) ln(2/(11)) 1 1
  • norm(D) 1/number of terms½
  • norm(D1) 1/6½ 0.4082, norm(D2) 1/2½ 0.7071
  • Score(Q, D1) 2/30.8596(10.594521.414212)0.
    40820.4135
  • Score(Q, D2) 1/20.8596(10.59452)0.70710.107
    4
Write a Comment
User Comments (0)
About PowerShow.com