Nutch Search Engine Tool - PowerPoint PPT Presentation

About This Presentation
Title:

Nutch Search Engine Tool

Description:

Nutch Search Engine Tool. Nutch overview. A full-fledged web search engine ... Internet and Intranet crawling. Parsing different document formats (PDF, HTML, ... – PowerPoint PPT presentation

Number of Views:153
Avg rating:3.0/5.0
Slides: 17
Provided by: ajayn
Category:
Tags: engine | nutch | search | tool

less

Transcript and Presenter's Notes

Title: Nutch Search Engine Tool


1
  • Nutch Search Engine Tool

2
Nutch overview
  • A full-fledged web search engine
  • Functionalities of Nutch
  • Internet and Intranet crawling
  • Parsing different document formats (PDF, HTML,
    XML, JS, DOC,PPT etc.)?
  • Web interface for querying the index
  • Management of Recrawls

3
Nutch Architecture
  • 4 main components
  • Crawler
  • Web Database (WebDB, LinkDB, segments)?
  • Indexer
  • Searcher
  • Crawler and Searcher are highly decoupled
    enabling independent scaling
  • Highly modular, Plugin based architechture

4
Nutch Architecture
Doug Cutting, "Nutch Open Source Web Search", 22
May 2004, WWW2004, New York
5
Steps in a CrawlIndex cycle
  1. Create a new WebDB (admin db -create).
  2. Inject root URLs into the WebDB (inject).
  3. Generate a fetchlist from the WebDB in a new
    segment (generate).
  4. Fetch content from URLs in the fetchlist (fetch).
  5. Update the WebDB with links from fetched pages
    (updatedb).
  6. Repeat steps 3-5 until the required depth is
    reached.
  7. Update segments with scores and links from the
    WebDB (updatesegs).
  8. Index the fetched pages (index).
  9. Eliminate duplicate content (and duplicate URLs)
    from the indexes (dedup).
  10. Merge the indexes into a single index for
    searching (merge).

6
Crawling (cont.)
  • Can effectively crawl upto 100M pages
  • Crawl Statistics on KReSIT site (it.iitb)
  • Took 153 mins for a deep crawl (depth 10)
  • Crawled 4171 documents
  • Size of crawl on disk 168MB
  • Size of index 25MB

7
Web Database (WebDB)?
  • Persistent data structure for mirroring the
    structure and properties of the web graph being
    crawled
  • The WebDB stores two types of entities
  • Pages
  • Links
  • Optimised for frequent updation

8
Crawl Structure of it.iitb
9
Page DB
  • Page Database
  • used for fetch scheduling
  • Contains
  • pages indexed and sorted by MD5 and URL
  • outlinks, fetch information, page score
  • A set of APIs are provided to perform the various
    operations

10
Sample data of PageDB
  • Page 1 Version 4
  • URL http//keaton/tinysite/A.html
  • ID fb8b9f0792e449cda72a9670b4ce833a
  • Next fetch Thu Nov 24 111335 GMT 2005
  • Retries since fetch 0
  • Retry interval 30 days
  • Num outlinks 1
  • Score 1.0
  • NextScore 1.0
  • Page 2 Version 4
  • URL http//keaton/tinysite/B.html
  • ID 404db2bd139307b0e1b696d3a1a772b4
  • Next fetch Thu Nov 24 111337 GMT 2005
  • Retries since fetch 0
  • Retry interval 30 days
  • Num outlinks 3
  • Score 1.0
  • NextScore 1.0

11
Link DB
  • Link Database
  • Contains
  • links sorted by MD5
  • links sorted by URL
  • Represents full link graph.
  • Stores anchor text associated with each link
  • Used for
  • Link analysis
  • Anchor text indexing.

12
Segments
  • Collection of pages fetched and indexed by the
    crawler in a single run
  • One segment dir for each crawl-fetch-update cycle
    at a particular depth
  • Contains raw text and parsed data of the files
    crawled
  • Used to return the cached copy of a page and in
    snippet generation in results page

13
Segments
  • segread tool gives a useful summary of all
    segments. (Parsed, Started, Finished, Dir)
  • It can also be used to dump the segment data in
    raw text format. The dump switch gives the
    following details
  • Fetcher Output lturl, hash, fetch-date ..gt.
    Entries that go into the WebDB
  • Content Raw content including http-headers and
    other meta-data. stored cached copy of a page
  • ParseData ParseText appropriate parser plugin
    by looking at the Raw content, is used to
    generate this data

14
Nutch API
15
Plugins
  • Provide extensions to extension-points
  • Each extension point defines an interface that
    must be implemented by extension
  • Some core extension points
  • IndexingFilter add meta-data to indexed fields
  • Parser to parse a new type of document
  • NutchAnalyzer language specific analyzers

16
References
  • Nutch Docs http//lucene.apache.org/nutch/
  • Nutch Wiki http//wiki.apache.org/nutch/
  • Prasad Pingali, CLIA consortium, Nutch Workshop,
    2007
  • Tom White, Introduction to Nutch, java.net
    website (http//today.java.net/pub/a/today/2006/01
    /10/introduction-to-nutch-1.html)
Write a Comment
User Comments (0)
About PowerShow.com