IST 441 Nutch: Design - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

IST 441 Nutch: Design

Description:

Nutch as a web crawler. Nutch as a complete web search engine. Installation/Usage (with Demo) ... Java based, open source, many customizable scripts available ... – PowerPoint PPT presentation

Number of Views:328
Avg rating:3.0/5.0
Slides: 25
Provided by: zhao6
Category:
Tags: ist | design | nutch | webcrawler

less

Transcript and Presenter's Notes

Title: IST 441 Nutch: Design


1
IST 441 NutchDesign Crawling
  • TA
  • Saurabh Kataria
  • Instructor
  • C. Lee Giles

2
Outline
  • Overview/Basics
  • Crawling Technical Details
  • Nutch as a web crawler
  • Nutch as a complete web search engine
  • Installation/Usage (with Demo)

3
Overview
  • Complete web search engine
  • Nutch Crawler Indexer/Searcher (Lucene) GUI
  • Plugins
  • MapReduce Distributed FS (Hadoop)
  • Java based, open source, many customizable
    scripts available at (http//lucene.apache.org/nut
    ch/)
  • Features
  • Customizable
  • Extensible (e.g. extend to Solr for enhanced
    portability)

4
Search Engine Basic Workflow
5
What is Nutch?
  • Open source search engine
  • Written in Java
  • Built on top of Apache Lucene

6
Advantages of Nutch
  • Scalable
  • Index local host or entire Internet
  • Portable
  • Runs anywhere with Java
  • Flexible
  • Plugin system API
  • Code pretty easy to read work with
  • Better than implementing it yourself!

7
Data Structures used by Nutch
  • Web Database or WebDB
  • Mirrors the properties/structure of web graph
    being crawled
  • Segment
  • Intermediate index
  • Contains pages fetched in a single run
  • Index
  • Final inverted index obtained by merging
    segments (Lucene)

8
WebDB
  • Customized graph database
  • Used by Crawler only
  • Persistent storage for pages links
  • Page DB Indexed by URL and hash contains
    content, outlinks, fetch information score
  • Link DB contains source to target links,
    anchor text

9
Segment
  • Collection of pages fetched in a single run
  • Contains
  • Output of the fetcher
  • List of the links to be fetched in the next run
    called fetchlist
  • Limited life span (default 30 days)

10
Index
  • To be discussed later

11
Crawling
  • Cyclic process
  • crawler generates a set of fetchlists from the
    WebDB
  • fetchers downloads the content from the Web
  • the crawler updates the WebDB with new links that
    were found
  • and then the crawler generates a new set of
    fetchlists
  • And Repeat till you reach the depth

12
Nutch as a crawler
Initial URLs
CrawlDB
Webpages/files
update
get
read/write
generate
read/write
Segment
13
Nutch as a complete web search engine
Partially Crawled Data
(Lucene)
Index
(Lucene)
GUI
(Tomcat)
14
Crawling 10 stage process
  • bin/nutch crawl -dir -depth
    crawl.log
  • 1. admin db create Create a new WebDB.
  • 2. inject Inject root URLs into the WebDB.
  • 3. generate Generate a fetchlist from the
    WebDB in a new segment.
  • 4. fetch Fetch content from URLs in the
    fetchlist.
  • 5. updatedb Update the WebDB with links from
    fetched pages.
  • 6. Repeat steps 3-5 until the required depth
    is reached.
  • 7. updatesegs Update segments with scores and
    links from the WebDB.
  • 8. index Index the fetched pages.
  • 9. dedup Eliminate duplicate content (and
    duplicate URLs) from the indexes.
  • 10. merge Merge the indexes into a single
    index for searching.

15
Demo Configuration
  • Configuration files (XML)
  • Required user parameters
  • http.agent.name
  • http.agent.description
  • http.agent.url
  • http.agent.email
  • Adjustable parameters for every component
  • E.g. for fetcher
  • Threads-per-host
  • Threads-per-ip

16
Configuration
  • URL Filters (Text file) (conf/crawl-urlfilter.txt)
  • Regular expression to filter URLs during crawling
  • E.g.
  • To ignore files with certain suffix
  • -\.(gifexezipico)
  • To accept host in a certain domain
  • http//(a-z0-9\.)apache.org/

17
Installation Usage
  • Installation
  • Software needed
  • Nutch release
  • Java
  • Apache Tomcat (for GUI)
  • Cgywin (for windows)

18
Installation Usage
  • Usage
  • Crawling
  • Initial URLs (text file or DMOZ file)
  • Required parameters (conf/nutch-site.xml)
  • URL filters (conf/crawl-urlfilter.txt)
  • Indexing
  • Automatic
  • Searching
  • Location of files (WAR file, index)
  • The tomcat server

19
Demo site we would crawl?
  • http//ist441.ist.psu.edu

20
Site Structure
http//ist441.ist.psu.edu/
A.html
B.html
A_dup.html
C.html
C_dup.html
Wikipedia.org
21
Demo Some Commands
  • Crawl
  • bin/nutch crawl -dir -depth
    crawl.log
  • Analyze webDB
  • bin/nutch readdb stats
  • bin/nutch readdb dumppageurl
  • bin/nutch readdb dumplinks
  • bin/nutch readdb -linkurl
  • sls -d / head -1
  • bin/nutch segread -dump s

22
Next Meeting
  • Indexing!!!!!!!
  • Searching!!!!!!

23
QA?
24
References
  • http//lucene.apache.org/nutch/ -- Official
    website
  • http//wiki.apache.org/nutch/ -- Nutch wiki
    (Seriously outdated. Take with a grain of salt.)
  • http//lucene.apache.org/nutch/release/ Nutch
    source code
  • www.nutchinstall.blogspot.com Installation guide
  • http//www.robotstxt.org/wc/robots.html The web
    robot pages
Write a Comment
User Comments (0)
About PowerShow.com