IST 441 Nutch: Design - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

IST 441 Nutch: Design

Description:

Nutch as a web crawler. Nutch as a complete web search engine. Installation/Usage (with Demo) ... Java based, open source, many customizable scripts available ... – PowerPoint PPT presentation

Number of Views:328

Avg rating:3.0/5.0

Slides: 25

Provided by: zhao6

Category:

more less

Transcript and Presenter's Notes

Title: IST 441 Nutch: Design

1
IST 441 NutchDesign Crawling

TA
Saurabh Kataria
Instructor
C. Lee Giles

2
Outline

Overview/Basics
Crawling Technical Details
Nutch as a web crawler
Nutch as a complete web search engine
Installation/Usage (with Demo)

3
Overview

Complete web search engine
Nutch Crawler Indexer/Searcher (Lucene) GUI
Plugins
MapReduce Distributed FS (Hadoop)
Java based, open source, many customizable
scripts available at (http//lucene.apache.org/nut
ch/)
Features
Customizable
Extensible (e.g. extend to Solr for enhanced
portability)

4
Search Engine Basic Workflow
5
What is Nutch?

Open source search engine
Written in Java
Built on top of Apache Lucene

6
Advantages of Nutch

Scalable
Index local host or entire Internet
Portable
Runs anywhere with Java
Flexible
Plugin system API
Code pretty easy to read work with
Better than implementing it yourself!

7
Data Structures used by Nutch

Web Database or WebDB
Mirrors the properties/structure of web graph
being crawled
Segment
Intermediate index
Contains pages fetched in a single run
Index
Final inverted index obtained by merging
segments (Lucene)

8
WebDB

Customized graph database
Used by Crawler only
Persistent storage for pages links
Page DB Indexed by URL and hash contains
content, outlinks, fetch information score
Link DB contains source to target links,
anchor text

9
Segment

Collection of pages fetched in a single run
Contains
Output of the fetcher
List of the links to be fetched in the next run
called fetchlist
Limited life span (default 30 days)

10
Index

To be discussed later

11
Crawling

Cyclic process
crawler generates a set of fetchlists from the
WebDB
fetchers downloads the content from the Web
the crawler updates the WebDB with new links that
were found
and then the crawler generates a new set of
fetchlists
And Repeat till you reach the depth

12
Nutch as a crawler
Initial URLs
CrawlDB
Webpages/files
update
get
read/write
generate
read/write
Segment
13
Nutch as a complete web search engine
Partially Crawled Data
(Lucene)
Index
(Lucene)
GUI
(Tomcat)
14
Crawling 10 stage process

bin/nutch crawl -dir -depth
crawl.log
1. admin db create Create a new WebDB.
2. inject Inject root URLs into the WebDB.
3. generate Generate a fetchlist from the
WebDB in a new segment.
4. fetch Fetch content from URLs in the
fetchlist.
5. updatedb Update the WebDB with links from
fetched pages.
6. Repeat steps 3-5 until the required depth
is reached.
7. updatesegs Update segments with scores and
links from the WebDB.
8. index Index the fetched pages.
9. dedup Eliminate duplicate content (and
duplicate URLs) from the indexes.
10. merge Merge the indexes into a single
index for searching.

15
Demo Configuration

Configuration files (XML)
Required user parameters
http.agent.name
http.agent.description
http.agent.url
http.agent.email
Adjustable parameters for every component
E.g. for fetcher
Threads-per-host
Threads-per-ip

16
Configuration

URL Filters (Text file) (conf/crawl-urlfilter.txt)
Regular expression to filter URLs during crawling
E.g.
To ignore files with certain suffix
-\.(gifexezipico)
To accept host in a certain domain
http//(a-z0-9\.)apache.org/

17
Installation Usage

Installation
Software needed
Nutch release
Java
Apache Tomcat (for GUI)
Cgywin (for windows)

18
Installation Usage

Usage
Crawling
Initial URLs (text file or DMOZ file)
Required parameters (conf/nutch-site.xml)
URL filters (conf/crawl-urlfilter.txt)
Indexing
Automatic
Searching
Location of files (WAR file, index)
The tomcat server

19
Demo site we would crawl?

http//ist441.ist.psu.edu

20
Site Structure
http//ist441.ist.psu.edu/
A.html
B.html
A_dup.html
C.html
C_dup.html
Wikipedia.org
21
Demo Some Commands

Crawl
bin/nutch crawl -dir -depth
crawl.log
Analyze webDB
bin/nutch readdb stats
bin/nutch readdb dumppageurl
bin/nutch readdb dumplinks
bin/nutch readdb -linkurl
sls -d / head -1
bin/nutch segread -dump s

22
Next Meeting

Indexing!!!!!!!
Searching!!!!!!

23
QA?
24
References

http//lucene.apache.org/nutch/ -- Official
website
http//wiki.apache.org/nutch/ -- Nutch wiki
(Seriously outdated. Take with a grain of salt.)
http//lucene.apache.org/nutch/release/ Nutch
source code
www.nutchinstall.blogspot.com Installation guide
http//www.robotstxt.org/wc/robots.html The web
robot pages

Write a Comment

User Comments (0)