A Brief Look at Web Crawlers - PowerPoint PPT Presentation

About This Presentation

Title:

A Brief Look at Web Crawlers

Description:

Number of Views:42

Avg rating:3.0/5.0

Slides: 24

Provided by: Bin107

Category:

more less

Transcript and Presenter's Notes

Title: A Brief Look at Web Crawlers

1
A Brief Look at Web Crawlers

2
Web Crawlers

is a program or automated script which browses
the World Wide Web in a methodical, automated
manner
Uses
Create an archive / index from the visited web
pages to support offline browsing / search /
mining.
Automating maintenance tasks on a website
Harvesting specific information from web pages

3
High-level architecture
Seeds Frontier
4
How easy is it to write a program to crawl all
uiuc.edu web pages?
5
All sorts of real problems

6
This is only a small-scale crawl

7
Data characterics in large-scale crawls

Large volume, fast changes, dynamic page
generation a wide selection of possibly
crawlable URLs
Edwards et al "Given that the bandwidth for
conducting crawls is neither infinite nor free it
is becoming essential to crawl the Web in not
only a scalable, but efficient way, if some
reasonable measure of quality or freshness is to
be maintained."

8
Selection policy which page to download

re-visit policy
9
Revisit policy when to check for changes to the
pages

10
Revisit Policy (cont.)

Uniform policy revisiting all pages in the
collection with the same frequency
Proportional policy revisiting more often the
pages that change more frequently
The optimal method for keeping average freshness
high includes ignoring the pages that change too
often, and the optimal for keeping average age
low is to use access frequencies that
monotonically (and sub-linearly) increase with
the rate of change of each page.
Numerical methods are used for calculation based
on distribution of page changes

11
Politeness policy how to avoid overloading
websites

12
Parallelization policy how to coordinate
distributed web crawlers

Nutch "A successful search engine requires more
bandwidth to upload query result pages than its
crawler needs to download pages"

13
Crawling the deep web

Many web spiders run by popular search engines
ignore URLs with a query string
Googles Sitemap protocol allows a webmaster to
inform search engines about URLs on a website
that are available for crawling
Also mod-oai is an Apache module that allows web
crawlers to efficiently discover new, modified,
and deleted web resources from a web server by
using OAI-PMH, a protocol which is widely used in
the digital libraries community

14
Example Web Crawler Software

15
Wget

16
Heritrix

Heritrix is Internet Archives web crawler which
was specially designed for web archiving
Licence LGPL
Written in Java

17
(No Transcript)
18
Features

Highly modular easily extensible
Scales to large data volume
Implemented selection policies
Breadth-first with options to throttle activity
against particular hosts and to bias towards
finishing hosts in progress or cycling among all
hosts with pending URLs
Domain sensitive allows specifying an
upper-bound on the number of pages downloaded per
site
Adaptive revisiting repeatedly visit all
encountered URLs (wait time between visits
configurable)
Implements fixed / proportional connection delay
Detailed documentation
Web-based UI for crawler administration