A Brief Look at Web Crawlers - PowerPoint PPT Presentation

About This Presentation
Title:

A Brief Look at Web Crawlers

Description:

... a program or automated script which browses the World Wide Web in a methodical, ... Create an archive / index from the visited web pages to support offline ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 24
Provided by: Bin107
Category:

less

Transcript and Presenter's Notes

Title: A Brief Look at Web Crawlers


1
A Brief Look at Web Crawlers
  • Bin Tan
  • 03/15/07

2
Web Crawlers
  • is a program or automated script which browses
    the World Wide Web in a methodical, automated
    manner
  • Uses
  • Create an archive / index from the visited web
    pages to support offline browsing / search /
    mining.
  • Automating maintenance tasks on a website
  • Harvesting specific information from web pages

3
High-level architecture
Seeds Frontier
4
How easy is it to write a program to crawl all
uiuc.edu web pages?
5
All sorts of real problems
  • Managing multiple download threads is nontrivial
  • If you make requests to a server in short
    intervals, youll overloading it
  • Pages may be missing servers may be down or
    sluggish
  • You may be trapped in dynamic-generated pages
  • Web page may use ill-formed HTML

6
This is only a small-scale crawl
  • (Shkapenyuk and Suel, 2002) "While it is fairly
    easy to build a slow crawler that downloads a few
    pages per second for a short period of time,
    building a high-performance system that can
    download hundreds of millions of pages over
    several weeks presents a number of challenges in
    system design, I/O and network efficiency, and
    robustness and manageability."

7
Data characterics in large-scale crawls
  • Large volume, fast changes, dynamic page
    generation a wide selection of possibly
    crawlable URLs
  • Edwards et al "Given that the bandwidth for
    conducting crawls is neither infinite nor free it
    is becoming essential to crawl the Web in not
    only a scalable, but efficient way, if some
    reasonable measure of quality or freshness is to
    be maintained."

8
Selection policy which page to download
  • Need to prioritize according to some page
    importance metrics
  • Depth-first
  • Breadth-first
  • Partial PageRank calculation
  • OPIC (On-line Page Importance Computation)
  • Length of per-site queues
  • In focused crawling, prediction of similarity
    between page text and query

re-visit policy
9
Revisit policy when to check for changes to the
pages
  • Pages are frequently updated, created or deleted
  • Cost functions to minimize
  • Freshness (0 for stale pages, 1 for fresh pages )
  • Age (amount of time for which a page has been
    stale)

10
Revisit Policy (cont.)
  • Uniform policy revisiting all pages in the
    collection with the same frequency
  • Proportional policy revisiting more often the
    pages that change more frequently
  • The optimal method for keeping average freshness
    high includes ignoring the pages that change too
    often, and the optimal for keeping average age
    low is to use access frequencies that
    monotonically (and sub-linearly) increase with
    the rate of change of each page.
  • Numerical methods are used for calculation based
    on distribution of page changes

11
Politeness policy how to avoid overloading
websites
  • Badly-behaved crawlers can be a nuisance
  • Robots exclusion protocol (robots.txt) Google
  • Interval/delay between connections (10sec 5
    min)
  • fixed
  • proportional to page downloading time

12
Parallelization policy how to coordinate
distributed web crawlers
  • Nutch "A successful search engine requires more
    bandwidth to upload query result pages than its
    crawler needs to download pages"

13
Crawling the deep web
  • Many web spiders run by popular search engines
    ignore URLs with a query string
  • Googles Sitemap protocol allows a webmaster to
    inform search engines about URLs on a website
    that are available for crawling
  • Also mod-oai is an Apache module that allows web
    crawlers to efficiently discover new, modified,
    and deleted web resources from a web server by
    using OAI-PMH, a protocol which is widely used in
    the digital libraries community

14
Example Web Crawler Software
  • wget
  • heritrix
  • nutch
  • others

15
Wget
  • Command-line tool, non-extensible
  • Config recursive downloading
  • Config spanning hosts
  • Breadth-first for HTTP, depth-first for FTP
  • Config include/exclude filters
  • Updates outdated pages based on timestamps
  • Supports robots.txt protocol
  • Config connection delay
  • Single-threaded

16
Heritrix
  • Heritrix is Internet Archives web crawler which
    was specially designed for web archiving
  • Licence LGPL
  • Written in Java

17
(No Transcript)
18
Features
  • Highly modular easily extensible
  • Scales to large data volume
  • Implemented selection policies
  • Breadth-first with options to throttle activity
    against particular hosts and to bias towards
    finishing hosts in progress or cycling among all
    hosts with pending URLs
  • Domain sensitive allows specifying an
    upper-bound on the number of pages downloaded per
    site
  • Adaptive revisiting repeatedly visit all
    encountered URLs (wait time between visits
    configurable)
  • Implements fixed / proportional connection delay
  • Detailed documentation
  • Web-based UI for crawler administration

19
(No Transcript)
20
Nutch
  • Nutch is an effort to build an open source search
    engine based on Lucene for the search and index
    component.
  • License Apache 2.0
  • Written in Java

21
Features
  • Modular extensible
  • Breadth-first
  • Includes parsing and indexing components
  • Implements a MapReduce facility and a distributed
    file system (Haddop)

22
Recrawl command lines
  • The generate/fetch/update cycle
  • for ((i1 i lt depth i))
  • do
  • bin/nutch generate webdb_dir segments_dir
    -adddays adddays
  • segmentls -d segments_dir/ tail -1
  • bin/nutch fetch segment
  • bin/nutch updatedb webdb_dir segment
  • done

23
Appedix Parsers
  • HTML
  • lynx dump
  • Beautiful Soup (Python)
  • tidylib (C)
  • PDF
  • xpdf
  • Others
  • Nutch plugins
  • Office API (Windows)
Write a Comment
User Comments (0)
About PowerShow.com