HighPerformance Web Crawling - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

HighPerformance Web Crawling

Description:

... and implement a high-performance web crawler extensible by third parties ... Web crawler system using plurality of parallel priority level queues US Patent 6, ... – PowerPoint PPT presentation

Number of Views:107
Avg rating:3.0/5.0
Slides: 14
Provided by: jasonw7
Category:

less

Transcript and Presenter's Notes

Title: HighPerformance Web Crawling


1
High-Performance Web Crawling
Marc Najork and Allan Heydon Compaq Systems
Research Center SRC Research Report 173 September
26, 2001
  • October 22, 2002
  • CSE-497
  • Jason P. Walters

2
The Mercator Project
  • Gerardus Mercator, 1512-1594. Flemish
    cartographer whose most important innovation was
    a map, later known as the Mercator projection, on
    which parallels and meridians are rendered as
    straight lines spaced so as to produce at any
    point an accurate ratio of latitude to longitude.
    Mercator also introduced the term atlas for a
    collection of maps. --Encyclopedia Britannica

Marc Najork
From http//research.compaq.com/SRC/mercator/rese
arch.html http//research.microsoft.com/
najork/
Our crawler, like the famous cartographer, aims
at producing maps of the known (virtual) world
that accurately depict its dimensions. Moreover,
our crawler's extensibility means it can be used
to produce not just one map, but many. The
Mercator Team
3
Project Goals
  • Project Started in 1999 Now part of Alta Vista
    Search Engine 3
  • To design and implement a high-performance web
    crawler extensible by third parties
  • Modular/Flexible Design -- allow others to write
    new modules and platform independence (Java)
  • Large Scale -- balance bandwidth, memory,
    performance
  • High Performance -- capable of 400 pages per
    second
  • Support multiple protocols -- HTTP, FTP, Gopher
  • Different document formats -- MIME types
  • Polite -- configurable policies regarding web
    servers
  • Continuous -- keep pages fresh (priority
    scheduling)
  • State Recoverable checkpointing backs data up
    at intervals to reduce data loss
  • To actually crawl the web and gather statistics
  • Document types, hosts, how dynamic are pages
  • Analyze quality of search engines and indices

4
Architechtural Overview
  • URL Frontier acts as a central repository for
    sites to be visited seed list starts here.
  • Protocol Modules (http, ftp, gopher) resource
    identified by scheme DNS requests sent to local
    DNS Server (a custom implementation), checks
    robots file against cache
  • RIS provides I/O abstraction that prevents
    extra traffic
  • Content Checker determines if document has been
    before
  • Link Extractor retrieves new URLs, makes
    absolute
  • Tag Counter, Gif counter
  • Your own customizable Processes
  • URL Filter removes any user specified URLs,
    i.e., not .edu document exclusions such as no
    gifs, jpegs
  • DUE repository containing list of already
    visited URLs
  • New URLs are added to URL Frontier and loop
    repeats

.
5
Mercator Architecture
A 64 bit checksum is a fingerprint
Running on separate machine
A background process wakes up to see if crawl
should be terminated, logs statistics, and checks
for time to create checkpoint
Robots file cache maps hostnames to their robots
exclusion rules
6
DNS Module
  • DNS a bottleneck, yet not CPU Bound! Its a
    waiting game.
  • It was determined that DNS queries are
    synchronized events in Java and happen one at a
    time, this is also the case with the BINDs
    gethostbyname function.
  • Because DNS Servers refer to higher authorities
    if they cannot find an entry, a single query
    might take tens of seconds.
  • By forwarding to DNS requests to a local server
    running a custom DNS server that can process
    queries in Parallel, Lookup went from 70 of the
    elapsed time down to 14.

7
DUE
Fingerprint of most popular queries
Hash Table at breaking point is copied to T
U is renamed to U and a new U created at
breaking point
Filtered URLs are checked against the popular
queries and T. If found disregarded. Otherwise
fp added to T and URL to U.
Results Double-Buffering allows second buffer to
merge while processing new requests to the DUE
New entries to Frontier
T is merged with F and added items tagged.
Tagged entries get merged w/Frontier
8
Prioritizing URL Frontier
Client controls order of URLs coming in,
Frontier determines when a site will be crawled.
(Keeps history of URLs also) A single FIFO will
not work because of the locality of URLs to one
another, Web Servers would be throttled.
Multiple FIFOs are necessary. Front-End FIFOs
are for given priority Back-End FIFOs are for a
particular host. The more back-end queues the
more parallel crawling threads are.
9
Parallel Implementation
Host Splitter checks if filtered URL belongs to
this process, otherwise routing it to correct DUE
10
Crawling Results
Test Platform 4 Compaq AlphaServers 4GB, 650GB,
100MB Ethernet, Bandwidth 160MB/sec
Checkpoints Preformed
11
Crawling Results
Test Platform 4 Compaq AlphaServers 4GB, 650GB,
100MB Ethernet, Bandwidth 160MB/sec
12
CONCLUSION
Did you find this paper very well refined, clear
and articulate? Why? (Prof. Davison)
Why would Compaq Research Labs spend so much time
developing a web crawler?
What would you add to this web crawler?
What were the most significant contributions made
by Mercator? (Prof. Davison)
13
CONCLUSION
Najork wrote many papers Breadth First Search
Crawling Yields High-Quality Pages (May
2001). Mercator A Scalable, Extensible Web
Crawler (Dec 1999). Performance Limitations of
the Java Core Libraries (June 1999). Measuring
Index Quality Using Random Walks on the Web (May
1999). and received many Patents Web crawler
system using parallel queues. US Patent
6,377,984, issued 4/23/2002.. System and method
for associating an extensible set of data. US
Patent 6,351,755, issued 2/26/2002. System and
method for enforcing politeness. US Patent
6,321,265, issued 11/20/2001. System and method
for efficient representation of data set
addresses in a web crawler. US Patent 6,301,614,
issued 10/9/2001. Web crawler system using
plurality of parallel priority level queues US
Patent 6,263,364, issued 7/17/2001. http//resear
ch.microsoft.com/najork/
Write a Comment
User Comments (0)
About PowerShow.com