Title: HighPerformance Web Crawling
1High-Performance Web Crawling
Marc Najork and Allan Heydon Compaq Systems
Research Center SRC Research Report 173 September
26, 2001
- October 22, 2002
- CSE-497
- Jason P. Walters
2The Mercator Project
- Gerardus Mercator, 1512-1594. Flemish
cartographer whose most important innovation was
a map, later known as the Mercator projection, on
which parallels and meridians are rendered as
straight lines spaced so as to produce at any
point an accurate ratio of latitude to longitude.
Mercator also introduced the term atlas for a
collection of maps. --Encyclopedia Britannica
Marc Najork
From http//research.compaq.com/SRC/mercator/rese
arch.html http//research.microsoft.com/
najork/
Our crawler, like the famous cartographer, aims
at producing maps of the known (virtual) world
that accurately depict its dimensions. Moreover,
our crawler's extensibility means it can be used
to produce not just one map, but many. The
Mercator Team
3Project Goals
- Project Started in 1999 Now part of Alta Vista
Search Engine 3 - To design and implement a high-performance web
crawler extensible by third parties - Modular/Flexible Design -- allow others to write
new modules and platform independence (Java) - Large Scale -- balance bandwidth, memory,
performance - High Performance -- capable of 400 pages per
second - Support multiple protocols -- HTTP, FTP, Gopher
- Different document formats -- MIME types
- Polite -- configurable policies regarding web
servers - Continuous -- keep pages fresh (priority
scheduling) - State Recoverable checkpointing backs data up
at intervals to reduce data loss - To actually crawl the web and gather statistics
- Document types, hosts, how dynamic are pages
- Analyze quality of search engines and indices
4Architechtural Overview
- URL Frontier acts as a central repository for
sites to be visited seed list starts here. - Protocol Modules (http, ftp, gopher) resource
identified by scheme DNS requests sent to local
DNS Server (a custom implementation), checks
robots file against cache - RIS provides I/O abstraction that prevents
extra traffic - Content Checker determines if document has been
before - Link Extractor retrieves new URLs, makes
absolute - Tag Counter, Gif counter
- Your own customizable Processes
- URL Filter removes any user specified URLs,
i.e., not .edu document exclusions such as no
gifs, jpegs - DUE repository containing list of already
visited URLs - New URLs are added to URL Frontier and loop
repeats
.
5Mercator Architecture
A 64 bit checksum is a fingerprint
Running on separate machine
A background process wakes up to see if crawl
should be terminated, logs statistics, and checks
for time to create checkpoint
Robots file cache maps hostnames to their robots
exclusion rules
6DNS Module
- DNS a bottleneck, yet not CPU Bound! Its a
waiting game. - It was determined that DNS queries are
synchronized events in Java and happen one at a
time, this is also the case with the BINDs
gethostbyname function. - Because DNS Servers refer to higher authorities
if they cannot find an entry, a single query
might take tens of seconds. - By forwarding to DNS requests to a local server
running a custom DNS server that can process
queries in Parallel, Lookup went from 70 of the
elapsed time down to 14.
7DUE
Fingerprint of most popular queries
Hash Table at breaking point is copied to T
U is renamed to U and a new U created at
breaking point
Filtered URLs are checked against the popular
queries and T. If found disregarded. Otherwise
fp added to T and URL to U.
Results Double-Buffering allows second buffer to
merge while processing new requests to the DUE
New entries to Frontier
T is merged with F and added items tagged.
Tagged entries get merged w/Frontier
8Prioritizing URL Frontier
Client controls order of URLs coming in,
Frontier determines when a site will be crawled.
(Keeps history of URLs also) A single FIFO will
not work because of the locality of URLs to one
another, Web Servers would be throttled.
Multiple FIFOs are necessary. Front-End FIFOs
are for given priority Back-End FIFOs are for a
particular host. The more back-end queues the
more parallel crawling threads are.
9Parallel Implementation
Host Splitter checks if filtered URL belongs to
this process, otherwise routing it to correct DUE
10Crawling Results
Test Platform 4 Compaq AlphaServers 4GB, 650GB,
100MB Ethernet, Bandwidth 160MB/sec
Checkpoints Preformed
11Crawling Results
Test Platform 4 Compaq AlphaServers 4GB, 650GB,
100MB Ethernet, Bandwidth 160MB/sec
12CONCLUSION
Did you find this paper very well refined, clear
and articulate? Why? (Prof. Davison)
Why would Compaq Research Labs spend so much time
developing a web crawler?
What would you add to this web crawler?
What were the most significant contributions made
by Mercator? (Prof. Davison)
13CONCLUSION
Najork wrote many papers Breadth First Search
Crawling Yields High-Quality Pages (May
2001). Mercator A Scalable, Extensible Web
Crawler (Dec 1999). Performance Limitations of
the Java Core Libraries (June 1999). Measuring
Index Quality Using Random Walks on the Web (May
1999). and received many Patents Web crawler
system using parallel queues. US Patent
6,377,984, issued 4/23/2002.. System and method
for associating an extensible set of data. US
Patent 6,351,755, issued 2/26/2002. System and
method for enforcing politeness. US Patent
6,321,265, issued 11/20/2001. System and method
for efficient representation of data set
addresses in a web crawler. US Patent 6,301,614,
issued 10/9/2001. Web crawler system using
plurality of parallel priority level queues US
Patent 6,263,364, issued 7/17/2001. http//resear
ch.microsoft.com/najork/