HighPerformance Web Crawling - PowerPoint PPT Presentation

1 / 13

About This Presentation

Title:

HighPerformance Web Crawling

Description:

... and implement a high-performance web crawler extensible by third parties ... Web crawler system using plurality of parallel priority level queues US Patent 6, ... – PowerPoint PPT presentation

Number of Views:107

Avg rating:3.0/5.0

Slides: 14

Provided by: jasonw7

Category:

more less

Transcript and Presenter's Notes

Title: HighPerformance Web Crawling

1
High-Performance Web Crawling
Marc Najork and Allan Heydon Compaq Systems
Research Center SRC Research Report 173 September
26, 2001

October 22, 2002
CSE-497
Jason P. Walters

2
The Mercator Project

Gerardus Mercator, 1512-1594. Flemish
cartographer whose most important innovation was
a map, later known as the Mercator projection, on
which parallels and meridians are rendered as
straight lines spaced so as to produce at any
point an accurate ratio of latitude to longitude.
Mercator also introduced the term atlas for a
collection of maps. --Encyclopedia Britannica

Marc Najork
From http//research.compaq.com/SRC/mercator/rese
arch.html http//research.microsoft.com/
najork/
Our crawler, like the famous cartographer, aims
at producing maps of the known (virtual) world
that accurately depict its dimensions. Moreover,
our crawler's extensibility means it can be used
to produce not just one map, but many. The
Mercator Team
3
Project Goals

Project Started in 1999 Now part of Alta Vista
Search Engine 3
To design and implement a high-performance web
crawler extensible by third parties
Modular/Flexible Design -- allow others to write
new modules and platform independence (Java)
Large Scale -- balance bandwidth, memory,
performance
High Performance -- capable of 400 pages per
second
Support multiple protocols -- HTTP, FTP, Gopher
Different document formats -- MIME types
Polite -- configurable policies regarding web
servers
Continuous -- keep pages fresh (priority
scheduling)
State Recoverable checkpointing backs data up
at intervals to reduce data loss
To actually crawl the web and gather statistics
Document types, hosts, how dynamic are pages
Analyze quality of search engines and indices

4
Architechtural Overview

URL Frontier acts as a central repository for
sites to be visited seed list starts here.
Protocol Modules (http, ftp, gopher) resource
identified by scheme DNS requests sent to local
DNS Server (a custom implementation), checks
robots file against cache
RIS provides I/O abstraction that prevents
extra traffic
Content Checker determines if document has been
before
Link Extractor retrieves new URLs, makes
absolute
Tag Counter, Gif counter
Your own customizable Processes
URL Filter removes any user specified URLs,
i.e., not .edu document exclusions such as no
gifs, jpegs
DUE repository containing list of already
visited URLs
New URLs are added to URL Frontier and loop
repeats

.
5
Mercator Architecture
A 64 bit checksum is a fingerprint
Running on separate machine
A background process wakes up to see if crawl
should be terminated, logs statistics, and checks
for time to create checkpoint
Robots file cache maps hostnames to their robots
exclusion rules
6
DNS Module

DNS a bottleneck, yet not CPU Bound! Its a
waiting game.
It was determined that DNS queries are
synchronized events in Java and happen one at a
time, this is also the case with the BINDs
gethostbyname function.
Because DNS Servers refer to higher authorities
if they cannot find an entry, a single query
might take tens of seconds.
By forwarding to DNS requests to a local server
running a custom DNS server that can process
queries in Parallel, Lookup went from 70 of the
elapsed time down to 14.

7
DUE
Fingerprint of most popular queries
Hash Table at breaking point is copied to T
U is renamed to U and a new U created at
breaking point
Filtered URLs are checked against the popular
queries and T. If found disregarded. Otherwise
fp added to T and URL to U.
Results Double-Buffering allows second buffer to
merge while processing new requests to the DUE
New entries to Frontier
T is merged with F and added items tagged.
Tagged entries get merged w/Frontier
8
Prioritizing URL Frontier
Client controls order of URLs coming in,
Frontier determines when a site will be crawled.
(Keeps history of URLs also) A single FIFO will
not work because of the locality of URLs to one
another, Web Servers would be throttled.
Multiple FIFOs are necessary. Front-End FIFOs
are for given priority Back-End FIFOs are for a
particular host. The more back-end queues the
more parallel crawling threads are.
9
Parallel Implementation
Host Splitter checks if filtered URL belongs to
this process, otherwise routing it to correct DUE
10
Crawling Results
Test Platform 4 Compaq AlphaServers 4GB, 650GB,
100MB Ethernet, Bandwidth 160MB/sec
Checkpoints Preformed
11
Crawling Results
Test Platform 4 Compaq AlphaServers 4GB, 650GB,
100MB Ethernet, Bandwidth 160MB/sec
12
CONCLUSION
Did you find this paper very well refined, clear
and articulate? Why? (Prof. Davison)
Why would Compaq Research Labs spend so much time
developing a web crawler?
What would you add to this web crawler?
What were the most significant contributions made
by Mercator? (Prof. Davison)
13
CONCLUSION
Najork wrote many papers Breadth First Search
Crawling Yields High-Quality Pages (May
2001). Mercator A Scalable, Extensible Web
Crawler (Dec 1999). Performance Limitations of
the Java Core Libraries (June 1999). Measuring
Index Quality Using Random Walks on the Web (May
1999). and received many Patents Web crawler
system using parallel queues. US Patent
6,377,984, issued 4/23/2002.. System and method
for associating an extensible set of data. US
Patent 6,351,755, issued 2/26/2002. System and
method for enforcing politeness. US Patent
6,321,265, issued 11/20/2001. System and method
for efficient representation of data set
addresses in a web crawler. US Patent 6,301,614,
issued 10/9/2001. Web crawler system using
plurality of parallel priority level queues US
Patent 6,263,364, issued 7/17/2001. http//resear
ch.microsoft.com/najork/

Write a Comment

User Comments (0)