Title: Crawling the Web
1Crawling the Web
- Web pages
- Few thousand characters long
- Served through the internet using the hypertext
transport protocol (HTTP) - Viewed at client end using browsers
- Crawler
- To fetch the pages to the computer
- At the computer
- Automatic programs can analyze hypertext documents
2HTML
- HyperText Markup Language
- Lets the author
- specify layout and typeface
- embed diagrams
- create hyperlinks.
- expressed as an anchor tag with a HREF attribute
- HREF names another page using a Uniform Resource
Locator (URL), - URL
- protocol field (HTTP)
- a server hostname (www.cse.iitb.ac.in)
- file path (/, the root' of the published file
system).
3HTTP(hypertext transport protocol)
- Built on top of the Transport Control Protocol
(TCP) - Steps(from client end)
- resolve the server host name to an Internet
address (IP) - Use Domain Name Server (DNS)
- DNS is a distributed database of name-to-IP
mappings maintained at a set of known servers - contact the server using TCP
- connect to default HTTP port (80) on the server.
- Enter the HTTP requests header (E.g. GET)
- Fetch the response header
- MIME (Multipurpose Internet Mail Extensions)
- A meta-data standard for email and Web content
transfer - Fetch the HTML page
4Crawl all Web pages?
- Problem no catalog of all accessible URLs on the
Web. - Solution
- start from a given set of URLs
- Progressively fetch and scan them for new
outlinking URLs - fetch these pages in turn..
- Submit the text in page to a text indexing system
- and so on.
5Crawling procedure
- Simple
- Great deal of engineering goes into
industry-strength crawlers - Industry crawlers crawl a substantial fraction of
the Web - E.g. Alta Vista, Northern Lights, Inktomi
- No guarantee that all accessible Web pages will
be located in this fashion - Crawler may never halt .
- pages will be added continually even as it is
running.
6Crawling overheads
- Delays involved in
- Resolving the host name in the URL to an IP
address using DNS - Connecting a socket to the server and sending the
request - Receiving the requested page in response
- Solution Overlap the above delays by
- fetching many pages at the same time
7Anatomy of a crawler.
- Page fetching threads
- Starts with DNS resolution
- Finishes when the entire page has been fetched
- Each page
- stored in compressed form to disk/tape
- scanned for outlinks
- Work pool of outlinks
- maintain network utilization without overloading
it - Dealt with by load manager
- Continue till he crawler has collected a
sufficient number of pages.
8Typical anatomy of a large-scale crawler.
9Large-scale crawlers performance and reliability
considerations
- Need to fetch many pages at same time
- utilize the network bandwidth
- single page fetch may involve several seconds of
network latency - Highly concurrent and parallelized DNS lookups
- Use of asynchronous sockets
- Explicit encoding of the state of a fetch context
in a data structure - Polling socket to check for completion of network
transfers - Multi-processing or multi-threading Impractical
- Care in URL extraction
- Eliminating duplicates to reduce redundant
fetches - Avoiding spider traps
10DNS caching, pre-fetching and resolution
- A customized DNS component with..
- Custom client for address resolution
- Caching server
- Prefetching client
11Custom client for address resolution
- Tailored for concurrent handling of multiple
outstanding requests - Allows issuing of many resolution requests
together - polling at a later time for completion of
individual requests - Facilitates load distribution among many DNS
servers.
12Caching server
- With a large cache, persistent across DNS
restarts - Residing largely in memory if possible.
13Prefetching client
- Steps
- Parse a page that has just been fetched
- extract host names from HREF targets
- Make DNS resolution requests to the caching
server - Usually implemented using UDP
- User Datagram Protocol
- connectionless, packet-based communication
protocol - does not guarantee packet delivery
- Does not wait for resolution to be completed.
14Multiple concurrent fetches
- Managing multiple concurrent connections
- A single download may take several seconds
- Open many socket connections to different HTTP
servers simultaneously - Multi-CPU machines not useful
- crawling performance limited by network and disk
- Two approaches
- using multi-threading
- using non-blocking sockets with event handlers
15Multi-threading
- logical threads
- physical thread of control provided by the
operating system (E.g. pthreads) OR - concurrent processes
- fixed number of threads allocated in advance
- programming paradigm
- create a client socket
- connect the socket to the HTTP service on a
server - Send the HTTP request header
- read the socket (recv) until
- no more characters are available
- close the socket.
- use blocking system calls
16Multi-threading Problems
- performance penalty
- mutual exclusion
- concurrent access to data structures
- slow disk seeks.
- great deal of interleaved, random input-output on
disk - Due to concurrent modification of document
repository by multiple threads
17Non-blocking sockets and event handlers
- non-blocking sockets
- connect, send or recv call returns immediately
without waiting for the network operation to
complete. - poll the status of the network operation
separately - select system call
- lets application suspend until more data can be
read from or written to the socket - timing out after a pre-specified deadline
- Monitor polls several sockets at the same time
- More efficient memory management
- code that completes processing not interrupted by
other completions - No need for locks and semaphores on the pool
- only append complete pages to the log
18Link extraction and normalization
- Goal Obtaining a canonical form of URL
- URL processing and filtering
- Avoid multiple fetches of pages known by
different URLs - many IP addresses
- For load balancing on large sites
- Mirrored contents/contents on same file system
- Proxy pass
- Mapping of different host names to a single IP
address - need to publish many logical sites
- Relative URLs
- need to be interpreted w.r.t to a base URL.
19Canonical URL
- Formed by
- Using a standard string for the protocol
- Canonicalizing the host name
- Adding an explicit port number
- Normalizing and cleaning up the path
20Robot exclusion
- Check
- whether the server prohibits crawling a
normalized URL - In robots.txt file in the HTTP root directory of
the server - species a list of path prefixes which crawlers
should not attempt to fetch. - Meant for crawlers only
21Eliminating already-visited URLs
- Checking if a URL has already been fetched
- Before adding a new URL to the work pool
- Needs to be very quick.
- Achieved by computing MD5 hash function on the
URL - Exploiting spatio-temporal locality of access
- Two-level hash function.
- most significant bits (say, 24) derived by
hashing the host name plus port - lower order bits (say, 40) derived by hashing the
path - concatenated bits use d as a key in a B-tree
- qualifying URLs added to frontier of the crawl.
- hash values added to B-tree.
22Spider traps
- Protecting from crashing on
- Ill-formed HTML
- E.g. page with 68 kB of null characters
- Misleading sites
- indefinite number of pages dynamically generated
by CGI scripts - paths of arbitrary depth created using soft
directory links and path remapping features in
HTTP server
23 Spider Traps Solutions
- No automatic technique can be foolproof
- Check for URL length
- Guards
- Preparing regular crawl statistics
- Adding dominating sites to guard module
- Disable crawling active content such as CGI form
queries - Eliminate URLs with non-textual data types
24Avoiding repeated expansion of links on duplicate
pages
- Reduce redundancy in crawls
- Duplicate detection
- Mirrored Web pages and sites
- Detecting exact duplicates
- Checking against MD5 digests of stored URLs
- Representing a relative link v (relative to
aliases u1 and u2) as tuples (h(u1) v) and
(h(u2) v) - Detecting near-duplicates
- Even a single altered character will completely
change the digest ! - E.g. date of update/ name and email of the site
administrator - Solution Shingling
25Load monitor
- Keeps track of various system statistics
- Recent performance of the wide area network (WAN)
connection - E.g. latency and bandwidth estimates.
- Operator-provided/estimated upper bound on open
sockets for a crawler - Current number of active sockets.
26Thread manager
- Responsible for
- Choosing units of work from frontier
- Scheduling issue of network resources
- Distribution of these requests over multiple ISPs
if appropriate. - Uses statistics from load monitor
27Per-server work queues
- Denial of service (DoS) attacks
- limit the speed or frequency of responses to any
fixed client IP address - Avoiding DOS
- limit the number of active requests to a given
server IP address at any time - maintain a queue of requests for each server
- Use the HTTP/1.1 persistent socket capability.
- Distribute attention relatively evenly between a
large number of sites - Access locality vs. politeness dilemma
28Text repository
- Crawlers last task
- Dumping fetched pages into a repository
- Decoupling crawler from other functions for
efficiency and reliability preferred - Page-related information stored in two parts
- meta-data
- page contents.
29Storage of page-related information
- Meta-data
- relational in nature
- usually managed by custom software to avoid
relation database system overheads - text index involves bulk updates
- includes fields like content-type, last-modified
date, content-length, HTTP status code, etc.
30Page contents storage
- Typical HTML Web page compresses to 2-4 kB (using
zlib) - File systems have a 4-8 kB file block size
- Too large !!
- Page storage managed by custom storage manager
- simple access methods for
- crawler to add pages
- Subsequent programs (Indexer etc) to retrieve
documents
31Page Storage
- Small-scale systems
- Repository fitting within the disks of a single
machine - Use of storage manager (E.g. Berkeley DB)
- Manage disk-based databases within a single file
- configuration as a hash-table/B-tree for URL
access key - To handle ordered access of pages
- configuration as a sequential log of page
records. - Since Indexer can handle pages in any order
32Page Storage
- Large Scale systems
- Repository distributed over a number of storage
servers - Storage servers
- Connected to the crawler through a fast local
network (E.g. Ethernet) - Hashed by URLs
- T3' grade leased lines.
- To handle 10 million pages (40 GB) per hour
33Large-scale crawlers often use multiple ISPs and
a bank of local storage servers to store the
pages crawled.
34Refreshing crawled pages
- Search engine's index should be fresh
- Web-scale crawler never completes' its job
- High variance of rate of page changes
- If-modified-since request header with HTTP
protocol - Impractical for a crawler
- Solution
- At commencement of new crawling round estimate
which pages have changed
35Determining page changes
- Expires HTTP response header
- For page that come with an expiry date
- Otherwise need to guess if revisiting that page
will yield a modified version. - Score reflecting probability of page being
modified - Crawler fetches URLs in decreasing order of
score. - Assumption recent past predicts the future
36Estimating page change rates
- Brewington and Cybenko Cho
- Algorithms for maintaining a crawl in which most
pages are fresher than a specified epoch. - Prerequisite
- average interval at which crawler checks for
changes is smaller than the inter-modification
times of a page - Small scale intermediate crawler runs
- to monitor fast changing sites
- E.g. current news, weather, etc.
- Patched intermediate indices into master index
37Putting together a crawler
- Reference implementation of the HTTP client
protocol - World-wide Web Consortium (http//www.w3c.org/)
- w3c-libwww package
38Design of the core components Crawler class.
- To copy bytes from network sockets to storage
media - Three methods to express Crawler's contract with
user - pushing a URL to be fetched to the Crawler
(fetchPush) - Termination callback handler (fetchDone) called
with same URL - Method (start) which starts Crawler's event loop.
- Implementation of Crawler class
- Need for two helper classes called DNS and Fetch
-