Crawling the Web presentation

About This Presentation

Transcript and Presenter's Notes

Title: Crawling the Web

1
Crawling the Web

Web pages
Few thousand characters long
Served through the internet using the hypertext
transport protocol (HTTP)
Viewed at client end using browsers
Crawler
To fetch the pages to the computer
At the computer
Automatic programs can analyze hypertext documents

2
HTML

HyperText Markup Language
Lets the author
specify layout and typeface
embed diagrams
create hyperlinks.
expressed as an anchor tag with a HREF attribute
HREF names another page using a Uniform Resource
Locator (URL),
URL
protocol field (HTTP)
a server hostname (www.cse.iitb.ac.in)
file path (/, the root' of the published file
system).

3
HTTP(hypertext transport protocol)

Built on top of the Transport Control Protocol
(TCP)
Steps(from client end)
resolve the server host name to an Internet
address (IP)
Use Domain Name Server (DNS)
DNS is a distributed database of name-to-IP
mappings maintained at a set of known servers
contact the server using TCP
connect to default HTTP port (80) on the server.
Enter the HTTP requests header (E.g. GET)
Fetch the response header
MIME (Multipurpose Internet Mail Extensions)
A meta-data standard for email and Web content
transfer
Fetch the HTML page

4
Crawl all Web pages?

Problem no catalog of all accessible URLs on the
Web.
Solution
start from a given set of URLs
Progressively fetch and scan them for new
outlinking URLs
fetch these pages in turn..
Submit the text in page to a text indexing system
and so on.

5
Crawling procedure

Simple
Great deal of engineering goes into
industry-strength crawlers
Industry crawlers crawl a substantial fraction of
the Web
E.g. Alta Vista, Northern Lights, Inktomi
No guarantee that all accessible Web pages will
be located in this fashion
Crawler may never halt .
pages will be added continually even as it is
running.

6
Crawling overheads

Delays involved in
Resolving the host name in the URL to an IP
address using DNS
Connecting a socket to the server and sending the
request
Receiving the requested page in response
Solution Overlap the above delays by
fetching many pages at the same time

7
Anatomy of a crawler.

Page fetching threads
Starts with DNS resolution
Finishes when the entire page has been fetched
Each page
stored in compressed form to disk/tape
scanned for outlinks
Work pool of outlinks
maintain network utilization without overloading
it
Dealt with by load manager
Continue till he crawler has collected a
sufficient number of pages.

8
Typical anatomy of a large-scale crawler.
9
Large-scale crawlers performance and reliability
considerations

Need to fetch many pages at same time
utilize the network bandwidth
single page fetch may involve several seconds of
network latency
Highly concurrent and parallelized DNS lookups
Use of asynchronous sockets
Explicit encoding of the state of a fetch context
in a data structure
Polling socket to check for completion of network
transfers
Multi-processing or multi-threading Impractical
Care in URL extraction
Eliminating duplicates to reduce redundant
fetches
Avoiding spider traps

10
DNS caching, pre-fetching and resolution

A customized DNS component with..
Custom client for address resolution
Caching server
Prefetching client

11
Custom client for address resolution

Tailored for concurrent handling of multiple
outstanding requests
Allows issuing of many resolution requests
together
polling at a later time for completion of
individual requests
Facilitates load distribution among many DNS
servers.

12
Caching server

With a large cache, persistent across DNS
restarts
Residing largely in memory if possible.

13
Prefetching client

Steps
Parse a page that has just been fetched
extract host names from HREF targets
Make DNS resolution requests to the caching
server
Usually implemented using UDP
User Datagram Protocol
connectionless, packet-based communication
protocol
does not guarantee packet delivery
Does not wait for resolution to be completed.

14
Multiple concurrent fetches

Managing multiple concurrent connections
A single download may take several seconds
Open many socket connections to different HTTP
servers simultaneously
Multi-CPU machines not useful
crawling performance limited by network and disk
Two approaches
using multi-threading
using non-blocking sockets with event handlers

15
Multi-threading

logical threads
physical thread of control provided by the
operating system (E.g. pthreads) OR
concurrent processes
fixed number of threads allocated in advance
programming paradigm
create a client socket
connect the socket to the HTTP service on a
server
Send the HTTP request header
read the socket (recv) until
no more characters are available
close the socket.
use blocking system calls

16
Multi-threading Problems

performance penalty
mutual exclusion
concurrent access to data structures
slow disk seeks.
great deal of interleaved, random input-output on
disk
Due to concurrent modification of document
repository by multiple threads

17
Non-blocking sockets and event handlers

non-blocking sockets
connect, send or recv call returns immediately
without waiting for the network operation to
complete.
poll the status of the network operation
separately
select system call
lets application suspend until more data can be
read from or written to the socket
timing out after a pre-specified deadline
Monitor polls several sockets at the same time
More efficient memory management
code that completes processing not interrupted by
other completions
No need for locks and semaphores on the pool
only append complete pages to the log

18
Link extraction and normalization

Goal Obtaining a canonical form of URL
URL processing and filtering
Avoid multiple fetches of pages known by
different URLs
many IP addresses
For load balancing on large sites
Mirrored contents/contents on same file system
Proxy pass
Mapping of different host names to a single IP
address
need to publish many logical sites
Relative URLs
need to be interpreted w.r.t to a base URL.

19
Canonical URL

Formed by
Using a standard string for the protocol
Canonicalizing the host name
Adding an explicit port number
Normalizing and cleaning up the path

20
Robot exclusion

Check
whether the server prohibits crawling a
normalized URL
In robots.txt file in the HTTP root directory of
the server
species a list of path prefixes which crawlers
should not attempt to fetch.
Meant for crawlers only

21
Eliminating already-visited URLs

Checking if a URL has already been fetched
Before adding a new URL to the work pool
Needs to be very quick.
Achieved by computing MD5 hash function on the
URL
Exploiting spatio-temporal locality of access
Two-level hash function.
most significant bits (say, 24) derived by
hashing the host name plus port
lower order bits (say, 40) derived by hashing the
path
concatenated bits use d as a key in a B-tree
qualifying URLs added to frontier of the crawl.
hash values added to B-tree.

22
Spider traps

Protecting from crashing on
Ill-formed HTML
E.g. page with 68 kB of null characters
Misleading sites
indefinite number of pages dynamically generated
by CGI scripts
paths of arbitrary depth created using soft
directory links and path remapping features in
HTTP server

23
Spider Traps Solutions

No automatic technique can be foolproof
Check for URL length
Guards
Preparing regular crawl statistics
Adding dominating sites to guard module
Disable crawling active content such as CGI form
queries
Eliminate URLs with non-textual data types

24
Avoiding repeated expansion of links on duplicate
pages

Reduce redundancy in crawls
Duplicate detection
Mirrored Web pages and sites
Detecting exact duplicates
Checking against MD5 digests of stored URLs
Representing a relative link v (relative to
aliases u1 and u2) as tuples (h(u1) v) and
(h(u2) v)
Detecting near-duplicates
Even a single altered character will completely
change the digest !
E.g. date of update/ name and email of the site
administrator
Solution Shingling

25
Load monitor

Keeps track of various system statistics
Recent performance of the wide area network (WAN)
connection
E.g. latency and bandwidth estimates.
Operator-provided/estimated upper bound on open
sockets for a crawler
Current number of active sockets.

26
Thread manager

Responsible for
Choosing units of work from frontier
Scheduling issue of network resources
Distribution of these requests over multiple ISPs
if appropriate.
Uses statistics from load monitor

27
Per-server work queues

Denial of service (DoS) attacks
limit the speed or frequency of responses to any
fixed client IP address
Avoiding DOS
limit the number of active requests to a given
server IP address at any time
maintain a queue of requests for each server
Use the HTTP/1.1 persistent socket capability.
Distribute attention relatively evenly between a
large number of sites
Access locality vs. politeness dilemma

28
Text repository

Crawlers last task
Dumping fetched pages into a repository
Decoupling crawler from other functions for
efficiency and reliability preferred
Page-related information stored in two parts
meta-data
page contents.

29
Storage of page-related information

Meta-data
relational in nature
usually managed by custom software to avoid
relation database system overheads
text index involves bulk updates
includes fields like content-type, last-modified
date, content-length, HTTP status code, etc.

30
Page contents storage

Typical HTML Web page compresses to 2-4 kB (using
zlib)
File systems have a 4-8 kB file block size
Too large !!
Page storage managed by custom storage manager
simple access methods for
crawler to add pages
Subsequent programs (Indexer etc) to retrieve
documents

31
Page Storage

Small-scale systems
Repository fitting within the disks of a single
machine
Use of storage manager (E.g. Berkeley DB)
Manage disk-based databases within a single file
configuration as a hash-table/B-tree for URL
access key
To handle ordered access of pages
configuration as a sequential log of page
records.
Since Indexer can handle pages in any order

32
Page Storage

Large Scale systems
Repository distributed over a number of storage
servers
Storage servers
Connected to the crawler through a fast local
network (E.g. Ethernet)
Hashed by URLs
T3' grade leased lines.
To handle 10 million pages (40 GB) per hour

33
Large-scale crawlers often use multiple ISPs and
a bank of local storage servers to store the
pages crawled.
34
Refreshing crawled pages

Search engine's index should be fresh
Web-scale crawler never completes' its job
High variance of rate of page changes
If-modified-since request header with HTTP
protocol
Impractical for a crawler
Solution
At commencement of new crawling round estimate
which pages have changed

35
Determining page changes

Expires HTTP response header
For page that come with an expiry date
Otherwise need to guess if revisiting that page
will yield a modified version.
Score reflecting probability of page being
modified
Crawler fetches URLs in decreasing order of
score.
Assumption recent past predicts the future

36
Estimating page change rates

Brewington and Cybenko Cho
Algorithms for maintaining a crawl in which most
pages are fresher than a specified epoch.
Prerequisite
average interval at which crawler checks for
changes is smaller than the inter-modification
times of a page
Small scale intermediate crawler runs
to monitor fast changing sites
E.g. current news, weather, etc.
Patched intermediate indices into master index

37
Putting together a crawler

Reference implementation of the HTTP client
protocol
World-wide Web Consortium (http//www.w3c.org/)
w3c-libwww package

38
Design of the core components Crawler class.

To copy bytes from network sockets to storage
media
Three methods to express Crawler's contract with
user
pushing a URL to be fetched to the Crawler
(fetchPush)
Termination callback handler (fetchDone) called
with same URL
Method (start) which starts Crawler's event loop.
Implementation of Crawler class
Need for two helper classes called DNS and Fetch

Write a Comment

User Comments (0)

About PowerShow.com

Crawling the Web PowerPoint PPT Presentation