Title: CRAWLER DESIGN
1CRAWLER DESIGN
YÃœCEL SAYGIN These slides are based on the book
Mining the Web by Soumen Chakrabarti Refer to
Crawling the Web Chapter for more information
2Challenges
- The amount of information
- In 1994 the World Wide Web Worm indexed 110K
pages - In 1997 millions of pages
- In 2004 billions of pages
- In 2010 ???? Of pages
- Complexity of the link graph
3Basics
- HTTP Hypertext transport protocol
- TCP Transmission Control Protocol
- IP Internet Protocol
- HTML Hypertext markup language
- URL Uniform Resource Locator
lta hrefhttp//www.cse.iitb.ac.in/gt The IIT
Bombay Computer Science Departmentlt/agt
protocol
Server host name
File path
4Basics
- A click on the hyperlink is converted to a
network request by the browser - Browser will then fetch and display the web page
pointed to ny the url. - Server host name (like www.cse.iitb.ac.in) needs
to be translated into an ip address such as
144.16.111.14 to contact the server using TCP.
lta hrefhttp//www.cse.iitb.ac.in/gt The IIT
Bombay Computer Science Departmentlt/agt
protocol
Server host name
File path
5Basics
- DNS (Domain Name Service) is a distributed
database of name-to-IP address mappings - This database is maintained by known servers
- A click on the hyperlink is translated into
- telnet www.cse.iitb.ac.in 80
- 80 is the default http port
6MIME Header
MIME Multipurpose Internet Mail Extensions, a
standard for email and web content transfer.
7Crawling
- There is no directory of all accessible URLs
- The main strategy is to
- start from a set of seed web pages
- Extract URLs from those pages
- Apply the same techniques to the pages from those
URL - It may not be possible to retrieve all the pages
on the WEB with this technique since New pages
are added every day
Use a queue structure and mark Visited nodes
8Crawling
- Writing a basic crawler is easy
- Writing a large-scale crawler is challenging
- Following are the basic steps of crawling
- URL to IP conversion using the DNS server
- Socket connection to the server and sending the
request - Receiving the requested page
- For small pages, DNS lookup and socket connection
takes more time then receiving the requested page - We need to overlap the processing and waiting
times for the above three steps.
9Crawling
- Storage requirements are huge
- Need to store the list of URLs and the retrieved
pages in the disk - Storing the URLs in the disk is also needed for
persistency - Pages are stored in compressed form (goodle uses
zlib for compression, 3 to 1 )
10(No Transcript)
11Large Scale Crawler Tips
- Fetch hundreds of pages at the same time to
increase bandwidth utilization - Use more than one DNS server for concurrent DNS
lookup - Using asynchronous sockets is better than
multi-threading - Eliminate duplicates to reduce the number of
redundant fetches and to avoid spider traps
(infinite set of fake URLs)
12DNS Caching
- Address mapping is a significant bottleneck
- A crawler can generate more requests per unit
time than a DNS server can handle - Caching the DNS entries helps
- DNS cache needs to be refreshed periodically
(whenever it is idle)
13Concurrent page requests
- Can be achieved by
- Multithreading
- Non-blocking sockets with event handlers
- Multithreading
- A set of threads are created
- After the server name is translated to IP
address, - a thread creates a client socket
- Connects to the Http service on the server
- Sends the http request header
- Reads the socket until eof
- Closes the socket
- Blocking system calls are used to suspend the
thread until the requested data is available
14Multithreading
- A fixed number of worker threads share a
work-queue of pages to fetch - Handling concurrent access to data structures is
a problem. Mutual exclusion needs to be handled
properly - Disk access can not be orchestrated when multiple
concurrent threads are used - Non-blocking sockets could be a better approach!
15Non-blocking sockets
- Connect, send, and receive calls will return
immediately without blocking for network data - The status of the network can be polled later on
- Select system call lets the application wait
for data to be available on the socket - This way completion of page fetching is
serialized. - No need for locks or semaphores
- Can append the pages to the file in disk without
being intercepted
16Link Extraction and Normalization
- An HTML page is searched for links to add to the
work-pool - URLs extracted from pages need to be preprocessed
before they are added to the work-pool - Duplicate elimination is necessary but difficult
- Since mapping from urls to hostnames is
many-to-many I.e., a computer may have many IP
addresses and many hostnames. - Extracted URLs are converted to canonical form by
- Using the canonical hotname provided by the DNS
response - Adding an explicit port number
- Converting the relative addresses to absolute
addresses
17Some more tips
- Server may disallow crawling using robots.txt
found in the http root directory - Robots.txt specifies a list of path prefixes that
crawlers should not try to fetch -
18Eliminating already visited URLS
- IsUrlVisisted module in the architecture does
that job - The same page could be kinked from many different
sites - Checking if the page is already visited
eliminates redundant page requests - Comparing the strings of URLs may take long time
since it involves disk access and checking
against all the stored URLS
19Eliminating already visited URLS
- Duplicate checking is done by applying a hash
function MD5 originally designed for digital
signature applications - MD5 algorithm takes a message of arbitrary length
as input and produces a 128-bit "fingerprint" or
"message digest" as output - it is computationally infeasible to produce two
messages having the same message digest - http//www.w3.org/TR/1998/REC-DSig-label/MD5-1_0
- Even the hashed URLs need to be stored in disk
due to storage and persistency requirements - Spatial and temporal locality of URL access means
less number of disk accesses when URL hashes are
cached
20Eliminating already visited URLs
- We need utilize spatial locality as much as
possible - But MD5 will distribute the domain of similar
URLs string uniformly over a range. - Two-block or two-level hash function is used
- Use different hash functions for the host address
and the path - B-tree could be used to index the host name, and
the retrieved page will contain the urls in the
same host.
21Spider Traps
- Malicious pages designed to crash the crawlers
- Simply add 64K of null characters in the middle
of URL to crash the lexical analyzer - Infinitely deep web sites
- Using dynamically generated links via CGI scripts
- Need to check the link length
- No technique is foolproof
- Generate periodic statistics for the crawler to
eliminate dominating sites - Disable crawling active content
22Avoiding duplicate pages
- A page can be accessed via different URLs
- Eliminating duplicate pages will also help
eliminate spider traps - MD5 can be used for that purpose
- Minor changes can not be handled with MD5.
- Can divide the page into blocks
23Denial of Service
- HTTP servers protect themselves against denial of
service (DoS) attacks - DoS attacks will send frequent requests to the
same server to slow down its operation - Therefore frequent requests from the same IP are
prohibited - Crawlers need to consider such cases for
courtesy/legal action - Need to limit the active requests to a given
server IP address at any time - Maintain a queue of requests for each server
- This will also reduce the effect of spider traps
24Text Repository
- The pages that are fetched are dumped into a text
repository - The text repository is significantly large
- Needs to be compressed (google uses zlip for 3-1
compression) - Google implements its own file system
- Berkeley DB (www.sleepycat.com) can also be used
- Stores a database within a single file
- Provides several access methods such as B-tree or
sequential
25Refreshing Crawled Pages
- HTTP protocol could be used to check if a page
changes since last time it was crawled - But using HTTP for checking if a page is modified
takes a lot of time - If a page expires after a certain time, this
could be extracted from the http header. - If we had a score that reflects the probability
of change since last time it was visited - We can sort the pages wrt that score and crawl
them in that order - Use the past behavior to model the future!
26Your crawler
- Use w3c-libwww API to implement your crawler
- Start from a very simple implementation and go on
from that! - Sample codes and algorithms are provided in the
handouts