CRAWLER DESIGN - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

CRAWLER DESIGN

Description:

A click on the hyperlink is converted to a network request by the browser ... Berkeley DB (www.sleepycat.com) can also be used. Stores a database within a single file ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 27
Provided by: peopleSab
Category:

less

Transcript and Presenter's Notes

Title: CRAWLER DESIGN


1
CRAWLER DESIGN
YÃœCEL SAYGIN These slides are based on the book
Mining the Web by Soumen Chakrabarti Refer to
Crawling the Web Chapter for more information
2
Challenges
  • The amount of information
  • In 1994 the World Wide Web Worm indexed 110K
    pages
  • In 1997 millions of pages
  • In 2004 billions of pages
  • In 2010 ???? Of pages
  • Complexity of the link graph

3
Basics
  • HTTP Hypertext transport protocol
  • TCP Transmission Control Protocol
  • IP Internet Protocol
  • HTML Hypertext markup language
  • URL Uniform Resource Locator

lta hrefhttp//www.cse.iitb.ac.in/gt The IIT
Bombay Computer Science Departmentlt/agt
protocol
Server host name
File path
4
Basics
  • A click on the hyperlink is converted to a
    network request by the browser
  • Browser will then fetch and display the web page
    pointed to ny the url.
  • Server host name (like www.cse.iitb.ac.in) needs
    to be translated into an ip address such as
    144.16.111.14 to contact the server using TCP.

lta hrefhttp//www.cse.iitb.ac.in/gt The IIT
Bombay Computer Science Departmentlt/agt
protocol
Server host name
File path
5
Basics
  • DNS (Domain Name Service) is a distributed
    database of name-to-IP address mappings
  • This database is maintained by known servers
  • A click on the hyperlink is translated into
  • telnet www.cse.iitb.ac.in 80
  • 80 is the default http port

6
MIME Header
MIME Multipurpose Internet Mail Extensions, a
standard for email and web content transfer.
7
Crawling
  • There is no directory of all accessible URLs
  • The main strategy is to
  • start from a set of seed web pages
  • Extract URLs from those pages
  • Apply the same techniques to the pages from those
    URL
  • It may not be possible to retrieve all the pages
    on the WEB with this technique since New pages
    are added every day

Use a queue structure and mark Visited nodes
8
Crawling
  • Writing a basic crawler is easy
  • Writing a large-scale crawler is challenging
  • Following are the basic steps of crawling
  • URL to IP conversion using the DNS server
  • Socket connection to the server and sending the
    request
  • Receiving the requested page
  • For small pages, DNS lookup and socket connection
    takes more time then receiving the requested page
  • We need to overlap the processing and waiting
    times for the above three steps.

9
Crawling
  • Storage requirements are huge
  • Need to store the list of URLs and the retrieved
    pages in the disk
  • Storing the URLs in the disk is also needed for
    persistency
  • Pages are stored in compressed form (goodle uses
    zlib for compression, 3 to 1 )

10
(No Transcript)
11
Large Scale Crawler Tips
  • Fetch hundreds of pages at the same time to
    increase bandwidth utilization
  • Use more than one DNS server for concurrent DNS
    lookup
  • Using asynchronous sockets is better than
    multi-threading
  • Eliminate duplicates to reduce the number of
    redundant fetches and to avoid spider traps
    (infinite set of fake URLs)

12
DNS Caching
  • Address mapping is a significant bottleneck
  • A crawler can generate more requests per unit
    time than a DNS server can handle
  • Caching the DNS entries helps
  • DNS cache needs to be refreshed periodically
    (whenever it is idle)

13
Concurrent page requests
  • Can be achieved by
  • Multithreading
  • Non-blocking sockets with event handlers
  • Multithreading
  • A set of threads are created
  • After the server name is translated to IP
    address,
  • a thread creates a client socket
  • Connects to the Http service on the server
  • Sends the http request header
  • Reads the socket until eof
  • Closes the socket
  • Blocking system calls are used to suspend the
    thread until the requested data is available

14
Multithreading
  • A fixed number of worker threads share a
    work-queue of pages to fetch
  • Handling concurrent access to data structures is
    a problem. Mutual exclusion needs to be handled
    properly
  • Disk access can not be orchestrated when multiple
    concurrent threads are used
  • Non-blocking sockets could be a better approach!

15
Non-blocking sockets
  • Connect, send, and receive calls will return
    immediately without blocking for network data
  • The status of the network can be polled later on
  • Select system call lets the application wait
    for data to be available on the socket
  • This way completion of page fetching is
    serialized.
  • No need for locks or semaphores
  • Can append the pages to the file in disk without
    being intercepted

16
Link Extraction and Normalization
  • An HTML page is searched for links to add to the
    work-pool
  • URLs extracted from pages need to be preprocessed
    before they are added to the work-pool
  • Duplicate elimination is necessary but difficult
  • Since mapping from urls to hostnames is
    many-to-many I.e., a computer may have many IP
    addresses and many hostnames.
  • Extracted URLs are converted to canonical form by
  • Using the canonical hotname provided by the DNS
    response
  • Adding an explicit port number
  • Converting the relative addresses to absolute
    addresses

17
Some more tips
  • Server may disallow crawling using robots.txt
    found in the http root directory
  • Robots.txt specifies a list of path prefixes that
    crawlers should not try to fetch

18
Eliminating already visited URLS
  • IsUrlVisisted module in the architecture does
    that job
  • The same page could be kinked from many different
    sites
  • Checking if the page is already visited
    eliminates redundant page requests
  • Comparing the strings of URLs may take long time
    since it involves disk access and checking
    against all the stored URLS

19
Eliminating already visited URLS
  • Duplicate checking is done by applying a hash
    function MD5 originally designed for digital
    signature applications
  • MD5 algorithm takes a message of arbitrary length
    as input and produces a 128-bit "fingerprint" or
    "message digest" as output
  • it is computationally infeasible to produce two
    messages having the same message digest
  • http//www.w3.org/TR/1998/REC-DSig-label/MD5-1_0
  • Even the hashed URLs need to be stored in disk
    due to storage and persistency requirements
  • Spatial and temporal locality of URL access means
    less number of disk accesses when URL hashes are
    cached

20
Eliminating already visited URLs
  • We need utilize spatial locality as much as
    possible
  • But MD5 will distribute the domain of similar
    URLs string uniformly over a range.
  • Two-block or two-level hash function is used
  • Use different hash functions for the host address
    and the path
  • B-tree could be used to index the host name, and
    the retrieved page will contain the urls in the
    same host.

21
Spider Traps
  • Malicious pages designed to crash the crawlers
  • Simply add 64K of null characters in the middle
    of URL to crash the lexical analyzer
  • Infinitely deep web sites
  • Using dynamically generated links via CGI scripts
  • Need to check the link length
  • No technique is foolproof
  • Generate periodic statistics for the crawler to
    eliminate dominating sites
  • Disable crawling active content

22
Avoiding duplicate pages
  • A page can be accessed via different URLs
  • Eliminating duplicate pages will also help
    eliminate spider traps
  • MD5 can be used for that purpose
  • Minor changes can not be handled with MD5.
  • Can divide the page into blocks

23
Denial of Service
  • HTTP servers protect themselves against denial of
    service (DoS) attacks
  • DoS attacks will send frequent requests to the
    same server to slow down its operation
  • Therefore frequent requests from the same IP are
    prohibited
  • Crawlers need to consider such cases for
    courtesy/legal action
  • Need to limit the active requests to a given
    server IP address at any time
  • Maintain a queue of requests for each server
  • This will also reduce the effect of spider traps

24
Text Repository
  • The pages that are fetched are dumped into a text
    repository
  • The text repository is significantly large
  • Needs to be compressed (google uses zlip for 3-1
    compression)
  • Google implements its own file system
  • Berkeley DB (www.sleepycat.com) can also be used
  • Stores a database within a single file
  • Provides several access methods such as B-tree or
    sequential

25
Refreshing Crawled Pages
  • HTTP protocol could be used to check if a page
    changes since last time it was crawled
  • But using HTTP for checking if a page is modified
    takes a lot of time
  • If a page expires after a certain time, this
    could be extracted from the http header.
  • If we had a score that reflects the probability
    of change since last time it was visited
  • We can sort the pages wrt that score and crawl
    them in that order
  • Use the past behavior to model the future!

26
Your crawler
  • Use w3c-libwww API to implement your crawler
  • Start from a very simple implementation and go on
    from that!
  • Sample codes and algorithms are provided in the
    handouts
Write a Comment
User Comments (0)
About PowerShow.com