Crawling the Web - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Crawling the Web

Description:

Served through the internet using the hypertext transport ... Use of storage manager (E.g.: Berkeley DB) Manage disk-based databases within a single file ... – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 39
Provided by: dmlab
Category:

less

Transcript and Presenter's Notes

Title: Crawling the Web


1
Crawling the Web
  • Web pages
  • Few thousand characters long
  • Served through the internet using the hypertext
    transport protocol (HTTP)
  • Viewed at client end using browsers
  • Crawler
  • To fetch the pages to the computer
  • At the computer
  • Automatic programs can analyze hypertext documents

2
HTML
  • HyperText Markup Language
  • Lets the author
  • specify layout and typeface
  • embed diagrams
  • create hyperlinks.
  • expressed as an anchor tag with a HREF attribute
  • HREF names another page using a Uniform Resource
    Locator (URL),
  • URL
  • protocol field (HTTP)
  • a server hostname (www.cse.iitb.ac.in)
  • file path (/, the root' of the published file
    system).

3
HTTP(hypertext transport protocol)
  • Built on top of the Transport Control Protocol
    (TCP)
  • Steps(from client end)
  • resolve the server host name to an Internet
    address (IP)
  • Use Domain Name Server (DNS)
  • DNS is a distributed database of name-to-IP
    mappings maintained at a set of known servers
  • contact the server using TCP
  • connect to default HTTP port (80) on the server.
  • Enter the HTTP requests header (E.g. GET)
  • Fetch the response header
  • MIME (Multipurpose Internet Mail Extensions)
  • A meta-data standard for email and Web content
    transfer
  • Fetch the HTML page

4
Crawl all Web pages?
  • Problem no catalog of all accessible URLs on the
    Web.
  • Solution
  • start from a given set of URLs
  • Progressively fetch and scan them for new
    outlinking URLs
  • fetch these pages in turn..
  • Submit the text in page to a text indexing system
  • and so on.

5
Crawling procedure
  • Simple
  • Great deal of engineering goes into
    industry-strength crawlers
  • Industry crawlers crawl a substantial fraction of
    the Web
  • E.g. Alta Vista, Northern Lights, Inktomi
  • No guarantee that all accessible Web pages will
    be located in this fashion
  • Crawler may never halt .
  • pages will be added continually even as it is
    running.

6
Crawling overheads
  • Delays involved in
  • Resolving the host name in the URL to an IP
    address using DNS
  • Connecting a socket to the server and sending the
    request
  • Receiving the requested page in response
  • Solution Overlap the above delays by
  • fetching many pages at the same time

7
Anatomy of a crawler.
  • Page fetching threads
  • Starts with DNS resolution
  • Finishes when the entire page has been fetched
  • Each page
  • stored in compressed form to disk/tape
  • scanned for outlinks
  • Work pool of outlinks
  • maintain network utilization without overloading
    it
  • Dealt with by load manager
  • Continue till he crawler has collected a
    sufficient number of pages.

8
Typical anatomy of a large-scale crawler.
9
Large-scale crawlers performance and reliability
considerations
  • Need to fetch many pages at same time
  • utilize the network bandwidth
  • single page fetch may involve several seconds of
    network latency
  • Highly concurrent and parallelized DNS lookups
  • Use of asynchronous sockets
  • Explicit encoding of the state of a fetch context
    in a data structure
  • Polling socket to check for completion of network
    transfers
  • Multi-processing or multi-threading Impractical
  • Care in URL extraction
  • Eliminating duplicates to reduce redundant
    fetches
  • Avoiding spider traps

10
DNS caching, pre-fetching and resolution
  • A customized DNS component with..
  • Custom client for address resolution
  • Caching server
  • Prefetching client

11
Custom client for address resolution
  • Tailored for concurrent handling of multiple
    outstanding requests
  • Allows issuing of many resolution requests
    together
  • polling at a later time for completion of
    individual requests
  • Facilitates load distribution among many DNS
    servers.

12
Caching server
  • With a large cache, persistent across DNS
    restarts
  • Residing largely in memory if possible.

13
Prefetching client
  • Steps
  • Parse a page that has just been fetched
  • extract host names from HREF targets
  • Make DNS resolution requests to the caching
    server
  • Usually implemented using UDP
  • User Datagram Protocol
  • connectionless, packet-based communication
    protocol
  • does not guarantee packet delivery
  • Does not wait for resolution to be completed.

14
Multiple concurrent fetches
  • Managing multiple concurrent connections
  • A single download may take several seconds
  • Open many socket connections to different HTTP
    servers simultaneously
  • Multi-CPU machines not useful
  • crawling performance limited by network and disk
  • Two approaches
  • using multi-threading
  • using non-blocking sockets with event handlers

15
Multi-threading
  • logical threads
  • physical thread of control provided by the
    operating system (E.g. pthreads) OR
  • concurrent processes
  • fixed number of threads allocated in advance
  • programming paradigm
  • create a client socket
  • connect the socket to the HTTP service on a
    server
  • Send the HTTP request header
  • read the socket (recv) until
  • no more characters are available
  • close the socket.
  • use blocking system calls

16
Multi-threading Problems
  • performance penalty
  • mutual exclusion
  • concurrent access to data structures
  • slow disk seeks.
  • great deal of interleaved, random input-output on
    disk
  • Due to concurrent modification of document
    repository by multiple threads

17
Non-blocking sockets and event handlers
  • non-blocking sockets
  • connect, send or recv call returns immediately
    without waiting for the network operation to
    complete.
  • poll the status of the network operation
    separately
  • select system call
  • lets application suspend until more data can be
    read from or written to the socket
  • timing out after a pre-specified deadline
  • Monitor polls several sockets at the same time
  • More efficient memory management
  • code that completes processing not interrupted by
    other completions
  • No need for locks and semaphores on the pool
  • only append complete pages to the log

18
Link extraction and normalization
  • Goal Obtaining a canonical form of URL
  • URL processing and filtering
  • Avoid multiple fetches of pages known by
    different URLs
  • many IP addresses
  • For load balancing on large sites
  • Mirrored contents/contents on same file system
  • Proxy pass
  • Mapping of different host names to a single IP
    address
  • need to publish many logical sites
  • Relative URLs
  • need to be interpreted w.r.t to a base URL.

19
Canonical URL
  • Formed by
  • Using a standard string for the protocol
  • Canonicalizing the host name
  • Adding an explicit port number
  • Normalizing and cleaning up the path

20
Robot exclusion
  • Check
  • whether the server prohibits crawling a
    normalized URL
  • In robots.txt file in the HTTP root directory of
    the server
  • species a list of path prefixes which crawlers
    should not attempt to fetch.
  • Meant for crawlers only

21
Eliminating already-visited URLs
  • Checking if a URL has already been fetched
  • Before adding a new URL to the work pool
  • Needs to be very quick.
  • Achieved by computing MD5 hash function on the
    URL
  • Exploiting spatio-temporal locality of access
  • Two-level hash function.
  • most significant bits (say, 24) derived by
    hashing the host name plus port
  • lower order bits (say, 40) derived by hashing the
    path
  • concatenated bits use d as a key in a B-tree
  • qualifying URLs added to frontier of the crawl.
  • hash values added to B-tree.

22
Spider traps
  • Protecting from crashing on
  • Ill-formed HTML
  • E.g. page with 68 kB of null characters
  • Misleading sites
  • indefinite number of pages dynamically generated
    by CGI scripts
  • paths of arbitrary depth created using soft
    directory links and path remapping features in
    HTTP server

23
Spider Traps Solutions
  • No automatic technique can be foolproof
  • Check for URL length
  • Guards
  • Preparing regular crawl statistics
  • Adding dominating sites to guard module
  • Disable crawling active content such as CGI form
    queries
  • Eliminate URLs with non-textual data types

24
Avoiding repeated expansion of links on duplicate
pages
  • Reduce redundancy in crawls
  • Duplicate detection
  • Mirrored Web pages and sites
  • Detecting exact duplicates
  • Checking against MD5 digests of stored URLs
  • Representing a relative link v (relative to
    aliases u1 and u2) as tuples (h(u1) v) and
    (h(u2) v)
  • Detecting near-duplicates
  • Even a single altered character will completely
    change the digest !
  • E.g. date of update/ name and email of the site
    administrator
  • Solution Shingling

25
Load monitor
  • Keeps track of various system statistics
  • Recent performance of the wide area network (WAN)
    connection
  • E.g. latency and bandwidth estimates.
  • Operator-provided/estimated upper bound on open
    sockets for a crawler
  • Current number of active sockets.

26
Thread manager
  • Responsible for
  • Choosing units of work from frontier
  • Scheduling issue of network resources
  • Distribution of these requests over multiple ISPs
    if appropriate.
  • Uses statistics from load monitor

27
Per-server work queues
  • Denial of service (DoS) attacks
  • limit the speed or frequency of responses to any
    fixed client IP address
  • Avoiding DOS
  • limit the number of active requests to a given
    server IP address at any time
  • maintain a queue of requests for each server
  • Use the HTTP/1.1 persistent socket capability.
  • Distribute attention relatively evenly between a
    large number of sites
  • Access locality vs. politeness dilemma

28
Text repository
  • Crawlers last task
  • Dumping fetched pages into a repository
  • Decoupling crawler from other functions for
    efficiency and reliability preferred
  • Page-related information stored in two parts
  • meta-data
  • page contents.

29
Storage of page-related information
  • Meta-data
  • relational in nature
  • usually managed by custom software to avoid
    relation database system overheads
  • text index involves bulk updates
  • includes fields like content-type, last-modified
    date, content-length, HTTP status code, etc.

30
Page contents storage
  • Typical HTML Web page compresses to 2-4 kB (using
    zlib)
  • File systems have a 4-8 kB file block size
  • Too large !!
  • Page storage managed by custom storage manager
  • simple access methods for
  • crawler to add pages
  • Subsequent programs (Indexer etc) to retrieve
    documents

31
Page Storage
  • Small-scale systems
  • Repository fitting within the disks of a single
    machine
  • Use of storage manager (E.g. Berkeley DB)
  • Manage disk-based databases within a single file
  • configuration as a hash-table/B-tree for URL
    access key
  • To handle ordered access of pages
  • configuration as a sequential log of page
    records.
  • Since Indexer can handle pages in any order

32
Page Storage
  • Large Scale systems
  • Repository distributed over a number of storage
    servers
  • Storage servers
  • Connected to the crawler through a fast local
    network (E.g. Ethernet)
  • Hashed by URLs
  • T3' grade leased lines.
  • To handle 10 million pages (40 GB) per hour

33
Large-scale crawlers often use multiple ISPs and
a bank of local storage servers to store the
pages crawled.
34
Refreshing crawled pages
  • Search engine's index should be fresh
  • Web-scale crawler never completes' its job
  • High variance of rate of page changes
  • If-modified-since request header with HTTP
    protocol
  • Impractical for a crawler
  • Solution
  • At commencement of new crawling round estimate
    which pages have changed

35
Determining page changes
  • Expires HTTP response header
  • For page that come with an expiry date
  • Otherwise need to guess if revisiting that page
    will yield a modified version.
  • Score reflecting probability of page being
    modified
  • Crawler fetches URLs in decreasing order of
    score.
  • Assumption recent past predicts the future

36
Estimating page change rates
  • Brewington and Cybenko Cho
  • Algorithms for maintaining a crawl in which most
    pages are fresher than a specified epoch.
  • Prerequisite
  • average interval at which crawler checks for
    changes is smaller than the inter-modification
    times of a page
  • Small scale intermediate crawler runs
  • to monitor fast changing sites
  • E.g. current news, weather, etc.
  • Patched intermediate indices into master index

37
Putting together a crawler
  • Reference implementation of the HTTP client
    protocol
  • World-wide Web Consortium (http//www.w3c.org/)
  • w3c-libwww package

38
Design of the core components Crawler class.
  • To copy bytes from network sockets to storage
    media
  • Three methods to express Crawler's contract with
    user
  • pushing a URL to be fetched to the Crawler
    (fetchPush)
  • Termination callback handler (fetchDone) called
    with same URL
  • Method (start) which starts Crawler's event loop.
  • Implementation of Crawler class
  • Need for two helper classes called DNS and Fetch
Write a Comment
User Comments (0)
About PowerShow.com