Web Crawlers - PowerPoint PPT Presentation

About This Presentation
Title:

Web Crawlers

Description:

Focused Crawler: selectively seeks out pages that are relevant to a ... Approached used for 966 Yahoo category searches (ex Business/Electronics) Users input ... – PowerPoint PPT presentation

Number of Views:979
Avg rating:3.0/5.0
Slides: 25
Provided by: vladimirb4
Category:

less

Transcript and Presenter's Notes

Title: Web Crawlers


1
Web Crawlers
  • IST 497
  • Vladimir Belyavskiy
  • 11/21/02

2
Overview
  • Introduction to Crawlers
  • Focused Crawling
  • Issues to consider
  • Parallel Crawlers
  • Ambitions for the future
  • Conclusion

3
Introduction
  • What is a crawler?
  • Why are crawlers important?
  • Used by many
  • Main use is to create indexes for search engines
  • Tool was needed to keep track of web content
  • In March of 2002 there were 38,118,962 web sites

4
(No Transcript)
5
(No Transcript)
6
Focused Crawling
  • Focused Crawler selectively seeks out pages that
    are relevant to a pre-defined set of topics.
  • Topics specified by using exemplary documents
    (not keywords)
  • Crawl most relevant links
  • Ignore irrelevant parts.
  • Leads to significant savings in hardware and
    network resources.

7
Issues to consider
  • Where to start crawling?
  • Keyword search
  • User specifies keywords
  • Search for given criteria
  • Popular sites are found using weighted degree
    measures
  • Approached used for 966 Yahoo category searches
    (ex Business/Electronics)
  • Users input
  • User gives document examples
  • Crawler compared documents to find matches

8
Issues to consider
  • URLs found are stored in a queue, stack or a
    deck
  • Which link do you crawl next?
  • Ordering metrics
  • Breadth-First
  • URLs are placed in the queue in order discovered
  • First link found is the first to crawl

9
(No Transcript)
10
Issues to consider
  • Backlink count
  • Counts the number of links to the page
  • Site with greatest of links is given priority
  • Page Rank
  • backlinks are also counted
  • Popular backlinks are given extra value (Ex.
    Yahoo)
  • Works the best

11
(No Transcript)
12
Issues to consider
  • What pages should crawler download?
  • Not enough space
  • Not enough time
  • How to keep content fresh?
  • Fixed Order - Explicit list of URLs to visit
  • Random Order Start from seed and follow links
  • Purely Random Refresh pages on demand

13
(No Transcript)
14
(No Transcript)
15
Issues to consider
  • Estimate frequency of changes
  • Visit pages once a week for five weeks
  • Estimate change frequency
  • Adjust revisit frequency based on the estimate
  • Most effective method

16
Issues to consider
  • How to minimize the load on visited pages?
  • Crawler should obey the constraints
  • Crawler html tags
  • Robot.txt file
  • User-Agent
  • Disallow /
  • Spider Traps

17
(No Transcript)
18
Parallel Crawlers
  • Web is too big to be crawled by a single
    crawler, work should be divided
  • Independent assignment
  • Each crawler starts with its own set of URLs
  • Follows links without consulting other crawlers
  • Reduces communication overhead
  • Some overlap is unavoidable

19
Parallel Crawlers
  • Dynamic assignment
  • Central coordinator divides web into partitions
  • Crawlers crawl their assigned partition
  • Links to other URLs are given to Central
    coordinator
  • Static assignment
  • Web is partitioned and divided to each crawler
  • Crawler only crawls its part of the web

20
(No Transcript)
21
Evaluation
  • Content Quality better for single-process
    crawler
  • Overlap in most multiple processors or they
    dont cover all of the content
  • Overall crawlers are useful tools

22
Future
  • Query interface pages
  • Ex. http//www.weatherchannel.com
  • Detect web page changes better
  • Separate dynamic from static content
  • Share data better between servers and crawlers

23
Bibliography
  • Cheng, Rickie Kwong, April. April 2000
    http//sirius.cs.ucdavis.edu/teaching/289FSQ00/pr
    oject/Reports/crawl_init.pdf.
  • Cho, Junghoo. http//rose.cs.ucla.edu/cho/papers/
    cho-thesis.pdf 2002.
  • Dom, Brian. http//www8.org/w8-papers/5a-search-qu
    ery/crawling/ March 1999.
  • Polytechnic University, CIS Department
  • http//hosting.jrc.cec.eu.int/langtech/Documents/S
    lides-001220_Scheer_OSILIA.pdf

24
The End
  • Any Questions?
Write a Comment
User Comments (0)
About PowerShow.com