COMP630D Presentation: Focused Crawler - PowerPoint PPT Presentation

1 / 11
About This Presentation
Title:

COMP630D Presentation: Focused Crawler

Description:

COMP630D Presentation: Focused Crawler. By: Yang Yongsheng, Wang Hui. 22.Nov.2000. 2000-11-22 ... 2000-11-22. Focused Crawler. Section 2. Wang Hui will continue... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 12
Provided by: tcp7
Category:

less

Transcript and Presenter's Notes

Title: COMP630D Presentation: Focused Crawler


1
COMP630D PresentationFocused Crawler
  • By Yang Yongsheng, Wang Hui
  • 22.Nov.2000

2
Outline
  • Section 1
  • Introduction
  • Basic Techniques
  • Sougatas Improving Method
  • Section 2
  • Weakness of Basic Techniques
  • Our Modified Algorithm
  • Summary

3
Introduction
  • What is Focused Crawler
  • Given some keywords/urls to represent one or more
    topics
  • Fetching web pages on the topics
  • Avoid downloading irrelevant web pages

4
Introduction cont.
  • Why focus
  • Web is a huge and rapid increasing database
  • Too much time and resources consuming in general
    purpose crawler
  • In most case, we just need information about
    special topics

5
Basic Techniques
Example urls
Key words
Search engine
Queue empty?
Seed urls
Download Queue
Yes
Relevant links
No
End
Index pages and extract out links
Distiller
Relevant pages
Classifier /Filter
Out links
6
Basic Techniques(II)
  • Classifier
  • rank pages based on keywords
  • Distiller
  • rank links based on hub weights

7
Performance
  • About 50 pages are relevant
  • Bottleneck
  • too much time consuming in downloading pages
  • network congestion
  • downloading many pages from a single site may
    take a long time
  • Need to improve preciseness

8
Sougatas improving method(I) www9
  • Nearness of the current page to the linked
    page,e.g current page in A/B
  • same website
  • only pages in (A,B,C,D,E,F,G)
  • other website
  • pages link to or from pages in (A,B,C,D,E,F,G)
  • url contains any of topic keywords
  • http//www.titanicmovie.com-gtTitanic collection

A
B
E
C
D
F
G
9
Sougatas improving method (II)
  • Irrelevant Directories
  • set a counter on indexed and ignored pages number
    for each directory
  • if number of downloaded pagesgt25 and 90 of those
    pages are ignored, that directory is irrelevant,
    do not download any more pages on it

10
Comparison of basic and Sougatas method
Download()100(pages downloaded by Sougatas
crawler)/(those by basic crawler) Nearness()
pages not downloaded because linked page was not
near to original page Rejected Directories()
pages not downloaded because the directory was
irrelevant Relevant Pages Missed() 100(num of
pages indexed by basic crawler but ignored by
Sougatas crawler)/(total num of pages indexed by
basic crawler)
11
Section 2
  • Wang Hui will continue
Write a Comment
User Comments (0)
About PowerShow.com