COMP630D Presentation: Focused Crawler

About This Presentation

Title:

COMP630D Presentation: Focused Crawler

Description:

COMP630D Presentation: Focused Crawler. By: Yang Yongsheng, Wang Hui. 22.Nov.2000. 2000-11-22 ... 2000-11-22. Focused Crawler. Section 2. Wang Hui will continue... – PowerPoint PPT presentation

Number of Views:26

Avg rating:3.0/5.0

Slides: 12

Provided by: tcp7

Category:

more less

Transcript and Presenter's Notes

Title: COMP630D Presentation: Focused Crawler

1
COMP630D PresentationFocused Crawler

By Yang Yongsheng, Wang Hui
22.Nov.2000

2
Outline

Section 1
Introduction
Basic Techniques
Sougatas Improving Method
Section 2
Weakness of Basic Techniques
Our Modified Algorithm
Summary

3
Introduction

What is Focused Crawler
Given some keywords/urls to represent one or more
topics
Fetching web pages on the topics
Avoid downloading irrelevant web pages

4
Introduction cont.

Why focus
Web is a huge and rapid increasing database
Too much time and resources consuming in general
purpose crawler
In most case, we just need information about
special topics

5
Basic Techniques
Example urls
Key words
Search engine
Queue empty?
Seed urls
Download Queue
Yes
Relevant links
No
End
Index pages and extract out links
Distiller
Relevant pages
Classifier /Filter
Out links
6
Basic Techniques(II)

Classifier
rank pages based on keywords
Distiller
rank links based on hub weights

7
Performance

About 50 pages are relevant
Bottleneck
too much time consuming in downloading pages
network congestion
downloading many pages from a single site may
take a long time
Need to improve preciseness

8
Sougatas improving method(I) www9

Nearness of the current page to the linked
page,e.g current page in A/B
same website
only pages in (A,B,C,D,E,F,G)
other website
pages link to or from pages in (A,B,C,D,E,F,G)
url contains any of topic keywords
http//www.titanicmovie.com-gtTitanic collection

A
B
E
C
D
F
G
9
Sougatas improving method (II)

Irrelevant Directories
set a counter on indexed and ignored pages number
for each directory
if number of downloaded pagesgt25 and 90 of those
pages are ignored, that directory is irrelevant,
do not download any more pages on it

10
Comparison of basic and Sougatas method
Download()100(pages downloaded by Sougatas
crawler)/(those by basic crawler) Nearness()
pages not downloaded because linked page was not
near to original page Rejected Directories()
pages not downloaded because the directory was
irrelevant Relevant Pages Missed() 100(num of
pages indexed by basic crawler but ignored by
Sougatas crawler)/(total num of pages indexed by
basic crawler)
11
Section 2