Title: COMP630D Presentation: Focused Crawler
1COMP630D PresentationFocused Crawler
- By Yang Yongsheng, Wang Hui
- 22.Nov.2000
2Outline
- Section 1
- Introduction
- Basic Techniques
- Sougatas Improving Method
- Section 2
- Weakness of Basic Techniques
- Our Modified Algorithm
- Summary
3Introduction
- What is Focused Crawler
- Given some keywords/urls to represent one or more
topics - Fetching web pages on the topics
- Avoid downloading irrelevant web pages
4Introduction cont.
- Why focus
- Web is a huge and rapid increasing database
- Too much time and resources consuming in general
purpose crawler - In most case, we just need information about
special topics
5Basic Techniques
Example urls
Key words
Search engine
Queue empty?
Seed urls
Download Queue
Yes
Relevant links
No
End
Index pages and extract out links
Distiller
Relevant pages
Classifier /Filter
Out links
6Basic Techniques(II)
- Classifier
- rank pages based on keywords
- Distiller
- rank links based on hub weights
7Performance
- About 50 pages are relevant
- Bottleneck
- too much time consuming in downloading pages
- network congestion
- downloading many pages from a single site may
take a long time - Need to improve preciseness
8Sougatas improving method(I) www9
- Nearness of the current page to the linked
page,e.g current page in A/B - same website
- only pages in (A,B,C,D,E,F,G)
- other website
- pages link to or from pages in (A,B,C,D,E,F,G)
- url contains any of topic keywords
- http//www.titanicmovie.com-gtTitanic collection
A
B
E
C
D
F
G
9Sougatas improving method (II)
- Irrelevant Directories
- set a counter on indexed and ignored pages number
for each directory - if number of downloaded pagesgt25 and 90 of those
pages are ignored, that directory is irrelevant,
do not download any more pages on it
10Comparison of basic and Sougatas method
Download()100(pages downloaded by Sougatas
crawler)/(those by basic crawler) Nearness()
pages not downloaded because linked page was not
near to original page Rejected Directories()
pages not downloaded because the directory was
irrelevant Relevant Pages Missed() 100(num of
pages indexed by basic crawler but ignored by
Sougatas crawler)/(total num of pages indexed by
basic crawler)
11Section 2