Exploiting%20Inter-Class%20Rules%20for%20Focused%20Crawling presentation

About This Presentation

Transcript and Presenter's Notes

Title: Exploiting%20Inter-Class%20Rules%20for%20Focused%20Crawling

1
Exploiting Inter-Class Rules for Focused Crawling

Ismail Sengör Altingövde
Bilkent University
Ankara, Turkey

2
Our Research The Big Picture

Goal Metadata based modeling and querying of web
resources
Stages
Semi automated metadata extraction from web
resources Focused crawling fits here!
Extending SQL to support ranking and text-based
operations in an integrated manner
Developing query processing algorithms
Prototyping a digital library application for CS
resources

3
Overview

Motivation
Background related work
Interclass rules for focused crawling
Preliminary results

4
Motivation

Crawlers a.k.a. bots,spiders, robots
Goal Fetching all the pages on the Web, to allow
succeding useful tasks (e.g., indexing)
all pages means somewhat 4 billion pages today
(due to Google)
Requires enormous hardware and network resources
Consider the growth rate refresh rate of Web
What about hidden-Web and dynamic content?

5
Motivation

Certain applications do need such powerful (and
expensive) crawlers
e.g., a general purpose search engine
And some others dont...
e.g., a portal on computer science papers, or
people homepages...

6
Motivation

Lets relax the problem space
Focus on a restricted target space of Web pages
that may be of some type (e.g., homepages)
that may be of some topic (CS, quantum physics)
The focused crawling effort would
use much less resources,
be more timely,
be more qualified for indexing searching
purposes

7
Motivation

Goal Design and implement a focused Web crawler
that would
gather only pages on a particular topic (or
class)
use interclass relationships while choosing the
next page to download
Once we have this, we can do many interesting
things on top of the crawled pages
I plan to be around for a few more years!!! ?

8
Background A typical crawler

Starts from a set of seed pages
Follows all hyperlinks it encounters, to
eventually traverse the entire Web
Applies breadth-first search (BFS)
Runs endless in cycles
to revisist modified pages
to access unseen content

9
Our simple BFS crawler
10
Crawling issues...

Multi-threading
Use separate and dedicated threads for DNS
resolution and actual page downloading
Cache and prefetch DNS resolutions
Content-seen test
Avoid duplicate content, e.g., mirrors
Link extraction and normalization
Canonical URLs

11
More issues...

URL-seen test
Avoid being trapped in a cycle!
Hash visited URLs by MD5 algorithm and store in a
database.
2-level hashing to exploit spatio-temporal
locality
Load balancing among hosts Be polite!
Robot exclusion protocol
Meta tags

12
Even more issues?!

Our crawler is simple, since issues like
Refreshing crawled web pages
Performance monitoring
Hidden-Web content
are left out...
And some of the implemented issues can be still
improved
Busy queue for the politeness policy!

13
Background Focused crawling

A focused crawler seeks and acquires ...
pages on a specific set of topics representing a
relatively narrow segment of the Web. (Soumen
Chakrabarti)
The underlying paradigm is Best-First Search
instead of the Breadth-First Search

14
Breadth vs. Best First Search
15
Two fundamental questions

Q1 How to decide whether a downloaded page is
on-topic, or not?
Q2 How to choose the next page to visit?

16
Early algorithms

FISHSEARCH Query driven
A1 Pages that match to a query
A2 Neighborhood of the pages in the above
SHARKSEARCH
Use TF-IDF cosine measure from IR to determine
page relevance
Cho et. al.
Reorder crawl frontier based on page importance
score (PageRank, in-links, etc.)

17
Chakrabartis crawler

Chakrabartis focused crawler
A1 Determines the page relevance using a text
classifier
A2 Adds URLs to a max-priority queue with their
parent pages score and visits them in descending
order!
What is original is using a text classifier!

18
The baseline crawler

A simplified implementation of Chakrabartis
crawler
It is used to present evaluate our rule based
strategy
Just two minor changes in our crawler
architecture, and done!!!

19
Our baseline crawler
20
The baseline crawler

An essential component is text classifier
Naive-Bayes classifier called Rainbow
Training the classifier
Data Use a topic taxonomy (The Open Directory,
Yahoo).
Better than modeling a negative class

21
Baseline crawler Page relevance

Testing the classifier
User determines focus topics
Crawler calls the classifier and obtains a score
for each downloaded page
Classifier returns a sorted list of classes and
scores
(A 80, B 10, C 7, D 1,...)
The classifier determines the page relevance!

22
Baseline crawler Visit order

The radius-1 hypothesis If page u is an on-topic
example and u links to v, then the probability
that v is on-topic is higher than the probability
that a random chosen Web page is on-topic.

23
Baseline crawler Visit order

Hard-focus crawling
If a downloaded page is off-topic, stops
following hyperlinks from this page.
Assume target is class B
And for page P, classifier gives
A 80, B 10, C 7, D 1,...
Do not follow Ps links at all!

24
Baseline crawler Visit order

Soft-focus crawling
obtains a pages relevance score (a score on the
pages relevance to the target topic)
assigns this score to every URL extracted from
this particular page, and adds to the priority
queue
Example A 80, B 10, C 7, D 1,...
Insert Ps links with score 0.10 into PQ

25
Rule-based crawler Motivation

Two important observations
Pages not only refer to pages from the same
class, but also pages from other classes.
e.g., from bicycle pages to first aid pages
Relying on only radius-1 hypothesis is not
enough!

26
Rule-based crawler Motivation

Baseline crawler can not support tunneling
University homepages link to CS pages, which
link to researcher homepages, and which futher
link to CS papers
Determining score only w.r.t. the similarity to
the target class is not enough!

27
Our solution

Extract rules that statistically capture linkage
relationships among the classes (topics) and
guide crawler accordingly
Intuitively, we determine relationships like
pages in class A refer to pages in class B with
probability X
A B (X)

28
Our solution

When crawler seeks for class B and page P at hand
is of class A,
consider all paths from A to B
compute an overall score S
add links from P to the PQ with this score S
Basically, we revise radius-1 hypothesis with
class linkage probabilities.

29
How to obtain rules?
30
An example scenario

Assume our taxonomy have 4 classes
department homepages (DH)
course homepages (CH)
personal homepages (PH)
sports pages (SP)
First, obtain train-0 set
Next, for each class, assume 10 pages are fetched
pointed to by the pages in train-0 set

31
An example scenario
The distribution of links to classes
Inter-class rules for the above distribution
32
Seed and target classes are both from the class
PH.
33
Seed and target classes are both from the class
PH.
34
Rule-based crawler

Rule-based approach succesfully uses class
linkage information
to revise radius-1 hypothesis
to reach an immediate award

35
Rule-based crawler Tunneling

Rule based approach also support tunneling by a
simple application of transitivity.
Consider URL2 (of class DH)
A direct rule is DH ? PH (0.1)
An indirect rule is
from DH ? CH (0.8) and CH ? PH (0.4)
obtain DH ? PH (0.8 0.4 0.32)
And, thus DH ? PH (0.1 0.32 0.42)

36
Rule-based crawler Tunneling

Observe that
In effect, the rule based crawler becomes aware
of a path DH ? CH ? PH, although it has only
trained with paths of length 1.
The rule based crawler can succesfully imitate
tunneling.

37
Rule-based score computation

Chain the rules up to some predefined MAX-DEPTH
number (e.g., 2 or 3)
Merge the paths with the function SUM
If no rules whatsoever, stick on soft-focus score
Note that
Rule db can be represented as a graph, and
For a given target class all cycle free paths
(except self loop of T) can be computed (e.g.,
modify BFS)

38
Rule-based score computation
39
Preliminary results Set-up

DMOZ taxonomy
leafs with more than 150 URLs
1282 classes (topics)
Train-0 set 120K pages
Train-1 set 40K pages pointed to by 266
interrelated classes (all about science)
Target topics are also from these 266 classes

40
Preliminary results Set-up

Harvest ratio the average relevance of all pages
acquired by the crawler to the target topic

41
Preliminary results

Seeds are from DMOZ and Yahoo!
Harvest rate improve from 3 to 38
Coverage also differs

42
Harvest Rate
43
Future Work

Sophisticated rule discovery techniques (e.g.,
topic citation matrix of Chakrabarti et al.)
On-line refinement of the rule database
Using the entire taxonomy but not only leafs

44
Acknowledgments

We gratefully thank Ö. Rauf Atay for the
implementation.

45
References

I. S. Altingövde, Ö. Ulusoy, Exploiting
Inter-Class Rules for Focused Crawling, IEEE
Intelligent Systems Magazine, to appear.
S. Chakrabarti, Mining the Web Discovering
Knowledge from Hypertext Data. Morgan Kaufmann
Publishers, 352 pages, 2003.
S. Chakrabarti, M. H. van den Berg, and B.E. Dom,
Focused crawling a new approach to
topic-specific web resource discovery, In Proc.
of 8th International WWW Conference (WWW8),
1999.

46
Any questions???

Write a Comment

User Comments (0)

About PowerShow.com

Exploiting%20Inter-Class%20Rules%20for%20Focused%20Crawling PowerPoint PPT Presentation