Exploiting%20Inter-Class%20Rules%20for%20Focused%20Crawling - PowerPoint PPT Presentation

About This Presentation
Title:

Exploiting%20Inter-Class%20Rules%20for%20Focused%20Crawling

Description:

Exploiting Inter-Class Rules for Focused Crawling smail Seng r Alt ng vde Bilkent University Ankara, Turkey – PowerPoint PPT presentation

Number of Views:205
Avg rating:3.0/5.0
Slides: 47
Provided by: sss85
Category:

less

Transcript and Presenter's Notes

Title: Exploiting%20Inter-Class%20Rules%20for%20Focused%20Crawling


1
Exploiting Inter-Class Rules for Focused Crawling
  • Ismail Sengör Altingövde
  • Bilkent University
  • Ankara, Turkey

2
Our Research The Big Picture
  • Goal Metadata based modeling and querying of web
    resources
  • Stages
  • Semi automated metadata extraction from web
    resources Focused crawling fits here!
  • Extending SQL to support ranking and text-based
    operations in an integrated manner
  • Developing query processing algorithms
  • Prototyping a digital library application for CS
    resources

3
Overview
  • Motivation
  • Background related work
  • Interclass rules for focused crawling
  • Preliminary results

4
Motivation
  • Crawlers a.k.a. bots,spiders, robots
  • Goal Fetching all the pages on the Web, to allow
    succeding useful tasks (e.g., indexing)
  • all pages means somewhat 4 billion pages today
    (due to Google)
  • Requires enormous hardware and network resources
  • Consider the growth rate refresh rate of Web
  • What about hidden-Web and dynamic content?

5
Motivation
  • Certain applications do need such powerful (and
    expensive) crawlers
  • e.g., a general purpose search engine
  • And some others dont...
  • e.g., a portal on computer science papers, or
    people homepages...

6
Motivation
  • Lets relax the problem space
  • Focus on a restricted target space of Web pages
  • that may be of some type (e.g., homepages)
  • that may be of some topic (CS, quantum physics)
  • The focused crawling effort would
  • use much less resources,
  • be more timely,
  • be more qualified for indexing searching
    purposes

7
Motivation
  • Goal Design and implement a focused Web crawler
    that would
  • gather only pages on a particular topic (or
    class)
  • use interclass relationships while choosing the
    next page to download
  • Once we have this, we can do many interesting
    things on top of the crawled pages
  • I plan to be around for a few more years!!! ?

8
Background A typical crawler
  • Starts from a set of seed pages
  • Follows all hyperlinks it encounters, to
    eventually traverse the entire Web
  • Applies breadth-first search (BFS)
  • Runs endless in cycles
  • to revisist modified pages
  • to access unseen content

9
Our simple BFS crawler
10
Crawling issues...
  • Multi-threading
  • Use separate and dedicated threads for DNS
    resolution and actual page downloading
  • Cache and prefetch DNS resolutions
  • Content-seen test
  • Avoid duplicate content, e.g., mirrors
  • Link extraction and normalization
  • Canonical URLs

11
More issues...
  • URL-seen test
  • Avoid being trapped in a cycle!
  • Hash visited URLs by MD5 algorithm and store in a
    database.
  • 2-level hashing to exploit spatio-temporal
    locality
  • Load balancing among hosts Be polite!
  • Robot exclusion protocol
  • Meta tags

12
Even more issues?!
  • Our crawler is simple, since issues like
  • Refreshing crawled web pages
  • Performance monitoring
  • Hidden-Web content
  • are left out...
  • And some of the implemented issues can be still
    improved
  • Busy queue for the politeness policy!

13
Background Focused crawling
  • A focused crawler seeks and acquires ...
    pages on a specific set of topics representing a
    relatively narrow segment of the Web. (Soumen
    Chakrabarti)
  • The underlying paradigm is Best-First Search
    instead of the Breadth-First Search

14
Breadth vs. Best First Search
15
Two fundamental questions
  • Q1 How to decide whether a downloaded page is
    on-topic, or not?
  • Q2 How to choose the next page to visit?

16
Early algorithms
  • FISHSEARCH Query driven
  • A1 Pages that match to a query
  • A2 Neighborhood of the pages in the above
  • SHARKSEARCH
  • Use TF-IDF cosine measure from IR to determine
    page relevance
  • Cho et. al.
  • Reorder crawl frontier based on page importance
    score (PageRank, in-links, etc.)

17
Chakrabartis crawler
  • Chakrabartis focused crawler
  • A1 Determines the page relevance using a text
    classifier
  • A2 Adds URLs to a max-priority queue with their
    parent pages score and visits them in descending
    order!
  • What is original is using a text classifier!

18
The baseline crawler
  • A simplified implementation of Chakrabartis
    crawler
  • It is used to present evaluate our rule based
    strategy
  • Just two minor changes in our crawler
    architecture, and done!!!

19
Our baseline crawler
20
The baseline crawler
  • An essential component is text classifier
  • Naive-Bayes classifier called Rainbow
  • Training the classifier
  • Data Use a topic taxonomy (The Open Directory,
    Yahoo).
  • Better than modeling a negative class

21
Baseline crawler Page relevance
  • Testing the classifier
  • User determines focus topics
  • Crawler calls the classifier and obtains a score
    for each downloaded page
  • Classifier returns a sorted list of classes and
    scores
  • (A 80, B 10, C 7, D 1,...)
  • The classifier determines the page relevance!

22
Baseline crawler Visit order
  • The radius-1 hypothesis If page u is an on-topic
    example and u links to v, then the probability
    that v is on-topic is higher than the probability
    that a random chosen Web page is on-topic.

23
Baseline crawler Visit order
  • Hard-focus crawling
  • If a downloaded page is off-topic, stops
    following hyperlinks from this page.
  • Assume target is class B
  • And for page P, classifier gives
  • A 80, B 10, C 7, D 1,...
  • Do not follow Ps links at all!

24
Baseline crawler Visit order
  • Soft-focus crawling
  • obtains a pages relevance score (a score on the
    pages relevance to the target topic)
  • assigns this score to every URL extracted from
    this particular page, and adds to the priority
    queue
  • Example A 80, B 10, C 7, D 1,...
  • Insert Ps links with score 0.10 into PQ

25
Rule-based crawler Motivation
  • Two important observations
  • Pages not only refer to pages from the same
    class, but also pages from other classes.
  • e.g., from bicycle pages to first aid pages
  • Relying on only radius-1 hypothesis is not
    enough!

26
Rule-based crawler Motivation
  • Baseline crawler can not support tunneling
  • University homepages link to CS pages, which
    link to researcher homepages, and which futher
    link to CS papers
  • Determining score only w.r.t. the similarity to
    the target class is not enough!

27
Our solution
  • Extract rules that statistically capture linkage
    relationships among the classes (topics) and
    guide crawler accordingly
  • Intuitively, we determine relationships like
    pages in class A refer to pages in class B with
    probability X
  • A B (X)

28
Our solution
  • When crawler seeks for class B and page P at hand
    is of class A,
  • consider all paths from A to B
  • compute an overall score S
  • add links from P to the PQ with this score S
  • Basically, we revise radius-1 hypothesis with
    class linkage probabilities.

29
How to obtain rules?
30
An example scenario
  • Assume our taxonomy have 4 classes
  • department homepages (DH)
  • course homepages (CH)
  • personal homepages (PH)
  • sports pages (SP)
  • First, obtain train-0 set
  • Next, for each class, assume 10 pages are fetched
    pointed to by the pages in train-0 set

31
An example scenario
The distribution of links to classes
Inter-class rules for the above distribution
32
Seed and target classes are both from the class
PH.
33
Seed and target classes are both from the class
PH.
34
Rule-based crawler
  • Rule-based approach succesfully uses class
    linkage information
  • to revise radius-1 hypothesis
  • to reach an immediate award

35
Rule-based crawler Tunneling
  • Rule based approach also support tunneling by a
    simple application of transitivity.
  • Consider URL2 (of class DH)
  • A direct rule is DH ? PH (0.1)
  • An indirect rule is
  • from DH ? CH (0.8) and CH ? PH (0.4)
  • obtain DH ? PH (0.8 0.4 0.32)
  • And, thus DH ? PH (0.1 0.32 0.42)

36
Rule-based crawler Tunneling
  • Observe that
  • In effect, the rule based crawler becomes aware
    of a path DH ? CH ? PH, although it has only
    trained with paths of length 1.
  • The rule based crawler can succesfully imitate
    tunneling.

37
Rule-based score computation
  • Chain the rules up to some predefined MAX-DEPTH
    number (e.g., 2 or 3)
  • Merge the paths with the function SUM
  • If no rules whatsoever, stick on soft-focus score
  • Note that
  • Rule db can be represented as a graph, and
  • For a given target class all cycle free paths
    (except self loop of T) can be computed (e.g.,
    modify BFS)

38
Rule-based score computation
39
Preliminary results Set-up
  • DMOZ taxonomy
  • leafs with more than 150 URLs
  • 1282 classes (topics)
  • Train-0 set 120K pages
  • Train-1 set 40K pages pointed to by 266
    interrelated classes (all about science)
  • Target topics are also from these 266 classes

40
Preliminary results Set-up
  • Harvest ratio the average relevance of all pages
    acquired by the crawler to the target topic

41
Preliminary results
  • Seeds are from DMOZ and Yahoo!
  • Harvest rate improve from 3 to 38
  • Coverage also differs

42
Harvest Rate
43
Future Work
  • Sophisticated rule discovery techniques (e.g.,
    topic citation matrix of Chakrabarti et al.)
  • On-line refinement of the rule database
  • Using the entire taxonomy but not only leafs

44
Acknowledgments
  •  We gratefully thank Ö. Rauf Atay for the
    implementation.

45
References
  • I. S. Altingövde, Ö. Ulusoy, Exploiting
    Inter-Class Rules for Focused Crawling, IEEE
    Intelligent Systems Magazine, to appear.
  • S. Chakrabarti, Mining the Web Discovering
    Knowledge from Hypertext Data. Morgan Kaufmann
    Publishers, 352 pages, 2003.
  • S. Chakrabarti, M. H. van den Berg, and B.E. Dom,
    Focused crawling a new approach to
    topic-specific web resource discovery, In Proc.
    of 8th International WWW Conference (WWW8),
    1999.

46
Any questions???
Write a Comment
User Comments (0)
About PowerShow.com