Accelerated Focused Crawling Through Online Relevance Feedback presentation

About This Presentation

Transcript and Presenter's Notes

Title: Accelerated Focused Crawling Through Online Relevance Feedback

1
Accelerated Focused CrawlingThrough Online
Relevance Feedback

Soumen Chakrabarti, IIT BombayKunal Punera,
IIT BombayMallela Subramanyam, UT Austin

2
First-generation focused crawling
If Pr(cu) is large enoughthen enqueue all
outlinks v of uwith priority Pr(cu)
Frontier URLSpriority queue
Seed URLs
Pick best
Crawler
Crawldatabase
Newly fetchedpage u
Submit page for classification

Crawl regions of the Web pertaining to specific
topic c, avoiding irrelevant topics
Guess relevance of unseen node v based on the
relevance of u (u?v) evaluated by topic classifier

3
Baseline crawling results

20 topics from http//dmoz.org
Half to two-thirds of pages fetched are
irrelevant
Every click on a link is a leap of faith
Humans leap better than focused crawlers
Adequate clues in text DOM to leap better

4
How to seek out distant goals?

Manually collect paths (context graphs) leading
to relevant goals
Use a classifier to predict link-distance to
goal from page text
Prioritize link expansion by estimated distance
to relevant goals
No discrimination amongdifferent links on a page

Goal
2
3
1
1
2
3
4
Classifier
Crawler
5
The apprenticecritic strategy
If Pr(cu) islarge enough...
Frontier URLspriority queue
Baseline learner (Critic)
Good
Dmoztopic taxonomy
Class modelsconsisting ofterm stats
u
Pick best
Good/ Bad
Crawler
v
Crawldatabase
Newly fetchedpage u
Submit page for classification
6
Design rationale

Baseline classifier specifies what to crawl
Could be a user-defined black box
Usually depends on large amounts of training
data, relatively slow to train (hours)
Apprentice classifier learns how to locate pages
approved by the baseline classifier
Uses local regularities in HTML, site structure
Less training data, faster training (minutes)
Guards against high fan-out (10)
No need to manually collect paths to goals

7
Apprentice feature design

HREF source page u represented by DOM tree
Leaf nodes marked with offsets wrt the HREF
Many usable representations for term at offset d
A ?t,d? tuple, e.g., ?download, -2?
?t,p,d? where p is path length from t to d
through LCA

ul
li
li
li
li
a
tt
TEXT
HREF
TEXT
em
TEXT
TEXT
font
TEXT
TEXT
TEXT
_at_0
_at_0
_at_1
_at_2
_at_3
_at_-1
_at_-2
Offsets?
8
Offsets of good ?t,d? features

Plot information gain at each d averaged over
terms t
Max at d0, falls off on both sides, but
Clear saw-tooth pattern for most topicswhy?
ltligtltagtltligtltagt
ltagtltbrgtltagtltbrgt
Topic-independent authoring idioms, best handled
by apprentice

9
Apprentice learner design

Instance ?(u,v),Pr(cv) ? represented by
?t,d? features for d up to some dmax
HREF source topics ?c,Pr(cu)? ?c
Cocitation w1, w2 siblings, w1 good?w2 good
Learning algorithm
Want to map features to a score in 0,1
Discretize 0,1 into ranges, each a label ?
Class label ? has an associated value q?
Use a naïve Bayes classifier to find Pr(?(u,v))
Score ??q? Pr(?(u,v))

10
Apprentice accuracy

Better elimination of useless outlinks with
increasing dmax
Good accuracy with dmaxlt 6
Using DOM offset info improves accuracy
Small accuracy gains
LCA distance in ?t,p,d?
Source page topics
Cocitation features

11
Offline apprentice trials

Run baseline, train apprentice
Start new crawler at the same URLs
Let it fetch any page it schedules
(Recall) limit it to pages visited by the
baseline crawler
Baseline loss gt recall loss gt apprentice loss
Small URL overlap

12
Online guidance from apprentice

Run baseline
Train apprentice
Re-evaluate frontier
Apprentice not as optimistic as baseline
Many URLs downgraded
Continue crawling with apprentice guidance
Immediate reduction in loss rate

13
Summary

New generation of focused crawler
Discriminates between links, learning online
Apprentice easy and fast to train online
Accurate with small dmax around 46
DOM-derived features better than text
Effective division of labor (what vs. how)
Loss rate reduced by 3090
Apprentice better at guessing relevance of
unvisited nodes than baseline crawler
Benefits visible after 1001000 page fetches

14
Ongoing work

Extending to larger radius and deeper DOM site
structure
Public domain C software
Crawler
Asynchronous DNS, simple callback model
Can saturate dedicated 4Mbps with a Pentium2
HTML cleaner
Simple, customizable, table-driven patch logic
Robust to bad HTML, no crash or memory leak
HTML to DOM converter
Extensible DOM node class

Write a Comment

User Comments (0)

About PowerShow.com

Accelerated Focused Crawling Through Online Relevance Feedback PowerPoint PPT Presentation