Accelerated Focused Crawling Through Online Relevance Feedback - PowerPoint PPT Presentation

About This Presentation
Title:

Accelerated Focused Crawling Through Online Relevance Feedback

Description:

Humans leap better than focused crawlers. Adequate clues in text DOM to leap better ... (Recall) limit it to pages visited by the baseline crawler ... – PowerPoint PPT presentation

Number of Views:116
Avg rating:3.0/5.0
Slides: 15
Provided by: CFI9
Category:

less

Transcript and Presenter's Notes

Title: Accelerated Focused Crawling Through Online Relevance Feedback


1
Accelerated Focused CrawlingThrough Online
Relevance Feedback
  • Soumen Chakrabarti, IIT BombayKunal Punera,
    IIT BombayMallela Subramanyam, UT Austin

2
First-generation focused crawling
If Pr(cu) is large enoughthen enqueue all
outlinks v of uwith priority Pr(cu)
Frontier URLSpriority queue
Seed URLs
Pick best
Crawler
Crawldatabase
Newly fetchedpage u
Submit page for classification
  • Crawl regions of the Web pertaining to specific
    topic c, avoiding irrelevant topics
  • Guess relevance of unseen node v based on the
    relevance of u (u?v) evaluated by topic classifier

3
Baseline crawling results
  • 20 topics from http//dmoz.org
  • Half to two-thirds of pages fetched are
    irrelevant
  • Every click on a link is a leap of faith
  • Humans leap better than focused crawlers
  • Adequate clues in text DOM to leap better

4
How to seek out distant goals?
  • Manually collect paths (context graphs) leading
    to relevant goals
  • Use a classifier to predict link-distance to
    goal from page text
  • Prioritize link expansion by estimated distance
    to relevant goals
  • No discrimination amongdifferent links on a page

Goal
2
3
1
1
2
3
4
Classifier
Crawler
5
The apprenticecritic strategy
If Pr(cu) islarge enough...
Frontier URLspriority queue
Baseline learner (Critic)
Good
Dmoztopic taxonomy
Class modelsconsisting ofterm stats
u
Pick best
Good/ Bad
Crawler
v
Crawldatabase
Newly fetchedpage u
Submit page for classification
6
Design rationale
  • Baseline classifier specifies what to crawl
  • Could be a user-defined black box
  • Usually depends on large amounts of training
    data, relatively slow to train (hours)
  • Apprentice classifier learns how to locate pages
    approved by the baseline classifier
  • Uses local regularities in HTML, site structure
  • Less training data, faster training (minutes)
  • Guards against high fan-out (10)
  • No need to manually collect paths to goals

7
Apprentice feature design
  • HREF source page u represented by DOM tree
  • Leaf nodes marked with offsets wrt the HREF
  • Many usable representations for term at offset d
  • A ?t,d? tuple, e.g., ?download, -2?
  • ?t,p,d? where p is path length from t to d
    through LCA

ul
li
li
li
li
a
tt
TEXT
HREF
TEXT
em
TEXT
TEXT
font
TEXT
TEXT
TEXT
_at_0
_at_0
_at_1
_at_2
_at_3
_at_-1
_at_-2
Offsets?
8
Offsets of good ?t,d? features
  • Plot information gain at each d averaged over
    terms t
  • Max at d0, falls off on both sides, but
  • Clear saw-tooth pattern for most topicswhy?
  • ltligtltagtltligtltagt
  • ltagtltbrgtltagtltbrgt
  • Topic-independent authoring idioms, best handled
    by apprentice

9
Apprentice learner design
  • Instance ?(u,v),Pr(cv) ? represented by
  • ?t,d? features for d up to some dmax
  • HREF source topics ?c,Pr(cu)? ?c
  • Cocitation w1, w2 siblings, w1 good?w2 good
  • Learning algorithm
  • Want to map features to a score in 0,1
  • Discretize 0,1 into ranges, each a label ?
  • Class label ? has an associated value q?
  • Use a naïve Bayes classifier to find Pr(?(u,v))
  • Score ??q? Pr(?(u,v))

10
Apprentice accuracy
  • Better elimination of useless outlinks with
    increasing dmax
  • Good accuracy with dmaxlt 6
  • Using DOM offset info improves accuracy
  • Small accuracy gains
  • LCA distance in ?t,p,d?
  • Source page topics
  • Cocitation features

11
Offline apprentice trials
  • Run baseline, train apprentice
  • Start new crawler at the same URLs
  • Let it fetch any page it schedules
  • (Recall) limit it to pages visited by the
    baseline crawler
  • Baseline loss gt recall loss gt apprentice loss
  • Small URL overlap

12
Online guidance from apprentice
  • Run baseline
  • Train apprentice
  • Re-evaluate frontier
  • Apprentice not as optimistic as baseline
  • Many URLs downgraded
  • Continue crawling with apprentice guidance
  • Immediate reduction in loss rate

13
Summary
  • New generation of focused crawler
  • Discriminates between links, learning online
  • Apprentice easy and fast to train online
  • Accurate with small dmax around 46
  • DOM-derived features better than text
  • Effective division of labor (what vs. how)
  • Loss rate reduced by 3090
  • Apprentice better at guessing relevance of
    unvisited nodes than baseline crawler
  • Benefits visible after 1001000 page fetches

14
Ongoing work
  • Extending to larger radius and deeper DOM site
    structure
  • Public domain C software
  • Crawler
  • Asynchronous DNS, simple callback model
  • Can saturate dedicated 4Mbps with a Pentium2
  • HTML cleaner
  • Simple, customizable, table-driven patch logic
  • Robust to bad HTML, no crash or memory leak
  • HTML to DOM converter
  • Extensible DOM node class
Write a Comment
User Comments (0)
About PowerShow.com