Title: Accelerated Focused Crawling Through Online Relevance Feedback
1Accelerated Focused CrawlingThrough Online
Relevance Feedback
- Soumen Chakrabarti, IIT BombayKunal Punera,
IIT BombayMallela Subramanyam, UT Austin
2First-generation focused crawling
If Pr(cu) is large enoughthen enqueue all
outlinks v of uwith priority Pr(cu)
Frontier URLSpriority queue
Seed URLs
Pick best
Crawler
Crawldatabase
Newly fetchedpage u
Submit page for classification
- Crawl regions of the Web pertaining to specific
topic c, avoiding irrelevant topics - Guess relevance of unseen node v based on the
relevance of u (u?v) evaluated by topic classifier
3Baseline crawling results
- 20 topics from http//dmoz.org
- Half to two-thirds of pages fetched are
irrelevant - Every click on a link is a leap of faith
- Humans leap better than focused crawlers
- Adequate clues in text DOM to leap better
4How to seek out distant goals?
- Manually collect paths (context graphs) leading
to relevant goals - Use a classifier to predict link-distance to
goal from page text - Prioritize link expansion by estimated distance
to relevant goals - No discrimination amongdifferent links on a page
Goal
2
3
1
1
2
3
4
Classifier
Crawler
5The apprenticecritic strategy
If Pr(cu) islarge enough...
Frontier URLspriority queue
Baseline learner (Critic)
Good
Dmoztopic taxonomy
Class modelsconsisting ofterm stats
u
Pick best
Good/ Bad
Crawler
v
Crawldatabase
Newly fetchedpage u
Submit page for classification
6Design rationale
- Baseline classifier specifies what to crawl
- Could be a user-defined black box
- Usually depends on large amounts of training
data, relatively slow to train (hours) - Apprentice classifier learns how to locate pages
approved by the baseline classifier - Uses local regularities in HTML, site structure
- Less training data, faster training (minutes)
- Guards against high fan-out (10)
- No need to manually collect paths to goals
7Apprentice feature design
- HREF source page u represented by DOM tree
- Leaf nodes marked with offsets wrt the HREF
- Many usable representations for term at offset d
- A ?t,d? tuple, e.g., ?download, -2?
- ?t,p,d? where p is path length from t to d
through LCA
ul
li
li
li
li
a
tt
TEXT
HREF
TEXT
em
TEXT
TEXT
font
TEXT
TEXT
TEXT
_at_0
_at_0
_at_1
_at_2
_at_3
_at_-1
_at_-2
Offsets?
8Offsets of good ?t,d? features
- Plot information gain at each d averaged over
terms t - Max at d0, falls off on both sides, but
- Clear saw-tooth pattern for most topicswhy?
- ltligtltagtltligtltagt
- ltagtltbrgtltagtltbrgt
- Topic-independent authoring idioms, best handled
by apprentice
9Apprentice learner design
- Instance ?(u,v),Pr(cv) ? represented by
- ?t,d? features for d up to some dmax
- HREF source topics ?c,Pr(cu)? ?c
- Cocitation w1, w2 siblings, w1 good?w2 good
- Learning algorithm
- Want to map features to a score in 0,1
- Discretize 0,1 into ranges, each a label ?
- Class label ? has an associated value q?
- Use a naïve Bayes classifier to find Pr(?(u,v))
- Score ??q? Pr(?(u,v))
10Apprentice accuracy
- Better elimination of useless outlinks with
increasing dmax - Good accuracy with dmaxlt 6
- Using DOM offset info improves accuracy
- Small accuracy gains
- LCA distance in ?t,p,d?
- Source page topics
- Cocitation features
11Offline apprentice trials
- Run baseline, train apprentice
- Start new crawler at the same URLs
- Let it fetch any page it schedules
- (Recall) limit it to pages visited by the
baseline crawler - Baseline loss gt recall loss gt apprentice loss
- Small URL overlap
12Online guidance from apprentice
- Run baseline
- Train apprentice
- Re-evaluate frontier
- Apprentice not as optimistic as baseline
- Many URLs downgraded
- Continue crawling with apprentice guidance
- Immediate reduction in loss rate
13Summary
- New generation of focused crawler
- Discriminates between links, learning online
- Apprentice easy and fast to train online
- Accurate with small dmax around 46
- DOM-derived features better than text
- Effective division of labor (what vs. how)
- Loss rate reduced by 3090
- Apprentice better at guessing relevance of
unvisited nodes than baseline crawler - Benefits visible after 1001000 page fetches
14Ongoing work
- Extending to larger radius and deeper DOM site
structure - Public domain C software
- Crawler
- Asynchronous DNS, simple callback model
- Can saturate dedicated 4Mbps with a Pentium2
- HTML cleaner
- Simple, customizable, table-driven patch logic
- Robust to bad HTML, no crash or memory leak
- HTML to DOM converter
- Extensible DOM node class