Augmenting Focused Crawling using Search Engine Queries - PowerPoint PPT Presentation

1 / 11
About This Presentation
Title:

Augmenting Focused Crawling using Search Engine Queries

Description:

Breadth-first (using in standard crawling) Best-first (using in ... Feature extractor. Highly depend on the seed pages. Term Extraction module. Baseline system ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 12
Provided by: wingCom
Category:

less

Transcript and Presenter's Notes

Title: Augmenting Focused Crawling using Search Engine Queries


1
Augmenting Focused Crawling using Search Engine
Queries
  • Wang Xuan
  • 10th Nov 2006

2
What is focused crawling
  • Crawling vs. Focused crawling

Seed Page
Target page
3
Crawling methods
  • Web search algorithm
  • Breadth-first (using in standard crawling)
  • Best-first (using in focused crawling)
  • They are local-search strategies
  • Web analysis algorithm
  • content-based web analysis
  • page text, title, URL, page layout
  • link-based web analysis
  • hard to analyze the page while the knowledge
    about the search graph is not yet known
    completely.

4
Related works
  • Naïve Bayes Crawler relevance score is the
    cosine similarity between page and topic
  • IBM focused crawler introduce a distiller to find
    topic hubs.
  • CORA crawler assign Q-value according number of
    target pages in neighborhood
  • Context focused crawler introduce a link
    hierarchy
  • Automatic Publication Data Gatherer classified
    the webpage without the page
  • PaSE locate publication using Search Engine

5
General framework
Term Extraction module
Highly depend on the seed pages
6
Baseline system
7
Three stage of the crawling
8
Framework for upgraded system
9
TargetURL
Search Engine
Term Extraction
More Pages
10
(No Transcript)
11
Baseline system Upgrade system
Publication pages found 45 117
precision 3.21 8.36
recall 26.63 69.23
F1 0.057 0.149
Write a Comment
User Comments (0)
About PowerShow.com