Interactive Focused Crawler: Setup, Monitoring and Control Through User Feedback - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Interactive Focused Crawler: Setup, Monitoring and Control Through User Feedback

Description:

Segments pages using visual cues like color, images, font sizes, etc. Proceeds in three phases ... In WWW, Hawaii. ACM, May 2002. Soumen Chakrabarti, Martin ... – PowerPoint PPT presentation

Number of Views:68

Avg rating:3.0/5.0

Slides: 31

Provided by: rogerm3

Category:

more less

Transcript and Presenter's Notes

Title: Interactive Focused Crawler: Setup, Monitoring and Control Through User Feedback

1
Interactive Focused Crawler Setup, Monitoring
and Control Through User Feedback
M. Tech. Project First Stage Presentation Roger
Menezes
under the guidance of Prof. Soumen Chakrabarti
2
Focus Crawler

Hypertext resource discovery agent
Selectively seeks out pages relevant to
pre-defined set of topics
Motivation
General purpose crawlers too much to do,
expensive
Topic specific search engines, portals

3
Basic Framework

Pages link to related pages
Train classifier to recognize relevant topics
Maintain a crawl frontier
Prioritize all neighbors v of a fetched page u
based on the score of u
Pick an unvisited link v of the highest priority

4
Existing Architecture

All links may not link to related pages
Web pages are noisy and multi-topic
Neighborhood textual content
Learn two classifiers
Baseline scores each fetched page
Apprentice scores each link in the crawl
frontier

5
Existing Architecture..contd
6
Existing Architecture..contd

Baseline classifier trained over a standard
taxonomy
Need for a standard taxonomy
Characterization of the negative class
Users example URLs routed to categories in
taxonomy
User modifies and confirms selection
Interest topics may not cleanly fit into the
taxonomy
Bad User Interface

7
Existing Architecture..contd

Resource Discovery
Lack of monitoring tools
Topic Specificity
Incorporate User Feedback
Control of the crawler

8
Project Aim

Build the focus crawler as an application for the
end-user
Better user interface
Obviating the standard taxonomy
Increasing user interactivity
More control over the application
Extensions to crawl frontier monitoring

9
Focus Crawler Proposed Phases

Setting Up
Learning the Baseline Classifier
Monitoring the crawl

10
Setting Up

Compensate for Standard Taxonomy
Learn classifier from only positive labeled
examples and mixed unlabeled examples
Need sufficient positive examples
Identify good positive features
Web pages are noisy and multi-topic
Segment into different coherent regions

11
Gathering Related Pages

User supplies keywords/phrases and URLs
Fetch pages and segment into coherent pages
User selects relevant segments
Extract keywords/phrases
Query search engine and gather more URLs
Repeatedly run Related Pages algorithm and
collect more pages

12
Block Diagram
Extract/ Analyze
URLs
Fetch
More words/phrases
Approve
Words/phrases
Search Engine

URLs
User
Fetch
Approve
Related Pages Algo.
13
User Interface

Screen prompting user to select relevant regions
in the segmented web pages

14
User Interface
Screen displaying the user the list of
keywords/phrases to be used for Search Engine
querying.
15
User Interface
Screen displaying the user the list of top ranked
URLs gathered from the search engine.
16
Learning the classifier

Training using labeled positive examples
PEBL (Positive Example Based Learning)
User tests the classifier and provides feedback
User ignorant of the standard taxonomy

17
User Interface
Screen during the time the classifier is being
trained. The progress bar at the right top corner
indicates the remaining time.
18
Monitoring the Crawl Frontier

Two modes
Resource Discovery
Maintenance
Topic Specific websites based on different
criteria.
Maintaining a window of top 10 relevant pages.
Maximum link depth traversed by the crawler

19
Monitoring..contd.

Running HITS and returning the top 10 hubs
Links for the different web pages crawled.
A graph depicting the average relevance of pages
fetched against time.
Retraining of the apprentice.
Start/Stop the crawler

20
Review of Existing Algorithms

PEBL
Related Pages Algorithm
Page Segmentation Algorithm

21
PEBL

Exploits marginal property of SVMs
Algorithm(Positive P, Unlabeled M)
Distinguish strong positive features
Separate the most negative points N
Learn SVM based on P and N
Classify other examples in M using the SVM
Add all negatively classified points to N
Go to step 3, if points were added to N

22
PEBL..contd.

Can use the same random sample M for different
topics
May not necessarily use SVMs but use similar
technique for other learning algorithm

23
Related Pages Algorithm

Technique discussed in Dean and Henzinger
Connectivity based algorithms
Cocitation Algorithm
Siblings are related
Degree of Cocitation Number of common parents
two web pages have.
Two pages u and v having high degree of
cocitation are related.

24
Related Pages Algorithmcontd

Companion Algorithm
Given a URL u, build a vicinity graph for it.
B parents and BF children of them
F children and FB parents of them
Contract duplicates and near duplicates
Compute edge weights
Run HITS and return the top 10 authorities

25
Related Pages Algorithmcontd

Extensions
Transitive degree of cocitation
Page v having high degree of cocitation with u
A page w having high degree of cocitation with v
but not with u.
Use of DOM content analysis along with the above
algorithms
Need an extension for more than one URLs

26
Page Segmentation Algorithm

Similar algorithm to VIPS (VIsion-based Page
Segmentation)
Web pages are multi-topic
Segments pages using visual cues like color,
images, font sizes, etc.
Proceeds in three phases
Outputs a content structure tree whose nodes are
regions in page having coherent content

27
VIPS

Visual Block Extraction
A Visual Block is a single unit of semantic
coherent content.
Given a DOM node check for coherency
Iteratively replace node by child till
permissible coherency achieved
Visual Separator Detection
Separator for the visual block retrieved
Differences in font sizes, color, presence of HR
tag.

28
VIPS..contd.

Content Structure Construction
Merge blocks on either side of separator into
regions
Check for the regions, if the Degree of coherency
is below a threshold
If not, repeat the process for the region

29
Referencescontd

Krishna Bharat and Monika R. Henzinger. Improved
algorithms for topic distillation in a
hyperlinked environment. In Proceedings of
SIGIR-98, 21st ACM International Conference on
Research and Development in Information
Retrieval, pages 104-111, Melbourne, AU, 1998.
S. Chakrabarti, K. Punera, and M. Subramanyam.
Accelerated focused crawling through online
relevance feedback. In WWW, Hawaii. ACM, May
2002.
Soumen Chakrabarti, Martin van den Berg, and
Byron Dom. Focused crawling A new approach to
topic-specific Web resource discovery. Computer
Networks (Amsterdam, Netherlands 1999),
31(1116)16231640, 1999.

30
Referencescontd

Jeffrey Dean and Monika R. Henzinger. Finding
related pages in the World Wide Web. Computer
Networks (Amsterdam, Netherlands 1999), 31(11
16)14671479, 1999.
Jon M. Kleinberg. Authoritative sources in a
hyperlinked environment. Journal of the ACM,
46(5)604632, 1999.
H. Yu, J. Han, and K. C-C. PEBL Positive
example-based learning for web page
classification using SVM, 2002.
Shipeng Yu, Deng Cai, Ji-Rong Wen, and Wei-Ying
Ma. Improving pseudorelevance feedback in web
information retrieval using web page
segmentation. In Proceedings of WWW2003, The
Twelfth International World Wide Web Conference,
Budapest, HUNGARY, 2003.