Interactive Focused Crawler: Setup, Monitoring and Control Through User Feedback - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Interactive Focused Crawler: Setup, Monitoring and Control Through User Feedback

Description:

Segments pages using visual cues like color, images, font sizes, etc. Proceeds in three phases ... In WWW, Hawaii. ACM, May 2002. Soumen Chakrabarti, Martin ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 31
Provided by: rogerm3
Category:

less

Transcript and Presenter's Notes

Title: Interactive Focused Crawler: Setup, Monitoring and Control Through User Feedback


1
Interactive Focused Crawler Setup, Monitoring
and Control Through User Feedback
M. Tech. Project First Stage Presentation Roger
Menezes
under the guidance of Prof. Soumen Chakrabarti
2
Focus Crawler
  • Hypertext resource discovery agent
  • Selectively seeks out pages relevant to
    pre-defined set of topics
  • Motivation
  • General purpose crawlers too much to do,
    expensive
  • Topic specific search engines, portals

3
Basic Framework
  • Pages link to related pages
  • Train classifier to recognize relevant topics
  • Maintain a crawl frontier
  • Prioritize all neighbors v of a fetched page u
    based on the score of u
  • Pick an unvisited link v of the highest priority

4
Existing Architecture
  • All links may not link to related pages
  • Web pages are noisy and multi-topic
  • Neighborhood textual content
  • Learn two classifiers
  • Baseline scores each fetched page
  • Apprentice scores each link in the crawl
    frontier

5
Existing Architecture..contd
6
Existing Architecture..contd
  • Baseline classifier trained over a standard
    taxonomy
  • Need for a standard taxonomy
  • Characterization of the negative class
  • Users example URLs routed to categories in
    taxonomy
  • User modifies and confirms selection
  • Interest topics may not cleanly fit into the
    taxonomy
  • Bad User Interface

7
Existing Architecture..contd
  • Resource Discovery
  • Lack of monitoring tools
  • Topic Specificity
  • Incorporate User Feedback
  • Control of the crawler

8
Project Aim
  • Build the focus crawler as an application for the
    end-user
  • Better user interface
  • Obviating the standard taxonomy
  • Increasing user interactivity
  • More control over the application
  • Extensions to crawl frontier monitoring

9
Focus Crawler Proposed Phases
  • Setting Up
  • Learning the Baseline Classifier
  • Monitoring the crawl

10
Setting Up
  • Compensate for Standard Taxonomy
  • Learn classifier from only positive labeled
    examples and mixed unlabeled examples
  • Need sufficient positive examples
  • Identify good positive features
  • Web pages are noisy and multi-topic
  • Segment into different coherent regions

11
Gathering Related Pages
  • User supplies keywords/phrases and URLs
  • Fetch pages and segment into coherent pages
  • User selects relevant segments
  • Extract keywords/phrases
  • Query search engine and gather more URLs
  • Repeatedly run Related Pages algorithm and
    collect more pages

12
Block Diagram
Extract/ Analyze
URLs
Fetch
More words/phrases
Approve
Words/phrases
Search Engine

URLs
User
Fetch
Approve
Related Pages Algo.
13
User Interface
  • Screen prompting user to select relevant regions
    in the segmented web pages

14
User Interface
Screen displaying the user the list of
keywords/phrases to be used for Search Engine
querying.
15
User Interface
Screen displaying the user the list of top ranked
URLs gathered from the search engine.
16
Learning the classifier
  • Training using labeled positive examples
  • PEBL (Positive Example Based Learning)
  • User tests the classifier and provides feedback
  • User ignorant of the standard taxonomy

17
User Interface
Screen during the time the classifier is being
trained. The progress bar at the right top corner
indicates the remaining time.
18
Monitoring the Crawl Frontier
  • Two modes
  • Resource Discovery
  • Maintenance
  • Topic Specific websites based on different
    criteria.
  • Maintaining a window of top 10 relevant pages.
  • Maximum link depth traversed by the crawler

19
Monitoring..contd.
  • Running HITS and returning the top 10 hubs
  • Links for the different web pages crawled.
  • A graph depicting the average relevance of pages
    fetched against time.
  • Retraining of the apprentice.
  • Start/Stop the crawler

20
Review of Existing Algorithms
  • PEBL
  • Related Pages Algorithm
  • Page Segmentation Algorithm

21
PEBL
  • Exploits marginal property of SVMs
  • Algorithm(Positive P, Unlabeled M)
  • Distinguish strong positive features
  • Separate the most negative points N
  • Learn SVM based on P and N
  • Classify other examples in M using the SVM
  • Add all negatively classified points to N
  • Go to step 3, if points were added to N

22
PEBL..contd.
  • Can use the same random sample M for different
    topics
  • May not necessarily use SVMs but use similar
    technique for other learning algorithm

23
Related Pages Algorithm
  • Technique discussed in Dean and Henzinger
  • Connectivity based algorithms
  • Cocitation Algorithm
  • Siblings are related
  • Degree of Cocitation Number of common parents
    two web pages have.
  • Two pages u and v having high degree of
    cocitation are related.

24
Related Pages Algorithmcontd
  • Companion Algorithm
  • Given a URL u, build a vicinity graph for it.
  • B parents and BF children of them
  • F children and FB parents of them
  • Contract duplicates and near duplicates
  • Compute edge weights
  • Run HITS and return the top 10 authorities

25
Related Pages Algorithmcontd
  • Extensions
  • Transitive degree of cocitation
  • Page v having high degree of cocitation with u
  • A page w having high degree of cocitation with v
    but not with u.
  • Use of DOM content analysis along with the above
    algorithms
  • Need an extension for more than one URLs

26
Page Segmentation Algorithm
  • Similar algorithm to VIPS (VIsion-based Page
    Segmentation)
  • Web pages are multi-topic
  • Segments pages using visual cues like color,
    images, font sizes, etc.
  • Proceeds in three phases
  • Outputs a content structure tree whose nodes are
    regions in page having coherent content

27
VIPS
  • Visual Block Extraction
  • A Visual Block is a single unit of semantic
    coherent content.
  • Given a DOM node check for coherency
  • Iteratively replace node by child till
    permissible coherency achieved
  • Visual Separator Detection
  • Separator for the visual block retrieved
  • Differences in font sizes, color, presence of HR
    tag.

28
VIPS..contd.
  • Content Structure Construction
  • Merge blocks on either side of separator into
    regions
  • Check for the regions, if the Degree of coherency
    is below a threshold
  • If not, repeat the process for the region

29
Referencescontd
  • Krishna Bharat and Monika R. Henzinger. Improved
    algorithms for topic distillation in a
    hyperlinked environment. In Proceedings of
    SIGIR-98, 21st ACM International Conference on
    Research and Development in Information
    Retrieval, pages 104-111, Melbourne, AU, 1998.
  • S. Chakrabarti, K. Punera, and M. Subramanyam.
    Accelerated focused crawling through online
    relevance feedback. In WWW, Hawaii. ACM, May
    2002.
  • Soumen Chakrabarti, Martin van den Berg, and
    Byron Dom. Focused crawling A new approach to
    topic-specific Web resource discovery. Computer
    Networks (Amsterdam, Netherlands 1999),
    31(1116)16231640, 1999.

30
Referencescontd
  • Jeffrey Dean and Monika R. Henzinger. Finding
    related pages in the World Wide Web. Computer
    Networks (Amsterdam, Netherlands 1999), 31(11
    16)14671479, 1999.
  • Jon M. Kleinberg. Authoritative sources in a
    hyperlinked environment. Journal of the ACM,
    46(5)604632, 1999.
  • H. Yu, J. Han, and K. C-C. PEBL Positive
    example-based learning for web page
    classification using SVM, 2002.
  • Shipeng Yu, Deng Cai, Ji-Rong Wen, and Wei-Ying
    Ma. Improving pseudorelevance feedback in web
    information retrieval using web page
    segmentation. In Proceedings of WWW2003, The
    Twelfth International World Wide Web Conference,
    Budapest, HUNGARY, 2003.
Write a Comment
User Comments (0)
About PowerShow.com