MedIEQ Web Spider and Link scoring component - PowerPoint PPT Presentation

1 / 7
About This Presentation
Title:

MedIEQ Web Spider and Link scoring component

Description:

University of Economics Prague - UEP. MedIEQ Web Spider and Link scoring component ... Input: list of urls from Crawler. Spidering process ... Spidering experience ' ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 8
Provided by: 14611
Category:

less

Transcript and Presenter's Notes

Title: MedIEQ Web Spider and Link scoring component


1
MedIEQ Web Spider and Link scoring component
  • Marek Ruzicka
  • Project meeting
  • TKK, Helsinki, Finland
  • 23.October.2006

2
Presentation overview
  • Navigation component (Spider)
  • Link scoring component
  • Current state
  • Next steps

3
Navigation Component (Spider)
  • Input list of urls from Crawler
  • Spidering process
  • Retrieve web page and convert its coding into
    UTF-8
  • Extract all links on page
  • Put internal links in link queue
  • Repeat process for each link in queue
  • Configuration of spider
  • Supported/activated link types
  • Supported/activated file (web page) types

Extract Links
SPIDER
Links
Visit internal links
Content Classification Component
Pos. Classified pages
4
Navigation Component (Spider)
  • Storing web pages
  • Content of each page is given to CCC
  • Pos. classified pages are stored locally for IE

Extract Links
SPIDER
Links
Visit internal links
Content Classification Component
Pos. Classified pages
5
Link Scoring Component
  • Link Scoring component
  • Extracts link objects (links including
    link text, surrounding text, alt text etc.)
  • Consists of several modules (specialized to given
    content e.g. contact pages)
  • If at least one module scores link positively,
    it is explored by spider
  • Link scoring modules
  • Created by ML or heuristics
  • Tested on heuristics

Extract Link objects
SPIDER
Link objects
Pos. Scored links
Content Classification Component
Link Scoring Component
Pos. Classified pages
6
Current state
  • Current state
  • Spider successfully retrieve about 95 web pages
  • List of unreachable pages is stored for nest
    run
  • Spider runs multi-thread one thread per web
    site
  • Spidering experience
  • Correct number of threads is strongly dependant
    on HW and network capacities
  • Common spider-traps are usually harmless
  • There are still spider-killer pages in medical
    domain
  • LSMs based on heuristics haven't good results

7
Next steps
  • Spider
  • Examine influence of spider-traps on spider
  • Avoid spider-killer pages
  • Enable Spider configuration by web interface
  • Link scoring component
  • Train link scoring modules using ML
  • Enable LSC configuration by web interface
Write a Comment
User Comments (0)
About PowerShow.com