Crawling the Web Forums - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Crawling the Web Forums

Description:

Online discussion area where anyone can discuss their favorite topics. Why Generic Crawler Fails in case of Web Forums Presence of many functional links. – PowerPoint PPT presentation

Number of Views:208
Avg rating:3.0/5.0
Slides: 24
Provided by: Ank65
Category:

less

Transcript and Presenter's Notes

Title: Crawling the Web Forums


1
Crawling the Web Forums
  • By
  • Ankush Goel
  • Instructor Prof. Gail Kaiser
  • Spring 2009

2
Web Crawling
  • Automated traversal of web to collect all the
    useful informative pages, effectively and
    efficiently
  • Gather information about link structure
    interconnecting the informative pages.

3
Generic Crawler Architecture
4
Generic Crawler Architecture
5
Generic Crawler Architecture
6
Generic Crawler Architecture
7
Generic Crawler Architecture
8
Generic Crawler Architecture
9
Web Forums
  • Web application designed to manage user created
    content.
  • Online discussion area where anyone can discuss
    their favorite topics.

10
Why Generic Crawler Fails in case of Web Forums
  • Presence of many functional links.
  • Inability to index relationship among post pages.
  • Avoids crawling deep inside a web site.
  • Inefficient and ineffective.

11
List-of-Post Page
12
iRobot
  • Tool to crawl through Web Forums.
  • Intelligent enough to understand structure of
    forums before selecting traversal paths.
  • It work towards two issues
  • Important pages
  • Important links

13
How it Works?
  • Pre-samples few pages to discover the repetitive
    regions.
  • Group pre-sampled pages into clusters based on
    their repetitive regions where each cluster can
    be considered a vertex in the sitemap.
  • Selects optimal traversal path to crawl through
    sitemap.

14
Repetitive Regions
15
Information Estimation Criteria
  • Important Pages
  • More pages with similar kind of structure or
    repetitive regions, than the page under
    consideration is probably more important.
  • The size of an important page containing valuable
    information would be probably larger than any
    invaluable page like login page.
  • The content of an informative page would be more
    diverse than an invaluable page like a post page
    contains content created by thousand of different
    users and would definitely be more diverse than
    an automatically generated duplicate error page.
  • Important Links
  • Location
  • Similar location in similar repetitive region, so
    similar function.

16

F L O W C H A R T
17

F L O W C H A R T
18

F L O W C H A R T
19

F L O W C H A R T
20

F L O W C H A R T
21
C O M P A R I S O N

22
C O M P A R I S O N

23
Thank You
Write a Comment
User Comments (0)
About PowerShow.com