RoadRunner: Towards Automatic Data Extraction from Large Web Sites - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Description:

Web pages that have a heavy data component are generated ... amazon.com. buy.com. rpmfind.net. uefa.com. Results cont... The algorithm performed very well. ... – PowerPoint PPT presentation

Number of Views:314
Avg rating:3.0/5.0
Slides: 16
Provided by: chinalan
Category:

less

Transcript and Presenter's Notes

Title: RoadRunner: Towards Automatic Data Extraction from Large Web Sites


1
RoadRunner Towards Automatic Data Extraction
from Large Web Sites
Valter Crescenzi Giansalvatore Mecca Paola
Merialdo
2
Aims
  • Automatic wrapper generation
  • No user intervention required
  • No prior knowledge of data schema

3
Premise
  • Web pages that have a heavy data component are
    generated using scripts, XSLT or similar
    techniques from data retrieved using one or more
    queries from some sort of database.
  • All pages generated using the same script and
    queries are said to belong to the same class of
    pages.

4
The Approach
  • The aim is to use comparisons between web pages
    belonging to the same class, to infer both the
    data schema, and the wrapper to extract that
    data.
  • The wrapper is a regular grammar, or more
    particularly a union-free regular expression.

5
The Algorithm
  • The HTML is cleaned up so that it conforms to the
    XHTML specification, in particular, so that all
    tags are properly ended and nested.
  • The clean code is then parsed into a sequence of
    tokens, each of which is either an html tag, or a
    string.

6
The Algorithm cont...
  • We work on two objects at a time.
  • One is the wrapper, which is just a regular
    expression.
  • The other is a page.

7
The Algorithm cont...
  • The initial wrapper is just another of the sample
    pages.
  • There will be mismatches between the wrapper and
    the current page. It is these mismatches that
    drive the algorithm.
  • There are two kinds of mismatch

8
String Mismatches.
  • This occurs when the same position within the
    wrapper and page contains different strings.
  • These different strings are data, and the string
    in the wrapper is replaced by a token PCDATA
    that matches all data strings.

9
Tag Mismatches
  • This occurs when the same position within the
    wrapper and page contain either different html
    tags or one has a tag and the other a string.
  • These mismatches are used to infer the data
    schema.

10
Finding Iterators
  • Having found a mismatch, we first look for an
    iterator or repeating group.
  • The end tag for which is the tag immediately
    before the mismatch.
  • Because we have well-formed html, we just need to
    find the corresponding start tag.
  • Generalise the wrapper.

11
Finding Optionals
  • If we didn't find an iterator, then we are
    dealing with optional data.
  • Either the wrapper, or the page contains a
    sequence of tokens that is not present in the
    other.
  • Try to find where the wrapper and page again
    start to match.
  • Generalise the wrapper.

12
More Complex Cases
  • We have found the cause of the mismatch, and
    identified either an iterator or an optional.
  • This is not necessarily enough. We must now
    recursively examine the block of code we have
    found as it may contain more mismatches and
    therefore iterators or optionals.

13
More Complex Cases cont...
  • Identifying the boundaries of the mismatch may
    not be simple either.
  • We may have to choose between possibilities and
    backtrack if that does not work. That is we have
    to search a tree of possibilities.
  • This tree may be very large if the web pages are
    large and complicated.

14
Results
  • They tried the algorithm on a number of sites
  • amazon.com
  • buy.com
  • rpmfind.net
  • uefa.com

15
Results cont...
  • The algorithm performed very well.
  • If the algorithm could handle the class then it
    was able to identify the schema and extract the
    data with a very high level of accuracy.
  • The failures were usually complete. Page
    structures that were too complex for the
    union-free regular expressions to describe.
Write a Comment
User Comments (0)
About PowerShow.com