RoadRunner: Towards Automatic Data Extraction from Large Web Sites

About This Presentation

Title:

RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Description:

Web pages that have a heavy data component are generated ... amazon.com. buy.com. rpmfind.net. uefa.com. Results cont... The algorithm performed very well. ... – PowerPoint PPT presentation

Number of Views:314

Avg rating:3.0/5.0

Slides: 16

Provided by: chinalan

Category:

more less

Transcript and Presenter's Notes

Title: RoadRunner: Towards Automatic Data Extraction from Large Web Sites

1
RoadRunner Towards Automatic Data Extraction
from Large Web Sites
Valter Crescenzi Giansalvatore Mecca Paola
Merialdo
2
Aims

Automatic wrapper generation
No user intervention required
No prior knowledge of data schema

3
Premise

Web pages that have a heavy data component are
generated using scripts, XSLT or similar
techniques from data retrieved using one or more
queries from some sort of database.
All pages generated using the same script and
queries are said to belong to the same class of
pages.

4
The Approach

The aim is to use comparisons between web pages
belonging to the same class, to infer both the
data schema, and the wrapper to extract that
data.
The wrapper is a regular grammar, or more
particularly a union-free regular expression.

5
The Algorithm

The HTML is cleaned up so that it conforms to the
XHTML specification, in particular, so that all
tags are properly ended and nested.
The clean code is then parsed into a sequence of
tokens, each of which is either an html tag, or a
string.

6
The Algorithm cont...

We work on two objects at a time.
One is the wrapper, which is just a regular
expression.
The other is a page.

7
The Algorithm cont...

The initial wrapper is just another of the sample
pages.
There will be mismatches between the wrapper and
the current page. It is these mismatches that
drive the algorithm.
There are two kinds of mismatch

8
String Mismatches.

This occurs when the same position within the
wrapper and page contains different strings.
These different strings are data, and the string
in the wrapper is replaced by a token PCDATA
that matches all data strings.

9
Tag Mismatches

This occurs when the same position within the
wrapper and page contain either different html
tags or one has a tag and the other a string.
These mismatches are used to infer the data
schema.

10
Finding Iterators

Having found a mismatch, we first look for an
iterator or repeating group.
The end tag for which is the tag immediately
before the mismatch.
Because we have well-formed html, we just need to
find the corresponding start tag.
Generalise the wrapper.

11
Finding Optionals

If we didn't find an iterator, then we are
dealing with optional data.
Either the wrapper, or the page contains a
sequence of tokens that is not present in the
other.
Try to find where the wrapper and page again
start to match.
Generalise the wrapper.

12
More Complex Cases

We have found the cause of the mismatch, and
identified either an iterator or an optional.
This is not necessarily enough. We must now
recursively examine the block of code we have
found as it may contain more mismatches and
therefore iterators or optionals.

13
More Complex Cases cont...

Identifying the boundaries of the mismatch may
not be simple either.
We may have to choose between possibilities and
backtrack if that does not work. That is we have
to search a tree of possibilities.
This tree may be very large if the web pages are
large and complicated.

14
Results

They tried the algorithm on a number of sites
amazon.com
buy.com
rpmfind.net
uefa.com

15
Results cont...

The algorithm performed very well.
If the algorithm could handle the class then it
was able to identify the schema and extract the
data with a very high level of accuracy.
The failures were usually complete. Page
structures that were too complex for the
union-free regular expressions to describe.

Write a Comment

User Comments (0)