RoadRunner: Towards Automatic Data Extraction from Large Web Sites - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Description:

RoadRunner: Towards Automatic Data Extraction from Large Web Sites ... Can RoadRunner be improved to work with more than 2 pages at a time? ... – PowerPoint PPT presentation

Number of Views:428
Avg rating:3.0/5.0
Slides: 18
Provided by: Lei109
Category:

less

Transcript and Presenter's Notes

Title: RoadRunner: Towards Automatic Data Extraction from Large Web Sites


1
RoadRunnerTowards Automatic Data Extraction
from Large Web Sites
  • Valter Crescenzi Giansalvatore Mecca Paolo
    Merialdo
  • Presented
    by Lei Lei

2
Outline
  • Problem
  • Theoretical Background
  • Matching Technique
  • Examples and Experimental Results
  • Comparison with other works
  • Assessment

3
Introduction and Problems
  • Fast growing information
  • Little machine understandable
  • Wrappers and its key problem

4
Previous Works
  • Data-intensive Web sites
  • Grammar inference
  • Golds work
  • Positive examples alone Problem
  • Complexity of learning

5
Common Features of Wrappers
  • Extra information from users interactions
  • Priori knowledge
  • One HTML page at a time

6
Background
  • Nested types
  • Generate a Union-free Regular Expression (UFRE)
  • Locate the least upper bounds on the RE lattice
    to generate a wrapper
  • Reduces to find the least upper bound on two UFRES

7
Matching/Mismatching
  • Start with the first page and create a RE that
    defines the wrapper
  • Match each successive sample against the wrapper
  • Mismatches result in generalizations of the
    regular expression
  • Types of mismatches
  • String mismatches
  • Tag mismatches

8
Example Pages
9
Simple Matching Example
PCDATA
String Mismatch discover fields Replace string
by PCDATA
10
Example (Cont.)
Tag Mismatch Discover Optionals Find repeated
and optional patterns Cross-Search Wrapper
Generalization
11
  • Tag Mismatches Discovering Iterators
  • Assume mismatch is caused by repeated elements in
    a list
  • Match possible squares against earlier squares
  • Generalize the wrapper by finding all contiguous
    repeated occurrences
  • Ie. (ltligtltigtTitlelt/igtPCDATAlt/ligt)

Example (Cont.)
PCDATA
(ltIMG src/gt)?
PCDATA
PCDATA
12
A More Complex Example
13
Extraction Output
14
Experiment Results
15
Comparison with other works
16
Assessment
  • Quality of extracted datasets
  • Assumption for simplicitys sake
  • - regular structured pages
  • - no disjunctions
  • Search Space for explaining mismatches
  • - Uses a number of heuristics to prune space
  • Limited backtracking due to lots of alternatives
  • Patterns can not be delimited by optionals
  • Will result in pruning possible wrappers

17
Questions
  • Can RoadRunner be improved to work with more than
    2 pages at a time?
  • Anything to improve the manually named field
    process?
  • Introduce disjunction?
Write a Comment
User Comments (0)
About PowerShow.com