ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites

Description:

Title: PowerPoint Presentation Author: Li Xu Last modified by: Li Xu Created Date: 5/5/2002 5:52:57 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 18
Provided by: LiXu8
Category:

less

Transcript and Presenter's Notes

Title: ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites


1
ROADRUNNER Towards Automatic Data Extraction
from Large Web Sites
  • Valter Crescenzi
  • Giansalvatore Mecca
  • Paolo Merialdo

VLDB 2001
2
Overview
  • Automatically generates a wrapper from large
    structured Web pages
  • Supports nested structures and lists
  • Efficient approach to large, complex pages with
    regular structure

3
Example Pages
  • Extracts the fields and hierarchical structure
  • Depends on well-structured HTML
  • Only extracts at the entire field level

4
Extracted Result
5
Approach
  • Given a set of example pages
  • Generate a Union-free Regular Expression (UFRE)
  • RE without any disjunctions
  • Strong assumption that usually hold
  • Find the least upper bounds on the RE lattice to
    generate a wrapper
  • Reduces to find the least upper bound on two UFRES

6
Matching/Mismatching
  • Start with the first page and create a RE that
    defines the wrapper
  • Match each successive sample against the wrapper
  • Mismatches result in generalizations of the
    regular expression
  • Types of mismatches
  • String mismatches
  • Tag mismatches

7
Example
8
String Mismatches
  • String mismatches are used to discover fields of
    the documents
  • Wrapper is generated by replacing John Smith
    with PCDATA

ltHTMLgtBooks of ltBgtJohn Smith ?ltHTMLgtBooks of
ltBgtPCDATA
9
Example (Cont.)
10
Tag Mismatches
  • First check to see if mismatch is caused by an
    iterator
  • If not, could be an optional field in wrapper or
    sample
  • Cross search used to determine possible optionals
  • Image field determined to be optional
  • (ltimg src/gt)?

11
Example (Cont.)
12
Tag MissmatchesDiscovering Iterators
  • Assume mismatch is caused by repeated elements in
    a list
  • Match possible squares against earlier squares
  • Generalize the wrapper by finding all contiguous
    repeated occurrences
  • (ltligtltigtTitlelt/igtPCDATAlt/ligt)

13
Example (Cont.)
14
Recursive Example
15
Discussion
  • Assumptions
  • Pages are well-structured
  • Want to extract at the level of entire fields
  • Structure can be modeled without disjunctions
  • Search Space for explaining mismatches is huge
  • Uses a number of heuristics to prune space
  • Limited backtracking
  • Limit on number of choices to explore
  • Patterns can not be delimited by optionals
  • Will result in pruning possible wrappers

16
Experimental Result
17
Comparison with Other Works
Write a Comment
User Comments (0)
About PowerShow.com