Schema Matching and Data Extraction over HTML Tables - PowerPoint PPT Presentation

About This Presentation
Title:

Schema Matching and Data Extraction over HTML Tables

Description:

{Make, Model, Year, Colour, Price, Auto, Air Cond., AM/FM, CD} ... Cell-phone sales application domain. 11 recognized tables in the testing Set. Total 97 mappings ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 33
Provided by: Cui79
Learn more at: https://www.deg.byu.edu
Category:

less

Transcript and Presenter's Notes

Title: Schema Matching and Data Extraction over HTML Tables


1
Schema Matching and Data Extraction over HTML
Tables
  • Cui Tao
  • Data Extraction Research Group
  • Department of Computer Science
  • Brigham Young University

supported by NSF
2
Introduction
  • Many tables on the Web
  • Ontology-based extraction
  • Works well for unstructured or semi-structured
    data
  • What about structured data tables?
  • How to integrate data stored in different tables?
  • Detect the table of interest
  • Form attribute-value pairs (adjust if necessary)
  • Do extraction
  • Infer mappings from extraction patterns

3
ProblemDetecting The Table of Interest
4
Problem
Different schemas
  • Different source table schemas
  • Run , Yr, Make, Model, Tran, Color, Dr
  • Make, Model, Year, Colour, Price, Auto, Air
    Cond., AM/FM, CD
  • Vehicle, Distance, Price, Mileage
  • Year, Make, Model, Trim, Invoice/Retail, Engine,
    Fuel Economy
  • Target database schema
  • Car, Year, Make, Model, Mileage, Price,
    PhoneNr,
  • Car,
    Feature

5
ProblemAttribute is Value
6
Problem Attribute-Value is Value
7
ProblemValue is not Value
8
ProblemFactored Values
9
ProblemSplit Values
10
ProblemMerged Values
11
ProblemInformation Behind Links
12
Solution
  • Detect the table of interest
  • Form attribute-value pairs (adjust if necessary)
  • Do extraction
  • Infer mappings from extraction patterns

13
SolutionDetect The Table of Interest
  • Top-level tables
  • Table size at least 3 rows and columns
  • Grid layout same of values
  • Attributes
  • Value density
  • of ontology extracted values
  • total of values in the table

14
SolutionDetect The Table of Interest
  • Linked-page tables
  • Table size at least 2 rows and columns
  • Attributes
  • Attribute-value-pair pattern
  • Page-spanning tables

15
Solution Remove Factoring
16
SolutionReplace Boolean Values
17
SolutionForm Attribute-Value Pairs
ltMake, Hondagt, ltModel, Civic EXgt, ltYear, 1995gt,
ltColour, Whitegt, ltPrice, 6300gt, ltAuto,
Autogt, ltAir Cond., Air Cond.gt, ltAM/FM, AM/FMgt
18
SolutionAdjust Attribute-Value Pairs
ltMake, Hondagt, ltModel, Civic EXgt, ltYear, 1995gt,
ltColour, Whitegt, ltPrice, 6300gt, ltAuto,
Autogt, ltAir Cond., Air Cond.gt, ltAM/FM, AM/FMgt
19
SolutionAdd Information Hidden Behind Links
20
SolutionInferred Mapping Creation
Car, Year, Make, Model, Mileage, Price,
PhoneNr, Car, Feature
21
SolutionInferred Mapping Creation
Car, Year, Make, Model, Mileage, Price,
PhoneNr, Car, Feature
22
SolutionInferred Mapping Creation
Car, Year, Make, Model, Mileage, Price,
PhoneNr, Car, Feature
23
SolutionInferred Mapping Creation
Car, Year, Make, Model, Mileage, Price,
PhoneNr, Car, Feature
24
SolutionInferred Mapping Creation
Car, Year, Make, Model, Mileage, Price,
PhoneNr, Car, Feature
25
SolutionInferred Mapping Creation
Car, Year, Make, Model, Mileage, Price,
PhoneNr, Car, Feature
26
SolutionInferred Mapping Creation
Car, Year, Make, Model, Mileage, Price,
PhoneNr, Car, Feature
27
SolutionInferred Mapping Creation
Car, Year, Make, Model, Mileage, Price,
PhoneNr, Car, Feature
28
Experimental Results - Table Location
  • Car advertisement application domain

Precision86 Recall 92
29
Experimental Results - Mapping
  • Car advertisement application domain
  • 46 recognized tables in the testing set
  • Total 319 mappings
  • Precision 95.8 Recall 92.8
  • Top-level tables 77 of the 296 correct mappings
  • Linked tables 19.6
  • Both 3.4

30
Experimental Results - Table Location
  • Cell-phone sales application domain

31
Experimental Results - Mapping
  • Cell-phone sales application domain
  • 11 recognized tables in the testing Set
  • Total 97 mappings
  • Precision 90.1 Recall 85.4
  • Top-level tables 85.4 of the 88 correct
    mappings
  • Linked tables 50.5
  • Both 35.9

32
Contribution
  • Provides an approach to extract information
    automatically from HTML tables
  • Suggests a different way to solve the problem of
    schema matching
Write a Comment
User Comments (0)
About PowerShow.com