Title: Schema Matching and Data Extraction over HTML Tables
1Schema Matching and Data Extraction over HTML
Tables
- Cui Tao
- Data Extraction Research Group
- Department of Computer Science
- Brigham Young University
supported by NSF
2Introduction
- Many tables on the Web
- Ontology-based extraction
- Works well for unstructured or semi-structured
data - What about structured data tables?
- How to integrate data stored in different tables?
- Detect the table of interest
- Form attribute-value pairs (adjust if necessary)
- Do extraction
- Infer mappings from extraction patterns
3ProblemDetecting The Table of Interest
4Problem
Different schemas
- Different source table schemas
- Run , Yr, Make, Model, Tran, Color, Dr
- Make, Model, Year, Colour, Price, Auto, Air
Cond., AM/FM, CD - Vehicle, Distance, Price, Mileage
- Year, Make, Model, Trim, Invoice/Retail, Engine,
Fuel Economy - Target database schema
- Car, Year, Make, Model, Mileage, Price,
PhoneNr, - Car,
Feature
5ProblemAttribute is Value
6Problem Attribute-Value is Value
7ProblemValue is not Value
8ProblemFactored Values
9ProblemSplit Values
10ProblemMerged Values
11ProblemInformation Behind Links
12Solution
- Detect the table of interest
- Form attribute-value pairs (adjust if necessary)
- Do extraction
- Infer mappings from extraction patterns
13SolutionDetect The Table of Interest
- Top-level tables
- Table size at least 3 rows and columns
- Grid layout same of values
- Attributes
- Value density
- of ontology extracted values
- total of values in the table
14SolutionDetect The Table of Interest
- Linked-page tables
- Table size at least 2 rows and columns
- Attributes
- Attribute-value-pair pattern
- Page-spanning tables
-
15Solution Remove Factoring
16SolutionReplace Boolean Values
17SolutionForm Attribute-Value Pairs
ltMake, Hondagt, ltModel, Civic EXgt, ltYear, 1995gt,
ltColour, Whitegt, ltPrice, 6300gt, ltAuto,
Autogt, ltAir Cond., Air Cond.gt, ltAM/FM, AM/FMgt
18SolutionAdjust Attribute-Value Pairs
ltMake, Hondagt, ltModel, Civic EXgt, ltYear, 1995gt,
ltColour, Whitegt, ltPrice, 6300gt, ltAuto,
Autogt, ltAir Cond., Air Cond.gt, ltAM/FM, AM/FMgt
19SolutionAdd Information Hidden Behind Links
20SolutionInferred Mapping Creation
Car, Year, Make, Model, Mileage, Price,
PhoneNr, Car, Feature
21SolutionInferred Mapping Creation
Car, Year, Make, Model, Mileage, Price,
PhoneNr, Car, Feature
22SolutionInferred Mapping Creation
Car, Year, Make, Model, Mileage, Price,
PhoneNr, Car, Feature
23SolutionInferred Mapping Creation
Car, Year, Make, Model, Mileage, Price,
PhoneNr, Car, Feature
24SolutionInferred Mapping Creation
Car, Year, Make, Model, Mileage, Price,
PhoneNr, Car, Feature
25SolutionInferred Mapping Creation
Car, Year, Make, Model, Mileage, Price,
PhoneNr, Car, Feature
26SolutionInferred Mapping Creation
Car, Year, Make, Model, Mileage, Price,
PhoneNr, Car, Feature
27SolutionInferred Mapping Creation
Car, Year, Make, Model, Mileage, Price,
PhoneNr, Car, Feature
28Experimental Results - Table Location
- Car advertisement application domain
Precision86 Recall 92
29Experimental Results - Mapping
- Car advertisement application domain
- 46 recognized tables in the testing set
- Total 319 mappings
- Precision 95.8 Recall 92.8
- Top-level tables 77 of the 296 correct mappings
- Linked tables 19.6
- Both 3.4
30Experimental Results - Table Location
- Cell-phone sales application domain
31Experimental Results - Mapping
- Cell-phone sales application domain
- 11 recognized tables in the testing Set
- Total 97 mappings
- Precision 90.1 Recall 85.4
- Top-level tables 85.4 of the 88 correct
mappings - Linked tables 50.5
- Both 35.9
32Contribution
- Provides an approach to extract information
automatically from HTML tables - Suggests a different way to solve the problem of
schema matching