Title: Schema Matching and Data Extraction over HTML Tables
1Schema Matching and Data Extraction over HTML
Tables
- Cui Tao
- Data Extraction Research Group
- Department of Computer Science
- Brigham Young University
supported by NSF
2Introduction
- Many tables on the Web
- How to integrate data stored in different tables?
- Detect the table of interest
- Form attribute-value pairs (adjust if necessary)
- Do extraction
- Infer mappings from extraction patterns
3ProblemDetecting The Table of Interest
4Problem
Different schemas
- Different source table schemas
- Run , Yr, Make, Model, Tran, Color, Dr
- Make, Model, Year, Colour, Price, Auto, Air
Cond., AM/FM, CD - Vehicle, Distance, Price, Mileage
- Year, Make, Model, Trim, Invoice/Retail, Engine,
Fuel Economy - Target database schema
- Car, Year, Make, Model, Mileage, Price,
PhoneNr, - Car,
Feature
5ProblemAttribute is Value
6Problem Attribute-Value is Value
7ProblemValue is not Value
8ProblemFactored Values
9ProblemSplit Values
10ProblemMerged Values
11ProblemInformation Behind Links
12Solution
- Detect the table of interest
- Form attribute-value pairs (adjust if necessary)
- Do extraction
- Infer mappings from extraction patterns
13SolutionDetect The Table of Interest
- Real table test
- Same number of values
- Table size
- Attribute test
- Density measure test
- of ontology extracted values
- total of values in the table
14Solution Remove Factoring
15SolutionReplace Boolean Values
16SolutionForm Attribute-Value Pairs
ltMake, Hondagt, ltModel, Civic EXgt, ltYear, 1995gt,
ltColour, Whitegt, ltPrice, 6300gt, ltAuto,
Autogt, ltAir Cond., Air Cond.gt, ltAM/FM, AM/FMgt
17SolutionAdjust Attribute-Value Pairs
ltMake, Hondagt, ltModel, Civic EXgt, ltYear, 1995gt,
ltColour, Whitegt, ltPrice, 6300gt, ltAuto,
Autogt, ltAir Cond., Air Cond.gt, ltAM/FM, AM/FMgt
18SolutionAdd Information Hidden Behind Links
19SolutionInferred Mapping Creation
Car, Year, Make, Model, Mileage, Price,
PhoneNr, Car, Feature
20SolutionInferred Mapping Creation
Car, Year, Make, Model, Mileage, Price,
PhoneNr, Car, Feature
21SolutionInferred Mapping Creation
Car, Year, Make, Model, Mileage, Price,
PhoneNr, Car, Feature
22SolutionInferred Mapping Creation
Car, Year, Make, Model, Mileage, Price,
PhoneNr, Car, Feature
23SolutionInferred Mapping Creation
Car, Year, Make, Model, Mileage, Price,
PhoneNr, Car, Feature
24SolutionInferred Mapping Creation
Car, Year, Make, Model, Mileage, Price,
PhoneNr, Car, Feature
25SolutionInferred Mapping Creation
Car, Year, Make, Model, Mileage, Price,
PhoneNr, Car, Feature
26Experimental Results
- Car Advertisement Application domain
- 10 training tables
- 100 of the 57 mappings (no false mappings)
- 94.6 precision of the values in linked pages
(5.4 false declarations) - 50 test tables
- 94.7 of the 300 mappings (no false mappings)
- On the bases of sampling 3,000 values in linked
pages, we obtained 97 recall and 86 precision
27Other Applications
- Cell Phone Plan Application domain
- Soccer Player Application domain
28Contribution
- Provides an approach to extract information
automatically from HTML tables - Suggests a different way to solve the problem of
schema matching