Title: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure
1Automatically Extracting Ontologically Specified
Data from HTML Tableswith Unknown Structure
- David W. Embley, Cui Tao, Stephen W. Liddle
- Brigham Young University
Funded by NSF
2Information Exchange
Source
Target
Information Extraction
Schema Matching
3Information Extraction
4Extracting Pertinent Information from Documents
5A Conceptual-Modeling Solution
6Car-Ads Ontology
- Car -gtobject
- Car 0..1 has Year 1..
- Car 0..1 has Make 1..
- Car 0...1 has Model 1..
- Car 0..1 has Mileage 1..
- Car 0.. has Feature 1..
- Car 0..1 has Price 1..
- PhoneNr 1.. is for Car 0..
- PhoneNr 0..1 has Extension 1..
- Year matches 4
- constant extract \d2
- context "(\\d)4-9\d
\d" - substitute "" -gt "19" ,
-
-
- End
7Recognition and Extraction
8Schema Matching for HTML Tables with Unknown
Structure
9Table-Schema Matching(Basic Idea)
- Many Tables on the Web
- Ontology-Based Extraction
- Works well for unstructured or semistructured
data - What about structured data tables?
- Method
- Form attribute-value pairs
- Do extraction
- Infer mappings from extraction patterns
10Problem Different Schemas
- Target Database Schema
- Car, Year, Make, Model, Mileage, Price,
PhoneNr, PhoneNr, Extension, Car, Feature - Different Source Table Schemas
- Run , Yr, Make, Model, Tran, Color, Dr
- Make, Model, Year, Colour, Price, Auto, Air
Cond., AM/FM, CD - Vehicle, Distance, Price, Mileage
- Year, Make, Model, Trim, Invoice/Retail, Engine,
Fuel Economy
11Problem Attribute is Value
12Problem Attribute-Value is Value
13Problem Value is not Value
14Problem Implied Values
15Problem Missing Attributes
16Problem Compound Attributes
17Problem Factored Values
18Problem Split Values
19Problem Merged Values
20Problem Values not of Interest
21Problem Information Behind Links
22Solution
- Form attribute-value pairs (adjust if necessary)
- Do extraction
- Infer mappings from extraction patterns
23Solution Remove Internal Factoring
Discover Nesting Make, (Model, (Year, Colour,
Price, Auto, Air Cond, AM/FM, CD))
24Solution Replace Boolean Values
ACURA
ACURA
Legend
25Solution Form Attribute-Value Pairs
ACURA
ACURA
Legend
ltMake, Hondagt, ltModel, Civic EXgt, ltYear, 1995gt,
ltColour, Whitegt, ltPrice, 6300gt, ltAuto,
Autogt, ltAir Cond., Air Cond.gt, ltAM/FM, AM/FMgt,
ltCD, gt
26Solution Adjust Attribute-Value Pairs
ACURA
ACURA
Legend
ltMake, Hondagt, ltModel, Civic EXgt, ltYear, 1995gt,
ltColour, Whitegt, ltPrice, 6300gt, ltAutogt,
ltAir Condgt, ltAM/FMgt
27Solution Do Extraction
ACURA
ACURA
Legend
28Solution Infer Mappings
ACURA
ACURA
Legend
Car, Year, Make, Model, Mileage, Price,
PhoneNr, PhoneNr, Extension, Car, Feature
29Solution Do Extraction
ACURA
ACURA
Legend
Car, Year, Make, Model, Mileage, Price,
PhoneNr, PhoneNr, Extension, Car, Feature
30Solution Do Extraction
ACURA
ACURA
Legend
pPriceTable
Car, Year, Make, Model, Mileage, Price,
PhoneNr, PhoneNr, Extension, Car, Feature
31Solution Do Extraction
ACURA
ACURA
Legend
? Colour?Feature p ColourTable U ? Auto?Feature p
Auto ß AutoTable U ? Air Cond.?Feature p Air
Cond. ß Air Cond.Table U ? AM/FM?Feature p AM/FM
ß AM/FMTable U ? CD?Featurep CDß CDTable
Yes,
Yes,
Yes,
Yes,
Car, Year, Make, Model, Mileage, Price,
PhoneNr, PhoneNr, Extension, Car, Feature
32Experiment
- Tables from 60 sites
- 10 training tables
- 50 test tables
- 357 mappings (from all 60 sites)
- 172 direct mappings (same attribute and meaning)
- 185 indirect mappings (29 attribute synonyms, 5
Yes/No columns, 68 unions over columns for
Feature, 19 factored values, and 89 columns of
merged values that needed to be split) -
-
33Results
- 10 training tables
- 100 of the 57 mappings (no false mappings)
- 94.6 of the values in linked pages (5.4 false
declarations) - 50 test tables
- 94.7 of the 300 mappings (no false mappings)
- On the bases of sampling 3,000 values in linked
pages, we obtained 97 recall and 86 precision - 16 missed mappings
- 4 partial (not all unions included)
- 6 non-U.S. car-ads (unrecognized makes and
models) - 2 U.S. unrecognized makes and models
- 3 prices (missing or found MSRP instead)
- 1 mileage (mileages less than 1,000)
34Conclusions
- Summary
- Transformed schema-matching problem to extraction
- Inferred semantic mappings
- Discovered source-to-target mapping rules
- Evidence of Success
- Tables (mappings) 95 (Recall) 100
(Precision) - Linked Text (value extraction) 97 (Recall)
86 (Precision) - Future Work
- Discover and exploit structure in linked text
- Broaden table understanding
- Integrate with current extraction tools
www.deg.byu.edu