Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure - PowerPoint PPT Presentation

About This Presentation
Title:

Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure

Description:

Car Year Make Model Mileage Price PhoneNr. 0001 1989 Subaru SW $1900 (336)835-8597 ... is trivial because we have OIDs for table rows (e.g. for each Car). ER 2002 ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 35
Provided by: davidw8
Learn more at: https://www.deg.byu.edu
Category:

less

Transcript and Presenter's Notes

Title: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure


1
Automatically Extracting Ontologically Specified
Data from HTML Tableswith Unknown Structure
  • David W. Embley, Cui Tao, Stephen W. Liddle
  • Brigham Young University

Funded by NSF
2
Information Exchange
Source
Target
Information Extraction
Schema Matching
3
Information Extraction
4
Extracting Pertinent Information from Documents
5
A Conceptual-Modeling Solution
6
Car-Ads Ontology
  • Car -gtobject
  • Car 0..1 has Year 1..
  • Car 0..1 has Make 1..
  • Car 0...1 has Model 1..
  • Car 0..1 has Mileage 1..
  • Car 0.. has Feature 1..
  • Car 0..1 has Price 1..
  • PhoneNr 1.. is for Car 0..
  • PhoneNr 0..1 has Extension 1..
  • Year matches 4
  • constant extract \d2
  • context "(\\d)4-9\d
    \d"
  • substitute "" -gt "19" ,
  • End

7
Recognition and Extraction
8
Schema Matching for HTML Tables with Unknown
Structure
9
Table-Schema Matching(Basic Idea)
  • Many Tables on the Web
  • Ontology-Based Extraction
  • Works well for unstructured or semistructured
    data
  • What about structured data tables?
  • Method
  • Form attribute-value pairs
  • Do extraction
  • Infer mappings from extraction patterns

10
Problem Different Schemas
  • Target Database Schema
  • Car, Year, Make, Model, Mileage, Price,
    PhoneNr, PhoneNr, Extension, Car, Feature
  • Different Source Table Schemas
  • Run , Yr, Make, Model, Tran, Color, Dr
  • Make, Model, Year, Colour, Price, Auto, Air
    Cond., AM/FM, CD
  • Vehicle, Distance, Price, Mileage
  • Year, Make, Model, Trim, Invoice/Retail, Engine,
    Fuel Economy

11
Problem Attribute is Value
12
Problem Attribute-Value is Value
13
Problem Value is not Value
14
Problem Implied Values
15
Problem Missing Attributes
16
Problem Compound Attributes
17
Problem Factored Values
18
Problem Split Values
19
Problem Merged Values
20
Problem Values not of Interest
21
Problem Information Behind Links
22
Solution
  • Form attribute-value pairs (adjust if necessary)
  • Do extraction
  • Infer mappings from extraction patterns

23
Solution Remove Internal Factoring
Discover Nesting Make, (Model, (Year, Colour,
Price, Auto, Air Cond, AM/FM, CD))
24
Solution Replace Boolean Values
ACURA
ACURA
Legend
25
Solution Form Attribute-Value Pairs
ACURA
ACURA
Legend
ltMake, Hondagt, ltModel, Civic EXgt, ltYear, 1995gt,
ltColour, Whitegt, ltPrice, 6300gt, ltAuto,
Autogt, ltAir Cond., Air Cond.gt, ltAM/FM, AM/FMgt,
ltCD, gt
26
Solution Adjust Attribute-Value Pairs
ACURA
ACURA
Legend
ltMake, Hondagt, ltModel, Civic EXgt, ltYear, 1995gt,
ltColour, Whitegt, ltPrice, 6300gt, ltAutogt,
ltAir Condgt, ltAM/FMgt
27
Solution Do Extraction
ACURA
ACURA
Legend
28
Solution Infer Mappings
ACURA
ACURA
Legend
Car, Year, Make, Model, Mileage, Price,
PhoneNr, PhoneNr, Extension, Car, Feature
29
Solution Do Extraction
ACURA
ACURA
Legend
Car, Year, Make, Model, Mileage, Price,
PhoneNr, PhoneNr, Extension, Car, Feature
30
Solution Do Extraction
ACURA
ACURA
Legend
pPriceTable
Car, Year, Make, Model, Mileage, Price,
PhoneNr, PhoneNr, Extension, Car, Feature
31
Solution Do Extraction
ACURA
ACURA
Legend
? Colour?Feature p ColourTable U ? Auto?Feature p
Auto ß AutoTable U ? Air Cond.?Feature p Air
Cond. ß Air Cond.Table U ? AM/FM?Feature p AM/FM
ß AM/FMTable U ? CD?Featurep CDß CDTable
Yes,
Yes,
Yes,
Yes,
Car, Year, Make, Model, Mileage, Price,
PhoneNr, PhoneNr, Extension, Car, Feature
32
Experiment
  • Tables from 60 sites
  • 10 training tables
  • 50 test tables
  • 357 mappings (from all 60 sites)
  • 172 direct mappings (same attribute and meaning)
  • 185 indirect mappings (29 attribute synonyms, 5
    Yes/No columns, 68 unions over columns for
    Feature, 19 factored values, and 89 columns of
    merged values that needed to be split)

33
Results
  • 10 training tables
  • 100 of the 57 mappings (no false mappings)
  • 94.6 of the values in linked pages (5.4 false
    declarations)
  • 50 test tables
  • 94.7 of the 300 mappings (no false mappings)
  • On the bases of sampling 3,000 values in linked
    pages, we obtained 97 recall and 86 precision
  • 16 missed mappings
  • 4 partial (not all unions included)
  • 6 non-U.S. car-ads (unrecognized makes and
    models)
  • 2 U.S. unrecognized makes and models
  • 3 prices (missing or found MSRP instead)
  • 1 mileage (mileages less than 1,000)

34
Conclusions
  • Summary
  • Transformed schema-matching problem to extraction
  • Inferred semantic mappings
  • Discovered source-to-target mapping rules
  • Evidence of Success
  • Tables (mappings) 95 (Recall) 100
    (Precision)
  • Linked Text (value extraction) 97 (Recall)
    86 (Precision)
  • Future Work
  • Discover and exploit structure in linked text
  • Broaden table understanding
  • Integrate with current extraction tools

www.deg.byu.edu
Write a Comment
User Comments (0)
About PowerShow.com