Title: Semantically Conceptualizing and Annotating Tables
1Semantically Conceptualizing and Annotating Tables
- Stephen Lynn David W. Embley
- Data Extraction Research Group
- Department of Computer Science
- Brigham Young University
Supported by the
2Overview
- Context
- WoK Web of Knowledge
- TANGO Table ANalysis for Generating Ontologies
- MOGO Mini-Ontology GeneratOr
- Semantic Enrichment via MOGO
- Implementation
- Experimentation
- Enhancements
- Challenges Opportunities
3WoK a Web of Knowledge
4TANGO
fleck velter velter
fleck gonsity (ld/gg) hepth(gd)
burlam 1.2 120
falder 2.3 230
multon 2.5 400
TANGO repeatedly turns raw tables into conceptual
mini-ontologies and integrates them into a
growing ontology.
Growing Ontology
5MOGO
fleck velter velter
fleck gonsity (ld/gg) hepth(gd)
burlam 1.2 120
falder 2.3 230
multon 2.5 400
TANGO repeatedly turns raw tables into conceptual
mini-ontologies and integrates them into a
growing ontology.
Growing Ontology
MOGO generates mini-ontologies from interpreted
tables.
6MOGO Overview
- Table
- Interpretation
- Yields a canonical table
- Canonical Table
- Concept/Value Recognition
- Relationship Discovery
- Constraint Discovery
- Yields a semantically enriched conceptual model
- Mini-ontology
- Integration into a growing ontology
MOGO
7Sample Input
Region and State Information Region and State Information Region and State Information Region and State Information
Location Population (2000) Latitude Longitude
Northeast 2,122,869
Delaware 817,376 45 -90
Maine 1,305,493 44 -93
Northwest 9,690,665
Oregon 3,559,547 45 -120
Washington 6,131,118 43 -120
Sample Output
8Concept/Value Recognition
- Lexical Clues
- Labels as data values
- Data value assignment
- Data Frame Clues
- Labels as data values
- Data value assignment
- Default
- Recognize concepts and values by syntax and
layout
9Concept/Value Recognition
- Lexical Clues
- Labels as data values
- Data value assignment
- Data Frame Clues
- Labels as data values
- Data value assignment
- Default
- Recognize concepts and values by syntax and
layout
Region
State
10Concept/Value Recognition
- Lexical Clues
- Labels as data values
- Data value assignment
- Data Frame Clues
- Labels as data values
- Data value assignment
- Default
- Recognize concepts and values by syntax and
layout
Region
State
11Relationship Discovery
2000
- Dimension Tree Mappings
- Lexical Clues
- Generalization/Specialization
- Aggregation
- Data Frames
- Ontology Fragment Merge
12Relationship Discovery
- Dimension Tree Mappings
- Lexical Clues
- Generalization/Specialization
- Aggregation
- Data Frames
- Ontology Fragment Merge
13Constraint Discovery
- Generalization/Specialization
- Computed Values
- Functional Relationships
- Optional Participation
Region and State Information Region and State Information Region and State Information Region and State Information
Location Population (2000) Latitude Longitude
Northeast 2,122,869
Delaware 817,376 45 -90
Maine 1,305,493 44 -93
Northwest 9,690,665
Oregon 3,559,547 45 -120
Washington 6,131,118 43 -120
14Validation
- Concept/Value Recognition
- Correctly identified concepts
- Missed concepts
- False positives
- Data values assignment
- Relationship Discovery
- Valid relationship sets
- Invalid relationship sets
- Missed relationship sets
- Constraint Discovery
- Valid constraints
- Invalid constraints
- Missed constraints
Precision Recall F-measure
Concept Recognition 87 94 90
Relationship Discovery 73 81 77
Constraint Discovery 89 91 90
15Concept Recognition
- Counted
- Correct/Incorrect/Missing Concepts
- Correct/Incorrect/Missing Labels
- Data value assignments
16Relationship Discovery
- Counted
- Correct/incorrect/missing relationship sets
- Correct/incorrect/missing aggregations and
generalization/specializations
17Constraint Discovery
- Counted
- Correct/Incorrect/Missing
- Generalization/Specialization constraints
- Computed value constraints
- Functional constraints
- Optional constraints
18Concept Recognition
- Successes
- 98 of concepts identified
- Missing label identification
- 97 of values assigned to correct concept
- Common problems
- Finding an appropriate label
- Duplicate concepts
19Relationship Discovery
- Recall of 92 for relationship sets
- Missing aggregations and gen./spec.s (only found
in label nesting) - Unnecessary rel. sets generated (are computable)
20Constraint Discovery
- F-measure of 98 for functional relationship sets
- Computed value discovery
- Funtional/non-functional ? lists in cells
21MOGO Contributions
- Tool to generate mini-ontologies
- Accuracy encouraging
Precision Recall F-measure
Concept Recognition 87 94 90
Relationship Discovery 73 81 77
Constraint Discovery 89 91 90
22Opportunities Challenges MOGO
- Enhancements
- Check for inter-label relationships
- Check for more complex computations
- Check for lists in cells
-
- Wish List
- Data-frame library
- Atomic knowledge components
- Instance recognizers
- Library of molecular components
- Semi-automatic construction of a WordNet-like
resource for knowledge components
23Summary
- MOGO
- Semantic Enrichment
- Encouraging Results
- But More Possible
- Broader Implications Vision Challenges
- TANGO
- WoK
- Web of Data
- Semantic Annotation
- User-friendly Query Answering
www.deg.byu.edu embley_at_cs.byu.edu
24Opportunities Challenges TANGO
- Table Interpretation
- Transforming tables to F-logic Pivk07
- Layout-independent table representation Jha08
- Table interpretation by sibling tables Tao07
- Semantic Enhancement / Ontology Generation
- Naming unnamed table concepts Pivk07
- MOGO Lynn09
- Semi-automatic Ontology Integration
- Ontology Matching Euzenat07
- Ontology-mapping tools Falconer07
- Direct and indirect schema mappings for TANGO
Xu06
25Opportunities Challenges WoK
- Web of Data
- The Semantic Web is a web of data. W3C
- Upcoming special issue of Journal of Web
Semantics - Enabling a Web of Knowledge Tao09
- Information Extraction
- Domain-independent IE from web tables
Gatterbauer07 - Open IE Banko07
-
26Opportunities Challenges WoK
-
- Semantic Annotation wrt Ontologies
- Linking Data to Ontologies Poggi08
- TISP Tao07
- FOCIH Tao09
- Reasoning Query Answering
- Description Logics Baadar03
- NLIDB Community
- AskOntos Ding06
- SerFR Al-Muhammed07
27References
- Al-Muhammed07 Al-Muhammed and Embley,
Ontology-Based Constraint Recognition for
Free-Form Service Requests, Proceedings of the
23rd International Conference on Data
Engineering, 2007. - Baader, Calvanese, McGuinness, Nardi and
Patel-Schneider, The Description Logic Handbook,
Cambridge University Press, 2003. - Banko07 Banko, Cafarella, Soderland, Broadhead
and Etzioni, Open Information Extraction from
the Web, Proceedings of the International Joint
Conference on Artificial Intelligence, 2007. - Ding06 Ding, Embley and Liddle, Automatic
Creation and Simplified Querying of Semantic Web
Content An Approach Based on Information-Extracti
on Ontologies, Proceedings of the First Asian
Semantic Web Conference, 2006. - Euzenat07 Eusenat and Shvaiko, Ontology
Matching, Springer Verlag, 2007. - Falconer07 Falconer, Noy and Storey, Ontology
MappingA User Survey, Proceedings of the Second
International Workshop on Ontology Mapping, 2007. - Gatterbauer07 Gatterbauer, Bohunsky, Herzog and
Pollak, Towards Domain-Independent Information
Extraction from Web Tables, Proceedings of the
Sixteenth International World Wide Web
Conference, 2007. - Jha07 Jha and Nagy, Wang Notation Tool Layout
Independent Representation of Tables,
Proceedings of the 19th International Conference
on Pattern Recognition, 2007. - Pivk07 Pivk, Sure, Cimiano, Gams, Rajkovic and
Studer, Transforming Arbitrary Tables into
Logical Form with TARTAR, Data Knowledge
Engineering, 2007. - Poggi08 Poggi, Lembo, Calvanese, DeGiacomo,
Lenzerini and Rosati, Linking Data to
Ontologies, Journal on Data Semantics, 2008. - Tao07 Tao and Embley, Automatic Hidden-Web
Table Interpretation by Sibling page Comparison,
Proceedings of the 26th International Conference
on Conceptual Modeling, 2007. - Tao09 Tao, Embley and Liddle, Enabling a Web
of Knowledge, Technical Report
tango.byu.edu/papers, 2009. - Xu06 Xu and Embley, A Composite Approach to
Automating Direct and Indirect Schema Mappings,
Information Systems, 2006.