Title: Table Interpretation by Sibling Page Comparison
1Table Interpretation by Sibling Page Comparison
Cui Tao David W. Embley Data Extraction Group
Department of Computer Science
Brigham Young University
Supported by NSF
2Table Interpretation(in context)
- Context Table Understanding
- Table Recognition
- Table Interpretation
- Table Conceptualization
- Table Understanding
- Applications
- Not only understanding wrt community knowledge
- But also creation or augmentation of community
knowledge - Challenging Conceptual-Modeling Work
3Table Interpretation(in context)
- Context Table Understanding
- Table Recognition
- Table Interpretation with Sibling Pages
- Table Conceptualization
- Table Understanding
- Applications
- Not only understanding wrt community knowledge
- But also creation or augmentation of community
knowledge - Challenging Conceptual-Modeling Work
TISP
4TISP Table Recognition and Interpretation
- Recognize tables (discard non-tables)
- Locate table labels
- Locate table values
- Find label/value associations
5Recognize Tables
Layout Tables (discard)
Data Table
Nested Data Tables
6Locate Table Labels
Examples Identification.Gene
model(s).Protein Identification.Gene model(s).2
7Locate Table Labels
Examples Identification.Gene model(s).Gene
Model Identification.Gene model(s).2
1 2
8Locate Table Values
Value
9Find Label/Value Associations
Example (Identification.Gene model(s).Protein,
Identification.Gene model(s).2) WPCE28918
1 2
10Conceptual Table Interpretation
Table Ontology
Wang Notation Wang96 (Identification.Gene
model(s).Protein, Identification.Gene model(s).2)
WPCE28918
11Interpretation Technique Sibling Page Comparison
12Interpretation Technique Sibling Page Comparison
Same
13Interpretation Technique Sibling Page Comparison
Almost Same
14Interpretation Technique Sibling Page Comparison
Different
Same
15Technique Details
- Unnest tables
- Match tables in sibling pages
- Perfect match (table for layout ? discard )
- Reasonable match (sibling table)
- Determine/Use Table-Structure Pattern
- Discover pattern
- Pattern usage
- Dynamic pattern adjustment
16Table Unnesting
17Match Based on DOM Tree
18Simple Tree Matching Algorithm
Yang91
Match Score Categorization Exact/Near-Exact,
Sibling-Table, False
19Table Structure Patterns
- Regularity Expectations
- (lttrgtlt(tdth)gt L lt(tdth)gt V)n
- lttrgt(lt(tdth)gt L)n
- (lttrgt(lt(tdth)gt V)n)
-
Pattern combinations are also possible.
20Pattern Usage
(Location.Genetic Position) X12.69 /- 0.000
cM mapping data (Location.Genomic Position)
X13518823..13515773 bp
21Dynamic Pattern Adjustment
lttrgt(lt(tdth)gt L)5 (lttrgt(lt(tdth)gt V)5)
lttrgt(lt(tdth)gt L)5 (lttrgt(lt(tdth)gt V)5)
lttrgt(lt(tdth)gt L)6 (lttrgt(lt(tdth)gt V)6)
22TISP Evaluation
- Applications
- Commercial car ads
- Scientific molecular biology
- Geopolitical US states and countries
- Data gt 2,000 tables, 275 sibling tables, 35 web
sites - Evaluation
- Initial two sibling pages
- Correct separation of data tables from layout
tables? - Correct pattern recognition?
- Remaining tables in site
- Information properly extracted?
- Able to detect and adjust for pattern variations?
23Experimental Results
- Table recognition correctly discarded 157 of 158
layout tables - Pattern recognition correctly found 69 of 72
structure patterns - Extraction and adjustments 5 path adjustments
and 34 label adjustments ? all correct
24Discovered Difficulties
- Abundance of null entries
- Multiple tables as a single table
- Recognize and group
- Use box model
Gatterbauer07 - Factored labels
25Table Understanding
- Table Recognition
- Data table vs. table for layout
- Adjust (group table components, defactor labels,
) - Table Interpretation
- Populate table ontology
- Additional table-ontology elements (title,
footnotes, ) - Table Conceptualization
- Capture table semantics
- Reverse engineer as a conceptual model
- Table Understanding
- Embed within a community ontology
- Alternatively, augment community knowledge
26Knowledge Generation
fleck velter velter
fleck gonsity (ld/gg) hepth(gd)
burlam 1.2 120
falder 2.3 230
multon 2.5 400
- repeat
- recognize table
- interpret table
- conceptualize table
- merge
- adjust
- until ontology developed
Growing Ontology
TANGO (Table Analysis for Generating Ontologies)
repeatedly turns raw tables into conceptual
mini-ontologies and integrates them into a
growing ontology.
27Conclusionsand Future Opportunities
- Conclusions
- Table Interpretation overall F-measure of 94.5
- Can successfully apply sibling-page technique
- Future Opportunities
- Table understanding
- Knowledge generation
- Challenging conceptual-modeling work
www.deg.byu.edu