Table Interpretation by Sibling Page Comparison - PowerPoint PPT Presentation

About This Presentation
Title:

Table Interpretation by Sibling Page Comparison

Description:

Data Extraction Group Department of Computer Science Brigham Young ... Embed within a community ontology. Alternatively, augment community knowledge. fleck ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 28
Provided by: cui1
Learn more at: https://www.deg.byu.edu
Category:

less

Transcript and Presenter's Notes

Title: Table Interpretation by Sibling Page Comparison


1
Table Interpretation by Sibling Page Comparison
Cui Tao David W. Embley Data Extraction Group
Department of Computer Science
Brigham Young University
Supported by NSF
2
Table Interpretation(in context)
  • Context Table Understanding
  • Table Recognition
  • Table Interpretation
  • Table Conceptualization
  • Table Understanding
  • Applications
  • Not only understanding wrt community knowledge
  • But also creation or augmentation of community
    knowledge
  • Challenging Conceptual-Modeling Work

3
Table Interpretation(in context)
  • Context Table Understanding
  • Table Recognition
  • Table Interpretation with Sibling Pages
  • Table Conceptualization
  • Table Understanding
  • Applications
  • Not only understanding wrt community knowledge
  • But also creation or augmentation of community
    knowledge
  • Challenging Conceptual-Modeling Work

TISP
4
TISP Table Recognition and Interpretation
  • Recognize tables (discard non-tables)
  • Locate table labels
  • Locate table values
  • Find label/value associations

5
Recognize Tables
Layout Tables (discard)
Data Table
Nested Data Tables
6
Locate Table Labels
Examples Identification.Gene
model(s).Protein Identification.Gene model(s).2
7
Locate Table Labels
Examples Identification.Gene model(s).Gene
Model Identification.Gene model(s).2
1 2
8
Locate Table Values
Value
9
Find Label/Value Associations
Example (Identification.Gene model(s).Protein,
Identification.Gene model(s).2) WPCE28918
1 2
10
Conceptual Table Interpretation
Table Ontology
Wang Notation Wang96 (Identification.Gene
model(s).Protein, Identification.Gene model(s).2)
WPCE28918
11
Interpretation Technique Sibling Page Comparison
12
Interpretation Technique Sibling Page Comparison
Same
13
Interpretation Technique Sibling Page Comparison
Almost Same
14
Interpretation Technique Sibling Page Comparison
Different
Same
15
Technique Details
  • Unnest tables
  • Match tables in sibling pages
  • Perfect match (table for layout ? discard )
  • Reasonable match (sibling table)
  • Determine/Use Table-Structure Pattern
  • Discover pattern
  • Pattern usage
  • Dynamic pattern adjustment

16
Table Unnesting
17
Match Based on DOM Tree
18
Simple Tree Matching Algorithm
Yang91
Match Score Categorization Exact/Near-Exact,
Sibling-Table, False
19
Table Structure Patterns
  • Regularity Expectations
  • (lttrgtlt(tdth)gt L lt(tdth)gt V)n
  • lttrgt(lt(tdth)gt L)n
  • (lttrgt(lt(tdth)gt V)n)

Pattern combinations are also possible.
20
Pattern Usage
(Location.Genetic Position) X12.69 /- 0.000
cM mapping data (Location.Genomic Position)
X13518823..13515773 bp
21
Dynamic Pattern Adjustment
lttrgt(lt(tdth)gt L)5 (lttrgt(lt(tdth)gt V)5)
lttrgt(lt(tdth)gt L)5 (lttrgt(lt(tdth)gt V)5)
lttrgt(lt(tdth)gt L)6 (lttrgt(lt(tdth)gt V)6)
22
TISP Evaluation
  • Applications
  • Commercial car ads
  • Scientific molecular biology
  • Geopolitical US states and countries
  • Data gt 2,000 tables, 275 sibling tables, 35 web
    sites
  • Evaluation
  • Initial two sibling pages
  • Correct separation of data tables from layout
    tables?
  • Correct pattern recognition?
  • Remaining tables in site
  • Information properly extracted?
  • Able to detect and adjust for pattern variations?

23
Experimental Results
  • Table recognition correctly discarded 157 of 158
    layout tables
  • Pattern recognition correctly found 69 of 72
    structure patterns
  • Extraction and adjustments 5 path adjustments
    and 34 label adjustments ? all correct

24
Discovered Difficulties
  • Abundance of null entries
  • Multiple tables as a single table
  • Recognize and group
  • Use box model
    Gatterbauer07
  • Factored labels

25
Table Understanding
  • Table Recognition
  • Data table vs. table for layout
  • Adjust (group table components, defactor labels,
    )
  • Table Interpretation
  • Populate table ontology
  • Additional table-ontology elements (title,
    footnotes, )
  • Table Conceptualization
  • Capture table semantics
  • Reverse engineer as a conceptual model
  • Table Understanding
  • Embed within a community ontology
  • Alternatively, augment community knowledge

26
Knowledge Generation
fleck velter velter
fleck gonsity (ld/gg) hepth(gd)
burlam 1.2 120
falder 2.3 230
multon 2.5 400
  • repeat
  • recognize table
  • interpret table
  • conceptualize table
  • merge
  • adjust
  • until ontology developed

Growing Ontology
TANGO (Table Analysis for Generating Ontologies)
repeatedly turns raw tables into conceptual
mini-ontologies and integrates them into a
growing ontology.
27
Conclusionsand Future Opportunities
  • Conclusions
  • Table Interpretation overall F-measure of 94.5
  • Can successfully apply sibling-page technique
  • Future Opportunities
  • Table understanding
  • Knowledge generation
  • Challenging conceptual-modeling work

www.deg.byu.edu
Write a Comment
User Comments (0)
About PowerShow.com