Title: From Tessellations to Table Interpretation
 1From Tessellations to Table Interpretation
- R. C. Jandhyala1, M. Krishnamoorthy1, 
- G. Nagy1, R. Padmanabhan1, 
- S. Seth2, W. Silversmith1 
- 1DocLab, Rensselaer Polytechnic Institute 
- 2Computer Science and Engineering, University of 
 Nebraska-Lincoln
- (Supported by NSF Grants  044114854 and 0414644, 
 and Rensselaer Center for Open Source Software)
2Goal Construction of a narrow-domain ontology 
from semi-structured web data (table 
understanding ) 
 3Outline
Tilings (rectangular tessellations) X-Y trees 
(1984)
Grammars
Tables 
Wang Categories (1996)
 A B C D 
 4Outline
Tilings (rectangular tessellations) X-Y trees 
(1984)
Grammars
Tables 
Wang Categories (1996)
 A B C D 
 5Web tables
- Cannot precisely define human-understandable 
 tables.
- Convert to smaller set of admissible tables. 
- Why? Algorithmic ease. 
6Admissible Tables
- Have stub, headings and data cells. 
7Factor out layout-equivalent tables 
 8Outline
Tilings (rectangular tessellations) X-Y trees 
(1984)
Grammars
Tables 
Wang Categories (1996)
 A B C D 
 9Rectangular Tessellations
- Partition of an isothetic rectangle into 
 rectangles.
- Uniquely defined by junction points (location and 
 type).
- Number of tessellations increases rapidly with 
 table size.
10XY Tessellations
- Special case of rectangular tessellations. 
- Successive horizontal and vertical cuts. 
- Easily represented by trees. 
11A tiling and its X-Y Tree(aka slicing structure, 
puzzle tree, tree map) 
 12Non-slicing structures  No XY tree
In fact, X-Y tilings are an infinitesimal 
fraction of all tilings. This helps, because 
tables never contain this spiral structure. 
 13Fundamental Idea
- Use XY trees to automate table processing and 
 understanding.
14Table to XY tree  EX2XY
- Applicable to any XY tessellation. 
- Input  Excel Table 
- Copy and paste or Import. 
- Edit to make admissible. 
- Output  XY tree 
- as XML for portability. 
- as parenthesized string for grammars. 
15Example
(http//www40.statcan.ca/l01/cst01/econ50-eng.htm) 
 16After import into Excel 
 17After Editing 
 18Output - XML
-  
- ltblock id'1.1.2.1' range'17,230,2'gt 
-  ltcontentgt 
-  Real gross domestic product, expenditure-based, 
 by province and territory (millions of chained
 (2002) dollars)
-  lt/contentgt 
- lt/blockgt 
-  
19Outline
Tilings (rectangular tessellations) X-Y trees 
(1984)
Grammars
Tables 
Wang Categories (1996)
 A B C D 
 20Table Grammars
- Can characterize entire families of tables. 
- Developed grammar for one family. 
- Input - Nested parenthesized notation . 
- Output  Accept/Reject as example of family. 
21Grammar
- For parsing column headers 
-  S  A (Rule 1) 
-  A  B (Rule 2) 
-  B  c X B  c X (Rules 3 and 4) 
-  X  c X  A X  A  c (Rules 5, 6, 7 and 
 8)
- S is start symbol. 
- A generates all admissible column headers. 
- B generates category trees. 
- c is a root category. 
- X generates sub-categories. 
22Table Grammars
- Cannot check if table is consistent. 
- Need further geometric alignment and lexical 
 checks.
23Outline
Tilings (rectangular tessellations) X-Y trees 
(1984)
Grammars
Tables 
Wang Categories (1996)
 A B C D 
 24Logical Structure of Tables
- How to interpret a table? 
- Describe relationship between header cells and 
 content cells Wang, U. Waterloo,1996.
- Wang notation 
- Elegant description. 
- Dimensionality Number of category trees. 
- Cartesian product maps categories to data. 
25Layout independent Wang Notation
Different layout and same information means same 
Wang Notation 
 26Wang Category Trees for either table
- characteristic 
-  gonsity 
-  hepth 
- fleck burlam falder multon 
- Any data cell can be designated by a path 
 through each category tree.
- Leaves correspond to row or column headings.
27Real Table Understanding 
- Analyzing logical structure not sufficient. 
- Need additional information from title, 
 footnotes, captions, etc.
- Semantic analysis of the labels also important  
 need external knowledge.
28Does Wang Notation always exist?
- Not always! 
- Inconsistent tables do not have Wang Notation. 
- Others can be edited using virtual headers. 
29XY tree to Wang Notation Algorithm
- Input  XY trees. 
- Output  XML version of Wang Notation. 
- Checks for table consistency.
30Algorithm
- Locate principal regions - stub, headers and 
 content cells.
- Extract Wang categories. 
- Compute Cartesian product of category paths. 
- Match each key to the content of a delta cell.
31Conclusions
- Admissible layouts identified for ease of 
 processing.
- Algorithms developed for 
- extracting XY trees from tables. 
- extracting Wang notation from XY trees. 
- Family of tables identified using a grammar.
32Future work
- Augmentations - captions, aggregates, units, etc. 
 
- Expand the grammar. 
- Automate conversion of table to admissible 
 formats.
(http//www40.statcan.ca/l01/cst01/agri111a-eng.ht
m) 
 33THANK YOU 
 34Goal construction of a narrow-domain 
ontologyfrom semi-structured web data(table 
understanding )
- Currently multon is the best choice for rapitting 
 velters. It is about 25 better than burlam or
 falder, which have the same girby (hepth/gonsity
 ratio).
- Check another table to see whether elmer is even 
 better.
- NOT TODAY!
35H-first tree can be transformed into V-first 
tree(and vice-versa) 
 36EX2XY Algorithm
- Two workhorses 
- Vertical_cut  returns leftmost sub-rectangle of 
 a given rectangle.
- Horizontal_cut  returns topmost sub-rectangle of 
 a given rectangle.
37EX2XY Algorithm (contd.)
- Used in a pair of procedures P1 and P2. 
- P1 cuts vertically and submits first 
 sub-rectangle to P2 for horizontal cuts.
- Similarly with P2. 
38Parenthesized notation
- P-notation has 11 correspondence with general 
 trees.
- For above table, the XY tree sentence is 
- Sxy  c c c c c c c c c c c c. 
39A table with six Wang dimensions 
 40XY2WANG Other features
- Handles more complex scenarios 
- Higher dimensionality. 
- Deeper nesting of headers. 
- Repetitive headers. 
41(http//www40.statcan.ca/l01/cst01/econ50-eng.htm) 
 42Table Augmentations Example 
 43Raghavs Experiment 
 44Results 
 45Results (Contd.) 
 46Conclusion
- Average total time to process a table - 231 
 seconds.
- Average table size - 587 cells before 
 preprocessing.
- Average preprocessing time - 104 seconds. 
- 3 category tables took approximately 27 seconds 
 more than 2 category tables.
47Conclusion (Contd.)
- Tables with aggregates and footnotes - more time 
 to process.
- Strong correlation between processing time and 
 table size.
- For future automatically segmenting 
 augmentations, categories and delta cells using
 visual cues.