PowerPoint%20Presentation%20%20-%20%2048x36%20Poster%20Template - PowerPoint PPT Presentation

About This Presentation
Title:

PowerPoint%20Presentation%20%20-%20%2048x36%20Poster%20Template

Description:

HTML Table Interpretation by Sibling Page Comparison in the Molecular ... have at least two helices, and participate in glycolysis' GenBank, PDB, KEGG ' ... – PowerPoint PPT presentation

Number of Views:166
Avg rating:3.0/5.0
Slides: 2
Provided by: deg7
Category:

less

Transcript and Presenter's Notes

Title: PowerPoint%20Presentation%20%20-%20%2048x36%20Poster%20Template


1
HTML Table Interpretation by Sibling Page
Comparison in the Molecular Biology Domain Cui
Tao and David W. Embley Data Extraction Research
Group Department of Computer Science, Brigham
Young University, Provo, UT, 84602
PROBLEMS Huge evolving number of
Bio-databases Molecular biology database
collection 2004 total 548, 162 more than
2003 2005 total 719, 171 more than
2004 Different access capabilities Syntactic
heterogeneity Semantic heterogeneity Updated at
anytime by independent authorities
Non-data tables
SAMPLE TABLE
GOALS To help biologists cross-search various
resources Examples Find genes which are longer
than 5kbp, whose products have at least two
helices, and participate in glycolysis
GenBank, PDB, KEGG Find genes newly annotated
after Jan. 2003 in the fly and worm genomes
FlyBase, WormBase
Values
Labels
SOLUTION Source page understanding Table
interpretation Table recognition Table pattern
generalization Pattern adjustment Information
extraction semantic annotation Source location
through semantic indexing Cross-database query
processing
Sibling pages sibling tables
Table-Interpretation Steps HTML table ? DOM
tree Tree matching ? Find sibling
tables Variable fields values Fixed fields
labels Infer pattern
Focus of this poster
  • EXPERIMENTAL RESULTS
  • Test Set 10 web sites 100 sibling pages 862
    HTML tables
  • Table Recognition correctly eliminated all but 3
    non-data tables
  • Pattern Generation successfully recognized 28 of
    29 patterns
  • Dynamic Adjustment 5 location adjustments 12
    structure adjustments ? all correct

Pattern Combinations
Matches any pre-defined pattern template?
Generates a specific structure pattern for the
table
Input an HTML table Output a formal table
notation (Wang notation)
We can Recognize data tables Find labels and
values Infer table patterns Dynamically adjust
table patterns Domain Generality work for other
domains
value
Pre-defined structure templates
label
Dynamically adjust the structure pattern
Data Extraction Research GroupDepartment of
Computer Science Brigham Young UniversityProvo,
UT 84602 Cui Tao, ctao_at_cs.byu.edu David W.
Embley, embley_at_cs.byu.edu http//www.deg.byu.edu/
  • Consider all tagged tables
  • Unnest
  • Filter out tables containing no data
  • Sibling table match percentage max match score /
    tree size
  • gt the high threshold exact match or near exact
    match
  • lt the low threshold false match
  • In between sibling tables
Write a Comment
User Comments (0)
About PowerShow.com