Table Extraction Using Conditional Random Fields - PowerPoint PPT Presentation

About This Presentation
Title:

Table Extraction Using Conditional Random Fields

Description:

Table Extraction Using Conditional Random Fields. D. Pinto, A. McCallum, X. Wei ... benefits of conditional-probability training models and Markov finite-state ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 12
Provided by: csC76
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Table Extraction Using Conditional Random Fields


1
Table Extraction Using Conditional Random Fields
  • D. Pinto, A. McCallum, X. Wei and W. Bruce Croft
  • - on SIGIR03 -
  • Presented by Vitor R. Carvalho
  • March 15th 2004

2
Warm up
  • Why table extraction?
  • Applications Question-Answering, data mining and
    IR
  • Tables textual tokens laid out in tabular form
  • Tables databases designed for human eyes
  • Related Work
  • Pyreddy and Croft,1997 purely layout-based
    approach a Character Alignment Graph (CAG) is
    used to identify the whole table
  • Ng et. al. ,1999 machine learning to identify
    rows and columns positions no extraction of
    content.
  • Hurst, 2000 combination of layout and language
    perspective text are broken into blocks by
    spatial and linguistic evidence
  • Pinto et. al., 2002 based on CAG, heuristic
    method to extract table cells for QA system.

3
Objectives
  • On this paper
  • Only text tables are studied, not HTML tables
  • Table extraction can be broken down into 6
    subproblems
  • Locate the table ()
  • Identify the row positions and types ()
  • Identify columns positions and types
  • Segment tables into cells
  • Tag cells as data or headers
  • Associate data cells with their corresponding
    headers
  • Only () tasks are addressed in the paper
  • CRFs are compared to MaxEntropy and to HMM

4
Example
  • From www.FedStats.com , July 2001

5
12 Line Labels
  • Non-extraction labels
  • NONTABLE, BLANKLINE, SEPARATOR
  • Header Labels
  • TITLE, SUPERHEADER, TABLEHEADER, SUBHEADER,
    SECTIONHEADER
  • Data Row Labels
  • DATAROW, SECTIONDATAROW
  • Caption Labels
  • TABLEFOOTNOTE, TABLECAPTION

6
Feature Set
  • White Space Features
  • Presence of 4 consecutive white spaces, 4 space
    indents, 2 consecutive white space between
    non-space characters, a complete white space
    line, single space indent, etc
  • Percentage of white space from the first
    non-white space on
  • Text Features
  • Presence of 3 cells on a line, etc
  • Percentage of digits (0-9) on a line, alphabet
    characters(a-z) on a line, header features (year
    strings, month abreviations, etc) on a line
  • Separator Features
  • Presence of 4 consecutive periods
  • Percentage of separator characters(-,,! ,,,)
    on a line
  • Conjunction of Features
  • Conjunctions currentprevious line, currentnext
    line, nextnextnext

7
Task 1 Table Line Location
  • A table line is any label but NONTABLE, BLANKLINE
    and SEPARATOR
  • F-Measure (2Precision Recall)/(RecallPrecisi
    on)
  • Both CRFs used a Gaussian Prior and were trained
    using L-BFGS
  • Training set (52 documents), develop. set (6
    documents), test set (62 docs)

8
Task 2 Line Identification
  • How many of these lines were actually table lines?

9
Task 2 Line Identification
10
Additional Results
  • Pinto et. al. heuristic method
  • 4 labels CAPTIONS, HEADERS, DATA, NON-TABLE

11
Conclusions
  • The Table extraction problem has complex
    linguistic and formatting characteristics. In
    order to attack this problem, a combination of
    textual and spatial features was used.
  • CRFs can handle very well arbitrary and
    overlapping features, and offer the combined
    benefits of conditional-probability training
    models and Markov finite-state context models.
Write a Comment
User Comments (0)
About PowerShow.com