The ODU Metadata Extraction Project - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

The ODU Metadata Extraction Project

Description:

... for Scientific Research 140 Commonwealth Avenue Chestnut Hill, MA 02467 ... ReportDate Accepted this 18th day of June 2004 by: /ReportDate /metadata ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 58
Provided by: wcar9
Learn more at: https://www.cs.odu.edu
Category:

less

Transcript and Presenter's Notes

Title: The ODU Metadata Extraction Project


1
  • The ODU Metadata Extraction Project
  • March 28, 2007
  • Dr. Steven J. Zeil
  • zeil_at_cs.odu.edu

2
Outline
  • Overview
  • Recent Developments
  • Independent Document Model
  • Validation
  • Diversifying NASA GPO collections
  • New Issues Future Directions
  • Post-processing
  • Image-Based Classification

3
1. Overview
4
Input Processing OCR
  • Select pages of interest
  • Apply Off-The-Shelf OCR software
  • Convert OCR output to XML model format

5
Form Processing
  • Scan document for form names
  • Select form template
  • Apply form extraction engine to document and
    template

6
Sample RDP
7
Sample RDP (cont.)
8
Metadata Extracted from Sample RDP (1/3)
  • ltmetadata templateName"sf298_2"gt
  • ltReportDategt18-09-2003lt/ReportDategt
  • ltDescriptiveNotegtFinal Reportlt/DescriptiveNotegt
  • ltDescriptiveNotegt1 April 1996 - 31 August
    2003lt/DescriptiveNotegt
  • ltUnclassifiedTitlegtVALIDATION OF IONOSPHERIC
    MODELSlt/UnclassifiedTitlegt
  • ltContractNumbergtF19628-96-C-0039lt/ContractNumber
    gt ltContractNumbergtlt/ContractNumbergt
  • ltProgramElementNumbergt61102Flt/ProgramElementNumber
    gt
  • ltPersonalAuthorgtPatricia H. Doherty Leo F.
    McNamara
  • Susan H. Delay Neil J. Grossbardlt/PersonalAu
    thorgt
  • ltProjectNumbergt1010lt/ProjectNumbergt
  • ltTaskNumbergtIMlt/TaskNumbergt
  • ltWorkUnitNumbergtAClt/WorkUnitNumbergt
  • ltCorporateAuthorgtBoston College / Institute for
    Scientific Research 140 Commonwealth Avenue
    Chestnut Hill, MA 02467-3862lt/CorporateAuthorgt

9
Metadata Extracted from Sample RDP (2/3)
  • ltReportNumbergtlt/ReportNumbergt
  • ltMonitorNameAndAddressgtAir Force Research
    Laboratory 29 Randolph Road Hanscom AFB, MA
    01731-3010lt/MonitorNameAndAddressgt
  • ltMonitorAcronymgtVSBPlt/MonitorAcronymgt
  • ltMonitorSeriesgtAFRL-VS-TR-2003-1610lt/MonitorSeri
    esgt
  • ltDistributionStatementgtApproved for public
    release distribution unlimited.lt/DistributionStat
    ementgt
  • ltAbstractgtThis document represents the final
    report for work
  • performed under the Boston College contract F
    I9628-96C-0039. This
  • contract was entitled Validation of
    Ionospheric Models. The
  • objective of this contract was to obtain
    satellite and ground-based
  • ionospheric measurements from a wide range of
    geographic locations
  • and to utilize the resulting databases to
    validate the theoretical
  • ionospheric models that are the basis of the
    Parameterized Real-time
  • Ionospheric Specification Model (PRISM) and
    the Ionospheric Forecast
  • Model (IFM). Thus our various efforts can be
    categorized as either
  • observational databases or modeling
    studies.lt/Abstractgt

10
Metadata Extracted from Sample RDP (3/3)
  • ltIdentifiergtIonosphere, Total Electron Content
    (TEC), Scintillation,
  • Electron density, Parameterized Real-time
    Ionospheric Specification
  • Model (PRISM), Ionospheric Forecast Model
    (IFM), Paramaterized
  • Ionosphere Model (PIM), Global Positioning
    System (GPS)lt/Identifiergt
  • ltResponsiblePersongtJohn Rettererlt/ResponsiblePer
    songt
  • ltPhonegt781-377-3891lt/Phonegt
  • ltReportClassificationgtUlt/ReportClassificationgt
    ltAbstractClassificationgtUlt/AbstractClassificationgt
  • ltAbstractLimitaiongtSARlt/AbstractLimitaiongt
  • lt/metadatagt

11
Non-Form Processing
  • Classification compare document against known
    document layouts
  • Select template written for closest matching
    layout
  • Apply non-form extraction engine to document and
    template

12
Non-Form Sample (1/2)
13
Non-Form Sample (2/2)
14
Template Used for Sample Document
  • ltstructdef pagenumber"1" templateID"au"gt
  • ltidentifier min"1" max"1"gt
  • ltbegin inclusive"current"gt
  • ltstringmatch case"yes" loc"beginwith"gtAU/lt
    /stringmatchgt
  • lt/begingt
  • ltendgtonesectionlt/endgt
  • lt/identifiergt
  • ltCorporateAuthor min"1" max"1"gt
  • ltbegin inclusive"current"gt
  • ltstringmatch case"no" loc"beginwith"gt
  • AIR COMMAND AIR WAR
  • lt/stringmatchgt
  • lt/begingt
  • ltend inclusive"current"gt
  • ltstringmatch case"no" loc"beginwith"gtAIR
    UNIVERSITYlt/stringmatchgt
  • lt/endgt
  • lt/CorporateAuthorgt
  • ltUnclassifiedTitle min"1" max"1"gt
  • ltbegin inclusive"after"gtCorporateAuthorlt/begi
    ngt

15
Metadata Extracted From the Title Page of the
Sample Document
  • ltpaper templateid"au"gt
  • ltidentifiergtAU/ACSC/012/1999-04lt/identifiergt
  • ltCorporateAuthorgtAIR COMMAND AND STAFF COLLEGE
  • AIR UNIVERSITYlt/CorporateAuthorgt
  • ltUnclassifiedTitlegtINTEGRATING COMMERCIAL
  • ELECTRONIC EQUIPMENT TO IMPROVE
  • MILITARY CAPABILITIES
  • lt/UnclassifiedTitlegt
  • ltPersonalAuthorgtJeffrey A. Bohler LCDR,
    USNlt/PersonalAuthorgt
  • ltadvisorgtAdvisor CDR Albert L.
    St.Clairlt/advisorgt
  • ltReportDategtApril 1999lt/ReportDategt
  • lt/papergt

16
Post-Processing
  • Coerce extracted values into standard formats

17
Validation
  • Estimate quality of extracted metadata
  • Untrusted outputs referred (to humans) for review
    and correction

18
Recent Developments
  • Independent Document Model
  • Validation
  • Diversifying NASA and GPO Collections

19
Independent Document Model (IDM)
  • Platform independent Document Model
  • Motivation
  • Dramatic XML Schema Change between Omnipage 14
    and 15
  • Tie the template engine to stable specification
  • Protects from linking directly to specific OCR
    product
  • Allows us to include statistics for enhanced
    feature usage
  • Statistics (i.e. avgDocFontSize, avgPageFontSize,
    wordCount, avgDocWordCount, etc..)

20
Documents in IDM
  • A document consists of pages
  • pages are divided into regions
  • regions may be divided into
  • blocks of vertical whitespace
  • paragraphs
  • tables
  • images
  • paragraphs are divided into lines
  • lines are divided into words
  • All of these carry standard attributes for size,
    position, font, etc.

21
Generating IDM
  • Use XSLT 2.0 stylesheets to transform
  • Supporting new OCR schema only requires
    generation of new XSLT stylesheet. -- Engine
    does not change

22
IDM Usage
OmniPage 14 XML Doc
Form Based Extraction
docTreeModelOmni14.xsl
docTreeModelOmni15.xsl
OmniPage 15 XML Doc
IDM XML Doc
Non Form Extraction
docTreeModelOther.xsl
Other OCR Output XML Doc
23
IDM Tool Status
  • Converters completed to generate IDM from
    Omnipage 14 and 15 XML
  • Omnipage 15 proved to have numerous errors in its
    representation of an OCRd document
  • Consequently, not recommended
  • Form-based extraction engine revised to work from
    IDM
  • Non-form engine still works from our older
    CleanXML
  • convertor from IDM to CleanXML completed as
    stop-gap measure
  • direct use of IDM deferred pending review of
    other engine modifications

24
B. Validation
  • Given a set of extracted metadata
  • mark each field with a confidence value
    indicating how trustworthy the extracted value is
  • mark the set with a composite confidence score
  • Fields and Sets with low confidence scores may be
    referred for additional processing
  • automated post-processing
  • human intervention and correction

25
Validating Extracted Metadata
  • Techniques must be independent of the extraction
    method
  • A validation specification is written for each
    collection, combining
  • Field-specific validation rules
  • statistical models derived for each field of
  • text length
  • of words from English dictionary
  • of phrases from knowledge base prepared for
    that field
  • pattern matching

26
Sample Validation Specification
  • Combines results from multiple fields
  • ltvalvalidate collection"dtic"
  • xmlnsval"jellyedu.odu.cs.dtic.validation.Valida
    tionTagLibrary"gt
  • ltvalaveragegt
  • ltvalfield name"UnclassifiedTitle"gt...lt/valfi
    eldgt
  • ltvalfield name"PersonalAuthor"gt...lt/valfield
    gt
  • ltvalfield name"CorporateAuthor"gt...lt/valfiel
    dgt
  • ltvalfield name"ReportDate"gt...lt/valfieldgt
  • lt/valaveragegt
  • lt/valvalidategt

27
Validation Spec Field Tests
  • Each field is subjected to one or more tests
  • ltvalfield name"PersonalAuthor"gt
  • ltvalaveragegt
  • ltvallength/gt
  • ltvalmaxgt
  • ltvalphrases length"1"/gt
  • ltvalphrases length"2"/gt
  • ltvalphrases length"3"/gt
  • lt/valmaxgt
  • lt/valaveragegt
  • lt/valfieldgt
  • ltvalfield name"ReportDate"gt
  • ltvalreportFormat/gt
  • lt/valfieldgt
  • ...

28
Sample Input Metadata Set
  • ltmetadatagt
  • ltUnclassifiedTitlegtThesis Title The Military
    Extraterritorial Jurisdiction Actlt/UnclassifiedTit
    legt
  • ltPersonalAuthorgtName of Candidate LCDR
    Kathleen A. Kerriganlt/PersonalAuthorgt
  • ltReportDategtAccepted this 18th day of June 2004
    bylt/ReportDategt
  • lt/metadatagt

29
Sample Validator Output
  • ltmetadata confidence"0.522"gt
  • ltUnclassifiedTitle confidence"0.943"gtThesis
    Title The Military Extraterritorial Jurisdiction
    Actlt/UnclassifiedTitlegt
  • ltPersonalAuthor confidence"0.622"gtName of
    Candidate LCDR Kathleen A. Kerriganlt/PersonalAuth
    orgt
  • ltReportDate confidence"0.0" warning"ReportDate
    field does not match required pattern"gtAccepted
    this 18th day of June 2004 bylt/ReportDategt
  • lt/metadatagt

30
Classification (a priori)
  • Previously, we had attempted various schemes for
    a priori classification
  • x-y trees
  • bin classification
  • Still investigating some
  • image-based recognition

31
Post-Hoc Classification
  • Apply all templates to document
  • results in multiple candidate sets of metadata
  • Score each candidate using the validator
  • Select the best-scoring set

32
Experimental Results
33
Interpretation of Results
  • Validator agreed with human on 125 out of 167
    cases
  • Of 42 cases where they disagreed
  • 37 were due to extra words in extracted
    metadata (e.g., military ranks in author names)
  • highlights need for post-processing to clean up
    metadata
  • 2 were mistakes by template
  • 2 were due to garbled characters by OCR
  • 1 due to a bug in the validator

34
C. Diversifying NASA and GPO Collections
  • Document collections differ in
  • whether forms are used and form layout
  • document layout
  • what metadata fields are present which ones are
    collected

35
Changing Collections
  • Porting to a new document collection
  • identify pages of interest
  • training classifiers to recognize new document
    layouts (?)
  • templates for forms document layouts
  • new validation scripts
  • collect statistics for collection model
  • new post-processing rules
  • No changes required to core engines other
    software

36
NASA Technical Reports
  • Different layouts than DTIC
  • fewer total
  • tend to be visually more similar
  • mixture with and without RDPs

37
NASA Sample Document
38
Extracted Metadata for NASA Sample
  • ltpaper templateid"singleAuthor"gt
  • ltmetadatagt
  • ltUnclassifiedTitlegt
  • A Computationally Efficient Meshless Local
    Petrov-Galerkin Method
  • for Axisymmetric Problems
  • lt/UnclassifiedTitlegt
  • ltPersonalAuthorgt
  • I.S. Raju and T. Chen?
  • lt/PersonalAuthorgt
  • ltCorporateAuthorgt
  • NASA Langley Research Center
  • Hampton, VA 23681
  • lt/CorporateAuthorgt
  • ltAbstractgt
  • The Meshless Local Petrov-Galerkin (MLPG)
  • method is one of the recently developed
    element-free

39
Govt. Printing Office
  • Congressional acts reports
  • EPA reports Preliminary study with Acts of
    Congress and EPA reports
  • samples suggest layouts are more diverse than
    DTIC or NASA
  • metadata actually present in document varies
    widely

40
GPO Sample Act of Congress
41
Metadata Extracted for Act of Congress
  • ltpapergt
  • ltmetadatagt
  • ltpublic_law_report_numgt
  • 118 STAT. 3984 PUBLIC LAW 108?493?DEC. 23,
    2004
  • lt/public_law_report_numgt
  • ltbill_numbergtH.R. 5394 components.lt/bill_nu
    mbergt
  • ltcongress_numgt108th Congresslt/congress_numgt
  • lttypegtAn Actlt/typegt
  • ltacttypegt
  • Dec. 23, 2004 To amend the Internal Revenue
    Code of 1986 to modify the taxation of arrow
  • H.R. 5394 components.
  • lt/acttypegt
  • lt/metadatagt
  • lt/papergt

42
GPO sample report
43
Metadata Extracted from GPO Sample Report
  • ltpapergt
  • ltmetadatagt
  • lttitlegt
  • CHINA?S PROLIFERATION PRACTICES
  • AND ROLE IN THE NORTH KOREA CRISIS
  • lt/titlegt
  • lttypegt
  • HEARING BEFORE THE
  • U.S.-CHINA ECONOMIC AND SECURITY
  • REVIEW COMMISSION
  • lt/typegt
  • ltsessiongtONE HUNDRED NINTH CONGRESS FIRST
    SESSIONlt/sessiongt
  • ltdategtMARCH 10, 2005lt/dategt
  • ltusegt
  • Printed for the use of the
  • U.S.-China Economic and Security Review
    Commission
  • lt/usegt
  • ltonlinegtAvailable via the World Wide Web
    http//www.uscc.govlt/onlinegt
  • lt/metadatagt

44
3. New Issues and Future Directions
  • Post-Processing
  • Image-Based Classification

45
Post-processing
  • WYSIWYG
  • WYG ! WYW

46
Post-processing
  • WYSIWYG
  • What You See is What You Get
  • WYG ! WYW

47
Post-processing
  • WYSIWYG
  • What You See is What You Get
  • WYG ! WYW
  • What You Get is not What You Want

48
Example DTIC Date Format
  • Document may contain
  • March 28, 2007
  • 3/28/2007
  • 3/28/07
  • DTIC requires
  • 28 MAR 2007

49
Example Personal Authors
50
Example Personal Authors (cont.)
  • We extract
  • ltPersonalAuthorgtPatricia H. Doherty Leo F.
    McNamara Susan H. Delay Neil J.
    Grossbardlt/PersonalAuthorgt
  • DTIC requires
  • ltPersonalAuthorgtPatricia H. Doherty Leo F.
    McNamara Susan H. Delay Neil J.
    Grossbardlt/PersonalAuthorgt
  • NASA requires
  • ltauthorgtPatricia H. Dohertylt/authorgt
  • ltauthorgtLeo F. McNamaralt/authorgt
  • ltauthorgtSusan H. Delaylt/authorgt
  • ltauthorgtNeil J. Grossbardlt/authorgt

51
Post-Processing Requirements
  • Post-processing rules must vary by
  • metadata field
  • collection

52
Post-Processing Architecture
53
Image-Based Classification
  • filter to find likely candidates for
    validator-based selection of template
  • Looking at a variety of techaniques inspired by
    work in image recognition

54
Example Image-Based Classification
  • Example represent a page using various colors to
    denote images, text, bold text, etc.
  • find visually most similar pages in documents of
    known classes
  • vote based on 5 most similar documents

55
Visual Matching Example (1/2)
56
Visual Matching Example (2/2)
57
Conclusions
  • Automated metadata extraction can be performed
    effectively on a wide variety of documents
  • Coping with heterogeneous collections is a major
    challenge
  • Much attention must be paid to support issues
  • validation, post-processing, etc.
Write a Comment
User Comments (0)
About PowerShow.com