Title: The ODU Metadata Extraction Project
1- The ODU Metadata Extraction Project
- March 28, 2007
- Dr. Steven J. Zeil
- zeil_at_cs.odu.edu
2Outline
- Overview
- Recent Developments
- Independent Document Model
- Validation
- Diversifying NASA GPO collections
- New Issues Future Directions
- Post-processing
- Image-Based Classification
31. Overview
4Input Processing OCR
- Select pages of interest
- Apply Off-The-Shelf OCR software
- Convert OCR output to XML model format
5Form Processing
- Scan document for form names
- Select form template
- Apply form extraction engine to document and
template
6Sample RDP
7Sample RDP (cont.)
8Metadata Extracted from Sample RDP (1/3)
- ltmetadata templateName"sf298_2"gt
- ltReportDategt18-09-2003lt/ReportDategt
- ltDescriptiveNotegtFinal Reportlt/DescriptiveNotegt
- ltDescriptiveNotegt1 April 1996 - 31 August
2003lt/DescriptiveNotegt - ltUnclassifiedTitlegtVALIDATION OF IONOSPHERIC
MODELSlt/UnclassifiedTitlegt - ltContractNumbergtF19628-96-C-0039lt/ContractNumber
gt ltContractNumbergtlt/ContractNumbergt - ltProgramElementNumbergt61102Flt/ProgramElementNumber
gt - ltPersonalAuthorgtPatricia H. Doherty Leo F.
McNamara - Susan H. Delay Neil J. Grossbardlt/PersonalAu
thorgt - ltProjectNumbergt1010lt/ProjectNumbergt
- ltTaskNumbergtIMlt/TaskNumbergt
- ltWorkUnitNumbergtAClt/WorkUnitNumbergt
- ltCorporateAuthorgtBoston College / Institute for
Scientific Research 140 Commonwealth Avenue
Chestnut Hill, MA 02467-3862lt/CorporateAuthorgt
9Metadata Extracted from Sample RDP (2/3)
- ltReportNumbergtlt/ReportNumbergt
- ltMonitorNameAndAddressgtAir Force Research
Laboratory 29 Randolph Road Hanscom AFB, MA
01731-3010lt/MonitorNameAndAddressgt - ltMonitorAcronymgtVSBPlt/MonitorAcronymgt
- ltMonitorSeriesgtAFRL-VS-TR-2003-1610lt/MonitorSeri
esgt - ltDistributionStatementgtApproved for public
release distribution unlimited.lt/DistributionStat
ementgt - ltAbstractgtThis document represents the final
report for work - performed under the Boston College contract F
I9628-96C-0039. This - contract was entitled Validation of
Ionospheric Models. The - objective of this contract was to obtain
satellite and ground-based - ionospheric measurements from a wide range of
geographic locations - and to utilize the resulting databases to
validate the theoretical - ionospheric models that are the basis of the
Parameterized Real-time - Ionospheric Specification Model (PRISM) and
the Ionospheric Forecast - Model (IFM). Thus our various efforts can be
categorized as either - observational databases or modeling
studies.lt/Abstractgt -
10Metadata Extracted from Sample RDP (3/3)
- ltIdentifiergtIonosphere, Total Electron Content
(TEC), Scintillation, - Electron density, Parameterized Real-time
Ionospheric Specification - Model (PRISM), Ionospheric Forecast Model
(IFM), Paramaterized - Ionosphere Model (PIM), Global Positioning
System (GPS)lt/Identifiergt - ltResponsiblePersongtJohn Rettererlt/ResponsiblePer
songt - ltPhonegt781-377-3891lt/Phonegt
- ltReportClassificationgtUlt/ReportClassificationgt
ltAbstractClassificationgtUlt/AbstractClassificationgt
- ltAbstractLimitaiongtSARlt/AbstractLimitaiongt
- lt/metadatagt
11Non-Form Processing
- Classification compare document against known
document layouts - Select template written for closest matching
layout - Apply non-form extraction engine to document and
template
12Non-Form Sample (1/2)
13Non-Form Sample (2/2)
14Template Used for Sample Document
- ltstructdef pagenumber"1" templateID"au"gt
- ltidentifier min"1" max"1"gt
- ltbegin inclusive"current"gt
- ltstringmatch case"yes" loc"beginwith"gtAU/lt
/stringmatchgt - lt/begingt
- ltendgtonesectionlt/endgt
- lt/identifiergt
- ltCorporateAuthor min"1" max"1"gt
- ltbegin inclusive"current"gt
- ltstringmatch case"no" loc"beginwith"gt
- AIR COMMAND AIR WAR
- lt/stringmatchgt
- lt/begingt
- ltend inclusive"current"gt
- ltstringmatch case"no" loc"beginwith"gtAIR
UNIVERSITYlt/stringmatchgt - lt/endgt
- lt/CorporateAuthorgt
- ltUnclassifiedTitle min"1" max"1"gt
- ltbegin inclusive"after"gtCorporateAuthorlt/begi
ngt
15Metadata Extracted From the Title Page of the
Sample Document
- ltpaper templateid"au"gt
- ltidentifiergtAU/ACSC/012/1999-04lt/identifiergt
- ltCorporateAuthorgtAIR COMMAND AND STAFF COLLEGE
- AIR UNIVERSITYlt/CorporateAuthorgt
- ltUnclassifiedTitlegtINTEGRATING COMMERCIAL
- ELECTRONIC EQUIPMENT TO IMPROVE
- MILITARY CAPABILITIES
- lt/UnclassifiedTitlegt
- ltPersonalAuthorgtJeffrey A. Bohler LCDR,
USNlt/PersonalAuthorgt - ltadvisorgtAdvisor CDR Albert L.
St.Clairlt/advisorgt - ltReportDategtApril 1999lt/ReportDategt
- lt/papergt
16Post-Processing
- Coerce extracted values into standard formats
17Validation
- Estimate quality of extracted metadata
- Untrusted outputs referred (to humans) for review
and correction
18Recent Developments
- Independent Document Model
- Validation
- Diversifying NASA and GPO Collections
19 Independent Document Model (IDM)
- Platform independent Document Model
- Motivation
- Dramatic XML Schema Change between Omnipage 14
and 15 - Tie the template engine to stable specification
- Protects from linking directly to specific OCR
product - Allows us to include statistics for enhanced
feature usage - Statistics (i.e. avgDocFontSize, avgPageFontSize,
wordCount, avgDocWordCount, etc..)
20Documents in IDM
- A document consists of pages
- pages are divided into regions
- regions may be divided into
- blocks of vertical whitespace
- paragraphs
- tables
- images
- paragraphs are divided into lines
- lines are divided into words
- All of these carry standard attributes for size,
position, font, etc.
21Generating IDM
- Use XSLT 2.0 stylesheets to transform
- Supporting new OCR schema only requires
generation of new XSLT stylesheet. -- Engine
does not change
22IDM Usage
OmniPage 14 XML Doc
Form Based Extraction
docTreeModelOmni14.xsl
docTreeModelOmni15.xsl
OmniPage 15 XML Doc
IDM XML Doc
Non Form Extraction
docTreeModelOther.xsl
Other OCR Output XML Doc
23IDM Tool Status
- Converters completed to generate IDM from
Omnipage 14 and 15 XML - Omnipage 15 proved to have numerous errors in its
representation of an OCRd document - Consequently, not recommended
- Form-based extraction engine revised to work from
IDM - Non-form engine still works from our older
CleanXML - convertor from IDM to CleanXML completed as
stop-gap measure - direct use of IDM deferred pending review of
other engine modifications
24B. Validation
- Given a set of extracted metadata
- mark each field with a confidence value
indicating how trustworthy the extracted value is - mark the set with a composite confidence score
- Fields and Sets with low confidence scores may be
referred for additional processing - automated post-processing
- human intervention and correction
25Validating Extracted Metadata
- Techniques must be independent of the extraction
method - A validation specification is written for each
collection, combining - Field-specific validation rules
- statistical models derived for each field of
- text length
- of words from English dictionary
- of phrases from knowledge base prepared for
that field - pattern matching
26Sample Validation Specification
- Combines results from multiple fields
- ltvalvalidate collection"dtic"
- xmlnsval"jellyedu.odu.cs.dtic.validation.Valida
tionTagLibrary"gt - ltvalaveragegt
- ltvalfield name"UnclassifiedTitle"gt...lt/valfi
eldgt - ltvalfield name"PersonalAuthor"gt...lt/valfield
gt - ltvalfield name"CorporateAuthor"gt...lt/valfiel
dgt - ltvalfield name"ReportDate"gt...lt/valfieldgt
- lt/valaveragegt
- lt/valvalidategt
27Validation Spec Field Tests
- Each field is subjected to one or more tests
-
- ltvalfield name"PersonalAuthor"gt
- ltvalaveragegt
- ltvallength/gt
- ltvalmaxgt
- ltvalphrases length"1"/gt
- ltvalphrases length"2"/gt
- ltvalphrases length"3"/gt
- lt/valmaxgt
- lt/valaveragegt
- lt/valfieldgt
- ltvalfield name"ReportDate"gt
- ltvalreportFormat/gt
- lt/valfieldgt
- ...
28 Sample Input Metadata Set
- ltmetadatagt
- ltUnclassifiedTitlegtThesis Title The Military
Extraterritorial Jurisdiction Actlt/UnclassifiedTit
legt - ltPersonalAuthorgtName of Candidate LCDR
Kathleen A. Kerriganlt/PersonalAuthorgt - ltReportDategtAccepted this 18th day of June 2004
bylt/ReportDategt - lt/metadatagt
29Sample Validator Output
- ltmetadata confidence"0.522"gt
- ltUnclassifiedTitle confidence"0.943"gtThesis
Title The Military Extraterritorial Jurisdiction
Actlt/UnclassifiedTitlegt - ltPersonalAuthor confidence"0.622"gtName of
Candidate LCDR Kathleen A. Kerriganlt/PersonalAuth
orgt - ltReportDate confidence"0.0" warning"ReportDate
field does not match required pattern"gtAccepted
this 18th day of June 2004 bylt/ReportDategt - lt/metadatagt
30Classification (a priori)
- Previously, we had attempted various schemes for
a priori classification - x-y trees
- bin classification
- Still investigating some
- image-based recognition
31Post-Hoc Classification
- Apply all templates to document
- results in multiple candidate sets of metadata
- Score each candidate using the validator
- Select the best-scoring set
32Experimental Results
33Interpretation of Results
- Validator agreed with human on 125 out of 167
cases - Of 42 cases where they disagreed
- 37 were due to extra words in extracted
metadata (e.g., military ranks in author names) - highlights need for post-processing to clean up
metadata - 2 were mistakes by template
- 2 were due to garbled characters by OCR
- 1 due to a bug in the validator
34C. Diversifying NASA and GPO Collections
- Document collections differ in
- whether forms are used and form layout
- document layout
- what metadata fields are present which ones are
collected
35Changing Collections
- Porting to a new document collection
- identify pages of interest
- training classifiers to recognize new document
layouts (?) - templates for forms document layouts
- new validation scripts
- collect statistics for collection model
- new post-processing rules
- No changes required to core engines other
software
36NASA Technical Reports
- Different layouts than DTIC
- fewer total
- tend to be visually more similar
- mixture with and without RDPs
37NASA Sample Document
38Extracted Metadata for NASA Sample
- ltpaper templateid"singleAuthor"gt
- ltmetadatagt
- ltUnclassifiedTitlegt
- A Computationally Efficient Meshless Local
Petrov-Galerkin Method - for Axisymmetric Problems
- lt/UnclassifiedTitlegt
- ltPersonalAuthorgt
- I.S. Raju and T. Chen?
- lt/PersonalAuthorgt
- ltCorporateAuthorgt
- NASA Langley Research Center
- Hampton, VA 23681
- lt/CorporateAuthorgt
- ltAbstractgt
- The Meshless Local Petrov-Galerkin (MLPG)
- method is one of the recently developed
element-free -
39Govt. Printing Office
- Congressional acts reports
- EPA reports Preliminary study with Acts of
Congress and EPA reports - samples suggest layouts are more diverse than
DTIC or NASA - metadata actually present in document varies
widely
40GPO Sample Act of Congress
41Metadata Extracted for Act of Congress
- ltpapergt
- ltmetadatagt
- ltpublic_law_report_numgt
- 118 STAT. 3984 PUBLIC LAW 108?493?DEC. 23,
2004 - lt/public_law_report_numgt
- ltbill_numbergtH.R. 5394 components.lt/bill_nu
mbergt - ltcongress_numgt108th Congresslt/congress_numgt
- lttypegtAn Actlt/typegt
- ltacttypegt
- Dec. 23, 2004 To amend the Internal Revenue
Code of 1986 to modify the taxation of arrow - H.R. 5394 components.
- lt/acttypegt
- lt/metadatagt
- lt/papergt
42GPO sample report
43Metadata Extracted from GPO Sample Report
- ltpapergt
- ltmetadatagt
- lttitlegt
- CHINA?S PROLIFERATION PRACTICES
- AND ROLE IN THE NORTH KOREA CRISIS
- lt/titlegt
- lttypegt
- HEARING BEFORE THE
- U.S.-CHINA ECONOMIC AND SECURITY
- REVIEW COMMISSION
- lt/typegt
- ltsessiongtONE HUNDRED NINTH CONGRESS FIRST
SESSIONlt/sessiongt - ltdategtMARCH 10, 2005lt/dategt
- ltusegt
- Printed for the use of the
- U.S.-China Economic and Security Review
Commission - lt/usegt
- ltonlinegtAvailable via the World Wide Web
http//www.uscc.govlt/onlinegt - lt/metadatagt
443. New Issues and Future Directions
- Post-Processing
- Image-Based Classification
45Post-processing
46Post-processing
- WYSIWYG
- What You See is What You Get
- WYG ! WYW
47Post-processing
- WYSIWYG
- What You See is What You Get
- WYG ! WYW
- What You Get is not What You Want
48Example DTIC Date Format
- Document may contain
- March 28, 2007
- 3/28/2007
- 3/28/07
- DTIC requires
- 28 MAR 2007
49Example Personal Authors
50Example Personal Authors (cont.)
- We extract
- ltPersonalAuthorgtPatricia H. Doherty Leo F.
McNamara Susan H. Delay Neil J.
Grossbardlt/PersonalAuthorgt - DTIC requires
- ltPersonalAuthorgtPatricia H. Doherty Leo F.
McNamara Susan H. Delay Neil J.
Grossbardlt/PersonalAuthorgt - NASA requires
- ltauthorgtPatricia H. Dohertylt/authorgt
- ltauthorgtLeo F. McNamaralt/authorgt
- ltauthorgtSusan H. Delaylt/authorgt
- ltauthorgtNeil J. Grossbardlt/authorgt
51Post-Processing Requirements
- Post-processing rules must vary by
- metadata field
- collection
52Post-Processing Architecture
53Image-Based Classification
- filter to find likely candidates for
validator-based selection of template - Looking at a variety of techaniques inspired by
work in image recognition
54Example Image-Based Classification
- Example represent a page using various colors to
denote images, text, bold text, etc. - find visually most similar pages in documents of
known classes - vote based on 5 most similar documents
55Visual Matching Example (1/2)
56Visual Matching Example (2/2)
57Conclusions
- Automated metadata extraction can be performed
effectively on a wide variety of documents - Coping with heterogeneous collections is a major
challenge - Much attention must be paid to support issues
- validation, post-processing, etc.