Title: Automated Metadata Extraction
1Automated Metadata Extraction July 17-20,
2006 Kurt Maly maly_at_cs.odu.edu
2Outline
- Background and Motivation
- Challenges and Approaches
- Metadata Extraction Experience at ODU CS
- Architecture for Metadata Extraction
- Experiments with DTIC Documents
- Experiments with limited GPO Documents
- Conclusions
3Digital Library Research at ODU
http//dlib.cs.odu.edu/
4Motivation
- Metadata enhances the value of a document
collection - Using metadata helps resource discovery
- It may save about 8,200 per employee for a
company to use metadata in its intranet to reduce
employee time for searching, verifying and
organizing the files . (estimation made by Mike
Doane on DCMI 2003 workshop) - Using metadata helps make collections
interoperable with OAI-PMH - Manual metadata extraction is costly and
time-consuming - It would take about 60 employee-years to create
metadata for 1 million documents. (estimation
made by Lou Rosenfeld on DCMI 2003 workshop).
Automatic metadata extraction tools are essential
to reduce the cost. - Automatic extraction tools are essential for
rapid dissemination at reasonable cost - OCR is not sufficient for making legacy
documents searchable.
5Challenges
- A successful metadata extraction system must
- extract metadata accurately
- scale to large document collections
- cope with heterogeneity within a collection
- maintain accuracy, with minimal
reprogramming/training cost, as the collection
evolves over time - have a validation/correction process
6Approaches
- Machine Learning
- HMM
- SVM
- Rule-Based
- Ad Hoc
- Expert Systems
- Template-Based (ODU CS)
7Comparison
- Machine-Learning Approach
- Good adaptability but it has to be trained from
samples very time consuming - Performance degrades with increasing
heterogeneity - Difficult to add new fields to be extracted
- Difficult to select the right features for
training - Rule-based
- No need for training from samples
- Can extract different metadata from different
documents - Rule writing may require significant technical
expertise
8Metadata Extraction Experience at ODU CS
- DTIC (2004, 2005)
- developed software to automate the task of
extracting metadata and basic structure from DTIC
PDF documents - explored alternatives including SVM, HMM, expert
systems - origin of the ODU template-based engine
- GPO (in progress)
- NASA (in progress)
- Feasibility study to apply template-based
approach to CASI collection
9Meeting the Challenges
- All techniques achieved reasonable accuracy for
small collections - possible to scale to large homogeneous
collections - Heterogeneity remains a problem
- Ad hoc rule-based tend to complex monoliths
- Expert systems tend to large rule sets with
complex, poorly-understood interactions - Machine-learning must choose between reduced
accuracy and confidence or state explosion - Evolution problematic for machine-learning
approaches - older documents may have higher rate of OCR
errors - expensive retraining required to accommodate
changes in collection - potential lag time during which accuracy decays
until sufficient training instances acquired - Validation A largely unexplored area.
- Machine-learning approaches offer some support
via confidence measures
10Architecture for Metadata Extraction
11Our Approach Meeting the Challenges
- Bi-level architecture
- Classification based upon document similarity
- Simple templates (rule-based) written for each
emerging class
12Our Approach Meeting the Challenges
- Heterogeneity
- Classification, in effect, reduces the problem to
multiple homogeneous collections - Multiple templates required, but each template is
comparatively simple - only needs to accommodate one class of documents
that share a common layout and style - Evolution
- New classes of documents accommodated by writing
a new template - templates are comparatively simple
- no lengthy retraining required
- potentially rapid response to changes in
collection - Enriching the template engine by introducing new
features to reduce complexity of templates - Validation
- Exploring a variety of techniques drawn from
automated software testing validation
13Metadata Extraction Template-based
- Template-based approach
- Classify documents into classes based on
similarity - For each document class, create a template, or a
set of rules - Decoupling rules from coding
- A template is kept in a separate file
- Advantages
- Easy to extend
- For a new document class, just create a template
- Rules are simpler
- Rules can be refined easily
14Classes of documents
15Template engine
16Document features
- Layout features
- Boldness, i.e., whether text is in bold font or
not - Font size, i.e., the font size used in text, e.g.
font size 12, font size 14, etc - Alignment, i.e. whether text is left, right,
central, or adjusted alignment - Geometric location, for example, a block starting
with coordinates (0, 0) and ending with
coordinates (100, 200) - Geometric relation, for example, a block located
below the title block.
17Document features
- Textual features
- Special words, for example, a string starting
with abstract - Special patterns, for example, a string with
regular expression 1-20-90-90-9 - Statistics features, for example, a string with
more than 20 words, a string with more than 100
letters, and a string with more than 50 letters
in upper case - Knowledge features, for example, a string
containing a last name from a name dictionary.
18Template language
- XML based
- Related to document features
- XML schema
- Simple document model
- Document page-zone-region-column-row-paragraphs-l
ines-words-character
19Template sample
20Sample document pdf
21Scan OCR output
22Clean XML output
23Template (part)
24Metadata extracted
25Results Summary from DTIC Project
26Experiment with Limited GPO Documents
- 14 GPO Documents having Technical Report
Documentation Page - 57 GPO Documents without Technical Report
Documentation Page - 16 Congressional Reports
- 16 Public Law Documents
27 GPO Report Documentation Page
28 GPO Document
29Congressional Report
30 Public Law Document
31Conclusions
- OCR software works very well on current documents
- Template based approach allows automatic metadata
extraction from - Dynamically changing collections
- Heterogeneous, large collections
- Report document pages
- High degree of accuracy
- Feasibility of structure (e.g., table of
contents, tables, equations, sections) metadata
extraction
32Metadata extraction Part IIAutomatic
Categorization
33Document Categorization
- Problem given
- a collection of documents, and
- a taxonomy of subject areas
- Classification Determine the subject area(s)
most pertinent to each document - Indexing Select a set of keywords / index terms
appropriate to each document
34Classification Techniques
- Manual (a.k.a. Knowledge Engineering)
- typically, rule-based expert systems
- Machine Learning
- Probabalistic (e.g., Naïve Bayesian)
- Decision Structures (e.g., Decision Trees)
- Profile-Based
- compare document to profile(s) of subject classes
- similarity rules similar to those employed in
I.R. - Support Machines (e.g., SVM)
35Classification via Machine Learning
- Usually train-and-test
- Exploit an existing collection in which documents
have already been classified - a portion used as the training set
- another portion used as a test set
- permits measurement of classifier effectiveness
- allows tuning of classifier parameters to yield
maximum effectiveness - Single- vs. multi-label
- can 1 document be assigned to multiple categories?
36Automatic Indexing
- Assign to each document up to k terms drawn from
a controlled vocabulary - Typically reduced to a multi-label classification
problem - each keyword corresponds to a class of documents
for which that keyword is an appropriate
descriptor
37Case Study SVM categorization
- Document Collection from DTIC
- 10,000 documents
- previously classified manually
- Taxonomy of
- 25 broad subject fields, divided into a total of
- 251 narrower groups
- Document lengths average 2705?1464 words, 623?274
significant unique terms. - Collection has 32457 significant unique terms
38Document Collection
39(No Transcript)
40Sample Broad Subject Fields
- 01--Aviation Technology
- 02--Agriculture
- 03--Astronomy and Astrophysics
- 04--Atmospheric Sciences
- 05--Behavioral and Social Sciences
- 06--Biological and Medical Sciences
- 07--Chemistry
- 08--Earth Sciences and Oceanography
41Sample Narrow Subject Groups
- Aviation Technology
- 01 Aerodynamics
- 02 Military Aircraft Operations
- 03 Aircraft
- 0301 Helicopters
- 0302 Bombers
- 0303 Attack and Fighter Aircraft
- 0304 Patrol and Reconnaissance Aircraft
42Distribution among Categories
43(No Transcript)
44Baseline
- Establish baseline for state-of-the-art machine
learning techniques - classification
- training SVM for each subject area
- off-the-shelf document modelling and SVM
libraries
45Why SVM?
- Prior studies have suggested good results with
SVM - relatively immune to overfitting fitting to
coincidental relations encountered during
training - few model parameters
- avoids problems of optimizing in high-dimension
space
46Machine Learning Support Vector Machines
hyperplane
- Binary Classifier
- Finds the plane with largest margin to separate
the two classes of training samples - Subsequently classifies items based on which side
of line they fall
Font size
margin
Line number
47SVM Evaluation
48Baseline SVM Evaluation(Interim Report)
- Training Testing process repeated for multiple
subject categories - Determine accuracy
- overall
- positive (ability to recognize new documents that
belong in the class the SVM was trained for) - negative (ability to reject new documents that
belong to other classes) - Explore Training Issues
49SVM Out of the Box
- 16 broad categories with 150 or more documents
- Lucene library extracting terms and forming
weighted term vectors - LibSVM for SVM training testing
- no normalization or parameter tuning
- Training set of 100/100 (positive/negative
samples) - Test set of 50/50
50Accuracy
51 OotB Interpretation
- Reasonable performance on broad categories given
modest training set size. - Accuracy measured as ( correct decisions / test
set size) - Related experiment showed that with normalization
and optimized parameter selection, accuracy could
be improved as much as an additional 10
52Training Set Size
53Training Set Size
- accuracy plateaus for training set sizes well
under the number of terms in the document model
54Training Issues
- Training Set Size
- Concern detailed subject groups may have too few
known examples to perform effective SVM training
in that subject - Possible Solution collection may have few
positive examples, but has many, many negative
example - Positive/Negative Training Mixes
- effects on accuracy
55Increased Negative Training
56Training Set Composition
- experiment performed with 50 positive training
examples - OotB SVM training
- increasing the number of negative training
examples has little effect on overall accuracy - but positive accuracy reduced
57Interpretation
- may indicate a weakness in SVM
- or simply further evidence of the importance of
optimizing SVM parameters - may indicate unsuitability of treating SVM output
as simple boolean decision - might do better as best fit in a multi-label
classifier
58Conclusions
- State of the art for DTIC like collections will
give on the order of 75 accuracy - Key problems that need to be addressed
- establish baseline for other methods
- validation recognizing trusted results
- to fall back on human intervention
- improve on baseline by more sophisticated methods
- possible application for knowledge bases
59Additional Slides
60Metadata Extraction Machine-Learning Approach
- Learn the relationship between input and output
from samples and make predictions for new data - This approach has good adaptability but it has to
be trained from samples. - HMM (hidden Markov Model) SVM (Support Vector
Machine)
61Machine Learning - Hidden Markov Models
- Hidden Markov Modeling is a probabilistic
technique for the study of observed items
arranged in discrete-time series --Alan B Poritz
Hidden Markov Models A Guided Tour, ICASSP
1988 - HMM is a probabilistic finite state automaton
- Transit from state to state
- Emit a symbol when visit each state
- States are hidden
62Hidden Markov Models
- A Hidden Markov Model consists of
- A set of hidden states (e.g. coin1, coin2,
coin3) - A set of observation symbols ( e.g. H and T)
- Transition probabilities the probabilities from
one state to another - Emission probabilities probability of emitting
each symbol in each state - Initial probabilities probability of each state
to be chosen as the first state
63HMM - Metadata Extraction
- A document is a sequence of words that is
produced by some hidden states (title, author,
etc.) - The parameters of HMM was learned from samples in
advance. - Metadata Extraction is to find the most possible
sequence of states (title, author, etc.) for a
given sequence of words.
64Machine Learning Support Vector Machines
hyperplane
- Binary Classifier (classify data into two
classes) - It represents data with pre-defined features
- It finds the plane with largest margin to
separate the two classes from samples - It classifies data into two classes based on
which side they located.
Font size
margin
Line number
The figure shows a SVM example to classify a line
into two classes title, not title by two
features font size and line number (1, 2, 3,
etc). Each dot represents a line. Red dot title
Blue dot not title.
65SVM - Metadata Extraction
- Widely used in pattern recognition areas such as
face detection, isolated handwriting digit
recognition, gene classification, etc. - Basic idea
- Classes ? metadata elements
- Extract metadata from a document? classify each
line (or block) into appropriate classes. - For example
- Extract document title from a document ?
- Classify each line to see whether it is a part of
title or not
66Metadata Extraction Rule-based
- Basic idea
- Use a set of rules to define how to extract
metadata based on human observation. - For example, a rule may be The first line is
title. - Advantage
- Can be implemented straightforwardly
- No need for training
- Disadvantage
- Lack of adaptability (work for similar document)
- Difficult to work with a large number of features
- Difficult to tune the system when errors occur
because rules are usually fixed
67Metadata Extraction - Rule-based
- Expert system approach
- Build a large rule base by using standard
languages such as prolog - Use existed expert system engine (for example,
SWI-prolog) - Advantages
- Can use existing engine
- Disadvantages
- Building rule base is time-consuming
68Metadata Extraction Experience at ODU CS
- We have knowledge database obtained from
analyzing Arc and DTIC collections - Authors (4Mill strings from
http//arc.cs.odu.edu) - Organizations (79 from DTIC250, 200 from DTIC
600) - Universities (52 from DTIC250)