Title: Evaluation of Different Algorithms for Metadata Extraction
1Evaluation of Different Algorithms for Metadata
Extraction
Work in Progress Metadata Extraction Project
Sponsored by DTIC
- Department of Computer Science
- Old Dominion University
- 6 / 03 / 2004
2Contents
- Introduction
- Metadata Extraction Using SVMs
- Support Vector Machines
- Multi-Class SVMs
- Metadata Extraction as Multi-Class SVMs
- Metadata Extraction Using HMM
- Hidden Markov Model
- Metadata Extraction as HMM
- Metadata Extraction Using Templates
- Experiments
- SVMs
- HMMs
- Templates
- Conclusion and Future Work
3Introduction
Motivation Evaluate different approaches for
metadata extraction for the DTIC test bed.
- Machine Learning Approach
- Support Vector Machines
- Hidden Markov Model
- Rule-based approach
- Using rules to specify how to extract metadata
4Introduction
- Deliverables
- Software tool to extract metadata and structure
from a set of pdf documents. - Feasibility report on extracting complex objects
such as figures, equations, references, and
tables from the document and representing them in
a DTD-compliant XML format.
5Introduction
- Schedule (Starting Date March 2004)
- Months 0-2 Working with DTIC in identifying the
set of documents and the metadata of interest. - Months 3-8 Developing software for metadata and
structure extraction from the selected set of pdf
documents - Months 9-12 Feasibility study for extracting
complex objects and representing the complete
document in XML
6Support Vector Machines
- Binary classifier(classify data into two classes)
- Represent data with pre-defined features
- Learning to find the plane with largest margin
to separate the two classes. - Classifying classify data into two classes based
on which side they located.
hyperplane
feature1
margin
feature2
The figure shows a SVM example to classify a
person into two classes overweighed, not
overweighed two features are pre-defined
weight (feature 1) and height (feature 2). Each
dot represents a person. Red dot overweighed
Blue dot not overweighed
7Multi-Class SVMs
- Combining into multi-class classifier
- One-vs-rest
- Classes in this class or not in this class
- Positive training samples data in this class
- Negative training samples the rest
- K binary SVM (k the number of the classes)
- One-vs-One
- Classes in class one or in class two
- Positive training samples data in this class
- Negative training samples data in the other
class - K(K-1)/2 binary SVM
8Metadata Extraction as SVMs
- Each element (title, author, etc.) in the
metadata set can be looked as a class. - Classify each line(paragraph) into a class
- Feature set
- Line features ( number of words, etc.)
- Word features use each word as a feature. In
practice, word clustering techniques are used to
reduce the number of features. Word clustering
techniques are to cluster words into groups based
similarity.
9Hidden Markov Model
- A probabilistic finite state automaton
- A sequence of observation symbols are produced by
the underlying states (Hidden States) based on - Transition probabilities the probabilities from
one state to another - Emission Probabilities the probabilities of
emitting each symbol in each state - Learning determining the transition and emission
probabilities from training data. - Decoding find the most possible sequence of the
hidden states that produce the sequence of
observation symbols.
10Metadata Extraction as HMM
- A document header can be looked as a sequence of
symbols (words, etc.) produced by the hidden
states (title, author, etc.) - Metadata Extraction
- For a sequence of symbols a document header
- find the most possible sequence of states (title,
author, etc.) - For example,
- Input Converting Existing Corpus to an OAI
Compliant Repository, K. Maly, M. Zubair, J. Tang - Output title title title title title title title
title author author author author author author
11Metadata Extraction Using Templates
- A rule-based approach
- But decouples the code and the rules
- Share the same code
- One template per document type
- Template
- A XML file to describe the document features
- Using rules to define how to extract metadata for
this type of documents
12Experiments
- SVM
- Apply SVM to different data sets
- Objective Evaluate the performances of different
data sets. - Software used
- LibSVM
- Multi-class SVMs Using one-vs-one approach
- Features Textual features only
- word-specific features such as city
- line-specific features such as how many words in
a line
13SVM Experiments with different data sets
- Data Sets
- Data Set 1 Seymore935
- Download from http//www-2.cs.cmu.edu/kseymore/ie
.html - 935 manually tagged document headers
- 15 Tags title, author, affiliation, address,
note, email, date, abstract, introduction
(intro), phone, keywords, web, degree,
publication number (pubnum), and page - Ignore tags except title, author, affiliation,
date - Using the first 500 for training and the rest for
test - Data Set 2 DTIC100
- Selected 100 PDF files from DTIC website based on
Z39.18 standard - OCR the first pages and convert to text format
- Manually tagged these 100 document headers
- 5 Tags title, author, affiliation, date and
others - Using the first 75 for training and the rest for
test - Data Set 3 DTIC33
- A subset of DTIC100
- 33 tagged document headers with identical layout
- 5 Tags title, author, affiliation, date and
others - Using the first 24 for training and the rest for
test
14SVM Experiments with different data sets
15SVM Experiments with different data sets
16SVM Experiments with different data sets
17Experiments
- SVM
- Use SVM with different feature sets
- Objective Evaluate the performances of different
feature sets. - Software used
- LibSVM
- Multi-class SVMs Using One-vs-One approach, I.e,
training one SVM classifier for each pair. - Research from LibSVM developers shows that
One-vs-One approach has better performance than
One-vs-Rest approach - Data set DTIC100
- Manually tagged the XML files with layout
information - Feature Sets
- Text Textual features only
- Textfont textual features and font size feature
- Textfontbold textual features and bold feature
18SVM with different feature sets
19SVM with different feature sets
20SVM with different feature sets
21SVM with different feature sets
- More
- Using layout information for the documents with
much different layout does not improve the
performance significantly. - Another step further is to use to a document set
with similar layout. We do the same experiment
with DTIC33 and get better result in recall.
However, due to the data set is too small, we can
not jump to conclusion yet.
22Experiments
- HMM
- Data Set Seymore935
- One state per field (tag)
- Using the first 500 for training and the rest for
test - Experimental Result
- Overall accuracy93.0
23Experiments
- Template
- Data Set
- DTIC100 100 XML files with font size and bold
information - It is divided into 7 classes according to layout
information - For each class, a template is developed after
checking the first one or two documents in this
class. This template is applied to the remaining
documents to get performance data (recall and
precision)
24Experiments
25Experiments
26Experiments
- Templates with more data
- demo
27Discussions
- We have done experiments with SVM, HMM and
Template approach - Template approach is flexible and produces good
results. - SVM looks more promising than HMM
- Results are better
- It processes the data line by line (or paragraph
by paragraph) instead of word by word - It is easy to process layout information
28Discussions
- SVM
- Reported to have good performance
- - difficulty in selecting proper features
difficulty in labeling a lot of training data
converting data into features and training is
time-consuming. - Template Approach
- Flexible and straightforward (rules may be
understood by human) - - Rules are fixed difficulty in adjusting rules
when errors occurs.
29Overall Approach for Handling Large Collection
- Manual Classification
- This approach assumes it is possible to humanly
classify the large set of documents into similar
classes ( based on time period, source
organizations, etc. ) - For each class, randomly select, say 100,
documents develop a template. Evaluate the
template by statistically sampling and refine
the template till error is under a tolerance
level. Next apply the refined template to the
whole set. - Auto-Classification
- This approach assumes it is not humanly possible
to classify the large set of documents. In this
case we develop a higher-set of rules on a
smaller sample for classification. Evaluate the
classification approach based on statistical
sampling. - Next develop the template for each class, apply,
and refine as outlined in the manual
classification approach.
30Future work
- Evaluate different approaches for the DTIC test
bed including the hybrid Approach that
integrates SVM and template based approach.
31Future work
- Enlarge the data set
- Currently, the data set is small
- We need enlarge the data for evaluation different
approaches
32Thanks
33SVM
- The margin is the width of separation between the
two classes. - Optimal hyperplane is the one with maximal margin
of separation between the two classes. - The support vectors are the instances closest to
the optimal hyperplane.
34SVM (cont.)
- Geometric interpretationSupport vectors uniquely
defines the optimal hyperplane
35SVM (cnt.)
- SVM is to determine the hyperplane between two
classes from training set - SVM make the classification based on which side
the input data located on.
36SVM (cont.)
- Mathematics Interpretation
- We wantw.xi b 1 if yi 1 (xi in class
1)wTxi b -1 if yi -1 (xi in class
2)The margin 2/w - Then the problem turned into constrained
optimization problemmaximize 2/w or minimize
w2 subject to yi(w.xi b )-1 0
37SVM (cont.)
- Unique solution w Saiyixi over all support
vectors - Decision function f(x)sign(Saiyixi.xb)
- All other xi irrelevant to the solution.
- Lagrangian Lp1/2w2-?aiyi(w.xi b )?ai
w Saiyixi Saiyi0
38SVM (cont.)
- Advantages
- Can manage a very large number of
attributes/features. - Linear regression has overfitting problem when
the number of attributes is much larger than the
size of training set. - The SVM solution is determined by support vectors
only. - Various kernel functions can be used to map input
space into feature space - For non-linear space, SVM uses kernel functions
to map it to a linear separable space. - In the way, SVM use linear separation to solve
non-linear problems.
39Experiment (SVM)
- Our experiment (working on 500 tagged headers as
the paper described) - Knowledge collection
- Collect the authors names from Archon (CERN
collection) - Download a British word list from internet
- Collect country name from web
- Collect USA city names
- Collect Canada province names and USA state names
- Collect month names and their abbreviations
- Frequent words for degree, pubnum, notenum,
affiliation, address. - Regular expression for email and url
40Experiment(SVM)
- 2. Word Clustering
- Converting the original data
- For example,
- lttitlegt Protocols for Collecting Responses L in
Multi-hop Radio Networks L lt/titlegt - ltauthorgt Chungki Lee James E. Burns L Mostafa
H. Ammar L lt/authorgt - ltpubnumgt GIT-CC-92/28 L lt/pubnumgt
- ltdategt June 1992 L lt/dategt
- Will converted to
- lttitlegt Cap1DictWord DictWord Cap1DictWord
Cap1DictWord L - prep CapWord1LowerWord4-LowerWord3
Cap1DictWord Cap1DictWord L lt/title - gt
- ltauthorgt CapWord1LowerWord6 mayName mayName
singleCap mayName L - CapWord1LowerWord6 singleCap mayName L
lt/authorgt - ltpubnumgt CapWord3-CapWord2-Digs2/Digs2 L
lt/pubnumgt - ltdategt month Digs4 L lt/dategt
41Experiment(SVM)
- 3. Get Features
- Treat each word in converted file as a feature,
use occurrence as the weight. - 4. 500 headers are divided into 450 training data
and 50 test data. - 5. Training each of the 15 classifiers using
one-versus-all approaches.
42Hidden Markov Models Example
- someone trying to deduce the weather from a piece
of seaweed - For some reason, he can not access weather
information (sun, cloud, rain) directly - But he can know the dampness of a piece of
seaweed (soggy, damp, dryish, dry) - And the state of the seaweed is probabilistically
related to the state of the weather
43Hidden Markov Models (cont.)
44HMM problems (cont.)
the most probable sequence of hidden states is
the sequence that maximizes Pr(dry,damp,soggy
sunny,sunny,sunny), Pr(dry,damp,soggy
sunny,sunny,cloudy), Pr(dry,damp,soggy
sunny,sunny,rainy), . . . . Pr(dry,damp,soggy
rainy,rainy,rainy)
45Hidden Markov Models (cont.)
- A Hidden Markov Model is consist of two sets of
states and three sets of probabilities - hidden states the (TRUE) states of a system
that may be described by a Markov process (e.g.
weather states in our example). - observable symbols the symbols of the process
that are visible (e.g. dampness of the
seaweed). - Initial probabilities for hidden states
- Transition probabilities for hidden states
- Emission probabilities for each observable symbol
in each hidden state
46Digital Library Research at ODU
47Open Archives InitiativeOAI-PMH
2.0http//www.openarchives.org
48Connecting Islands of Digital Libraries
Islands of digital libraries need to be
interconnected for users to access different
information resources from anywhere Need for
manipulating, organizing, and correlating
information from different repository for better
discovery Open Archives Protocol for Metadata
Harvesting (OAI-PMH) is an international effort
to facilitate bridges across islands of digital
libraries. OAI does to digital libraries what
Internet did for islands of isolated networks.
49Background - Open Archives Initiative (OAI)
The goal of the Open Archives Initiative Protocol
for Metadata Harvesting is to supply and promote
an application-independent interoperability
framework. The OAI protocol permits metadata
harvesting of a data provider by a service
provider. Data Provider supports the OAI protocol
as a means of exposing metadata about the content
in their systems Service Providers issue OAI
protocol requests to the systems of data
providers and use the returned metadata as a
basis for building value-added services.
http//www.openarchives.org The word open in
OAI is from the architectural perspective
defining and promoting machine interfaces.
Openness does not mean free or unlimited
access to the information repositories that
conform to the OAI technical framework. The OAI
is an International effort. Major sponsors are
Council on Library and Information Resources
(CLIR), the Digital Library Federation (DLF), the
Scholarly Publishing Academic
50What does it mean making an existing digital
library OAI enabled ?
Digital Library
OAI Layer
Exposing metadata to OAI service providers DC
and Parallel metadata sets
ONLY METADATA
Storage
51Minimal Dublin Core Metadata OAI Requirement
http//dublincore.org/documents/dces/ Fifteen
Elements (Optional) Element Title A name given
to the resource. Typically, a Title will be a
name by which the resource is formally
known. Element Creator An entity primarily
responsible for making the content of the
resource. Examples of a Creator include a
person, an organisation, or a service.
Element Subject The topic of the content of
the resource. Typically, a Subject will be
expressed as keywords, Element Description An
account of the content of the resource.
Description may include but is not limited to an
abstract, table of contents, reference to a
graphical representation of content or a
free-text account of the content. Element
Publisher An entity responsible for making the
resource available. Examples of a Publisher
include a person, an organisation, or a
service.
52Dublin Core Metadata
Element Contributor An entity responsible for
making contributions to the content of the
resource. Element Date A date associated with
an event in the life cycle of the resource.
Typically, Date will be associated with the
creation or availability of the resource.
Element Type The nature or genre of the
content of the resource. Element Format The
physical or digital manifestation of the
resource. Typically, Format may include the
media-type or dimensions of the resource. Format
may be used to determine the software, hardware
or other equipment needed to display or operate
the resource. Element Identifier An
unambiguous reference to the resource within a
given context. Example formal identification
systems include the Uniform Resource Locator
(URL).
53Dublin Core Metadata
Element Source A Reference to a resource from
which the present resource is derived.
Element Language A language of the intellectual
content of the resource. Element Relation A
reference to a related resource. Element
Coverage The extent or scope of the content of
the resource. Coverage will typically include
spatial location (a place name or geographic
coordinates), temporal period (a period label,
date, or date range) or jurisdiction (such as a
named administrative entity). Element
Rights Information about rights held in and over
the resource.
54Beyond Dublin Core Metadata
Need to support parallel metadata sets to enable
the OAI service provider to take advantage of the
richer metadata fields for resource discovery.
The OAI metadata harvesting protocol supports
the notion of parallel metadata sets, allowing
collections to expose metadata in formats that
are specific to their applications and domains.
The OAI technical framework places no limitations
on the nature of such parallel sets, other than
that the metadata records be structured as XML
data that have a corresponding XML schema for
validation.
55- Metadata Harvesting
- Move away from distributed searching.
- - cannot scale well to large number of
participants. - Extract metadata from various sources.
- - Build services on local copies of metadata.
- - data remains at remote repositories
RCDL 2003, St. Petersburg
56- OAI Request and OAI Response.
- - OAI Request for Metadata is embedded in HTTP.
- - OAI Response to OAI Request is encoded in XML.
- - XML Schema specification for OAI Response is
provided in OAI-PMH document.
RCDL 2003, St. Petersburg
57Service Provider
Data Provider
- Supporting protocol requests
- Identify
- ListMetadataFormats
- ListSets
- Harvesting protocol requests
- ListRecords
- ListIdentifiers
- GetRecord
RCDL 2003, St. Petersburg
58Service Provider
Data Provider
Identify
- Repository name
- Base-URL
- Admin e-mail
- OAI protocol version
- Description Container
RCDL 2003, St. Petersburg
59Service Provider
Data Provider
ListMetadataFormats
- REPEAT
- Format prefix
- Format XML schema
- /REPEAT
RCDL 2003, St. Petersburg
60Service Provider
Data Provider
ListSets
- REPEAT
- Set Specification
- Set Name
- /REPEAT
RCDL 2003, St. Petersburg
61Service Provider
Data Provider
froma
untilb
setklm ListRecords metadataPrefixoai_dc
- REPEAT
- Identifier
- Datestamp
- Metadata
- About Container
- /REPEAT
RCDL 2003, St. Petersburg
62Service Provider
Data Provider
froma
untilb metadataprefixoai_dc ListIdentifie
rs setklm
- REPEAT
- Identifier
- Datestamp
- /REPEAT
RCDL 2003, St. Petersburg
63Service Provider
Data Provider
identifieroaimlib123a
GetRecord metadataPrefixoai_dc
- Identifier
- Datestamp
- Metadata
- About
RCDL 2003, St. Petersburg
64OAI Mechanics
Request is encoded in http
Response is encoded in XML
XML Schemas for the responses are defined in the
OAI-PMH document
Courtesy Michael Nelson