Evaluation of Different Algorithms for Metadata Extraction

About This Presentation

Title:

Evaluation of Different Algorithms for Metadata Extraction

Description:

title Protocols for Collecting Responses L in Multi-hop Radio Networks L /title ... does to digital libraries what Internet did for islands of isolated ... – PowerPoint PPT presentation

Number of Views:87

Avg rating:3.0/5.0

Slides: 65

Provided by: csO9

Learn more at: https://www.cs.odu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Evaluation of Different Algorithms for Metadata Extraction

1
Evaluation of Different Algorithms for Metadata
Extraction
Work in Progress Metadata Extraction Project
Sponsored by DTIC

Department of Computer Science
Old Dominion University
6 / 03 / 2004

2
Contents

Introduction
Metadata Extraction Using SVMs
Support Vector Machines
Multi-Class SVMs
Metadata Extraction as Multi-Class SVMs
Metadata Extraction Using HMM
Hidden Markov Model
Metadata Extraction as HMM
Metadata Extraction Using Templates
Experiments
SVMs
HMMs
Templates
Conclusion and Future Work

3
Introduction
Motivation Evaluate different approaches for
metadata extraction for the DTIC test bed.

Machine Learning Approach
Support Vector Machines
Hidden Markov Model
Rule-based approach
Using rules to specify how to extract metadata

4
Introduction

Deliverables
Software tool to extract metadata and structure
from a set of pdf documents.
Feasibility report on extracting complex objects
such as figures, equations, references, and
tables from the document and representing them in
a DTD-compliant XML format.

5
Introduction

Schedule (Starting Date March 2004)
Months 0-2 Working with DTIC in identifying the
set of documents and the metadata of interest.
Months 3-8 Developing software for metadata and
structure extraction from the selected set of pdf
documents
Months 9-12 Feasibility study for extracting
complex objects and representing the complete
document in XML

6
Support Vector Machines

Binary classifier(classify data into two classes)
Represent data with pre-defined features
Learning to find the plane with largest margin
to separate the two classes.
Classifying classify data into two classes based
on which side they located.

hyperplane
feature1
margin
feature2
The figure shows a SVM example to classify a
person into two classes overweighed, not
overweighed two features are pre-defined
weight (feature 1) and height (feature 2). Each
dot represents a person. Red dot overweighed
Blue dot not overweighed
7
Multi-Class SVMs

Combining into multi-class classifier
One-vs-rest
Classes in this class or not in this class
Positive training samples data in this class
Negative training samples the rest
K binary SVM (k the number of the classes)
One-vs-One
Classes in class one or in class two
Positive training samples data in this class
Negative training samples data in the other
class
K(K-1)/2 binary SVM

8
Metadata Extraction as SVMs

Each element (title, author, etc.) in the
metadata set can be looked as a class.
Classify each line(paragraph) into a class
Feature set
Line features ( number of words, etc.)
Word features use each word as a feature. In
practice, word clustering techniques are used to
reduce the number of features. Word clustering
techniques are to cluster words into groups based
similarity.

9
Hidden Markov Model

A probabilistic finite state automaton
A sequence of observation symbols are produced by
the underlying states (Hidden States) based on
Transition probabilities the probabilities from
one state to another
Emission Probabilities the probabilities of
emitting each symbol in each state
Learning determining the transition and emission
probabilities from training data.
Decoding find the most possible sequence of the
hidden states that produce the sequence of
observation symbols.

10
Metadata Extraction as HMM

A document header can be looked as a sequence of
symbols (words, etc.) produced by the hidden
states (title, author, etc.)
Metadata Extraction
For a sequence of symbols a document header
find the most possible sequence of states (title,
author, etc.)
For example,
Input Converting Existing Corpus to an OAI
Compliant Repository, K. Maly, M. Zubair, J. Tang
Output title title title title title title title
title author author author author author author

11
Metadata Extraction Using Templates

A rule-based approach
But decouples the code and the rules
Share the same code
One template per document type
Template
A XML file to describe the document features
Using rules to define how to extract metadata for
this type of documents

12
Experiments

SVM
Apply SVM to different data sets
Objective Evaluate the performances of different
data sets.
Software used
LibSVM
Multi-class SVMs Using one-vs-one approach
Features Textual features only
word-specific features such as city
line-specific features such as how many words in
a line

13
SVM Experiments with different data sets

Data Sets
Data Set 1 Seymore935
Download from http//www-2.cs.cmu.edu/kseymore/ie
.html
935 manually tagged document headers
15 Tags title, author, affiliation, address,
note, email, date, abstract, introduction
(intro), phone, keywords, web, degree,
publication number (pubnum), and page
Ignore tags except title, author, affiliation,
date
Using the first 500 for training and the rest for
test
Data Set 2 DTIC100
Selected 100 PDF files from DTIC website based on
Z39.18 standard
OCR the first pages and convert to text format
Manually tagged these 100 document headers
5 Tags title, author, affiliation, date and
others
Using the first 75 for training and the rest for
test
Data Set 3 DTIC33
A subset of DTIC100
33 tagged document headers with identical layout
5 Tags title, author, affiliation, date and
others
Using the first 24 for training and the rest for
test

14
SVM Experiments with different data sets

Result

15
SVM Experiments with different data sets

Result

16
SVM Experiments with different data sets

Result

17
Experiments

SVM
Use SVM with different feature sets
Objective Evaluate the performances of different
feature sets.
Software used
LibSVM
Multi-class SVMs Using One-vs-One approach, I.e,
training one SVM classifier for each pair.
Research from LibSVM developers shows that
One-vs-One approach has better performance than
One-vs-Rest approach
Data set DTIC100
Manually tagged the XML files with layout
information
Feature Sets
Text Textual features only
Textfont textual features and font size feature
Textfontbold textual features and bold feature

18
SVM with different feature sets

Result

19
SVM with different feature sets

Result

20
SVM with different feature sets

Result

21
SVM with different feature sets

More
Using layout information for the documents with
much different layout does not improve the
performance significantly.
Another step further is to use to a document set
with similar layout. We do the same experiment
with DTIC33 and get better result in recall.
However, due to the data set is too small, we can
not jump to conclusion yet.

22
Experiments

HMM
Data Set Seymore935
One state per field (tag)
Using the first 500 for training and the rest for
test
Experimental Result
Overall accuracy93.0

23
Experiments

Template
Data Set
DTIC100 100 XML files with font size and bold
information
It is divided into 7 classes according to layout
information
For each class, a template is developed after
checking the first one or two documents in this
class. This template is applied to the remaining
documents to get performance data (recall and
precision)

24
Experiments

Result

25
Experiments

Result

26
Experiments

Templates with more data
demo

27
Discussions

We have done experiments with SVM, HMM and
Template approach
Template approach is flexible and produces good
results.
SVM looks more promising than HMM
Results are better
It processes the data line by line (or paragraph
by paragraph) instead of word by word
It is easy to process layout information

28
Discussions

SVM
Reported to have good performance
- difficulty in selecting proper features
difficulty in labeling a lot of training data
converting data into features and training is
time-consuming.
Template Approach
Flexible and straightforward (rules may be
understood by human)
- Rules are fixed difficulty in adjusting rules
when errors occurs.

29
Overall Approach for Handling Large Collection

Manual Classification
This approach assumes it is possible to humanly
classify the large set of documents into similar
classes ( based on time period, source
organizations, etc. )
For each class, randomly select, say 100,
documents develop a template. Evaluate the
template by statistically sampling and refine
the template till error is under a tolerance
level. Next apply the refined template to the
whole set.
Auto-Classification
This approach assumes it is not humanly possible
to classify the large set of documents. In this
case we develop a higher-set of rules on a
smaller sample for classification. Evaluate the
classification approach based on statistical
sampling.
Next develop the template for each class, apply,
and refine as outlined in the manual
classification approach.

30
Future work

Evaluate different approaches for the DTIC test
bed including the hybrid Approach that
integrates SVM and template based approach.

31
Future work

Enlarge the data set
Currently, the data set is small
We need enlarge the data for evaluation different
approaches

32
Thanks
33
SVM

The margin is the width of separation between the
two classes.
Optimal hyperplane is the one with maximal margin
of separation between the two classes.
The support vectors are the instances closest to
the optimal hyperplane.

34
SVM (cont.)

Geometric interpretationSupport vectors uniquely
defines the optimal hyperplane

35
SVM (cnt.)

SVM is to determine the hyperplane between two
classes from training set
SVM make the classification based on which side
the input data located on.

36
SVM (cont.)

Mathematics Interpretation
We wantw.xi b 1 if yi 1 (xi in class
1)wTxi b -1 if yi -1 (xi in class
2)The margin 2/w
Then the problem turned into constrained
optimization problemmaximize 2/w or minimize
w2 subject to yi(w.xi b )-1 0

37
SVM (cont.)

Unique solution w Saiyixi over all support
vectors
Decision function f(x)sign(Saiyixi.xb)
All other xi irrelevant to the solution.
Lagrangian Lp1/2w2-?aiyi(w.xi b )?ai
w Saiyixi Saiyi0

38
SVM (cont.)

Advantages
Can manage a very large number of
attributes/features.
Linear regression has overfitting problem when
the number of attributes is much larger than the
size of training set.
The SVM solution is determined by support vectors
only.
Various kernel functions can be used to map input
space into feature space
For non-linear space, SVM uses kernel functions
to map it to a linear separable space.
In the way, SVM use linear separation to solve
non-linear problems.

39
Experiment (SVM)

Our experiment (working on 500 tagged headers as
the paper described)
Knowledge collection
Collect the authors names from Archon (CERN
collection)
Download a British word list from internet
Collect country name from web
Collect USA city names
Collect Canada province names and USA state names
Collect month names and their abbreviations
Frequent words for degree, pubnum, notenum,
affiliation, address.
Regular expression for email and url

40
Experiment(SVM)

2. Word Clustering
Converting the original data
For example,
lttitlegt Protocols for Collecting Responses L in
Multi-hop Radio Networks L lt/titlegt
ltauthorgt Chungki Lee James E. Burns L Mostafa
H. Ammar L lt/authorgt
ltpubnumgt GIT-CC-92/28 L lt/pubnumgt
ltdategt June 1992 L lt/dategt
Will converted to
lttitlegt Cap1DictWord DictWord Cap1DictWord
Cap1DictWord L
prep CapWord1LowerWord4-LowerWord3
Cap1DictWord Cap1DictWord L lt/title
gt
ltauthorgt CapWord1LowerWord6 mayName mayName
singleCap mayName L
CapWord1LowerWord6 singleCap mayName L
lt/authorgt
ltpubnumgt CapWord3-CapWord2-Digs2/Digs2 L
lt/pubnumgt
ltdategt month Digs4 L lt/dategt

41
Experiment(SVM)

3. Get Features
Treat each word in converted file as a feature,
use occurrence as the weight.
4. 500 headers are divided into 450 training data
and 50 test data.
5. Training each of the 15 classifiers using
one-versus-all approaches.

42
Hidden Markov Models Example

someone trying to deduce the weather from a piece
of seaweed
For some reason, he can not access weather
information (sun, cloud, rain) directly
But he can know the dampness of a piece of
seaweed (soggy, damp, dryish, dry)
And the state of the seaweed is probabilistically
related to the state of the weather

43
Hidden Markov Models (cont.)
44
HMM problems (cont.)
the most probable sequence of hidden states is
the sequence that maximizes Pr(dry,damp,soggy
sunny,sunny,sunny), Pr(dry,damp,soggy
sunny,sunny,cloudy), Pr(dry,damp,soggy
sunny,sunny,rainy), . . . . Pr(dry,damp,soggy
rainy,rainy,rainy)
45
Hidden Markov Models (cont.)

A Hidden Markov Model is consist of two sets of
states and three sets of probabilities
hidden states the (TRUE) states of a system
that may be described by a Markov process (e.g.
weather states in our example).
observable symbols the symbols of the process
that are visible (e.g. dampness of the
seaweed).
Initial probabilities for hidden states
Transition probabilities for hidden states
Emission probabilities for each observable symbol
in each hidden state

46
Digital Library Research at ODU
47
Open Archives InitiativeOAI-PMH
2.0http//www.openarchives.org
48
Connecting Islands of Digital Libraries
Islands of digital libraries need to be
interconnected for users to access different
information resources from anywhere Need for
manipulating, organizing, and correlating
information from different repository for better
discovery Open Archives Protocol for Metadata
Harvesting (OAI-PMH) is an international effort
to facilitate bridges across islands of digital
libraries. OAI does to digital libraries what
Internet did for islands of isolated networks.
49
Background - Open Archives Initiative (OAI)
The goal of the Open Archives Initiative Protocol
for Metadata Harvesting is to supply and promote
an application-independent interoperability
framework. The OAI protocol permits metadata
harvesting of a data provider by a service
provider. Data Provider supports the OAI protocol
as a means of exposing metadata about the content
in their systems Service Providers issue OAI
protocol requests to the systems of data
providers and use the returned metadata as a
basis for building value-added services.
http//www.openarchives.org The word open in
OAI is from the architectural perspective
defining and promoting machine interfaces.
Openness does not mean free or unlimited
access to the information repositories that
conform to the OAI technical framework. The OAI
is an International effort. Major sponsors are
Council on Library and Information Resources
(CLIR), the Digital Library Federation (DLF), the
Scholarly Publishing Academic
50
What does it mean making an existing digital
library OAI enabled ?
Digital Library
OAI Layer
Exposing metadata to OAI service providers DC
and Parallel metadata sets
ONLY METADATA
Storage
51
Minimal Dublin Core Metadata OAI Requirement
http//dublincore.org/documents/dces/ Fifteen
Elements (Optional) Element Title A name given
to the resource. Typically, a Title will be a
name by which the resource is formally
known. Element Creator An entity primarily
responsible for making the content of the
resource. Examples of a Creator include a
person, an organisation, or a service.
Element Subject The topic of the content of
the resource. Typically, a Subject will be
expressed as keywords, Element Description An
account of the content of the resource.
Description may include but is not limited to an
abstract, table of contents, reference to a
graphical representation of content or a
free-text account of the content. Element
Publisher An entity responsible for making the
resource available. Examples of a Publisher
include a person, an organisation, or a
service.
52
Dublin Core Metadata
Element Contributor An entity responsible for
making contributions to the content of the
resource. Element Date A date associated with
an event in the life cycle of the resource.
Typically, Date will be associated with the
creation or availability of the resource.
Element Type The nature or genre of the
content of the resource. Element Format The
physical or digital manifestation of the
resource. Typically, Format may include the
media-type or dimensions of the resource. Format
may be used to determine the software, hardware
or other equipment needed to display or operate
the resource. Element Identifier An
unambiguous reference to the resource within a
given context. Example formal identification
systems include the Uniform Resource Locator
(URL).
53
Dublin Core Metadata
Element Source A Reference to a resource from
which the present resource is derived.
Element Language A language of the intellectual
content of the resource. Element Relation A
reference to a related resource. Element
Coverage The extent or scope of the content of
the resource. Coverage will typically include
spatial location (a place name or geographic
coordinates), temporal period (a period label,
date, or date range) or jurisdiction (such as a
named administrative entity). Element
Rights Information about rights held in and over
the resource.
54
Beyond Dublin Core Metadata
Need to support parallel metadata sets to enable
the OAI service provider to take advantage of the
richer metadata fields for resource discovery.
The OAI metadata harvesting protocol supports
the notion of parallel metadata sets, allowing
collections to expose metadata in formats that
are specific to their applications and domains.
The OAI technical framework places no limitations
on the nature of such parallel sets, other than
that the metadata records be structured as XML
data that have a corresponding XML schema for
validation.
55

Metadata Harvesting
Move away from distributed searching.
- cannot scale well to large number of
participants.
Extract metadata from various sources.
- Build services on local copies of metadata.
- data remains at remote repositories

RCDL 2003, St. Petersburg
56

OAI Request and OAI Response.
- OAI Request for Metadata is embedded in HTTP.
- OAI Response to OAI Request is encoded in XML.
- XML Schema specification for OAI Response is
provided in OAI-PMH document.

RCDL 2003, St. Petersburg
57
Service Provider
Data Provider

Supporting protocol requests
Identify
ListMetadataFormats
ListSets

Harvesting protocol requests
ListRecords
ListIdentifiers
GetRecord

RCDL 2003, St. Petersburg
58
Service Provider
Data Provider
Identify

Repository name
Base-URL
Admin e-mail
OAI protocol version
Description Container

RCDL 2003, St. Petersburg
59
Service Provider
Data Provider
ListMetadataFormats

REPEAT
Format prefix
Format XML schema
/REPEAT

RCDL 2003, St. Petersburg
60
Service Provider
Data Provider
ListSets

REPEAT
Set Specification
Set Name
/REPEAT

RCDL 2003, St. Petersburg
61
Service Provider
Data Provider
froma
untilb
setklm ListRecords metadataPrefixoai_dc

REPEAT
Identifier
Datestamp
Metadata
About Container
/REPEAT

RCDL 2003, St. Petersburg
62
Service Provider
Data Provider

froma
untilb metadataprefixoai_dc ListIdentifie
rs setklm

REPEAT
Identifier
Datestamp
/REPEAT

RCDL 2003, St. Petersburg
63
Service Provider
Data Provider
identifieroaimlib123a
GetRecord metadataPrefixoai_dc

Identifier
Datestamp
Metadata
About

RCDL 2003, St. Petersburg
64
OAI Mechanics
Request is encoded in http
Response is encoded in XML
XML Schemas for the responses are defined in the
OAI-PMH document
Courtesy Michael Nelson

Write a Comment

User Comments (0)