Evaluation of Different Algorithms for Metadata Extraction - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

Evaluation of Different Algorithms for Metadata Extraction

Description:

title Protocols for Collecting Responses L in Multi-hop Radio Networks L /title ... does to digital libraries what Internet did for islands of isolated ... – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0
Slides: 65
Provided by: csO9
Learn more at: https://www.cs.odu.edu
Category:

less

Transcript and Presenter's Notes

Title: Evaluation of Different Algorithms for Metadata Extraction


1
Evaluation of Different Algorithms for Metadata
Extraction
Work in Progress Metadata Extraction Project
Sponsored by DTIC
  • Department of Computer Science
  • Old Dominion University
  • 6 / 03 / 2004

2
Contents
  • Introduction
  • Metadata Extraction Using SVMs
  • Support Vector Machines
  • Multi-Class SVMs
  • Metadata Extraction as Multi-Class SVMs
  • Metadata Extraction Using HMM
  • Hidden Markov Model
  • Metadata Extraction as HMM
  • Metadata Extraction Using Templates
  • Experiments
  • SVMs
  • HMMs
  • Templates
  • Conclusion and Future Work

3
Introduction
Motivation Evaluate different approaches for
metadata extraction for the DTIC test bed.
  • Machine Learning Approach
  • Support Vector Machines
  • Hidden Markov Model
  • Rule-based approach
  • Using rules to specify how to extract metadata

4
Introduction
  • Deliverables
  • Software tool to extract metadata and structure
    from a set of pdf documents.
  • Feasibility report on extracting complex objects
    such as figures, equations, references, and
    tables from the document and representing them in
    a DTD-compliant XML format.

5
Introduction
  • Schedule (Starting Date March 2004)
  • Months 0-2 Working with DTIC in identifying the
    set of documents and the metadata of interest.
  • Months 3-8 Developing software for metadata and
    structure extraction from the selected set of pdf
    documents
  • Months 9-12 Feasibility study for extracting
    complex objects and representing the complete
    document in XML

6
Support Vector Machines
  • Binary classifier(classify data into two classes)
  • Represent data with pre-defined features
  • Learning to find the plane with largest margin
    to separate the two classes.
  • Classifying classify data into two classes based
    on which side they located.

hyperplane
feature1
margin
feature2
The figure shows a SVM example to classify a
person into two classes overweighed, not
overweighed two features are pre-defined
weight (feature 1) and height (feature 2). Each
dot represents a person. Red dot overweighed
Blue dot not overweighed
7
Multi-Class SVMs
  • Combining into multi-class classifier
  • One-vs-rest
  • Classes in this class or not in this class
  • Positive training samples data in this class
  • Negative training samples the rest
  • K binary SVM (k the number of the classes)
  • One-vs-One
  • Classes in class one or in class two
  • Positive training samples data in this class
  • Negative training samples data in the other
    class
  • K(K-1)/2 binary SVM

8
Metadata Extraction as SVMs
  • Each element (title, author, etc.) in the
    metadata set can be looked as a class.
  • Classify each line(paragraph) into a class
  • Feature set
  • Line features ( number of words, etc.)
  • Word features use each word as a feature. In
    practice, word clustering techniques are used to
    reduce the number of features. Word clustering
    techniques are to cluster words into groups based
    similarity.

9
Hidden Markov Model
  • A probabilistic finite state automaton
  • A sequence of observation symbols are produced by
    the underlying states (Hidden States) based on
  • Transition probabilities the probabilities from
    one state to another
  • Emission Probabilities the probabilities of
    emitting each symbol in each state
  • Learning determining the transition and emission
    probabilities from training data.
  • Decoding find the most possible sequence of the
    hidden states that produce the sequence of
    observation symbols.

10
Metadata Extraction as HMM
  • A document header can be looked as a sequence of
    symbols (words, etc.) produced by the hidden
    states (title, author, etc.)
  • Metadata Extraction
  • For a sequence of symbols a document header
  • find the most possible sequence of states (title,
    author, etc.)
  • For example,
  • Input Converting Existing Corpus to an OAI
    Compliant Repository, K. Maly, M. Zubair, J. Tang
  • Output title title title title title title title
    title author author author author author author

11
Metadata Extraction Using Templates
  • A rule-based approach
  • But decouples the code and the rules
  • Share the same code
  • One template per document type
  • Template
  • A XML file to describe the document features
  • Using rules to define how to extract metadata for
    this type of documents

12
Experiments
  • SVM
  • Apply SVM to different data sets
  • Objective Evaluate the performances of different
    data sets.
  • Software used
  • LibSVM
  • Multi-class SVMs Using one-vs-one approach
  • Features Textual features only
  • word-specific features such as city
  • line-specific features such as how many words in
    a line

13
SVM Experiments with different data sets
  • Data Sets
  • Data Set 1 Seymore935
  • Download from http//www-2.cs.cmu.edu/kseymore/ie
    .html
  • 935 manually tagged document headers
  • 15 Tags title, author, affiliation, address,
    note, email, date, abstract, introduction
    (intro), phone, keywords, web, degree,
    publication number (pubnum), and page
  • Ignore tags except title, author, affiliation,
    date
  • Using the first 500 for training and the rest for
    test
  • Data Set 2 DTIC100
  • Selected 100 PDF files from DTIC website based on
    Z39.18 standard
  • OCR the first pages and convert to text format
  • Manually tagged these 100 document headers
  • 5 Tags title, author, affiliation, date and
    others
  • Using the first 75 for training and the rest for
    test
  • Data Set 3 DTIC33
  • A subset of DTIC100
  • 33 tagged document headers with identical layout
  • 5 Tags title, author, affiliation, date and
    others
  • Using the first 24 for training and the rest for
    test

14
SVM Experiments with different data sets
  • Result

15
SVM Experiments with different data sets
  • Result

16
SVM Experiments with different data sets
  • Result

17
Experiments
  • SVM
  • Use SVM with different feature sets
  • Objective Evaluate the performances of different
    feature sets.
  • Software used
  • LibSVM
  • Multi-class SVMs Using One-vs-One approach, I.e,
    training one SVM classifier for each pair.
  • Research from LibSVM developers shows that
    One-vs-One approach has better performance than
    One-vs-Rest approach
  • Data set DTIC100
  • Manually tagged the XML files with layout
    information
  • Feature Sets
  • Text Textual features only
  • Textfont textual features and font size feature
  • Textfontbold textual features and bold feature

18
SVM with different feature sets
  • Result

19
SVM with different feature sets
  • Result

20
SVM with different feature sets
  • Result

21
SVM with different feature sets
  • More
  • Using layout information for the documents with
    much different layout does not improve the
    performance significantly.
  • Another step further is to use to a document set
    with similar layout. We do the same experiment
    with DTIC33 and get better result in recall.
    However, due to the data set is too small, we can
    not jump to conclusion yet.

22
Experiments
  • HMM
  • Data Set Seymore935
  • One state per field (tag)
  • Using the first 500 for training and the rest for
    test
  • Experimental Result
  • Overall accuracy93.0

23
Experiments
  • Template
  • Data Set
  • DTIC100 100 XML files with font size and bold
    information
  • It is divided into 7 classes according to layout
    information
  • For each class, a template is developed after
    checking the first one or two documents in this
    class. This template is applied to the remaining
    documents to get performance data (recall and
    precision)

24
Experiments
  • Result

25
Experiments
  • Result

26
Experiments
  • Templates with more data
  • demo

27
Discussions
  • We have done experiments with SVM, HMM and
    Template approach
  • Template approach is flexible and produces good
    results.
  • SVM looks more promising than HMM
  • Results are better
  • It processes the data line by line (or paragraph
    by paragraph) instead of word by word
  • It is easy to process layout information

28
Discussions
  • SVM
  • Reported to have good performance
  • - difficulty in selecting proper features
    difficulty in labeling a lot of training data
    converting data into features and training is
    time-consuming.
  • Template Approach
  • Flexible and straightforward (rules may be
    understood by human)
  • - Rules are fixed difficulty in adjusting rules
    when errors occurs.

29
Overall Approach for Handling Large Collection
  • Manual Classification
  • This approach assumes it is possible to humanly
    classify the large set of documents into similar
    classes ( based on time period, source
    organizations, etc. )
  • For each class, randomly select, say 100,
    documents develop a template. Evaluate the
    template by statistically sampling and refine
    the template till error is under a tolerance
    level. Next apply the refined template to the
    whole set.
  • Auto-Classification
  • This approach assumes it is not humanly possible
    to classify the large set of documents. In this
    case we develop a higher-set of rules on a
    smaller sample for classification. Evaluate the
    classification approach based on statistical
    sampling.
  • Next develop the template for each class, apply,
    and refine as outlined in the manual
    classification approach.

30
Future work
  • Evaluate different approaches for the DTIC test
    bed including the hybrid Approach that
    integrates SVM and template based approach.

31
Future work
  • Enlarge the data set
  • Currently, the data set is small
  • We need enlarge the data for evaluation different
    approaches

32
Thanks
33
SVM
  • The margin is the width of separation between the
    two classes.
  • Optimal hyperplane is the one with maximal margin
    of separation between the two classes.
  • The support vectors are the instances closest to
    the optimal hyperplane.

34
SVM (cont.)
  • Geometric interpretationSupport vectors uniquely
    defines the optimal hyperplane

35
SVM (cnt.)
  • SVM is to determine the hyperplane between two
    classes from training set
  • SVM make the classification based on which side
    the input data located on.

36
SVM (cont.)
  • Mathematics Interpretation
  • We wantw.xi b 1 if yi 1 (xi in class
    1)wTxi b -1 if yi -1 (xi in class
    2)The margin 2/w
  • Then the problem turned into constrained
    optimization problemmaximize 2/w or minimize
    w2 subject to yi(w.xi b )-1 0

37
SVM (cont.)
  • Unique solution w Saiyixi over all support
    vectors
  • Decision function f(x)sign(Saiyixi.xb)
  • All other xi irrelevant to the solution.
  • Lagrangian Lp1/2w2-?aiyi(w.xi b )?ai
    w Saiyixi Saiyi0

38
SVM (cont.)
  • Advantages
  • Can manage a very large number of
    attributes/features.
  • Linear regression has overfitting problem when
    the number of attributes is much larger than the
    size of training set.
  • The SVM solution is determined by support vectors
    only.
  • Various kernel functions can be used to map input
    space into feature space
  • For non-linear space, SVM uses kernel functions
    to map it to a linear separable space.
  • In the way, SVM use linear separation to solve
    non-linear problems.

39
Experiment (SVM)
  • Our experiment (working on 500 tagged headers as
    the paper described)
  • Knowledge collection
  • Collect the authors names from Archon (CERN
    collection)
  • Download a British word list from internet
  • Collect country name from web
  • Collect USA city names
  • Collect Canada province names and USA state names
  • Collect month names and their abbreviations
  • Frequent words for degree, pubnum, notenum,
    affiliation, address.
  • Regular expression for email and url

40
Experiment(SVM)
  • 2. Word Clustering
  • Converting the original data
  • For example,
  • lttitlegt Protocols for Collecting Responses L in
    Multi-hop Radio Networks L lt/titlegt
  • ltauthorgt Chungki Lee James E. Burns L Mostafa
    H. Ammar L lt/authorgt
  • ltpubnumgt GIT-CC-92/28 L lt/pubnumgt
  • ltdategt June 1992 L lt/dategt
  • Will converted to
  • lttitlegt Cap1DictWord DictWord Cap1DictWord
    Cap1DictWord L
  • prep CapWord1LowerWord4-LowerWord3
    Cap1DictWord Cap1DictWord L lt/title
  • gt
  • ltauthorgt CapWord1LowerWord6 mayName mayName
    singleCap mayName L
  • CapWord1LowerWord6 singleCap mayName L
    lt/authorgt
  • ltpubnumgt CapWord3-CapWord2-Digs2/Digs2 L
    lt/pubnumgt
  • ltdategt month Digs4 L lt/dategt

41
Experiment(SVM)
  • 3. Get Features
  • Treat each word in converted file as a feature,
    use occurrence as the weight.
  • 4. 500 headers are divided into 450 training data
    and 50 test data.
  • 5. Training each of the 15 classifiers using
    one-versus-all approaches.

42
Hidden Markov Models Example
  • someone trying to deduce the weather from a piece
    of seaweed
  • For some reason, he can not access weather
    information (sun, cloud, rain) directly
  • But he can know the dampness of a piece of
    seaweed (soggy, damp, dryish, dry)
  • And the state of the seaweed is probabilistically
    related to the state of the weather

43
Hidden Markov Models (cont.)
44
HMM problems (cont.)
the most probable sequence of hidden states is
the sequence that maximizes Pr(dry,damp,soggy
sunny,sunny,sunny), Pr(dry,damp,soggy
sunny,sunny,cloudy), Pr(dry,damp,soggy
sunny,sunny,rainy), . . . . Pr(dry,damp,soggy
rainy,rainy,rainy)
45
Hidden Markov Models (cont.)
  • A Hidden Markov Model is consist of two sets of
    states and three sets of probabilities
  • hidden states the (TRUE) states of a system
    that may be described by a Markov process (e.g.
    weather states in our example).
  • observable symbols the symbols of the process
    that are visible (e.g. dampness of the
    seaweed).
  • Initial probabilities for hidden states
  • Transition probabilities for hidden states
  • Emission probabilities for each observable symbol
    in each hidden state

46
Digital Library Research at ODU
47
Open Archives InitiativeOAI-PMH
2.0http//www.openarchives.org
48
Connecting Islands of Digital Libraries
Islands of digital libraries need to be
interconnected for users to access different
information resources from anywhere Need for
manipulating, organizing, and correlating
information from different repository for better
discovery Open Archives Protocol for Metadata
Harvesting (OAI-PMH) is an international effort
to facilitate bridges across islands of digital
libraries. OAI does to digital libraries what
Internet did for islands of isolated networks.
49
Background - Open Archives Initiative (OAI)
The goal of the Open Archives Initiative Protocol
for Metadata Harvesting is to supply and promote
an application-independent interoperability
framework. The OAI protocol permits metadata
harvesting of a data provider by a service
provider. Data Provider supports the OAI protocol
as a means of exposing metadata about the content
in their systems Service Providers issue OAI
protocol requests to the systems of data
providers and use the returned metadata as a
basis for building value-added services.
http//www.openarchives.org The word open in
OAI is from the architectural perspective
defining and promoting machine interfaces.
Openness does not mean free or unlimited
access to the information repositories that
conform to the OAI technical framework. The OAI
is an International effort. Major sponsors are
Council on Library and Information Resources
(CLIR), the Digital Library Federation (DLF), the
Scholarly Publishing Academic
50
What does it mean making an existing digital
library OAI enabled ?
Digital Library
OAI Layer
Exposing metadata to OAI service providers DC
and Parallel metadata sets
ONLY METADATA
Storage
51
Minimal Dublin Core Metadata OAI Requirement
http//dublincore.org/documents/dces/ Fifteen
Elements (Optional) Element Title A name given
to the resource. Typically, a Title will be a
name by which the resource is formally
known. Element Creator An entity primarily
responsible for making the content of the
resource. Examples of a Creator include a
person, an organisation, or a service.
Element Subject The topic of the content of
the resource. Typically, a Subject will be
expressed as keywords, Element Description An
account of the content of the resource.
Description may include but is not limited to an
abstract, table of contents, reference to a
graphical representation of content or a
free-text account of the content. Element
Publisher An entity responsible for making the
resource available. Examples of a Publisher
include a person, an organisation, or a
service.
52
Dublin Core Metadata
Element Contributor An entity responsible for
making contributions to the content of the
resource. Element Date A date associated with
an event in the life cycle of the resource.
Typically, Date will be associated with the
creation or availability of the resource.
Element Type The nature or genre of the
content of the resource. Element Format The
physical or digital manifestation of the
resource. Typically, Format may include the
media-type or dimensions of the resource. Format
may be used to determine the software, hardware
or other equipment needed to display or operate
the resource. Element Identifier An
unambiguous reference to the resource within a
given context. Example formal identification
systems include the Uniform Resource Locator
(URL).
53
Dublin Core Metadata
Element Source A Reference to a resource from
which the present resource is derived.
Element Language A language of the intellectual
content of the resource. Element Relation A
reference to a related resource. Element
Coverage The extent or scope of the content of
the resource. Coverage will typically include
spatial location (a place name or geographic
coordinates), temporal period (a period label,
date, or date range) or jurisdiction (such as a
named administrative entity). Element
Rights Information about rights held in and over
the resource.
54
Beyond Dublin Core Metadata
Need to support parallel metadata sets to enable
the OAI service provider to take advantage of the
richer metadata fields for resource discovery.
The OAI metadata harvesting protocol supports
the notion of parallel metadata sets, allowing
collections to expose metadata in formats that
are specific to their applications and domains.
The OAI technical framework places no limitations
on the nature of such parallel sets, other than
that the metadata records be structured as XML
data that have a corresponding XML schema for
validation.
55
  • Metadata Harvesting
  • Move away from distributed searching.
  • - cannot scale well to large number of
    participants.
  • Extract metadata from various sources.
  • - Build services on local copies of metadata.
  • - data remains at remote repositories

RCDL 2003, St. Petersburg
56
  • OAI Request and OAI Response.
  • - OAI Request for Metadata is embedded in HTTP.
  • - OAI Response to OAI Request is encoded in XML.
  • - XML Schema specification for OAI Response is
    provided in OAI-PMH document.

RCDL 2003, St. Petersburg
57
Service Provider
Data Provider
  • Supporting protocol requests
  • Identify
  • ListMetadataFormats
  • ListSets
  • Harvesting protocol requests
  • ListRecords
  • ListIdentifiers
  • GetRecord

RCDL 2003, St. Petersburg
58
Service Provider
Data Provider
Identify
  • Repository name
  • Base-URL
  • Admin e-mail
  • OAI protocol version
  • Description Container

RCDL 2003, St. Petersburg
59
Service Provider
Data Provider
ListMetadataFormats
  • REPEAT
  • Format prefix
  • Format XML schema
  • /REPEAT

RCDL 2003, St. Petersburg
60
Service Provider
Data Provider
ListSets
  • REPEAT
  • Set Specification
  • Set Name
  • /REPEAT

RCDL 2003, St. Petersburg
61
Service Provider
Data Provider
froma
untilb
setklm ListRecords metadataPrefixoai_dc
  • REPEAT
  • Identifier
  • Datestamp
  • Metadata
  • About Container
  • /REPEAT

RCDL 2003, St. Petersburg
62
Service Provider
Data Provider

froma
untilb metadataprefixoai_dc ListIdentifie
rs setklm
  • REPEAT
  • Identifier
  • Datestamp
  • /REPEAT

RCDL 2003, St. Petersburg
63
Service Provider
Data Provider
identifieroaimlib123a
GetRecord metadataPrefixoai_dc
  • Identifier
  • Datestamp
  • Metadata
  • About

RCDL 2003, St. Petersburg
64
OAI Mechanics
Request is encoded in http
Response is encoded in XML
XML Schemas for the responses are defined in the
OAI-PMH document
Courtesy Michael Nelson
Write a Comment
User Comments (0)
About PowerShow.com