Automated Metadata Extraction - PowerPoint PPT Presentation

1 / 68
About This Presentation
Title:

Automated Metadata Extraction

Description:

Metadata Extraction Experience at ODU CS. Architecture for Metadata ... Arc/Archon (NSF) Kepler (NSF) TRI (NASA,LANL, SANDIA) DL Grid (Andrew Mellon) Secure DL ... – PowerPoint PPT presentation

Number of Views:162
Avg rating:3.0/5.0
Slides: 69
Provided by: wcar94
Category:

less

Transcript and Presenter's Notes

Title: Automated Metadata Extraction


1
Automated Metadata Extraction July 17-20,
2006 Kurt Maly maly_at_cs.odu.edu
2
Outline
  • Background and Motivation
  • Challenges and Approaches
  • Metadata Extraction Experience at ODU CS
  • Architecture for Metadata Extraction
  • Experiments with DTIC Documents
  • Experiments with limited GPO Documents
  • Conclusions

3
Digital Library Research at ODU
http//dlib.cs.odu.edu/
4
Motivation
  • Metadata enhances the value of a document
    collection
  • Using metadata helps resource discovery
  • It may save about 8,200 per employee for a
    company to use metadata in its intranet to reduce
    employee time for searching, verifying and
    organizing the files . (estimation made by Mike
    Doane on DCMI 2003 workshop)
  • Using metadata helps make collections
    interoperable with OAI-PMH
  • Manual metadata extraction is costly and
    time-consuming
  • It would take about 60 employee-years to create
    metadata for 1 million documents. (estimation
    made by Lou Rosenfeld on DCMI 2003 workshop).
    Automatic metadata extraction tools are essential
    to reduce the cost.
  • Automatic extraction tools are essential for
    rapid dissemination at reasonable cost
  • OCR is not sufficient for making legacy
    documents searchable.

5
Challenges
  • A successful metadata extraction system must
  • extract metadata accurately
  • scale to large document collections
  • cope with heterogeneity within a collection
  • maintain accuracy, with minimal
    reprogramming/training cost, as the collection
    evolves over time
  • have a validation/correction process

6
Approaches
  • Machine Learning
  • HMM
  • SVM
  • Rule-Based
  • Ad Hoc
  • Expert Systems
  • Template-Based (ODU CS)

7
Comparison
  • Machine-Learning Approach
  • Good adaptability but it has to be trained from
    samples very time consuming
  • Performance degrades with increasing
    heterogeneity
  • Difficult to add new fields to be extracted
  • Difficult to select the right features for
    training
  • Rule-based
  • No need for training from samples
  • Can extract different metadata from different
    documents
  • Rule writing may require significant technical
    expertise

8
Metadata Extraction Experience at ODU CS
  • DTIC (2004, 2005)
  • developed software to automate the task of
    extracting metadata and basic structure from DTIC
    PDF documents
  • explored alternatives including SVM, HMM, expert
    systems
  • origin of the ODU template-based engine
  • GPO (in progress)
  • NASA (in progress)
  • Feasibility study to apply template-based
    approach to CASI collection

9
Meeting the Challenges
  • All techniques achieved reasonable accuracy for
    small collections
  • possible to scale to large homogeneous
    collections
  • Heterogeneity remains a problem
  • Ad hoc rule-based tend to complex monoliths
  • Expert systems tend to large rule sets with
    complex, poorly-understood interactions
  • Machine-learning must choose between reduced
    accuracy and confidence or state explosion
  • Evolution problematic for machine-learning
    approaches
  • older documents may have higher rate of OCR
    errors
  • expensive retraining required to accommodate
    changes in collection
  • potential lag time during which accuracy decays
    until sufficient training instances acquired
  • Validation A largely unexplored area.
  • Machine-learning approaches offer some support
    via confidence measures

10
Architecture for Metadata Extraction

11
Our Approach Meeting the Challenges
  • Bi-level architecture
  • Classification based upon document similarity
  • Simple templates (rule-based) written for each
    emerging class

12
Our Approach Meeting the Challenges
  • Heterogeneity
  • Classification, in effect, reduces the problem to
    multiple homogeneous collections
  • Multiple templates required, but each template is
    comparatively simple
  • only needs to accommodate one class of documents
    that share a common layout and style
  • Evolution
  • New classes of documents accommodated by writing
    a new template
  • templates are comparatively simple
  • no lengthy retraining required
  • potentially rapid response to changes in
    collection
  • Enriching the template engine by introducing new
    features to reduce complexity of templates
  • Validation
  • Exploring a variety of techniques drawn from
    automated software testing validation

13
Metadata Extraction Template-based
  • Template-based approach
  • Classify documents into classes based on
    similarity
  • For each document class, create a template, or a
    set of rules
  • Decoupling rules from coding
  • A template is kept in a separate file
  • Advantages
  • Easy to extend
  • For a new document class, just create a template
  • Rules are simpler
  • Rules can be refined easily

14
Classes of documents
15
Template engine
16
Document features
  • Layout features
  • Boldness, i.e., whether text is in bold font or
    not
  • Font size, i.e., the font size used in text, e.g.
    font size 12, font size 14, etc
  • Alignment, i.e. whether text is left, right,
    central, or adjusted alignment
  • Geometric location, for example, a block starting
    with coordinates (0, 0) and ending with
    coordinates (100, 200)
  • Geometric relation, for example, a block located
    below the title block.

17
Document features
  • Textual features
  • Special words, for example, a string starting
    with abstract
  • Special patterns, for example, a string with
    regular expression 1-20-90-90-9
  • Statistics features, for example, a string with
    more than 20 words, a string with more than 100
    letters, and a string with more than 50 letters
    in upper case
  • Knowledge features, for example, a string
    containing a last name from a name dictionary.

18
Template language
  • XML based
  • Related to document features
  • XML schema
  • Simple document model
  • Document page-zone-region-column-row-paragraphs-l
    ines-words-character

19
Template sample
20
Sample document pdf
21
Scan OCR output
22
Clean XML output
23
Template (part)
24
Metadata extracted
25
Results Summary from DTIC Project
26
Experiment with Limited GPO Documents
  • 14 GPO Documents having Technical Report
    Documentation Page
  • 57 GPO Documents without Technical Report
    Documentation Page
  • 16 Congressional Reports
  • 16 Public Law Documents

27

GPO Report Documentation Page
28

GPO Document
29
Congressional Report
30

Public Law Document
31
Conclusions
  • OCR software works very well on current documents
  • Template based approach allows automatic metadata
    extraction from
  • Dynamically changing collections
  • Heterogeneous, large collections
  • Report document pages
  • High degree of accuracy
  • Feasibility of structure (e.g., table of
    contents, tables, equations, sections) metadata
    extraction

32
Metadata extraction Part IIAutomatic
Categorization

33
Document Categorization
  • Problem given
  • a collection of documents, and
  • a taxonomy of subject areas
  • Classification Determine the subject area(s)
    most pertinent to each document
  • Indexing Select a set of keywords / index terms
    appropriate to each document

34
Classification Techniques
  • Manual (a.k.a. Knowledge Engineering)
  • typically, rule-based expert systems
  • Machine Learning
  • Probabalistic (e.g., Naïve Bayesian)
  • Decision Structures (e.g., Decision Trees)
  • Profile-Based
  • compare document to profile(s) of subject classes
  • similarity rules similar to those employed in
    I.R.
  • Support Machines (e.g., SVM)

35
Classification via Machine Learning
  • Usually train-and-test
  • Exploit an existing collection in which documents
    have already been classified
  • a portion used as the training set
  • another portion used as a test set
  • permits measurement of classifier effectiveness
  • allows tuning of classifier parameters to yield
    maximum effectiveness
  • Single- vs. multi-label
  • can 1 document be assigned to multiple categories?

36
Automatic Indexing
  • Assign to each document up to k terms drawn from
    a controlled vocabulary
  • Typically reduced to a multi-label classification
    problem
  • each keyword corresponds to a class of documents
    for which that keyword is an appropriate
    descriptor

37
Case Study SVM categorization
  • Document Collection from DTIC
  • 10,000 documents
  • previously classified manually
  • Taxonomy of
  • 25 broad subject fields, divided into a total of
  • 251 narrower groups
  • Document lengths average 2705?1464 words, 623?274
    significant unique terms.
  • Collection has 32457 significant unique terms

38
Document Collection
39
(No Transcript)
40
Sample Broad Subject Fields
  • 01--Aviation Technology
  • 02--Agriculture
  • 03--Astronomy and Astrophysics
  • 04--Atmospheric Sciences
  • 05--Behavioral and Social Sciences
  • 06--Biological and Medical Sciences
  • 07--Chemistry
  • 08--Earth Sciences and Oceanography

41
Sample Narrow Subject Groups
  • Aviation Technology
  • 01 Aerodynamics
  • 02 Military Aircraft Operations
  • 03 Aircraft
  • 0301 Helicopters
  • 0302 Bombers
  • 0303 Attack and Fighter Aircraft
  • 0304 Patrol and Reconnaissance Aircraft

42
Distribution among Categories
43
(No Transcript)
44
Baseline
  • Establish baseline for state-of-the-art machine
    learning techniques
  • classification
  • training SVM for each subject area
  • off-the-shelf document modelling and SVM
    libraries

45
Why SVM?
  • Prior studies have suggested good results with
    SVM
  • relatively immune to overfitting fitting to
    coincidental relations encountered during
    training
  • few model parameters
  • avoids problems of optimizing in high-dimension
    space

46
Machine Learning Support Vector Machines
hyperplane
  • Binary Classifier
  • Finds the plane with largest margin to separate
    the two classes of training samples
  • Subsequently classifies items based on which side
    of line they fall

Font size
margin
Line number
47
SVM Evaluation
48
Baseline SVM Evaluation(Interim Report)
  • Training Testing process repeated for multiple
    subject categories
  • Determine accuracy
  • overall
  • positive (ability to recognize new documents that
    belong in the class the SVM was trained for)
  • negative (ability to reject new documents that
    belong to other classes)
  • Explore Training Issues

49
SVM Out of the Box
  • 16 broad categories with 150 or more documents
  • Lucene library extracting terms and forming
    weighted term vectors
  • LibSVM for SVM training testing
  • no normalization or parameter tuning
  • Training set of 100/100 (positive/negative
    samples)
  • Test set of 50/50

50
Accuracy
51
OotB Interpretation
  • Reasonable performance on broad categories given
    modest training set size.
  • Accuracy measured as ( correct decisions / test
    set size)
  • Related experiment showed that with normalization
    and optimized parameter selection, accuracy could
    be improved as much as an additional 10

52
Training Set Size
53
Training Set Size
  • accuracy plateaus for training set sizes well
    under the number of terms in the document model

54
Training Issues
  • Training Set Size
  • Concern detailed subject groups may have too few
    known examples to perform effective SVM training
    in that subject
  • Possible Solution collection may have few
    positive examples, but has many, many negative
    example
  • Positive/Negative Training Mixes
  • effects on accuracy

55
Increased Negative Training
56
Training Set Composition
  • experiment performed with 50 positive training
    examples
  • OotB SVM training
  • increasing the number of negative training
    examples has little effect on overall accuracy
  • but positive accuracy reduced

57
Interpretation
  • may indicate a weakness in SVM
  • or simply further evidence of the importance of
    optimizing SVM parameters
  • may indicate unsuitability of treating SVM output
    as simple boolean decision
  • might do better as best fit in a multi-label
    classifier

58
Conclusions
  • State of the art for DTIC like collections will
    give on the order of 75 accuracy
  • Key problems that need to be addressed
  • establish baseline for other methods
  • validation recognizing trusted results
  • to fall back on human intervention
  • improve on baseline by more sophisticated methods
  • possible application for knowledge bases

59
Additional Slides
60
Metadata Extraction Machine-Learning Approach
  • Learn the relationship between input and output
    from samples and make predictions for new data
  • This approach has good adaptability but it has to
    be trained from samples.
  • HMM (hidden Markov Model) SVM (Support Vector
    Machine)

61
Machine Learning - Hidden Markov Models
  • Hidden Markov Modeling is a probabilistic
    technique for the study of observed items
    arranged in discrete-time series --Alan B Poritz
    Hidden Markov Models A Guided Tour, ICASSP
    1988
  • HMM is a probabilistic finite state automaton
  • Transit from state to state
  • Emit a symbol when visit each state
  • States are hidden

62
Hidden Markov Models
  • A Hidden Markov Model consists of
  • A set of hidden states (e.g. coin1, coin2,
    coin3)
  • A set of observation symbols ( e.g. H and T)
  • Transition probabilities the probabilities from
    one state to another
  • Emission probabilities probability of emitting
    each symbol in each state
  • Initial probabilities probability of each state
    to be chosen as the first state

63
HMM - Metadata Extraction
  • A document is a sequence of words that is
    produced by some hidden states (title, author,
    etc.)
  • The parameters of HMM was learned from samples in
    advance.
  • Metadata Extraction is to find the most possible
    sequence of states (title, author, etc.) for a
    given sequence of words.

64
Machine Learning Support Vector Machines
hyperplane
  • Binary Classifier (classify data into two
    classes)
  • It represents data with pre-defined features
  • It finds the plane with largest margin to
    separate the two classes from samples
  • It classifies data into two classes based on
    which side they located.

Font size
margin
Line number
The figure shows a SVM example to classify a line
into two classes title, not title by two
features font size and line number (1, 2, 3,
etc). Each dot represents a line. Red dot title
Blue dot not title.
65
SVM - Metadata Extraction
  • Widely used in pattern recognition areas such as
    face detection, isolated handwriting digit
    recognition, gene classification, etc.
  • Basic idea
  • Classes ? metadata elements
  • Extract metadata from a document? classify each
    line (or block) into appropriate classes.
  • For example
  • Extract document title from a document ?
  • Classify each line to see whether it is a part of
    title or not

66
Metadata Extraction Rule-based
  • Basic idea
  • Use a set of rules to define how to extract
    metadata based on human observation.
  • For example, a rule may be The first line is
    title.
  • Advantage
  • Can be implemented straightforwardly
  • No need for training
  • Disadvantage
  • Lack of adaptability (work for similar document)
  • Difficult to work with a large number of features
  • Difficult to tune the system when errors occur
    because rules are usually fixed

67
Metadata Extraction - Rule-based
  • Expert system approach
  • Build a large rule base by using standard
    languages such as prolog
  • Use existed expert system engine (for example,
    SWI-prolog)
  • Advantages
  • Can use existing engine
  • Disadvantages
  • Building rule base is time-consuming

68
Metadata Extraction Experience at ODU CS
  • We have knowledge database obtained from
    analyzing Arc and DTIC collections
  • Authors (4Mill strings from
    http//arc.cs.odu.edu)
  • Organizations (79 from DTIC250, 200 from DTIC
    600)
  • Universities (52 from DTIC250)
Write a Comment
User Comments (0)
About PowerShow.com