Document Categorization - PowerPoint PPT Presentation

About This Presentation
Title:

Document Categorization

Description:

25 broad subject fields, divided into a total of. 251 narrower groups ... Sample: Broad Subject Fields. 01--Aviation Technology. 02--Agriculture ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 26
Provided by: wcar9
Learn more at: https://www.cs.odu.edu
Category:

less

Transcript and Presenter's Notes

Title: Document Categorization


1
Document Categorization
  • Problem given
  • a collection of documents, and
  • a taxonomy of subject areas
  • Classification Determine the subject area(s)
    most pertinent to each document
  • Indexing Select a set of keywords / index terms
    appropriate to each document

2
Classification Techniques
  • Manual (a.k.a. Knowledge Engineering)
  • typically, rule-based expert systems
  • Machine Learning
  • Probabalistic (e.g., Naïve Bayesian)
  • Decision Structures (e.g., Decision Trees)
  • Profile-Based
  • compare document to profile(s) of subject classes
  • similarity rules similar to those employed in
    I.R.
  • Support Machines (e.g., SVM)

3
Machine Learning Procedures
  • Usually train-and-test
  • Exploit an existing collection in which documents
    have already been classified
  • a portion used as the training set
  • another portion used as a test set
  • permits measurement of classifier effectiveness
  • allows tuning of classifier parameters to yield
    maximum effectiveness
  • Single- vs. multi-label
  • can 1 document be assigned to multiple categories?

4
Automatic Indexing
  • Assign to each document up to k terms drawn from
    a controlled vocabulary
  • Typically reduced to a multi-label classification
    problem
  • each keyword corresponds to a class of documents
    for which that keyword is an appropriate
    descriptor

5
Case Study SVM categorization
  • Document Collection from DTIC
  • 10,000 documents
  • previously classified manually
  • Taxonomy of
  • 25 broad subject fields, divided into a total of
  • 251 narrower groups
  • Document lengths average 2705?1464 words, 623?274
    significant unique terms.
  • Collection has 32457 significant unique terms

6
Document Collection
7
(No Transcript)
8
Sample Broad Subject Fields
  • 01--Aviation Technology
  • 02--Agriculture
  • 03--Astronomy and Astrophysics
  • 04--Atmospheric Sciences
  • 05--Behavioral and Social Sciences
  • 06--Biological and Medical Sciences
  • 07--Chemistry
  • 08--Earth Sciences and Oceanography

9
Sample Narrow Subject Groups
  • Aviation Technology
  • 01 Aerodynamics
  • 02 Military Aircraft Operations
  • 03 Aircraft
  • 0301 Helicopters
  • 0302 Bombers
  • 0303 Attack and Fighter Aircraft
  • 0304 Patrol and Reconnaissance Aircraft

10
Distribution among Categories
11
(No Transcript)
12
Baseline
  • Establish baseline for conventional techniques
  • classification
  • training SVM for each subject area
  • off-the-shelf document modelling and SVM
    libraries

13
Why SVM?
  • Prior studies have suggested good results with
    SVM
  • relatively immune to overfitting fitting to
    coincidental relations encountered during
    training
  • low dimensionality of model parameters

14
Machine Learning Support Vector Machines
hyperplane
  • Binary Classifier
  • Finds the plane with largest margin to separate
    the two classes of training samples
  • Subsequently classifies items based on which side
    of line they fall

Font size
margin
Line number
15
SVM Evaluation
16
Baseline SVM Evaluation
  • Training Testing process repeated for multiple
    subject categories
  • Determine accuracy
  • overall
  • positive (ability to recognize new documents that
    belong in the class the SVM was trained for)
  • negative (ability to reject new documents that
    belong to other classes)
  • Explore Training Issues

17
SVM Out of the Box
  • 16 broad categories with 150 or more documents
  • Lucene library for model preparation
  • LibSVM for SVM training testing
  • no normalization or parameter tuning
  • Training set of 100/100 (positive/negative
    samples)
  • Test set of 50/50

18
(No Transcript)
19
OOtB Interpretation
  • Reasonable performance on broad categories given
    modest training set size.
  • Related experiment showed that with normalization
    and optimized parameter selection, accuracy could
    be improved as much as an additional 10

20
Training Set Size
21
Training Set Size
  • accuracy plateaus for training set sizes well
    under the number of terms in the document model

22
Training Issues
  • Training Set Size
  • Concern detailed subject groups may have too few
    known examples to perform effective SVM training
    in that subject
  • Possible Solution collection may have few
    positive examples, but has many, many negative
    example
  • Positive/Negative Training Mixes
  • effects on accuracy

23
Increased Negative Training
24
Training Set Composition
  • experiment performed with 50 positive training
    examples
  • OotB SVM training
  • increasing the number of negative training
    examples has little effect on overall accuracy
  • but positive accuracy reduced

25
Interpretation
  • may indicate a weakness in SVM
  • or simply further evidence of the importance of
    optimizing SVM parameters
  • may indicate unsuitability of treating SVM output
    as simple boolean decision
  • might do better as best fit in a multi-label
    classifier
Write a Comment
User Comments (0)
About PowerShow.com