Document Categorization

About This Presentation

Title:

Description:

Number of Views:30

Avg rating:3.0/5.0

Slides: 26

Provided by: wcar9

Learn more at: https://www.cs.odu.edu

Category:

Tags: categorization | document | field | subject

Transcript and Presenter's Notes

Title: Document Categorization

1
Document Categorization

2
Classification Techniques

3
Machine Learning Procedures

4
Automatic Indexing

Assign to each document up to k terms drawn from
a controlled vocabulary
Typically reduced to a multi-label classification
problem
each keyword corresponds to a class of documents
for which that keyword is an appropriate
descriptor

5
Case Study SVM categorization

6
Document Collection
7
(No Transcript)
8
Sample Broad Subject Fields

9
Sample Narrow Subject Groups

10
Distribution among Categories
11
(No Transcript)
12
Baseline

13
Why SVM?

Prior studies have suggested good results with
SVM
relatively immune to overfitting fitting to
coincidental relations encountered during
training
low dimensionality of model parameters

14
Machine Learning Support Vector Machines
hyperplane

Binary Classifier
Finds the plane with largest margin to separate
the two classes of training samples
Subsequently classifies items based on which side
of line they fall

Font size
margin
Line number
15
SVM Evaluation
16
Baseline SVM Evaluation

Training Testing process repeated for multiple
subject categories
Determine accuracy
overall
positive (ability to recognize new documents that
belong in the class the SVM was trained for)
negative (ability to reject new documents that
belong to other classes)
Explore Training Issues

17
SVM Out of the Box

18
(No Transcript)
19
OOtB Interpretation

Reasonable performance on broad categories given
modest training set size.
Related experiment showed that with normalization
and optimized parameter selection, accuracy could
be improved as much as an additional 10

20
Training Set Size
21
Training Set Size

accuracy plateaus for training set sizes well
under the number of terms in the document model

22
Training Issues

Training Set Size
Concern detailed subject groups may have too few
known examples to perform effective SVM training
in that subject
Possible Solution collection may have few
positive examples, but has many, many negative
example
Positive/Negative Training Mixes
effects on accuracy

23
Increased Negative Training
24
Training Set Composition

experiment performed with 50 positive training
examples
OotB SVM training
increasing the number of negative training
examples has little effect on overall accuracy
but positive accuracy reduced