Document Classification - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Document Classification

Description:

Given a collection of words determine the best fit category for this collection of words. ... Heavy use of statistical formulas and mathematics. Classification ... – PowerPoint PPT presentation

Number of Views:230

Avg rating:3.0/5.0

Slides: 38

Provided by: Pala2

Category:

more less

Transcript and Presenter's Notes

Title: Document Classification

1
Document Classification

Chun Fai Cheung
December 8, 2006
Reasoning About Uncertainty

2
Outline

Motivation
Document Classification
Naïve Bayes Classifier
Support Vector Machine

3
Motivation

Sort data
Search data
Spam filtering

4
Document Classification

Given a collection of words determine the best
fit category for this collection of words.
Example from paper
Classify Usenet posts using 1000 hand labeled
training articles with 50 accuracy

5
Naïve Bayes Probability Model

Assumes that data is generated using the
following model

Number of mixtures, number of classes
Document i
Probability Distribution by a set of parameters,
classifier
Probability of this type of document given all we
know is of that class
Mixture of components, also known as class if we
assume 1-1
6
Naïve Bayes Probability Model

Assumption that there is a one-to-one
correspondence between mixture model components
and classes
Naïve Bayes classifier needs to estimate the
parameters of the generative model

7
Naïve Bayes Probability Model

If we do not assume Naïve Bayes, we get this mess

Probability of this word occurring in this
position in this class
Probability of this document length, number of
words
Probability of these exact order of words
occurring given a certain mixture model, or class.
8
Naïve Bayes Probability Model

Assume word independence, not true, but
simplifies the model

Probability of this document length
Probability of this word occurring in this class
Assume identically distributed for all classes,
no longer a parameter
9
Naïve Bayes Probability Model

Only the word probabilities matter with Naïve
Bayes assumption

Probability of this word occurring in class j
t is a number from 1, . . . , V, V,
vocabuluary, is a vector containing all known
words
10
Naïve Bayes Classifier

Now we know what is important, we can train the
classifier
Create labeled(class) data aka training data,
count the number of occurrences of each word and
associate this number of occurrences to that
provided class.

11
Naïve Bayes Classifier
Number of times ws occurs in document di
Training equation
Number of words in our vocabulary
Probability of a word given all we know is the
class equals the number of times word
occurs in the training data divided by the number
of occurrences of that word in that class.
12
Naïve Bayes Classifier
Probability of a class given a document
Summation over all documents
Class prior probabilities
Number of classes
Number of documents
13
Naïve Bayes Classifier
The probability of a class given a document.
An application of Bayes Rule
14
Crude explanation

If we have labeled documents, we can count the
occurrence of words in that type of document and
infer that the occurrence of these words mean
that if we encounter an unlabeled document and
this document has high counts of these same
words, we can classify the unlabeled document as
belonging to the same class.

15
Naives Bayes Classifier

Use the classifier to determine the likelihood of
a document belonging to each class the classifier
understands and choose the class with the highest
likelihood.

16
Unlabeled Data

The paper presents the use of unlabeled data.
Unlabeled data does not directly help us
classify, since there is no label data to train
the classifier.
However, unlabeled data can still provide useful
information about the joint probability between
the words.

17
Expectation Maximization

Use the Bayes classifier to classify unlabeled
documents, by assigning a probabilistically
weighted class label
Treat these weighted class labels as actual
labels and train the classifier with it along
with the original data using the same training
equation listed before
Continue until there is no effect to the
probabilities outputted by the classifier

18
Expectation Maximization Algorithm
Use the training equation
Classification equation
19
Expectation Maximization
Matrix of all the class labels, we dont actually
have these, this is what were trying to estimate.
Probability of a document given its class
20
Discussion of EM

Helps a lot when there are small training sets
and large unlabeled data
Can potentially hurt the performance of the
classifier when there is a lot of training data
and a lot of unlabeled data

21
Augmented EM

Since large amount of training data and large
amount of unlabeled data hurt accuracy, new
strategy was devised.
Use the labeled data to initialize the EM
procedure.
Use weights to determine whether or not it was
appropriate to use the unlabeled data

22
Results
23
Results
24
Results Interpretation

Accuracy is not the most important metric
Recall and Precision break-even point

25
Conclusion

Naïve Bayes classifiers are simple and are
accurate, but for most practical applications
require large training sets, which may be
expensive to obtain
When there are only small training sets, can
augment using EM
Supervised learning unsupervised learning

26
Support Vector Machines

Superficial treatment of this subject
Classification algorithm based on statistics
Heavy use of statistical formulas and mathematics

27
Classification

Can be viewed as a balancing act between the
accuracy of the classifier to learn the training
set versus its capacity to generalize towards
unknown data sets.

28
Background

We have l number of observations
Each observation consists of a pair
A vector xi and a value yi which represents some
truth value related to the vector xi
Assume that there is some probability
distribution that relates vector x and truth
value y, P(x,y) cumulative probability
distribution
A machine is made to learn the mapping of vector
x to y

29
Background

f(x,a) is the function that returns a value y for
an input of vector x and adjustable parameters a.
a can be viewed as training weights
y and f(x,a) returns one of two values 1 and -1,
each representing a specific category, linearly
separable

a
30
Risk

The actual error of the trained machine can be
calculated by

31
Empirical Risk

Only have finite number of samples, so there is
only empirical risk that can be measured
l is the number of samples

32
Risk Bound
Some number between 0 and 1
h is the VC (Vapnik Chervonenkis) dimension a
measure of capacity how well can this
classifier generalize
VC Confidence
33
VC Dimension
34
Linear SVMs Separable Case