Title: Document Classification
1Document Classification
- Chun Fai Cheung
- December 8, 2006
- Reasoning About Uncertainty
2Outline
- Motivation
- Document Classification
- Naïve Bayes Classifier
- Support Vector Machine
3Motivation
- Sort data
- Search data
- Spam filtering
4Document Classification
- Given a collection of words determine the best
fit category for this collection of words. - Example from paper
- Classify Usenet posts using 1000 hand labeled
training articles with 50 accuracy
5Naïve Bayes Probability Model
- Assumes that data is generated using the
following model
Number of mixtures, number of classes
Document i
Probability Distribution by a set of parameters,
classifier
Probability of this type of document given all we
know is of that class
Mixture of components, also known as class if we
assume 1-1
6Naïve Bayes Probability Model
- Assumption that there is a one-to-one
correspondence between mixture model components
and classes - Naïve Bayes classifier needs to estimate the
parameters of the generative model
7Naïve Bayes Probability Model
- If we do not assume Naïve Bayes, we get this mess
Probability of this word occurring in this
position in this class
Probability of this document length, number of
words
Probability of these exact order of words
occurring given a certain mixture model, or class.
8Naïve Bayes Probability Model
- Assume word independence, not true, but
simplifies the model
Probability of this document length
Probability of this word occurring in this class
Assume identically distributed for all classes,
no longer a parameter
9Naïve Bayes Probability Model
- Only the word probabilities matter with Naïve
Bayes assumption
Probability of this word occurring in class j
t is a number from 1, . . . , V, V,
vocabuluary, is a vector containing all known
words
10Naïve Bayes Classifier
- Now we know what is important, we can train the
classifier - Create labeled(class) data aka training data,
count the number of occurrences of each word and
associate this number of occurrences to that
provided class.
11Naïve Bayes Classifier
Number of times ws occurs in document di
Training equation
Number of words in our vocabulary
Probability of a word given all we know is the
class equals the number of times word
occurs in the training data divided by the number
of occurrences of that word in that class.
12Naïve Bayes Classifier
Probability of a class given a document
Summation over all documents
Class prior probabilities
Number of classes
Number of documents
13Naïve Bayes Classifier
The probability of a class given a document.
An application of Bayes Rule
14Crude explanation
- If we have labeled documents, we can count the
occurrence of words in that type of document and
infer that the occurrence of these words mean
that if we encounter an unlabeled document and
this document has high counts of these same
words, we can classify the unlabeled document as
belonging to the same class.
15Naives Bayes Classifier
- Use the classifier to determine the likelihood of
a document belonging to each class the classifier
understands and choose the class with the highest
likelihood.
16Unlabeled Data
- The paper presents the use of unlabeled data.
- Unlabeled data does not directly help us
classify, since there is no label data to train
the classifier. - However, unlabeled data can still provide useful
information about the joint probability between
the words.
17Expectation Maximization
- Use the Bayes classifier to classify unlabeled
documents, by assigning a probabilistically
weighted class label - Treat these weighted class labels as actual
labels and train the classifier with it along
with the original data using the same training
equation listed before - Continue until there is no effect to the
probabilities outputted by the classifier
18Expectation Maximization Algorithm
Use the training equation
Classification equation
19Expectation Maximization
Matrix of all the class labels, we dont actually
have these, this is what were trying to estimate.
Probability of a document given its class
20Discussion of EM
- Helps a lot when there are small training sets
and large unlabeled data - Can potentially hurt the performance of the
classifier when there is a lot of training data
and a lot of unlabeled data
21Augmented EM
- Since large amount of training data and large
amount of unlabeled data hurt accuracy, new
strategy was devised. - Use the labeled data to initialize the EM
procedure. - Use weights to determine whether or not it was
appropriate to use the unlabeled data
22Results
23Results
24Results Interpretation
- Accuracy is not the most important metric
- Recall and Precision break-even point
25Conclusion
- Naïve Bayes classifiers are simple and are
accurate, but for most practical applications
require large training sets, which may be
expensive to obtain - When there are only small training sets, can
augment using EM - Supervised learning unsupervised learning
26Support Vector Machines
- Superficial treatment of this subject
- Classification algorithm based on statistics
- Heavy use of statistical formulas and mathematics
27Classification
- Can be viewed as a balancing act between the
accuracy of the classifier to learn the training
set versus its capacity to generalize towards
unknown data sets.
28Background
- We have l number of observations
- Each observation consists of a pair
- A vector xi and a value yi which represents some
truth value related to the vector xi - Assume that there is some probability
distribution that relates vector x and truth
value y, P(x,y) cumulative probability
distribution - A machine is made to learn the mapping of vector
x to y
29Background
- f(x,a) is the function that returns a value y for
an input of vector x and adjustable parameters a. - a can be viewed as training weights
- y and f(x,a) returns one of two values 1 and -1,
each representing a specific category, linearly
separable
a
30Risk
- The actual error of the trained machine can be
calculated by
31Empirical Risk
- Only have finite number of samples, so there is
only empirical risk that can be measured - l is the number of samples
32Risk Bound
Some number between 0 and 1
h is the VC (Vapnik Chervonenkis) dimension a
measure of capacity how well can this
classifier generalize
VC Confidence
33VC Dimension
34Linear SVMs Separable Case
- In this case there exists a separating hyperplane
- The hyperplane is
- The distances to the nearest points are
Slides borrowed from Amit Davids ML Seminar
35Linear SVMs Separable Case 2
- The parallel hyperplanes are
- so
- The linear SVM seeks the maximum separation by
minimizing - The points that determine the hyperplane are the
support vectors
Slides borrowed from Amit Davids ML Seminar
36Linear SVMs Separable Case 3
- Classifying is generalized formso
Slides borrowed from Amit Davids ML Seminar
37Results