Document Classification - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Document Classification

Description:

Given a collection of words determine the best fit category for this collection of words. ... Heavy use of statistical formulas and mathematics. Classification ... – PowerPoint PPT presentation

Number of Views:230
Avg rating:3.0/5.0
Slides: 38
Provided by: Pala2
Category:

less

Transcript and Presenter's Notes

Title: Document Classification


1
Document Classification
  • Chun Fai Cheung
  • December 8, 2006
  • Reasoning About Uncertainty

2
Outline
  • Motivation
  • Document Classification
  • Naïve Bayes Classifier
  • Support Vector Machine

3
Motivation
  • Sort data
  • Search data
  • Spam filtering

4
Document Classification
  • Given a collection of words determine the best
    fit category for this collection of words.
  • Example from paper
  • Classify Usenet posts using 1000 hand labeled
    training articles with 50 accuracy

5
Naïve Bayes Probability Model
  • Assumes that data is generated using the
    following model

Number of mixtures, number of classes
Document i
Probability Distribution by a set of parameters,
classifier
Probability of this type of document given all we
know is of that class
Mixture of components, also known as class if we
assume 1-1
6
Naïve Bayes Probability Model
  • Assumption that there is a one-to-one
    correspondence between mixture model components
    and classes
  • Naïve Bayes classifier needs to estimate the
    parameters of the generative model

7
Naïve Bayes Probability Model
  • If we do not assume Naïve Bayes, we get this mess

Probability of this word occurring in this
position in this class
Probability of this document length, number of
words
Probability of these exact order of words
occurring given a certain mixture model, or class.
8
Naïve Bayes Probability Model
  • Assume word independence, not true, but
    simplifies the model

Probability of this document length
Probability of this word occurring in this class
Assume identically distributed for all classes,
no longer a parameter
9
Naïve Bayes Probability Model
  • Only the word probabilities matter with Naïve
    Bayes assumption

Probability of this word occurring in class j
t is a number from 1, . . . , V, V,
vocabuluary, is a vector containing all known
words
10
Naïve Bayes Classifier
  • Now we know what is important, we can train the
    classifier
  • Create labeled(class) data aka training data,
    count the number of occurrences of each word and
    associate this number of occurrences to that
    provided class.

11
Naïve Bayes Classifier
Number of times ws occurs in document di
Training equation
Number of words in our vocabulary
Probability of a word given all we know is the
class equals the number of times word
occurs in the training data divided by the number
of occurrences of that word in that class.
12
Naïve Bayes Classifier
Probability of a class given a document
Summation over all documents
Class prior probabilities
Number of classes
Number of documents
13
Naïve Bayes Classifier
The probability of a class given a document.
An application of Bayes Rule
14
Crude explanation
  • If we have labeled documents, we can count the
    occurrence of words in that type of document and
    infer that the occurrence of these words mean
    that if we encounter an unlabeled document and
    this document has high counts of these same
    words, we can classify the unlabeled document as
    belonging to the same class.

15
Naives Bayes Classifier
  • Use the classifier to determine the likelihood of
    a document belonging to each class the classifier
    understands and choose the class with the highest
    likelihood.

16
Unlabeled Data
  • The paper presents the use of unlabeled data.
  • Unlabeled data does not directly help us
    classify, since there is no label data to train
    the classifier.
  • However, unlabeled data can still provide useful
    information about the joint probability between
    the words.

17
Expectation Maximization
  • Use the Bayes classifier to classify unlabeled
    documents, by assigning a probabilistically
    weighted class label
  • Treat these weighted class labels as actual
    labels and train the classifier with it along
    with the original data using the same training
    equation listed before
  • Continue until there is no effect to the
    probabilities outputted by the classifier

18
Expectation Maximization Algorithm
Use the training equation
Classification equation
19
Expectation Maximization
Matrix of all the class labels, we dont actually
have these, this is what were trying to estimate.
Probability of a document given its class
20
Discussion of EM
  • Helps a lot when there are small training sets
    and large unlabeled data
  • Can potentially hurt the performance of the
    classifier when there is a lot of training data
    and a lot of unlabeled data

21
Augmented EM
  • Since large amount of training data and large
    amount of unlabeled data hurt accuracy, new
    strategy was devised.
  • Use the labeled data to initialize the EM
    procedure.
  • Use weights to determine whether or not it was
    appropriate to use the unlabeled data

22
Results
23
Results
24
Results Interpretation
  • Accuracy is not the most important metric
  • Recall and Precision break-even point

25
Conclusion
  • Naïve Bayes classifiers are simple and are
    accurate, but for most practical applications
    require large training sets, which may be
    expensive to obtain
  • When there are only small training sets, can
    augment using EM
  • Supervised learning unsupervised learning

26
Support Vector Machines
  • Superficial treatment of this subject
  • Classification algorithm based on statistics
  • Heavy use of statistical formulas and mathematics

27
Classification
  • Can be viewed as a balancing act between the
    accuracy of the classifier to learn the training
    set versus its capacity to generalize towards
    unknown data sets.

28
Background
  • We have l number of observations
  • Each observation consists of a pair
  • A vector xi and a value yi which represents some
    truth value related to the vector xi
  • Assume that there is some probability
    distribution that relates vector x and truth
    value y, P(x,y) cumulative probability
    distribution
  • A machine is made to learn the mapping of vector
    x to y

29
Background
  • f(x,a) is the function that returns a value y for
    an input of vector x and adjustable parameters a.
  • a can be viewed as training weights
  • y and f(x,a) returns one of two values 1 and -1,
    each representing a specific category, linearly
    separable

a
30
Risk
  • The actual error of the trained machine can be
    calculated by

31
Empirical Risk
  • Only have finite number of samples, so there is
    only empirical risk that can be measured
  • l is the number of samples

32
Risk Bound
Some number between 0 and 1
h is the VC (Vapnik Chervonenkis) dimension a
measure of capacity how well can this
classifier generalize
VC Confidence
33
VC Dimension
34
Linear SVMs Separable Case
  • In this case there exists a separating hyperplane
  • The hyperplane is
  • The distances to the nearest points are

Slides borrowed from Amit Davids ML Seminar
35
Linear SVMs Separable Case 2
  • The parallel hyperplanes are
  • so
  • The linear SVM seeks the maximum separation by
    minimizing
  • The points that determine the hyperplane are the
    support vectors

Slides borrowed from Amit Davids ML Seminar
36
Linear SVMs Separable Case 3
  • Classifying is generalized formso

Slides borrowed from Amit Davids ML Seminar
37
Results
Write a Comment
User Comments (0)
About PowerShow.com