A Comparison of Event Models for Nave Bayes Text Classification - PowerPoint PPT Presentation

1 / 18

About This Presentation

Title:

A Comparison of Event Models for Nave Bayes Text Classification

Description:

Using the estimates, it classifies new test documents ... Results can vary based on sampling error due to different training and test sets. ... – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 19

Provided by: raluc8

Category:

more less

Transcript and Presenter's Notes

Title: A Comparison of Event Models for Nave Bayes Text Classification

1
A Comparison of Event Models for Naïve Bayes Text
Classification

Andrew McCallum, Kamal Nigam

2
Motivation

Text classification approaches using 2 different
first-order probabilistic models for
classification, both making the naïve Bayes
assumption
Present the differences and details of the 2
models
Comparison of the classification performance of
the 2 models

3
The Models 1

Model 1 Multi-variate Bernoulli event model
One feature Xw for each word in dictionary
Xw true in document d if w appears in d
Naïve Bayes assumption
Given the documents topic, appearance of one
word in document tells us nothing about chances
that another word appears

4
The Models 2

Model 2 Multinomial event model
One feature Xi for each word pos in document
features values are all words in dictionary
Value of Xi is the word in position i
Naïve Bayes assumption
Given the documents topic, word in one position
in document tells us nothing about value of words
in other positions
Second assumption
word appearance does not depend on position
for all positions i,j, word w, and class c

5
Probabilistic Framework for Naïve Bayes 1

Generative model for both classes of naïve Bayes
classifiers
Assumption text data is generated with a
parametric model
Uses training data to compute Bayes-optimal
estimates of the model parameters
Using the estimates, it classifies new test
documents
Computes posterior probabilities of each class to
generate the test document
Classification selection of the most probable
class

6
Probabilistic Framework for Naïve Bayes 2

Documents are generated by a model ?
Model components (classes)
Likelihood of a document di sum of total
probability over all mixture components
Select a component according to the priors
Model components generate a document according to
its own parameters, with distribution

7
Multi-variate Bernoulli Model 1

Vocabulary V
t dimensions of the vocabulary space
Bit dimension t of the vector for document i
Bit 0 / 1 indicates if word wt appears in
document di
Probability of a document given its class

8
Multi-variate Bernoulli Model 2

Set of labeled training documents
The class prior parameters are set by the maximum
likelihood estimate

9
Multinomial Model 1

Nit number of times word wt occurs in document
di
Probability of a document given its class
Probability of word wt in class cj
The class prior parameters are computed like in
(4)

10
Document Classification

Compute the posterior probability of each class
given the evidence of the test document
Select the class with the highest probability

11
Feature Selection

We might not want to use all words, but just
reliable, good discriminators
In training set, choose k words which best
discriminate the categories.
One way is in terms of Mutual Information

12
Evaluating Categorization

Evaluation must be done on test data that are
independent of the training data (usually a
disjoint set of instances).
Classification accuracy c/n where n is the total
number of test instances and c is the number of
test instances correctly classified by the
system.
Results can vary based on sampling error due to
different training and test sets.
Average results over multiple training and test
sets (splits of the overall data) for the best
results.

13
Yahoo! Science

Classify 13,589 Yahoo! webpages in Science
subtree into 95 different topics (hierarchy depth
2)

Vocabulary size 44383 words
Preprocessing stemming, remove stopwords, words
appearing only once
Multi-variate Bernoulli performs best with a
small vocabulary
the multinomial performs best with a larger
vocabulary.
The multinomial achieves higher accuracy overall.

14
WebKB

Classify webpages from CS departments into 7
categories
5000 pages
No stemming
No stopwords removal (my - very good indicator
for student homepages)
Vocabulary size 23830 words

15
Industry Sector

6440 company web pages
71 categories, 2-level deep hierarchy
No stemming
Remove words occurring only once
Vocabulary size 29964 owrds

16
Newsgroups

20000 collected articles from 20 UseNet
discussion groups
No stemming
Remove stopwords and words appearing only once
Remaining vocabulary 42191 words

17
Reuters

12902 Reuters articles
135 overlapping topic categories
No stemming
Remove stopwords
Resulting vocabulary 19371 words

18
Conclusions

The multinomial model is found to almost
uniformly perform better than the multi-variate
Bernoulli model
Experiments with 5 real-world corpora
Multinomial model reduces error by an average of
27 and sometimes by more than 50

Write a Comment

User Comments (0)