A Comparison of Event Models for Nave Bayes Text Classification - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

A Comparison of Event Models for Nave Bayes Text Classification

Description:

Using the estimates, it classifies new test documents ... Results can vary based on sampling error due to different training and test sets. ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 19
Provided by: raluc8
Category:

less

Transcript and Presenter's Notes

Title: A Comparison of Event Models for Nave Bayes Text Classification


1
A Comparison of Event Models for Naïve Bayes Text
Classification
  • Andrew McCallum, Kamal Nigam

2
Motivation
  • Text classification approaches using 2 different
    first-order probabilistic models for
    classification, both making the naïve Bayes
    assumption
  • Present the differences and details of the 2
    models
  • Comparison of the classification performance of
    the 2 models

3
The Models 1
  • Model 1 Multi-variate Bernoulli event model
  • One feature Xw for each word in dictionary
  • Xw true in document d if w appears in d
  • Naïve Bayes assumption
  • Given the documents topic, appearance of one
    word in document tells us nothing about chances
    that another word appears

4
The Models 2
  • Model 2 Multinomial event model
  • One feature Xi for each word pos in document
  • features values are all words in dictionary
  • Value of Xi is the word in position i
  • Naïve Bayes assumption
  • Given the documents topic, word in one position
    in document tells us nothing about value of words
    in other positions
  • Second assumption
  • word appearance does not depend on position
  • for all positions i,j, word w, and class c

5
Probabilistic Framework for Naïve Bayes 1
  • Generative model for both classes of naïve Bayes
    classifiers
  • Assumption text data is generated with a
    parametric model
  • Uses training data to compute Bayes-optimal
    estimates of the model parameters
  • Using the estimates, it classifies new test
    documents
  • Computes posterior probabilities of each class to
    generate the test document
  • Classification selection of the most probable
    class

6
Probabilistic Framework for Naïve Bayes 2
  • Documents are generated by a model ?
  • Model components (classes)
  • Likelihood of a document di sum of total
    probability over all mixture components
  • Select a component according to the priors
  • Model components generate a document according to
    its own parameters, with distribution

7
Multi-variate Bernoulli Model 1
  • Vocabulary V
  • t dimensions of the vocabulary space
  • Bit dimension t of the vector for document i
  • Bit 0 / 1 indicates if word wt appears in
    document di
  • Probability of a document given its class

8
Multi-variate Bernoulli Model 2
  • Set of labeled training documents
  • The class prior parameters are set by the maximum
    likelihood estimate

9
Multinomial Model 1
  • Nit number of times word wt occurs in document
    di
  • Probability of a document given its class
  • Probability of word wt in class cj
  • The class prior parameters are computed like in
    (4)

10
Document Classification
  • Compute the posterior probability of each class
    given the evidence of the test document
  • Select the class with the highest probability

11
Feature Selection
  • We might not want to use all words, but just
    reliable, good discriminators
  • In training set, choose k words which best
    discriminate the categories.
  • One way is in terms of Mutual Information

12
Evaluating Categorization
  • Evaluation must be done on test data that are
    independent of the training data (usually a
    disjoint set of instances).
  • Classification accuracy c/n where n is the total
    number of test instances and c is the number of
    test instances correctly classified by the
    system.
  • Results can vary based on sampling error due to
    different training and test sets.
  • Average results over multiple training and test
    sets (splits of the overall data) for the best
    results.

13
Yahoo! Science
  • Classify 13,589 Yahoo! webpages in Science
    subtree into 95 different topics (hierarchy depth
    2)
  • Vocabulary size 44383 words
  • Preprocessing stemming, remove stopwords, words
    appearing only once
  • Multi-variate Bernoulli performs best with a
    small vocabulary
  • the multinomial performs best with a larger
    vocabulary.
  • The multinomial achieves higher accuracy overall.

14
WebKB
  • Classify webpages from CS departments into 7
    categories
  • 5000 pages
  • No stemming
  • No stopwords removal (my - very good indicator
    for student homepages)
  • Vocabulary size 23830 words

15
Industry Sector
  • 6440 company web pages
  • 71 categories, 2-level deep hierarchy
  • No stemming
  • Remove words occurring only once
  • Vocabulary size 29964 owrds

16
Newsgroups
  • 20000 collected articles from 20 UseNet
    discussion groups
  • No stemming
  • Remove stopwords and words appearing only once
  • Remaining vocabulary 42191 words

17
Reuters
  • 12902 Reuters articles
  • 135 overlapping topic categories
  • No stemming
  • Remove stopwords
  • Resulting vocabulary 19371 words

18
Conclusions
  • The multinomial model is found to almost
    uniformly perform better than the multi-variate
    Bernoulli model
  • Experiments with 5 real-world corpora
  • Multinomial model reduces error by an average of
    27 and sometimes by more than 50
Write a Comment
User Comments (0)
About PowerShow.com