Document Classification with Na - PowerPoint PPT Presentation

About This Presentation

Title:

Document Classification with Na

Description:

Document Classification with Na ve Bayes --How to Build Yahoo Automatically Andrew McCallum Just Research & CMU www.cs.cmu.edu/~mccallum Joint work with Kamal Nigam ... – PowerPoint PPT presentation

Number of Views:145

Avg rating:3.0/5.0

Slides: 30

Provided by: Andrew1720

Learn more at: https://people.cs.umass.edu

Category:

more less

Transcript and Presenter's Notes

Title: Document Classification with Na

1
Document Classification with Naïve Bayes --How
to Build Yahoo Automatically

Andrew McCallum
Just Research CMU
www.cs.cmu.edu/mccallum

Joint work with Kamal Nigam, Jason Rennie,
Kristie Seymore, Tom Mitchell, Sebastian Thrun,
Roni Rosenfeld, Andrew Ng.
2
(No Transcript)
3
(No Transcript)
4
(No Transcript)
5
(No Transcript)
6
Document Classification
(Planning)
Testing Data
.
Categories
Multimedia
GUI
Garb.Coll.
Semantics
ML
Planning
Training Data
planning temporal reasoning plan language...
programming semantics types language proof...
learning algorithm reinforcement intelligence netw
ork...
garbage collection memory optimization region...
...
...
7
A Probabilistic Approach to Document
Classification
Pick the most probable class, given the evidence
- a class (like Planning)
- a document (like language intelligence
proof...)
Bayes Rule
Naïve Bayes
(1) One mixture-component per class (2)
Independence assumption
- the i th word in d (like proof)
8
A Probabilistic Bayesian Approach

Define a probabilistic generative model for
documents with classes.
Learn the parameters of this model by fitting
them to the data and a prior.

9
Parameter Estimation in Naïve Bayes
Naïve Bayes
Maximum a posteriori estimate of Pr(wc), with a
Dirichlet prior, (AKA Laplace smoothing)
where N(w,d) is number of times word w occurs in
document d.
Two ways to improve this method
(A) Make less restrictive assumptions about the
model (B) Get better estimates of the model
parameters, i.e. Pr(wc)
10
The Scenario
Training data with class labels
Data available at training time, but without
class labels
Web pages user says are interesting
Web pages user says are uninteresting
Web pages user hasnt seen or said anything about
Can we use the unlabeled documents to increase
accuracy?
11
Using the Unlabeled Data
Build a classification model using
limited labeled data
Use model to estimate the labels of the
unlabeled documents
Use all documents to build a new classification
model, which is often more accurate because it is
trained using more data.
12
An Example
Labeled Data
Unlabeled Data
Baseball
Ice Skating
Tara Lipinskis substitute ice skates didnt hurt
her performance. She graced the ice with a
series of perfect jumps and won the gold medal.
Fell on the ice...
The new hitter struck out...
Perfect triple jump...
Struck out in last inning...
Katarina Witts gold medal performance...
Homerun in the first inning...
Tara Lipinski bought a new house for her parents.
New ice skates...
Pete Rose is not as good an athlete as Tara
Lipinski...
Practice at the ice rink every day...
After EM
Pr ( Lipinski Ice Skating ) 0.02
Before EM
Pr ( Lipinski Baseball ) 0.003
Pr ( Lipinski ) 0.01
Pr ( Lipinski ) 0.001
13
Filling in Missing Labels with EM
Dempster et al 77, Ghahramani Jordan 95,
McLachlan Krishnan 97
Expectation Maximization is a class of iterative
algorithms for maximum likelihood estimation with
incomplete data.

E-step Use current estimates of model
parameters to guess value of missing labels.
M-step Use current guesses for missing labels
to calculate new estimates of model parameters.
Repeat E- and M-steps until convergence.

Finds the model parameters that locally maximize
the probability of both the labeled and the
unlabeled data.
14
EM for Text Classification
Expectation-step (estimate the class labels)
Maximization-step (new parameters using the
estimates)
15
WebKB Data Set
student
faculty
course
project
4 classes, 4199 documents
from CS academic departments
16
Word Vector Evolution with EM
Iteration 0 intelligence DD artificial understand
ing DDw dist identical rus arrange games dartmouth
natural cognitive logic proving prolog
Iteration 1 DD D lecture cc D DDDD handout due
problem set tay DDam yurtas homework kfoury sec
Iteration 2 D DD lecture cc DDDD due D homework
assignment handout set hw exam problem DDam posts
cript
(D is a digit)
17
EM as Clustering
X
X
X
unlabeled
18
EM as Clustering, Gone Wrong
X
X
X
19
20 Newsgroups Data Set

sci.med
sci.crypt
sci.space
alt.atheism
sci.electronics
comp.graphics
talk.politics.misc
comp.windows.x
rec.sport.hockey
talk.politics.guns
talk.religion.misc
rec.sport.baseball
talk.politics.mideast
comp.sys.mac.hardware
comp.sys.ibm.pc.hardware
comp.os.ms-windows.misc
20 class labels, 20,000 documents 62k unique words
20
Newsgroups Classification Accuracyvarying
labeled documents
21
Newsgroups Classification Accuracyvarying
unlabeled documents
22
WebKB Classification Accuracyvarying labeled
documents
23
WebKB Classification Accuracyvarying weight of
unlabeled data
24
WebKB Classification Accuracyvarying labeled
documentsand selecting unlabeled weight by CV
25
Populating a hierarchy

Naïve Bayes
Simple, robust document classification.
Many principled enhancements (e.g. shrinkage).
Requires some labeled training data.
Keyword matching
Requires no labeled training data except keywords
themselves.
Brittle, breaks easily

26
Combine Naïve Bayes and Keywords for Best of Both

Classify unlabeled documents with keyword
matching.
Pretend these category labels are correct, and
use this data to train naïve Bayes.
Naïve Bayes acts to temper and round out the
keyword class definitions.
Brings in new probabilistically-weighted keywords
that are correlated with the few original
keywords.

27
Top words found by naïve Bayes and Shrinkage
ROOT computer, university, science, system, paper
HCI computer system multimedia university paper
IR information text documents classification retri
eval
Hardware circuits designs computer university perf
ormance
AI learning university computer based intelligence
Programming programming language logic university
programs
GUI interface design user sketch interfaces
Cooperative collaborative CSCW work provide group
Multimedia multimedia real time data media
Planning planning temporal reasoning plan problems
Machine Learning learning algorithm university net
works
NLP language natural processing information text
Semantics semantics denotational language construc
tion types
Garbage Collection garbage collection memory optim
ization region
28
Classification Results
400 test documents 70 classes in a hierarchy of
depth 2-4
29
Conclusions

Naïve Bayes is a method of document
classification based on Bayesian statistics.
Many parameters to estimate. Requires much
labeled training data.
We can build on its probabilistic, statistical
foundations to improve performance (e.g.
unlabeled data EM)
These techniques are accurate and robust enough
to build useful Web services.

Write a Comment

User Comments (0)