Title: Document Classification with Na
1Document Classification with Naïve Bayes --How
to Build Yahoo Automatically
- Andrew McCallum
- Just Research CMU
- www.cs.cmu.edu/mccallum
Joint work with Kamal Nigam, Jason Rennie,
Kristie Seymore, Tom Mitchell, Sebastian Thrun,
Roni Rosenfeld, Andrew Ng.
2(No Transcript)
3(No Transcript)
4(No Transcript)
5(No Transcript)
6Document Classification
(Planning)
Testing Data
.
Categories
Multimedia
GUI
Garb.Coll.
Semantics
ML
Planning
Training Data
planning temporal reasoning plan language...
programming semantics types language proof...
learning algorithm reinforcement intelligence netw
ork...
garbage collection memory optimization region...
...
...
7A Probabilistic Approach to Document
Classification
Pick the most probable class, given the evidence
- a class (like Planning)
- a document (like language intelligence
proof...)
Bayes Rule
Naïve Bayes
(1) One mixture-component per class (2)
Independence assumption
- the i th word in d (like proof)
8A Probabilistic Bayesian Approach
- Define a probabilistic generative model for
documents with classes. - Learn the parameters of this model by fitting
them to the data and a prior.
9Parameter Estimation in Naïve Bayes
Naïve Bayes
Maximum a posteriori estimate of Pr(wc), with a
Dirichlet prior, (AKA Laplace smoothing)
where N(w,d) is number of times word w occurs in
document d.
Two ways to improve this method
(A) Make less restrictive assumptions about the
model (B) Get better estimates of the model
parameters, i.e. Pr(wc)
10The Scenario
Training data with class labels
Data available at training time, but without
class labels
Web pages user says are interesting
Web pages user says are uninteresting
Web pages user hasnt seen or said anything about
Can we use the unlabeled documents to increase
accuracy?
11Using the Unlabeled Data
Build a classification model using
limited labeled data
Use model to estimate the labels of the
unlabeled documents
Use all documents to build a new classification
model, which is often more accurate because it is
trained using more data.
12An Example
Labeled Data
Unlabeled Data
Baseball
Ice Skating
Tara Lipinskis substitute ice skates didnt hurt
her performance. She graced the ice with a
series of perfect jumps and won the gold medal.
Fell on the ice...
The new hitter struck out...
Perfect triple jump...
Struck out in last inning...
Katarina Witts gold medal performance...
Homerun in the first inning...
Tara Lipinski bought a new house for her parents.
New ice skates...
Pete Rose is not as good an athlete as Tara
Lipinski...
Practice at the ice rink every day...
After EM
Pr ( Lipinski Ice Skating ) 0.02
Before EM
Pr ( Lipinski Baseball ) 0.003
Pr ( Lipinski ) 0.01
Pr ( Lipinski ) 0.001
13Filling in Missing Labels with EM
Dempster et al 77, Ghahramani Jordan 95,
McLachlan Krishnan 97
Expectation Maximization is a class of iterative
algorithms for maximum likelihood estimation with
incomplete data.
- E-step Use current estimates of model
parameters to guess value of missing labels. - M-step Use current guesses for missing labels
to calculate new estimates of model parameters. - Repeat E- and M-steps until convergence.
Finds the model parameters that locally maximize
the probability of both the labeled and the
unlabeled data.
14EM for Text Classification
Expectation-step (estimate the class labels)
Maximization-step (new parameters using the
estimates)
15WebKB Data Set
student
faculty
course
project
4 classes, 4199 documents
from CS academic departments
16Word Vector Evolution with EM
Iteration 0 intelligence DD artificial understand
ing DDw dist identical rus arrange games dartmouth
natural cognitive logic proving prolog
Iteration 1 DD D lecture cc D DDDD handout due
problem set tay DDam yurtas homework kfoury sec
Iteration 2 D DD lecture cc DDDD due D homework
assignment handout set hw exam problem DDam posts
cript
(D is a digit)
17EM as Clustering
X
X
X
unlabeled
18EM as Clustering, Gone Wrong
X
X
X
1920 Newsgroups Data Set
sci.med
sci.crypt
sci.space
alt.atheism
sci.electronics
comp.graphics
talk.politics.misc
comp.windows.x
rec.sport.hockey
talk.politics.guns
talk.religion.misc
rec.sport.baseball
talk.politics.mideast
comp.sys.mac.hardware
comp.sys.ibm.pc.hardware
comp.os.ms-windows.misc
20 class labels, 20,000 documents 62k unique words
20Newsgroups Classification Accuracyvarying
labeled documents
21Newsgroups Classification Accuracyvarying
unlabeled documents
22WebKB Classification Accuracyvarying labeled
documents
23WebKB Classification Accuracyvarying weight of
unlabeled data
24WebKB Classification Accuracyvarying labeled
documentsand selecting unlabeled weight by CV
25Populating a hierarchy
- Naïve Bayes
- Simple, robust document classification.
- Many principled enhancements (e.g. shrinkage).
- Requires some labeled training data.
- Keyword matching
- Requires no labeled training data except keywords
themselves. - Brittle, breaks easily
26Combine Naïve Bayes and Keywords for Best of Both
- Classify unlabeled documents with keyword
matching. - Pretend these category labels are correct, and
use this data to train naïve Bayes. - Naïve Bayes acts to temper and round out the
keyword class definitions. - Brings in new probabilistically-weighted keywords
that are correlated with the few original
keywords.
27Top words found by naïve Bayes and Shrinkage
ROOT computer, university, science, system, paper
HCI computer system multimedia university paper
IR information text documents classification retri
eval
Hardware circuits designs computer university perf
ormance
AI learning university computer based intelligence
Programming programming language logic university
programs
GUI interface design user sketch interfaces
Cooperative collaborative CSCW work provide group
Multimedia multimedia real time data media
Planning planning temporal reasoning plan problems
Machine Learning learning algorithm university net
works
NLP language natural processing information text
Semantics semantics denotational language construc
tion types
Garbage Collection garbage collection memory optim
ization region
28Classification Results
400 test documents 70 classes in a hierarchy of
depth 2-4
29Conclusions
- Naïve Bayes is a method of document
classification based on Bayesian statistics. - Many parameters to estimate. Requires much
labeled training data. - We can build on its probabilistic, statistical
foundations to improve performance (e.g.
unlabeled data EM) - These techniques are accurate and robust enough
to build useful Web services.