Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li - PowerPoint PPT Presentation

About This Presentation

Title:

Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li

Description:

Manually label a set of documents to pre-defined classes ... Problem: bottleneck associated with large number of labeled training documents ... Theory be darned! ... – PowerPoint PPT presentation

Number of Views:104

Avg rating:3.0/5.0

Slides: 23

Provided by: rickkn4

Category:

more less

Transcript and Presenter's Notes

Title: Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li

1
Partially Supervised Classification of Text
DocumentsbyBing Liu, Philip Yu, and Xiaoli Li

Presented by Rick Knowles
7 April 2005

2
Agenda

Problem Statement
Related Work
Theoretical Foundations
Proposed Technique
Evaluation
Conclusions

3
Problem Statement Common Approach

Text categorization automated assigning of text
documents to pre-defined classes
Common Approach Supervised Learning
Manually label a set of documents to pre-defined
classes
Use a learning algorithm to build a classifier

_

_
_

_

_
_
_

_

_
_

_

_
_

_

_
4
Problem Statement Common Approach (cont.)

Problem bottleneck associated with large number
of labeled training documents to build the
classifier
Nigram, et al, have shown that using a large dose
of unlabeled data can help

.
.
.
_

_
.

.
_

.

.
.
_
_

.
_

_
.

.
.
.
_
.

_
.
.

_
5
A different approachPartially supervised
classification

Two class problem positive and unlabeled
Key feature is that there is no labeled negative
document
Can be posed as a constrained optimization
problem
Develop a function that correctly classifies all
positive docs and minimizes the number of mixed
docs classified as positive will have an expected
error rate of no more than e.
Examplar Finding matching (i.e., positive
documents) from a large collection such as the
Web.
Matching documents are positive
All others are negative

6
Related Work

Text Classification techniques
Naïve Bayesian
K-nearest neighbor
Support vector machines
Each requires labeled data for all classes
Problem similar to traditional information
retrieval
Rank orders documents according to their
similarities to the query document
Does not perform document classification

7
Theoretical Foundations

Some discussion regarding the theoretical
foundations. Focused primarily on
Minimization of the probability of error
Expected recall and precision of functions
Prf(X)Y Prf(X)1 - PrY1 2Pr Prf(X)0
Y1PrY1
Painful, painful but it did show you can build
accurate classifiers with high probability when
sufficient documents in P (the positive document
set) and M (the unlabeled set) are available.

(1)
/
8
Theoretical Foundations (cont.)

Two serious practical drawbacks to the
theoretical method
Constrained optimization problem may not be easy
to solve for the function class in which we are
interested
Not easy to choose a desired recall level that
will give a good classifier using the function
class we are using

9
Proposed Technique

Theory be darned!
Paper introduces a practical technique based on
the naïve Bayes classifier and the
Expectation-Maximization (EM) algorithm
After introducing a general technique, the
authors offer an enhancement using spies

10
Proposed TechniqueTerms

D is the set of training documents
V lt w1, w2, , wV gt is the set of all words
considered for classification
wdi,k is the word in position k in document di
N(wt, di) is the number of times wt occurs in di
C c1, c2 is the set of predefined classes
P is the set of positive documents
M is the set of unlabeled set of documents
S is the set of spy documents
Posterior probability Prcj di e 0,1 depends
on the class label of the document

11
Proposed Techniquenaïve Bayesian classifer
(NB-C)

Prcj Si Prcjdi / D
Prwtcj 1 Si1Pcjdi N(wt, di)
V Ss1 Si1 Pcjdi N(ws,
di)
and assuming the words are independent given the
class
Prcjdi Prcj Pk1Prwdi,kcj
Sr1Prcr Pk1Prwdi,kcr
The class with the highest Prcjdi is assigned
as the class of the doc

(2)
D
(3)
D
V
di
(4)
C
di
12
Proposed TechniqueEM algorithm

Popular class of iterative algorithms for maximum
likelihood estimation in problems with incomplete
data.
Two steps
Expectation fills in the missing data
Maximization parameters are estimated
Rinse and repeat
Using a NB-C, (2) and (3) equate to the E step,
and (4) is the M step
Probability of a class now takes the value in
0,1 instead of 0,1

13
Proposed TechniqueEM algorithm (cont.)

All positive documents have the class value c1
Need to determine class value of each doc in
mixed set.
EM can help assign a probabilistic class label to
each document dj in the mixed set
Prc1dj and Prc2dj
After a number of iterations, all the
probabilities will converge

14
Proposed TechniqueStep 1 - Reinitialization
(I-EM)

Reinitialization
Build an initial NB-C using the documents sets M
and P
For class P, Prc1dj 1 and Prc2dj 0
For class M, Prc1dj 0 and Prc2dj 1
Loop while classifier parameters change
For each document dj e M
Compute Prc1dj using the current NB-C
Prc2dj 1 - Prc1dj
Update Prwtc1 and Prc1 given the
probabilistically assigned class for dj
(Prc1dj) and P (a new NB-C is being built in
the process
Works well on easy datasets
Problem is that our initialization is strongly
biased towards positive documents

15
Proposed TechniqueStep 1 - Spies

Problem is that our initialization is strongly
biased towards positive documents
Need to identify some very likely negative
documents from the mixed set
We do this by sending spy documents from the
positive set P and put in the mixed set M
(10 was used)
A threshold t is set and those documents with a
probabilistic label less than t are identified as
negative
15 was the threshold used

mix
c2
c2
likely negative
unlabeled
spies
spies
positive
c1
c1
positive
16
Proposed TechniqueStep 1 - Spies (cont)

N (most likely negative docs) U (unlabeled
docs) f
S (spies) sample(P,s)
MS M U S
P P - S
Assign every document di in P the class c1
Assign every document dj in MS the class c2
Run I-EM(MS,P)
Classify each document dj in MS
Determine the probability threshold t using S
For each document dj in M
If its probability Prc1dj lt t
N N U dj
Else U U U dj

17
Proposed TechniqueStep 2 - Building the final
classifier

Using P, N and U as developed in the previous
step
Put all the spy documents S back in P
Assign Prc1 di 1 for all documents in P
Assign Prc2 di 1 for all documents in N.
This will change with each iteration of EM
Each doc dk in U is not assigned a label
initially. At the end of the first iteration, it
will have a probabilistic label Prc1 dk
Run EM using the document sets P, N and U until
it converges
When EM stops, the final classifier has been
produced.
This two step technique is called S-EM (Spy EM)

18
Proposed TechniqueSelecting a classifier

The local maximum that the final classifier may
not cleanly separate the positive and negative
documents
Likely if there are many local clusters
If so, from the set of classifiers developed over
each iteration, select the one with the least
probability of error
Refer to (1)
Prf(X)Y Prf(X)1 - PrY1 2Pr Prf(X)0
Y1PrY1

/
19
EvaluationMeasurements

Breakeven Point
0 p - r, where p is precision and r is recall
Only evaluates sorting order of class
probabilities of documents
Not appropriate
F score
F 2pr / (pr)
Measures performance on a particular class
Reflects average effect of both precision and
recall
Only when both p and r are large will F be large
Accuracy

20
EvaluationResults

2 large document corpora
20NG
Removed UseNet headers and subject lines
WebKB
HTML tags removed
8 iterations

Pos Size M size Pos in M NB(F)
Average 405 4471 811 43.93
NB(A) 1-EM8(F) 1-EM8(A) S-EM(F) S-EM(A)
84.52 68.58 87.54 76.61 92.16
21
EvaluationResults (cont)

Also varied the of positive documents both in P
(a) and in M (b)

Pos Size M size Pos in M NB(F)
a20 b20 405 3985 324 60.66
a50 b20 1013 3863 203 72.09
a50 b50 1013 4167 507 73.81
NB(A) 1-EM8(F) 1-EM8(A) S-EM(F) S-EM(A)
94.41 68.08 91.96 76.93 95.96
95.94 63.63 86.81 73.61 95.28
93.12 71.25 85.79 81.85 94.32
22
Conclusions

This paper studied the problem of classification
with only partial information one class and a
set of mixed documents
Technique
Naïve Bayes classifier
Expectation Maximization algorithm
Reinitialized using the positive documents and
the most likely negative documents to compensate
bias
Use estimate of classification error to select a
good classifier
Extremely accurate results