Title: Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li
1Partially Supervised Classification of Text
DocumentsbyBing Liu, Philip Yu, and Xiaoli Li
- Presented by Rick Knowles
- 7 April 2005
2Agenda
- Problem Statement
- Related Work
- Theoretical Foundations
- Proposed Technique
- Evaluation
- Conclusions
3Problem Statement Common Approach
- Text categorization automated assigning of text
documents to pre-defined classes - Common Approach Supervised Learning
- Manually label a set of documents to pre-defined
classes - Use a learning algorithm to build a classifier
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
4Problem Statement Common Approach (cont.)
- Problem bottleneck associated with large number
of labeled training documents to build the
classifier - Nigram, et al, have shown that using a large dose
of unlabeled data can help
.
.
.
_
_
.
.
_
.
.
.
_
_
.
_
_
.
.
.
.
_
.
_
.
.
_
5A different approachPartially supervised
classification
- Two class problem positive and unlabeled
- Key feature is that there is no labeled negative
document - Can be posed as a constrained optimization
problem - Develop a function that correctly classifies all
positive docs and minimizes the number of mixed
docs classified as positive will have an expected
error rate of no more than e. - Examplar Finding matching (i.e., positive
documents) from a large collection such as the
Web. - Matching documents are positive
- All others are negative
6Related Work
- Text Classification techniques
- Naïve Bayesian
- K-nearest neighbor
- Support vector machines
- Each requires labeled data for all classes
- Problem similar to traditional information
retrieval - Rank orders documents according to their
similarities to the query document - Does not perform document classification
7Theoretical Foundations
- Some discussion regarding the theoretical
foundations. Focused primarily on - Minimization of the probability of error
- Expected recall and precision of functions
- Prf(X)Y Prf(X)1 - PrY1 2Pr Prf(X)0
Y1PrY1 - Painful, painful but it did show you can build
accurate classifiers with high probability when
sufficient documents in P (the positive document
set) and M (the unlabeled set) are available.
(1)
/
8Theoretical Foundations (cont.)
- Two serious practical drawbacks to the
theoretical method - Constrained optimization problem may not be easy
to solve for the function class in which we are
interested - Not easy to choose a desired recall level that
will give a good classifier using the function
class we are using
9Proposed Technique
- Theory be darned!
- Paper introduces a practical technique based on
the naïve Bayes classifier and the
Expectation-Maximization (EM) algorithm - After introducing a general technique, the
authors offer an enhancement using spies
10Proposed TechniqueTerms
- D is the set of training documents
- V lt w1, w2, , wV gt is the set of all words
considered for classification - wdi,k is the word in position k in document di
- N(wt, di) is the number of times wt occurs in di
- C c1, c2 is the set of predefined classes
- P is the set of positive documents
- M is the set of unlabeled set of documents
- S is the set of spy documents
- Posterior probability Prcj di e 0,1 depends
on the class label of the document
11Proposed Techniquenaïve Bayesian classifer
(NB-C)
- Prcj Si Prcjdi / D
- Prwtcj 1 Si1Pcjdi N(wt, di)
- V Ss1 Si1 Pcjdi N(ws,
di) - and assuming the words are independent given the
class - Prcjdi Prcj Pk1Prwdi,kcj
- Sr1Prcr Pk1Prwdi,kcr
- The class with the highest Prcjdi is assigned
as the class of the doc
(2)
D
(3)
D
V
di
(4)
C
di
12Proposed TechniqueEM algorithm
- Popular class of iterative algorithms for maximum
likelihood estimation in problems with incomplete
data. - Two steps
- Expectation fills in the missing data
- Maximization parameters are estimated
- Rinse and repeat
- Using a NB-C, (2) and (3) equate to the E step,
and (4) is the M step - Probability of a class now takes the value in
0,1 instead of 0,1
13Proposed TechniqueEM algorithm (cont.)
- All positive documents have the class value c1
- Need to determine class value of each doc in
mixed set. - EM can help assign a probabilistic class label to
each document dj in the mixed set - Prc1dj and Prc2dj
- After a number of iterations, all the
probabilities will converge
14Proposed TechniqueStep 1 - Reinitialization
(I-EM)
- Reinitialization
- Build an initial NB-C using the documents sets M
and P - For class P, Prc1dj 1 and Prc2dj 0
- For class M, Prc1dj 0 and Prc2dj 1
- Loop while classifier parameters change
- For each document dj e M
- Compute Prc1dj using the current NB-C
- Prc2dj 1 - Prc1dj
- Update Prwtc1 and Prc1 given the
probabilistically assigned class for dj
(Prc1dj) and P (a new NB-C is being built in
the process - Works well on easy datasets
- Problem is that our initialization is strongly
biased towards positive documents
15Proposed TechniqueStep 1 - Spies
- Problem is that our initialization is strongly
biased towards positive documents - Need to identify some very likely negative
documents from the mixed set - We do this by sending spy documents from the
positive set P and put in the mixed set M - (10 was used)
- A threshold t is set and those documents with a
probabilistic label less than t are identified as
negative - 15 was the threshold used
mix
c2
c2
likely negative
unlabeled
spies
spies
positive
c1
c1
positive
16Proposed TechniqueStep 1 - Spies (cont)
- N (most likely negative docs) U (unlabeled
docs) f - S (spies) sample(P,s)
- MS M U S
- P P - S
- Assign every document di in P the class c1
- Assign every document dj in MS the class c2
- Run I-EM(MS,P)
- Classify each document dj in MS
- Determine the probability threshold t using S
- For each document dj in M
- If its probability Prc1dj lt t
- N N U dj
- Else U U U dj
17Proposed TechniqueStep 2 - Building the final
classifier
- Using P, N and U as developed in the previous
step - Put all the spy documents S back in P
- Assign Prc1 di 1 for all documents in P
- Assign Prc2 di 1 for all documents in N.
This will change with each iteration of EM - Each doc dk in U is not assigned a label
initially. At the end of the first iteration, it
will have a probabilistic label Prc1 dk - Run EM using the document sets P, N and U until
it converges - When EM stops, the final classifier has been
produced. - This two step technique is called S-EM (Spy EM)
18Proposed TechniqueSelecting a classifier
- The local maximum that the final classifier may
not cleanly separate the positive and negative
documents - Likely if there are many local clusters
- If so, from the set of classifiers developed over
each iteration, select the one with the least
probability of error - Refer to (1)
- Prf(X)Y Prf(X)1 - PrY1 2Pr Prf(X)0
Y1PrY1
/
19EvaluationMeasurements
- Breakeven Point
- 0 p - r, where p is precision and r is recall
- Only evaluates sorting order of class
probabilities of documents - Not appropriate
- F score
- F 2pr / (pr)
- Measures performance on a particular class
- Reflects average effect of both precision and
recall - Only when both p and r are large will F be large
- Accuracy
20EvaluationResults
- 2 large document corpora
- 20NG
- Removed UseNet headers and subject lines
- WebKB
- HTML tags removed
- 8 iterations
Pos Size M size Pos in M NB(F)
Average 405 4471 811 43.93
NB(A) 1-EM8(F) 1-EM8(A) S-EM(F) S-EM(A)
84.52 68.58 87.54 76.61 92.16
21EvaluationResults (cont)
- Also varied the of positive documents both in P
(a) and in M (b)
Pos Size M size Pos in M NB(F)
a20 b20 405 3985 324 60.66
a50 b20 1013 3863 203 72.09
a50 b50 1013 4167 507 73.81
NB(A) 1-EM8(F) 1-EM8(A) S-EM(F) S-EM(A)
94.41 68.08 91.96 76.93 95.96
95.94 63.63 86.81 73.61 95.28
93.12 71.25 85.79 81.85 94.32
22Conclusions
- This paper studied the problem of classification
with only partial information one class and a
set of mixed documents - Technique
- Naïve Bayes classifier
- Expectation Maximization algorithm
- Reinitialized using the positive documents and
the most likely negative documents to compensate
bias - Use estimate of classification error to select a
good classifier - Extremely accurate results