Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li - PowerPoint PPT Presentation

About This Presentation
Title:

Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li

Description:

Manually label a set of documents to pre-defined classes ... Problem: bottleneck associated with large number of labeled training documents ... Theory be darned! ... – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 23
Provided by: rickkn4
Category:

less

Transcript and Presenter's Notes

Title: Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li


1
Partially Supervised Classification of Text
DocumentsbyBing Liu, Philip Yu, and Xiaoli Li
  • Presented by Rick Knowles
  • 7 April 2005

2
Agenda
  • Problem Statement
  • Related Work
  • Theoretical Foundations
  • Proposed Technique
  • Evaluation
  • Conclusions

3
Problem Statement Common Approach
  • Text categorization automated assigning of text
    documents to pre-defined classes
  • Common Approach Supervised Learning
  • Manually label a set of documents to pre-defined
    classes
  • Use a learning algorithm to build a classifier


_

_
_

_



_
_
_


_

_
_


_

_
_

_


_
4
Problem Statement Common Approach (cont.)
  • Problem bottleneck associated with large number
    of labeled training documents to build the
    classifier
  • Nigram, et al, have shown that using a large dose
    of unlabeled data can help

.
.
.
_

_
.

.
_

.

.
.
_
_

.
_

_
.

.
.
.
_
.

_
.
.

_
5
A different approachPartially supervised
classification
  • Two class problem positive and unlabeled
  • Key feature is that there is no labeled negative
    document
  • Can be posed as a constrained optimization
    problem
  • Develop a function that correctly classifies all
    positive docs and minimizes the number of mixed
    docs classified as positive will have an expected
    error rate of no more than e.
  • Examplar Finding matching (i.e., positive
    documents) from a large collection such as the
    Web.
  • Matching documents are positive
  • All others are negative

6
Related Work
  • Text Classification techniques
  • Naïve Bayesian
  • K-nearest neighbor
  • Support vector machines
  • Each requires labeled data for all classes
  • Problem similar to traditional information
    retrieval
  • Rank orders documents according to their
    similarities to the query document
  • Does not perform document classification

7
Theoretical Foundations
  • Some discussion regarding the theoretical
    foundations. Focused primarily on
  • Minimization of the probability of error
  • Expected recall and precision of functions
  • Prf(X)Y Prf(X)1 - PrY1 2Pr Prf(X)0
    Y1PrY1
  • Painful, painful but it did show you can build
    accurate classifiers with high probability when
    sufficient documents in P (the positive document
    set) and M (the unlabeled set) are available.

(1)
/
8
Theoretical Foundations (cont.)
  • Two serious practical drawbacks to the
    theoretical method
  • Constrained optimization problem may not be easy
    to solve for the function class in which we are
    interested
  • Not easy to choose a desired recall level that
    will give a good classifier using the function
    class we are using

9
Proposed Technique
  • Theory be darned!
  • Paper introduces a practical technique based on
    the naïve Bayes classifier and the
    Expectation-Maximization (EM) algorithm
  • After introducing a general technique, the
    authors offer an enhancement using spies

10
Proposed TechniqueTerms
  • D is the set of training documents
  • V lt w1, w2, , wV gt is the set of all words
    considered for classification
  • wdi,k is the word in position k in document di
  • N(wt, di) is the number of times wt occurs in di
  • C c1, c2 is the set of predefined classes
  • P is the set of positive documents
  • M is the set of unlabeled set of documents
  • S is the set of spy documents
  • Posterior probability Prcj di e 0,1 depends
    on the class label of the document

11
Proposed Techniquenaïve Bayesian classifer
(NB-C)
  • Prcj Si Prcjdi / D
  • Prwtcj 1 Si1Pcjdi N(wt, di)
  • V Ss1 Si1 Pcjdi N(ws,
    di)
  • and assuming the words are independent given the
    class
  • Prcjdi Prcj Pk1Prwdi,kcj
  • Sr1Prcr Pk1Prwdi,kcr
  • The class with the highest Prcjdi is assigned
    as the class of the doc

(2)
D
(3)
D
V
di
(4)
C
di
12
Proposed TechniqueEM algorithm
  • Popular class of iterative algorithms for maximum
    likelihood estimation in problems with incomplete
    data.
  • Two steps
  • Expectation fills in the missing data
  • Maximization parameters are estimated
  • Rinse and repeat
  • Using a NB-C, (2) and (3) equate to the E step,
    and (4) is the M step
  • Probability of a class now takes the value in
    0,1 instead of 0,1

13
Proposed TechniqueEM algorithm (cont.)
  • All positive documents have the class value c1
  • Need to determine class value of each doc in
    mixed set.
  • EM can help assign a probabilistic class label to
    each document dj in the mixed set
  • Prc1dj and Prc2dj
  • After a number of iterations, all the
    probabilities will converge

14
Proposed TechniqueStep 1 - Reinitialization
(I-EM)
  • Reinitialization
  • Build an initial NB-C using the documents sets M
    and P
  • For class P, Prc1dj 1 and Prc2dj 0
  • For class M, Prc1dj 0 and Prc2dj 1
  • Loop while classifier parameters change
  • For each document dj e M
  • Compute Prc1dj using the current NB-C
  • Prc2dj 1 - Prc1dj
  • Update Prwtc1 and Prc1 given the
    probabilistically assigned class for dj
    (Prc1dj) and P (a new NB-C is being built in
    the process
  • Works well on easy datasets
  • Problem is that our initialization is strongly
    biased towards positive documents

15
Proposed TechniqueStep 1 - Spies
  • Problem is that our initialization is strongly
    biased towards positive documents
  • Need to identify some very likely negative
    documents from the mixed set
  • We do this by sending spy documents from the
    positive set P and put in the mixed set M
  • (10 was used)
  • A threshold t is set and those documents with a
    probabilistic label less than t are identified as
    negative
  • 15 was the threshold used

mix
c2
c2
likely negative
unlabeled
spies
spies
positive
c1
c1
positive
16
Proposed TechniqueStep 1 - Spies (cont)
  • N (most likely negative docs) U (unlabeled
    docs) f
  • S (spies) sample(P,s)
  • MS M U S
  • P P - S
  • Assign every document di in P the class c1
  • Assign every document dj in MS the class c2
  • Run I-EM(MS,P)
  • Classify each document dj in MS
  • Determine the probability threshold t using S
  • For each document dj in M
  • If its probability Prc1dj lt t
  • N N U dj
  • Else U U U dj

17
Proposed TechniqueStep 2 - Building the final
classifier
  • Using P, N and U as developed in the previous
    step
  • Put all the spy documents S back in P
  • Assign Prc1 di 1 for all documents in P
  • Assign Prc2 di 1 for all documents in N.
    This will change with each iteration of EM
  • Each doc dk in U is not assigned a label
    initially. At the end of the first iteration, it
    will have a probabilistic label Prc1 dk
  • Run EM using the document sets P, N and U until
    it converges
  • When EM stops, the final classifier has been
    produced.
  • This two step technique is called S-EM (Spy EM)

18
Proposed TechniqueSelecting a classifier
  • The local maximum that the final classifier may
    not cleanly separate the positive and negative
    documents
  • Likely if there are many local clusters
  • If so, from the set of classifiers developed over
    each iteration, select the one with the least
    probability of error
  • Refer to (1)
  • Prf(X)Y Prf(X)1 - PrY1 2Pr Prf(X)0
    Y1PrY1

/
19
EvaluationMeasurements
  • Breakeven Point
  • 0 p - r, where p is precision and r is recall
  • Only evaluates sorting order of class
    probabilities of documents
  • Not appropriate
  • F score
  • F 2pr / (pr)
  • Measures performance on a particular class
  • Reflects average effect of both precision and
    recall
  • Only when both p and r are large will F be large
  • Accuracy

20
EvaluationResults
  • 2 large document corpora
  • 20NG
  • Removed UseNet headers and subject lines
  • WebKB
  • HTML tags removed
  • 8 iterations

Pos Size M size Pos in M NB(F)
Average 405 4471 811 43.93
NB(A) 1-EM8(F) 1-EM8(A) S-EM(F) S-EM(A)
84.52 68.58 87.54 76.61 92.16
21
EvaluationResults (cont)
  • Also varied the of positive documents both in P
    (a) and in M (b)

Pos Size M size Pos in M NB(F)
a20 b20 405 3985 324 60.66
a50 b20 1013 3863 203 72.09
a50 b50 1013 4167 507 73.81
NB(A) 1-EM8(F) 1-EM8(A) S-EM(F) S-EM(A)
94.41 68.08 91.96 76.93 95.96
95.94 63.63 86.81 73.61 95.28
93.12 71.25 85.79 81.85 94.32
22
Conclusions
  • This paper studied the problem of classification
    with only partial information one class and a
    set of mixed documents
  • Technique
  • Naïve Bayes classifier
  • Expectation Maximization algorithm
  • Reinitialized using the positive documents and
    the most likely negative documents to compensate
    bias
  • Use estimate of classification error to select a
    good classifier
  • Extremely accurate results
Write a Comment
User Comments (0)
About PowerShow.com