Title: PARTIALLY SUPERVISED CLASSIFICATION OF TEXT DOCUMENTS
1PARTIALLY SUPERVISED CLASSIFICATION OF TEXT
DOCUMENTS
- authors
- B. Liu, W.S. Lee, P.S. Yu, X. Li
- presented by
- Rafal Ladysz
2WHAT IT IS ABOUT
- the paper shows
- document classification
- one class of positively labeled documents
- accompanied by a set of unlabeled, mixed
documents - the above enables to build accurate classifiers
- using EM algorithm based on NB classification
- strengthening the EM by so called spy documents
- experimental results for illustration
- we will browse through the paper and
- emphasize/refresh some of its theoretical aspects
- try to understand the methods described
- look at results obtained and interpret them
3AGENDA (informally)
- problem described
- document classification
- PSC - general assumptions
- PSC - some theory
- Bayes basics
- EM in general
- I- EM algorithm
- introducing spies
- I-S-EM algorithm
- selecting classifier
- experimental data
- results and conclusions
- references
4KEY PROBLEM a big picture
- no labeled negative training data (text
documents) - only a (small) set of relevant (positive)
documents - necessity to classify unlabeled text documents
- importance
- finding relevant text on the web
- or digital libraries
5DOCUMENT CLASSIFICATION some techniques used
- kNN (Nearest Neighbors)
- Linear Least Squares Fit
- SVM
- Naive Bayes utilized here
6PARTIALLY SUPERVISED CLASSIFICATION (PSC)
theoretical foundations
- fixed distribution D over space X x Y, where Y
0, 1 - X, Y sets of possible documents, classes
- (positive and negative), respectively
- example is a labeled document
- two sets of documents
- labeled as positive P of size n1 drawn from
DXY1 - unlabeled M of size n2 drawn indep. from X for
DX -
- remark there might be some relevant documents
in M (but - we dont know about their
existence!)
7PSC cont.
- PrDA A ? X x Y chosen randomly according to D
- T a finite sample being a subset of our dataset
- PrTA A ? T ? X x Y chosen randomly
- learning algorithm deals with F, a class of
functions and selects a function f from F - F X ? 0, 1 to be used by classifier
- probability of error Prf(X) ? Y
- Pr(f(X) 1) ? (Y 0) Pr(f(X) 0) ? (Y
1) - sum of false positive and false negative cases
8PSC approximations (1)
- after transforming expression for probability of
error - Prf(X) ? Y
- Prf(X) 1 - PrY 1 2Prf(X) 0Y
1?PrY 1 - notice PrY 1 const (no changes of
criteria) - approximation 1
- keeping Prf(X) 0Y 1 small
- learning error ? Prf(X) 1 - PrY 1
- Prf(X) 1 const ?
- ? minimizing Prf(X) 1
-
9PSC approximations (2)
- error Prf(X) ? Y
- Prf(X) 1 - PrY 1 2Prf(X) 0Y
1?PrY 1 - approximation 2
- keeping Prf(X) 0Y 1 small
- AND minimizing Prf(X) 1 ?
- ? minimizing PrMf(X) 1) (assumption
most irrelevant) - AND keeping PrP(ositive)f(X) 1) ? r
- where r is a recall (relevant retrieved) /
(all relevant) - for large enough sets P (positive) and M
(unlabeled)
10CONSTRAINT OPTIMIZATION
- simply summarizing what has just been said
- a good learning results are achievable if
- the learning algorithm minimizes the number of
unlabeled examples labeled as positive - the constrain that fraction of errors on the
positive examples ? 1 recall (declared upfront)
is satisfied
11COMPLEXITY FUNCTION (CF)
- VC-dim complexity measure of F (class of f.)
- meaning cardinality of the largest sample set T
- T ? X such that FT 2T
- thus the larger T, the more functions ? F (class
of f.) - conversely, the higher VC-dim, the more f. in F
- Naive Bayes VC-dim ? 2m 1
- where m is the cardinality classifiers of
vocabulary
12CF two cases
- no noise ?ft?F ?(X, Y)?D Y ft(X) (perfect
f.) - it can be shown that selecting f ? F
- which minimizes ?i 1n2 f(Xi)M
- AND with total recall on set of positives (P)
- results in a function with small expected error
- noise Y may or may not equal ft(X)
- F may or may not contain the target function f
- labels are noisy
- specifying target expected recall required
13CF in noise modus operandi
- learning algorithm tries to output f ? F such
that - Erecall(f) ? r (thats why recall is required)
- Eprecision(f) ? best available for ?f?F
recall(f) ? r - how the algorithm achieves that
- selecting a set of positives examples
- from DXY1 and unlabeled examples from DX
- searches a function f which minimizes ?i1n2
f(Zi) operating on unlabeled examples - under constrain errors fraction on positives ? 1
- r
14PROBABILITY vs. LIKELIHOOD
- in the Webster dictionary apparently synonims
- from probabilistic point of view
- si some collectively exclusive states of
nature - assuming the prior probabilities P(si) are known
- observing experimental outcomes oj more info
- suppose that ?oj ?si P(ojsi) is known
- it is the likelihood of the outcome oj given
state si - Bayes theor. combines prior probab. with
likelihood - and determines posterior probability for each si
- likelihood probability of observed experimental
outcome
15NAIVE BAYES in general
- formally, Bayes theorem can be formulated
- P(SiOj) P(OjSi)P(Si) / (?k1n P(OjSk)P(Sk))
- and is called Inverse Probability Law
- NB model assumptions
- words randomly selected from lexicon, with
replacement - words independence (words as components of a
feature vector) - even though simplistic works pretty well
- NB together with EM will be emplloyed here
16NB-based text classification - formalism
- D training set of
- documents as ordered list of words wt
- V ltw1, w2, ..., wVgt vocabulary used
- wd i,k is a word ? V in position k of doc. di
- C c1, c2, ...,cC predif. classes, here
c1, c2 - Prcjdi posterior probab needed
- total p. Prcj ?iPrcjdi / D (indeed
Prdi?1/D) - in NB model class with the highest Prcjdi is
assigned to the document
17ITERATIVE EXPECTATION-MAXIMIZATION ALGORITHM
(I-EM) a concept
- a general method of estimating max. likelihood
- of an underlying distributions parameters
- when the data is incomplete
- two main applications of the EM algorithm
- when the data has missing values due to problems
with the observation process - when optimizing the likelihood function is
- analytically hard
- but the likelihood function can be simplified
- by assuming values for additional, hidden
parameters
18I-EM - mathematically
- ?(i1) argmax? ?zP(Zzx,?(i))L(x,Zz?)
- where
- x is an observable,
- Z represents all hidden (unknown, missing) data,
- ? stands for all (sought after) parameters
- problem determine parameter ? on the base of
observations y only, - i.e. without knowledge of complete data set x
- solution exploit y and determine iteratively x, ?
19(No Transcript)
20(No Transcript)
21(No Transcript)
22I-EM properties
- simple but computationally demanding
- convergence behavior
- no guarantee for global optimum
- initial point ?(0) determines if global optimum
is reachable (or algorithm gets stuck in local
optimum) - stable likelihood function increases in every
iteration until (local if not global) optimum
reached - M(aximum) L(ikelihood) are fixed points in EM
23I-EM ALGORITHM why and how
- for the classification (main objective)
- posterior probability Prcjdi needed
- probabilities will converge during iterations
- EM iterative algorithm for max. likelihood
estimation for incomplete data (interpolates) - two steps
- 1. expectation filling in missing data
- 2. maximization parameters estimating
- next iteration launched
24I-EM symbols used
- symbols used
- D training set of documents
- each documant ordered list of words
- wdi,k kth word in ith document
- each wdi,k ? V w1, w2, ..., wV
(vocabulary) - vocabulary all words to be classified
- C c1, c2 predefine dclasses (only 2)
25I-EM - application
- initial labeling
- ?di ? P ? c1, i.e. Prc1di 1, Prc2di 0
- ?dj ? M ? c2, i.e. Prc1dj 0, Prc2dj 1
(vice versa) - NB-C created, then applied to dataset M
- computing posterior probab. Prc1dj in M (eq.
5) - assigning computed new probabilistic label to dj
- Prc1di 1 (not affected) during the process
- in each iteration
- Prc1dj is revised, then
- new NB-C built based on new Prc1dj for M and P
- iterating continues till convergence occurs
26I-EM pseudocode
- I-EM(M, P)
- 1. build initial NB classifier NB-C using M and P
sets - 2. loop while NB-C parameters keep changing
- (i.e. as long as convergence is taking place)
- 3. for each document dj?M
- 4. compute Prc1dj using current NB-C
(eq. 5) - // Prc2dj 1 - Prc1dj c1 and c2
are collectively excl. - // if Prc1dj gt Prc2dj then di is
classified as c1 - 5. update Prwtc1 and Prc1 (eq. 3, 4)
- // given probabilistically assigned
classes for - // dj (Prc1dj) and set P,
- // a new NB-C built during processing
27I-EM benefits and limitations
- EM A. helps assign probabilistic class labels to
each dj in mixed set of documents Prc1dj and
Prc2dj - all the above probabilities converge (iterations)
- the final result is sensitive to initial
conditions assumed - conclusion
- good handling of easy data (/- separable
easily) - a niche for improvement for hard data
- source of the limitation initialization strongly
biased towards positive data (documents) - solution
- balanced initialization (/-)
- find reliable negative documents for
initialization c2 in EM
28I-EM extension
- I-EM helps identify (most likely) negatives in M
- issue how to get as reliable as possible data
(documents) to do so - idea using spy documents from P in M
- approach
- select s ? 10 of documents from P denoted S
- add S-set to M-set
- S behave as unknown positive documents do in M
- enabling inference within M
- I-EM still in use
- but instead of M it operates on M ? S
29SPIES determining threshold
- set of spy documents S s1, s2, ..., sk
- Prc1si ? si probab. label assigned to each
spy - in noiseless case t minPrc1si, i 1, 2,
..., k - equivalent to retrieving all spy documents
- in more realistic scenario noise and outliers
exist - minimum probability might be unreliable, because
e.g. - for outlier si in S posterior Prc1si might
be ltlt Prc1dj ? M - setting t
- sort si in S according to Prc1si
- set noise level l (e.g. 15) so that l of docs
have probability lt t - thus, Step-1 objective is
- identifying a set of reliable negative documents
from the unlabeled set - unlabeled set to be treated as negative data
(docs)
30SPY DOCUMENTS and Step-1 algorithm
- threshold t used for decision making
- if Prc1dj lt t denoted as N(egative)
- if for dj ? P Prc1dj gt t denoted as
U(nlabeled) - algorithm Step-1
- for identifying most likely negatives N from
unlabeled U set
31STEP-1 effect
positives
negatives
BEFORE AFTER
LN (likely negative)
M (un-labeled)
c2
U un-labeled
c2
positive
spies
some spies
positive
c1
P (positive)
c1
help of spies most positives in M get into
unlabeled set, while most negatives get into
LN purity of LN higher than that of M
initial situation M P ? N no clue which are P
and N spies from P added to M
32STEP-2 building and selecting final classifier
- EM still in use, but now with P, LN and U
- algorithm proceeds as follows
- put all spies S back to P (where they were
before) - ?di?P ? c1 (i.e. Prc1di 1) (fixed thru
iterations) - ?dj?LN ? c2 (i.e. Prc2dj 1) (changing
thru EM) - ?dk?U initially assigned no label (will be
after EM(1)) - run EM using P, LN and U until it converges
- final classifier is produced when EM stops
- all these constitutes S-EM (spy EM)
33STEP-2 comments
- probabilities of sets U and LN are allowed to
change - set U participates in EM since EM(2) on with its
documents assigned probab. labels Prc1dk - experimenting with a 5, 10 or 20 gave
similar results why? - for the parameter a ( used for creating LN)
- when within a range of approximately 5-20 if
too many positives in LN, then - EM corrects it slowly adding them to positives
34STEP-1 AND STEP-2 SUMMARY
- Step 1
- Identifying a set of reliable negative
documents from the - unlabeled set. The unlabeled set is treated
as negative data. - Step 2
- Building and selecting a classifier,
consists of two sub-steps - building a set of classifiers by iteratively
applying a - classification algorithm the EM
algorithm is used again. - b) selecting a good classifier from the set of
- classifiers constructed above this
sub-step may - be called "catching a good classifier".
35SELECTING CLASSIFIER
- as said, EM is prone to local maxima trap
- if a local maximum separates the two classes
well no problem (or problem solved) - otherwise (i.e. positives and negatives consist
of many clusters each) the data may be not
separable - remedy stop iterating of EM at some point
- what point?
36SELECTING CLASSIFIER continued
- eq. (2) can be helpful error probability
- Prf(X) ? Y Prf(X) 1 - PrY 1
2Prf(X) 0Y 1?PrY 1 - it can be shown that knowing the component PrMY
c1 allows us to estimate the error - method estimating the change of the probability
error between iterations i and i1 - ?i can be computed (formula in 4.5 of the
paper) - if ?i gt 0 for the first time, then i-th
classifier produced is the last to add (no need
to proceed beyond i)
37EXPERIMENTAL DATA described
- two large document corpora
- 30 datasets created
- e.g. 20 newsgroups subdivided into 4 groups
- all headers removed
- e.g. WebKB (CS depts.) subdivided into 7
categories - objective
- recovering positive documents placed into mixed
sets - no need to separate test set (from training set)
- unlabeled mixed set serves as the test set
38DATA description cont.
- for each experiment
- dividing full positive set into two subsets P
and R - P positive set used in the algorithm with a of
the full positive set - R set of remaining documents with b have been
put into negative set M (not all in R put to M) - belief in reality M is large and has a small
proportion of positive documents - parameters a and b have been varied to cover
different scenarios
39EXPERIMENTAL RESULTS
- techniques used
- NB-C applied directly to P (c1) and M(c2) to
built classifier to be applied to classify data
in set M - I-EM applies EM-A to P and M as long as
converges (no spy yet) final classifier to be
applied to M to identify its positives - S-EM spies used to re-initialize I-EM to build
the final classifier threshold t used
40RESULTS cont.
- Table 1 30 results for diferent parametrs a, b
- Table 2 summary of averages for other a, b
settings - precision F 2pr/(pr), where p, r are and
recall, respectively - S-EM outperforms NB and I-EM in F dramatically
- accuracy (of a classifier) A c/(c1) , where c,
i are numbers of correct and incorrect decisions,
respectively - S-EM outperforms NB and I-EM in A as well
- comment datasets skewed (positives are only a
small fraction), thus A is not a reliable measure
of classifiers performance
41RESULTS cont.
- Table 3 F-score and accuracy A
- results in this table show great effect of
reinitialization with spies - S-EM outperforms I-EMbest
- reinitialization is not, however, the only factor
of improvement - S-EM outperforms S-EM4
- conclusions both Step-1 (reinitializing) and
Step-2 (selecting the best model) are needed!
42REFERENCES other than in the paper
- http//www.cs.uic.edu/liub/LPU/LPU-download.html
- http//www.ant.uni-bremen.de/teaching/sem/ws02_03/
slides/em_mud.pdf - http//www.mcs.vuw.ac.nz/vignaux/docs/Adams_NLJ.h
tml - http//plato.stanford.edu/entries/bayes-theorem/
- http//www.math.uiuc.edu/hildebr/361/cargoat1sol.
pdf - http//jimvb.home.mindspring.com/monthall.htm
- http//www2.sjsu.edu/faculty/watkins/mhall.htm
- http//www.aei-potsdam.mpg.de/mpoessel/Mathe/3doo
r.html - http//216.239.37.104/search?qcacheaKEOiHevtE0J
ccrma-www.stanford.edu/jos/bayes/Bayesian_Paramet
er_Estimation.htmlbayeslikelihoodhlplieUTF-8
inlangpl
43