PARTIALLY SUPERVISED CLASSIFICATION OF TEXT DOCUMENTS

About This Presentation

Title:

PARTIALLY SUPERVISED CLASSIFICATION OF TEXT DOCUMENTS

Description:

one class of positively labeled documents. accompanied by a set of unlabeled, mixed documents ... Linear Least Squares Fit. SVM. Naive Bayes: utilized here. 6 ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 44

Provided by: rafall

Learn more at: https://cs.gmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: PARTIALLY SUPERVISED CLASSIFICATION OF TEXT DOCUMENTS

1
PARTIALLY SUPERVISED CLASSIFICATION OF TEXT
DOCUMENTS

authors
B. Liu, W.S. Lee, P.S. Yu, X. Li
presented by
Rafal Ladysz

2
WHAT IT IS ABOUT

the paper shows
document classification
one class of positively labeled documents
accompanied by a set of unlabeled, mixed
documents
the above enables to build accurate classifiers
using EM algorithm based on NB classification
strengthening the EM by so called spy documents
experimental results for illustration
we will browse through the paper and
emphasize/refresh some of its theoretical aspects
try to understand the methods described
look at results obtained and interpret them

3
AGENDA (informally)

problem described
document classification
PSC - general assumptions
PSC - some theory
Bayes basics
EM in general
I- EM algorithm
introducing spies
I-S-EM algorithm
selecting classifier
experimental data
results and conclusions
references

4
KEY PROBLEM a big picture

no labeled negative training data (text
documents)
only a (small) set of relevant (positive)
documents
necessity to classify unlabeled text documents
importance
finding relevant text on the web
or digital libraries

5
DOCUMENT CLASSIFICATION some techniques used

kNN (Nearest Neighbors)
Linear Least Squares Fit
SVM
Naive Bayes utilized here

6
PARTIALLY SUPERVISED CLASSIFICATION (PSC)
theoretical foundations

fixed distribution D over space X x Y, where Y
0, 1
X, Y sets of possible documents, classes
(positive and negative), respectively
example is a labeled document
two sets of documents
labeled as positive P of size n1 drawn from
DXY1
unlabeled M of size n2 drawn indep. from X for
DX
remark there might be some relevant documents
in M (but
we dont know about their
existence!)

7
PSC cont.

PrDA A ? X x Y chosen randomly according to D
T a finite sample being a subset of our dataset
PrTA A ? T ? X x Y chosen randomly
learning algorithm deals with F, a class of
functions and selects a function f from F
F X ? 0, 1 to be used by classifier
probability of error Prf(X) ? Y
Pr(f(X) 1) ? (Y 0) Pr(f(X) 0) ? (Y
1)
sum of false positive and false negative cases

8
PSC approximations (1)

after transforming expression for probability of
error
Prf(X) ? Y
Prf(X) 1 - PrY 1 2Prf(X) 0Y
1?PrY 1
notice PrY 1 const (no changes of
criteria)
approximation 1
keeping Prf(X) 0Y 1 small
learning error ? Prf(X) 1 - PrY 1
Prf(X) 1 const ?
? minimizing Prf(X) 1

9
PSC approximations (2)

error Prf(X) ? Y
Prf(X) 1 - PrY 1 2Prf(X) 0Y
1?PrY 1
approximation 2
keeping Prf(X) 0Y 1 small
AND minimizing Prf(X) 1 ?
? minimizing PrMf(X) 1) (assumption
most irrelevant)
AND keeping PrP(ositive)f(X) 1) ? r
where r is a recall (relevant retrieved) /
(all relevant)
for large enough sets P (positive) and M
(unlabeled)

10
CONSTRAINT OPTIMIZATION

simply summarizing what has just been said
a good learning results are achievable if
the learning algorithm minimizes the number of
unlabeled examples labeled as positive
the constrain that fraction of errors on the
positive examples ? 1 recall (declared upfront)
is satisfied

11
COMPLEXITY FUNCTION (CF)

VC-dim complexity measure of F (class of f.)
meaning cardinality of the largest sample set T
T ? X such that FT 2T
thus the larger T, the more functions ? F (class
of f.)
conversely, the higher VC-dim, the more f. in F
Naive Bayes VC-dim ? 2m 1
where m is the cardinality classifiers of
vocabulary

12
CF two cases

no noise ?ft?F ?(X, Y)?D Y ft(X) (perfect
f.)
it can be shown that selecting f ? F
which minimizes ?i 1n2 f(Xi)M
AND with total recall on set of positives (P)
results in a function with small expected error
noise Y may or may not equal ft(X)
F may or may not contain the target function f
labels are noisy
specifying target expected recall required

13
CF in noise modus operandi

learning algorithm tries to output f ? F such
that
Erecall(f) ? r (thats why recall is required)
Eprecision(f) ? best available for ?f?F
recall(f) ? r
how the algorithm achieves that
selecting a set of positives examples
from DXY1 and unlabeled examples from DX
searches a function f which minimizes ?i1n2
f(Zi) operating on unlabeled examples
under constrain errors fraction on positives ? 1
- r

14
PROBABILITY vs. LIKELIHOOD

in the Webster dictionary apparently synonims
from probabilistic point of view
si some collectively exclusive states of
nature
assuming the prior probabilities P(si) are known
observing experimental outcomes oj more info
suppose that ?oj ?si P(ojsi) is known
it is the likelihood of the outcome oj given
state si
Bayes theor. combines prior probab. with
likelihood
and determines posterior probability for each si
likelihood probability of observed experimental
outcome

15
NAIVE BAYES in general

formally, Bayes theorem can be formulated
P(SiOj) P(OjSi)P(Si) / (?k1n P(OjSk)P(Sk))
and is called Inverse Probability Law
NB model assumptions
words randomly selected from lexicon, with
replacement
words independence (words as components of a
feature vector)
even though simplistic works pretty well
NB together with EM will be emplloyed here

16
NB-based text classification - formalism

D training set of
documents as ordered list of words wt
V ltw1, w2, ..., wVgt vocabulary used
wd i,k is a word ? V in position k of doc. di
C c1, c2, ...,cC predif. classes, here
c1, c2
Prcjdi posterior probab needed
total p. Prcj ?iPrcjdi / D (indeed
Prdi?1/D)
in NB model class with the highest Prcjdi is
assigned to the document

17
ITERATIVE EXPECTATION-MAXIMIZATION ALGORITHM
(I-EM) a concept

a general method of estimating max. likelihood
of an underlying distributions parameters
when the data is incomplete
two main applications of the EM algorithm
when the data has missing values due to problems
with the observation process
when optimizing the likelihood function is
analytically hard
but the likelihood function can be simplified
by assuming values for additional, hidden
parameters

18
I-EM - mathematically

?(i1) argmax? ?zP(Zzx,?(i))L(x,Zz?)
where
x is an observable,
Z represents all hidden (unknown, missing) data,
? stands for all (sought after) parameters
problem determine parameter ? on the base of
observations y only,
i.e. without knowledge of complete data set x
solution exploit y and determine iteratively x, ?

19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
I-EM properties

simple but computationally demanding
convergence behavior
no guarantee for global optimum
initial point ?(0) determines if global optimum
is reachable (or algorithm gets stuck in local
optimum)
stable likelihood function increases in every
iteration until (local if not global) optimum
reached
M(aximum) L(ikelihood) are fixed points in EM

23
I-EM ALGORITHM why and how

for the classification (main objective)
posterior probability Prcjdi needed
probabilities will converge during iterations
EM iterative algorithm for max. likelihood
estimation for incomplete data (interpolates)
two steps
1. expectation filling in missing data
2. maximization parameters estimating
next iteration launched

24
I-EM symbols used

symbols used
D training set of documents
each documant ordered list of words
wdi,k kth word in ith document
each wdi,k ? V w1, w2, ..., wV
(vocabulary)
vocabulary all words to be classified
C c1, c2 predefine dclasses (only 2)

25
I-EM - application

initial labeling
?di ? P ? c1, i.e. Prc1di 1, Prc2di 0
?dj ? M ? c2, i.e. Prc1dj 0, Prc2dj 1
(vice versa)
NB-C created, then applied to dataset M
computing posterior probab. Prc1dj in M (eq.
5)
assigning computed new probabilistic label to dj
Prc1di 1 (not affected) during the process
in each iteration
Prc1dj is revised, then
new NB-C built based on new Prc1dj for M and P
iterating continues till convergence occurs

26
I-EM pseudocode

I-EM(M, P)
1. build initial NB classifier NB-C using M and P
sets
2. loop while NB-C parameters keep changing
(i.e. as long as convergence is taking place)
3. for each document dj?M
4. compute Prc1dj using current NB-C
(eq. 5)
// Prc2dj 1 - Prc1dj c1 and c2
are collectively excl.
// if Prc1dj gt Prc2dj then di is
classified as c1
5. update Prwtc1 and Prc1 (eq. 3, 4)
// given probabilistically assigned
classes for
// dj (Prc1dj) and set P,
// a new NB-C built during processing

27
I-EM benefits and limitations

EM A. helps assign probabilistic class labels to
each dj in mixed set of documents Prc1dj and
Prc2dj
all the above probabilities converge (iterations)
the final result is sensitive to initial
conditions assumed
conclusion
good handling of easy data (/- separable
easily)
a niche for improvement for hard data
source of the limitation initialization strongly
biased towards positive data (documents)
solution
balanced initialization (/-)
find reliable negative documents for
initialization c2 in EM

28
I-EM extension

I-EM helps identify (most likely) negatives in M
issue how to get as reliable as possible data
(documents) to do so
idea using spy documents from P in M
approach
select s ? 10 of documents from P denoted S
add S-set to M-set
S behave as unknown positive documents do in M
enabling inference within M
I-EM still in use
but instead of M it operates on M ? S

29
SPIES determining threshold

set of spy documents S s1, s2, ..., sk
Prc1si ? si probab. label assigned to each
spy
in noiseless case t minPrc1si, i 1, 2,
..., k
equivalent to retrieving all spy documents
in more realistic scenario noise and outliers
exist
minimum probability might be unreliable, because
e.g.
for outlier si in S posterior Prc1si might
be ltlt Prc1dj ? M
setting t
sort si in S according to Prc1si
set noise level l (e.g. 15) so that l of docs
have probability lt t
thus, Step-1 objective is
identifying a set of reliable negative documents
from the unlabeled set
unlabeled set to be treated as negative data
(docs)

30
SPY DOCUMENTS and Step-1 algorithm

threshold t used for decision making
if Prc1dj lt t denoted as N(egative)
if for dj ? P Prc1dj gt t denoted as
U(nlabeled)
algorithm Step-1
for identifying most likely negatives N from
unlabeled U set

31
STEP-1 effect
positives
negatives
BEFORE AFTER
LN (likely negative)
M (un-labeled)
c2
U un-labeled
c2
positive
spies
some spies
positive
c1
P (positive)
c1
help of spies most positives in M get into
unlabeled set, while most negatives get into
LN purity of LN higher than that of M
initial situation M P ? N no clue which are P
and N spies from P added to M
32
STEP-2 building and selecting final classifier

EM still in use, but now with P, LN and U
algorithm proceeds as follows
put all spies S back to P (where they were
before)
?di?P ? c1 (i.e. Prc1di 1) (fixed thru
iterations)
?dj?LN ? c2 (i.e. Prc2dj 1) (changing
thru EM)
?dk?U initially assigned no label (will be
after EM(1))
run EM using P, LN and U until it converges
final classifier is produced when EM stops
all these constitutes S-EM (spy EM)

33
STEP-2 comments

probabilities of sets U and LN are allowed to
change
set U participates in EM since EM(2) on with its
documents assigned probab. labels Prc1dk
experimenting with a 5, 10 or 20 gave
similar results why?
for the parameter a ( used for creating LN)
when within a range of approximately 5-20 if
too many positives in LN, then
EM corrects it slowly adding them to positives

34
STEP-1 AND STEP-2 SUMMARY

Step 1
Identifying a set of reliable negative
documents from the
unlabeled set. The unlabeled set is treated
as negative data.
Step 2
Building and selecting a classifier,
consists of two sub-steps
building a set of classifiers by iteratively
applying a
classification algorithm the EM
algorithm is used again.
b) selecting a good classifier from the set of
classifiers constructed above this
sub-step may
be called "catching a good classifier".

35
SELECTING CLASSIFIER

as said, EM is prone to local maxima trap
if a local maximum separates the two classes
well no problem (or problem solved)
otherwise (i.e. positives and negatives consist
of many clusters each) the data may be not
separable
remedy stop iterating of EM at some point
what point?

36
SELECTING CLASSIFIER continued

eq. (2) can be helpful error probability
Prf(X) ? Y Prf(X) 1 - PrY 1
2Prf(X) 0Y 1?PrY 1
it can be shown that knowing the component PrMY
c1 allows us to estimate the error
method estimating the change of the probability
error between iterations i and i1
?i can be computed (formula in 4.5 of the
paper)
if ?i gt 0 for the first time, then i-th
classifier produced is the last to add (no need
to proceed beyond i)

37
EXPERIMENTAL DATA described

two large document corpora
30 datasets created
e.g. 20 newsgroups subdivided into 4 groups
all headers removed
e.g. WebKB (CS depts.) subdivided into 7
categories
objective
recovering positive documents placed into mixed
sets
no need to separate test set (from training set)
unlabeled mixed set serves as the test set

38
DATA description cont.

for each experiment
dividing full positive set into two subsets P
and R
P positive set used in the algorithm with a of
the full positive set
R set of remaining documents with b have been
put into negative set M (not all in R put to M)
belief in reality M is large and has a small
proportion of positive documents
parameters a and b have been varied to cover
different scenarios

39
EXPERIMENTAL RESULTS

techniques used
NB-C applied directly to P (c1) and M(c2) to
built classifier to be applied to classify data
in set M
I-EM applies EM-A to P and M as long as
converges (no spy yet) final classifier to be
applied to M to identify its positives
S-EM spies used to re-initialize I-EM to build
the final classifier threshold t used

40
RESULTS cont.

Table 1 30 results for diferent parametrs a, b
Table 2 summary of averages for other a, b
settings
precision F 2pr/(pr), where p, r are and
recall, respectively
S-EM outperforms NB and I-EM in F dramatically
accuracy (of a classifier) A c/(c1) , where c,
i are numbers of correct and incorrect decisions,
respectively
S-EM outperforms NB and I-EM in A as well
comment datasets skewed (positives are only a
small fraction), thus A is not a reliable measure
of classifiers performance

41
RESULTS cont.

Table 3 F-score and accuracy A
results in this table show great effect of
reinitialization with spies
S-EM outperforms I-EMbest
reinitialization is not, however, the only factor
of improvement
S-EM outperforms S-EM4
conclusions both Step-1 (reinitializing) and
Step-2 (selecting the best model) are needed!

42
REFERENCES other than in the paper

http//www.cs.uic.edu/liub/LPU/LPU-download.html
http//www.ant.uni-bremen.de/teaching/sem/ws02_03/
slides/em_mud.pdf
http//www.mcs.vuw.ac.nz/vignaux/docs/Adams_NLJ.h
tml
http//plato.stanford.edu/entries/bayes-theorem/
http//www.math.uiuc.edu/hildebr/361/cargoat1sol.
pdf
http//jimvb.home.mindspring.com/monthall.htm
http//www2.sjsu.edu/faculty/watkins/mhall.htm
http//www.aei-potsdam.mpg.de/mpoessel/Mathe/3doo
r.html
http//216.239.37.104/search?qcacheaKEOiHevtE0J
ccrma-www.stanford.edu/jos/bayes/Bayesian_Paramet
er_Estimation.htmlbayeslikelihoodhlplieUTF-8
inlangpl