PARTIALLY SUPERVISED CLASSIFICATION OF TEXT DOCUMENTS - PowerPoint PPT Presentation

About This Presentation
Title:

PARTIALLY SUPERVISED CLASSIFICATION OF TEXT DOCUMENTS

Description:

one class of positively labeled documents. accompanied by a set of unlabeled, mixed documents ... Linear Least Squares Fit. SVM. Naive Bayes: utilized here. 6 ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 44
Provided by: rafall
Learn more at: https://cs.gmu.edu
Category:

less

Transcript and Presenter's Notes

Title: PARTIALLY SUPERVISED CLASSIFICATION OF TEXT DOCUMENTS


1
PARTIALLY SUPERVISED CLASSIFICATION OF TEXT
DOCUMENTS
  • authors
  • B. Liu, W.S. Lee, P.S. Yu, X. Li
  • presented by
  • Rafal Ladysz

2
WHAT IT IS ABOUT
  • the paper shows
  • document classification
  • one class of positively labeled documents
  • accompanied by a set of unlabeled, mixed
    documents
  • the above enables to build accurate classifiers
  • using EM algorithm based on NB classification
  • strengthening the EM by so called spy documents
  • experimental results for illustration
  • we will browse through the paper and
  • emphasize/refresh some of its theoretical aspects
  • try to understand the methods described
  • look at results obtained and interpret them

3
AGENDA (informally)
  • problem described
  • document classification
  • PSC - general assumptions
  • PSC - some theory
  • Bayes basics
  • EM in general
  • I- EM algorithm
  • introducing spies
  • I-S-EM algorithm
  • selecting classifier
  • experimental data
  • results and conclusions
  • references

4
KEY PROBLEM a big picture
  • no labeled negative training data (text
    documents)
  • only a (small) set of relevant (positive)
    documents
  • necessity to classify unlabeled text documents
  • importance
  • finding relevant text on the web
  • or digital libraries

5
DOCUMENT CLASSIFICATION some techniques used
  • kNN (Nearest Neighbors)
  • Linear Least Squares Fit
  • SVM
  • Naive Bayes utilized here

6
PARTIALLY SUPERVISED CLASSIFICATION (PSC)
theoretical foundations
  • fixed distribution D over space X x Y, where Y
    0, 1
  • X, Y sets of possible documents, classes
  • (positive and negative), respectively
  • example is a labeled document
  • two sets of documents
  • labeled as positive P of size n1 drawn from
    DXY1
  • unlabeled M of size n2 drawn indep. from X for
    DX
  • remark there might be some relevant documents
    in M (but
  • we dont know about their
    existence!)

7
PSC cont.
  • PrDA A ? X x Y chosen randomly according to D
  • T a finite sample being a subset of our dataset
  • PrTA A ? T ? X x Y chosen randomly
  • learning algorithm deals with F, a class of
    functions and selects a function f from F
  • F X ? 0, 1 to be used by classifier
  • probability of error Prf(X) ? Y
  • Pr(f(X) 1) ? (Y 0) Pr(f(X) 0) ? (Y
    1)
  • sum of false positive and false negative cases

8
PSC approximations (1)
  • after transforming expression for probability of
    error
  • Prf(X) ? Y
  • Prf(X) 1 - PrY 1 2Prf(X) 0Y
    1?PrY 1
  • notice PrY 1 const (no changes of
    criteria)
  • approximation 1
  • keeping Prf(X) 0Y 1 small
  • learning error ? Prf(X) 1 - PrY 1
  • Prf(X) 1 const ?
  • ? minimizing Prf(X) 1

9
PSC approximations (2)
  • error Prf(X) ? Y
  • Prf(X) 1 - PrY 1 2Prf(X) 0Y
    1?PrY 1
  • approximation 2
  • keeping Prf(X) 0Y 1 small
  • AND minimizing Prf(X) 1 ?
  • ? minimizing PrMf(X) 1) (assumption
    most irrelevant)
  • AND keeping PrP(ositive)f(X) 1) ? r
  • where r is a recall (relevant retrieved) /
    (all relevant)
  • for large enough sets P (positive) and M
    (unlabeled)

10
CONSTRAINT OPTIMIZATION
  • simply summarizing what has just been said
  • a good learning results are achievable if
  • the learning algorithm minimizes the number of
    unlabeled examples labeled as positive
  • the constrain that fraction of errors on the
    positive examples ? 1 recall (declared upfront)
    is satisfied

11
COMPLEXITY FUNCTION (CF)
  • VC-dim complexity measure of F (class of f.)
  • meaning cardinality of the largest sample set T
  • T ? X such that FT 2T
  • thus the larger T, the more functions ? F (class
    of f.)
  • conversely, the higher VC-dim, the more f. in F
  • Naive Bayes VC-dim ? 2m 1
  • where m is the cardinality classifiers of
    vocabulary

12
CF two cases
  • no noise ?ft?F ?(X, Y)?D Y ft(X) (perfect
    f.)
  • it can be shown that selecting f ? F
  • which minimizes ?i 1n2 f(Xi)M
  • AND with total recall on set of positives (P)
  • results in a function with small expected error
  • noise Y may or may not equal ft(X)
  • F may or may not contain the target function f
  • labels are noisy
  • specifying target expected recall required

13
CF in noise modus operandi
  • learning algorithm tries to output f ? F such
    that
  • Erecall(f) ? r (thats why recall is required)
  • Eprecision(f) ? best available for ?f?F
    recall(f) ? r
  • how the algorithm achieves that
  • selecting a set of positives examples
  • from DXY1 and unlabeled examples from DX
  • searches a function f which minimizes ?i1n2
    f(Zi) operating on unlabeled examples
  • under constrain errors fraction on positives ? 1
    - r

14
PROBABILITY vs. LIKELIHOOD
  • in the Webster dictionary apparently synonims
  • from probabilistic point of view
  • si some collectively exclusive states of
    nature
  • assuming the prior probabilities P(si) are known
  • observing experimental outcomes oj more info
  • suppose that ?oj ?si P(ojsi) is known
  • it is the likelihood of the outcome oj given
    state si
  • Bayes theor. combines prior probab. with
    likelihood
  • and determines posterior probability for each si
  • likelihood probability of observed experimental
    outcome

15
NAIVE BAYES in general
  • formally, Bayes theorem can be formulated
  • P(SiOj) P(OjSi)P(Si) / (?k1n P(OjSk)P(Sk))
  • and is called Inverse Probability Law
  • NB model assumptions
  • words randomly selected from lexicon, with
    replacement
  • words independence (words as components of a
    feature vector)
  • even though simplistic works pretty well
  • NB together with EM will be emplloyed here

16
NB-based text classification - formalism
  • D training set of
  • documents as ordered list of words wt
  • V ltw1, w2, ..., wVgt vocabulary used
  • wd i,k is a word ? V in position k of doc. di
  • C c1, c2, ...,cC predif. classes, here
    c1, c2
  • Prcjdi posterior probab needed
  • total p. Prcj ?iPrcjdi / D (indeed
    Prdi?1/D)
  • in NB model class with the highest Prcjdi is
    assigned to the document

17
ITERATIVE EXPECTATION-MAXIMIZATION ALGORITHM
(I-EM) a concept
  • a general method of estimating max. likelihood
  • of an underlying distributions parameters
  • when the data is incomplete
  • two main applications of the EM algorithm
  • when the data has missing values due to problems
    with the observation process
  • when optimizing the likelihood function is
  • analytically hard
  • but the likelihood function can be simplified
  • by assuming values for additional, hidden
    parameters

18
I-EM - mathematically
  • ?(i1) argmax? ?zP(Zzx,?(i))L(x,Zz?)
  • where
  • x is an observable,
  • Z represents all hidden (unknown, missing) data,
  • ? stands for all (sought after) parameters
  • problem determine parameter ? on the base of
    observations y only,
  • i.e. without knowledge of complete data set x
  • solution exploit y and determine iteratively x, ?

19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
I-EM properties
  • simple but computationally demanding
  • convergence behavior
  • no guarantee for global optimum
  • initial point ?(0) determines if global optimum
    is reachable (or algorithm gets stuck in local
    optimum)
  • stable likelihood function increases in every
    iteration until (local if not global) optimum
    reached
  • M(aximum) L(ikelihood) are fixed points in EM

23
I-EM ALGORITHM why and how
  • for the classification (main objective)
  • posterior probability Prcjdi needed
  • probabilities will converge during iterations
  • EM iterative algorithm for max. likelihood
    estimation for incomplete data (interpolates)
  • two steps
  • 1. expectation filling in missing data
  • 2. maximization parameters estimating
  • next iteration launched

24
I-EM symbols used
  • symbols used
  • D training set of documents
  • each documant ordered list of words
  • wdi,k kth word in ith document
  • each wdi,k ? V w1, w2, ..., wV
    (vocabulary)
  • vocabulary all words to be classified
  • C c1, c2 predefine dclasses (only 2)

25
I-EM - application
  • initial labeling
  • ?di ? P ? c1, i.e. Prc1di 1, Prc2di 0
  • ?dj ? M ? c2, i.e. Prc1dj 0, Prc2dj 1
    (vice versa)
  • NB-C created, then applied to dataset M
  • computing posterior probab. Prc1dj in M (eq.
    5)
  • assigning computed new probabilistic label to dj
  • Prc1di 1 (not affected) during the process
  • in each iteration
  • Prc1dj is revised, then
  • new NB-C built based on new Prc1dj for M and P
  • iterating continues till convergence occurs

26
I-EM pseudocode
  • I-EM(M, P)
  • 1. build initial NB classifier NB-C using M and P
    sets
  • 2. loop while NB-C parameters keep changing
  • (i.e. as long as convergence is taking place)
  • 3. for each document dj?M
  • 4. compute Prc1dj using current NB-C
    (eq. 5)
  • // Prc2dj 1 - Prc1dj c1 and c2
    are collectively excl.
  • // if Prc1dj gt Prc2dj then di is
    classified as c1
  • 5. update Prwtc1 and Prc1 (eq. 3, 4)
  • // given probabilistically assigned
    classes for
  • // dj (Prc1dj) and set P,
  • // a new NB-C built during processing

27
I-EM benefits and limitations
  • EM A. helps assign probabilistic class labels to
    each dj in mixed set of documents Prc1dj and
    Prc2dj
  • all the above probabilities converge (iterations)
  • the final result is sensitive to initial
    conditions assumed
  • conclusion
  • good handling of easy data (/- separable
    easily)
  • a niche for improvement for hard data
  • source of the limitation initialization strongly
    biased towards positive data (documents)
  • solution
  • balanced initialization (/-)
  • find reliable negative documents for
    initialization c2 in EM

28
I-EM extension
  • I-EM helps identify (most likely) negatives in M
  • issue how to get as reliable as possible data
    (documents) to do so
  • idea using spy documents from P in M
  • approach
  • select s ? 10 of documents from P denoted S
  • add S-set to M-set
  • S behave as unknown positive documents do in M
  • enabling inference within M
  • I-EM still in use
  • but instead of M it operates on M ? S

29
SPIES determining threshold
  • set of spy documents S s1, s2, ..., sk
  • Prc1si ? si probab. label assigned to each
    spy
  • in noiseless case t minPrc1si, i 1, 2,
    ..., k
  • equivalent to retrieving all spy documents
  • in more realistic scenario noise and outliers
    exist
  • minimum probability might be unreliable, because
    e.g.
  • for outlier si in S posterior Prc1si might
    be ltlt Prc1dj ? M
  • setting t
  • sort si in S according to Prc1si
  • set noise level l (e.g. 15) so that l of docs
    have probability lt t
  • thus, Step-1 objective is
  • identifying a set of reliable negative documents
    from the unlabeled set
  • unlabeled set to be treated as negative data
    (docs)

30
SPY DOCUMENTS and Step-1 algorithm
  • threshold t used for decision making
  • if Prc1dj lt t denoted as N(egative)
  • if for dj ? P Prc1dj gt t denoted as
    U(nlabeled)
  • algorithm Step-1
  • for identifying most likely negatives N from
    unlabeled U set

31
STEP-1 effect
positives
negatives
BEFORE AFTER
LN (likely negative)
M (un-labeled)
c2
U un-labeled
c2
positive
spies
some spies
positive
c1
P (positive)
c1
help of spies most positives in M get into
unlabeled set, while most negatives get into
LN purity of LN higher than that of M
initial situation M P ? N no clue which are P
and N spies from P added to M
32
STEP-2 building and selecting final classifier
  • EM still in use, but now with P, LN and U
  • algorithm proceeds as follows
  • put all spies S back to P (where they were
    before)
  • ?di?P ? c1 (i.e. Prc1di 1) (fixed thru
    iterations)
  • ?dj?LN ? c2 (i.e. Prc2dj 1) (changing
    thru EM)
  • ?dk?U initially assigned no label (will be
    after EM(1))
  • run EM using P, LN and U until it converges
  • final classifier is produced when EM stops
  • all these constitutes S-EM (spy EM)

33
STEP-2 comments
  • probabilities of sets U and LN are allowed to
    change
  • set U participates in EM since EM(2) on with its
    documents assigned probab. labels Prc1dk
  • experimenting with a 5, 10 or 20 gave
    similar results why?
  • for the parameter a ( used for creating LN)
  • when within a range of approximately 5-20 if
    too many positives in LN, then
  • EM corrects it slowly adding them to positives

34
STEP-1 AND STEP-2 SUMMARY
  • Step 1
  • Identifying a set of reliable negative
    documents from the
  • unlabeled set. The unlabeled set is treated
    as negative data.
  • Step 2
  • Building and selecting a classifier,
    consists of two sub-steps
  • building a set of classifiers by iteratively
    applying a
  • classification algorithm the EM
    algorithm is used again.
  • b) selecting a good classifier from the set of
  • classifiers constructed above this
    sub-step may
  • be called "catching a good classifier".

35
SELECTING CLASSIFIER
  • as said, EM is prone to local maxima trap
  • if a local maximum separates the two classes
    well no problem (or problem solved)
  • otherwise (i.e. positives and negatives consist
    of many clusters each) the data may be not
    separable
  • remedy stop iterating of EM at some point
  • what point?

36
SELECTING CLASSIFIER continued
  • eq. (2) can be helpful error probability
  • Prf(X) ? Y Prf(X) 1 - PrY 1
    2Prf(X) 0Y 1?PrY 1
  • it can be shown that knowing the component PrMY
    c1 allows us to estimate the error
  • method estimating the change of the probability
    error between iterations i and i1
  • ?i can be computed (formula in 4.5 of the
    paper)
  • if ?i gt 0 for the first time, then i-th
    classifier produced is the last to add (no need
    to proceed beyond i)

37
EXPERIMENTAL DATA described
  • two large document corpora
  • 30 datasets created
  • e.g. 20 newsgroups subdivided into 4 groups
  • all headers removed
  • e.g. WebKB (CS depts.) subdivided into 7
    categories
  • objective
  • recovering positive documents placed into mixed
    sets
  • no need to separate test set (from training set)
  • unlabeled mixed set serves as the test set

38
DATA description cont.
  • for each experiment
  • dividing full positive set into two subsets P
    and R
  • P positive set used in the algorithm with a of
    the full positive set
  • R set of remaining documents with b have been
    put into negative set M (not all in R put to M)
  • belief in reality M is large and has a small
    proportion of positive documents
  • parameters a and b have been varied to cover
    different scenarios

39
EXPERIMENTAL RESULTS
  • techniques used
  • NB-C applied directly to P (c1) and M(c2) to
    built classifier to be applied to classify data
    in set M
  • I-EM applies EM-A to P and M as long as
    converges (no spy yet) final classifier to be
    applied to M to identify its positives
  • S-EM spies used to re-initialize I-EM to build
    the final classifier threshold t used

40
RESULTS cont.
  • Table 1 30 results for diferent parametrs a, b
  • Table 2 summary of averages for other a, b
    settings
  • precision F 2pr/(pr), where p, r are and
    recall, respectively
  • S-EM outperforms NB and I-EM in F dramatically
  • accuracy (of a classifier) A c/(c1) , where c,
    i are numbers of correct and incorrect decisions,
    respectively
  • S-EM outperforms NB and I-EM in A as well
  • comment datasets skewed (positives are only a
    small fraction), thus A is not a reliable measure
    of classifiers performance

41
RESULTS cont.
  • Table 3 F-score and accuracy A
  • results in this table show great effect of
    reinitialization with spies
  • S-EM outperforms I-EMbest
  • reinitialization is not, however, the only factor
    of improvement
  • S-EM outperforms S-EM4
  • conclusions both Step-1 (reinitializing) and
    Step-2 (selecting the best model) are needed!

42
REFERENCES other than in the paper
  • http//www.cs.uic.edu/liub/LPU/LPU-download.html
  • http//www.ant.uni-bremen.de/teaching/sem/ws02_03/
    slides/em_mud.pdf
  • http//www.mcs.vuw.ac.nz/vignaux/docs/Adams_NLJ.h
    tml
  • http//plato.stanford.edu/entries/bayes-theorem/
  • http//www.math.uiuc.edu/hildebr/361/cargoat1sol.
    pdf
  • http//jimvb.home.mindspring.com/monthall.htm
  • http//www2.sjsu.edu/faculty/watkins/mhall.htm
  • http//www.aei-potsdam.mpg.de/mpoessel/Mathe/3doo
    r.html
  • http//216.239.37.104/search?qcacheaKEOiHevtE0J
    ccrma-www.stanford.edu/jos/bayes/Bayesian_Paramet
    er_Estimation.htmlbayeslikelihoodhlplieUTF-8
    inlangpl

43
  • THANK YOU
Write a Comment
User Comments (0)
About PowerShow.com