Learning Classifiers from Only Positive and Unlabeled Data Charles Elkan, Keith Noto Presented by Li - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Learning Classifiers from Only Positive and Unlabeled Data Charles Elkan, Keith Noto Presented by Li

Description:

E.g.: specialized molecular biology database. ... Such a training set consists of two subsets, called the 'labeled' (s = 1) and ' ... – PowerPoint PPT presentation

Number of Views:121
Avg rating:3.0/5.0
Slides: 17
Provided by: ly18
Category:

less

Transcript and Presenter's Notes

Title: Learning Classifiers from Only Positive and Unlabeled Data Charles Elkan, Keith Noto Presented by Li


1
Learning Classifiers from Only Positive and
Unlabeled DataCharles Elkan, Keith Noto
Presented by Liu Yi
2
Traditional way of building classifiers
  • Take two sets of examples
  • One set consists of positive examples of the
    concept to be learnt
  • The other set consists of negative examples
  • Use the positive and negative examples to train a
    classifier

3
What if we have non-traditional inputs
  • It is often the case that available training data
    are an incomplete set of positive examples and a
    set of unlabeled examples.
  • E.g. specialized molecular biology database.
  • Defines a set of positive examples (
    genes/proteins related to certain disease or
    function )No info about examples that should not
    be included and it is unnatural to build such set.

4
Other examples
  • Learning users preference for web pages
  • The users bookmarks can be considered as
    positive examples
  • All the rest web pages are unlabeled examples
  • Direct marketing companys current list of
    customers as positive examples
  • Text classification labeling is labor intensive

5
Learning from Non-Traditional Input
  • x example
  • y binary class label ( either 0 or 1 )
  • s 1 if labeled, 0 if unlabeled
  • So intuitively, s1 -gt y1. Also we have

6
Learning from Non-Traditional Input
  • The goal is to learn a function f(x) such that
    f(x) p(y 1x) as closely as possible. We call
    such a function f a traditional probabilistic
    classier.
  • selected completely at random assumption

7
Learning from Non-Traditional Input
  • c
  • c is a constant probability
  • So, a training set is a random sample from a
    distribution p(x, y, s) that satisfies previous
    equations.
  • Such a training set consists of two subsets,
    called the labeled (s 1) and unlabeled (s
    0) sets.

8
Learning from Non-Traditional Input
  • Steps
  • Feed the two sets as inputs to a standard
    training algorithm. The algorithm will then yield
    a function g(x) such that g(x) p(s 1x)
    approximately.
  • We call g(x) a nontraditional classier.
  • Then we obtain f(x) from g(x).

9
Learning from Non-Traditional Input
  • Suppose the selected completely at random
    assumption holds. Then p(y 1x) p(s
    1x)/c.
  • Easy to prove
  • Everything done. We just need to estimate the
    constant c here.

10
Estimating c
  • Let V be such a validation set that is drawn from
    the overall distribution p(x, y, s) in the same
    manner as the nontraditional training set. Let P
    be the subset of examples in V that are labeled
    (and hence positive). The estimator of p(s 1y
    1) is the average value of g(x) for x in P.
  • Formally , n is the cardinality of P
  • There are other estimators for c

11
  • 500 positive data points and 1000 negative data
    points, each from a 2d Gaussian.
  • using 20 of the positive data as labeled
    positive examples.
  • Based on a validation set of just 20 labeled
    examples, this estimated value is e1 0.1928,
    which is very close to the true value 0.2.

12
Weighting Unlabeled Examples
  • Let the goal be to estimate for any function h

13
  • unlabeled examples are duplicated
  • one copy is made positive with weight
    p(y1x,s0) and the other copy is made negative
    with weight 1-p(y1x,s 0).

14
  • n is the cardinality of the labeled training set
  • This result solves an open problem identified in
    D. Zhang and W. S. Lee. A simple probabilistic
    approach to learning from positive and unlabeled
    examples., namely how to estimate p(y1) given
    only the type of nontraditional training set
    considered here

15
Experiments on Real Data
  • Positive examples 2453 records obtained from a
    specialized database named TCDB
  • Unlabeled examples 4906 records selected
    randomly from SwissProt excluding its
    intersection with TCDB
  • Q subset of actual positive examples inside
    U348 members

16
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com