Learning Classifiers from Only Positive and Unlabeled Data Charles Elkan, Keith Noto Presented by Li

About This Presentation

Title:

Learning Classifiers from Only Positive and Unlabeled Data Charles Elkan, Keith Noto Presented by Li

Description:

E.g.: specialized molecular biology database. ... Such a training set consists of two subsets, called the 'labeled' (s = 1) and ' ... – PowerPoint PPT presentation

Number of Views:121

Avg rating:3.0/5.0

Slides: 17

Provided by: ly18

Category:

more less

Transcript and Presenter's Notes

Title: Learning Classifiers from Only Positive and Unlabeled Data Charles Elkan, Keith Noto Presented by Li

1
Learning Classifiers from Only Positive and
Unlabeled DataCharles Elkan, Keith Noto
Presented by Liu Yi
2
Traditional way of building classifiers

Take two sets of examples
One set consists of positive examples of the
concept to be learnt
The other set consists of negative examples
Use the positive and negative examples to train a
classifier

3
What if we have non-traditional inputs

It is often the case that available training data
are an incomplete set of positive examples and a
set of unlabeled examples.
E.g. specialized molecular biology database.
Defines a set of positive examples (
genes/proteins related to certain disease or
function )No info about examples that should not
be included and it is unnatural to build such set.

4
Other examples

Learning users preference for web pages
The users bookmarks can be considered as
positive examples
All the rest web pages are unlabeled examples
Direct marketing companys current list of
customers as positive examples
Text classification labeling is labor intensive

5
Learning from Non-Traditional Input

x example
y binary class label ( either 0 or 1 )
s 1 if labeled, 0 if unlabeled
So intuitively, s1 -gt y1. Also we have

6
Learning from Non-Traditional Input

The goal is to learn a function f(x) such that
f(x) p(y 1x) as closely as possible. We call
such a function f a traditional probabilistic
classier.
selected completely at random assumption

7
Learning from Non-Traditional Input

c
c is a constant probability
So, a training set is a random sample from a
distribution p(x, y, s) that satisfies previous
equations.
Such a training set consists of two subsets,
called the labeled (s 1) and unlabeled (s
0) sets.

8
Learning from Non-Traditional Input

Steps
Feed the two sets as inputs to a standard
training algorithm. The algorithm will then yield
a function g(x) such that g(x) p(s 1x)
approximately.
We call g(x) a nontraditional classier.
Then we obtain f(x) from g(x).

9
Learning from Non-Traditional Input

Suppose the selected completely at random
assumption holds. Then p(y 1x) p(s
1x)/c.
Easy to prove
Everything done. We just need to estimate the
constant c here.

10
Estimating c

Let V be such a validation set that is drawn from
the overall distribution p(x, y, s) in the same
manner as the nontraditional training set. Let P
be the subset of examples in V that are labeled
(and hence positive). The estimator of p(s 1y
1) is the average value of g(x) for x in P.
Formally , n is the cardinality of P
There are other estimators for c

500 positive data points and 1000 negative data
points, each from a 2d Gaussian.
using 20 of the positive data as labeled
positive examples.
Based on a validation set of just 20 labeled
examples, this estimated value is e1 0.1928,
which is very close to the true value 0.2.

12
Weighting Unlabeled Examples

Let the goal be to estimate for any function h

unlabeled examples are duplicated
one copy is made positive with weight
p(y1x,s0) and the other copy is made negative
with weight 1-p(y1x,s 0).

n is the cardinality of the labeled training set
This result solves an open problem identified in
D. Zhang and W. S. Lee. A simple probabilistic
approach to learning from positive and unlabeled
examples., namely how to estimate p(y1) given
only the type of nontraditional training set
considered here

15
Experiments on Real Data

Positive examples 2453 records obtained from a
specialized database named TCDB
Unlabeled examples 4906 records selected
randomly from SwissProt excluding its
intersection with TCDB
Q subset of actual positive examples inside
U348 members

16
(No Transcript)

Write a Comment

User Comments (0)