A PAC Model for Learning from Labeled and Unlabeled Data presentation

About This Presentation

Transcript and Presenter's Notes

Title: A PAC Model for Learning from Labeled and Unlabeled Data

1
A PAC Model for Learning from Labeled and
Unlabeled Data

Maria-Florina Balcan Avrim Blum
Carnegie Mellon University,
Computer Science Department

2
Outline of the talk

Supervised Learning
PAC Model
Sample Complexity
Algorithm Design
Semi-supervised Learning
A PAC Style Model
Examples of results in our model
Sample Complexity
Algorithmic Issues Co-training of linear
separators
Conclusions
Implications of our Analysis

3
Usual Supervised Learning Problem

Imagine you want a computer program to help you
decide which email messages are spam and which
are important.
Might represent each message by n features.
(e.g., return address, keywords, spelling, etc.).
Take a sample S of data, labeled according to
whether they were/weren't spam.
Goal of algorithm is to use data seen so far to
produce good prediction rule (a "hypothesis") h
for future data.

4
The concept learning setting
E.g.,

Given data, some reasonable rules might be
Predict SPAM if unknown AND (sex OR sales)
Predict SPAM if sales sex known gt 0.
...

5
Supervised Learning, Big Questions

Algorithm Design
How might we automatically generate rules that do
well on observed data?
Sample Complexity/Confidence Bound
What kind of confidence do we have that they will
do well in the future?

6
Supervised Learning Formalization (PAC)

PAC model nice/standard model for learning from
labeled data.
X - instance space
S(x, l) - set of labeled examples
examples - assumed to be drawn i.i.d. from some
distr. D over X and labeled by some target
concept c
labels 2 -1,1 - binary classification
Want to do optimization over S to find some
hypothesis h, but we want h to have small error
over D.
err(h)Prx 2 D(h(x) ? c(x))

7
Basic PAC Learning Definitions

Algorithm A PAC-learns concept class C if for any
target c in C, any distribution D over X, any ?,
? gt 0
A uses at most poly(n,1/?,1/?,size(c)) examples
and running time.
With probability 1-?, A produces h in C of error
at most ?.
Notation true error of h
- empirical error
of h

8
Sample Complexity Uniform Convergence Finite
Hypothesis Spaces

Realizable Case
1. Prob. a bad hypothesis is consistent with m
examples is at most (1-?)m
2. So, prob. exists a bad consistent hypothesis
is at most C(1-?)m
3. Set to ?, solve to get examples needed at
most 1/?ln(C) ln(1/?)
If not too many rules to choose from, then
unlikely some bad one will fool you just by
chance.

9
Sample Complexity Uniform Convergence Finite
Hypothesis Spaces

Realizable Case
Agnostic Case
What if there is no perfect h?
Gives hope for local optimization over the
training data.

10
Shattering, VC-dimension

Def A set of points S is shattered by a concept
class C if there are concepts in C that split S
in all 2S possible ways.
VC-dimension of C is the size of the largest set
of points that can be shattered by C.
Example C the class of subintervals a,b, 0
a,b 1
VC-dim(C)2
CS the set of splittings of dataset S using
concepts from C.
Cm - maximum number of ways to split m points
using concepts in C i.e.

0
1
11
Sample Complexity Uniform Convergence Infinite
Hypothesis Spaces

CS the set of splittings of dataset S using
concepts from C.
Cm - maximum number of ways to split m points
using concepts in C i.e.
Cm,D - expected number of splits of m points
from D with concepts in C.
Neat Fact 1 previous results still hold if we
replace C with C2m.
Neat Fact 2 can even replace with C2m,D.

12
Sample Complexity Uniform Convergence Infinite
Hypothesis Spaces

For instance
Sauers Lemma, CmO(mVC-dim(C)) implies

13
Sample Complexity Uniform Convergence Infinite
Hypothesis Spaces

Agnostic Case

14
Outline of the talk

Supervised Learning
PAC Model
Sample Complexity
Algorithms
Semi-supervised Learning
Proposed Model
Examples of results in our model
Sample Complexity
Algorithmic Issues Co-training of linear
separators
Conclusions
Implications of our Analysis

15
Combining Labeled and Unlabeled Data (a.k.a.
Semi-supervised Learning)

Hot topic in recent years in Machine Learning.
Many applications have lots of unlabeled data,
but labeled data is rare or expensive
Web page, document classification
OCR, Image classification

16
Combining Labeled and Unlabeled Data

Several methods have been developed to try to use
unlabeled data to improve performance, e.g.
Transductive SVM
Co-training
Graph-based methods

17
Can we extend the PAC model to deal with
Unlabeled Data?

PAC model nice/standard model for learning from
labeled data.
Goal extend it naturally to the case of
learning from both labeled and unlabeled data.
Different algorithms are based on different
assumptions about how data should behave.
Question how to capture many of the assumptions
typically used?

18
Example of typical assumption

The separator goes through low density regions of
the space/large margin.
assume we are looking for linear separator
belief should exist one with large separation

19
Another Example

Agreement between two parts co-training.
examples contain two sufficient sets of features,
i.e. an example is xh x1, x2 i and the belief is
that the two parts of the example are consistent,
i.e. 9 c1, c2 such that c1(x1)c2(x2)c(x)
for example, if we want to classify web pages

x h x1, x2 i
20
Co-training
Text info
Link info

-
-
21
Proposed Model

Augment the notion of a concept class C with a
notion of compatibility ? between a concept and
the data distribution.
learn C becomes learn (C,?) (i.e. learn class
C under compatibility notion ?)
Express relationships that one hopes the target
function and underlying distribution will
possess.
Goal use unlabeled data the belief that the
target is compatible to reduce C down to just
the highly compatible functions in C.

22
Proposed Model, cont

Goal use unlabeled data our belief to reduce
size(C) down to size(highly compatible functions
in C) in the previous bounds.
Want to be able to analyze how much unlabeled
data is needed to uniformly estimate
compatibilities well.
Require that the degree of compatibility be
something that can be estimated from a finite
sample.

23
Proposed Model, cont

Augment the notion of a concept class C with a
notion of compatibility ? between a concept and
the data distribution.
Require that the degree of compatibility be
something that can be estimated from a finite
sample.
Require ? to be an expectation over individual
examples
?(h,D)Ex 2 D?(h, x) compatibility of h with
D, ?(h,x) 2 0,1
errunl(h)1-?(h, D) incompatibility of h with
D (unlabeled error rate of h)

24
Margins, Compatibility

Margins belief is that should exist a large
margin separator.
Incompatibility of h and D (unlabeled error rate
of h) the probability mass within distance ? of
h.
Can be written as an expectation over individual
examples ?(h,D)Ex 2 D?(h,x) where
?(h,x)0 if dist(x,h) ?
?(h,x)1 if dist(x,h) ?

25
Margins, Compatibility

Margins belief is that should exist a large
margin separator.
If do not want to commit to ? in advance, define
?(h,x) to be a smooth function of dist(x,h),
e.g.
Illegal notion of compatibility the largest ?
s.t. D has probability mass exactly zero within
distance ? of h.

26
Co-training, Compatibility

Co-training examples come as pairs h x1, x2 i
and the goal is to learn a pair of functions h
h1, h2 i
Hope is that the two parts of the example are
consistent.
Legal (and natural) notion of compatibility
the compatibility of h h1, h2 i and D
can be written as an expectation over examples

27
Examples of results in our model Sample
Complexity - Uniform convergence bounds

Finite Hypothesis Spaces, Doubly Realizable Case
Assume ?(h,x) 2 0,1 define CD,?(?) h 2 C
errunl(h) ?.
Theorem
Bound the number of labeled examples as a measure
of the helpfulness of D w.r.t to ?
a helpful distribution is one in which CD,?(?) is
small

28
Semi-Supervised Learning Natural Formalization
(PAC?)

We will say an algorithm "PAC?-learns" if it runs
in poly time using samples poly in respective
bounds.
E.g., can think of lnC as bits to describe
target without knowing D, and lnCD,?(?) as
number of bits to describe target knowing a good
approximation to D, given the assumption that the
target has low unlabeled error rate.

29
Examples of results in our model Sample
Complexity - Uniform convergence bounds

Finite Hypothesis Spaces c not fully
compatible
Theorem

30
Examples of results in our model Sample
Complexity - Uniform convergence bounds

Infinite Hypothesis Spaces
Assume ?(h,x) 2 0,1 and ?(C) ?h h 2 C
where ?h(x) ?(h,x).

31
Examples of results in our model Sample
Complexity - Uniform convergence bounds

For S µ X, denote by US the uniform distribution
over S, and by Cm, US the expected number of
splits of m points from US with concepts in C.
Assume err(c)0 and errunl(c)0.
Theorem
The number of labeled examples depends on the
unlabeled sample.
Useful since can imagine the learning alg.
performing some calculations over the unlabeled
data and then deciding how many labeled examples
to purchase.

32
Examples of results in our model Sample
Complexity, ?-Cover-based bounds

For algorithms that behave in a specific way
first use the unlabeled data to choose a
representative set of compatible hypotheses
then use the labeled sample to choose among these
Theorem

33
Examples of results in our model

Lets look at some algorithms.

34
Examples of results in our modelAlgorithmic
Issues Algorithm for a simple (C,?)

X0,1n, C class of disjunctions, e.g. hx1 Ç
x2 Ç x3 Ç x4 Ç x7
For x 2 X, let vars(x) be the set of variables
set to 1 by x
For h 2 C, let vars(h) be the set of variables
disjoined by h
?(h,x)1 if either vars(x) µ vars(h) or vars(x)
Å vars(h)?
Strong notion of margin
every variable is either a positive indicator or
a negative indicator
no example should contain both positive and
negative indicators
Can give a simple PAC?-learning algorithm for
this pair (C,?).

35
Examples of results in our modelAlgorithmic
Issues Algorithm for a simple (C,?)

Use unlabeled sample U to build G on n vertices
put an edge between i and j if 9 x in U with i,j
2 vars(x).
Use labeled data L to label the connected
components.
Output h s. t. vars(h) is the union of the
positively-labeled components.
If c is fully compatible, then no component will
get both positive and negative labels.
and
If U L are as given in the bounds, then whp
err(h) ?.

-
011000
101000
unlabeled set U
1
3
4
5
6
2
001000
000011
100100
hx1Çx2Çx3Çx4
100000
labeled set L
000011 -
36
Examples of results in our modelAlgorithmic
Issues Algorithm for a simple (C,?)

Especially non-helpful distribution the uniform
distr. over all examples x with vars(x)1
get n components still needs ?(n) labeled
examples
Helpful distribution - one such that w.h.p. the
of components is small
need a lower number of labeled examples

37
Examples of results in our modelAlgorithmic
Issues Co-training of linear separators

Examples h x1, x2 i 2 Rn Rn.
Target functions c1 and c2 are linear separators,
assume c1c2c, and that no pair crosses the
target plane.
f linear separator in Rn, errunl(f) - the
fraction of the pairs that cross fs boundary
Consistency problem given a set of labeled and
unlabeled examples, want to find a separator that
is consistent with labeled examples and
compatible with the unlabeled ones.
It is NP-hard - Abie.

38
Examples of results in our modelAlgorithmic
Issues Co-training of linear separators

Assume independence given the label (both points
from D or from D-).
Blum Mitchell 98 show can co-train (in
polynomial time) if have enough labeled data to
produce a weakly-useful hypothesis to begin with.
We show, can learn with only a single labeled
example.
Key point independence given the label implies
that the functions with low errunl rate are
close to c
close to c
close to the all positive function
close to the all negative function

39
Examples of results in our modelAlgorithmic
Issues Co-training of linear separators

Nice Tool a super simple algorithm for weak
learning a large-margin separator
pick c at random
If margin1/poly(n), then a random c has at least
1/poly(n) chance of being a weak predictor

40
Examples of results in our modelAlgorithmic
Issues Co-training of linear separators

Assume independence given the label.
Draw a large unlabeled sample S(x1i,x2i).
If also assume large margin,
run the super-simple alg poly(n) times
feed each c into Blum Mitchell booster
examine all the hypotheses produced, and pick
one h with small errunl, that is far from
all-positive and all-negative functions
use labeled example to choose either h or h
w.h.p. one random c was a weakly-useful
predictor so on at least one of these steps we
end up with a hypothesis h with small err(h), and
so with small errunl(h)
If dont assume large margin,
use Outlier Removal Lemma to make sure that at
least 1/poly fraction of the points in S1x1i
have margin at least 1/poly this is sufficient.

41
Implications of our analysisWays in which
unlabeled data can help

If the target is highly compatible with D and
have enough unlabeled data to estimate ? over all
h 2 C, then can reduce the search space (from C
down to just those h 2 C whose estimated
unlabeled error rate is low).
By providing an estimate of D, unlabeled data can
allow a more refined distribution-specific notion
of hypothesis space size (such as Annealed
VC-entropy or the size of the smallest ?-cover).
If D is nice so that the set of compatible h 2 C
has a small ?-cover and the elements of the cover
are far apart, then can learn from even fewer
labeled examples than the 1/? needed just to
verify a good hypothesis.

42
Implications of our analysisWays in which
unlabeled data can help

If the target is highly compatible with D and
have enough unlabeled data to estimate ? over all
h 2 C, then can reduce the search space (from C
down to just those h 2 C whose estimated
unlabeled error rate is low).
By providing an estimate of D, unlabeled data can
allow a more refined distribution-specific notion
of hypothesis space size (such as Annealed
VC-entropy or the size of the smallest ?-cover).

43
Questions?
44
Thank you !

Write a Comment

User Comments (0)

About PowerShow.com

A PAC Model for Learning from Labeled and Unlabeled Data PowerPoint PPT Presentation