Title: A PAC Model for Learning from Labeled and Unlabeled Data
1A PAC Model for Learning from Labeled and
Unlabeled Data
- Maria-Florina Balcan Avrim Blum
- Carnegie Mellon University,
- Computer Science Department
2Outline of the talk
- Supervised Learning
- PAC Model
- Sample Complexity
- Algorithm Design
- Semi-supervised Learning
- A PAC Style Model
- Examples of results in our model
- Sample Complexity
- Algorithmic Issues Co-training of linear
separators - Conclusions
- Implications of our Analysis
3Usual Supervised Learning Problem
- Imagine you want a computer program to help you
decide which email messages are spam and which
are important. - Might represent each message by n features.
(e.g., return address, keywords, spelling, etc.). -
- Take a sample S of data, labeled according to
whether they were/weren't spam. - Goal of algorithm is to use data seen so far to
produce good prediction rule (a "hypothesis") h
for future data.
4The concept learning setting
E.g.,
- Given data, some reasonable rules might be
- Predict SPAM if unknown AND (sex OR sales)
- Predict SPAM if sales sex known gt 0.
- ...
5Supervised Learning, Big Questions
- Algorithm Design
- How might we automatically generate rules that do
well on observed data? - Sample Complexity/Confidence Bound
- What kind of confidence do we have that they will
do well in the future?
6 Supervised Learning Formalization (PAC)
- PAC model nice/standard model for learning from
labeled data. - X - instance space
- S(x, l) - set of labeled examples
- examples - assumed to be drawn i.i.d. from some
distr. D over X and labeled by some target
concept c - labels 2 -1,1 - binary classification
- Want to do optimization over S to find some
hypothesis h, but we want h to have small error
over D. - err(h)Prx 2 D(h(x) ? c(x))
7Basic PAC Learning Definitions
- Algorithm A PAC-learns concept class C if for any
target c in C, any distribution D over X, any ?,
? gt 0 - A uses at most poly(n,1/?,1/?,size(c)) examples
and running time. - With probability 1-?, A produces h in C of error
at most ?. - Notation true error of h
- - empirical error
of h
8Sample Complexity Uniform Convergence Finite
Hypothesis Spaces
- Realizable Case
- 1. Prob. a bad hypothesis is consistent with m
examples is at most (1-?)m - 2. So, prob. exists a bad consistent hypothesis
is at most C(1-?)m - 3. Set to ?, solve to get examples needed at
most 1/?ln(C) ln(1/?) - If not too many rules to choose from, then
unlikely some bad one will fool you just by
chance.
9Sample Complexity Uniform Convergence Finite
Hypothesis Spaces
- Realizable Case
- Agnostic Case
- What if there is no perfect h?
-
- Gives hope for local optimization over the
training data.
10Shattering, VC-dimension
- Def A set of points S is shattered by a concept
class C if there are concepts in C that split S
in all 2S possible ways. - VC-dimension of C is the size of the largest set
of points that can be shattered by C. - Example C the class of subintervals a,b, 0
a,b 1 - VC-dim(C)2
- CS the set of splittings of dataset S using
concepts from C. - Cm - maximum number of ways to split m points
using concepts in C i.e.
0
1
11Sample Complexity Uniform Convergence Infinite
Hypothesis Spaces
- CS the set of splittings of dataset S using
concepts from C. - Cm - maximum number of ways to split m points
using concepts in C i.e. - Cm,D - expected number of splits of m points
from D with concepts in C. - Neat Fact 1 previous results still hold if we
replace C with C2m. - Neat Fact 2 can even replace with C2m,D.
12Sample Complexity Uniform Convergence Infinite
Hypothesis Spaces
- For instance
- Sauers Lemma, CmO(mVC-dim(C)) implies
13Sample Complexity Uniform Convergence Infinite
Hypothesis Spaces
14Outline of the talk
- Supervised Learning
- PAC Model
- Sample Complexity
- Algorithms
- Semi-supervised Learning
- Proposed Model
- Examples of results in our model
- Sample Complexity
- Algorithmic Issues Co-training of linear
separators - Conclusions
- Implications of our Analysis
15Combining Labeled and Unlabeled Data (a.k.a.
Semi-supervised Learning)
- Hot topic in recent years in Machine Learning.
- Many applications have lots of unlabeled data,
but labeled data is rare or expensive - Web page, document classification
- OCR, Image classification
16Combining Labeled and Unlabeled Data
- Several methods have been developed to try to use
unlabeled data to improve performance, e.g. - Transductive SVM
- Co-training
- Graph-based methods
17Can we extend the PAC model to deal with
Unlabeled Data?
- PAC model nice/standard model for learning from
labeled data. - Goal extend it naturally to the case of
learning from both labeled and unlabeled data. - Different algorithms are based on different
assumptions about how data should behave. - Question how to capture many of the assumptions
typically used?
18Example of typical assumption
- The separator goes through low density regions of
the space/large margin. - assume we are looking for linear separator
- belief should exist one with large separation
19Another Example
- Agreement between two parts co-training.
- examples contain two sufficient sets of features,
i.e. an example is xh x1, x2 i and the belief is
that the two parts of the example are consistent,
i.e. 9 c1, c2 such that c1(x1)c2(x2)c(x) - for example, if we want to classify web pages
x h x1, x2 i
20Co-training
Text info
Link info
-
-
21Proposed Model
- Augment the notion of a concept class C with a
notion of compatibility ? between a concept and
the data distribution. - learn C becomes learn (C,?) (i.e. learn class
C under compatibility notion ?) - Express relationships that one hopes the target
function and underlying distribution will
possess. - Goal use unlabeled data the belief that the
target is compatible to reduce C down to just
the highly compatible functions in C.
22Proposed Model, cont
- Goal use unlabeled data our belief to reduce
size(C) down to size(highly compatible functions
in C) in the previous bounds. - Want to be able to analyze how much unlabeled
data is needed to uniformly estimate
compatibilities well. - Require that the degree of compatibility be
something that can be estimated from a finite
sample.
23Proposed Model, cont
- Augment the notion of a concept class C with a
notion of compatibility ? between a concept and
the data distribution. - Require that the degree of compatibility be
something that can be estimated from a finite
sample. - Require ? to be an expectation over individual
examples - ?(h,D)Ex 2 D?(h, x) compatibility of h with
D, ?(h,x) 2 0,1 - errunl(h)1-?(h, D) incompatibility of h with
D (unlabeled error rate of h)
24Margins, Compatibility
- Margins belief is that should exist a large
margin separator. - Incompatibility of h and D (unlabeled error rate
of h) the probability mass within distance ? of
h. - Can be written as an expectation over individual
examples ?(h,D)Ex 2 D?(h,x) where - ?(h,x)0 if dist(x,h) ?
- ?(h,x)1 if dist(x,h) ?
25Margins, Compatibility
- Margins belief is that should exist a large
margin separator. - If do not want to commit to ? in advance, define
?(h,x) to be a smooth function of dist(x,h),
e.g. - Illegal notion of compatibility the largest ?
s.t. D has probability mass exactly zero within
distance ? of h.
26Co-training, Compatibility
- Co-training examples come as pairs h x1, x2 i
and the goal is to learn a pair of functions h
h1, h2 i - Hope is that the two parts of the example are
consistent. - Legal (and natural) notion of compatibility
- the compatibility of h h1, h2 i and D
- can be written as an expectation over examples
27Examples of results in our model Sample
Complexity - Uniform convergence bounds
- Finite Hypothesis Spaces, Doubly Realizable Case
- Assume ?(h,x) 2 0,1 define CD,?(?) h 2 C
errunl(h) ?. - Theorem
- Bound the number of labeled examples as a measure
of the helpfulness of D w.r.t to ? - a helpful distribution is one in which CD,?(?) is
small
28 Semi-Supervised Learning Natural Formalization
(PAC?)
- We will say an algorithm "PAC?-learns" if it runs
in poly time using samples poly in respective
bounds. - E.g., can think of lnC as bits to describe
target without knowing D, and lnCD,?(?) as
number of bits to describe target knowing a good
approximation to D, given the assumption that the
target has low unlabeled error rate.
29Examples of results in our model Sample
Complexity - Uniform convergence bounds
- Finite Hypothesis Spaces c not fully
compatible - Theorem
30Examples of results in our model Sample
Complexity - Uniform convergence bounds
- Infinite Hypothesis Spaces
- Assume ?(h,x) 2 0,1 and ?(C) ?h h 2 C
where ?h(x) ?(h,x).
31Examples of results in our model Sample
Complexity - Uniform convergence bounds
- For S µ X, denote by US the uniform distribution
over S, and by Cm, US the expected number of
splits of m points from US with concepts in C. - Assume err(c)0 and errunl(c)0.
- Theorem
- The number of labeled examples depends on the
unlabeled sample. - Useful since can imagine the learning alg.
performing some calculations over the unlabeled
data and then deciding how many labeled examples
to purchase.
32Examples of results in our model Sample
Complexity, ?-Cover-based bounds
- For algorithms that behave in a specific way
- first use the unlabeled data to choose a
representative set of compatible hypotheses - then use the labeled sample to choose among these
- Theorem
-
33Examples of results in our model
- Lets look at some algorithms.
34Examples of results in our modelAlgorithmic
Issues Algorithm for a simple (C,?)
- X0,1n, C class of disjunctions, e.g. hx1 Ç
x2 Ç x3 Ç x4 Ç x7 - For x 2 X, let vars(x) be the set of variables
set to 1 by x - For h 2 C, let vars(h) be the set of variables
disjoined by h - ?(h,x)1 if either vars(x) µ vars(h) or vars(x)
Å vars(h)? - Strong notion of margin
- every variable is either a positive indicator or
a negative indicator - no example should contain both positive and
negative indicators - Can give a simple PAC?-learning algorithm for
this pair (C,?).
35Examples of results in our modelAlgorithmic
Issues Algorithm for a simple (C,?)
- Use unlabeled sample U to build G on n vertices
- put an edge between i and j if 9 x in U with i,j
2 vars(x). - Use labeled data L to label the connected
components. - Output h s. t. vars(h) is the union of the
positively-labeled components. - If c is fully compatible, then no component will
get both positive and negative labels. - and
- If U L are as given in the bounds, then whp
err(h) ?.
-
011000
101000
unlabeled set U
1
3
4
5
6
2
001000
000011
100100
hx1Çx2Çx3Çx4
100000
labeled set L
000011 -
36Examples of results in our modelAlgorithmic
Issues Algorithm for a simple (C,?)
- Especially non-helpful distribution the uniform
distr. over all examples x with vars(x)1 - get n components still needs ?(n) labeled
examples - Helpful distribution - one such that w.h.p. the
of components is small - need a lower number of labeled examples
37Examples of results in our modelAlgorithmic
Issues Co-training of linear separators
- Examples h x1, x2 i 2 Rn Rn.
- Target functions c1 and c2 are linear separators,
assume c1c2c, and that no pair crosses the
target plane. - f linear separator in Rn, errunl(f) - the
fraction of the pairs that cross fs boundary - Consistency problem given a set of labeled and
unlabeled examples, want to find a separator that
is consistent with labeled examples and
compatible with the unlabeled ones. - It is NP-hard - Abie.
38Examples of results in our modelAlgorithmic
Issues Co-training of linear separators
- Assume independence given the label (both points
from D or from D-). - Blum Mitchell 98 show can co-train (in
polynomial time) if have enough labeled data to
produce a weakly-useful hypothesis to begin with. - We show, can learn with only a single labeled
example. - Key point independence given the label implies
that the functions with low errunl rate are - close to c
- close to c
- close to the all positive function
- close to the all negative function
39Examples of results in our modelAlgorithmic
Issues Co-training of linear separators
- Nice Tool a super simple algorithm for weak
learning a large-margin separator - pick c at random
- If margin1/poly(n), then a random c has at least
1/poly(n) chance of being a weak predictor
40Examples of results in our modelAlgorithmic
Issues Co-training of linear separators
- Assume independence given the label.
- Draw a large unlabeled sample S(x1i,x2i).
- If also assume large margin,
- run the super-simple alg poly(n) times
- feed each c into Blum Mitchell booster
- examine all the hypotheses produced, and pick
one h with small errunl, that is far from
all-positive and all-negative functions - use labeled example to choose either h or h
- w.h.p. one random c was a weakly-useful
predictor so on at least one of these steps we
end up with a hypothesis h with small err(h), and
so with small errunl(h) -
- If dont assume large margin,
- use Outlier Removal Lemma to make sure that at
least 1/poly fraction of the points in S1x1i
have margin at least 1/poly this is sufficient.
41Implications of our analysisWays in which
unlabeled data can help
- If the target is highly compatible with D and
have enough unlabeled data to estimate ? over all
h 2 C, then can reduce the search space (from C
down to just those h 2 C whose estimated
unlabeled error rate is low). - By providing an estimate of D, unlabeled data can
allow a more refined distribution-specific notion
of hypothesis space size (such as Annealed
VC-entropy or the size of the smallest ?-cover). - If D is nice so that the set of compatible h 2 C
has a small ?-cover and the elements of the cover
are far apart, then can learn from even fewer
labeled examples than the 1/? needed just to
verify a good hypothesis.
42Implications of our analysisWays in which
unlabeled data can help
- If the target is highly compatible with D and
have enough unlabeled data to estimate ? over all
h 2 C, then can reduce the search space (from C
down to just those h 2 C whose estimated
unlabeled error rate is low). - By providing an estimate of D, unlabeled data can
allow a more refined distribution-specific notion
of hypothesis space size (such as Annealed
VC-entropy or the size of the smallest ?-cover).
43Questions?
44Thank you !