Title: Modern Topics in Learning Theory
1Modern Topics in Learning Theory
- Maria-Florina Balcan
- 04/19/2006
2Modern Topics in Learning Theory
- Semi-Supervised Learning
- Active Learning
- Kernels and Similarity Functions
- Tighter Data Dependent Bounds
3Semi-Supervised Learning
- Hot topic in recent years in Machine Learning.
- Many applications have lots of unlabeled data,
but labeled data is rare or expensive - Web page, document classification
- OCR, Image classification
4Combining Labeled and Unlabeled Data
- Several methods have been developed to try to use
unlabeled data to improve performance, e.g. - Transductive SVM J98
- Co-training BM98, BBY04
- Graph-based methods BC01, ZGL03, BLRR04
- Augmented PAC model for SSL BB05, BB06
5Can we extend the PAC model to deal with
Unlabeled Data?
- PAC model nice/standard model for learning from
labeled data. - Goal extend it naturally to the case of
learning from both labeled and unlabeled data. - Different algorithms are based on different
assumptions about how data should behave. - Question how to capture many of the assumptions
typically used?
6Example of typical assumption
- The separator goes through low density regions of
the space/large margin. - assume we are looking for linear separator
- belief should exist one with large separation
7Another Example
- Agreement between two parts co-training.
- examples contain two sufficient sets of features,
i.e. an example is xh x1, x2 i and the belief is
that the two parts of the example are consistent,
i.e. 9 c1, c2 such that c1(x1)c2(x2)c(x) - for example, if we want to classify web pages
x h x1, x2 i
8Co-Training BM98
Works by using unlabeled data to propagate
learned information.
9Proposed Model BB05, BB06
- Augment the notion of a concept class C with a
notion of compatibility ? between a concept and
the data distribution. - learn C becomes learn (C,?) (i.e. learn class
C under compatibility notion ?) - Express relationships that one hopes the target
function and underlying distribution will
possess. - Idea use unlabeled data the belief that the
target is compatible to reduce C down to just
the highly compatible functions in C.
10Proposed Model, cont
- Idea use unlabeled data our belief to reduce
size(C) down to size(highly compatible functions
in C) in our sample complexity bounds. - Want to be able to analyze how much unlabeled
data is needed to uniformly estimate
compatibilities well. - Require that the degree of compatibility be
something that can be estimated from a finite
sample.
11Proposed Model, cont
- Augment the notion of a concept class C with a
notion of compatibility ? between a concept and
the data distribution. - Require that the degree of compatibility be
something that can be estimated from a finite
sample. - Require ? to be an expectation over individual
examples - ?(h,D)Ex 2 D?(h, x) compatibility of h with
D, ?(h,x) 2 0,1 - errunl(h)1-?(h, D) incompatibility of h with
D (unlabeled error rate of h)
12Margins, Compatibility
- Margins belief is that should exist a large
margin separator. - Incompatibility of h and D (unlabeled error rate
of h) the probability mass within distance ? of
h. - Can be written as an expectation over individual
examples ?(h,D)Ex 2 D?(h,x) where - ?(h,x)0 if dist(x,h) ?
- ?(h,x)1 if dist(x,h) ?
13Margins, Compatibility
- Margins belief is that should exist a large
margin separator. - If do not want to commit to ? in advance, define
?(h,x) to be a smooth function of dist(x,h),
e.g. - Illegal notion of compatibility the largest ?
s.t. D has probability mass exactly zero within
distance ? of h.
14Co-Training, Compatibility
- Co-training examples come as pairs h x1, x2 i
and the goal is to learn a pair of functions h
h1, h2 i - Hope is that the two parts of the example are
consistent. - Legal (and natural) notion of compatibility
- the compatibility of h h1, h2 i and D
- can be written as an expectation over examples
15Examples of results in our model Sample
Complexity - Uniform convergence bounds
- Finite Hypothesis Spaces, Doubly Realizable Case
- Define CD,?(?) h 2 C errunl(h) ?.
- Theorem
- Bound the of labeled examples as a measure of
the helpfulness of D with respect to ? - a helpful distribution is one in which CD,?(?) is
small
16 Semi-Supervised Learning Natural Formalization
(PAC?)
- We will say an algorithm "PAC?-learns" if it runs
in poly time using samples poly in respective
bounds. - E.g., can think of lnC as bits to describe
target without knowing D, and lnCD,?(?) as
number of bits to describe target knowing a good
approximation to D, given the assumption that the
target has low unlabeled error rate.
17Examples of results in our model Sample
Complexity - Uniform convergence bounds
- Finite Hypothesis Spaces c not fully
compatible - Theorem
18Examples of results in our model Sample
Complexity - Uniform convergence bounds
- Infinite Hypothesis Spaces
- Assume ?(h,x) 2 0,1 and ?(C) ?h h 2 C
where ?h(x) ?(h,x). - Cm,D - expected of splits of m points from D
with concepts in C.
19Examples of results in our model Sample
Complexity - Uniform convergence bounds
- For S µ X, denote by US the uniform distribution
over S, and by Cm, US the expected number of
splits of m points from US with concepts in C. - Assume err(c)0 and errunl(c)0.
- Theorem
- The number of labeled examples depends on the
unlabeled sample. - Useful since can imagine the learning alg.
performing some calculations over the unlabeled
data and then deciding how many labeled examples
to purchase.
20Examples of results in our model Sample
Complexity, ?-Cover-based bounds
- For algorithms that behave in a specific way
- first use the unlabeled data to choose a
representative set of compatible hypotheses - then use the labeled sample to choose among these
- Theorem
-
21Implications of our analysisWays in which
unlabeled data can help
- If the target is highly compatible with D and
have enough unlabeled data to estimate ? over all
h 2 C, then can reduce the search space (from C
down to just those h 2 C whose estimated
unlabeled error rate is low). - By providing an estimate of D, unlabeled data can
allow a more refined distribution-specific notion
of hypothesis space size (such as Annealed
VC-entropy or the size of the smallest ?-cover). - If D is nice so that the set of compatible h 2 C
has a small ?-cover and the elements of the cover
are far apart, then can learn from even fewer
labeled examples than the 1/? needed just to
verify a good hypothesis.
22Modern Topics in Learning Theory
- Semi-Supervised Learning
- Active Learning
- Kernels and Similarity Functions
- Data Dependent Bounds
23Active Learning
- Unlabeled data is cheap easy to obtain, labeled
data is (much) more expensive. - The learner has the ability to choose specific
examples to be labeled - - The learner works harder, in order to use fewer
labeled examples.
24Membership queries
- The learner constructs the examples.
- Baum and Lang, 1991 tried fitting a neural net
to handwritten characters. - synthetic instances created were incomprehensible
to humans
25A PAC-like model CAL92
- Underlying distribution P on the (x,y) data.
- (agnostic setting)
- Learner has two abilities
- draw an unlabeled sample from the distribution
- ask for a label of one of these samples
- Special case assume the data is separable,
i.e. some concept h 2 C labels all points
perfectly.
(realizable setting)
26Can adaptive querying help? CAL92, D04
- Consider threshold functions on the real line
- Start with 1/? unlabeled points.
- Binary search need just log 1/? labels, from
which the rest can be inferred. - Output a consistent hypothesis.
Exponential improvement in sample complexity ?
27Region of uncertainty CAL92
- Current version space part of C consistent
with labels so far. - Region of uncertainty part of data space
about which there is still some uncertainty (i.e.
disagreement within version space)
- Example data lies on circle in R2 and
hypotheses are linear separators.
current version space
region of uncertainty in data space
28Region of uncertainty CAL92
Algorithm Of the unlabeled points which lie in
the region of uncertainty, pick one at random to
query.
current version space
region of uncertainty in data space
29Region of uncertainty CAL92
- Number of labels needed depends on C and also
on P. - Example C -- linear separators in Rd, D --
uniform distribution over unit sphere. - need only d3/2 log 1/? labels to find a
hypothesis with error rate lt ?. - supervised learning d/? labels.
Exponential improvement in sample complexity ?
For a robust version of CAL92 see BBL06.