Modern Topics in Learning Theory - PowerPoint PPT Presentation

About This Presentation

Title:

Modern Topics in Learning Theory

Description:

Modern Topics in Learning Theory. Maria-Florina Balcan. 04/19/2006. Modern Topics in Learning Theory. Semi-Supervised Learning. Active Learning ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 30

Provided by: dorub

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Modern Topics in Learning Theory

1
Modern Topics in Learning Theory

Maria-Florina Balcan
04/19/2006

2
Modern Topics in Learning Theory

Semi-Supervised Learning
Active Learning
Kernels and Similarity Functions
Tighter Data Dependent Bounds

3
Semi-Supervised Learning

Hot topic in recent years in Machine Learning.
Many applications have lots of unlabeled data,
but labeled data is rare or expensive
Web page, document classification
OCR, Image classification

4
Combining Labeled and Unlabeled Data

Several methods have been developed to try to use
unlabeled data to improve performance, e.g.
Transductive SVM J98
Co-training BM98, BBY04
Graph-based methods BC01, ZGL03, BLRR04
Augmented PAC model for SSL BB05, BB06

5
Can we extend the PAC model to deal with
Unlabeled Data?

PAC model nice/standard model for learning from
labeled data.
Goal extend it naturally to the case of
learning from both labeled and unlabeled data.
Different algorithms are based on different
assumptions about how data should behave.
Question how to capture many of the assumptions
typically used?

6
Example of typical assumption

The separator goes through low density regions of
the space/large margin.
assume we are looking for linear separator
belief should exist one with large separation

7
Another Example

Agreement between two parts co-training.
examples contain two sufficient sets of features,
i.e. an example is xh x1, x2 i and the belief is
that the two parts of the example are consistent,
i.e. 9 c1, c2 such that c1(x1)c2(x2)c(x)
for example, if we want to classify web pages

x h x1, x2 i
8
Co-Training BM98
Works by using unlabeled data to propagate
learned information.

9
Proposed Model BB05, BB06

Augment the notion of a concept class C with a
notion of compatibility ? between a concept and
the data distribution.
learn C becomes learn (C,?) (i.e. learn class
C under compatibility notion ?)
Express relationships that one hopes the target
function and underlying distribution will
possess.
Idea use unlabeled data the belief that the
target is compatible to reduce C down to just
the highly compatible functions in C.

10
Proposed Model, cont

Idea use unlabeled data our belief to reduce
size(C) down to size(highly compatible functions
in C) in our sample complexity bounds.
Want to be able to analyze how much unlabeled
data is needed to uniformly estimate
compatibilities well.
Require that the degree of compatibility be
something that can be estimated from a finite
sample.

11
Proposed Model, cont

Augment the notion of a concept class C with a
notion of compatibility ? between a concept and
the data distribution.
Require that the degree of compatibility be
something that can be estimated from a finite
sample.
Require ? to be an expectation over individual
examples
?(h,D)Ex 2 D?(h, x) compatibility of h with
D, ?(h,x) 2 0,1
errunl(h)1-?(h, D) incompatibility of h with
D (unlabeled error rate of h)

12
Margins, Compatibility

Margins belief is that should exist a large
margin separator.
Incompatibility of h and D (unlabeled error rate
of h) the probability mass within distance ? of
h.
Can be written as an expectation over individual
examples ?(h,D)Ex 2 D?(h,x) where
?(h,x)0 if dist(x,h) ?
?(h,x)1 if dist(x,h) ?

13
Margins, Compatibility

Margins belief is that should exist a large
margin separator.
If do not want to commit to ? in advance, define
?(h,x) to be a smooth function of dist(x,h),
e.g.
Illegal notion of compatibility the largest ?
s.t. D has probability mass exactly zero within
distance ? of h.

14
Co-Training, Compatibility

Co-training examples come as pairs h x1, x2 i
and the goal is to learn a pair of functions h
h1, h2 i
Hope is that the two parts of the example are
consistent.
Legal (and natural) notion of compatibility
the compatibility of h h1, h2 i and D
can be written as an expectation over examples

15
Examples of results in our model Sample
Complexity - Uniform convergence bounds

Finite Hypothesis Spaces, Doubly Realizable Case
Define CD,?(?) h 2 C errunl(h) ?.
Theorem
Bound the of labeled examples as a measure of
the helpfulness of D with respect to ?
a helpful distribution is one in which CD,?(?) is
small

16
Semi-Supervised Learning Natural Formalization
(PAC?)

We will say an algorithm "PAC?-learns" if it runs
in poly time using samples poly in respective
bounds.
E.g., can think of lnC as bits to describe
target without knowing D, and lnCD,?(?) as
number of bits to describe target knowing a good
approximation to D, given the assumption that the
target has low unlabeled error rate.

17
Examples of results in our model Sample
Complexity - Uniform convergence bounds

Finite Hypothesis Spaces c not fully
compatible
Theorem

18
Examples of results in our model Sample
Complexity - Uniform convergence bounds

Infinite Hypothesis Spaces
Assume ?(h,x) 2 0,1 and ?(C) ?h h 2 C
where ?h(x) ?(h,x).
Cm,D - expected of splits of m points from D
with concepts in C.

19
Examples of results in our model Sample
Complexity - Uniform convergence bounds

For S µ X, denote by US the uniform distribution
over S, and by Cm, US the expected number of
splits of m points from US with concepts in C.
Assume err(c)0 and errunl(c)0.
Theorem
The number of labeled examples depends on the
unlabeled sample.
Useful since can imagine the learning alg.
performing some calculations over the unlabeled
data and then deciding how many labeled examples
to purchase.

20
Examples of results in our model Sample
Complexity, ?-Cover-based bounds

For algorithms that behave in a specific way
first use the unlabeled data to choose a
representative set of compatible hypotheses
then use the labeled sample to choose among these
Theorem

21
Implications of our analysisWays in which
unlabeled data can help

If the target is highly compatible with D and
have enough unlabeled data to estimate ? over all
h 2 C, then can reduce the search space (from C
down to just those h 2 C whose estimated
unlabeled error rate is low).
By providing an estimate of D, unlabeled data can
allow a more refined distribution-specific notion
of hypothesis space size (such as Annealed
VC-entropy or the size of the smallest ?-cover).
If D is nice so that the set of compatible h 2 C
has a small ?-cover and the elements of the cover
are far apart, then can learn from even fewer
labeled examples than the 1/? needed just to
verify a good hypothesis.

22
Modern Topics in Learning Theory

Semi-Supervised Learning
Active Learning
Kernels and Similarity Functions
Data Dependent Bounds

23
Active Learning

Unlabeled data is cheap easy to obtain, labeled
data is (much) more expensive.
The learner has the ability to choose specific
examples to be labeled
- The learner works harder, in order to use fewer
labeled examples.

24
Membership queries

The learner constructs the examples.
Baum and Lang, 1991 tried fitting a neural net
to handwritten characters.
synthetic instances created were incomprehensible
to humans

25
A PAC-like model CAL92

Underlying distribution P on the (x,y) data.
(agnostic setting)
Learner has two abilities
draw an unlabeled sample from the distribution
ask for a label of one of these samples
Special case assume the data is separable,
i.e. some concept h 2 C labels all points
perfectly.

(realizable setting)
26
Can adaptive querying help? CAL92, D04

Consider threshold functions on the real line

Start with 1/? unlabeled points.
Binary search need just log 1/? labels, from
which the rest can be inferred.
Output a consistent hypothesis.

Exponential improvement in sample complexity ?
27
Region of uncertainty CAL92

Current version space part of C consistent
with labels so far.
Region of uncertainty part of data space
about which there is still some uncertainty (i.e.
disagreement within version space)

Example data lies on circle in R2 and
hypotheses are linear separators.

current version space

region of uncertainty in data space
28
Region of uncertainty CAL92
Algorithm Of the unlabeled points which lie in
the region of uncertainty, pick one at random to
query.
current version space
region of uncertainty in data space
29
Region of uncertainty CAL92

Number of labels needed depends on C and also
on P.
Example C -- linear separators in Rd, D --
uniform distribution over unit sphere.
need only d3/2 log 1/? labels to find a
hypothesis with error rate lt ?.
supervised learning d/? labels.

Exponential improvement in sample complexity ?
For a robust version of CAL92 see BBL06.

Write a Comment

User Comments (0)