Semi-supervised learning - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Semi-supervised learning

Description:

Different ways of going semi-supervised. Generative models ... Ideas pioneered by Belkin and Niyogi. G. Sanguinetti, EPSRC Winter School 01/08. Manifolds cont. ... – PowerPoint PPT presentation

Number of Views:107

Avg rating:3.0/5.0

Slides: 34

Provided by: gui564

Category:

more less

Transcript and Presenter's Notes

Title: Semi-supervised learning

1
Semi-supervised learning
Guido Sanguinetti Department of Computer Science,
University of Sheffield
2
Programme

Different ways of learning
Unsupervised learning
EM algorithm
Supervised learning
Different ways of going semi-supervised
Generative models
Reweighting
Discriminative models
Regularisation clusters and manifolds

3
Disclaimer

This is going to be a high-level introduction
rather than a technical talk
We will spend some time talking about supervised
and unsupervised learning as well
We will be unashamedly probabilistic, if not
fully Bayesian
Main reference C.Bishops book Pattern
Recognition and Machine Learning and M. Seegers
chapter in Semi-supervised learning, Chapelle,
Schölkopf and Zien, eds.

4
Different ways to learn

Reinforcement learning learn a strategy to
optimise a reward.
Closely related to decision theory and control
theory.
Supervised learning learn a map
Unsupervised learning estimate a density

5
Unsupervised learning

We are given data x.
We want to estimate the density that generated
the data.
Generally, assume a latent variable y as in
graphical model.
A continuous y leads to dimensionality reduction
(cf previous lecture), a discrete one to
clustering

?
p
x
y
6
Example mixture of Gaussians

Data vector measurements xi, i1,...,N.
Latent variable K-dimensional binary vectors
(class membership) yi yij1 means point i
belongs to class j.
The ? parameters are in this case the covariances
and the means of each component.

7
Estimating mixtures

Two objectives estimate the parameters ?, ?j and
?j and estimate the posterior probabilities on
class membership
?ij are the responsibilities.
Could use gradient descent.
Better to use EM

8
Expectation-Maximization

Iterative procedure to estimate the maximum
likelihood values of parameters in models with
latent variables.
We want to maximise the log-likelihood of the
model
Notice that log of a sum is not nice.
The key mathematical tool is Jensens inequality

9
Jensens inequality

For every concave function f, random variable x
and probability distribution q(x)
A cartoon of the proof, which relies on the
centre of mass of a convex polygon lying inside
the polygon.

10
Bound on the log-likelihood

Jensens inequality leads to
Hq is the entropy of the distribution q. Notice
the absence of the nasty log of a sum.
If and only if q(c) is the posterior q(cx), we
get that the bound is saturated

11
EM
E-step

For a fixed value of the parameters, the
posterior will saturate the bound.
In the M-step, optimise the bound with respect to
the parameters.
In the E-step, recompute the posterior with the
new value of the parameters.
Exercise EM for mixture of Gaussians.

M-step
?
12
Supervised learning

The data consists of a set of inputs x and a set
of output (target) values y.
The goal is to learn the functional relation
fx? y
Evaluated using reconstruction error
(classification accuracy), usually on a separate
test set.
Continuous y leads to regression, discrete to
classification

13
Classification-Generative

The generative approach starts with modelling
p(xc) as in unsupervised learning.
The model parameters are
estimated (e.g. using Maximum
Likelihood).
The assignment is based on the
posterior probabilities p(cx)
Requires estimation of many parameters.

?
p
x
c
14
Example discriminant analysis

Assume class conditional distributions to be
Gaussian
Estimate means and covariances using maximum
likelihood (O(KD2) params).
Classify novel (test) data using posterior
Exercise rewrite posterior in terms of sigmoids.

15
Classification-Discriminative

Also called diagnostic
Modelling the class-conditional
distributions involves significant
overheads.
Discriminative techniques avoid
this by modelling directly the posterior
distribution p(cjx).
Closely related with the concept of transductive
learning.

µ
?
x
c
16
Example logistic regression

Restrict to the two classes case.
Model posteriors as
where we have introduced the logistic sigmoid
Notice similarity with discriminant function in
generative models.
Number of parameters to be estimated is D.

17
Estimating logistic regression

Let
Let ci be 0 or 1.
The likelihood for the data (xi,ci) is
The gradient of the negative log-likelihood is
Overfitting, may need a prior on w

18
Semi-supervised learning

In many practical applications, the labelling
process is expensive and time consuming.
Plenty of examples of inputs x, but few of the
corresponding target c
The goal is still to predict p(cx) and is
evaluated using classification accuracy
Semi-supervised learning is, in this sense, a
special case of supervised learning where we use
the extra unlabeled data to improve the
predictive power.

19
Notation

We have a labeled or complete data set Dl,
comprising of Nl sets of pairs (x,c)
We have an unlabeled or incomplete data set Du
comprising of Nu sets of vectors x
The interesting (and common) case is when NugtgtNl

20
Baselines

One attack to semi-supervised learning would be
to ignore the labels and estimate an unsupervised
clustering model.
Another attack is to ignore the unlabeled data
and just use the complete data.
No free lunch it is possible to construct data
distributions for which either of the baselines
outperforms a given SSL method.
The art in SSL is designing good models for
specific applications, rather than theory.

21
Generative SSL

Generative SSL methods are most intuitive
The graphical model for
generative classification and
unsupervised learning is the
same
The likelihoods are combined
Parameters are estimated by EM

?
p
x
c
22
Discriminant analysis

McLachlan (1977) considered SSL for a generative
model with two Gaussian classes
Assume the covariance to be known, means need to
be estimated from the data.
Assume we have NuM1M2ltltNl
Assume also that the labelled data is split
evenly among the two classes, so Nl2n

23
A surprising result

Consider the misclassification expected error R
Let R be the expected error obtained using only
the labelled data, and R1 the expected error
obtained using all data (estimating parameters
with EM)
If , the first order expansion
(in Nu/Nl) of R-R1 is positive only for NultM,
with M finite
In other words, unlabelled data helps up to a
point only.

24
A way out

Perhaps considering labelled and unlabelled data
on the same footing is not ideal.
McLachlan went on to propose to estimate the
class means as
He then shows that, for suitable ?, the expected
error obtained in this way is to first order
always lower than the one obtained without the
unlabelled data

25
A hornets nest

In general, it seems sensible to reweight the
log-likelihood as
This is partly because, in the important case
when Nl is small, we do not want the label
information to be swamped
But how do you choose ??
Cross validation on the labels?
Unfeasible for small Nl

26
Stability

An elegant solution proposed by Corduneanu and
Jaakkola (2002)
In the limit when all data is labelled the
likelihood is unimodal
In the limit when all data is unlabelled the
likelihood is multimodal (e.g. permutations)
When moving from ?1 to ?0, we must encounter a
critical ? when the likelihood becomes
multimodal
That is the optimal choice

27
Discriminative SSL

The alternative paradigm for SSL
is the discriminative approach
As in supervised learning, the
discriminative approach models
directly p(cx) using the graphical model
above
The total likelihood factorises as

µ
?
x
c
28
Surprise!

Using Bayes theorem we obtain the posterior over
\theta as
This is seriously worrying!
Why?

29
Regularization

Bayesian methods in a straight discriminative SSL
cannot be employed
Some limited success in non-Bayesian methods
(e.g. Anderson 1978 for logistic regression over
discrete variables)
Most promising avenue is to
modify the graphical model used
This is known as regularization

µ
?
x
c
30
Discriminative vs Generative

The idea behind regularization is that
information about the generative structure of x
is feeding into the discriminative process
Does it still make sense to talk about a
discriminative approach?
Strictly speaking, perhaps no
In practice, the hypothesis used for
regularization are much weaker than modelling the
full class conditional distribution

31
Cluster assumption

Perhaps the most widely used form of
regularization
It states that the boundary between classes
should cross areas of low data density
It is often reasonable particularly in high
dimensions
Implemented in a host of Bayesian and
non-Bayesian methods
Can be understood in terms of smoothness of the
discriminant function

32
Manifold assumption

Another convenient assumption is that the data
lies on a low-dimensional sub-manifold of a
high-dimensional space
Often the starting point is to map the data to a
high dimensional space via a feature map
The cluster assumption is then applied in the
high dimensional space
Key concept is the Graph Laplacian
Ideas pioneered by Belkin and Niyogi

33
Manifolds cont.

We want discriminant functions which vary little
in regions of high data density
Mathematically, this is equivalent to
being very small, where ? is the
Laplace-Beltrami operator
The finite sample approximation to ? is given by
the graph Laplacian
The objective function for SSL is a combination
of the error given by f and of its norm under ?

Write a Comment

User Comments (0)