Semi-supervised learning - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Semi-supervised learning

Description:

Different ways of going semi-supervised. Generative models ... Ideas pioneered by Belkin and Niyogi. G. Sanguinetti, EPSRC Winter School 01/08. Manifolds cont. ... – PowerPoint PPT presentation

Number of Views:107
Avg rating:3.0/5.0
Slides: 34
Provided by: gui564
Category:

less

Transcript and Presenter's Notes

Title: Semi-supervised learning


1
Semi-supervised learning
Guido Sanguinetti Department of Computer Science,
University of Sheffield
2
Programme
  • Different ways of learning
  • Unsupervised learning
  • EM algorithm
  • Supervised learning
  • Different ways of going semi-supervised
  • Generative models
  • Reweighting
  • Discriminative models
  • Regularisation clusters and manifolds

3
Disclaimer
  • This is going to be a high-level introduction
    rather than a technical talk
  • We will spend some time talking about supervised
    and unsupervised learning as well
  • We will be unashamedly probabilistic, if not
    fully Bayesian
  • Main reference C.Bishops book Pattern
    Recognition and Machine Learning and M. Seegers
    chapter in Semi-supervised learning, Chapelle,
    Schölkopf and Zien, eds.

4
Different ways to learn
  • Reinforcement learning learn a strategy to
    optimise a reward.
  • Closely related to decision theory and control
    theory.
  • Supervised learning learn a map
  • Unsupervised learning estimate a density

5
Unsupervised learning
  • We are given data x.
  • We want to estimate the density that generated
    the data.
  • Generally, assume a latent variable y as in
    graphical model.
  • A continuous y leads to dimensionality reduction
    (cf previous lecture), a discrete one to
    clustering

?
p
x
y
6
Example mixture of Gaussians
  • Data vector measurements xi, i1,...,N.
  • Latent variable K-dimensional binary vectors
    (class membership) yi yij1 means point i
    belongs to class j.
  • The ? parameters are in this case the covariances
    and the means of each component.

7
Estimating mixtures
  • Two objectives estimate the parameters ?, ?j and
    ?j and estimate the posterior probabilities on
    class membership
  • ?ij are the responsibilities.
  • Could use gradient descent.
  • Better to use EM

8
Expectation-Maximization
  • Iterative procedure to estimate the maximum
    likelihood values of parameters in models with
    latent variables.
  • We want to maximise the log-likelihood of the
    model
  • Notice that log of a sum is not nice.
  • The key mathematical tool is Jensens inequality

9
Jensens inequality
  • For every concave function f, random variable x
    and probability distribution q(x)
  • A cartoon of the proof, which relies on the
    centre of mass of a convex polygon lying inside
    the polygon.

10
Bound on the log-likelihood
  • Jensens inequality leads to
  • Hq is the entropy of the distribution q. Notice
    the absence of the nasty log of a sum.
  • If and only if q(c) is the posterior q(cx), we
    get that the bound is saturated

11
EM
E-step
  • For a fixed value of the parameters, the
    posterior will saturate the bound.
  • In the M-step, optimise the bound with respect to
    the parameters.
  • In the E-step, recompute the posterior with the
    new value of the parameters.
  • Exercise EM for mixture of Gaussians.

M-step
?
12
Supervised learning
  • The data consists of a set of inputs x and a set
    of output (target) values y.
  • The goal is to learn the functional relation
  • fx? y
  • Evaluated using reconstruction error
    (classification accuracy), usually on a separate
    test set.
  • Continuous y leads to regression, discrete to
    classification

13
Classification-Generative
  • The generative approach starts with modelling
    p(xc) as in unsupervised learning.
  • The model parameters are
  • estimated (e.g. using Maximum
  • Likelihood).
  • The assignment is based on the
  • posterior probabilities p(cx)
  • Requires estimation of many parameters.

?
p
x
c
14
Example discriminant analysis
  • Assume class conditional distributions to be
    Gaussian
  • Estimate means and covariances using maximum
    likelihood (O(KD2) params).
  • Classify novel (test) data using posterior
  • Exercise rewrite posterior in terms of sigmoids.

15
Classification-Discriminative
  • Also called diagnostic
  • Modelling the class-conditional
  • distributions involves significant
  • overheads.
  • Discriminative techniques avoid
  • this by modelling directly the posterior
    distribution p(cjx).
  • Closely related with the concept of transductive
    learning.

µ
?
x
c
16
Example logistic regression
  • Restrict to the two classes case.
  • Model posteriors as
  • where we have introduced the logistic sigmoid
  • Notice similarity with discriminant function in
    generative models.
  • Number of parameters to be estimated is D.

17
Estimating logistic regression
  • Let
  • Let ci be 0 or 1.
  • The likelihood for the data (xi,ci) is
  • The gradient of the negative log-likelihood is
  • Overfitting, may need a prior on w

18
Semi-supervised learning
  • In many practical applications, the labelling
    process is expensive and time consuming.
  • Plenty of examples of inputs x, but few of the
    corresponding target c
  • The goal is still to predict p(cx) and is
    evaluated using classification accuracy
  • Semi-supervised learning is, in this sense, a
    special case of supervised learning where we use
    the extra unlabeled data to improve the
    predictive power.

19
Notation
  • We have a labeled or complete data set Dl,
    comprising of Nl sets of pairs (x,c)
  • We have an unlabeled or incomplete data set Du
    comprising of Nu sets of vectors x
  • The interesting (and common) case is when NugtgtNl

20
Baselines
  • One attack to semi-supervised learning would be
    to ignore the labels and estimate an unsupervised
    clustering model.
  • Another attack is to ignore the unlabeled data
    and just use the complete data.
  • No free lunch it is possible to construct data
    distributions for which either of the baselines
    outperforms a given SSL method.
  • The art in SSL is designing good models for
    specific applications, rather than theory.

21
Generative SSL
  • Generative SSL methods are most intuitive
  • The graphical model for
  • generative classification and
  • unsupervised learning is the
  • same
  • The likelihoods are combined
  • Parameters are estimated by EM

?
p
x
c
22
Discriminant analysis
  • McLachlan (1977) considered SSL for a generative
    model with two Gaussian classes
  • Assume the covariance to be known, means need to
    be estimated from the data.
  • Assume we have NuM1M2ltltNl
  • Assume also that the labelled data is split
    evenly among the two classes, so Nl2n

23
A surprising result
  • Consider the misclassification expected error R
  • Let R be the expected error obtained using only
    the labelled data, and R1 the expected error
    obtained using all data (estimating parameters
    with EM)
  • If , the first order expansion
    (in Nu/Nl) of R-R1 is positive only for NultM,
    with M finite
  • In other words, unlabelled data helps up to a
    point only.

24
A way out
  • Perhaps considering labelled and unlabelled data
    on the same footing is not ideal.
  • McLachlan went on to propose to estimate the
    class means as
  • He then shows that, for suitable ?, the expected
    error obtained in this way is to first order
    always lower than the one obtained without the
    unlabelled data

25
A hornets nest
  • In general, it seems sensible to reweight the
    log-likelihood as
  • This is partly because, in the important case
    when Nl is small, we do not want the label
    information to be swamped
  • But how do you choose ??
  • Cross validation on the labels?
  • Unfeasible for small Nl

26
Stability
  • An elegant solution proposed by Corduneanu and
    Jaakkola (2002)
  • In the limit when all data is labelled the
    likelihood is unimodal
  • In the limit when all data is unlabelled the
    likelihood is multimodal (e.g. permutations)
  • When moving from ?1 to ?0, we must encounter a
    critical ? when the likelihood becomes
    multimodal
  • That is the optimal choice

27
Discriminative SSL
  • The alternative paradigm for SSL
  • is the discriminative approach
  • As in supervised learning, the
  • discriminative approach models
  • directly p(cx) using the graphical model
    above
  • The total likelihood factorises as

µ
?
x
c
28
Surprise!
  • Using Bayes theorem we obtain the posterior over
    \theta as
  • This is seriously worrying!
  • Why?

29
Regularization
  • Bayesian methods in a straight discriminative SSL
    cannot be employed
  • Some limited success in non-Bayesian methods
    (e.g. Anderson 1978 for logistic regression over
    discrete variables)
  • Most promising avenue is to
  • modify the graphical model used
  • This is known as regularization

µ
?
x
c
30
Discriminative vs Generative
  • The idea behind regularization is that
    information about the generative structure of x
    is feeding into the discriminative process
  • Does it still make sense to talk about a
    discriminative approach?
  • Strictly speaking, perhaps no
  • In practice, the hypothesis used for
    regularization are much weaker than modelling the
    full class conditional distribution

31
Cluster assumption
  • Perhaps the most widely used form of
    regularization
  • It states that the boundary between classes
    should cross areas of low data density
  • It is often reasonable particularly in high
    dimensions
  • Implemented in a host of Bayesian and
    non-Bayesian methods
  • Can be understood in terms of smoothness of the
    discriminant function

32
Manifold assumption
  • Another convenient assumption is that the data
    lies on a low-dimensional sub-manifold of a
    high-dimensional space
  • Often the starting point is to map the data to a
    high dimensional space via a feature map
  • The cluster assumption is then applied in the
    high dimensional space
  • Key concept is the Graph Laplacian
  • Ideas pioneered by Belkin and Niyogi

33
Manifolds cont.
  • We want discriminant functions which vary little
    in regions of high data density
  • Mathematically, this is equivalent to
  • being very small, where ? is the
    Laplace-Beltrami operator
  • The finite sample approximation to ? is given by
    the graph Laplacian
  • The objective function for SSL is a combination
    of the error given by f and of its norm under ?
Write a Comment
User Comments (0)
About PowerShow.com