Kernel Methods: the Emergence of a Wellfounded Machine Learning - PowerPoint PPT Presentation

1 / 94

About This Presentation

Title:

Kernel Methods: the Emergence of a Wellfounded Machine Learning

Description:

Centre for Computational Statistics and Machine Learning. University College London ... Two theoretical approaches converged on very similar algorithms: ... – PowerPoint PPT presentation

Number of Views:115

Avg rating:3.0/5.0

Slides: 95

Provided by: ECS176

Category:

more less

Transcript and Presenter's Notes

Title: Kernel Methods: the Emergence of a Wellfounded Machine Learning

1
Kernel Methods the Emergence of a Well-founded
Machine Learning

John Shawe-Taylor
Centre for Computational Statistics and Machine
Learning
University College London

2
Overview

Celebration of 10 years of kernel methods
what has been achieved and
what can we learn from the experience?
Some historical perspectives
Theory or not?
Applicable or not?
Some emphases
Role of theory need for plurality of approaches
Importance of scalability

3
Caveats

Personal perspective with inevitable bias
One very small slice through what is now a very
big field
Focus on theory with emphasis on frequentist
analysis
There is no pro-forma for scientific research
But role of theory worth discussing
Needed to give firm foundation for proposed
approaches?

4
Motivation behind kernel methods

Linear learning typically has nice properties
Unique optimal solutions
Fast learning algorithms
Better statistical analysis
But one big problem
Insufficient capacity

5
Historical perspective

Minsky and Pappert highlighted the weakness in
their book Perceptrons
Neural networks overcame the problem by gluing
together many linear units with non-linear
activation functions
Solved problem of capacity and led to very
impressive extension of applicability of learning
But ran into training problems of speed and
multiple local minima

6
Kernel methods approach

The kernel methods approach is to stick with
linear functions but work in a high dimensional
feature space
The expectation is that the feature space has a
much higher dimension than the input space.

7
Example

Consider the mapping
If we consider a linear equation in this feature
space
We actually have an ellipse i.e. a non-linear
shape in the input space.

8
Capacity of feature spaces

The capacity is proportional to the dimension
for example
2-dim

9
Form of the functions

So kernel methods use linear functions in a
feature space
For regression this could be the function
For classification require thresholding

10
Problems of high dimensions

Capacity may easily become too large and lead to
over-fitting being able to realise every
classifier means unlikely to generalise well
Computational costs involved in dealing with
large vectors

11
Overview

Two theoretical approaches converged on very
similar algorithms
Frequentist led to Support Vector Machine
Bayesian approach led to Bayesian inference using
Gaussian Processes
First we briefly discuss the Bayesian approach
before mentioning some of the frequentist results

12
Bayesian approach

The Bayesian approach relies on a probabilistic
analysis by positing
a noise model
a prior distribution over the function class
Inference involves updating the prior
distribution with the likelihood of the data
Possible outputs
MAP function
Bayesian posterior average

13
Bayesian approach

Avoids overfitting by
Controlling the prior distribution
Averaging over the posterior
For Gaussian noise model (for regression) and
Gaussian process prior we obtain a kernel
method where
Kernel is covariance of the prior GP
Noise model translates into addition of ridge to
kernel matrix
MAP and averaging give the same solution
Link with infinite hidden node limit of single
hidden layer Neural Networks see seminal paper
Williams, Computation with infinite neural
networks (1997)

14
Bayesian approach

Subject to assumptions about noise model and
prior distribution
Can get error bars on the output
Compute evidence for the model and use for model
selection
Approach developed for different noise models
eg classification
Typically requires approximate inference

15
Frequentist approach

Source of randomness is assumed to be a
distribution that generates the training data
i.i.d. with the same distribution generating
the test data
Different/weaker assumptions than the Bayesian
approach so more general but less analysis can
typically be derived
Main focus is on generalisation error analysis

16
Capacity problem

What do we mean by generalisation?

17
Generalisation of a learner
18
Example of Generalisation

We consider the Breast Cancer dataset from the
UCIrepository
Use the simple Parzen window classifier weight
vector is
where is the average of the
positive (negative) training examples.
Threshold is set so hyperplane bisects the line
joining these two points.

19
Example of Generalisation

By repeatedly drawing random training sets S of
size m we estimate the distribution of
by using the test set error as a proxy for the
true generalisation
We plot the histogram and the average of the
distribution for various sizes of training set
648, 342, 273, 205, 137, 68, 34, 27, 20, 14, 7.

20
Example of Generalisation

Since the expected classifier is in all cases the
same
we do not expect large differences in the
average of the distribution, though the
non-linearity of the loss function means they
won't be the same exactly.

21
Error distribution full dataset
22
Error distribution dataset size 342
23
Error distribution dataset size 273
24
Error distribution dataset size 205
25
Error distribution dataset size 137
26
Error distribution dataset size 68
27
Error distribution dataset size 34
28
Error distribution dataset size 27
29
Error distribution dataset size 20
30
Error distribution dataset size 14
31
Error distribution dataset size 7
32
Observations

Things can get bad if number of training examples
small compared to dimension (in this case input
dimension is 9)
Mean can be bad predictor of true generalisation
i.e. things can look okay in expectation, but
still go badly wrong
Key ingredient of learning keep flexibility
high while still ensuring good generalisation

33
Controlling generalisation

The critical method of controlling generalisation
for classification is to force a large margin on
the training data

34
Intuitive and rigorous explanations

Makes classification robust to uncertainties in
inputs
Can randomly project into lower dimensional
spaces and still have separation so effectively
low dimensional
Rigorous statistical analysis shows effective
dimension
This is not structural risk minimisation over VC
classes since hierarchy depends on the data
data-dependent structural risk minimisation
see S-T, Bartlett, Williamson Anthony (1996
and 1998)

35
Learning framework

Since there are lower bounds in terms of the VC
dimension the margin is detecting a favourable
distribution/task alignment luckiness framework
captures this idea
Now consider using an SVM on the same data and
compare the distribution of generalisations
SVM distribution in red

36
Error distribution dataset size 205
37
Error distribution dataset size 137
38
Error distribution dataset size 68
39
Error distribution dataset size 34
40
Error distribution dataset size 27
41
Error distribution dataset size 20
42
Error distribution dataset size 14
43
Error distribution dataset size 7
44
Handling training errors

So far only considered case where data can be
separated
For non-separable sets we can introduce a penalty
proportional to the amount by which a point fails
to meet the margin
These amounts are often referred to as slack
variables from optimisation theory

45
Support Vector Machines

SVM optimisation
Analysis of this case given using augmented space
trick in
S-T and Cristianini (1999 and 2002) On the
generalisation of soft margin algorithms.

46
Complexity problem

Lets apply the quadratic example
to a 20x30 image of 600 pixels gives
approximately 180000 dimensions!
Would be computationally infeasible to work in
this space

47
Dual representation

Suppose weight vector is a linear combination of
the training examples
can evaluate inner product with new example

48
Learning the dual variables

The ai are known as dual variables
Since any component orthogonal to the space
spanned by the training data has no effect,
general result that weight vectors have dual
representation the representer theorem.
Hence, can reformulate algorithms to learn dual
variables rather than weight vector directly

49
Dual form of SVM

The dual form of the SVM can also be derived by
taking the dual optimisation problem! This gives
Note that threshold must be determined from
border examples

50
Using kernels

Critical observation is that again only inner
products are used
Suppose that we now have a shortcut method of
computing
Then we do not need to explicitly compute the
feature vectors either in training or testing

51
Kernel example

As an example consider the mapping
Here we have a shortcut

52
Efficiency

Hence, in the pixel example rather than work with
180000 dimensional vectors, we compute a 600
dimensional inner product and then square the
result!
Can even work in infinite dimensional spaces, eg
using the Gaussian kernel

53
Using Gaussian kernel for Breast 273
54
Data size 342
55
Constraints on the kernel

There is a restriction on the function
This restriction for any training set is enough
to guarantee function is a kernel

56
What have we achieved?

Replaced problem of neural network architecture
by kernel definition
Arguably more natural to define but restriction
is a bit unnatural
Not a silver bullet as fit with data is key
Can be applied to non-vectorial (or high dim)
data
Gained more flexible regularisation/
generalisation control
Gained convex optimisation problem
i.e. NO local minima!

57
Historical note

First use of kernel methods in machine learning
was in combination with perceptron algorithm
Aizerman et al., Theoretical foundations of the
potential function method in pattern recognition
learning(1964)
Apparently failed because of generalisation
issues margin not optimised but this can be
incorporated with one extra parameter!

58
PAC-Bayes General perspectives

The goal of different theories is to capture the
key elements that enable an understanding,
analysis and learning of different phenomena
Several theories of machine learning notably
Bayesian and frequentist
Different assumptions and hence different range
of applicability and range of results
Bayesian able to make more detailed probabilistic
predictions
Frequentist makes only i.i.d. assumption

59
Evidence and generalisation

Evidence is a measure used in Bayesian analysis
to inform model selection
Evidence related to the posterior volume of
weight space consistent with the sample
Link between evidence and generalisation
hypothesised by McKay
First such result was obtained by S-T
Williamson (1997) PAC Analysis of a Bayes
Estimator
Bound on generalisation in terms of the volume of
the sphere that can be inscribed in the version
space -- included a dependence on the dim of the
space but applies to non-linear function classes
Used Luckiness framework where luckiness measured
by the volume of the ball

60
PAC-Bayes Theorem

PAC-Bayes theorem has a similar flavour but
bounds in terms of prior and posterior
distributions
First version proved by McAllester in 1999
Improved proof and bound due to Seeger in 2002
with application to Gaussian processes
Application to SVMs by Langford and S-T also in
2002
Gives tightest bounds on generalisation for SVMs
see Langford tutorial on JMLR

61
(No Transcript)
62
(No Transcript)
63
(No Transcript)
64
Examples of bound evaluation
65
Principal Components Analysis (PCA)

Classical method is principal component analysis
looks for directions of maximum variance, given
by eigenvectors of covariance matrix

66
Dual representation of PCA

Eigenvectors of kernel matrix give dual
representation
Means we can perform PCA projection in a kernel
defined feature space kernel PCA

67
Kernel PCA

Need to take care of normalisation to obtain
where ? is the corresponding eigenvalue.

68
Generalisation of k-PCA

How reliable are estimates obtained from a sample
when working in such high dimensional spaces
Using Rademacher complexity bounds can show that
if low dimensional projection captures most of
the variance in the training set, will do well on
test examples as well even in high dimensional
spaces
see S-T, Williams, Cristianini and Kandola
(2004) On the eigenspectrum of the Gram matrix
and the generalisation error of kernel PCA
The bound also gives a new bound on the
difference between the process and empirical
eigenvalues

69
Latent Semantic Indexing

Developed in information retrieval to overcome
problem of semantic relations between words in
BoWs representation
Use Singular Value Decomposition of Term-doc
matrix

70
Lower dimensional representation

If we truncate the matrix U at k columns we
obtain the best k dimensional representation in
the least squares sense
In this case we can write a semantic matrix as

71
Latent Semantic Kernels

Can perform the same transformation in a kernel
defined feature space by performing an eigenvalue
decomposition of the kernel matrix (this is
equivalent to kernel PCA)

72
Related techniques

Number of related techniques
Probabilistic LSI (pLSI)
Non-negative Matrix Factorisation (NMF)
Multinomial PCA (mPCA)
Discrete PCA (DPCA)
All can be viewed as alternative decompositions

73
Different criteria

Vary by
Different constraints (eg non-negative entries)
Different prior distributions (eg Dirichlet,
Poisson)
Different optimisation criteria (eg max
likelihood, Bayesian)
Unlike LSI typically suffer from local minima and
so require EM type iterative algorithms to
converge to solutions

74
Other subspace methods

Kernel partial Gram-Schmidt orthogonalisation is
equivalent to incomplete Cholesky decomposition
greedy kernel PCA
Kernel Partial Least Squares implements a
multi-dimensional regression algorithm popular in
chemometrics takes account of labels
Kernel Canonical Correlation Analysis uses paired
datasets to learn a semantic representation
independent of the two views

75
Paired corpora

Can we use paired corpora to extract more
information?
Two views of same semantic object hypothesise
that both views contain all of the necessary
information, eg document and translation to a
second language

76
aligned text
77
Canadian parliament corpus
LAND MINES Ms. Beth Phinney (Hamilton Mountain,
Lib.) Mr. Speaker, we are pleased that the Nobel
peace prize has been given to those working to
ban land mines worldwide. We hope this award
will encourage the United States to join the over
100 countries planning to come to
E12
LES MINES ANTIPERSONNEL Mme Beth Phinney
(Hamilton Mountain, Lib.) Monsieur le Président,
nous nous réjouissons du fait que le prix Nobel
ait été attribué à ceux qui oeuvrent en faveur de
l'interdiction des mines antipersonnel dans le
monde entier. Nous espérons que cela incitera
les Américains à se joindre aux représentants de
plus de 100 pays qui ont l'intention de venir à
F12
78
cross-lingual lsi via svd
M. L. Littman, S. T. Dumais, and T. K. Landauer.
Automatic cross-language information retrieval
using latent semantic indexing. In G.
Grefenstette, editor, Cross-language information
retrieval. Kluwer, 1998.
79
cross-lingual kernel canonical correlation
analysis
feature English space
feature French space
fF2
fF1
F(x)
input French space
input English space
80
kernel canonical correlation analysis
81
regularization

using kernel functions may result in overfitting
Theoretical analysis shows that provided the
norms of the weight vectors are small the
correlation will still hold for new data
see S-T and Cristianini (2004) Kernel Methods
for Pattern Analysis
need to control flexibility of the projections fE
and fF add diagonal to the matrix D

? is the regularization parameter
82
pseudo query test
83
Experimental Results

The goal was to retrieve the paired document.
Experimental procedure
(1) LSI/KCCA trained on paired documents,
(2) All test documents projected into the
LSI/KCCA semantic space,
(3) Each query was projected into the LSI/KCCA
semantic space and documents were retrieved using
nearest neighbour based on cosine distance to the
query.

84
English-French retrieval accuracy,
85
Applying to different data types

Data
Combined image and associated text obtained from
the web
Three categories sport, aviation and paintball
400 examples from each category (1200 overall)
Features extracted HSV, Texture , Bag of words
Tasks
Classification of web pages into the 3 categories
Text query -gt image retrieval

86
Classification error of baseline method

Previous error rates obtained using
probabilistic ICA classification done for single
feature groups

87
Classification rates using KCCA
88
Classification with multi-views

If we use KCCA to generate a semantic feature
space and then learn with an SVM, can envisage
combining the two steps into a single SVM-2k
learns two SVMs one on each representation, but
constrains their outputs to be similar across the
training (and any unlabelled data).
Can give classification for single mode test data
applied to patent classification for Japanese
patents
Again theoretical analysis predicts good
generalisation if the training SVMs have a good
match, while the two representations have a small
overlap
see Farquhar, Hardoon, Meng, S-T, Szedmak (2005)
Two view learning SVM-2K, Theory and Practice

89
Results
90
Targeted Density Learning

Learning a probability density function in an L1
sense is hard
Learning for a single classification or
regression task is easy
Consider a set of tasks (Touchstone Class) F that
we think may arise with a distribution
emphasising those more likely to be seen
Now learn density that does well for the tasks in
F.
Surprisingly it can be shown that just including
a small subset of tasks gives us convergence over
all the tasks

91
Example plot

Consider fitting cumulative distribution up to
sample points as proposed by Mukherjee and
Vapnik.
Plot shows loss as a function of the number of
constraints (tasks) included

92
Idealised view of progress
Study problem to develop theoretical model
Derive analysis that indicates factors that
affect solution quality
Translate into optimisation maximising factors
relaxing to ensure convexity
Develop efficient solutions using specifics of
the task
93
Role of theory

Theoretical analysis plays a critical auxiliary
role
Can never be complete picture
Needs to capture the factors that most influence
quality of result
Means to an end only as good as the result
Important to be open to refinements/relaxations
that can be shown to better characterise
real-world learning and translate into efficient
algorithms

94
Conclusions and future directions

Extensions to learning data with output structure
that can be used to inform the learning
More general subspace methods algorithms and
theoretical foundations
Extensions to more complex tasks, eg system
identification, reinforcement learning
Linking learning into tasks with more detailed
prior knowledge eg stochastic differential
equations for climate modelling
Extending theoretical framework for more general
learning tasks eg learning a density function