Title: Kernel Methods: the Emergence of a Wellfounded Machine Learning
1Kernel Methods the Emergence of a Well-founded
Machine Learning
- John Shawe-Taylor
- Centre for Computational Statistics and Machine
Learning - University College London
2Overview
- Celebration of 10 years of kernel methods
- what has been achieved and
- what can we learn from the experience?
- Some historical perspectives
- Theory or not?
- Applicable or not?
- Some emphases
- Role of theory need for plurality of approaches
- Importance of scalability
3Caveats
- Personal perspective with inevitable bias
- One very small slice through what is now a very
big field - Focus on theory with emphasis on frequentist
analysis - There is no pro-forma for scientific research
- But role of theory worth discussing
- Needed to give firm foundation for proposed
approaches?
4Motivation behind kernel methods
- Linear learning typically has nice properties
- Unique optimal solutions
- Fast learning algorithms
- Better statistical analysis
- But one big problem
- Insufficient capacity
5Historical perspective
- Minsky and Pappert highlighted the weakness in
their book Perceptrons - Neural networks overcame the problem by gluing
together many linear units with non-linear
activation functions - Solved problem of capacity and led to very
impressive extension of applicability of learning - But ran into training problems of speed and
multiple local minima
6Kernel methods approach
- The kernel methods approach is to stick with
linear functions but work in a high dimensional
feature space - The expectation is that the feature space has a
much higher dimension than the input space.
7Example
- Consider the mapping
- If we consider a linear equation in this feature
space - We actually have an ellipse i.e. a non-linear
shape in the input space.
8Capacity of feature spaces
- The capacity is proportional to the dimension
for example - 2-dim
9Form of the functions
- So kernel methods use linear functions in a
feature space - For regression this could be the function
- For classification require thresholding
10Problems of high dimensions
- Capacity may easily become too large and lead to
over-fitting being able to realise every
classifier means unlikely to generalise well - Computational costs involved in dealing with
large vectors
11Overview
- Two theoretical approaches converged on very
similar algorithms - Frequentist led to Support Vector Machine
- Bayesian approach led to Bayesian inference using
Gaussian Processes - First we briefly discuss the Bayesian approach
before mentioning some of the frequentist results
12Bayesian approach
- The Bayesian approach relies on a probabilistic
analysis by positing - a noise model
- a prior distribution over the function class
- Inference involves updating the prior
distribution with the likelihood of the data - Possible outputs
- MAP function
- Bayesian posterior average
13Bayesian approach
- Avoids overfitting by
- Controlling the prior distribution
- Averaging over the posterior
- For Gaussian noise model (for regression) and
Gaussian process prior we obtain a kernel
method where - Kernel is covariance of the prior GP
- Noise model translates into addition of ridge to
kernel matrix - MAP and averaging give the same solution
- Link with infinite hidden node limit of single
hidden layer Neural Networks see seminal paper - Williams, Computation with infinite neural
networks (1997)
14Bayesian approach
- Subject to assumptions about noise model and
prior distribution - Can get error bars on the output
- Compute evidence for the model and use for model
selection - Approach developed for different noise models
- eg classification
- Typically requires approximate inference
15Frequentist approach
- Source of randomness is assumed to be a
distribution that generates the training data
i.i.d. with the same distribution generating
the test data - Different/weaker assumptions than the Bayesian
approach so more general but less analysis can
typically be derived - Main focus is on generalisation error analysis
16Capacity problem
- What do we mean by generalisation?
17Generalisation of a learner
18Example of Generalisation
- We consider the Breast Cancer dataset from the
UCIrepository - Use the simple Parzen window classifier weight
vector is - where is the average of the
positive (negative) training examples. - Threshold is set so hyperplane bisects the line
joining these two points.
19Example of Generalisation
- By repeatedly drawing random training sets S of
size m we estimate the distribution of - by using the test set error as a proxy for the
true generalisation - We plot the histogram and the average of the
distribution for various sizes of training set - 648, 342, 273, 205, 137, 68, 34, 27, 20, 14, 7.
20Example of Generalisation
- Since the expected classifier is in all cases the
same - we do not expect large differences in the
average of the distribution, though the
non-linearity of the loss function means they
won't be the same exactly.
21Error distribution full dataset
22Error distribution dataset size 342
23Error distribution dataset size 273
24Error distribution dataset size 205
25Error distribution dataset size 137
26Error distribution dataset size 68
27Error distribution dataset size 34
28Error distribution dataset size 27
29Error distribution dataset size 20
30Error distribution dataset size 14
31Error distribution dataset size 7
32Observations
- Things can get bad if number of training examples
small compared to dimension (in this case input
dimension is 9) - Mean can be bad predictor of true generalisation
i.e. things can look okay in expectation, but
still go badly wrong - Key ingredient of learning keep flexibility
high while still ensuring good generalisation
33Controlling generalisation
- The critical method of controlling generalisation
for classification is to force a large margin on
the training data
34Intuitive and rigorous explanations
- Makes classification robust to uncertainties in
inputs - Can randomly project into lower dimensional
spaces and still have separation so effectively
low dimensional - Rigorous statistical analysis shows effective
dimension - This is not structural risk minimisation over VC
classes since hierarchy depends on the data
data-dependent structural risk minimisation - see S-T, Bartlett, Williamson Anthony (1996
and 1998)
35Learning framework
- Since there are lower bounds in terms of the VC
dimension the margin is detecting a favourable
distribution/task alignment luckiness framework
captures this idea - Now consider using an SVM on the same data and
compare the distribution of generalisations - SVM distribution in red
36Error distribution dataset size 205
37Error distribution dataset size 137
38Error distribution dataset size 68
39Error distribution dataset size 34
40Error distribution dataset size 27
41Error distribution dataset size 20
42Error distribution dataset size 14
43Error distribution dataset size 7
44Handling training errors
- So far only considered case where data can be
separated - For non-separable sets we can introduce a penalty
proportional to the amount by which a point fails
to meet the margin - These amounts are often referred to as slack
variables from optimisation theory
45Support Vector Machines
- SVM optimisation
-
- Analysis of this case given using augmented space
trick in - S-T and Cristianini (1999 and 2002) On the
generalisation of soft margin algorithms.
46Complexity problem
- Lets apply the quadratic example
- to a 20x30 image of 600 pixels gives
approximately 180000 dimensions! - Would be computationally infeasible to work in
this space
47Dual representation
- Suppose weight vector is a linear combination of
the training examples - can evaluate inner product with new example
48Learning the dual variables
- The ai are known as dual variables
- Since any component orthogonal to the space
spanned by the training data has no effect,
general result that weight vectors have dual
representation the representer theorem. - Hence, can reformulate algorithms to learn dual
variables rather than weight vector directly
49Dual form of SVM
- The dual form of the SVM can also be derived by
taking the dual optimisation problem! This gives - Note that threshold must be determined from
border examples
50Using kernels
- Critical observation is that again only inner
products are used - Suppose that we now have a shortcut method of
computing - Then we do not need to explicitly compute the
feature vectors either in training or testing
51Kernel example
- As an example consider the mapping
- Here we have a shortcut
52Efficiency
- Hence, in the pixel example rather than work with
180000 dimensional vectors, we compute a 600
dimensional inner product and then square the
result! - Can even work in infinite dimensional spaces, eg
using the Gaussian kernel
53Using Gaussian kernel for Breast 273
54Data size 342
55Constraints on the kernel
- There is a restriction on the function
- This restriction for any training set is enough
to guarantee function is a kernel
56What have we achieved?
- Replaced problem of neural network architecture
by kernel definition - Arguably more natural to define but restriction
is a bit unnatural - Not a silver bullet as fit with data is key
- Can be applied to non-vectorial (or high dim)
data - Gained more flexible regularisation/
generalisation control - Gained convex optimisation problem
- i.e. NO local minima!
57Historical note
- First use of kernel methods in machine learning
was in combination with perceptron algorithm - Aizerman et al., Theoretical foundations of the
potential function method in pattern recognition
learning(1964) - Apparently failed because of generalisation
issues margin not optimised but this can be
incorporated with one extra parameter!
58PAC-Bayes General perspectives
- The goal of different theories is to capture the
key elements that enable an understanding,
analysis and learning of different phenomena - Several theories of machine learning notably
Bayesian and frequentist - Different assumptions and hence different range
of applicability and range of results - Bayesian able to make more detailed probabilistic
predictions - Frequentist makes only i.i.d. assumption
59Evidence and generalisation
- Evidence is a measure used in Bayesian analysis
to inform model selection - Evidence related to the posterior volume of
weight space consistent with the sample - Link between evidence and generalisation
hypothesised by McKay - First such result was obtained by S-T
Williamson (1997) PAC Analysis of a Bayes
Estimator - Bound on generalisation in terms of the volume of
the sphere that can be inscribed in the version
space -- included a dependence on the dim of the
space but applies to non-linear function classes - Used Luckiness framework where luckiness measured
by the volume of the ball
60PAC-Bayes Theorem
- PAC-Bayes theorem has a similar flavour but
bounds in terms of prior and posterior
distributions - First version proved by McAllester in 1999
- Improved proof and bound due to Seeger in 2002
with application to Gaussian processes - Application to SVMs by Langford and S-T also in
2002 - Gives tightest bounds on generalisation for SVMs
- see Langford tutorial on JMLR
61(No Transcript)
62(No Transcript)
63(No Transcript)
64Examples of bound evaluation
65Principal Components Analysis (PCA)
- Classical method is principal component analysis
looks for directions of maximum variance, given
by eigenvectors of covariance matrix
66Dual representation of PCA
- Eigenvectors of kernel matrix give dual
representation - Means we can perform PCA projection in a kernel
defined feature space kernel PCA
67Kernel PCA
- Need to take care of normalisation to obtain
- where ? is the corresponding eigenvalue.
68Generalisation of k-PCA
- How reliable are estimates obtained from a sample
when working in such high dimensional spaces - Using Rademacher complexity bounds can show that
if low dimensional projection captures most of
the variance in the training set, will do well on
test examples as well even in high dimensional
spaces - see S-T, Williams, Cristianini and Kandola
(2004) On the eigenspectrum of the Gram matrix
and the generalisation error of kernel PCA - The bound also gives a new bound on the
difference between the process and empirical
eigenvalues
69Latent Semantic Indexing
- Developed in information retrieval to overcome
problem of semantic relations between words in
BoWs representation - Use Singular Value Decomposition of Term-doc
matrix
70Lower dimensional representation
- If we truncate the matrix U at k columns we
obtain the best k dimensional representation in
the least squares sense - In this case we can write a semantic matrix as
71Latent Semantic Kernels
- Can perform the same transformation in a kernel
defined feature space by performing an eigenvalue
decomposition of the kernel matrix (this is
equivalent to kernel PCA)
72Related techniques
- Number of related techniques
- Probabilistic LSI (pLSI)
- Non-negative Matrix Factorisation (NMF)
- Multinomial PCA (mPCA)
- Discrete PCA (DPCA)
- All can be viewed as alternative decompositions
73Different criteria
- Vary by
- Different constraints (eg non-negative entries)
- Different prior distributions (eg Dirichlet,
Poisson) - Different optimisation criteria (eg max
likelihood, Bayesian) - Unlike LSI typically suffer from local minima and
so require EM type iterative algorithms to
converge to solutions
74Other subspace methods
- Kernel partial Gram-Schmidt orthogonalisation is
equivalent to incomplete Cholesky decomposition
greedy kernel PCA - Kernel Partial Least Squares implements a
multi-dimensional regression algorithm popular in
chemometrics takes account of labels - Kernel Canonical Correlation Analysis uses paired
datasets to learn a semantic representation
independent of the two views
75Paired corpora
- Can we use paired corpora to extract more
information? - Two views of same semantic object hypothesise
that both views contain all of the necessary
information, eg document and translation to a
second language
76aligned text
77Canadian parliament corpus
LAND MINES Ms. Beth Phinney (Hamilton Mountain,
Lib.) Mr. Speaker, we are pleased that the Nobel
peace prize has been given to those working to
ban land mines worldwide. We hope this award
will encourage the United States to join the over
100 countries planning to come to
E12
LES MINES ANTIPERSONNEL Mme Beth Phinney
(Hamilton Mountain, Lib.) Monsieur le Président,
nous nous réjouissons du fait que le prix Nobel
ait été attribué à ceux qui oeuvrent en faveur de
l'interdiction des mines antipersonnel dans le
monde entier. Nous espérons que cela incitera
les Américains à se joindre aux représentants de
plus de 100 pays qui ont l'intention de venir à
F12
78cross-lingual lsi via svd
M. L. Littman, S. T. Dumais, and T. K. Landauer.
Automatic cross-language information retrieval
using latent semantic indexing. In G.
Grefenstette, editor, Cross-language information
retrieval. Kluwer, 1998.
79cross-lingual kernel canonical correlation
analysis
feature English space
feature French space
fF2
fF1
F(x)
input French space
input English space
80kernel canonical correlation analysis
81regularization
- using kernel functions may result in overfitting
- Theoretical analysis shows that provided the
norms of the weight vectors are small the
correlation will still hold for new data - see S-T and Cristianini (2004) Kernel Methods
for Pattern Analysis - need to control flexibility of the projections fE
and fF add diagonal to the matrix D
? is the regularization parameter
82pseudo query test
83Experimental Results
- The goal was to retrieve the paired document.
- Experimental procedure
- (1) LSI/KCCA trained on paired documents,
- (2) All test documents projected into the
LSI/KCCA semantic space, - (3) Each query was projected into the LSI/KCCA
semantic space and documents were retrieved using
nearest neighbour based on cosine distance to the
query.
84English-French retrieval accuracy,
85Applying to different data types
- Data
- Combined image and associated text obtained from
the web - Three categories sport, aviation and paintball
- 400 examples from each category (1200 overall)
- Features extracted HSV, Texture , Bag of words
- Tasks
- Classification of web pages into the 3 categories
- Text query -gt image retrieval
86Classification error of baseline method
- Previous error rates obtained using
probabilistic ICA classification done for single
feature groups
87Classification rates using KCCA
88Classification with multi-views
- If we use KCCA to generate a semantic feature
space and then learn with an SVM, can envisage
combining the two steps into a single SVM-2k - learns two SVMs one on each representation, but
constrains their outputs to be similar across the
training (and any unlabelled data). - Can give classification for single mode test data
applied to patent classification for Japanese
patents - Again theoretical analysis predicts good
generalisation if the training SVMs have a good
match, while the two representations have a small
overlap - see Farquhar, Hardoon, Meng, S-T, Szedmak (2005)
Two view learning SVM-2K, Theory and Practice
89Results
90Targeted Density Learning
- Learning a probability density function in an L1
sense is hard - Learning for a single classification or
regression task is easy - Consider a set of tasks (Touchstone Class) F that
we think may arise with a distribution
emphasising those more likely to be seen - Now learn density that does well for the tasks in
F. - Surprisingly it can be shown that just including
a small subset of tasks gives us convergence over
all the tasks
91Example plot
- Consider fitting cumulative distribution up to
sample points as proposed by Mukherjee and
Vapnik. - Plot shows loss as a function of the number of
constraints (tasks) included
92Idealised view of progress
Study problem to develop theoretical model
Derive analysis that indicates factors that
affect solution quality
Translate into optimisation maximising factors
relaxing to ensure convexity
Develop efficient solutions using specifics of
the task
93Role of theory
- Theoretical analysis plays a critical auxiliary
role - Can never be complete picture
- Needs to capture the factors that most influence
quality of result - Means to an end only as good as the result
- Important to be open to refinements/relaxations
that can be shown to better characterise
real-world learning and translate into efficient
algorithms
94Conclusions and future directions
- Extensions to learning data with output structure
that can be used to inform the learning - More general subspace methods algorithms and
theoretical foundations - Extensions to more complex tasks, eg system
identification, reinforcement learning - Linking learning into tasks with more detailed
prior knowledge eg stochastic differential
equations for climate modelling - Extending theoretical framework for more general
learning tasks eg learning a density function