Kernel Methods: the Emergence of a Wellfounded Machine Learning - PowerPoint PPT Presentation

1 / 94
About This Presentation
Title:

Kernel Methods: the Emergence of a Wellfounded Machine Learning

Description:

Centre for Computational Statistics and Machine Learning. University College London ... Two theoretical approaches converged on very similar algorithms: ... – PowerPoint PPT presentation

Number of Views:114
Avg rating:3.0/5.0
Slides: 95
Provided by: ECS176
Category:

less

Transcript and Presenter's Notes

Title: Kernel Methods: the Emergence of a Wellfounded Machine Learning


1
Kernel Methods the Emergence of a Well-founded
Machine Learning
  • John Shawe-Taylor
  • Centre for Computational Statistics and Machine
    Learning
  • University College London

2
Overview
  • Celebration of 10 years of kernel methods
  • what has been achieved and
  • what can we learn from the experience?
  • Some historical perspectives
  • Theory or not?
  • Applicable or not?
  • Some emphases
  • Role of theory need for plurality of approaches
  • Importance of scalability

3
Caveats
  • Personal perspective with inevitable bias
  • One very small slice through what is now a very
    big field
  • Focus on theory with emphasis on frequentist
    analysis
  • There is no pro-forma for scientific research
  • But role of theory worth discussing
  • Needed to give firm foundation for proposed
    approaches?

4
Motivation behind kernel methods
  • Linear learning typically has nice properties
  • Unique optimal solutions
  • Fast learning algorithms
  • Better statistical analysis
  • But one big problem
  • Insufficient capacity

5
Historical perspective
  • Minsky and Pappert highlighted the weakness in
    their book Perceptrons
  • Neural networks overcame the problem by gluing
    together many linear units with non-linear
    activation functions
  • Solved problem of capacity and led to very
    impressive extension of applicability of learning
  • But ran into training problems of speed and
    multiple local minima

6
Kernel methods approach
  • The kernel methods approach is to stick with
    linear functions but work in a high dimensional
    feature space
  • The expectation is that the feature space has a
    much higher dimension than the input space.

7
Example
  • Consider the mapping
  • If we consider a linear equation in this feature
    space
  • We actually have an ellipse i.e. a non-linear
    shape in the input space.

8
Capacity of feature spaces
  • The capacity is proportional to the dimension
    for example
  • 2-dim

9
Form of the functions
  • So kernel methods use linear functions in a
    feature space
  • For regression this could be the function
  • For classification require thresholding

10
Problems of high dimensions
  • Capacity may easily become too large and lead to
    over-fitting being able to realise every
    classifier means unlikely to generalise well
  • Computational costs involved in dealing with
    large vectors

11
Overview
  • Two theoretical approaches converged on very
    similar algorithms
  • Frequentist led to Support Vector Machine
  • Bayesian approach led to Bayesian inference using
    Gaussian Processes
  • First we briefly discuss the Bayesian approach
    before mentioning some of the frequentist results

12
Bayesian approach
  • The Bayesian approach relies on a probabilistic
    analysis by positing
  • a noise model
  • a prior distribution over the function class
  • Inference involves updating the prior
    distribution with the likelihood of the data
  • Possible outputs
  • MAP function
  • Bayesian posterior average

13
Bayesian approach
  • Avoids overfitting by
  • Controlling the prior distribution
  • Averaging over the posterior
  • For Gaussian noise model (for regression) and
    Gaussian process prior we obtain a kernel
    method where
  • Kernel is covariance of the prior GP
  • Noise model translates into addition of ridge to
    kernel matrix
  • MAP and averaging give the same solution
  • Link with infinite hidden node limit of single
    hidden layer Neural Networks see seminal paper
  • Williams, Computation with infinite neural
    networks (1997)

14
Bayesian approach
  • Subject to assumptions about noise model and
    prior distribution
  • Can get error bars on the output
  • Compute evidence for the model and use for model
    selection
  • Approach developed for different noise models
  • eg classification
  • Typically requires approximate inference

15
Frequentist approach
  • Source of randomness is assumed to be a
    distribution that generates the training data
    i.i.d. with the same distribution generating
    the test data
  • Different/weaker assumptions than the Bayesian
    approach so more general but less analysis can
    typically be derived
  • Main focus is on generalisation error analysis

16
Capacity problem
  • What do we mean by generalisation?

17
Generalisation of a learner
18
Example of Generalisation
  • We consider the Breast Cancer dataset from the
    UCIrepository
  • Use the simple Parzen window classifier weight
    vector is
  • where is the average of the
    positive (negative) training examples.
  • Threshold is set so hyperplane bisects the line
    joining these two points.

19
Example of Generalisation
  • By repeatedly drawing random training sets S of
    size m we estimate the distribution of
  • by using the test set error as a proxy for the
    true generalisation
  • We plot the histogram and the average of the
    distribution for various sizes of training set
  • 648, 342, 273, 205, 137, 68, 34, 27, 20, 14, 7.

20
Example of Generalisation
  • Since the expected classifier is in all cases the
    same
  • we do not expect large differences in the
    average of the distribution, though the
    non-linearity of the loss function means they
    won't be the same exactly.

21
Error distribution full dataset
22
Error distribution dataset size 342
23
Error distribution dataset size 273
24
Error distribution dataset size 205
25
Error distribution dataset size 137
26
Error distribution dataset size 68
27
Error distribution dataset size 34
28
Error distribution dataset size 27
29
Error distribution dataset size 20
30
Error distribution dataset size 14
31
Error distribution dataset size 7
32
Observations
  • Things can get bad if number of training examples
    small compared to dimension (in this case input
    dimension is 9)
  • Mean can be bad predictor of true generalisation
    i.e. things can look okay in expectation, but
    still go badly wrong
  • Key ingredient of learning keep flexibility
    high while still ensuring good generalisation

33
Controlling generalisation
  • The critical method of controlling generalisation
    for classification is to force a large margin on
    the training data

34
Intuitive and rigorous explanations
  • Makes classification robust to uncertainties in
    inputs
  • Can randomly project into lower dimensional
    spaces and still have separation so effectively
    low dimensional
  • Rigorous statistical analysis shows effective
    dimension
  • This is not structural risk minimisation over VC
    classes since hierarchy depends on the data
    data-dependent structural risk minimisation
  • see S-T, Bartlett, Williamson Anthony (1996
    and 1998)

35
Learning framework
  • Since there are lower bounds in terms of the VC
    dimension the margin is detecting a favourable
    distribution/task alignment luckiness framework
    captures this idea
  • Now consider using an SVM on the same data and
    compare the distribution of generalisations
  • SVM distribution in red

36
Error distribution dataset size 205
37
Error distribution dataset size 137
38
Error distribution dataset size 68
39
Error distribution dataset size 34
40
Error distribution dataset size 27
41
Error distribution dataset size 20
42
Error distribution dataset size 14
43
Error distribution dataset size 7
44
Handling training errors
  • So far only considered case where data can be
    separated
  • For non-separable sets we can introduce a penalty
    proportional to the amount by which a point fails
    to meet the margin
  • These amounts are often referred to as slack
    variables from optimisation theory

45
Support Vector Machines
  • SVM optimisation
  • Analysis of this case given using augmented space
    trick in
  • S-T and Cristianini (1999 and 2002) On the
    generalisation of soft margin algorithms.

46
Complexity problem
  • Lets apply the quadratic example
  • to a 20x30 image of 600 pixels gives
    approximately 180000 dimensions!
  • Would be computationally infeasible to work in
    this space

47
Dual representation
  • Suppose weight vector is a linear combination of
    the training examples
  • can evaluate inner product with new example

48
Learning the dual variables
  • The ai are known as dual variables
  • Since any component orthogonal to the space
    spanned by the training data has no effect,
    general result that weight vectors have dual
    representation the representer theorem.
  • Hence, can reformulate algorithms to learn dual
    variables rather than weight vector directly

49
Dual form of SVM
  • The dual form of the SVM can also be derived by
    taking the dual optimisation problem! This gives
  • Note that threshold must be determined from
    border examples

50
Using kernels
  • Critical observation is that again only inner
    products are used
  • Suppose that we now have a shortcut method of
    computing
  • Then we do not need to explicitly compute the
    feature vectors either in training or testing

51
Kernel example
  • As an example consider the mapping
  • Here we have a shortcut

52
Efficiency
  • Hence, in the pixel example rather than work with
    180000 dimensional vectors, we compute a 600
    dimensional inner product and then square the
    result!
  • Can even work in infinite dimensional spaces, eg
    using the Gaussian kernel

53
Using Gaussian kernel for Breast 273
54
Data size 342
55
Constraints on the kernel
  • There is a restriction on the function
  • This restriction for any training set is enough
    to guarantee function is a kernel

56
What have we achieved?
  • Replaced problem of neural network architecture
    by kernel definition
  • Arguably more natural to define but restriction
    is a bit unnatural
  • Not a silver bullet as fit with data is key
  • Can be applied to non-vectorial (or high dim)
    data
  • Gained more flexible regularisation/
    generalisation control
  • Gained convex optimisation problem
  • i.e. NO local minima!

57
Historical note
  • First use of kernel methods in machine learning
    was in combination with perceptron algorithm
  • Aizerman et al., Theoretical foundations of the
    potential function method in pattern recognition
    learning(1964)
  • Apparently failed because of generalisation
    issues margin not optimised but this can be
    incorporated with one extra parameter!

58
PAC-Bayes General perspectives
  • The goal of different theories is to capture the
    key elements that enable an understanding,
    analysis and learning of different phenomena
  • Several theories of machine learning notably
    Bayesian and frequentist
  • Different assumptions and hence different range
    of applicability and range of results
  • Bayesian able to make more detailed probabilistic
    predictions
  • Frequentist makes only i.i.d. assumption

59
Evidence and generalisation
  • Evidence is a measure used in Bayesian analysis
    to inform model selection
  • Evidence related to the posterior volume of
    weight space consistent with the sample
  • Link between evidence and generalisation
    hypothesised by McKay
  • First such result was obtained by S-T
    Williamson (1997) PAC Analysis of a Bayes
    Estimator
  • Bound on generalisation in terms of the volume of
    the sphere that can be inscribed in the version
    space -- included a dependence on the dim of the
    space but applies to non-linear function classes
  • Used Luckiness framework where luckiness measured
    by the volume of the ball

60
PAC-Bayes Theorem
  • PAC-Bayes theorem has a similar flavour but
    bounds in terms of prior and posterior
    distributions
  • First version proved by McAllester in 1999
  • Improved proof and bound due to Seeger in 2002
    with application to Gaussian processes
  • Application to SVMs by Langford and S-T also in
    2002
  • Gives tightest bounds on generalisation for SVMs
  • see Langford tutorial on JMLR

61
(No Transcript)
62
(No Transcript)
63
(No Transcript)
64
Examples of bound evaluation
65
Principal Components Analysis (PCA)
  • Classical method is principal component analysis
    looks for directions of maximum variance, given
    by eigenvectors of covariance matrix

66
Dual representation of PCA
  • Eigenvectors of kernel matrix give dual
    representation
  • Means we can perform PCA projection in a kernel
    defined feature space kernel PCA

67
Kernel PCA
  • Need to take care of normalisation to obtain
  • where ? is the corresponding eigenvalue.

68
Generalisation of k-PCA
  • How reliable are estimates obtained from a sample
    when working in such high dimensional spaces
  • Using Rademacher complexity bounds can show that
    if low dimensional projection captures most of
    the variance in the training set, will do well on
    test examples as well even in high dimensional
    spaces
  • see S-T, Williams, Cristianini and Kandola
    (2004) On the eigenspectrum of the Gram matrix
    and the generalisation error of kernel PCA
  • The bound also gives a new bound on the
    difference between the process and empirical
    eigenvalues

69
Latent Semantic Indexing
  • Developed in information retrieval to overcome
    problem of semantic relations between words in
    BoWs representation
  • Use Singular Value Decomposition of Term-doc
    matrix

70
Lower dimensional representation
  • If we truncate the matrix U at k columns we
    obtain the best k dimensional representation in
    the least squares sense
  • In this case we can write a semantic matrix as

71
Latent Semantic Kernels
  • Can perform the same transformation in a kernel
    defined feature space by performing an eigenvalue
    decomposition of the kernel matrix (this is
    equivalent to kernel PCA)

72
Related techniques
  • Number of related techniques
  • Probabilistic LSI (pLSI)
  • Non-negative Matrix Factorisation (NMF)
  • Multinomial PCA (mPCA)
  • Discrete PCA (DPCA)
  • All can be viewed as alternative decompositions

73
Different criteria
  • Vary by
  • Different constraints (eg non-negative entries)
  • Different prior distributions (eg Dirichlet,
    Poisson)
  • Different optimisation criteria (eg max
    likelihood, Bayesian)
  • Unlike LSI typically suffer from local minima and
    so require EM type iterative algorithms to
    converge to solutions

74
Other subspace methods
  • Kernel partial Gram-Schmidt orthogonalisation is
    equivalent to incomplete Cholesky decomposition
    greedy kernel PCA
  • Kernel Partial Least Squares implements a
    multi-dimensional regression algorithm popular in
    chemometrics takes account of labels
  • Kernel Canonical Correlation Analysis uses paired
    datasets to learn a semantic representation
    independent of the two views

75
Paired corpora
  • Can we use paired corpora to extract more
    information?
  • Two views of same semantic object hypothesise
    that both views contain all of the necessary
    information, eg document and translation to a
    second language

76
aligned text
77
Canadian parliament corpus
LAND MINES Ms. Beth Phinney (Hamilton Mountain,
Lib.) Mr. Speaker, we are pleased that the Nobel
peace prize has been given to those working to
ban land mines worldwide. We hope this award
will encourage the United States to join the over
100 countries planning to come to
E12
LES MINES ANTIPERSONNEL Mme Beth Phinney
(Hamilton Mountain, Lib.) Monsieur le Président,
nous nous réjouissons du fait que le prix Nobel
ait été attribué à ceux qui oeuvrent en faveur de
l'interdiction des mines antipersonnel dans le
monde entier. Nous espérons que cela incitera
les Américains à se joindre aux représentants de
plus de 100 pays qui ont l'intention de venir à
F12
78
cross-lingual lsi via svd
M. L. Littman, S. T. Dumais, and T. K. Landauer.
Automatic cross-language information retrieval
using latent semantic indexing. In G.
Grefenstette, editor, Cross-language information
retrieval. Kluwer, 1998.
79
cross-lingual kernel canonical correlation
analysis
feature English space
feature French space
fF2
fF1
F(x)
input French space
input English space
80
kernel canonical correlation analysis
81
regularization
  • using kernel functions may result in overfitting
  • Theoretical analysis shows that provided the
    norms of the weight vectors are small the
    correlation will still hold for new data
  • see S-T and Cristianini (2004) Kernel Methods
    for Pattern Analysis
  • need to control flexibility of the projections fE
    and fF add diagonal to the matrix D

? is the regularization parameter
82
pseudo query test
83
Experimental Results
  • The goal was to retrieve the paired document.
  • Experimental procedure
  • (1) LSI/KCCA trained on paired documents,
  • (2) All test documents projected into the
    LSI/KCCA semantic space,
  • (3) Each query was projected into the LSI/KCCA
    semantic space and documents were retrieved using
    nearest neighbour based on cosine distance to the
    query.

84
English-French retrieval accuracy,
85
Applying to different data types
  • Data
  • Combined image and associated text obtained from
    the web
  • Three categories sport, aviation and paintball
  • 400 examples from each category (1200 overall)
  • Features extracted HSV, Texture , Bag of words
  • Tasks
  • Classification of web pages into the 3 categories
  • Text query -gt image retrieval

86
Classification error of baseline method
  • Previous error rates obtained using
    probabilistic ICA classification done for single
    feature groups

87
Classification rates using KCCA
88
Classification with multi-views
  • If we use KCCA to generate a semantic feature
    space and then learn with an SVM, can envisage
    combining the two steps into a single SVM-2k
  • learns two SVMs one on each representation, but
    constrains their outputs to be similar across the
    training (and any unlabelled data).
  • Can give classification for single mode test data
    applied to patent classification for Japanese
    patents
  • Again theoretical analysis predicts good
    generalisation if the training SVMs have a good
    match, while the two representations have a small
    overlap
  • see Farquhar, Hardoon, Meng, S-T, Szedmak (2005)
    Two view learning SVM-2K, Theory and Practice

89
Results
90
Targeted Density Learning
  • Learning a probability density function in an L1
    sense is hard
  • Learning for a single classification or
    regression task is easy
  • Consider a set of tasks (Touchstone Class) F that
    we think may arise with a distribution
    emphasising those more likely to be seen
  • Now learn density that does well for the tasks in
    F.
  • Surprisingly it can be shown that just including
    a small subset of tasks gives us convergence over
    all the tasks

91
Example plot
  • Consider fitting cumulative distribution up to
    sample points as proposed by Mukherjee and
    Vapnik.
  • Plot shows loss as a function of the number of
    constraints (tasks) included

92
Idealised view of progress
Study problem to develop theoretical model
Derive analysis that indicates factors that
affect solution quality
Translate into optimisation maximising factors
relaxing to ensure convexity
Develop efficient solutions using specifics of
the task
93
Role of theory
  • Theoretical analysis plays a critical auxiliary
    role
  • Can never be complete picture
  • Needs to capture the factors that most influence
    quality of result
  • Means to an end only as good as the result
  • Important to be open to refinements/relaxations
    that can be shown to better characterise
    real-world learning and translate into efficient
    algorithms

94
Conclusions and future directions
  • Extensions to learning data with output structure
    that can be used to inform the learning
  • More general subspace methods algorithms and
    theoretical foundations
  • Extensions to more complex tasks, eg system
    identification, reinforcement learning
  • Linking learning into tasks with more detailed
    prior knowledge eg stochastic differential
    equations for climate modelling
  • Extending theoretical framework for more general
    learning tasks eg learning a density function
Write a Comment
User Comments (0)
About PowerShow.com