Title: Overview of Kernel Methods
1Overview of Kernel Methods
Steve Vincent
Adapted from John Shawe-Taylor and Nello
Christianini, Kernel Methods for Pattern Analysis
2Coordinate Transformation
Plot of x vs y
- Planetary position in a two dimensional
orthogonal coordinate system
3Coordinate Transformation
Plot of x vs y
- Planetary position in a two dimensional
orthogonal coordinate system
Plot of x2 vs y2
4Non-linear Kernel Classification
If the data is not separable by a hyperplane
5Pattern Analysis Algorithm
- Computational efficiency
- Need to have all algorithms to be computationally
efficient and that the degree of any polynomial
involved should render the algorithm practical
for large data sets - Robustness
- Able to handle noisy data and identify
approximate patterns - Statistical Stability
- Output not be sensitive to a particular dataset,
just to the underlying source of the data
6Kernel Method
- Mapping into embedding or feature space defined
by kernel - Learning algorithm for discovering linear
patterns in that space - Learning algorithm must work in dual space
- Primal solution computes the weight vector
explicitly - Dual solution gives the solution as a linear
combination of the training examples
7Kernel Methods the mapping
f
f
f
Original Space
Feature (Vector) Space
8Linear Regression
- Given training data
- Construct linear function
- Creates pattern function
-
91-d Regression
10Least Squares Approximation
- Want
- Define error
- Minimize loss
11 Optimal Solution
- Want
- Mathematical Model
- Optimality Condition
- Solution satisfies
Solving n?n equation is 0(n3)
12Ridge Regression
- Inverse typically does not exist.
- Use least norm solution for fixed
- Regularized problem
- Optimality Condition
Requires 0(n3) operations
13Ridge Regression (cont)
- Inverse always exists for any
- Alternative representation
Solving l?l equation is 0(l3)
14Dual Ridge Regression
- To predict new point
- Note need only compute G, the Gram Matrix
Ridge Regression requires only inner products
between data points
15Efficiency
- To compute
- w in primal ridge regression is 0(n3)
- ? in primal ridge regression is 0(l3)
- To predict new point x
- primal
0(n) - dual
0(nl)
Dual is better if ngtgtl
16Notes on Ridge Regression
- Regularization is key to addressstability and
regularization. - Regularization lets method work when ngtgtl.
- Dual more efficient when ngtgtl.
- Dual only requires inner products of data.
17Linear Regression in Feature Space
- Key Idea
- Map data to higher dimensional space (feature
space) and perform linear regression in embedded
space. - Embedding Map
Alternative Form
18Nonlinear Regression in Feature Space
In primal representation
19Nonlinear Regression in Feature Space
In dual representation
20Kernel Methods intuitive idea
- Find a mapping f such that, in the new space,
problem solving is easier (e.g. linear) - The kernel represents the similarity between two
objects (documents, terms, ), defined as the
dot-product in this new vector space - But the mapping is left implicit
- Easy generalization of a lot of dot-product (or
distance) based pattern recognition algorithms
21Derivation of Kernel
22Kernel Function
- A kernel is a function K such that
-
- There are many possible kernels.
- Simplest is linear kernel.
23Kernel more formal definition
- A kernel k(x,y)
- is a similarity measure
- defined by an implicit mapping f,
- from the original space to a vector space
(feature space) - such that k(x,y)f(x)f(y)
- This similarity measure and the mapping include
- Invariance or other a priori knowledge
- Simpler structure (linear representation of the
data) - The class of functions the solution is taken from
- Possibly infinite dimension (hypothesis space for
learning) - but still computational efficiency when
computing k(x,y)
24Kernelization
- Such K is called a Mercer kernel.
- Kernels were introduced in mathematics to solve
integral equations. - Kernels measure similarity of inputs.
25Brief Comments on Hilbert Spaces
- A Hilbert space is a generalization of finite
dimensional vector spaces with inner product to a
possibly infinite dimension. - Most of interesting infinite dimensional vector
spaces are function spaces. - Hilbert spaces are the simplest among such
spaces. - Prime example L2 (the square integrable
functions) - Any continuous linear functional on a Hilbert
space is given by an inner product with a vector.
(Riesz Representation Theorem.) - A representation of a vector w.r.t. a fixed basis
is called Fourier expansion.
26Making Kernels
- The mapping function must be symmetric,
- And satisfy the inequalities that follow from the
Cauchy-Schwarz inequality.
27The Kernel Gram Matrix
- With KM-based learning, the sole information used
from the training data set is the Kernel Gram
Matrix - If the kernel is valid, K is symmetric
definite-positive (all eigenvalues are all
non-negative)
28Mercers Theorem
- Suppose X is compact. (Always true for finite
examples.) - Suppose K is a Mercer Kernel.
- Then it can be expanded, using eigenvalues and
eigenfunctions of K, as - Now, using eigenfunctions and their span, find
- H is called a Reproducing Kernel Hilbert Space
(RKHS).
29Characterization of Kernels
- Prove (kernel function)
- K is symmetric
- V is an orthogonal matrix.
-
- Let
-
-
-
-
30Characterization of Kernels
- Then for any xi, xj
- (positive semi-definite)
- Let there exists with
- The point
- Then
31Reproducing Kernel Hilbert Spaces
- Reproducing Kernel Hilbert Spaces (RKHS)
- 1 The Hilbert space L2 is too big for our
purpose, containing too many non-smooth
functions. One approach to obtaining restricted,
smooth spaces is the RKHS approach. - A RKHS is smaller then a general Hilbert space.
- Define the reproducing kernel map
- (to each x we associate a function k(.,x))
-
32Characterization of Kernels
- We now define an inner product.
- Now construct a vector space containing all
linear combination of the function k(.,x) - (This will be our
RKHS.) - Let and define
-
- Prove it as an inner product in RKHS.
33Characterization of Kernels
- Symmetry is obvious, and linearity is easy to
show. - Proveltf.fgt0gtf0.
-
- say k is the representer of evaluation.2
- By above,
- (reproduction property)
- So
-
- (From Cauchy-Schwartz).
- If ltf,fgt0gtf0, and This is our RKHS.
34Characterization of Kernels
- Formal definition
- For a compact subset X of Rd and Hilbert space H
of functions f X ?R, we say that H is a
reproducing kernel Hilbert space if k X2 ? R,
such that - 1. k has the reproducing property.(ltk(, x), f gt
f(x)) - 2. k spans H. (spank(, x) x is belong to X
H)
35Popular Kernels based on vectors
By Hilbert-Schmidt Kernels (Courant and Hilbert
1953)
for certain ? and K, e.g.
36Examples of Kernels
37How to build new kernels
- Kernel combinations, preserving validity
38Important Points
- Kernel method
- linear method embedding in feature space.
- Kernel functions used to do embedding
efficiently. - Feature space is higher dimensional space so must
regularize. - Choose kernel appropriate to domain.
39Principal Component Analysis (PCA)
- Subtract the mean (centers the data).
- Compute the covariance matrix, S.
- Compute the eigenvectors of S, sort them
according to their eigenvalues and keep the M
first vectors. - Project the data points on those vectors.
- Also called the Karhunen-Loeve transformation.
40Kernel PCA
- Principal Component Analysis (PCA) is one of the
fundamental technique in a wide range of areas. - Simply stated, PCA diagonalizes (or, finds
singular value decomposition (SVD) of) the
covariance matrix. - Equivalently, we may find SVD of the data matrix.
- Instead of PCA in the original input spaces, we
may perform PCA in the feature space. This is
called Kernel PCA. - Find eigenvalues and eigenvectors of the Gram
matrix. -
- For many applications, we need to find online
algorithms, i.e., algorithms that do not need to
store the Gram matrix.
41PCA in dot-product form
- Assume we have centered observations column
vectors xi (centered means ) - PCA finds the principal axes by diagonalizing the
covariance matrix C with singular value
decomposition
(1)
Eigenvector
Eigenvalue
(2)
Covariance matrix
42PCA in dot-product
Substituting equation 2 into 1, we get
(3)
Thus,
Scalar
(4)
All solutions v with ??0 lie in the span of
x1,x2,..,xl,,i.e.
(5)
43Kernel PCA algorithm
- If we do PCA in feature space, covariance matrix
(6)
Which can be diagonalized with nonnegative
eigenvalues satisfying
(7)
And we have shown that V lie in the span of
?(xi), so we have
(8)
44Kernel PCA
- Apply kernel trick,we have K(xi,xj) lt ?(xi),
?(xj)gt
(9)
And we can finally write the expression as the
eigenvalue Problem K? ??
(10)
45Kernel PCA algorithm outline
- Given a set of m-dimensional dataxk, calculate
K, for example, Gaussian K(xi,xj)exp(-xi-xj2/
d). - Carry out centering in feature space.
- Solve eigenvalue problem, K? ?? .
- For a test pattern x, we extract a nonlinear
component via
(11)
46Stability of Kernel Algorithms
Our objective for learning is to improve
generalize performance cross-validation,
Bayesian methods, generalization bounds,...
Call a pattern a sample
S. Is this pattern also likely to be present in
new data ?
We can use concentration inequalities (McDiamids
theorem) to prove that Theorem Let
be a IID sample from P and define the
sample mean of f(x) as
then it follows that
(prob. that sample mean and population mean
differ less than is more than ,independent of
P!
47Rademacher Complexity
Problem we only checked the generalization
performance for a single fixed
pattern f(x). What is we want to
search over a function class F? Intuition we
need to incorporate the complexity of this
function class.
Rademacher complexity captures the ability of the
function class to fit random noise. (
uniform distributed)
f1
(empirical RC)
f2
xi
48Generalization Bound
Theorem Let f be a function in F which maps to
0,1. (e.g. loss functions) Then, with
probability at least over random draws
of size every f satisfies
Relevance The expected pattern Ef0 will also
be present in a new data set,
if the last 2 terms are small
- Complexity function class F small
- number of training data large
49Linear Functions (in feature space)
Consider the function class
and a sample
Then, the empirical RC of FB is bounded by
Relevance Since
it follows that if we
control the norm in
kernel algorithms, we control the complexity of
the function class (regularization).
50Margin Bound (classification)
Theorem Choose cgt0 (the margin).
F f(x,y)-yg(x), y1,-1
S (0,1) probability of
violating bound.
(prob. of misclassification)
Relevance We our classification error on new
samples. Moreover, we have a strategy to improve
generalization choose the margin c as large
possible such that all samples are correctly
classified (e.g. support vector
machines).
51Next Part
- Constructing Kernels
- Kernels for Text
- Vector space kernels
- Kernels for Structured Data
- Subsequences kernels
- Trie-based kernels
- Kernels from Generative Models
- P-kernels
- Fisher kernels