Overview of Kernel Methods - PowerPoint PPT Presentation

About This Presentation

Title:

Overview of Kernel Methods

Description:

Mapping into embedding or feature space defined by kernel ... Theorem: Let be a IID sample from P and define. the sample mean of f(x) as: then it follows that: ... – PowerPoint PPT presentation

Number of Views:284

Avg rating:3.0/5.0

Slides: 52

Provided by: stude1002

Category:

more less

Transcript and Presenter's Notes

Title: Overview of Kernel Methods

1
Overview of Kernel Methods
Steve Vincent
Adapted from John Shawe-Taylor and Nello
Christianini, Kernel Methods for Pattern Analysis
2
Coordinate Transformation
Plot of x vs y

Planetary position in a two dimensional
orthogonal coordinate system

3
Coordinate Transformation
Plot of x vs y

Planetary position in a two dimensional
orthogonal coordinate system

Plot of x2 vs y2
4
Non-linear Kernel Classification
If the data is not separable by a hyperplane
5
Pattern Analysis Algorithm

Computational efficiency
Need to have all algorithms to be computationally
efficient and that the degree of any polynomial
involved should render the algorithm practical
for large data sets
Robustness
Able to handle noisy data and identify
approximate patterns
Statistical Stability
Output not be sensitive to a particular dataset,
just to the underlying source of the data

6
Kernel Method

Mapping into embedding or feature space defined
by kernel
Learning algorithm for discovering linear
patterns in that space
Learning algorithm must work in dual space
Primal solution computes the weight vector
explicitly
Dual solution gives the solution as a linear
combination of the training examples

7
Kernel Methods the mapping
f
f
f
Original Space
Feature (Vector) Space
8
Linear Regression

Given training data
Construct linear function
Creates pattern function

9
1-d Regression
10
Least Squares Approximation

Want
Define error
Minimize loss

11
Optimal Solution

Want
Mathematical Model
Optimality Condition
Solution satisfies

Solving n?n equation is 0(n3)
12
Ridge Regression

Inverse typically does not exist.
Use least norm solution for fixed
Regularized problem
Optimality Condition

Requires 0(n3) operations
13
Ridge Regression (cont)

Inverse always exists for any
Alternative representation

Solving l?l equation is 0(l3)
14
Dual Ridge Regression

To predict new point
Note need only compute G, the Gram Matrix

Ridge Regression requires only inner products
between data points
15
Efficiency

To compute
w in primal ridge regression is 0(n3)
? in primal ridge regression is 0(l3)
To predict new point x
primal
0(n)
dual
0(nl)

Dual is better if ngtgtl
16
Notes on Ridge Regression

Regularization is key to addressstability and
regularization.
Regularization lets method work when ngtgtl.
Dual more efficient when ngtgtl.
Dual only requires inner products of data.

17
Linear Regression in Feature Space

Key Idea
Map data to higher dimensional space (feature
space) and perform linear regression in embedded
space.
Embedding Map

Alternative Form
18
Nonlinear Regression in Feature Space
In primal representation
19
Nonlinear Regression in Feature Space
In dual representation
20
Kernel Methods intuitive idea

Find a mapping f such that, in the new space,
problem solving is easier (e.g. linear)
The kernel represents the similarity between two
objects (documents, terms, ), defined as the
dot-product in this new vector space
But the mapping is left implicit
Easy generalization of a lot of dot-product (or
distance) based pattern recognition algorithms

21
Derivation of Kernel
22
Kernel Function

A kernel is a function K such that
There are many possible kernels.
Simplest is linear kernel.

23
Kernel more formal definition

A kernel k(x,y)
is a similarity measure
defined by an implicit mapping f,
from the original space to a vector space
(feature space)
such that k(x,y)f(x)f(y)
This similarity measure and the mapping include
Invariance or other a priori knowledge
Simpler structure (linear representation of the
data)
The class of functions the solution is taken from
Possibly infinite dimension (hypothesis space for
learning)
but still computational efficiency when
computing k(x,y)

24
Kernelization

Such K is called a Mercer kernel.
Kernels were introduced in mathematics to solve
integral equations.
Kernels measure similarity of inputs.

25
Brief Comments on Hilbert Spaces

A Hilbert space is a generalization of finite
dimensional vector spaces with inner product to a
possibly infinite dimension.
Most of interesting infinite dimensional vector
spaces are function spaces.
Hilbert spaces are the simplest among such
spaces.
Prime example L2 (the square integrable
functions)
Any continuous linear functional on a Hilbert
space is given by an inner product with a vector.
(Riesz Representation Theorem.)
A representation of a vector w.r.t. a fixed basis
is called Fourier expansion.

26
Making Kernels

The mapping function must be symmetric,
And satisfy the inequalities that follow from the
Cauchy-Schwarz inequality.

27
The Kernel Gram Matrix

With KM-based learning, the sole information used
from the training data set is the Kernel Gram
Matrix
If the kernel is valid, K is symmetric
definite-positive (all eigenvalues are all
non-negative)

28
Mercers Theorem

Suppose X is compact. (Always true for finite
examples.)
Suppose K is a Mercer Kernel.
Then it can be expanded, using eigenvalues and
eigenfunctions of K, as
Now, using eigenfunctions and their span, find
H is called a Reproducing Kernel Hilbert Space
(RKHS).

29
Characterization of Kernels

Prove (kernel function)
K is symmetric
V is an orthogonal matrix.
Let

30
Characterization of Kernels

Then for any xi, xj
(positive semi-definite)
Let there exists with
The point
Then

31
Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces (RKHS)
1 The Hilbert space L2 is too big for our
purpose, containing too many non-smooth
functions. One approach to obtaining restricted,
smooth spaces is the RKHS approach.
A RKHS is smaller then a general Hilbert space.
Define the reproducing kernel map
(to each x we associate a function k(.,x))

32
Characterization of Kernels

We now define an inner product.
Now construct a vector space containing all
linear combination of the function k(.,x)
(This will be our
RKHS.)
Let and define
Prove it as an inner product in RKHS.

33
Characterization of Kernels

Symmetry is obvious, and linearity is easy to
show.
Proveltf.fgt0gtf0.
say k is the representer of evaluation.2
By above,
(reproduction property)
So
(From Cauchy-Schwartz).
If ltf,fgt0gtf0, and This is our RKHS.

34
Characterization of Kernels

Formal definition
For a compact subset X of Rd and Hilbert space H
of functions f X ?R, we say that H is a
reproducing kernel Hilbert space if k X2 ? R,
such that
1. k has the reproducing property.(ltk(, x), f gt
f(x))
2. k spans H. (spank(, x) x is belong to X
H)

35
Popular Kernels based on vectors
By Hilbert-Schmidt Kernels (Courant and Hilbert
1953)
for certain ? and K, e.g.
36
Examples of Kernels
37
How to build new kernels

Kernel combinations, preserving validity

38
Important Points

Kernel method
linear method embedding in feature space.
Kernel functions used to do embedding
efficiently.
Feature space is higher dimensional space so must
regularize.
Choose kernel appropriate to domain.

39
Principal Component Analysis (PCA)

Subtract the mean (centers the data).
Compute the covariance matrix, S.
Compute the eigenvectors of S, sort them
according to their eigenvalues and keep the M
first vectors.
Project the data points on those vectors.
Also called the Karhunen-Loeve transformation.

40
Kernel PCA

Principal Component Analysis (PCA) is one of the
fundamental technique in a wide range of areas.
Simply stated, PCA diagonalizes (or, finds
singular value decomposition (SVD) of) the
covariance matrix.
Equivalently, we may find SVD of the data matrix.
Instead of PCA in the original input spaces, we
may perform PCA in the feature space. This is
called Kernel PCA.
Find eigenvalues and eigenvectors of the Gram
matrix.
For many applications, we need to find online
algorithms, i.e., algorithms that do not need to
store the Gram matrix.

41
PCA in dot-product form

Assume we have centered observations column
vectors xi (centered means )
PCA finds the principal axes by diagonalizing the
covariance matrix C with singular value
decomposition

(1)
Eigenvector
Eigenvalue
(2)
Covariance matrix
42
PCA in dot-product
Substituting equation 2 into 1, we get
(3)
Thus,
Scalar
(4)
All solutions v with ??0 lie in the span of
x1,x2,..,xl,,i.e.
(5)
43
Kernel PCA algorithm

If we do PCA in feature space, covariance matrix

(6)
Which can be diagonalized with nonnegative
eigenvalues satisfying
(7)
And we have shown that V lie in the span of
?(xi), so we have
(8)
44
Kernel PCA

Apply kernel trick,we have K(xi,xj) lt ?(xi),
?(xj)gt

(9)
And we can finally write the expression as the
eigenvalue Problem K? ??
(10)
45
Kernel PCA algorithm outline

Given a set of m-dimensional dataxk, calculate
K, for example, Gaussian K(xi,xj)exp(-xi-xj2/
d).
Carry out centering in feature space.
Solve eigenvalue problem, K? ?? .
For a test pattern x, we extract a nonlinear
component via

(11)
46
Stability of Kernel Algorithms
Our objective for learning is to improve
generalize performance cross-validation,
Bayesian methods, generalization bounds,...
Call a pattern a sample
S. Is this pattern also likely to be present in
new data ?
We can use concentration inequalities (McDiamids
theorem) to prove that Theorem Let
be a IID sample from P and define the
sample mean of f(x) as
then it follows that
(prob. that sample mean and population mean
differ less than is more than ,independent of
P!
47
Rademacher Complexity
Problem we only checked the generalization
performance for a single fixed
pattern f(x). What is we want to
search over a function class F? Intuition we
need to incorporate the complexity of this
function class.
Rademacher complexity captures the ability of the
function class to fit random noise. (
uniform distributed)
f1
(empirical RC)
f2
xi
48
Generalization Bound
Theorem Let f be a function in F which maps to
0,1. (e.g. loss functions) Then, with
probability at least over random draws
of size every f satisfies
Relevance The expected pattern Ef0 will also
be present in a new data set,
if the last 2 terms are small
- Complexity function class F small
- number of training data large
49
Linear Functions (in feature space)
Consider the function class
and a sample
Then, the empirical RC of FB is bounded by
Relevance Since
it follows that if we
control the norm in
kernel algorithms, we control the complexity of
the function class (regularization).
50
Margin Bound (classification)
Theorem Choose cgt0 (the margin).
F f(x,y)-yg(x), y1,-1
S (0,1) probability of
violating bound.
(prob. of misclassification)
Relevance We our classification error on new
samples. Moreover, we have a strategy to improve
generalization choose the margin c as large
possible such that all samples are correctly
classified (e.g. support vector
machines).
51
Next Part