Overview of Kernel Methods - PowerPoint PPT Presentation

About This Presentation
Title:

Overview of Kernel Methods

Description:

Mapping into embedding or feature space defined by kernel ... Theorem: Let be a IID sample from P and define. the sample mean of f(x) as: then it follows that: ... – PowerPoint PPT presentation

Number of Views:284
Avg rating:3.0/5.0
Slides: 52
Provided by: stude1002
Category:

less

Transcript and Presenter's Notes

Title: Overview of Kernel Methods


1
Overview of Kernel Methods
Steve Vincent
Adapted from John Shawe-Taylor and Nello
Christianini, Kernel Methods for Pattern Analysis
2
Coordinate Transformation
Plot of x vs y
  • Planetary position in a two dimensional
    orthogonal coordinate system

3
Coordinate Transformation
Plot of x vs y
  • Planetary position in a two dimensional
    orthogonal coordinate system

Plot of x2 vs y2
4
Non-linear Kernel Classification
If the data is not separable by a hyperplane
5
Pattern Analysis Algorithm
  • Computational efficiency
  • Need to have all algorithms to be computationally
    efficient and that the degree of any polynomial
    involved should render the algorithm practical
    for large data sets
  • Robustness
  • Able to handle noisy data and identify
    approximate patterns
  • Statistical Stability
  • Output not be sensitive to a particular dataset,
    just to the underlying source of the data

6
Kernel Method
  • Mapping into embedding or feature space defined
    by kernel
  • Learning algorithm for discovering linear
    patterns in that space
  • Learning algorithm must work in dual space
  • Primal solution computes the weight vector
    explicitly
  • Dual solution gives the solution as a linear
    combination of the training examples

7
Kernel Methods the mapping
f
f
f
Original Space
Feature (Vector) Space
8
Linear Regression
  • Given training data
  • Construct linear function
  • Creates pattern function

9
1-d Regression
10
Least Squares Approximation
  • Want
  • Define error
  • Minimize loss

11
Optimal Solution
  • Want
  • Mathematical Model
  • Optimality Condition
  • Solution satisfies

Solving n?n equation is 0(n3)
12
Ridge Regression
  • Inverse typically does not exist.
  • Use least norm solution for fixed
  • Regularized problem
  • Optimality Condition

Requires 0(n3) operations
13
Ridge Regression (cont)
  • Inverse always exists for any
  • Alternative representation

Solving l?l equation is 0(l3)
14
Dual Ridge Regression
  • To predict new point
  • Note need only compute G, the Gram Matrix

Ridge Regression requires only inner products
between data points
15
Efficiency
  • To compute
  • w in primal ridge regression is 0(n3)
  • ? in primal ridge regression is 0(l3)
  • To predict new point x
  • primal
    0(n)
  • dual
    0(nl)

Dual is better if ngtgtl
16
Notes on Ridge Regression
  • Regularization is key to addressstability and
    regularization.
  • Regularization lets method work when ngtgtl.
  • Dual more efficient when ngtgtl.
  • Dual only requires inner products of data.

17
Linear Regression in Feature Space
  • Key Idea
  • Map data to higher dimensional space (feature
    space) and perform linear regression in embedded
    space.
  • Embedding Map

Alternative Form
18
Nonlinear Regression in Feature Space
In primal representation
19
Nonlinear Regression in Feature Space
In dual representation
20
Kernel Methods intuitive idea
  • Find a mapping f such that, in the new space,
    problem solving is easier (e.g. linear)
  • The kernel represents the similarity between two
    objects (documents, terms, ), defined as the
    dot-product in this new vector space
  • But the mapping is left implicit
  • Easy generalization of a lot of dot-product (or
    distance) based pattern recognition algorithms

21
Derivation of Kernel
22
Kernel Function
  • A kernel is a function K such that
  • There are many possible kernels.
  • Simplest is linear kernel.

23
Kernel more formal definition
  • A kernel k(x,y)
  • is a similarity measure
  • defined by an implicit mapping f,
  • from the original space to a vector space
    (feature space)
  • such that k(x,y)f(x)f(y)
  • This similarity measure and the mapping include
  • Invariance or other a priori knowledge
  • Simpler structure (linear representation of the
    data)
  • The class of functions the solution is taken from
  • Possibly infinite dimension (hypothesis space for
    learning)
  • but still computational efficiency when
    computing k(x,y)

24
Kernelization
  • Such K is called a Mercer kernel.
  • Kernels were introduced in mathematics to solve
    integral equations.
  • Kernels measure similarity of inputs.

25
Brief Comments on Hilbert Spaces
  • A Hilbert space is a generalization of finite
    dimensional vector spaces with inner product to a
    possibly infinite dimension.
  • Most of interesting infinite dimensional vector
    spaces are function spaces.
  • Hilbert spaces are the simplest among such
    spaces.
  • Prime example L2 (the square integrable
    functions)
  • Any continuous linear functional on a Hilbert
    space is given by an inner product with a vector.
    (Riesz Representation Theorem.)
  • A representation of a vector w.r.t. a fixed basis
    is called Fourier expansion.

26
Making Kernels
  • The mapping function must be symmetric,
  • And satisfy the inequalities that follow from the
    Cauchy-Schwarz inequality.

27
The Kernel Gram Matrix
  • With KM-based learning, the sole information used
    from the training data set is the Kernel Gram
    Matrix
  • If the kernel is valid, K is symmetric
    definite-positive (all eigenvalues are all
    non-negative)

28
Mercers Theorem
  • Suppose X is compact. (Always true for finite
    examples.)
  • Suppose K is a Mercer Kernel.
  • Then it can be expanded, using eigenvalues and
    eigenfunctions of K, as
  • Now, using eigenfunctions and their span, find
  • H is called a Reproducing Kernel Hilbert Space
    (RKHS).

29
Characterization of Kernels
  • Prove (kernel function)
  • K is symmetric
  • V is an orthogonal matrix.
  • Let

30
Characterization of Kernels
  • Then for any xi, xj
  • (positive semi-definite)
  • Let there exists with
  • The point
  • Then

31
Reproducing Kernel Hilbert Spaces
  • Reproducing Kernel Hilbert Spaces (RKHS)
  • 1 The Hilbert space L2 is too big for our
    purpose, containing too many non-smooth
    functions. One approach to obtaining restricted,
    smooth spaces is the RKHS approach.
  • A RKHS is smaller then a general Hilbert space.
  • Define the reproducing kernel map
  • (to each x we associate a function k(.,x))

32
Characterization of Kernels
  • We now define an inner product.
  • Now construct a vector space containing all
    linear combination of the function k(.,x)
  • (This will be our
    RKHS.)
  • Let and define
  • Prove it as an inner product in RKHS.

33
Characterization of Kernels
  • Symmetry is obvious, and linearity is easy to
    show.
  • Proveltf.fgt0gtf0.
  • say k is the representer of evaluation.2
  • By above,
  • (reproduction property)
  • So
  • (From Cauchy-Schwartz).
  • If ltf,fgt0gtf0, and This is our RKHS.

34
Characterization of Kernels
  • Formal definition
  • For a compact subset X of Rd and Hilbert space H
    of functions f X ?R, we say that H is a
    reproducing kernel Hilbert space if k X2 ? R,
    such that
  • 1. k has the reproducing property.(ltk(, x), f gt
    f(x))
  • 2. k spans H. (spank(, x) x is belong to X
    H)

35
Popular Kernels based on vectors
By Hilbert-Schmidt Kernels (Courant and Hilbert
1953)
for certain ? and K, e.g.
36
Examples of Kernels
37
How to build new kernels
  • Kernel combinations, preserving validity

38
Important Points
  • Kernel method
  • linear method embedding in feature space.
  • Kernel functions used to do embedding
    efficiently.
  • Feature space is higher dimensional space so must
    regularize.
  • Choose kernel appropriate to domain.

39
Principal Component Analysis (PCA)
  • Subtract the mean (centers the data).
  • Compute the covariance matrix, S.
  • Compute the eigenvectors of S, sort them
    according to their eigenvalues and keep the M
    first vectors.
  • Project the data points on those vectors.
  • Also called the Karhunen-Loeve transformation.

40
Kernel PCA
  • Principal Component Analysis (PCA) is one of the
    fundamental technique in a wide range of areas.
  • Simply stated, PCA diagonalizes (or, finds
    singular value decomposition (SVD) of) the
    covariance matrix.
  • Equivalently, we may find SVD of the data matrix.
  • Instead of PCA in the original input spaces, we
    may perform PCA in the feature space. This is
    called Kernel PCA.
  • Find eigenvalues and eigenvectors of the Gram
    matrix.
  • For many applications, we need to find online
    algorithms, i.e., algorithms that do not need to
    store the Gram matrix.

41
PCA in dot-product form
  • Assume we have centered observations column
    vectors xi (centered means )
  • PCA finds the principal axes by diagonalizing the
    covariance matrix C with singular value
    decomposition

(1)
Eigenvector
Eigenvalue
(2)
Covariance matrix
42
PCA in dot-product
Substituting equation 2 into 1, we get
(3)
Thus,
Scalar
(4)
All solutions v with ??0 lie in the span of
x1,x2,..,xl,,i.e.
(5)
43
Kernel PCA algorithm
  • If we do PCA in feature space, covariance matrix

(6)
Which can be diagonalized with nonnegative
eigenvalues satisfying
(7)
And we have shown that V lie in the span of
?(xi), so we have
(8)
44
Kernel PCA
  • Apply kernel trick,we have K(xi,xj) lt ?(xi),
    ?(xj)gt

(9)
And we can finally write the expression as the
eigenvalue Problem K? ??
(10)
45
Kernel PCA algorithm outline
  • Given a set of m-dimensional dataxk, calculate
    K, for example, Gaussian K(xi,xj)exp(-xi-xj2/
    d).
  • Carry out centering in feature space.
  • Solve eigenvalue problem, K? ?? .
  • For a test pattern x, we extract a nonlinear
    component via

(11)
46
Stability of Kernel Algorithms
Our objective for learning is to improve
generalize performance cross-validation,
Bayesian methods, generalization bounds,...
Call a pattern a sample
S. Is this pattern also likely to be present in
new data ?
We can use concentration inequalities (McDiamids
theorem) to prove that Theorem Let
be a IID sample from P and define the
sample mean of f(x) as
then it follows that
(prob. that sample mean and population mean
differ less than is more than ,independent of
P!
47
Rademacher Complexity
Problem we only checked the generalization
performance for a single fixed
pattern f(x). What is we want to
search over a function class F? Intuition we
need to incorporate the complexity of this
function class.
Rademacher complexity captures the ability of the
function class to fit random noise. (
uniform distributed)
f1
(empirical RC)
f2
xi
48
Generalization Bound
Theorem Let f be a function in F which maps to
0,1. (e.g. loss functions) Then, with
probability at least over random draws
of size every f satisfies
Relevance The expected pattern Ef0 will also
be present in a new data set,
if the last 2 terms are small
- Complexity function class F small
- number of training data large
49
Linear Functions (in feature space)
Consider the function class
and a sample
Then, the empirical RC of FB is bounded by
Relevance Since
it follows that if we
control the norm in
kernel algorithms, we control the complexity of
the function class (regularization).
50
Margin Bound (classification)
Theorem Choose cgt0 (the margin).
F f(x,y)-yg(x), y1,-1
S (0,1) probability of
violating bound.
(prob. of misclassification)
Relevance We our classification error on new
samples. Moreover, we have a strategy to improve
generalization choose the margin c as large
possible such that all samples are correctly
classified (e.g. support vector
machines).
51
Next Part
  • Constructing Kernels
  • Kernels for Text
  • Vector space kernels
  • Kernels for Structured Data
  • Subsequences kernels
  • Trie-based kernels
  • Kernels from Generative Models
  • P-kernels
  • Fisher kernels
Write a Comment
User Comments (0)
About PowerShow.com