Support Vector Machines - PowerPoint PPT Presentation

1 / 105
About This Presentation
Title:

Support Vector Machines

Description:

w =Saiyixi b= yk- wTxk for any xk such that ak 0. f(x) = SaiyixiTx b. www.ritchcenter.com/nbv ... r degree polynomial: K(x,x')=(1 x,x' )d. For a feature ... – PowerPoint PPT presentation

Number of Views:923
Avg rating:3.0/5.0
Slides: 106
Provided by: adityat
Category:

less

Transcript and Presenter's Notes

Title: Support Vector Machines


1
Support Vector Machines

Session 8
Dr. N.B. Venkateswarlu AITAM, Tekkali
2
Overview
  • Background
  • Linear Classifier
  • SVM
  • Margin
  • Non-Linear SVM
  • Kernel Functions
  • Java Demo Applets

3
Some Background
  • In the machine learning context, a vector goes
    like this
  • Each attribute is a dimension of the vector.
  • The inner product or dot product is defined as
    below

4
Linearly Seperable Classes
5
(No Transcript)
6
Seperating Planes
7
a
Linear Classifiers
x
f
yest
f(x,w,b) sign(w x b)
denotes 1 denotes -1
w x bgt0
w x b0
How would you classify this data?
w x blt0
8
a
Linear Classifiers
x
f
yest
f(x,w,b) sign(w x b)
denotes 1 denotes -1
How would you classify this data?
9
a
Linear Classifiers
x
f
yest
f(x,w,b) sign(w x b)
denotes 1 denotes -1
How would you classify this data?
10
a
Linear Classifiers
x
f
yest
f(x,w,b) sign(w x b)
denotes 1 denotes -1
Any of these would be fine.. ..but which is best?
11
a
Linear Classifiers
x
f
yest
f(x,w,b) sign(w x b)
denotes 1 denotes -1
How would you classify this data?
Misclassified to 1 class
12
a
a
Classifier Margin
Classifier Margin
x
x
f
f
yest
yest
f(x,w,b) sign(w x b)
f(x,w,b) sign(w x b)
denotes 1 denotes -1
denotes 1 denotes -1
Define the margin of a linear classifier as the
width that the boundary could be increased by
before hitting a datapoint.
Define the margin of a linear classifier as the
width that the boundary could be increased by
before hitting a datapoint.
13
Some Background The Perceptron
  • Goal Find a plane in the n-dimension input space
    that classify the data.
  • The classifier trained has the form
  • y is the class label predicted by the perceptron,
    x is the example (instance, vector) to be
    classified and m is the weight vector and b is
    the offset.
  • Main Idea Each attribute is assigned a weight,
    negative, zero or positive. The sum of all these
    weights multiplied by the instance value for this
    attribute is (more or less) the class tag. The
    final decision rule is
  • If yi gt 0 then class positive, If yi lt 0 then
    class negative
  • So we can also write

14
Non-Linearly Seperable classes
15
Probable Misclassifications
16
SVMs
  • Support Vector Machines
  • To summarize A SVM finds an hyperplace
    separating the training set in a feature space
    induced by a kernel function used as the inner
    product in the algorithm.
  • The solution for the margin optimization process
    is sparse in a. Which means that only a few
    examples are effectively used in the classifier.
    These examples are the closest to the classifying
    boundary, so they SUPPORT this hyperplane. The
    vectors supports the classifier, Support Vector
    Machine.

17
a
Maximum Margin
x
f
yest
  • Maximizing the margin is good according to
    intuition and PAC theory
  • Implies that only support vectors are important
    other training examples are ignorable.
  • Empirically it works very very well.

f(x,w,b) sign(w x b)
denotes 1 denotes -1
The maximum margin linear classifier is the
linear classifier with the, um, maximum
margin. This is the simplest kind of SVM (Called
an LSVM)
Support Vectors are those datapoints that the
margin pushes up against
Linear SVM
18
Linear SVM Mathematically
x
MMargin Width
Predict Class 1 zone
X-
wxb1
Predict Class -1 zone
wxb0
wxb-1
  • What we know
  • w . x b 1
  • w . x- b -1
  • w . (x-x-) 2

19
Linear SVM Mathematically
  • Goal 1) Correctly classify all training data

  • if yi 1

  • if yi -1

  • for all i
  • 2) Maximize the Margin
  • same as minimize
  • We can formulate a Quadratic Optimization Problem
    and solve for w and b
  • Minimize
  • subject to

20
Example of linear SVM
Support vectors
margin
21
Support Vectors
22
Solving the Optimization Problem
Find w and b such that F(w) ½ wTw is minimized
and for all (xi ,yi) yi (wTxi b) 1
  • Need to optimize a quadratic function subject to
    linear constraints.
  • Quadratic optimization problems are a well-known
    class of mathematical programming problems, and
    many (rather intricate) algorithms exist for
    solving them.
  • The solution involves constructing a dual problem
    where a Lagrange multiplier ai is associated with
    every constraint in the primary problem

Find a1aN such that Q(a) Sai -
½SSaiajyiyjxiTxj is maximized and (1) Saiyi
0 (2) ai 0 for all ai
23
The Optimization Problem Solution
  • The solution has the form
  • Each non-zero ai indicates that corresponding xi
    is a support vector.
  • Then the classifying function will have the form
  • Notice that it relies on an inner product between
    the test point x and the support vectors xi we
    will return to this later.
  • Also keep in mind that solving the optimization
    problem involved computing the inner products
    xiTxj between all pairs of training points.

w Saiyixi b yk- wTxk for any xk
such that ak? 0
f(x) SaiyixiTx b
24
Dataset with noise
  • Hard Margin So far we require all data points be
    classified correctly
  • - No training error
  • What if the training set is noisy?
  • - Solution 1 use very powerful kernels

OVERFITTING!
25
Soft Margin Classification
Slack variables ?i can be added to allow
misclassification of difficult or noisy examples.
What should our quadratic optimization criterion
be? Minimize
26
Hard Margin v.s. Soft Margin
  • The old formulation
  • The new formulation incorporating slack
    variables
  • Parameter C can be viewed as a way to control
    overfitting.

Find w and b such that F(w) ½ wTw is minimized
and for all (xi ,yi) yi (wTxi b) 1
Find w and b such that F(w) ½ wTw CS?i is
minimized and for all (xi ,yi) yi (wTxi b)
1- ?i and ?i 0 for all i
27
Linear SVMs Overview
  • The classifier is a separating hyperplane.
  • Most important training points are support
    vectors they define the hyperplane.
  • Quadratic optimization algorithms can identify
    which training points xi are support vectors with
    non-zero Lagrangian multipliers ai.
  • Both in the dual formulation of the problem and
    in the solution training points appear only
    inside dot products

Find a1aN such that Q(a) Sai -
½SSaiajyiyjxiTxj is maximized and (1) Saiyi
0 (2) 0 ai C for all ai
f(x) SaiyixiTx b
28
Linear SVM for non-seperable data
29
Non-linear SVM for linearly non-seperable data
30
Non-linear SVMs
  • Datasets that are linearly separable with some
    noise work out great
  • But what are we going to do if the dataset is
    just too hard?
  • How about mapping data to a higher-dimensional
    space

0
x
31
Non-linear SVMs Feature spaces
  • General idea the original input space can
    always be mapped to some higher-dimensional
    feature space where the training set is separable

F x ? f(x)
32
Kernel methods approach
  • The kernel methods approach is to stick with
    linear functions but work in a high dimensional
    feature space
  • The expectation is that the feature space has a
    much higher dimension than the input space.

33
Kernel methods
data
Identified pattern
kernel
subspace
Pattern Analysis algorithm
34
Mapping
35
Mappping is know as Kernel Functions
  • Here is a training set in the input space I and
    the same training in the feature space F. We go
    from a 2D space to a 3D space.

The training set is not linearly separable in the
input space.
The training set is linearly separable in the
feature space. This is called the Kernel Trick.
36
Mappping is know as Kernel Functions
  • Here is a training set in the input space I and
    the same training in the feature space F. We go
    from a 2D space to a 3D space.

The training set is not linearly separable in the
input space.
37
Mappping is know as Kernel Functions
The training set is not linearly separable in the
input space.
38
(No Transcript)
39
Classes seperable after mapping
40
Example
  • Consider the mapping
  • If we consider a linear equation in this feature
    space
  • We actually have an ellipse i.e. a non-linear
    shape in the input space.

41
Capacity of feature spaces
  • The capacity is proportional to the dimension
    for example
  • 2-dim

42
  • If data are mapped into a space of sufficiently
    high dimension, they will always be linearly
    separable (N data points in N-1 dimensions or
    more)
  • Problem Linear separator in space of d
    dimensions have d parameters problem of over
    fitting
  • Reason for maximal margin/optimal separator

43
Form of the functions
  • So kernel methods use linear functions in a
    feature space
  • For regression this could be the function
  • For classification require thresholding

44
Problems of high dimensions
  • Capacity may easily become too large and lead to
    overfitting being able to realise every
    classifier means unlikely to generalise well
  • Computational costs involved in dealing with
    large vectors

45
Kernel Functions
  • To make the data linearly separable we could
  • Project the data from the input space to a new
    space called feature space
  • This feature space having more dimensions than
    the input space we could separate the data THERE
  • Using the normal Adatron (linear SVM)

46
Example of polynomial kernel.
  • r degree polynomial
  • K(x,x)(1ltx,xgt)d.
  • For a feature space with two inputs x1,x2 and
  • a polynomial kernel of degree 2.
  • K(x,x)(1ltx,xgt)2
  • Let
  • and , then
    K(x,x)lth(x),h(x)gt.

47
Kernel Functions
  • Lets use an example projection
  • The inner product of two vectors x and y
    projected in the space F becomes

48
(No Transcript)
49
Kernel Functions
  • But what happens if, instead of projecting the
    data we just do
  • Which means taking the input space inner
    product and squaring it.
  • It will actually lead to the feature space inner
    product!!!

This is much less time consuming as we are
implicitly projecting the training set.
50
Kernel Functions
  • So we could define a kernel function as follows
    It is the function that represents the inner
    product of some space in ANOTHER space.
  • Some spaces are known only by their kernel
    function
  • (ie their projection is UNKNOWN)
  • Its the case with these kernel functions
  • Gaussian RBF kernel
  • Sigmoïd kernel
  • An experimental kernel called KMOD

51
(No Transcript)
52
Radial Basis Functions
53
Sigmoidal Function
54
The Kernel Trick
  • The linear classifier relies on dot product
    between vectors K(xi,xj)xiTxj
  • If every data point is mapped into
    high-dimensional space via some transformation F
    x ? f(x), the dot product becomes
  • K(xi,xj) f(xi) Tf(xj)
  • A kernel function is some function that
    corresponds to an inner product in some expanded
    feature space.
  • Example
  • 2-dimensional vectors xx1 x2 let
    K(xi,xj)(1 xiTxj)2,
  • Need to show that K(xi,xj) f(xi) Tf(xj)
  • K(xi,xj)(1 xiTxj)2,
  • 1 xi12xj12 2
    xi1xj1 xi2xj2 xi22xj22 2xi1xj1 2xi2xj2
  • 1 xi12 v2 xi1xi2 xi22 v2xi1
    v2xi2T 1 xj12 v2 xj1xj2 xj22 v2xj1 v2xj2
  • f(xi) Tf(xj), where f(x) 1 x12
    v2 x1x2 x22 v2x1 v2x2

55
What Functions are Kernels?
  • For some functions K(xi,xj) checking that
  • K(xi,xj) f(xi) Tf(xj) can be
    cumbersome.
  • Mercers theorem
  • Every semi-positive definite symmetric function
    is a kernel
  • Semi-positive definite symmetric functions
    correspond to a semi-positive definite symmetric
    Gram matrix

K
56
Examples of Kernel Functions
  • Linear K(xi,xj) xi Txj
  • Polynomial of power p K(xi,xj) (1 xi Txj)p
  • Gaussian (radial-basis function network)
  • Sigmoid K(xi,xj) tanh(ß0xi Txj ß1)

57
Non-linear SVMs Mathematically
  • Dual problem formulation
  • The solution is
  • Optimization techniques for finding ais remain
    the same!

Find a1aN such that Q(a) Sai -
½SSaiajyiyjK(xi, xj) is maximized and (1) Saiyi
0 (2) ai 0 for all ai
f(x) SaiyiK(xi, xj) b
58
Nonlinear SVM - Overview
  • SVM locates a separating hyperplane in the
    feature space and classify points in that space
  • It does not need to represent the space
    explicitly, simply by defining a kernel function
  • The kernel function plays the role of the dot
    product in the feature space.

59
Properties of SVM
  • Flexibility in choosing a similarity function
  • Sparseness of solution when dealing with large
    data sets
  • - only support vectors are used to specify
    the separating hyperplane
  • Ability to handle large feature spaces
  • - complexity does not depend on the
    dimensionality of the feature space
  • Overfitting can be controlled by soft margin
    approach
  • Nice math property a simple convex optimization
    problem which is guaranteed to converge to a
    single global solution
  • Feature Selection

60
SVM Applications
  • SVM has been used successfully in many real-world
    problems
  • - text (and hypertext) categorization
  • - image classification
  • - bioinformatics (Protein classification,
  • Cancer classification)
  • - hand-written character recognition

61
Application 1 Cancer Classification
  • High Dimensional
  • - pgt1000 nlt100
  • Imbalanced
  • - less positive samples
  • Many irrelevant features
  • Noisy

FEATURE SELECTION In the linear case, wi2 gives
the ranking of dim i
SVM is sensitive to noisy (mis-labeled) data ?
62
(No Transcript)
63
(No Transcript)
64
(No Transcript)
65
(No Transcript)
66
(No Transcript)
67
(No Transcript)
68
(No Transcript)
69
(No Transcript)
70
(No Transcript)
71
(No Transcript)
72
(No Transcript)
73
(No Transcript)
74
(No Transcript)
75
(No Transcript)
76
SVMs
  • There are many ways to implement this
    optimization process.
  • The Kernel-Adatron is one. The simplest since
    its derived from the very well known perceptron.
  • Its simple but the slowest (awfully, painfully
    slow) since it passes through all the examples
    MANY times (many epoch)
  • The first approach used was Quadratic Programming
    since this optimization problem is quadratic.
  • Subject to the complex quadratic programming
    theory
  • Many numerical issues due to the method
  • An entire QP matrix for a sparse solution
  • Rather slow as well
  • Chunking chops the QP matrix in smaller chucks to
    gain speed from the sparseness. Does work but
    sometimes the improvement is unsignificant.
  • SMO stands for Sequential Minimal Optimization.
    Its Chunking to its finest grain. Using heavy
    heuristics, points are optimized by pair.
  • Very fast
  • Well documented, see John Platt on Google

77
Kernel function details
  • The first kernel shown is called the polynomial
    kernel

Where n is its order (in our example n2) and b
is called the lower order term (in our example
b0).
  • Why do we need a lower order term?
  • Because the origin of the input space matches the
    origin of the feature space induced by the
    polynomial kernel.
  • It serves as an offset or a shift for the origin
    of the feature space from the input space origin.
  • When given the choice, its strongly suggested
    you use it.

78
Kernel function details
  • The most popular kernel function (most powerful)
    is the Gaussian RBF kernel

Powerful kernel as its effect is to create a
small classification hyperball around an
instance. This kernel doesnt have a projection
formula since its dimension is infinite (you can
create as many balls as you want).
Where s is a measure of the radius of the
hyperball around an instance. You want this
ball to be big enough so hyperballs connect
with each other (pattern recognition) but not too
big to overlap the other class.
79
Kernel function details
  • Gaussian RBF overfits badly
  • The Christmas Tree effect.
  • If you select s too small for the distance
    between your examples, you will create small
    balls around all your instances.
  • You will get a 100 accuracy on your training
    set, wonderful!
  • But only your examples are classified (balls are
    too small).
  • In my applet, enter a few examples manually
    (using mouse). Make them sparse. Use the Gaussian
    RBF kernel with a significantly smaller s than
    the proposed values. HERE is YOUR Christmas Tree
  • your examples created small color balls, like
    Christmas tree balls.
  • The Christmas Tree effect is a joke. But your
    classifier is a joke too. No classification of
    new instances are possible with it.

80
Karush-Kuhn-Tucker conditions???
  • Just some conditions that are necessary for a
    solution in non-linear programming to be optimal

81
Common Kernels
  • Polynomial
  • Radial basis function
  • Sigmoid

82
Some RBFs
83
(No Transcript)
84
(No Transcript)
85
(No Transcript)
86
Softmargin method
  • Modified maximum margin idea allows for
    mislabeled examples during training
  • Still creates a hyperplane that separates as
    cleanly as possible
  • The hyperplane maximizes the distance to the
    nearest cleanly split examples

87
SVM_light
  • SVM package
  • Allows you to choose type (linear vs non-linear)
    and/or Kernel function
  • Two executables svm_learn and svm_classify

88
SVM_light
  • ltclassgt ltfeature nrgtltfeaturegt
  • Example
  • 1 10.7 20.3 30.5
  • -1 11.2 20.6 30.9

89
Mathematical background
  • The support vector (SV) machine is a new type of
    learning machine. It is based on statistical
    learning theory.

90
Objective
  • Use linear support vector machines (SVMs) to
    classify 2-D data.

91
Background
  • Suppose we want to find a decision function f
    with the property f(xi) yi, ?i.
  • (1)
  •  
  • In practice, a separating hyperplane often does
    not exist. To allow for the possibility of
    examples violating (1), the slack variables ?i
    are introduced.
  • (2)
  • to get
  • (3)

92
Background(2)
The SV approach to minimizing the guaranteed risk
bound consists of the following.
Minimize (4) subject to the constraints
(2) and (3). Introducing Lagrange multipliers ?i
and using the Kuhn_Tucker theorem of optimization
theory, the solution can be shown to have an
expansion (5) with nonzero coefficients ?i
only where the corresponding example (xi, yi)
precisely meets the constraint (3). These xi are
called support vectors. All remaining examples of
the training set are irrelevant.
93
Background(3)
The constraint (3) is satisfied automatically
(with ?i 0), and they do not appear in the
expansion (5). The coefficients ?i are found by
solving the following quadratic programming
problem. Maximize (6) subject
to and (7) By linearity of the dot
product, the decision function can be written
as (8)
94
Background (4)
To allow for much more general decision surfaces,
one can first nonlinearly transform a set of
input vectors x1, , xl into a high-dimensional
feature space. The decision function
becomes (9) Where RBF Kernels
are (10)
95
Principal Component Analysis
  • Given N data vectors from k-dimensions, find c lt
    k orthogonal vectors that can be best used to
    represent data
  • The original data set is reduced to one
    consisting of N data vectors on c principal
    components (reduced dimensions)
  • Each data vector is a linear combination of the c
    principal component vectors
  • Works for numeric data only
  • Used when the number of dimensions is large

96
Principal Component Analysis
97
Principal Component Analysis
  • Aimed at finding new co-ordinate system which has
    some characteristics.
  • M4.5 4.25
  • Cov Matrix 2.57 1.86
  • 1.86 6.21
  • Eigen Values 6.99, 1.79
  • Eigen Vectors 0.387 0.922
  • -0.922 0.387

98
(No Transcript)
99
However in some cases it is not possible to have
PCA working.
100
Canonical Analysis
101
  • Unlike PCA which takes global mean and
    covariance, this takes between the group and
    within the group covariance matrix and the
    calculates canonical axes.

102
(No Transcript)
103
Underfitting and overfitting
104
Non-Seperable Trining Sets
105
SVMs works badly with outliers
Write a Comment
User Comments (0)
About PowerShow.com