Support Vector Machines

About This Presentation

Title:

Support Vector Machines

Description:

Dr. Emad Ahmed El-Sebakhy. Presentation Outline. Goals of Research ... Mathematical Description Binary (Linear & Non-Linear) of SVM Classifiers ... – PowerPoint PPT presentation

Number of Views:158

Avg rating:3.0/5.0

Slides: 58

Provided by: mikh6

Category:

more less

Transcript and Presenter's Notes

Title: Support Vector Machines

1
Fast and Efficient Support Vector Machines
for Multi-Class Pattern Classification Problems
Dr. Emad Ahmed El-Sebakhy
2
Presentation Outline

Goals of Research
Pattern Classification Problems Meaning
Mathematical Description Binary (Linear
Non-Linear) of SVM Classifiers
The Complexity and Running Time Problems and The
Suggested Techniques.
Conjugate Gradient Least Squares Algorithm
Simplify the shape of the hyperplane by excluding
some training samples, that contributes the
convoluted hyperplane.
Multi-SVM Conjugate Least Squares Classifier
Real-World Applications
Simulation Study
Conclusions and Future Work

3
Goals of Research

To propose an effiecient and fast SVM classifier
for solving multi-class pattern classification
(PC) problems.
To assess the performance of SVM Multi-class
classifier on real-world applications.
To investigate the properties of SVM classifier
using simulation experiments.
To compare the performance of the new SVM
classifier to the seven most common classifiers
in the statistics and computer science
literatures.
To draw conclusions and recommendations

4
What is a Support Vector Machine?

SVM is one of the supervised learning algorithms.
It uses concepts from computational learning
theory.
It is an optimally defined surface, typically
nonlinear in the input space, and linear in a
higher dimensional space.
Implicitly defined by a kernel function.

5
What is SVM used for ?

Regression and data-fitting
Supervised and unsupervised learning
Image Processing
Speech Recognition
Pattern Recognition
Time-Series Analysis
Adaptive Equalization
Radar Point Source Location
Medical Diagnosis
Process Faults Detection

Classification
6
Expected Risk, Structural Risk Minimization (SRM)

Classification Problem
Decision functions
Expected Risk
Ideal Goal Find the decision function f(x)
minimize the expected risk
Empirical Risk Minimization
To avoid the over-fitting problem, use SRM or
minimum description length (MDL), where the
empirical risk should be minimized

7
Reference

Vapnik V., (1995), The Nature of Statistical
Learning Theory. Springer.
Vapnik V., (1998), Statistical learning theory,
John Wiley, New York.
E. Osuna, F. Girosi, (1998) Reducing the run-time
complexity of support vector machines, ICPR,
Brisbane, Australia.
Cristianini N. and Shawe-Taylor J., (2000), An
Introduction to Support Vector Machines and Other
Kernel-based learning methods, Cambridge
University Press.
Burges C. J. C., (1998), A Tutorial on Support
Vector Machines for Pattern Recognition
Knowledge Discovery and Data Mining, 2(2).
Hsu C. and Lin C., (2002), A comparison of
methods for multi-class support vector machines,
IEEE Transactions on Neural Networks, 13415-425.
Duan K., Keerthi S. S., Poo A. N., (2003),
Evaluation of simple performance measures for
tuning SVM hyperparameters", Neurocomputing (51)
41-59.

8
Whats A Pattern Classification Problem ?
Given a group of n objects and p predictors,
drawn from c populations (c gt2), the goal of
pattern classification is to construct a decision
rule that classifies given objects into one and
only one of a given number of populations.
Determine a decision function
Classifier
9
Pattern Classification Process
Training Set
Learning Algorithm
Building The Classification Model
Model Selection
Testing and Validation
10
Examples of Applications
11
Binary Linear Classifier

If the training set is linearly separable Binary
classification can be viewed as the task of
separating classes in feature space

The hypersurface equation f(x, w,b) sign(wTx
b)
wTx b 0
wTx b gt 0
wTx b lt 0
12
Classification Margin

Distance from an object to the hyperplane is
Support Vectors (the closest objects to the
hyperplane)
Margin (the width that the boundary could be
increased by before hitting a data point ), ?.
The maximum margin linear classifier is the
linear classifier with the maximum margin.

?
r
r
13
Linear SVM Classifier Mathematically

Assuming all data is at least distance 1 from the
hyperplane, the following two constraints follow
for a training set (xi ,yi)
For support vectors, the inequality becomes an
equality since each examples distance from the
hyperplane is , the margin is

wTxi b 1 if yi 1, wTxi b -1
if yi -1
Find w and b such that is
maximized, for all (xi ,yi) wTxi b 1 if
yi1 and wTxi b -1 if
yi -1
A better formulation
Find w and b such that ?(w) ½ wTw is minimized
for(xi ,yi), such that yi (wTxi
b) 1
14
Solving the Optimization Problem

In most real applications, we Need to optimize a
quadratic function subject to linear constraints.
The solution involves constructing a dual problem
where a Lagrange multiplier ? ?i is associated
with every constraint in the primary problem

Lagrange formulation
dual form
primal form
Find ?1 ?n such that
is maximized under the constraints
(1) S ?iyi 0, (2) ?i 0 for all ?i
15
This is how to find a solution (I learned how to
use analogy)
w S ?iyixi, b yk- wTxk for any xk such that
?k? 0
The solution
f(x) sign(S ?iyixiTx b)
The classifying decision function is
16
A Numerical Example

Three inequalities
1?w b ? ?1 2?w b ? 1 3?w b ? 1
J w2/2 ? ?1(?w?b?1) ? ?2(2wb?1) ? ?3(3wb?1)
?J/?w 0 ? w ? ?1 2 ?2 3 ?3
?J/?b 0 ? 0 ?1 ? ?2 ? ?3
Kuhn-Tucker condition implies
(a) ?1(?w?b?1) 0, (b) ?2(2wb?1) 0 (c)
?3(3w b ?1) 0
The solution (as we will see later) is ?1 ?2
2 and ?3 0. This yields
w 2, b ?3.
Hence the solution of decision boundary
(hyperplane) is
2x ? 3 0. or x 1.5!
This is shown as the dash line in above figure.

17
(No Transcript)
18
(No Transcript)
19
Formulating the Dual Problem

At the saddle point, we have
and , substituting these relations
into above, then we have the Dual Problem

Maximize Subject to and ?i ? 0 for i
1, 2, , n.
Note
20
Numerical Example (contd)

or Q(?) ?1 ?2 ?3 ? 0.5 ?12 2 ?22 4.5
?32 ? 2 ?1 ? 2 ?3 ?1 ? 3 6 ? 2 ? 3
subject to constraints ? ?1 ? 2 ? 3 0,
and
?1 ? 0, ?2 ? 0, and ?3 ? 0.
Use Matlab? Optimization tool box command
xfmincon(qalpha,X0, A, B, Aeq, Beq)
The solution is ?1 ?2 ?3 2 2 0 as
expected.

21
Non-linear SVMs Feature spaces

What if the training set is not linearly
separable?
The original feature space can always be mapped
to some higher-dimensional feature space where
the training set is separable

F x ? f(x)
22
Implication of Minimizing w

Let D denote the diameter of the smallest
hyper-surface that encloses all the input
training vectors x1, x2, , xn.
The set of optimal hyper-planes described by the
equation
WoTx bo 0
has a Vapnik Condition (VC)-dimension h bounded
from above as
h ? min ?D2/?2?, p 1
where p is the dimension of the input vectors,
and ? 2/wo is the margin of the separation
of the hyper-planes.
VC-dimension determines the complexity of the
classifier
structure, and usually the smaller the better.

23
Non-separable Cases

Recall that in linearly separable case, each
training sample pair (xi, yi) represents a linear
inequality constraint
yi(wTxi b) ? 1, i 1, 2, , n
()
If the training samples are not linearly
separable, the constraint can be modified to
yield a soft constraint
yi(wTxi b) ? 1??i , i 1, 2, , n
()
?i 1 ? i ? n are known as slack variables.
Note that originally, () is a normalized version
of yi f(xi)/w ? ?. With the slack variable
?i, that equation becomes yi f(xi)/w ?
?(1??i). Hence with the slack variable, we allow
some samples xi fall within the gap. Moreover, if
?i gt 1, then the corresponding (xi, yi) is
mis-classified because the sample will fall on
the wrong side of the hyper-plane H.

24
Non-Separable Case

With this approximated cost function, the goal is
to maximize ? (minimize W) while minimize
?i(? 0).
?i not counted if xi outside gap and on the
correct side.
0 lt ?ilt 1 xi inside gap, but on the correct
side.
?i gt 1 xi on the wrong side (inside or outside
gap).

Since ?i gt 1 implies mis-classification, the cost
function must include a term to minimize the
number of samples that are mis-classified
where ? is a Lagrange multiplier. But this
formulation is non-convex and a solution is
difficult to find using existing nonlinear
optimization algorithms.
Hence, we may instead use an approximated cost
function

?
0
1
25
The Kernel Functions

If every data point is mapped into
high-dimensional space via some transformation F
x ? f(x), the inner product becomes K(xi,xj)
f(xi) Tf(xj).
The linear classifier relies on inner product
between vectors K(xi,xj)xiTxj
A kernel matrix (function) is some function that
corresponds to an inner product into some feature
space.
For each K(xi,xj) checking that K(xi,xj) can be
written in the form of f(xi) Tf(xj) or not.
Example Two-dimensional vectors xx1 x2
let K(xi,xj)(1 xiTxj)2,
Need to show that K(xi,xj) f(xi) Tf(xj)
K(xi,xj)(1 xiTxj)2, 1 xi12xj12 2 xi1xj1
xi2xj2 xi22xj22 2xi1xj1 2xi2xj2
1 xi12 v2 xi1xi2 xi22 v2xi1
v2xi2T 1 xj12 v2 xj1xj2 xj22 v2xj1 v2xj2
f(xi) Tf(xj), where f(x) 1 x12
v2 x1x2 x22 v2x1 v2x2

What kernel is feasible? It must satisfy the
"Mercer's theorem"! , which is Semi-Positive
Definite Matrix.
26
Numerical Example XOR Problem
x x1, x2T. Use K(x,y) (1 xTy)2 one
has ?(x) 1 x12 x22 ?2 x1, ?2 x2, ?2
x1x2T
Note dim?(x) 6 gt dimx 2! Dim(K) Ns
of support vectors.
27
XOR Problem (Continued)
Note that K(xi, xj) can be calculated directly
without using ?!
The corresponding Lagrange multiplier ? (1/8)1
1 1 1T.
Hence the hyper-plane is y wT?(x) ?
x1x2
28
The Most Common Kernel Functions
29
SVM Using Nonlinear Kernels
?0
K(x,xj)
x1
?0
x1

f
W
?P
?P
K(x,xj)
xn
xn
Nonlinear transform
Kernel evaluation
Nonlinear transform

Using kernel, low dimensional feature vectors
will be mapped to high dimensional (may be
infinite dim) kernel feature space where the data
are likely to be linearly separable.

30
Non-Linear SVM Classifier Mathematically

Slack variables ?i can be added to allow
misclassification of noisy examples.

Find w and b such that

is minimized for all (xi
,yi), where yi (wT ?(xi) b) 1- ?i and ?i
0, i1,2,,n.
?i
?i
C is a user-specified positive number.
Using ?i and ?i as Lagrange multipliers, the
unconstrained cost function becomes
31
The Lagrangian Multipliers
The Kuhn and Tucker necessary conditions in a
matrix format
(1)
By substituting for W and ? in the Lagrangian
function, then the dual quadratic problem is
32
The Proposed Suggestions
We note that for large number of observations,
the matrix in (1) can not be stored, where we
need an iterative solution method, that cause
complexity, and running time problem.
A new suggestion to handle the computational
complexity and running time for the system (1)
using the following techniques

Use the Large scale conjugate gradient Least
Squares Technique, Gloub and Van Loan (1989).
Simplify the shape of the hyperplane is to
exclude some training samples, that contributes
the convoluted hyperplane. Thereby, it is
possible to separate the remaining samples by a
less convoluted hyperplane.

33
The Conjugate Gradient Algorithm
The conjugate gradient method for solving the
system with
symmetric positive definite and is
given by
34
The advantage of the conjugate gradient method is
that if the matrix A is symmetric positive
definite and A I t, where rank(t) d, then
the conjugate gradient algorithm in Table (1)
converges in at most d1 steps. Therefore, the
system (1) can be written in the form
(2)
The coefficient matrices in systems (1) and (2)
are not positive definite. SO!
(3)
35
Therefore, the solution of the system (3) is
36

We conclude that the solution of each of the
systems (3) and (1)is
Easy to compute, and there is no need to store
the matrix A.
This new technique run faster, and the
computational running time will be
only.
Moreover, we do not need to compute the inverse
of matrix, which is difficult when n
is very large.

37
The Matlab code for least squares support vector
machine for arbitrary Kernel to compute b and ?
is written as
38
Solving the Optimization Problem

The dual problem for non-linear SVM
classification
Again, xi with non-zero ?i will be support
vectors. Solution to the dual problem is

Find ?1 ?n such that
is maximized
subject to the constraints (1) S ?iyi 0,
(2) 0 ?i
C for all ?i
where Io i 0 lt ?i lt C
f(x) sign(S ?iyi K(xi ,x) b)
The decision function is
(4)
39
Multiclass Least squares SVMs Classifier With The
Conjugate Gradient Algorithm
Consider a training set drawn from c gt2
categories. We use the symbol to refer to the ith
output unit for the class , for all
k1,2, , c. The support vector machines
optimization problem can be written as
40
The corresponding Lagrangian can be formulated as
(5)
41
The necessary conditions of the Lagrangian in (5)
is given by
(6)
42
(No Transcript)
43
Similarly as we did for the binary
classification case, we use the conjugate
gradient least square technique to solve the
system (6). We need to modify the structure of
both matrices and . In addition,
we should not store the matrix , due to
its computational cost and complexity. moreover,
the matrices in (6) needed to be reformatted, so
that the two linear subsystems of equation have
positive definite matrices. Thus, the SVMs
Classifier Algorithm is
44
(No Transcript)
45
(No Transcript)
46
Model Selection

We select the best model as follows
Selection methods
Backward-Forward (BF)
Forward-Backward (FB)
Quality Criteria We use the minimum description
length (MDL), that is,

( 7 )
Penalty for complexity
Goodness of fit
The model with the smallest MDL is the best model.
47
Model Validation
We validate the classification model using two
methods
1. Internal Validation

1. The correct classification rate (CCR)
2. The average of squared classification
error (SSCE)

( 8 )
Where is the number of corrected
observations in the class k. The model with the
highest CCR is the better performance.
( 9 )
The model with the smallest SSCE is the better
performance.
Note CCR and SSCE are computed for the training
set.
48
2. External Validation
1.The average of CCR over all runs 3. The
average of SSCE over all runs 3. The average of
MDL over all runs 4. The average of execution
time over all runs
Note CCR, SSCE, MDL, and time are computed for
testing set.
We will talk about it all in the real
applications and simulation study.
49
(No Transcript)
50
(No Transcript)
51
(No Transcript)
52
SVM implementations I.

SVMlight - satyr.net2.private/usr/local/bin
svm_learn, svm_classify
bsvm - satyr.net2.private/usr/local/bin
svm-train, svm-classify, svm-scale
libsvm - satyr.net2.private/usr/local/bin
svm-train, svm-predict, svm-scale, svm-toy
mySVM
MATLAB svm toolbox
Differences available Kernel functions,
optimization, multiple class., user interfaces

53
SVM implementations II.

SVMlight
Simple text data format
Fast, C routines
bsvm
Multiple class.
LIBSVM
GUI svm-toy
MATLAB svm toolbox
Graphical interface 2D

54
Thalassemias Data
P 4 C 3
55
Performance ComparisonSVM-CGA vs. SVM-NP vs. SMOL
Number of Kernel evaluations (107)
Margin (Thalassemias data)
56
Conclusions and Future Work

The performance of the new SVM classifier for the
applications Fraud Detection, Wisconsin Breast
Cancer, Thalassemias, Fisher Iris data, and
Bioinformatics is robust and efficient with
respect to the parameters of the algorithm.
Future work for applying the new SVM Technique to
determine the best path for Oil extraction and
determine the Oil reserves.
Use the SVM Kernel function with the CGA
criterion to design a neural network model with
bounded-weights for pattern classification.
Test the methodology on the Internet Security,
and encryption.