Title: Support Vector Machines
1 Fast and Efficient Support Vector Machines
for Multi-Class Pattern Classification Problems
Dr. Emad Ahmed El-Sebakhy
2Presentation Outline
- Goals of Research
- Pattern Classification Problems Meaning
- Mathematical Description Binary (Linear
Non-Linear) of SVM Classifiers - The Complexity and Running Time Problems and The
Suggested Techniques. - Conjugate Gradient Least Squares Algorithm
- Simplify the shape of the hyperplane by excluding
some training samples, that contributes the
convoluted hyperplane. - Multi-SVM Conjugate Least Squares Classifier
- Real-World Applications
- Simulation Study
- Conclusions and Future Work
3Goals of Research
- To propose an effiecient and fast SVM classifier
for solving multi-class pattern classification
(PC) problems. - To assess the performance of SVM Multi-class
classifier on real-world applications. - To investigate the properties of SVM classifier
using simulation experiments. - To compare the performance of the new SVM
classifier to the seven most common classifiers
in the statistics and computer science
literatures. - To draw conclusions and recommendations
4What is a Support Vector Machine?
- SVM is one of the supervised learning algorithms.
- It uses concepts from computational learning
theory. - It is an optimally defined surface, typically
nonlinear in the input space, and linear in a
higher dimensional space. - Implicitly defined by a kernel function.
5What is SVM used for ?
- Regression and data-fitting
- Supervised and unsupervised learning
- Image Processing
- Speech Recognition
- Pattern Recognition
- Time-Series Analysis
- Adaptive Equalization
- Radar Point Source Location
- Medical Diagnosis
- Process Faults Detection
Classification
6Expected Risk, Structural Risk Minimization (SRM)
- Classification Problem
- Decision functions
- Expected Risk
- Ideal Goal Find the decision function f(x)
minimize the expected risk - Empirical Risk Minimization
- To avoid the over-fitting problem, use SRM or
minimum description length (MDL), where the
empirical risk should be minimized
7Reference
- Vapnik V., (1995), The Nature of Statistical
Learning Theory. Springer. - Vapnik V., (1998), Statistical learning theory,
John Wiley, New York. - E. Osuna, F. Girosi, (1998) Reducing the run-time
complexity of support vector machines, ICPR,
Brisbane, Australia. - Cristianini N. and Shawe-Taylor J., (2000), An
Introduction to Support Vector Machines and Other
Kernel-based learning methods, Cambridge
University Press. - Burges C. J. C., (1998), A Tutorial on Support
Vector Machines for Pattern Recognition
Knowledge Discovery and Data Mining, 2(2). - Hsu C. and Lin C., (2002), A comparison of
methods for multi-class support vector machines,
IEEE Transactions on Neural Networks, 13415-425. - Duan K., Keerthi S. S., Poo A. N., (2003),
Evaluation of simple performance measures for
tuning SVM hyperparameters", Neurocomputing (51)
41-59.
8Whats A Pattern Classification Problem ?
Given a group of n objects and p predictors,
drawn from c populations (c gt2), the goal of
pattern classification is to construct a decision
rule that classifies given objects into one and
only one of a given number of populations.
Determine a decision function
Classifier
9Pattern Classification Process
Training Set
Learning Algorithm
Building The Classification Model
Model Selection
Testing and Validation
10 Examples of Applications
11Binary Linear Classifier
- If the training set is linearly separable Binary
classification can be viewed as the task of
separating classes in feature space
The hypersurface equation f(x, w,b) sign(wTx
b)
wTx b 0
wTx b gt 0
wTx b lt 0
12Classification Margin
- Distance from an object to the hyperplane is
- Support Vectors (the closest objects to the
hyperplane) - Margin (the width that the boundary could be
increased by before hitting a data point ), ?. - The maximum margin linear classifier is the
linear classifier with the maximum margin.
?
r
r
13Linear SVM Classifier Mathematically
- Assuming all data is at least distance 1 from the
hyperplane, the following two constraints follow
for a training set (xi ,yi) - For support vectors, the inequality becomes an
equality since each examples distance from the
hyperplane is , the margin is
wTxi b 1 if yi 1, wTxi b -1
if yi -1
Find w and b such that is
maximized, for all (xi ,yi) wTxi b 1 if
yi1 and wTxi b -1 if
yi -1
A better formulation
Find w and b such that ?(w) ½ wTw is minimized
for(xi ,yi), such that yi (wTxi
b) 1
14Solving the Optimization Problem
- In most real applications, we Need to optimize a
quadratic function subject to linear constraints.
The solution involves constructing a dual problem
where a Lagrange multiplier ? ?i is associated
with every constraint in the primary problem
Lagrange formulation
dual form
primal form
Find ?1 ?n such that
is maximized under the constraints
(1) S ?iyi 0, (2) ?i 0 for all ?i
15This is how to find a solution (I learned how to
use analogy)
w S ?iyixi, b yk- wTxk for any xk such that
?k? 0
The solution
f(x) sign(S ?iyixiTx b)
The classifying decision function is
16A Numerical Example
- Three inequalities
- 1?w b ? ?1 2?w b ? 1 3?w b ? 1
- J w2/2 ? ?1(?w?b?1) ? ?2(2wb?1) ? ?3(3wb?1)
- ?J/?w 0 ? w ? ?1 2 ?2 3 ?3
- ?J/?b 0 ? 0 ?1 ? ?2 ? ?3
- Kuhn-Tucker condition implies
- (a) ?1(?w?b?1) 0, (b) ?2(2wb?1) 0 (c)
?3(3w b ?1) 0 - The solution (as we will see later) is ?1 ?2
2 and ?3 0. This yields - w 2, b ?3.
- Hence the solution of decision boundary
(hyperplane) is - 2x ? 3 0. or x 1.5!
- This is shown as the dash line in above figure.
17(No Transcript)
18(No Transcript)
19Formulating the Dual Problem
- At the saddle point, we have
and , substituting these relations
into above, then we have the Dual Problem
Maximize Subject to and ?i ? 0 for i
1, 2, , n.
Note
20Numerical Example (contd)
- or Q(?) ?1 ?2 ?3 ? 0.5 ?12 2 ?22 4.5
?32 ? 2 ?1 ? 2 ?3 ?1 ? 3 6 ? 2 ? 3 - subject to constraints ? ?1 ? 2 ? 3 0,
and - ?1 ? 0, ?2 ? 0, and ?3 ? 0.
- Use Matlab? Optimization tool box command
- xfmincon(qalpha,X0, A, B, Aeq, Beq)
- The solution is ?1 ?2 ?3 2 2 0 as
expected.
21Non-linear SVMs Feature spaces
- What if the training set is not linearly
separable? - The original feature space can always be mapped
to some higher-dimensional feature space where
the training set is separable
F x ? f(x)
22Implication of Minimizing w
- Let D denote the diameter of the smallest
hyper-surface that encloses all the input
training vectors x1, x2, , xn. - The set of optimal hyper-planes described by the
equation - WoTx bo 0
- has a Vapnik Condition (VC)-dimension h bounded
from above as - h ? min ?D2/?2?, p 1
- where p is the dimension of the input vectors,
and ? 2/wo is the margin of the separation
of the hyper-planes. - VC-dimension determines the complexity of the
classifier - structure, and usually the smaller the better.
23Non-separable Cases
- Recall that in linearly separable case, each
training sample pair (xi, yi) represents a linear
inequality constraint - yi(wTxi b) ? 1, i 1, 2, , n
() - If the training samples are not linearly
separable, the constraint can be modified to
yield a soft constraint - yi(wTxi b) ? 1??i , i 1, 2, , n
() - ?i 1 ? i ? n are known as slack variables.
- Note that originally, () is a normalized version
of yi f(xi)/w ? ?. With the slack variable
?i, that equation becomes yi f(xi)/w ?
?(1??i). Hence with the slack variable, we allow
some samples xi fall within the gap. Moreover, if
?i gt 1, then the corresponding (xi, yi) is
mis-classified because the sample will fall on
the wrong side of the hyper-plane H.
24Non-Separable Case
- With this approximated cost function, the goal is
to maximize ? (minimize W) while minimize
?i(? 0). - ?i not counted if xi outside gap and on the
correct side. - 0 lt ?ilt 1 xi inside gap, but on the correct
side. - ?i gt 1 xi on the wrong side (inside or outside
gap).
- Since ?i gt 1 implies mis-classification, the cost
function must include a term to minimize the
number of samples that are mis-classified -
- where ? is a Lagrange multiplier. But this
formulation is non-convex and a solution is
difficult to find using existing nonlinear
optimization algorithms. - Hence, we may instead use an approximated cost
function
?
0
1
25The Kernel Functions
- If every data point is mapped into
high-dimensional space via some transformation F
x ? f(x), the inner product becomes K(xi,xj)
f(xi) Tf(xj). - The linear classifier relies on inner product
between vectors K(xi,xj)xiTxj - A kernel matrix (function) is some function that
corresponds to an inner product into some feature
space. - For each K(xi,xj) checking that K(xi,xj) can be
written in the form of f(xi) Tf(xj) or not. - Example Two-dimensional vectors xx1 x2
let K(xi,xj)(1 xiTxj)2, - Need to show that K(xi,xj) f(xi) Tf(xj)
- K(xi,xj)(1 xiTxj)2, 1 xi12xj12 2 xi1xj1
xi2xj2 xi22xj22 2xi1xj1 2xi2xj2 - 1 xi12 v2 xi1xi2 xi22 v2xi1
v2xi2T 1 xj12 v2 xj1xj2 xj22 v2xj1 v2xj2
- f(xi) Tf(xj), where f(x) 1 x12
v2 x1x2 x22 v2x1 v2x2
What kernel is feasible? It must satisfy the
"Mercer's theorem"! , which is Semi-Positive
Definite Matrix.
26Numerical Example XOR Problem
x x1, x2T. Use K(x,y) (1 xTy)2 one
has ?(x) 1 x12 x22 ?2 x1, ?2 x2, ?2
x1x2T
Note dim?(x) 6 gt dimx 2! Dim(K) Ns
of support vectors.
27XOR Problem (Continued)
Note that K(xi, xj) can be calculated directly
without using ?!
The corresponding Lagrange multiplier ? (1/8)1
1 1 1T.
Hence the hyper-plane is y wT?(x) ?
x1x2
28The Most Common Kernel Functions
29SVM Using Nonlinear Kernels
?0
K(x,xj)
x1
?0
x1
f
W
?P
?P
K(x,xj)
xn
xn
Nonlinear transform
Kernel evaluation
Nonlinear transform
- Using kernel, low dimensional feature vectors
will be mapped to high dimensional (may be
infinite dim) kernel feature space where the data
are likely to be linearly separable.
30Non-Linear SVM Classifier Mathematically
- Slack variables ?i can be added to allow
misclassification of noisy examples.
Find w and b such that
is minimized for all (xi
,yi), where yi (wT ?(xi) b) 1- ?i and ?i
0, i1,2,,n.
?i
?i
C is a user-specified positive number.
Using ?i and ?i as Lagrange multipliers, the
unconstrained cost function becomes
31The Lagrangian Multipliers
The Kuhn and Tucker necessary conditions in a
matrix format
(1)
By substituting for W and ? in the Lagrangian
function, then the dual quadratic problem is
32The Proposed Suggestions
We note that for large number of observations,
the matrix in (1) can not be stored, where we
need an iterative solution method, that cause
complexity, and running time problem.
A new suggestion to handle the computational
complexity and running time for the system (1)
using the following techniques
- Use the Large scale conjugate gradient Least
Squares Technique, Gloub and Van Loan (1989). - Simplify the shape of the hyperplane is to
exclude some training samples, that contributes
the convoluted hyperplane. Thereby, it is
possible to separate the remaining samples by a
less convoluted hyperplane.
33The Conjugate Gradient Algorithm
The conjugate gradient method for solving the
system with
symmetric positive definite and is
given by
34The advantage of the conjugate gradient method is
that if the matrix A is symmetric positive
definite and A I t, where rank(t) d, then
the conjugate gradient algorithm in Table (1)
converges in at most d1 steps. Therefore, the
system (1) can be written in the form
(2)
The coefficient matrices in systems (1) and (2)
are not positive definite. SO!
(3)
35Therefore, the solution of the system (3) is
36- We conclude that the solution of each of the
systems (3) and (1)is - Easy to compute, and there is no need to store
the matrix A. - This new technique run faster, and the
computational running time will be
only. - Moreover, we do not need to compute the inverse
of matrix, which is difficult when n
is very large.
37The Matlab code for least squares support vector
machine for arbitrary Kernel to compute b and ?
is written as
38Solving the Optimization Problem
- The dual problem for non-linear SVM
classification - Again, xi with non-zero ?i will be support
vectors. Solution to the dual problem is
Find ?1 ?n such that
is maximized
subject to the constraints (1) S ?iyi 0,
(2) 0 ?i
C for all ?i
where Io i 0 lt ?i lt C
f(x) sign(S ?iyi K(xi ,x) b)
The decision function is
(4)
39Multiclass Least squares SVMs Classifier With The
Conjugate Gradient Algorithm
Consider a training set drawn from c gt2
categories. We use the symbol to refer to the ith
output unit for the class , for all
k1,2, , c. The support vector machines
optimization problem can be written as
40The corresponding Lagrangian can be formulated as
(5)
41The necessary conditions of the Lagrangian in (5)
is given by
(6)
42(No Transcript)
43 Similarly as we did for the binary
classification case, we use the conjugate
gradient least square technique to solve the
system (6). We need to modify the structure of
both matrices and . In addition,
we should not store the matrix , due to
its computational cost and complexity. moreover,
the matrices in (6) needed to be reformatted, so
that the two linear subsystems of equation have
positive definite matrices. Thus, the SVMs
Classifier Algorithm is
44(No Transcript)
45(No Transcript)
46Model Selection
- We select the best model as follows
- Selection methods
- Backward-Forward (BF)
- Forward-Backward (FB)
- Quality Criteria We use the minimum description
length (MDL), that is,
( 7 )
Penalty for complexity
Goodness of fit
The model with the smallest MDL is the best model.
47Model Validation
We validate the classification model using two
methods
1. Internal Validation
- 1. The correct classification rate (CCR)
- 2. The average of squared classification
error (SSCE)
( 8 )
Where is the number of corrected
observations in the class k. The model with the
highest CCR is the better performance.
( 9 )
The model with the smallest SSCE is the better
performance.
Note CCR and SSCE are computed for the training
set.
482. External Validation
1.The average of CCR over all runs 3. The
average of SSCE over all runs 3. The average of
MDL over all runs 4. The average of execution
time over all runs
Note CCR, SSCE, MDL, and time are computed for
testing set.
We will talk about it all in the real
applications and simulation study.
49(No Transcript)
50(No Transcript)
51(No Transcript)
52SVM implementations I.
- SVMlight - satyr.net2.private/usr/local/bin
- svm_learn, svm_classify
- bsvm - satyr.net2.private/usr/local/bin
- svm-train, svm-classify, svm-scale
- libsvm - satyr.net2.private/usr/local/bin
- svm-train, svm-predict, svm-scale, svm-toy
- mySVM
- MATLAB svm toolbox
- Differences available Kernel functions,
optimization, multiple class., user interfaces
53SVM implementations II.
- SVMlight
- Simple text data format
- Fast, C routines
- bsvm
- Multiple class.
- LIBSVM
- GUI svm-toy
- MATLAB svm toolbox
- Graphical interface 2D
54Thalassemias Data
P 4 C 3
55Performance ComparisonSVM-CGA vs. SVM-NP vs. SMOL
Number of Kernel evaluations (107)
Margin (Thalassemias data)
56Conclusions and Future Work
- The performance of the new SVM classifier for the
applications Fraud Detection, Wisconsin Breast
Cancer, Thalassemias, Fisher Iris data, and
Bioinformatics is robust and efficient with
respect to the parameters of the algorithm. - Future work for applying the new SVM Technique to
determine the best path for Oil extraction and
determine the Oil reserves. - Use the SVM Kernel function with the CGA
criterion to design a neural network model with
bounded-weights for pattern classification. - Test the methodology on the Internet Security,
and encryption.
57That's All Folks
If we all do a little, we will do a lot
Thank you