Support Vector Machines - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Support Vector Machines

Description:

Dr. Emad Ahmed El-Sebakhy. Presentation Outline. Goals of Research ... Mathematical Description Binary (Linear & Non-Linear) of SVM Classifiers ... – PowerPoint PPT presentation

Number of Views:157
Avg rating:3.0/5.0
Slides: 58
Provided by: mikh6
Category:

less

Transcript and Presenter's Notes

Title: Support Vector Machines


1
Fast and Efficient Support Vector Machines
for Multi-Class Pattern Classification Problems
Dr. Emad Ahmed El-Sebakhy
2
Presentation Outline
  • Goals of Research
  • Pattern Classification Problems Meaning
  • Mathematical Description Binary (Linear
    Non-Linear) of SVM Classifiers
  • The Complexity and Running Time Problems and The
    Suggested Techniques.
  • Conjugate Gradient Least Squares Algorithm
  • Simplify the shape of the hyperplane by excluding
    some training samples, that contributes the
    convoluted hyperplane.
  • Multi-SVM Conjugate Least Squares Classifier
  • Real-World Applications
  • Simulation Study
  • Conclusions and Future Work

3
Goals of Research
  • To propose an effiecient and fast SVM classifier
    for solving multi-class pattern classification
    (PC) problems.
  • To assess the performance of SVM Multi-class
    classifier on real-world applications.
  • To investigate the properties of SVM classifier
    using simulation experiments.
  • To compare the performance of the new SVM
    classifier to the seven most common classifiers
    in the statistics and computer science
    literatures.
  • To draw conclusions and recommendations

4
What is a Support Vector Machine?
  • SVM is one of the supervised learning algorithms.
  • It uses concepts from computational learning
    theory.
  • It is an optimally defined surface, typically
    nonlinear in the input space, and linear in a
    higher dimensional space.
  • Implicitly defined by a kernel function.

5
What is SVM used for ?
  • Regression and data-fitting
  • Supervised and unsupervised learning
  • Image Processing
  • Speech Recognition
  • Pattern Recognition
  • Time-Series Analysis
  • Adaptive Equalization
  • Radar Point Source Location
  • Medical Diagnosis
  • Process Faults Detection

Classification
6
Expected Risk, Structural Risk Minimization (SRM)
  • Classification Problem
  • Decision functions
  • Expected Risk
  • Ideal Goal Find the decision function f(x)
    minimize the expected risk
  • Empirical Risk Minimization
  • To avoid the over-fitting problem, use SRM or
    minimum description length (MDL), where the
    empirical risk should be minimized

7
Reference
  • Vapnik V., (1995), The Nature of Statistical
    Learning Theory. Springer.
  • Vapnik V., (1998), Statistical learning theory,
    John Wiley, New York.
  • E. Osuna, F. Girosi, (1998) Reducing the run-time
    complexity of support vector machines, ICPR,
    Brisbane, Australia.
  • Cristianini N. and Shawe-Taylor J., (2000), An
    Introduction to Support Vector Machines and Other
    Kernel-based learning methods, Cambridge
    University Press.
  • Burges C. J. C., (1998), A Tutorial on Support
    Vector Machines for Pattern Recognition
    Knowledge Discovery and Data Mining, 2(2).
  • Hsu C. and Lin C., (2002), A comparison of
    methods for multi-class support vector machines,
    IEEE Transactions on Neural Networks, 13415-425.
  • Duan K., Keerthi S. S., Poo A. N., (2003),
    Evaluation of simple performance measures for
    tuning SVM hyperparameters", Neurocomputing (51)
    41-59.

8
Whats A Pattern Classification Problem ?
Given a group of n objects and p predictors,
drawn from c populations (c gt2), the goal of
pattern classification is to construct a decision
rule that classifies given objects into one and
only one of a given number of populations.
Determine a decision function
Classifier
9
Pattern Classification Process
Training Set
Learning Algorithm
Building The Classification Model
Model Selection
Testing and Validation
10
Examples of Applications
11
Binary Linear Classifier
  • If the training set is linearly separable Binary
    classification can be viewed as the task of
    separating classes in feature space

The hypersurface equation f(x, w,b) sign(wTx
b)
wTx b 0
wTx b gt 0
wTx b lt 0
12
Classification Margin
  • Distance from an object to the hyperplane is
  • Support Vectors (the closest objects to the
    hyperplane)
  • Margin (the width that the boundary could be
    increased by before hitting a data point ), ?.
  • The maximum margin linear classifier is the
    linear classifier with the maximum margin.

?
r
r
13
Linear SVM Classifier Mathematically
  • Assuming all data is at least distance 1 from the
    hyperplane, the following two constraints follow
    for a training set (xi ,yi)
  • For support vectors, the inequality becomes an
    equality since each examples distance from the
    hyperplane is , the margin is

wTxi b 1 if yi 1, wTxi b -1
if yi -1
Find w and b such that is
maximized, for all (xi ,yi) wTxi b 1 if
yi1 and wTxi b -1 if
yi -1
A better formulation
Find w and b such that ?(w) ½ wTw is minimized
for(xi ,yi), such that yi (wTxi
b) 1
14
Solving the Optimization Problem
  • In most real applications, we Need to optimize a
    quadratic function subject to linear constraints.
    The solution involves constructing a dual problem
    where a Lagrange multiplier ? ?i is associated
    with every constraint in the primary problem

Lagrange formulation
dual form
primal form
Find ?1 ?n such that
is maximized under the constraints
(1) S ?iyi 0, (2) ?i 0 for all ?i
15
This is how to find a solution (I learned how to
use analogy)
w S ?iyixi, b yk- wTxk for any xk such that
?k? 0
The solution
f(x) sign(S ?iyixiTx b)
The classifying decision function is
16
A Numerical Example
  • Three inequalities
  • 1?w b ? ?1 2?w b ? 1 3?w b ? 1
  • J w2/2 ? ?1(?w?b?1) ? ?2(2wb?1) ? ?3(3wb?1)
  • ?J/?w 0 ? w ? ?1 2 ?2 3 ?3
  • ?J/?b 0 ? 0 ?1 ? ?2 ? ?3
  • Kuhn-Tucker condition implies
  • (a) ?1(?w?b?1) 0, (b) ?2(2wb?1) 0 (c)
    ?3(3w b ?1) 0
  • The solution (as we will see later) is ?1 ?2
    2 and ?3 0. This yields
  • w 2, b ?3.
  • Hence the solution of decision boundary
    (hyperplane) is
  • 2x ? 3 0. or x 1.5!
  • This is shown as the dash line in above figure.

17
(No Transcript)
18
(No Transcript)
19
Formulating the Dual Problem
  • At the saddle point, we have
    and , substituting these relations
    into above, then we have the Dual Problem

Maximize Subject to and ?i ? 0 for i
1, 2, , n.
Note
20
Numerical Example (contd)
  • or Q(?) ?1 ?2 ?3 ? 0.5 ?12 2 ?22 4.5
    ?32 ? 2 ?1 ? 2 ?3 ?1 ? 3 6 ? 2 ? 3
  • subject to constraints ? ?1 ? 2 ? 3 0,
    and
  • ?1 ? 0, ?2 ? 0, and ?3 ? 0.
  • Use Matlab? Optimization tool box command
  • xfmincon(qalpha,X0, A, B, Aeq, Beq)
  • The solution is ?1 ?2 ?3 2 2 0 as
    expected.

21
Non-linear SVMs Feature spaces
  • What if the training set is not linearly
    separable?
  • The original feature space can always be mapped
    to some higher-dimensional feature space where
    the training set is separable

F x ? f(x)
22
Implication of Minimizing w
  • Let D denote the diameter of the smallest
    hyper-surface that encloses all the input
    training vectors x1, x2, , xn.
  • The set of optimal hyper-planes described by the
    equation
  • WoTx bo 0
  • has a Vapnik Condition (VC)-dimension h bounded
    from above as
  • h ? min ?D2/?2?, p 1
  • where p is the dimension of the input vectors,
    and ? 2/wo is the margin of the separation
    of the hyper-planes.
  • VC-dimension determines the complexity of the
    classifier
  • structure, and usually the smaller the better.

23
Non-separable Cases
  • Recall that in linearly separable case, each
    training sample pair (xi, yi) represents a linear
    inequality constraint
  • yi(wTxi b) ? 1, i 1, 2, , n
    ()
  • If the training samples are not linearly
    separable, the constraint can be modified to
    yield a soft constraint
  • yi(wTxi b) ? 1??i , i 1, 2, , n
    ()
  • ?i 1 ? i ? n are known as slack variables.
  • Note that originally, () is a normalized version
    of yi f(xi)/w ? ?. With the slack variable
    ?i, that equation becomes yi f(xi)/w ?
    ?(1??i). Hence with the slack variable, we allow
    some samples xi fall within the gap. Moreover, if
    ?i gt 1, then the corresponding (xi, yi) is
    mis-classified because the sample will fall on
    the wrong side of the hyper-plane H.

24
Non-Separable Case
  • With this approximated cost function, the goal is
    to maximize ? (minimize W) while minimize
    ?i(? 0).
  • ?i not counted if xi outside gap and on the
    correct side.
  • 0 lt ?ilt 1 xi inside gap, but on the correct
    side.
  • ?i gt 1 xi on the wrong side (inside or outside
    gap).
  • Since ?i gt 1 implies mis-classification, the cost
    function must include a term to minimize the
    number of samples that are mis-classified
  • where ? is a Lagrange multiplier. But this
    formulation is non-convex and a solution is
    difficult to find using existing nonlinear
    optimization algorithms.
  • Hence, we may instead use an approximated cost
    function

?
0
1
25
The Kernel Functions
  • If every data point is mapped into
    high-dimensional space via some transformation F
    x ? f(x), the inner product becomes K(xi,xj)
    f(xi) Tf(xj).
  • The linear classifier relies on inner product
    between vectors K(xi,xj)xiTxj
  • A kernel matrix (function) is some function that
    corresponds to an inner product into some feature
    space.
  • For each K(xi,xj) checking that K(xi,xj) can be
    written in the form of f(xi) Tf(xj) or not.
  • Example Two-dimensional vectors xx1 x2
    let K(xi,xj)(1 xiTxj)2,
  • Need to show that K(xi,xj) f(xi) Tf(xj)
  • K(xi,xj)(1 xiTxj)2, 1 xi12xj12 2 xi1xj1
    xi2xj2 xi22xj22 2xi1xj1 2xi2xj2
  • 1 xi12 v2 xi1xi2 xi22 v2xi1
    v2xi2T 1 xj12 v2 xj1xj2 xj22 v2xj1 v2xj2
  • f(xi) Tf(xj), where f(x) 1 x12
    v2 x1x2 x22 v2x1 v2x2

What kernel is feasible? It must satisfy the
"Mercer's theorem"! , which is Semi-Positive
Definite Matrix.
26
Numerical Example XOR Problem
x x1, x2T. Use K(x,y) (1 xTy)2 one
has ?(x) 1 x12 x22 ?2 x1, ?2 x2, ?2
x1x2T
Note dim?(x) 6 gt dimx 2! Dim(K) Ns
of support vectors.
27
XOR Problem (Continued)
Note that K(xi, xj) can be calculated directly
without using ?!
The corresponding Lagrange multiplier ? (1/8)1
1 1 1T.
Hence the hyper-plane is y wT?(x) ?
x1x2
28
The Most Common Kernel Functions
29
SVM Using Nonlinear Kernels
?0
K(x,xj)
x1
?0
x1







f
W
?P
?P
K(x,xj)
xn
xn
Nonlinear transform
Kernel evaluation
Nonlinear transform
  • Using kernel, low dimensional feature vectors
    will be mapped to high dimensional (may be
    infinite dim) kernel feature space where the data
    are likely to be linearly separable.

30
Non-Linear SVM Classifier Mathematically
  • Slack variables ?i can be added to allow
    misclassification of noisy examples.

Find w and b such that

is minimized for all (xi
,yi), where yi (wT ?(xi) b) 1- ?i and ?i
0, i1,2,,n.
?i
?i
C is a user-specified positive number.
Using ?i and ?i as Lagrange multipliers, the
unconstrained cost function becomes
31
The Lagrangian Multipliers
The Kuhn and Tucker necessary conditions in a
matrix format
(1)
By substituting for W and ? in the Lagrangian
function, then the dual quadratic problem is
32
The Proposed Suggestions
We note that for large number of observations,
the matrix in (1) can not be stored, where we
need an iterative solution method, that cause
complexity, and running time problem.
A new suggestion to handle the computational
complexity and running time for the system (1)
using the following techniques
  • Use the Large scale conjugate gradient Least
    Squares Technique, Gloub and Van Loan (1989).
  • Simplify the shape of the hyperplane is to
    exclude some training samples, that contributes
    the convoluted hyperplane. Thereby, it is
    possible to separate the remaining samples by a
    less convoluted hyperplane.

33
The Conjugate Gradient Algorithm
The conjugate gradient method for solving the
system with
symmetric positive definite and is
given by
34
The advantage of the conjugate gradient method is
that if the matrix A is symmetric positive
definite and A I t, where rank(t) d, then
the conjugate gradient algorithm in Table (1)
converges in at most d1 steps. Therefore, the
system (1) can be written in the form
(2)
The coefficient matrices in systems (1) and (2)
are not positive definite. SO!
(3)
35
Therefore, the solution of the system (3) is
36
  • We conclude that the solution of each of the
    systems (3) and (1)is
  • Easy to compute, and there is no need to store
    the matrix A.
  • This new technique run faster, and the
    computational running time will be
    only.
  • Moreover, we do not need to compute the inverse
    of matrix, which is difficult when n
    is very large.

37
The Matlab code for least squares support vector
machine for arbitrary Kernel to compute b and ?
is written as
38
Solving the Optimization Problem
  • The dual problem for non-linear SVM
    classification
  • Again, xi with non-zero ?i will be support
    vectors. Solution to the dual problem is

Find ?1 ?n such that
is maximized
subject to the constraints (1) S ?iyi 0,
(2) 0 ?i
C for all ?i
where Io i 0 lt ?i lt C
f(x) sign(S ?iyi K(xi ,x) b)
The decision function is
(4)
39
Multiclass Least squares SVMs Classifier With The
Conjugate Gradient Algorithm
Consider a training set drawn from c gt2
categories. We use the symbol to refer to the ith
output unit for the class , for all
k1,2, , c. The support vector machines
optimization problem can be written as
40
The corresponding Lagrangian can be formulated as
(5)
41
The necessary conditions of the Lagrangian in (5)
is given by
(6)
42
(No Transcript)
43
Similarly as we did for the binary
classification case, we use the conjugate
gradient least square technique to solve the
system (6). We need to modify the structure of
both matrices and . In addition,
we should not store the matrix , due to
its computational cost and complexity. moreover,
the matrices in (6) needed to be reformatted, so
that the two linear subsystems of equation have
positive definite matrices. Thus, the SVMs
Classifier Algorithm is
44
(No Transcript)
45
(No Transcript)
46
Model Selection
  • We select the best model as follows
  • Selection methods
  • Backward-Forward (BF)
  • Forward-Backward (FB)
  • Quality Criteria We use the minimum description
    length (MDL), that is,

( 7 )
Penalty for complexity
Goodness of fit
The model with the smallest MDL is the best model.
47
Model Validation
We validate the classification model using two
methods
1. Internal Validation
  • 1. The correct classification rate (CCR)
  • 2. The average of squared classification
    error (SSCE)

( 8 )
Where is the number of corrected
observations in the class k. The model with the
highest CCR is the better performance.
( 9 )
The model with the smallest SSCE is the better
performance.
Note CCR and SSCE are computed for the training
set.
48
2. External Validation
1.The average of CCR over all runs 3. The
average of SSCE over all runs 3. The average of
MDL over all runs 4. The average of execution
time over all runs
Note CCR, SSCE, MDL, and time are computed for
testing set.
We will talk about it all in the real
applications and simulation study.
49
(No Transcript)
50
(No Transcript)
51
(No Transcript)
52
SVM implementations I.
  • SVMlight - satyr.net2.private/usr/local/bin
  • svm_learn, svm_classify
  • bsvm - satyr.net2.private/usr/local/bin
  • svm-train, svm-classify, svm-scale
  • libsvm - satyr.net2.private/usr/local/bin
  • svm-train, svm-predict, svm-scale, svm-toy
  • mySVM
  • MATLAB svm toolbox
  • Differences available Kernel functions,
    optimization, multiple class., user interfaces

53
SVM implementations II.
  • SVMlight
  • Simple text data format
  • Fast, C routines
  • bsvm
  • Multiple class.
  • LIBSVM
  • GUI svm-toy
  • MATLAB svm toolbox
  • Graphical interface 2D

54
Thalassemias Data
P 4 C 3
55
Performance ComparisonSVM-CGA vs. SVM-NP vs. SMOL
Number of Kernel evaluations (107)
Margin (Thalassemias data)
56
Conclusions and Future Work
  • The performance of the new SVM classifier for the
    applications Fraud Detection, Wisconsin Breast
    Cancer, Thalassemias, Fisher Iris data, and
    Bioinformatics is robust and efficient with
    respect to the parameters of the algorithm.
  • Future work for applying the new SVM Technique to
    determine the best path for Oil extraction and
    determine the Oil reserves.
  • Use the SVM Kernel function with the CGA
    criterion to design a neural network model with
    bounded-weights for pattern classification.
  • Test the methodology on the Internet Security,
    and encryption.

57
That's All Folks
If we all do a little, we will do a lot
Thank you
Write a Comment
User Comments (0)
About PowerShow.com