Title: Support Vector Machines
1Support Vector Machines
Summer Course Data Mining
Support Vector Machinesand other penalization
classifiers
- Presenter Georgi Nalbantov
Presenter Georgi Nalbantov
August 2009
2Contents
- Purpose
- Linear Support Vector Machines
- Nonlinear Support Vector Machines
- (Theoretical justifications of SVM)
- Marketing Examples
- Other penalization classification methods
- Conclusion and Q A
- (some extensions)
3Purpose
- Task to be solved (The Classification Task)
Classify cases (customers) into type 1 or
type 2 on the basis of - some known attributes (characteristics)
- Chosen tool to solve this taskSupport Vector
Machines
4The Classification Task
- Given data on explanatory and explained
variables, where the explained variable can take
two values ? 1 , find a function that gives
the best separation between the -1 cases and
the 1 cases - Given ( x1, y1 ), , ( xm , ym ) ?
?n ? ? 1 - Find ? ?n ? ? 1
- best function the expected error on unseen
data ( xm1, ym1 ), , ( xmk , ymk )
is minimal -
- Existing techniques to solve the classification
task - Linear and Quadratic Discriminant Analysis
- Logit choice models (Logistic Regression)
- Decision trees, Neural Networks, Least Squares
SVM
5Support Vector Machines Definition
- Support Vector Machines are a non-parametric tool
for classification/regression -
- Support Vector Machines are used for prediction
rather than description purposes - Support Vector Machines have been developed by
Vapnik and co-workers
6Linear Support Vector Machines
- A direct marketing company wants to sell a new
book - The Art History of Florence
- Nissan Levin and Jacob Zahavi in Lattin, Carroll
and Green (2003). - Problem How to identify buyers and non-buyers
using the two variables - Months since last purchase
- Number of art books purchased
? buyers
? non-buyers
Number of art books purchased
Months since last purchase
7Linear SVM Separable Case
- Main idea of SVMseparate groups by a line.
- However There are infinitely many lines that
have zero training error - which line shall we choose?
? buyers
? non-buyers
Number of art books purchased
Months since last purchase
8Linear SVM Separable Case
- SVM use the idea of a margin around the
separating line. - The thinner the margin,
- the more complex the model,
- The best line is the one with thelargest margin.
? buyers
? non-buyers
Number of art books purchased
Months since last purchase
9Linear SVM Separable Case
- The line having the largest margin isw1x1
w2x2 b 0 -
- Where
- x1 months since last purchase
- x2 number of art books purchased
- Note
- w1xi 1 w2xi 2 b ? 1 for i ? ?
- w1xj 1 w2xj 2 b ? 1 for j ? ?
x2
w1x1 w2x2 b 1
w1x1 w2x2 b 0
w1x1 w2x2 b -1
Number of art books purchased
margin
x1
Months since last purchase
10Linear SVM Separable Case
-
- The width of the margin is given by
- Note
x2
w1x1 w2x2 b 1
w1x1 w2x2 b 0
w1x1 w2x2 b -1
Number of art books purchased
margin
x1
Months since last purchase
11Linear SVM Separable Case
x2
- The optimization problem for SVM is
- subject to
- w1xi 1 w2xi 2 b ? 1 for i ? ?
- w1xj 1 w2xj 2 b ? 1 for j ? ?
margin
x1
12Linear SVM Separable Case
Support vectors
x2
- Support vectors are those points that lie on
the boundaries of the margin - The decision surface (line) is determined only by
the support vectors. All other points are
irrelevant
x1
13Linear SVM Nonseparable Case
- Non-separable case there is no line separating
errorlessly the two groups - Here, SVM minimize L(w,C)
- subject to
- w1xi 1 w2xi 2 b ? 1 ?i for i ? ?
- w1xj 1 w2xj 2 b ? 1 ?i for j ? ?
- ?I,j ? 0
Training set 1000 targeted customers
x2
? buyers
? non-buyers
w1x1 w2x2 b 1
?
?
?
?
?
?
?
?
?
?
L(w,C) Complexity Errors
?
?
?
?
?
?
?
?
?
?
x1
14Linear SVM The Role of C
x2
C 5
?
?
x1
decreased complexity
increased complexity
( wider margin )
( thinner margin )
bigger number errors
smaller number errors
( worse fit on the data )
( better fit on the data )
- Vary both complexity and empirical error via C
by affecting the optimal w and optimal number of
training errors
15Bias Variance trade-off
16From Regression into Classification
- We have a linear model, such as
- We have to estimate this relation using our
training data set and having in mind the
so-called accuracy, or 0-1 loss function (our
evaluation criterion).
- The training data set we have consists of only
MANY observations, for instance
Training data
17From Regression into Classification
- We have a linear model, such as
y
- We have to estimate this relation using our
training data set and having in mind the
so-called accuracy, or 0-1 loss function (our
evaluation criterion).
1
- The training data set we have consists of only
MANY observations, for instance
-1
Training data
x
x
18From Regression into ClassificationSupport
Vector Machines
- flatter line ? greater penalization
y
1
-1
x
x
margin
19From Regression into ClassificationSupport
Vector Machines
y
x2
x1
x2
x1
margin
- flatter line ? greater penalization
- smaller slope ? bigger margin
equivalently
20Nonlinear SVM Nonseparable Case
- Mapping into a higher-dimensional space
- Optimization task minimize L(w,C)
- subject to
-
? -
?
x2
x1
21Nonlinear SVM Nonseparable Case
- Map the data into higher-dimensional space ?2
?3
?
x1
?
(1,-1)
22Nonlinear SVM Nonseparable Case
- Find the optimal hyperplane in the transformed
space
?
x1
?
(1,-1)
23Nonlinear SVM Nonseparable Case
- Observe the decision surface in the original
space (optional)
x2
?
?
?
x1
?
?
?
24Nonlinear SVM Nonseparable Case
- Dual formulation of the (primal) SVM
minimization problem
Primal
Dual
Subject to
Subject to
25Nonlinear SVM Nonseparable Case
- Dual formulation of the (primal) SVM
minimization problem
Dual
Subject to
(kernel function)
26Nonlinear SVM Nonseparable Case
- Dual formulation of the (primal) SVM
minimization problem
Dual
Subject to
(kernel function)
27Strengths and Weaknesses of SVM
- Strengths of SVM
- Training is relatively easy
- No local minima
- It scales relatively well to high dimensional
data - Trade-off between classifier complexity and error
can be controlled explicitly via C - Robustness of the results
- The curse of dimensionality is avoided
- Weaknesses of SVM
- What is the best trade-off parameter C ?
- Need a good transformation of the original space
28The Ketchup Marketing Problem
- Two types of ketchup Heinz and Hunts
- Seven Attributes
- Feature Heinz
- Feature Hunts
- Display Heinz
- Display Hunts
- FeatureDisplay Heinz
- FeatureDisplay Hunts
- Log price difference between Heinz and Hunts
- Training Data 2498 cases (89.11 Heinz is
chosen) - Test Data 300 cases (88.33 Heinz is chosen)
29The Ketchup Marketing Problem
Cross-validation mean squared errors, SVM with
RBF kernel
Linear kernel Polynomial kernel RBF kernel
- Do (5-fold ) cross-validation procedure to find
the best combination of the manually adjustable
parameters (here C and s)
C
min
max
s
30The Ketchup Marketing Problem Training Set
31The Ketchup Marketing Problem Training Set
32The Ketchup Marketing Problem Training Set
33The Ketchup Marketing Problem Training Set
34The Ketchup Marketing Problem Test Set
35The Ketchup Marketing Problem Test Set
36The Ketchup Marketing Problem Test Set
37- Part II
- Penalized classification and regression methods
- Support Hyperplanes
- Nearest Convex Hull classifier
- Soft Nearest Neighbor
- Application An example Support Vector
Regression financial study - Conclusion
38- Classification
- Support Hyperplanes
- There are infinitely many hyperplanes that are
semi-consistent ( commit no error) with the
training data.
- Consider a (separable) binary classification
case training data (,-) and a test point x.
39- Classification
- Support Hyperplanes
- For the classification of the test point x, use
the farthest-away h-plane that is semi-consistent
with training data.
- The SH decision surface. Each point on it has 2
support h-planes.
40- Classification
- Support Hyperplanes
- Toy Problem Experiment with Support Hyperplanes
and Support Vector Machines
41- Classification
- Support Vector Machines and Support Hyperplanes
42- Classification
- Support Vector Machines and Nearest Convex Hull
cl.
- Nearest Convex Hull classification
43- Classification
- Support Vector Machines and Soft Nearest Neighbor
44- Classification Support Hyperplanes
- Support Hyperplanes
- (bigger penalization)
45- Classification Nearest Convex Hull classification
- Nearest Convex Hull classification
- Nearest Convex Hull classification
- (bigger penalization)
46- Classification Soft Nearest Neighbor
- Soft Nearest Neighbor
- (bigger penalization)
47- Classification Support Vector Machines,
- Nonseparable Case
48- Classification Support Hyperplanes,
- Nonseparable Case
49- Classification Nearest Convex Hull
classification, - Nonseparable Case
- Nearest Convex Hull classification
50- Classification Soft Nearest Neighbor,
- Nonseparable Case
51Summary Penalization Techniques for
Classification
- Penalization methods for classification Support
Vector Machines (SVM), Support Hyperplanes (SH),
Nearest Convex Hull classification (NCH), and
Soft Nearest Neighbour (SNN). In all cases, the
classificarion of test point x is dete4rmined
using the hyperplane h. Equivalently, x is
labelled 1 (-1) if it is farther away from set
S_ (S).
52Conclusion
- Support Vector Machines (SVM) can be applied in
the binaryand multi-class classification
problems - SVM behave robustly in multivariate problems
- Further research in various Marketing areas is
needed to justifyor refute the applicability of
SVM - Support Vector Regressions (SVR) can also be
applied - http//www.kernel-machines.org
- Email nalbantov_at_few.eur.nl