Support Vector Machines - PowerPoint PPT Presentation

1 / 52

About This Presentation

Title:

Support Vector Machines

Description:

Support Vector Machines are a non-parametric tool for ... 'The Art History of Florence' Nissan Levin and Jacob Zahavi in Lattin, Carroll and Green (2003) ... – PowerPoint PPT presentation

Number of Views:107

Avg rating:3.0/5.0

Slides: 53

Provided by: jian89

Category:

more less

Transcript and Presenter's Notes

Title: Support Vector Machines

1
Support Vector Machines
Summer Course Data Mining
Support Vector Machinesand other penalization
classifiers

Presenter Georgi Nalbantov

Presenter Georgi Nalbantov
August 2009
2
Contents

Purpose
Linear Support Vector Machines
Nonlinear Support Vector Machines
(Theoretical justifications of SVM)
Marketing Examples
Other penalization classification methods
Conclusion and Q A
(some extensions)

3
Purpose

Task to be solved (The Classification Task)
Classify cases (customers) into type 1 or
type 2 on the basis of
some known attributes (characteristics)
Chosen tool to solve this taskSupport Vector
Machines

4
The Classification Task

Given data on explanatory and explained
variables, where the explained variable can take
two values ? 1 , find a function that gives
the best separation between the -1 cases and
the 1 cases
Given ( x1, y1 ), , ( xm , ym ) ?
?n ? ? 1
Find ? ?n ? ? 1
best function the expected error on unseen
data ( xm1, ym1 ), , ( xmk , ymk )
is minimal
Existing techniques to solve the classification
task
Linear and Quadratic Discriminant Analysis
Logit choice models (Logistic Regression)
Decision trees, Neural Networks, Least Squares
SVM

5
Support Vector Machines Definition

Support Vector Machines are a non-parametric tool
for classification/regression
Support Vector Machines are used for prediction
rather than description purposes
Support Vector Machines have been developed by
Vapnik and co-workers

6
Linear Support Vector Machines

A direct marketing company wants to sell a new
book
The Art History of Florence
Nissan Levin and Jacob Zahavi in Lattin, Carroll
and Green (2003).
Problem How to identify buyers and non-buyers
using the two variables
Months since last purchase
Number of art books purchased

? buyers
? non-buyers
Number of art books purchased
Months since last purchase
7
Linear SVM Separable Case

Main idea of SVMseparate groups by a line.
However There are infinitely many lines that
have zero training error
which line shall we choose?

? buyers
? non-buyers
Number of art books purchased
Months since last purchase
8
Linear SVM Separable Case

SVM use the idea of a margin around the
separating line.
The thinner the margin,
the more complex the model,
The best line is the one with thelargest margin.

? buyers
? non-buyers
Number of art books purchased
Months since last purchase
9
Linear SVM Separable Case

The line having the largest margin isw1x1
w2x2 b 0
Where
x1 months since last purchase
x2 number of art books purchased
Note
w1xi 1 w2xi 2 b ? 1 for i ? ?
w1xj 1 w2xj 2 b ? 1 for j ? ?

x2
w1x1 w2x2 b 1
w1x1 w2x2 b 0
w1x1 w2x2 b -1
Number of art books purchased
margin
x1
Months since last purchase
10
Linear SVM Separable Case

The width of the margin is given by
Note

x2
w1x1 w2x2 b 1
w1x1 w2x2 b 0
w1x1 w2x2 b -1
Number of art books purchased
margin
x1
Months since last purchase
11
Linear SVM Separable Case
x2

The optimization problem for SVM is
subject to
w1xi 1 w2xi 2 b ? 1 for i ? ?
w1xj 1 w2xj 2 b ? 1 for j ? ?

margin
x1
12
Linear SVM Separable Case
Support vectors
x2

Support vectors are those points that lie on
the boundaries of the margin
The decision surface (line) is determined only by
the support vectors. All other points are
irrelevant

x1
13
Linear SVM Nonseparable Case

Non-separable case there is no line separating
errorlessly the two groups
Here, SVM minimize L(w,C)
subject to
w1xi 1 w2xi 2 b ? 1 ?i for i ? ?
w1xj 1 w2xj 2 b ? 1 ?i for j ? ?
?I,j ? 0

Training set 1000 targeted customers
x2
? buyers
? non-buyers
w1x1 w2x2 b 1
?
?
?
?
?
?
?
?
?
?
L(w,C) Complexity Errors
?
?
?
?
?
?
?
?
?
?
x1
14
Linear SVM The Role of C
x2
C 5
?
?
x1

Bigger C

Smaller C

decreased complexity
increased complexity
( wider margin )
( thinner margin )
bigger number errors
smaller number errors
( worse fit on the data )
( better fit on the data )

Vary both complexity and empirical error via C
by affecting the optimal w and optimal number of
training errors

15
Bias Variance trade-off
16
From Regression into Classification

We have a linear model, such as

We have to estimate this relation using our
training data set and having in mind the
so-called accuracy, or 0-1 loss function (our
evaluation criterion).

The training data set we have consists of only
MANY observations, for instance

Training data
17
From Regression into Classification

We have a linear model, such as

We have to estimate this relation using our
training data set and having in mind the
so-called accuracy, or 0-1 loss function (our
evaluation criterion).

The training data set we have consists of only
MANY observations, for instance

-1
Training data
x
x
18
From Regression into ClassificationSupport
Vector Machines

flatter line ? greater penalization

y
1
-1
x
x
margin
19
From Regression into ClassificationSupport
Vector Machines
y
x2
x1
x2
x1
margin

flatter line ? greater penalization

smaller slope ? bigger margin

equivalently
20
Nonlinear SVM Nonseparable Case

Mapping into a higher-dimensional space
Optimization task minimize L(w,C)
subject to
?
?

x2
x1
21
Nonlinear SVM Nonseparable Case

Map the data into higher-dimensional space ?2
?3

?
x1
?
(1,-1)
22
Nonlinear SVM Nonseparable Case

Find the optimal hyperplane in the transformed
space

?
x1
?
(1,-1)
23
Nonlinear SVM Nonseparable Case

Observe the decision surface in the original
space (optional)

x2
?
?
?
x1
?
?
?
24
Nonlinear SVM Nonseparable Case

Dual formulation of the (primal) SVM
minimization problem

Primal
Dual
Subject to
Subject to
25
Nonlinear SVM Nonseparable Case

Dual formulation of the (primal) SVM
minimization problem

Dual
Subject to
(kernel function)
26
Nonlinear SVM Nonseparable Case

Dual formulation of the (primal) SVM
minimization problem

Dual
Subject to
(kernel function)
27
Strengths and Weaknesses of SVM

Strengths of SVM
Training is relatively easy
No local minima
It scales relatively well to high dimensional
data
Trade-off between classifier complexity and error
can be controlled explicitly via C
Robustness of the results
The curse of dimensionality is avoided
Weaknesses of SVM
What is the best trade-off parameter C ?
Need a good transformation of the original space

28
The Ketchup Marketing Problem

Two types of ketchup Heinz and Hunts
Seven Attributes
Feature Heinz
Feature Hunts
Display Heinz
Display Hunts
FeatureDisplay Heinz
FeatureDisplay Hunts
Log price difference between Heinz and Hunts
Training Data 2498 cases (89.11 Heinz is
chosen)
Test Data 300 cases (88.33 Heinz is chosen)

29
The Ketchup Marketing Problem

Choose a kernel mapping

Cross-validation mean squared errors, SVM with
RBF kernel
Linear kernel Polynomial kernel RBF kernel

Do (5-fold ) cross-validation procedure to find
the best combination of the manually adjustable
parameters (here C and s)

C
min
max
s
30
The Ketchup Marketing Problem Training Set
31
The Ketchup Marketing Problem Training Set
32
The Ketchup Marketing Problem Training Set
33
The Ketchup Marketing Problem Training Set
34
The Ketchup Marketing Problem Test Set
35
The Ketchup Marketing Problem Test Set
36
The Ketchup Marketing Problem Test Set
37

Part II
Penalized classification and regression methods

Support Hyperplanes
Nearest Convex Hull classifier
Soft Nearest Neighbor
Application An example Support Vector
Regression financial study
Conclusion

Classification
Support Hyperplanes

There are infinitely many hyperplanes that are
semi-consistent ( commit no error) with the
training data.

Consider a (separable) binary classification
case training data (,-) and a test point x.

Classification
Support Hyperplanes

Support hyperplaneof x

For the classification of the test point x, use
the farthest-away h-plane that is semi-consistent
with training data.

The SH decision surface. Each point on it has 2
support h-planes.

Classification
Support Hyperplanes

Toy Problem Experiment with Support Hyperplanes
and Support Vector Machines

Classification
Support Vector Machines and Support Hyperplanes

Support Vector Machines

Support Hyperplanes

Classification
Support Vector Machines and Nearest Convex Hull
cl.

Support Vector Machines

Nearest Convex Hull classification

Classification
Support Vector Machines and Soft Nearest Neighbor

Support Vector Machines

Soft Nearest Neighbor

Classification Support Hyperplanes

Support Hyperplanes

Support Hyperplanes
(bigger penalization)

Classification Nearest Convex Hull classification

Nearest Convex Hull classification

Nearest Convex Hull classification
(bigger penalization)

Classification Soft Nearest Neighbor

Soft Nearest Neighbor
(bigger penalization)

Soft Nearest Neighbor

Classification Support Vector Machines,
Nonseparable Case

Support Vector Machines

Classification Support Hyperplanes,
Nonseparable Case

Support Hyperplanes

Classification Nearest Convex Hull
classification,
Nonseparable Case

Nearest Convex Hull classification

Classification Soft Nearest Neighbor,
Nonseparable Case

Soft Nearest Neighbor

51
Summary Penalization Techniques for
Classification

Penalization methods for classification Support
Vector Machines (SVM), Support Hyperplanes (SH),
Nearest Convex Hull classification (NCH), and
Soft Nearest Neighbour (SNN). In all cases, the
classificarion of test point x is dete4rmined
using the hyperplane h. Equivalently, x is
labelled 1 (-1) if it is farther away from set
S_ (S).

52
Conclusion

Support Vector Machines (SVM) can be applied in
the binaryand multi-class classification
problems
SVM behave robustly in multivariate problems
Further research in various Marketing areas is
needed to justifyor refute the applicability of
SVM
Support Vector Regressions (SVR) can also be
applied
http//www.kernel-machines.org
Email nalbantov_at_few.eur.nl