SVM Classifier Introduction - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

SVM Classifier Introduction

Description:

Find the hyperplane that classifies correctly the training set and that has a ... news article about David and Victoria Beckham could belong to different partial ... – PowerPoint PPT presentation

Number of Views:569

Avg rating:3.0/5.0

Slides: 40

Provided by: polet5

Category:

more less

Transcript and Presenter's Notes

Title: SVM Classifier Introduction

1
SVM ClassifierIntroduction

Linear SVM (separable data)
Linear SVM (nonseparable data)
Nonlinear SVM (nonseparable data)

2
(No Transcript)
3
1) Linear SVM (separable data)

Hyperplane definition
Maximum margin
Scaling
Final formula

4
Hyperplane definition

If data are linear separable, an hyperplane
f(x)wxb 0 exists, such that

5
w and maximum margin

Given a point x and an hyperplane wxb the
distance is
that is a function of
Find the hyperplane that classifies correctly the
training set and that has a minimum norma
(minimum w2), it means to find the maximum
margin from the points of the training set.

6
Maximum margin

It is proved that the capacity of generalization
of SVM arise, meanwhile the margin arises.
So we obtain the maximum generalization when the
hyperplane has the maximum margin. This is the
case of the optimal separating hyperplane (OSH).

7
Maximum margin
8
Final formula
9
Lagrangian solution

We must minimize the lagrangian
This is a QP problem with solution
Where are lagrange multiplicator variables.

10
Optimum hyperplane

So the final optimum hyperplane is

11
2) Linear SVM (nonseparable data)

Generalization, soft margins, slacks.
Formulation

12
Optimum hyperplane generalization

In this phase, the constraint for the exact
classification is relaxed (so we are talking
about soft margins).
We introduce slack variables , so the
constraint is

13
Optimum hyperplane generalization

At different values of the slacks variables we
have
Over the margin and correct classification
Inside the margin and correct classification
Incorrect classification

14
Soft margins and slacks
15
Formulation - Primal
16
Lagrangian solution

We must minimize the lagrangian
This is a QP problem with the same solution
But now
C manages the compromise between the size of the
margin (lower values of C) and the number of
errors tollerated in the training phase (if C
we have an hyperplane perfectly separable).

17
Final optimum hyperplane

So the final optimum hyperplane is the same

18
3) NonLinear SVM (nonseparable data)

Mapping F(x)
Kernel functions
Loss functions
Formulation

19
Mapping F(x)

In the case of sets not linear separable, we
introduce a mapping F(x) in a multi dimensional
space in order to obtain sets linearly separable.
So instead of arise the classifier complexity (it
is again an hyperplane) we arise the features
dimensional space.

20
Mapping F(x)
Mapping F(x)
Mapping F(x)
21
Kernel functions

The new transformed space can have a very big
dimension, so the mapping function F(.) can be
very complex to evaluate.
In the learning and classification phase we have
to manage the scalar product F(x)F(y).
Following the Mercer theorem, it exists a Kernel
function K(x,y) such that K(x,y)F(x)F(y).
So the discriminant function is
So the use of Kernel functions narrow the real
mapping in the n-dimensional space

22
Some Kernel functions

Linear Kernel
Polinomial Kernel
Gaussian Kernel (RBF Radial Basis Function)
Kernel MLP (Multi-layer perceptron)

23
Loss functions

Consider a true multilabel y and a predicted one
t.
The basic goal is to learn a function fy to
approximate the unknown target function ty
To evaluate the goodness of the approximation fy
we need the loss function l(y,t) denoting the
price we have to pay guessing that the associated
label

24
Loss functions conditions

Basic condition of the loss function
should be monotonically decreasing with
respect to the sets of incorrect multilabels.

25
Final formulation primal

With a single slack variable for each training
example.

26
Langrangian solution

Now we must minimize the lagrangian
This is a QP problem with the solution
So the final optimum hyperplane is

27
Learning Hierarchical Multi-Category Text
Classification Models

Juho Rousu, Craig Saunders, Sandor Szedmak, John
Shawe-Taylor

Proceedings of the 22nd International Conference
on Machine Learning (ICML05), Bonn, Germany - 2005
28
Hierarchical Multilabel Classificationunion of
partial paths model

Goal given document x and hierarchy T (V,E)
predict multilabel where
the positive microlabels yk form a union of
partial paths in T.

A news article about David and Victoria Beckham
could belong to different partial paths and might
not belong to any leaf categories.
29
Frequently used learning strategies for
hierarchies

Flatten the hierarchy learn each microlabel
independently with classification learner of your
choice.
Computationally relatively inexpensive.
Does not make use of the dependencies between
microlabels.
Hierarchical training train a node j with
example (x,y) that belong to the parent, so that
Some of the microlabels dependencies are learned.
However, training data fragments toward the
leaves, hence the estimation becomes less
reliable.
Model is not explicitly trained in terms of a
loss function for the hierarchy.
They try to improve these approaches.

30
Multi-Classification

Multi Classification
multilabel is a union of partial paths in the
hierarchy.
Results post-processed
if the label of a node is predicted as -1 then
all descendants of that node are also labelled
negatively (done to obtain good accuracy).

31
Loss functions for multilabel classifications

Consider a true multilabel and a predicted one .
There are many choices
Zero one-loss
Symmetric difference loss
They dont take the hierarchy into account

32
Hierarchical loss functions

Goal take the hierarchy into account.
Hierarchical loss the first mistake along a path
is penalized. (Cesa-Bianchi)
Simplified hierarchical loss mistake in the
child is penalized if the parent was correct.

33
Coefficient cj

The coefficients cj are used for down-scaling the
loss when going deeper in the tree. These can be
chosen in many ways
Uniform loss
Siblings loss
Subtree loss

34
Maximum margin learning

The model class is defined on the edges of a
Markov tree T(V,E)
F(x) vector representation for the document x
(bag of words). In the training data some F(x)
are duplicated with different weights.
Maximize the ratio between the probability of the
correct labeling yi and the worst competing
labeling y
With the exponential family, the problem
translates in maximize the minimum linear margin

35
Optimization problem Primal

Using a single slack variable for each training
example

36
Optimization problem - Dual

Where K is the joint kernel
Exponential number (in size of the hierarchy) of
primal constraints and dual variables, one per
example.

37
Marginalized problem

To obtain a polinomial size problem
Edge-marginals of dual variables.
Loss function decomposed by edges.
Kernel decomposed by edges.
Conditional Gradient descent to optimize the
marginalized problem (few iterations used to
update variables).

38
Prediction quality
REUTERS DATASET
WIPO DATASET
Flat SVM obtains highest precision, but the
lowest recall and F1. The F1 values are similar
for all the hierarchical models.
39
References

Rousu, J., Saunders, C., Szedmak, S. and
Shawe-Taylor, J. (2004) On Maximum Margin
Hierarchical Classification. In Proceedings of
Workshop on Learning with Structured Outputs at
NIPS 2004, Whistler.
Juho Rousu, Craig Saunders, Sándor Szedmák, John
Shawe-Taylor Kernel-Based Learning of
Hierarchical Multilabel Classification Models.
Journal of Machine Learning Research 7 1601-1626
(2006)
Cesa-Bianchi, N., Gentile, C., Tironi, A.,
Zaniboni, L. (2004). Incremental algorithms for
hierarchical classification. Neural Information
Processing Systems.
N. Cesa-Bianchi, C. Gentile, and L. Zaniboni
Incremental Algorithms for Hierarchical
Classification. Journal of Machine Learning
Research, 731--54, 2006.