SVM Classifier Introduction - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

SVM Classifier Introduction

Description:

Find the hyperplane that classifies correctly the training set and that has a ... news article about David and Victoria Beckham could belong to different partial ... – PowerPoint PPT presentation

Number of Views:569
Avg rating:3.0/5.0
Slides: 40
Provided by: polet5
Category:

less

Transcript and Presenter's Notes

Title: SVM Classifier Introduction


1
SVM ClassifierIntroduction
  • Linear SVM (separable data)
  • Linear SVM (nonseparable data)
  • Nonlinear SVM (nonseparable data)

2
(No Transcript)
3
1) Linear SVM (separable data)
  • Hyperplane definition
  • Maximum margin
  • Scaling
  • Final formula

4
Hyperplane definition
  • If data are linear separable, an hyperplane
    f(x)wxb 0 exists, such that

5
w and maximum margin
  • Given a point x and an hyperplane wxb the
    distance is
  • that is a function of
  • Find the hyperplane that classifies correctly the
    training set and that has a minimum norma
    (minimum w2), it means to find the maximum
    margin from the points of the training set.

6
Maximum margin
  • It is proved that the capacity of generalization
    of SVM arise, meanwhile the margin arises.
  • So we obtain the maximum generalization when the
    hyperplane has the maximum margin. This is the
    case of the optimal separating hyperplane (OSH).

7
Maximum margin
8
Final formula
9
Lagrangian solution
  • We must minimize the lagrangian
  • This is a QP problem with solution
  • Where are lagrange multiplicator variables.

10
Optimum hyperplane
  • So the final optimum hyperplane is

11
2) Linear SVM (nonseparable data)
  • Generalization, soft margins, slacks.
  • Formulation

12
Optimum hyperplane generalization
  • In this phase, the constraint for the exact
    classification is relaxed (so we are talking
    about soft margins).
  • We introduce slack variables , so the
    constraint is

13
Optimum hyperplane generalization
  • At different values of the slacks variables we
    have
  • Over the margin and correct classification
  • Inside the margin and correct classification
  • Incorrect classification

14
Soft margins and slacks
15
Formulation - Primal
16
Lagrangian solution
  • We must minimize the lagrangian
  • This is a QP problem with the same solution
  • But now
  • C manages the compromise between the size of the
    margin (lower values of C) and the number of
    errors tollerated in the training phase (if C
    we have an hyperplane perfectly separable).

17
Final optimum hyperplane
  • So the final optimum hyperplane is the same

18
3) NonLinear SVM (nonseparable data)
  • Mapping F(x)
  • Kernel functions
  • Loss functions
  • Formulation

19
Mapping F(x)
  • In the case of sets not linear separable, we
    introduce a mapping F(x) in a multi dimensional
    space in order to obtain sets linearly separable.
  • So instead of arise the classifier complexity (it
    is again an hyperplane) we arise the features
    dimensional space.

20
Mapping F(x)
Mapping F(x)
Mapping F(x)
21
Kernel functions
  • The new transformed space can have a very big
    dimension, so the mapping function F(.) can be
    very complex to evaluate.
  • In the learning and classification phase we have
    to manage the scalar product F(x)F(y).
  • Following the Mercer theorem, it exists a Kernel
    function K(x,y) such that K(x,y)F(x)F(y).
  • So the discriminant function is
  • So the use of Kernel functions narrow the real
    mapping in the n-dimensional space

22
Some Kernel functions
  • Linear Kernel
  • Polinomial Kernel
  • Gaussian Kernel (RBF Radial Basis Function)
  • Kernel MLP (Multi-layer perceptron)

23
Loss functions
  • Consider a true multilabel y and a predicted one
    t.
  • The basic goal is to learn a function fy to
    approximate the unknown target function ty
  • To evaluate the goodness of the approximation fy
    we need the loss function l(y,t) denoting the
    price we have to pay guessing that the associated
    label

24
Loss functions conditions
  • Basic condition of the loss function
  • should be monotonically decreasing with
    respect to the sets of incorrect multilabels.

25
Final formulation primal
  • With a single slack variable for each training
    example.

26
Langrangian solution
  • Now we must minimize the lagrangian
  • This is a QP problem with the solution
  • So the final optimum hyperplane is

27
Learning Hierarchical Multi-Category Text
Classification Models
  • Juho Rousu, Craig Saunders, Sandor Szedmak, John
    Shawe-Taylor

Proceedings of the 22nd International Conference
on Machine Learning (ICML05), Bonn, Germany - 2005
28
Hierarchical Multilabel Classificationunion of
partial paths model
  • Goal given document x and hierarchy T (V,E)
    predict multilabel where
    the positive microlabels yk form a union of
    partial paths in T.

A news article about David and Victoria Beckham
could belong to different partial paths and might
not belong to any leaf categories.
29
Frequently used learning strategies for
hierarchies
  • Flatten the hierarchy learn each microlabel
    independently with classification learner of your
    choice.
  • Computationally relatively inexpensive.
  • Does not make use of the dependencies between
    microlabels.
  • Hierarchical training train a node j with
    example (x,y) that belong to the parent, so that
  • Some of the microlabels dependencies are learned.
  • However, training data fragments toward the
    leaves, hence the estimation becomes less
    reliable.
  • Model is not explicitly trained in terms of a
    loss function for the hierarchy.
  • They try to improve these approaches.

30
Multi-Classification
  • Multi Classification
  • multilabel is a union of partial paths in the
    hierarchy.
  • Results post-processed
  • if the label of a node is predicted as -1 then
    all descendants of that node are also labelled
    negatively (done to obtain good accuracy).

31
Loss functions for multilabel classifications
  • Consider a true multilabel and a predicted one .
  • There are many choices
  • Zero one-loss
  • Symmetric difference loss
  • They dont take the hierarchy into account

32
Hierarchical loss functions
  • Goal take the hierarchy into account.
  • Hierarchical loss the first mistake along a path
    is penalized. (Cesa-Bianchi)
  • Simplified hierarchical loss mistake in the
    child is penalized if the parent was correct.

33
Coefficient cj
  • The coefficients cj are used for down-scaling the
    loss when going deeper in the tree. These can be
    chosen in many ways
  • Uniform loss
  • Siblings loss
  • Subtree loss

34
Maximum margin learning
  • The model class is defined on the edges of a
    Markov tree T(V,E)
  • F(x) vector representation for the document x
    (bag of words). In the training data some F(x)
    are duplicated with different weights.
  • Maximize the ratio between the probability of the
    correct labeling yi and the worst competing
    labeling y
  • With the exponential family, the problem
    translates in maximize the minimum linear margin

35
Optimization problem Primal
  • Using a single slack variable for each training
    example

36
Optimization problem - Dual
  • Where K is the joint kernel
  • Exponential number (in size of the hierarchy) of
    primal constraints and dual variables, one per
    example.

37
Marginalized problem
  • To obtain a polinomial size problem
  • Edge-marginals of dual variables.
  • Loss function decomposed by edges.
  • Kernel decomposed by edges.
  • Conditional Gradient descent to optimize the
    marginalized problem (few iterations used to
    update variables).

38
Prediction quality
REUTERS DATASET
WIPO DATASET
Flat SVM obtains highest precision, but the
lowest recall and F1. The F1 values are similar
for all the hierarchical models.
39
References
  • Rousu, J., Saunders, C., Szedmak, S. and
    Shawe-Taylor, J. (2004) On Maximum Margin
    Hierarchical Classification. In Proceedings of
    Workshop on Learning with Structured Outputs at
    NIPS 2004, Whistler.
  • Juho Rousu, Craig Saunders, Sándor Szedmák, John
    Shawe-Taylor Kernel-Based Learning of
    Hierarchical Multilabel Classification Models.
    Journal of Machine Learning Research 7 1601-1626
    (2006)
  • Cesa-Bianchi, N., Gentile, C., Tironi, A.,
    Zaniboni, L. (2004). Incremental algorithms for
    hierarchical classification. Neural Information
    Processing Systems.
  • N. Cesa-Bianchi, C. Gentile, and L. Zaniboni
    Incremental Algorithms for Hierarchical
    Classification. Journal of Machine Learning
    Research, 731--54, 2006.
Write a Comment
User Comments (0)
About PowerShow.com