Introduction to SVMs - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Introduction to SVMs

Description:

The hyperplane is a stiff sheet ... 'No free lunch' for kernels (Sch lkopf) ... My website: http://www.music.mcgill.ca/~rebecca/6080/SVM_bib.htm. References ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 29
Provided by: rebeccaf7
Category:

less

Transcript and Presenter's Notes

Title: Introduction to SVMs


1
Introduction to SVMs
  • Rebecca Fiebrink
  • IFT6080
  • 13 March 2006

2
Disclaimer
  • This is a high-level overview.
  • I am not an SVM expert.
  • Images are liberally stolen from tutorials by
    Schölkopf and Burges.

3
Separating planes Basic idea
  • A separating plane is great for doing 2-class
    classification for linearly-separable data
  • Its easy to reason about, too
  • Can discuss formal bounds on generalization error
  • Can think of straightforward ways to compute this
    plane

4
SVM Basic idea
  • What about when the data is not linearly
    separable?
  • Map it into a higher dimension where it is
    separable
  • Use the kernel trick to implicitly map up,
    compute stuff, and map back in one step.

5
Contents of this presentation
  • Introduce idea of linear maximum-margin
    hyperplane for separable data
  • Extend to non-separable data
  • Extend to non-linear hyperplanes
  • The kernel trick
  • Finding the maximum-margine hyperplane
  • Theoretical and practical implications
  • Extensions of SVMs

6
Linear plane, separable data
  • Each data instance xi is a vector in Rd
  • Each xi has a class label yi ? -1, 1
  • A separating hyperplane can be defined by normal
    vector w and scalar b
  • The plane is specified so
  • sgn(?xi, w?b)sgn(yi)
  • and ??xi, w?b? 1
  • New data can be classified as

7
Maximum margin hyperplane
  • A maximum-margin hyperplane minimizes w while
    requiring that all training points are correctly
    classified (and all yi 1)
  • There is one solution for this plane (or
    equivalent global solutions)
  • The training points that lie on the margin
    (circled) are the support vectors
  • Removal of any of these points from the training
    set will change the solution hyperplane
  • Each support vector may have a different
    importance to the solution
  • (Only) these points can be used to classify new
    data
  • They specify the hyperplane

8
Linear plane, non-separable data
  • Introduce non-negative slack variables ?i (one
    for each training point)
  • Relaxes the requirement that all training points
    be correctly classified
  • Ability to control balance between size of ?is
    and number of training errors
  • Classify new data in the same way as before

9
Mechanical analogy
  • Consider L R2
  • The hyperplane is a stiff sheet
  • Each support vector exerts a force normal to this
    sheet, scaled according to its importance
  • System is in equilibrium
  • Sum of forces is 0
  • Sum of torques is 0

10
Nonlinear SVM
  • Map data from lower-dimensional space LRd to a
    (much-)higher-dimensional space H using some
    function
  • Compute the maximum-margin hyperplane in H
    exactly as before

11
Nonlinear SVM
  • Could explicitly construct ?, map all the
    training data into H, compute hyperplane, map all
    the testing data into H to classify, etc.
  • But, the equations we use to find w and b can be
    expressed using the training data only in
    dot-product form
  • Additionally, the equations we use to classify
    new points can be expressed using only the
    support vectors and only in dot-product form
  • So
  • We can take a shortcut use a function that
    directly computes dot products in H on vectors in
    L
  • We dont even have to know what H is, or how to
    explicitly get there H just has to have an inner
    product defined
  • This is called the kernel trick

12
Kernel Trick
  • Replace all dot-products in L with dot-products
    in H as follows
  • If we have a ? we know and love
  • Change ?xi, xj? to ??(xi),?(xj)?
  • Call K(xi, xj) ??(xi),?(xj)?
  • Otherwise, if we have some K such that we know
    K(xi, xj) ??(xi),?(xj)? for some H, somewhere
  • Change ?xi, xj? to K(xi, xj) directly
  • Do the above for both the training and the
    testing of the SVM.

13
Mercers condition
  • How do we know if K(xi, xj) computes a
    dot-product in some higher-dimensional space?
  • Mercers condition K(x, y) is a valid kernel
  • Mercers condition doesnt tell us anything about
    ? or H, only that they exist


14
A few notes about the kernel trick
  • The image of ? may live in a space of very high
    dimension, but it is just a (possibly very
    contorted) surface whose intrinsic dimension is
    just that of L (see below)
  • w typically doesnt have a representation in L,
    otherwise we could just classify linearly
  • Kernels arent just for SVMs!

15
Choosing a kernel
  • Common kernels
  • Polynomial
  • Radial basis function
  • 2-layer sigmoidal neural network
  • Choosing a kernel corresponds to choosing a
    similarity measure for the data (Schölkopf)
  • No free lunch for kernels (Schölkopf)
  • The number of parameters one needs to set (tune)
    is a key consideration for practical use of SVMs
    (Hsu et al.)

16
Quadratic programming
  • Finding the maximum-margin separating hyperplane
    is quadratic programming problem
  • QP Type of optimization where the objective
    function is allowed to have quadratic terms
  • In general, QP involves minimizing w.r.t. x
  • under some constraints
  • If E is positive definite, then f(x) is a convex
    function and constraints are linear functions.
  • There are well-understood circumstances under
    which a solution is optimal Karush-Kuhn-Tucker
    (KKT) conditions
  • There is a collection of known methods for
    solving QP problems.
  • Thank you, Wikipedia.

17
QP whats it mean for SVM?
  • If Mercers conditions are met, then E is
    positive definite.
  • This means that any solution we find is global.
  • We can identify (check) a solution using KKT
    conditions.
  • The solution process is still sort of ugly.

18
What are all those equations?
  • Problem is formulated as primal and dual
    Lagrangians
  • Primal and dual are complementary equations
  • You can solve either one
  • Primal The objective function is a combination
    of n variables and m constraints.
  • Want to maximize (or minimize) the value of the
    objective function subject to the constraints
  • A solution is a vector of n values that achieves
    this maximum
  • Dual The objective function is a combination of
    the m values arising from the m constraints in
    the dual, and n constraints.
  • Again, maximize/minimize the objective function
    subject to the constraints
  • Lagrange multipliers
  • Method for dealing with constraints
  • An unknown scalar multiplier ?i is assigned to
    each constraint

19
The details
  • Separable case
  • Non-separable case
  • Why?
  • Want a maximum margin, and margin 1/w
  • Separable
  • Non-separable
  • and, control complexity (balance between
    training errors and size of ?is)

Primal minimize
Dual maximize
20
Theoretical implications
  • Structural risk minimization (SRM) Wed like to
    find a machine for which the training error is
    low and bounded tightly
  • A formal bound on generalization performance of a
    learner

21
VC dimension
  • h in the previous slide
  • Measures the capacity of a classification
    algorithm
  • Capacity relates to ability to learn perfectly
    (shatter), which relates to (in)ability to
    generalize

22
SVM and VC
  • VC dimension of an SVM is dependent on the kernel
    used
  • In general, capacity is very high (or infinite)
  • This isnt necessarily a problem other
    algorithms also have infinite VC bound (e.g., kNN)

23
Gap-tolerant classifiers
  • We can come up with meaningful (but looser)
    bounds for gap-tolerant classifiers, a sort of
    idealized and generalized version of SVMs
  • In practice, these bounds can tell us useful
    things about the performance
  • According to Burges, SVMs magic happens in the
    maximization of the margin

24
Practical implications for SVMs
  • They can take a really long time
  • There are methods for solving the QP problem more
    efficiently (e.g., chunking), and these are used
    in practice
  • Nominal attributes are typically binarized
    (results in more attributes)
  • Unlike other classifiers (e.g., neural nets),
    SVMs always find a global optimum
  • Special things must be done to handle multi-class
    problems

25
Extensions More than one class
  • There are a variety of ways to handle this, some
    of them designed for extending 2-class
    classifiers in general
  • 1-vs-1 (pairwise) Construct an SVM for every
    pair of N classes. The class that gets the most
    votes wins (max-wins)
  • 1-vs-all (1-vs-rest) Construct an SVM for each
    class. The class that gets the strongest vote in
    favor wins.
  • Others e.g., Platts DAGSVM
  • There are also a variety of ways to get more
    interesting outputs (e.g., probabilities)

26
Extensions large datasets
  • Scalability is a problem.
  • Some algorithms use selective sampling/active
    learning to sample the training data
    intelligently
  • Some approaches reformulate the QP problem so
    that it can be solved more efficiently
  • Portions of the basic SVM computation can be
    parallelized

From Yu et al. 2003
27
For more information
  • My website http//www.music.mcgill.ca/rebecca/60
    80/SVM_bib.htm

28
References
  • Burges, C. 1999. A tutorial on support vector
    machines for pattern recognition. Given at DAGM.
    Available online http//www.kernel-machines.org/t
    utorial.html.
  • Duda, R., P. Hart, and D. Stork. 2001. Pattern
    classification. New York John Wiley Sons. (2nd
    ed.)
  • Hsu, C., C. Chang, and C. Lin. A practical guide
    to support vector classification. Available
    online http//www.csie.ntu.edu.tw/cjlin/papers/g
    uide/guide.pdf.
  • Hsu, C., and C. Lin. 2002. A comparison of
    methods for multiclass support vector machines.
    IEEE transactions on neural networks. 13(2)
    415-25.
  • Platt, J. 1999. Probabilities for support vector
    machines. Advances in large margin classifiers,
    A. Smola, P. Bartlett, B. Schölkopf, D.
    Schuurmans, eds. MIT Press. 61-74. Original
    title Probabilistic outputs for support vector
    machines and comparisons to regularized
    likelihood methods. Available online
    http//research.microsoft.com/jplatt/abstracts/SV
    prob.html.
  • B. Schölkopf. A short tutorial on kernels, 2000.
    Tutorial given at the NIPS'00 Kernel Workshop.
    Available online http//www.dcs.rhbnc.ac.uk/colt/
    nips2000/kernels-tutorial.ps.20gz
  • Schölkopf, B., and A. Smola. 2002. Learning with
    kernels Support vector machines, regularization,
    optimization, and beyond. Cambridge, MA MIT
    Press.
  • www.kernel-machines.org
  • www.wikipedia.org
  • Yu, H., J. Yang, and J. Han. 2003. Classifying
    large data sets using SVMs with hierarchical
    clusters. Proceedings of SIGKDD 2003.
Write a Comment
User Comments (0)
About PowerShow.com