Support Vector Machines - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Support Vector Machines

Description:

Learning machines, especially linear machines that classify points by separating ... Kyte-Doolittle Hydrophobicity Scale. Alanine 1.8. Arginine -4.5. Asparagine -3.5 ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 28
Provided by: randyz
Category:

less

Transcript and Presenter's Notes

Title: Support Vector Machines


1
Support Vector Machines
  • Cristianini Shawe-Taylor
  • Chapt. 6,
  • Hua Sun, J.Mol. Biol., 308, 397-407 (2001).

2
Overview
  • So far we have learned about
  • Learning machines, especially linear machines
    that classify points by separating them into
    categories using hyperplanes
  • Attribute space vs. Feature space
  • How to embed attribute vectors into a
    high-dimensional feature space how to do this
    implicitly using kernels
  • How to measure the performance of a learning
    machine, using measures based on the margin
  • How to optimize functions in the presence of
    constraints

3
Putting it together
  • A support vector machine
  • Maps attribute vectors into a high-dimensional
    feature space (implicitly) using a kernel
  • Given a training set, finds a hyperplane to
    categorize the data, by optimizing the
    performance of the resulting learning machine
  • Uses an algorithm that automically focusses on
    those training points most critical to
    positioning the hyperplane these are the support
    vectors.

4
The Maximal Margin Approach
  • Given a training set, generate a separating
    hyperplane with maximal margin with respect to
    the data.
  • The data is implicitly embedded in a
    high-dimensional feature space using a kernel
  • Data must be separable not practical for many
    real-world problems (even in high dimensions),
    because of noise.

5
The target function
The separating plane
The functional margin
is affected by the scale factor lambda the
position of the plane is not.
6
How to set the weight vector
  • The geometric margin measures how far the
    training points are from the plane it is the
    margin computed when the weight vector has unit
    length.
  • It is easy to see that the geometric margin is
    maximized if the size of the un-normalized weight
    vector is made as small as possible.
  • To see this, suppose that the functional margin
    is 1

7
The maximal margin optimization problem
Given a separable training set
Find the hyperplane (w,b) that solves the
optimization problem
8
The Lagrangian
Conditions
9
Leads to
Substituting
10
A quadratic optimization problem
Find vector of alphas that maximizes
subject to
This solves for the optimal hyperplane
This comes from the dual formulation of the
optimization problem
11
Observations
  • The weight vector is expressed as a linear
    combination of the training example, just as in
    the case of the perceptron
  • Consider the inequality constraint the optimal
    alphas are non-zero only for those training
    points that bump into the constraint these
    have minimum distance to the hyperplane. They are
    support vectors.

12
  • The target function can be expressed in terms of
    the dual variables
  • Also, we can compute the margin

13
Using a Kernel
Find vector of alphas that maximizes
subject to
14
Look at Figure 6.2
15
Soft Margins
  • Real-life data is noisy even if there is an
    underlying distribution that can be used to
    classify the data, the presence of random noise
    means that some points near the separating
    hyperplane may cross over and spoil the
    classification the presence of noise may make
    the data inseparable, even in high dimensions.
  • The solution is to introduce margin slack
    variables which allow the margin to be violated
    at individual training points.

16
Using slack variables
17
Observations
  • The two versions of the optimization problem
    correspond to two different norms on the slack
    variables, and two different Lagrangians.
  • The parameter C is adjustable, and is varied
    while the performance of the machine is assessed.
    C effectively imposes an additional constraint on
    the size of the alpha inequality multipliers it
    affects the accuracy of the model and how well it
    regularizes.It can be adjusted to limit the
    influence of outliers.

18
Hua SunJ.Mol. Biol., 308, 397-407
(2001).Project 3
19
Protein Secondary Structure Prediction
  • Classic problem in computational biology first
    successful algorithm due to Chou Fasman in the
    70s.
  • Contemporary methods use neural networks hidden
    Markov models
  • Despite the sophistication of current approaches,
    there appear to be fundamental limits on the
    accuracy of predicting secondary structure on the
    basis of sequence alone.

20
An SVM classifier for secondary structure
  • The authors propose to use a set of SVMs to
    classify secondary structure. They justify this
    by pointing to the superiority of SVMs over
    competing learning methods in a number of
    different fields. Also
  • The rigorous learning theory that describes SVMs
  • The relative simplicity of constructing them.

21
Implementation
  • Use the ever-popular sliding window.
  • Use Gaussian Kernel (they refer to this as
    radial basis, in keeping with neural network
    terminology)
  • Data sets RS126 and CB513. We will use the RS126
    set (Rost Sander, J. Mol. Biol., 232, 584-599
    (1993)), available on the class home page.
  • SVM code? We will use SVM Light (available on the
    class home page, precompiled for OS X).

22
Training data format
  • The RS126 set presents a collection of 126
    multiple alignments of related proteins.
  • We will use the head sequence of each
    alignment these are collected conveniently in
    FASTA format herehttp//antheprot-pbil.ibcp.fr/R
    ost.html
  • We will use the secondary structure prediction
    found using DSSP, with the category
    identifications used by Hua Sun.
  • We will use a window of 11 residues, classifying
    the secondary structure assignment of the residue
    in the central position.

23
Generating Training Data
  • We will train three classifiers Helix/Helix,
    Sheet/Sheet and Coil/Coil.
  • For each sequence
  • Slide a window of selected length along the
    sequence to control how many example are
    generate, we will employ a user-selected stride
    when advancing the window.
  • Use the DSSP string to assign the type of each
    residue, with the mapping H,G,I-gtH, E,B-gtE, all
    others -gt C (coil, represented as - in the data
    set).

24
Data Encoding
  • Since this is for educational/recreational
    purposes, lets code our amino acids using three
    attributes
  • Molecular weight
  • Kyte-Doolittle hydophobicity
  • Charge
  • Note that we can also encode the identities of
    the amino acids using orthogonal binary vectors
    (just like in our neural net work). Hua Sun
    also employ this approach.

25
Implementing the Data Encoding
  • Use a single perl script
  • Generate three training output files, one for
    H/H, one for E/E, one for C/C, and three
    corresponding validation files. Use 50 of the
    data for training, 50 for validation.
  • Generate the encoding expected by SVM Light see
    http//svmlight.joachims.org/Example-1 10.43
    30.12 92840.2 abcdef

Label
Attribute Value
Comment
26
Kyte-Doolittle Hydrophobicity Scale
Alanine 1.8 Arginine -4.5
Asparagine -3.5 Aspartic acid -3.5
Cysteine 2.5 Glutamine -3.5
Glutamic acid -3.5 Glycine -0.4
Histidine -3.2 Isoleucine 4.5
Leucine 3.8
Lysine -3.9 Methionine 1.9
Phenylalanine 2.8 Proline -1.6
Serine -0.8 Threonine -0.7
Tryptophan -0.9 Tyrosine -1.3
Valine 4.2
http//arbl.cvmbs.colostate.edu/molkit/hydropathy/
scales.html
27
Amino Acid MWs
Alanine Ala A 71.04 Arginine Arg R
156.10 Aspartic acid Asp D 115.03 Asparagine Asn
N 114.04 Cysteine Cys C 103.01 Glutamic acid Glu
E 129.04 Glutamine Gln Q 128.06 Glycine Gly G
57.02 Histidine His H 137.06 Hydroxyproline Hyp -
113.05 Isoleucine Ile I 113.08 Leucine Leu L
113.08
Lysine Lys K 128.09 Methionine Met M
131.04 Phenylalanine Phe F 147.07 Proline Pro P
97.05 Serine Ser S 87.03 Threonine Thr T
101.05 Tryptophan Trp W 186.08 Tyrosine Tyr Y
163.06 Valine Val V 99.07
Write a Comment
User Comments (0)
About PowerShow.com