Title: Support Vector Machine: Introduction and Applications
1Support Vector Machine Introduction and
Applications
2Outline
- Basic concept of SVM
- SVC formulations
- Kernel function
- Model selection (tuning SVM hyperparameters)
- SVM applications
3Introduction
- Learning
- supervised learning (classification)
- unsupervised learning (clustering)
- Data classification
- training
- testing
4Basic Concept of SVM
- Consider linear separable case
- Training data two classes
-
5(No Transcript)
6Decision Function
- f(x) gt 0 ? class 1
- f(x) lt 0 ? class 2
- How to find good w and b?
- There are many possible (w,b)
7Support Vector Machines
- a promising technique for data classification
- statistic learning theorem maximize the distance
between two classes - linear separating hyperplane
-
8- Maximal margin distance between
9Questions?
- Linear nonseparable case
- How to solve w,b?
- 3. Is this (w,b) good?
- 4. Multiple-class case
10Method to Handle Non-separable Case
nonlinear case
- mapping the input data into a higher dimensional
feature space
11Example
12- Find a linear separating hyperplane
13- Questions
- 1. How to choose ? ?
- 2. Is it really better? Yes.
- Some times even in high dimension spaces. Data
may still not separable. - ? Allow training error
14example
- non-linear curves linear hyperplane in high
dimension space (feature space)
15SVC formulations (the soft margin hyperplane)
16If f is convex, x is opt. (KKT condition)
17How to solve an opt. problem with
constraints? Using Lagrangian multipliers
- Given an optimisation problem
18What is good in Dual than Primal?
- Consider the following primal problem
-
-
- (P) variables w? dimension of ?(x) ( very big
number) , b?1, ?? l - (D) variables l
- Derive its dual.
19 Derive the Dual The primal Lagrangian for the
problem is
The corresponding dual is found by
differentiating with respect to w, ?, and b.
20Resubstituting the relations obtained into the
primal to obtain the following adaptation of the
dual objective function Let
then Hence, maximizing the
above objective over is equivalent to
maximizing
21(No Transcript)
22- Primal and dual problem have the same KKT
conditions - Primal variables very large (shortcoming)
- Dual of variable l
-
- High dim. Inner product
- Reduce its computational time
- For special ? question can be efficiently
calculated.
23Kernel function
24(No Transcript)
25Model selection (Tuning SVM hyperparameters)
- Cross validation can avoid overfitting
- Ex 10 fold cross-validation, l data separated to
10 groups. Each time 9 groups as training data,
1group as test data. - LOO (leave-one-out)
- cross validation with l groups, each time (l-1)
data for training, 1 for testing.
26Model Selection
- The commonly used method of the model selection
is grid method
27Model Selection of SVMs Using GA Approach
- Peng-Wei Chen, Jung-Ying Wang and Hahn-Ming
Lee 2004 IJCNN International Joint Conference
on Neural Networks, 26 - 29 July 2004. - Abstract A new automatic search methodology for
model selection of support vector machines, based
on the GA-based tuning algorithm, is proposed to
search for the adequate hyperameters of SVMs.
28Model Selection of SVMs Using GA Approach
- Procedure GA-based Model Selection Algorithm
- Begin
- Read in dataset
- Initialize hyperparameters
- While (not termination condition) do
- Train SVMs
- Estimate general error
- Create hyperparameters by tuning algorithm
- End
- Output the best hyperparameters
- End
29Experiment Setup
- The initial population is selected at random and
the chromosome consists of one string of bits
with fixed length 20. - Each bit can have the value 0 or 1.
- The first 10 bits encode the integer value of C,
and the rest 10 bits encode the decimal value of
s. - Suggestion of population size N 20 is used
- The crossover rate 0.8 and mutation rate 1/20
0.05 is chosen
30SVM Application Breast Cancer Diagnosis
Software WEKA
31Coding for Weka
- _at_relation breast_training
- _at_attribute a1 real
- _at_attribute a2 real
- _at_attribute a3 real
- _at_attribute a4 real
- _at_attribute a5 real
- _at_attribute a6 real
- _at_attribute a7 real
- _at_attribute a8 real
- _at_attribute a9 real
- _at_attribute class 2,4
32Coding for Weka
- _at_data
- 5 ,1 ,1 ,1 ,2 ,1 ,3 ,1 ,1 ,2
- 5 ,4 ,4 ,5 ,7 ,10,3 ,2 ,1 ,2
- 3 ,1 ,1 ,1 ,2 ,2 ,3 ,1 ,1 ,2
- 6 ,8 ,8 ,1 ,3 ,4 ,3 ,7 ,1 ,2
- 8 ,10,10,7 ,10,10,7 ,3 ,8 ,4
- 8 ,10,5 ,3 ,8 ,4 ,4 ,10,3 ,4
- 10,3 ,5 ,4 ,3 ,7 ,3 ,5 ,3 ,4
- 6 ,10,10,10,10,10,8 ,10,10,4
- 1 ,1 ,1 ,1 ,2 ,10,3 ,1 ,1 ,2
- 2 ,1 ,2 ,1 ,2 ,1 ,3 ,1 ,1 ,2
- 2 ,1 ,1 ,1 ,2 ,1 ,1 ,1 ,5 ,2
33Running Results using Weka 3.3.6 predictor
Support Vector Machines (in Weka called
Sequential Minimal Optimization algorithm
Weka SMO result for 400 training data
34Weka SMO result for 283 test data
35Software and Model Selection
- software LIBSVM
- mapping function use Radial Basis Function
- find the best parameter C and kernel parameter g
- use cross validation to do the model selection
36LIBSVM Model Selection using Grid Method
-c 1000 -g 10 3-fold accuracy
69.8389 -c 1000 -g 1000 3-fold accuracy
69.8389 -c 1 -g 0.002 3-fold
accuracy 97.0717 winner -c 1 -g 0.004
3-fold accuracy 96.9253
37Coding for LIBSVM
2 1 2 2 3 3 1 4 1 5 5 6 1 7 1 8 1 9 1
2 1 3 2 2 3 2 4 3 5 2 6 3 7 3 8 1 9 1 4
110 210 310 4 7 510 610 7 8 8 2 9 1 2
1 4 2 3 3 3 4 1 5 2 6 1 7 3 8 3 9 1 2
1 5 2 1 3 3 4 1 5 2 6 1 7 2 8 1 9 1 2
1 3 2 1 3 1 4 1 5 2 6 1 7 1 8 1 9 1 4
1 9 210 310 410 510 610 710 810 9 1 2
1 5 2 3 3 6 4 1 5 2 6 1 7 1 8 1 9 1 4
1 8 2 7 3 8 4 2 5 4 6 2 7 5 810 9 1
38Summary
39Summary
40Multi-class SVM
- one-against-all method
- k SVM models (k the number of classes)
- ith SVM trained with all examples in the ith
class as positive, and others as negative - one-against-one method
- k(k-1)/2 classifiers where each one trains data
from two classes
41SVM Application in Bioinformatics
- SVM-Cabins Prediction of Solvent Accessibility
Using Accumulation Cutoff Set and Support Vector
Machine - Prediction of protein secondary structure
- SVM application in protein fold assignment
42Solvent Accessibility
- Waters can touch residues at the surface of a
protein - prediction of solvent-accessible surface area
helps us to understand the complete tertiary
structure of proteins
43Motivation
- Traditionally the ASA prediction is the
classification problem of binary or multiple
classes, but the arbitrary choice of cutoff
thresholds become a problem and drawback to
develop a prediction system - To overcome this, recently many statistical,
regression and machine learning methods have been
proposed to predict the real values of solvent
accessibility. - We want to propose a novel method for real value
prediction of solvent accessibility
44Related Methods
- Statistical information Wang, et al., 2004
- Multiple linear regression Wang, et al., 2005
- Neural network Ahmad et al., 2003 Garg et al.,
2005 - Neural networks-based regression Adamczak, et
al., 2004 - Support vector regression Yuan and Huang, 2004.
45Data Sets
- Rost and Shander data set (RS126)
- Cuff and Barton data set (CB502)
46Evolutionary Information
- We use the BLASTP to generate multiple sequence
alignments of proteins - Expectation value (E-value) of 0.01 and choosing
the non-redundant protein sequence database (NCBI
nr database) to search. - The alignments were represented as profiles or
position specific substitution matrices (PSSM)
47Coding Scheme
- A moving window of 13 neighboring residues and
each position of a window has 22 possible values - The data obtained from PSSM which includes 20
amino acid substitution scores, indel (inserting
and deleting) and entropy were directed used as
input to our algorithm. - Prediction is made for the central residue in the
window frame
48Intuition Idea
- Using multi-class classifier to assign a real
value to the test datum
49Two Problems
- The performance of SVM was poor when the number
of labeled positive data is small. - This is mainly due to the optimal hyper-plane of
SVM may be biased when the positive data are much
less than the negative data. - Traditional 3 approaches to solve
- Crisp set problem
50Algorithm to Transfer N Binary-class SVM Models
to Real Values of Solvent Accessibility
- We construct 13 accumulation cabins from the two
end-point of ASA real value. - There are 0, 0, 0, 5, 0, 10, 0, 20, 0,
30, 0, 40, - 0, 50, 50, 100, 60, 100, 70, 100, 80,
100, 90, 100, and 100, 100
51SVM Model Selection
52Accuracy for Each Binary-class SVM Model
53The Mainly Output Vector Patterns (over 97) for
13 Binary-class SVM Models
54Algorithm to Assign the Prediction Result
55An Example
- For example, the vector 1110000111111
- Four binary-class SVM models predict it belonging
to positive class. That is, 0, 20, 0, 30, 0,
40, and 0, 50 - We can infer that this datum must be inside the
cabin range of 0, 20. - We can get the same result by taking the
intersection set of above four continuing
positive cabin ranges
56An Example
- The test datum should not be included inside the
nine cabin ranges of 0, 0, 0, 5, 0, 10,
50, 100, 60, 100, 70, 100, 80, 100, 90,
100, and 100, 100 - So, we can further infer that the datum should be
inside the cabin range of 10, 20 - We can also use the set difference between the
cabin range of 0, 20 and above nine negative
cabin ranges to get the result of 10, 20. - Finally, we use the middle point 15 of the cabin
range 10, 20 as our real ASA prediction value.
57Algorithm to Assign the Prediction Result
- Auto-correction one bit error
- Because each binary-class SVM has at least 77
accuracy, when the number of 1 appears inside the
contiguous 0, we have the confidence to correct
it. - About 1.5 test data their vector patterns
belong to the case of one bit error inside the
two contiguous 0.
58Algorithm to Assign the Prediction Result
- The last 1.5 test data patterns we could use our
previous methods, including look-up table or
multiple linear regression to assign their real
value of ASA. - But in here, we use the simplest ways to assign
the ASA values that is, using the ASA of
individual residues to evaluate an average ASA in
our experimental data set and then assign this
average for a new residue as a prediction. - For example, the average ASA for Alanine (A)
residues in the Barton502 data set was 22.8,
which could then be assigned to all Alanine
residues in the last 1.5 test data.
59Validation Method
- Seven-fold cross validation of results was
carried out for RS126 data set. - Five-fold cross validation of results was carried
out for CB502 data set.
60Assessment of Prediction Performance
- Mean absolute error (MAE)
- Correlation coefficient between the predicted and
experimental values of ASA
61Results and Discussion
62Results and Discussion
63Results and Discussion
Table 2. Variation in prediction error for
different ranges of ASA
64Table3. Mean absolute error for different amino
acid types (all values are in percentage scale)
65Table 4. Effect of Protein Length on Mean
Absolute Error ( number of protein chains)
66Table 5. Comparison with other real value
prediction methods
67Introduction to Secondary Structure
- The prediction of protein secondary structure is
an important step to determine structural
properties of proteins. - The secondary structure consists of local folding
regularities maintained by hydrogen bonds and is
traditionally subdivided into three classes
alpha-helices, beta-sheets, and coil.
68(No Transcript)
69The Secondary Structure Prediction Task
70 Coding ExampleProtein Secondary Structure
Prediction
- given an amino-acid sequence
- predict a secondary-structure state
- (a, b, coil) for each residue in the sequence
- coding considering a moving window on n
(typically 13-21) neighboring residues - FGWYALVLAMFFYOYQEKSVMKKGD
71Methods
- statistical information ( Figureau et al., 2003
Yan et al., 2004) - neural networks (Qian and Sejnowski, 1988 Rost
and Sander, 1993 Pollastri et al., 2002 Cai et
al., 2003 Kaur and Raghava, 2004 Wood and
Hirst, 2004 Lin et al., 2005) - nearest-neighbor algorithms
- hidden Markov modes
- support vector machines (Hua and Sun, 2001
Hyunsoo and Haesun, 2003 Ward et al., 2003 Guo
et al., 2004).
72Milestone
- In 1988, using Neural Networks first achieved
about 62 accuracy (Qian and Sejnowski, 1988
Holley and Karplus, 1989). - In 1993, using evolutionary information, Neural
Network system had improved the prediction
accuracy to over 70 (Rost and Sander, 1993). - Recently there have been approaches (e.g. Baldi
et al., 1999 Petersen et al., 2000 Pollastr and
McLysaght, 2005) using neural networks which
achieve even higher accuracy (gt 78).
73Benchmark (Data Set Used in Protein Secondary
Structure)
- Rost and Sander data set (Rost and Sander, 1993)
(referred as RS126) - Note that the RS126 data set consists of 25,184
data points in three classes where 47 are coil,
32 are helix, and 21 are strand. - Cuff and Barton data set (Cuff and Barton, 1999)
(referred as CB513) - The performance accuracy is verified by a 7-fold
cross validation.
74 Secondary Structure Assignment
- According to the DSSP (Dictionary of Secondary
Structures of Proteins) algorithm (Kabsch and
Sander, 1983), which distinguishes eight
secondary structure classes - We converted the eight types into three classes
in the following way H (a-helix), I (p-helix),
and G (310-helix) as helix (a), E (extended
strand) as ß-strand (ß), and all others as coil
(c). - Different conversion methods influence the
prediction accuracy to some extent, as discussed
by Cutt and Barton (Cutt and Barton, 1999).
75Assessment of Prediction Accuracy
- Overall three-state accuracy Q3. (Qian and
Sejnowski, 1988 Rost and Sander, 1993). Q3 is
calculated by - N is the total number of residues in the test
data sets, and qs is the number of residues of
secondary structure type s that are predicted
correctly.
76Assessment of Prediction Accuracy
- A more complicated measure of accuracy is
Matthews correlation coefficient (MCC)
introduced in (Matthews, 1975) - Where TPi, TNi, FPi and FNi are numbers of true
positives, true negatives, false positives, and
false negatives for class i, respectively. It can
be clearly seen that a higher MCC is better.
77Support Vector Machines Predictor
- Using the software LIBSVM (Chang and Lin, 2005)
as our SVM predictor - Using the RBF kernel for all experiments
- Choosing optimal parameter for support vector
machines. Find the pair of C 10 and ? 0.01
that achieves the best prediction rate
78Coding Scheme
79Coding Scheme for Support Vector Machines
- We use the BLASTP to generate the alignments
(profile) of proteins in our database - The expectation value (E-value) of 10.0 and
choosing the non-redundant protein sequence
database (NCBI nr database) to search. - The profile data obtained from BLASTPGP are
normalized to 01, and then used as inputs to our
SVM predictor.
80Last position-specific scoring matrix computed,
weighted observed percentages rounded down,
information per position, and relative weight of
gapless real matches to pseudocounts
A R N D C Q E G H I L K M F P S
T W Y V A R N D C Q E G H
I L K M F P S T W Y V 1 R
-1 5 -1 -1 -4 3 0 -2 -1 -3 -3 3 -2 -3 -2 -1
-1 -3 -2 -3 0 50 0 0 0 17 0 0 0
0 0 33 0 0 0 0 0 0 0 0 0.63
0.09 2 T 0 -2 -1 -2 5 -1 -2 -2 -2 -2 -2
-1 -2 -3 -2 3 4 -3 -2 -1 0 0 0 0 22
0 0 0 0 0 0 0 0 0 0 30 48 0
0 0 0.54 0.15 3 D -1 -3 4 1 -4 -2 -2
6 -2 -5 -5 -2 -4 -4 -3 -1 -2 -4 -4 -4 0 0
24 8 0 0 0 68 0 0 0 0 0 0
0 0 0 0 0 0 1.04 0.34 4 C -2 0
1 -3 9 -3 -3 -3 -3 -2 -2 -3 -2 0 -3 -1 2 -3 -3
-2 0 6 10 0 61 0 0 0 0 0 0
0 0 6 0 0 17 0 0 0 1.14 0.38
5 Y -3 -3 -4 -5 -4 -3 -4 -5 0 -3 -2 -3 -2 2
-4 -3 -3 1 9 -3 0 0 0 0 0 0 0
0 0 0 0 0 0 3 0 0 0 0 97 0
1.84 0.53 6 G 1 -3 -2 -2 -4 -2 0 6 -3
-4 -5 -3 -4 -4 -3 -1 1 -4 -4 -4 11 0 0 0
0 0 9 72 0 0 0 0 0 0 0 0
9 0 0 0 1.10 0.56 7 N -3 -2 4 4 4
0 1 -3 2 -3 -3 -1 2 -4 -3 -1 -2 -5 -3 -2 0
0 30 26 11 3 9 0 5 0 0 3 9
0 0 2 0 0 0 2 0.64 0.56 8 V -2
-4 -4 -4 -3 -4 -4 -5 -5 6 0 -4 0 -2 -4 -3 1
-4 -3 3 0 0 0 0 0 0 0 0 0 68
0 0 0 0 0 0 9 0 0 23 0.95
0.56 9 N 1 1 3 -2 -3 1 1 -3 -2 -2 -2
-1 5 -3 -3 2 2 -4 -3 -2 11 8 14 0 0
6 8 0 0 0 0 0 23 0 0 17 12 0
0 0 0.40 0.57 10 R 0 2 1 -1 2 1 0
-3 -2 -3 -2 1 -3 -4 -3 2 3 -4 -3 -3 5 11
5 3 5 7 5 0 0 0 3 11 0 0 0
20 24 0 0 0 0.38 0.57 11 I 0 -4 -4
-5 -3 -3 -4 -4 -4 4 3 -4 3 -2 -4 -3 -2 -4 -3
3 9 0 0 0 0 0 0 0 0 29 33
0 10 0 0 0 0 0 0 20 0.65 0.57
12 D -2 -2 -1 4 -4 3 3 -3 -2 -4 -4 1 -3 -5
-3 2 1 -5 -4 -4 0 0 0 26 0 16 26
0 0 0 0 7 0 0 0 17 7 0 0 0
0.67 0.57
81Results
- Results for the ROST126 protein set
- (Using the seven-fold cross validation)
82Results
- Results for the CB513 protein set
- (Using the seven-fold cross validation)
83SVM Application in Protein Fold Assignment
- "Fine-grained Protein Fold Assignment by Support
Vector Machines using generalized n-peptide
Coding Schemes and jury voting from multiple
parameter sets", - Chin-Sheng Yu, Jung-Ying Wang, Jin-Moon Young,
P.-C. Lyu, Chih-Jen Lin, Jenn-Kang Hwang, - Proteins Structure, Function, Genetics, 50,
531-536 (2003).
84Data Sets
- Ding and Dubchak which consists of 386 proteins
of the most populated 27 SCOP folds in which the
protein pairs have sequence identity below 35
for the aligned subsequences longer than 80
residues. - These 27 proteins folds cover most major
structural classes and have at least 7 or more
proteins in their classes
85Coding Scheme
- We denote the coding schemes by X if all 20 amino
acids are used - X when the amino acids are classi?ed as four
groups charged, polar, aromatic, and nonpolar - X, if predicted secondary structures are used
- We assign the symbol X the values of D, T, Q,
and P, denoting the distributions of dipeptides,
3-peptides, and 4-peptides, respectively.
86Methods
87Results
88Results
89Results
90Results
91Structure Example Jury SVM Predictor
92Structure ExampleSVM Combiner
93