Title: SVM in Bioinformatics
1SVM in Bioinformatics
Understanding
Application
- Computer Science
- BISLab. KAIST
- Lee, Ki-Young
- 2003-12-10
2Contents
- Introduction
- What is SVM
- Basic idea, Why OHP
- Non-linear Separable Case
- Soft Margin Hyperplane
- Non-Linear SVM
- Multi-class Classification
- Success in using SVM
- Applications in Bioinformatics
- Areas, features, kernel function and parameters,
validation measures, current results, case by
case, current Issues - Conclusion
- Reference
3Introduction
SVM
Invented by Vapnik, as a by-product of SLT
Simple, and always trained to find global optimum
Used for pattern recognition, regression, and
linear operator inversion
Considered too slow at the beginning, but now for
most application this problem is overcome due to
late 1990s
Small number of parameters choice easy to use
4Basic Idea
Swellfish
mackerel
weight
Optimal Hyperplane (OHP)
simple kind of SVM (called an LSVM)
maximum margin
length
5Why maximum Margin Hyperplane
- Intuitively this feels safest
OHP
- There is some theory (using VC D.) that is
related to the proposition that this is a good
thing
not OHP
- Empirically it works very well
6Soft Margin Hyperplanes
Linear separable
7Higher Dimensional Space
Hypersurface
Hyperplane
Only Inner product is needed to calculate Dual
problem and decision function
Kernalization
8Kernel Functions
Sigmoid Kernel, Fisher Kernel, String Kernel
There are so many Kernel function!!
9Multi-class Classification
Using two-class SVM
One-against-others
One-against-one
Other variations
Using Muti-class SVM
10One-against-others method
Using N classifiers
Easy and simple
More than one classifier can generate
variation
All classifier can generate -
11One-against-one method
Using NC2 classifiers
More accurate than one-against-one
Too many classifier
Complicate and more time required
12Multi-class SVM
13Success in using SVM
What kind of Data
all fields in ML
Which feature(s)
What kind of Kernel Function .
How about the parameter values
Which method to solve multi-class classification
14SVMs in Bioinfomatics
Introduction
Areas
Feature (s)
Kernel Function and Parameter values
Validation Measures
Current results
Case study
15Big Picture of Protein Synthesis
16Transcription
preRNA
17preRNA ? mRNA (Splicing)
preRNA
mRNA
Where is the splicing cite?
18preRNA ? mRNA (Splicing)
GU-AG motifs are parts of longer consensus
sequences that span the 5 and 3 splice sites
In Vertebrates 5 splice site 5AGGUAAGU-3
3 splice site 5-PyPyPyPyPyNCAG-3
(Py U or C, N any nucleotide)
SVM
19Translation
Polypeptide
20Protein Stucture Hierarchy
21Protein Primary Structure
22Protein Secondary Structure
23Protein Tertiary Structure
?-coil
turn or loop
?-sheet
24Protein Structure
Protein Structure is so important
Protein Structure can give any to the function of
Protein
Traditional biological experiments are time
consuming and expensive
Computational mechanism is needed!!
SVM
25Areas in Bioinfo.
Microarray data analysis
Gene functional classification Brown et al.
(2000) Pavlidis et al. (2001)
Tissue classificaiton Mukherje et al. (1999),
Furey et al. (2000), Guyon et al. (2001)
Protein Synthesis
Splicing site prediction Ying-Fei Sun, (2003)
26Areas in Bioinfo.
Proteins
Structure prediction Hua et al. (2001), Suiun
Hua (2001) Yu-Dong Cai, (2002, 2003), Chris H. Q.
Ding (2001), Florian Markowetz (2003)
Fold recognition Ding et al. (2001)
Family prediction Jaakkoola et al. (1998)
Function prediction Jakkoola et al. (1998) C.
Z. Cai (2003), Yu-dong Cai (2003)
Protein-protein interaction prediction Bock et
al. (2001)
27Features
Physicochemical feature of residue / base
Hydrophobicity
Normalized volume
Polarity
Charge
Surface tension
28Features
Lower level structure
Amino acids Composition
the number/frequency of each amino acid
29Features
Pseudo-amino acid composition
Composition (AAC, 20-D) Sequence (Correlation)
30Kernel Functions and parameters
Kernel functions
Parameters (for Kernel, C)
Cross validation
Gaussian Kernel
minimize VC-Dimension
Polynomial Kernel
Fisher Kernel
String Kernel
Spectrum kernel
Interpolated Kernel
.
31Validation Measures
Specificity (Sp)
Sensitivity (Sn)
TP / (TP FN)
TN / (TN FP)
Accuracy
(TP TN) / (TP TN FP FN)
Weighted average of Sn and Sp
32Current Results in Bioinfo.
SVM is Really a Good tool
Splicing cites 85 92
Protein secondary structure 55 93.2
Protein fold prediction 2856
Protein domain prediction 5794.5
Protein function prediction 8894
Protein-protein interaction 8083
at least as good as other tools
33Case Study 1
Identifying splicing sites in eukaryotic RNA
SVM approach
Ying-Fei Sun, Xiao-Dan Fan, Yan-Da Li
Computers in Biology and Medicine, 2003
34Identifying splicing sites in eukaryotic RNA
SVM approach
35Identifying splicing sites in eukaryotic RNA
SVM approach
Two sequences, ACDDDEFGR vs. ACDEFHR
61 1(-2) 2(-1) 2
61 1(-2) 2(-1) 2
61 1(-1) 2(-2) 1
Mismatch, substitution ? -1
Gap or indel (insertion/deletion) ? opening -2,
extension -1
36Identifying splicing sites in eukaryotic RNA
SVM approach
37Identifying splicing sites in eukaryotic RNA
SVM approach
SVM
With and without secondary structural information
Polynomial kernel (s 1, r1, d4), Gaussian
kernel (std 20)
Three-fold cross validation
38Identifying splicing sites in eukaryotic RNA
SVM approach
39Case Study 2
A Nobel Method of Protein Secondary Structure
Prediction with High SOV SVM Approach
40A Novel Method of Protein Secondary Structure
Prediction with High SOV SVM Approach
SVM
Kernel Gaussian RBF
Parameter Values ? 0.10, C1.50
Binary, tertiary classifiers(3 cases),
One-against-others method
41A Novel Method of Protein Secondary Structure
Prediction with High SOV SVM Approach
Tertiary Classifiers
42A Novel Method of Protein Secondary Structure
Prediction with High SOV SVM Approach
Segment Overlap Measure (SOV)
Type and position of secondary structure segments
rather than a per-residue assignment of
conformational state
Natural variation of segment boundaries
Ambiguity in the position of segment ends
CCEEECCCCCCEEEEEECCC
CCCCCCCEEEEECCCEECCC
43A Novel Method of Protein Secondary Structure
Prediction with High SOV SVM Approach
Result
44A Novel Method of Protein Secondary Structure
Prediction with High SOV SVM Approach
Result
45Conclusion
- SVM is a useful tool in Bioinformatics
- Microarray data analysis
- Splicing Site Prediction
- Protein Structure Prediction
- Protein Function Prediction
- Protein-Protein Interaction Prediction
- Currently, many papers are publishing
- More researches are needed
- Feature selection, negative data generation,
kernel function and parameters
46Reference
- An Introduction to Lagrange Multipliers,
Steuard Jensen http//home.uchicago.edu/sbjensen/
Tutorials/Lagrange.html - Linear Algebra and Its Applications, David C.
Lay, 1999, second edition - Some Mathematical Tools for Machine Learning,
Chris Burges, August, 2003 - Statistical Learning and VC Theory, Peter
Bartlett, ISCAS, May 2001 - A Tutorial on Support Vector Machines for
Pattern Recognition, Christopher J.C. Burges,
Data Mining and Knowledge Discover, 1998 - Support Vector Learning, B. Schölkopf, Ph. D.
Thesis, 1997 - Kernel methods a survey of current techniques,
Colin Campbell, 2002
47Reference
- Predicting protein-protein interactions from
primary structure, Joel R. Rock and David A.
Gough, Bioinformatics, 2001 - Support Vector Machine Classification of
Microarray Gene Expression Data, Michael P. S.
Brown, 1999 - Protein function classification via support
vector machine approach, C.Z. Cai, Mathematical
Biosciences, 2003 - Prediction of Secondary Protein Structure with
Binary Coding Patterns of Amino Acid and
Nucleotide Physicochemical Properties, NIKOLA,
2002 - Identifying splicing sites in eukaryotic RNA
support vector machine approach, Ying-Fei Sun,
Computers in Biology and Medicine, 2003 - A Nobel Method of Protein Secondary Structure
Prediction with High segment Overlap Measure
Support Vector Machine Approach, Sujun Hua, J.
Mol. Biol., 2001 - Gene Expression data analysis of human lymphoma
suing support vector machines and output coding
ensembles, Giorgio Valentini, Artificial
Intelligence in Medicine, 2002
48Reference
- Support Vector Machines for Prediction of
Protein Domain Structural Class, Yu-Dong Cai,
Xiao-Jun Liu, Xue-Biao Xu and Kuo-Chen Chou, J.
theor. Biol., 2003 - Support vector machines for predicting rRNA-,
RNA-, and DNA-binding proteins from amino acid
sequence, Yu-dong Cai, Shuo Liang Lin, BBA, 2003 - Support Vector Machines for Predicting HIV
Protease Cleavage Sites in Protein, Yu-Dong Cai,
Xiao-Jun Liu, Xue-Biao Xu, Kuo-Chen Chou, J.
Comput. Chem., 2002 - Transductive Support Vector Machines for
Classification of Microarray Gene Expression
Data, R. Semolini, 2003
49Questions or Comments