SVM in Bioinformatics - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

SVM in Bioinformatics

Description:

Areas, features, kernel function and parameters, validation measures, current ... Primate and rodent subsets of Genetic Sequence Data Bank. Data acquisition ... – PowerPoint PPT presentation

Number of Views:550
Avg rating:3.0/5.0
Slides: 50
Provided by: aiKai
Category:

less

Transcript and Presenter's Notes

Title: SVM in Bioinformatics


1
SVM in Bioinformatics
Understanding
Application
  • Computer Science
  • BISLab. KAIST
  • Lee, Ki-Young
  • 2003-12-10

2
Contents
  • Introduction
  • What is SVM
  • Basic idea, Why OHP
  • Non-linear Separable Case
  • Soft Margin Hyperplane
  • Non-Linear SVM
  • Multi-class Classification
  • Success in using SVM
  • Applications in Bioinformatics
  • Areas, features, kernel function and parameters,
    validation measures, current results, case by
    case, current Issues
  • Conclusion
  • Reference

3
Introduction
SVM
Invented by Vapnik, as a by-product of SLT
Simple, and always trained to find global optimum
Used for pattern recognition, regression, and
linear operator inversion
Considered too slow at the beginning, but now for
most application this problem is overcome due to
late 1990s
Small number of parameters choice easy to use
4
Basic Idea
Swellfish
mackerel
weight
Optimal Hyperplane (OHP)
simple kind of SVM (called an LSVM)
maximum margin
length
5
Why maximum Margin Hyperplane
  • Intuitively this feels safest

OHP
  • There is some theory (using VC D.) that is
    related to the proposition that this is a good
    thing

not OHP
  • Empirically it works very well

6
Soft Margin Hyperplanes
Linear separable
7
Higher Dimensional Space
Hypersurface
Hyperplane
Only Inner product is needed to calculate Dual
problem and decision function
Kernalization
8
Kernel Functions
Sigmoid Kernel, Fisher Kernel, String Kernel
There are so many Kernel function!!
9
Multi-class Classification
Using two-class SVM
One-against-others
One-against-one
Other variations
Using Muti-class SVM
10
One-against-others method
Using N classifiers
Easy and simple
More than one classifier can generate
variation
All classifier can generate -
11
One-against-one method
Using NC2 classifiers
More accurate than one-against-one
Too many classifier
Complicate and more time required
12
Multi-class SVM
13
Success in using SVM
What kind of Data
all fields in ML
Which feature(s)
What kind of Kernel Function .
How about the parameter values
Which method to solve multi-class classification
14
SVMs in Bioinfomatics
Introduction
Areas
Feature (s)
Kernel Function and Parameter values
Validation Measures
Current results
Case study
15
Big Picture of Protein Synthesis
16
Transcription
preRNA
17
preRNA ? mRNA (Splicing)
preRNA
mRNA
Where is the splicing cite?
18
preRNA ? mRNA (Splicing)
GU-AG motifs are parts of longer consensus
sequences that span the 5 and 3 splice sites
In Vertebrates 5 splice site 5AGGUAAGU-3
3 splice site 5-PyPyPyPyPyNCAG-3
(Py U or C, N any nucleotide)
SVM
19
Translation
Polypeptide
20
Protein Stucture Hierarchy
21
Protein Primary Structure
22
Protein Secondary Structure
23
Protein Tertiary Structure
?-coil
turn or loop
?-sheet
24
Protein Structure
Protein Structure is so important
Protein Structure can give any to the function of
Protein
Traditional biological experiments are time
consuming and expensive
Computational mechanism is needed!!
SVM
25
Areas in Bioinfo.
Microarray data analysis
Gene functional classification Brown et al.
(2000) Pavlidis et al. (2001)
Tissue classificaiton Mukherje et al. (1999),
Furey et al. (2000), Guyon et al. (2001)
Protein Synthesis
Splicing site prediction Ying-Fei Sun, (2003)
26
Areas in Bioinfo.
Proteins
Structure prediction Hua et al. (2001), Suiun
Hua (2001) Yu-Dong Cai, (2002, 2003), Chris H. Q.
Ding (2001), Florian Markowetz (2003)
Fold recognition Ding et al. (2001)
Family prediction Jaakkoola et al. (1998)
Function prediction Jakkoola et al. (1998) C.
Z. Cai (2003), Yu-dong Cai (2003)
Protein-protein interaction prediction Bock et
al. (2001)
27
Features
Physicochemical feature of residue / base
Hydrophobicity
Normalized volume
Polarity
Charge
Surface tension

28
Features
Lower level structure
Amino acids Composition
the number/frequency of each amino acid
29
Features
Pseudo-amino acid composition
Composition (AAC, 20-D) Sequence (Correlation)
30
Kernel Functions and parameters
Kernel functions
Parameters (for Kernel, C)
Cross validation
Gaussian Kernel
minimize VC-Dimension
Polynomial Kernel
Fisher Kernel
String Kernel
Spectrum kernel
Interpolated Kernel
.
31
Validation Measures
Specificity (Sp)
Sensitivity (Sn)
TP / (TP FN)
TN / (TN FP)
Accuracy
(TP TN) / (TP TN FP FN)
Weighted average of Sn and Sp
32
Current Results in Bioinfo.
SVM is Really a Good tool
Splicing cites 85 92
Protein secondary structure 55 93.2
Protein fold prediction 2856
Protein domain prediction 5794.5
Protein function prediction 8894
Protein-protein interaction 8083
at least as good as other tools
33
Case Study 1
Identifying splicing sites in eukaryotic RNA
SVM approach
Ying-Fei Sun, Xiao-Dan Fan, Yan-Da Li
Computers in Biology and Medicine, 2003
34
Identifying splicing sites in eukaryotic RNA
SVM approach
35
Identifying splicing sites in eukaryotic RNA
SVM approach
Two sequences, ACDDDEFGR vs. ACDEFHR
61 1(-2) 2(-1) 2
61 1(-2) 2(-1) 2
61 1(-1) 2(-2) 1
Mismatch, substitution ? -1
Gap or indel (insertion/deletion) ? opening -2,
extension -1
36
Identifying splicing sites in eukaryotic RNA
SVM approach
37
Identifying splicing sites in eukaryotic RNA
SVM approach
SVM
With and without secondary structural information
Polynomial kernel (s 1, r1, d4), Gaussian
kernel (std 20)
Three-fold cross validation
38
Identifying splicing sites in eukaryotic RNA
SVM approach
39
Case Study 2
A Nobel Method of Protein Secondary Structure
Prediction with High SOV SVM Approach
40
A Novel Method of Protein Secondary Structure
Prediction with High SOV SVM Approach
SVM
Kernel Gaussian RBF
Parameter Values ? 0.10, C1.50
Binary, tertiary classifiers(3 cases),
One-against-others method
41
A Novel Method of Protein Secondary Structure
Prediction with High SOV SVM Approach
Tertiary Classifiers
42
A Novel Method of Protein Secondary Structure
Prediction with High SOV SVM Approach
Segment Overlap Measure (SOV)
Type and position of secondary structure segments
rather than a per-residue assignment of
conformational state
Natural variation of segment boundaries
Ambiguity in the position of segment ends
CCEEECCCCCCEEEEEECCC
CCCCCCCEEEEECCCEECCC
43
A Novel Method of Protein Secondary Structure
Prediction with High SOV SVM Approach
Result
44
A Novel Method of Protein Secondary Structure
Prediction with High SOV SVM Approach
Result
45
Conclusion
  • SVM is a useful tool in Bioinformatics
  • Microarray data analysis
  • Splicing Site Prediction
  • Protein Structure Prediction
  • Protein Function Prediction
  • Protein-Protein Interaction Prediction
  • Currently, many papers are publishing
  • More researches are needed
  • Feature selection, negative data generation,
    kernel function and parameters

46
Reference
  • An Introduction to Lagrange Multipliers,
    Steuard Jensen http//home.uchicago.edu/sbjensen/
    Tutorials/Lagrange.html
  • Linear Algebra and Its Applications, David C.
    Lay, 1999, second edition
  • Some Mathematical Tools for Machine Learning,
    Chris Burges, August, 2003
  • Statistical Learning and VC Theory, Peter
    Bartlett, ISCAS, May 2001
  • A Tutorial on Support Vector Machines for
    Pattern Recognition, Christopher J.C. Burges,
    Data Mining and Knowledge Discover, 1998
  • Support Vector Learning, B. Schölkopf, Ph. D.
    Thesis, 1997
  • Kernel methods a survey of current techniques,
    Colin Campbell, 2002

47
Reference
  • Predicting protein-protein interactions from
    primary structure, Joel R. Rock and David A.
    Gough, Bioinformatics, 2001
  • Support Vector Machine Classification of
    Microarray Gene Expression Data, Michael P. S.
    Brown, 1999
  • Protein function classification via support
    vector machine approach, C.Z. Cai, Mathematical
    Biosciences, 2003
  • Prediction of Secondary Protein Structure with
    Binary Coding Patterns of Amino Acid and
    Nucleotide Physicochemical Properties, NIKOLA,
    2002
  • Identifying splicing sites in eukaryotic RNA
    support vector machine approach, Ying-Fei Sun,
    Computers in Biology and Medicine, 2003
  • A Nobel Method of Protein Secondary Structure
    Prediction with High segment Overlap Measure
    Support Vector Machine Approach, Sujun Hua, J.
    Mol. Biol., 2001
  • Gene Expression data analysis of human lymphoma
    suing support vector machines and output coding
    ensembles, Giorgio Valentini, Artificial
    Intelligence in Medicine, 2002

48
Reference
  • Support Vector Machines for Prediction of
    Protein Domain Structural Class, Yu-Dong Cai,
    Xiao-Jun Liu, Xue-Biao Xu and Kuo-Chen Chou, J.
    theor. Biol., 2003
  • Support vector machines for predicting rRNA-,
    RNA-, and DNA-binding proteins from amino acid
    sequence, Yu-dong Cai, Shuo Liang Lin, BBA, 2003
  • Support Vector Machines for Predicting HIV
    Protease Cleavage Sites in Protein, Yu-Dong Cai,
    Xiao-Jun Liu, Xue-Biao Xu, Kuo-Chen Chou, J.
    Comput. Chem., 2002
  • Transductive Support Vector Machines for
    Classification of Microarray Gene Expression
    Data, R. Semolini, 2003

49
Questions or Comments
Write a Comment
User Comments (0)
About PowerShow.com