Title: CZ5226: Advanced Bioinformatics Lecture 7: Statistical Learning Methods Prof' Chen Yu Zong Tel: 6874
1CZ5226 Advanced Bioinformatics Lecture 7
Statistical Learning Methods Prof. Chen Yu
ZongTel 6874-6877Email csccyz_at_nus.edu.sghttp
//xin.cz3.nus.edu.sgRoom 07-24, level 7, SOC1,
National University of Singapore
2Classification of Drugs or Proteins by SVM
- A drug or a protein is classified as either
belong () or not belong (-) to a class - Examples of drug class inhibitor of a protein,
BBB penetrating, genotoxic - Examples of protein class enzyme EC3.4 family,
DNA-binding - By screening against all classes, the property of
a drug or the function of a protein can be
identified
Class-1 SVM
-
Drug or Protein
Class-2 SVM
-
Class-3 SVM
Drug or Protein belongs to Family-3
-
-
3Classification of Drugs or Proteins by SVM
- What is SVM?
- Support vector machines, a machine learning
method, learning by examples, statistical
learning, classify objects into one of the two
classes. - Advantages of SVM
- Diversity of class members (no racial
discrimination). - Use of structure-derived physico-chemical
features as basis for drug or protein
classification (no structure-similarity or
sequence-similarity required in the algorithm).
4SVM References
- C. Burges, "A tutorial on support vector machines
for pattern recognition", Data Mining and
Knowledge Discovery, Kluwer Academic
Publishers,1998 (on-line). - R. Duda, P. Hart, and D. Stork, Pattern
Classification, John-Wiley, 2nd edition, 2001
(section 5.11, hard-copy). - S. Gong et al. Dynamic Vision From Images to
Face Recognition, Imperial College Pres, 2001
(sections 3.6.2, 3.7.2, hard copy). - Online lecture notes (http//www.cs.unr.edu/bebis
/MathMethods/SVM/lecture.pdf ) - Publications of SVM drug prediction
- J. Chem. Inf. Comput. Sci. 44,1630 (2004)
- J. Chem. Inf. Comput. Sci. 44, 1497 (2004)
- Toxicol. Sci. 79,170 (2004).
5SVM References
- Publications of SVM protein function prediction
- Bioinformatics 2002 18, 147
- Nucleic Acids Res 2003 31, 3692
- Proteins 2004 55, 66
- RNA 2004 10, 355
- J Biol Chem 2004 279, 23262
- Nucleic Acids Res. 2004 32(21) 6437-6444
- Virology 2005 331(1)136-143
- Publications of SVM peptide-binder prediction
- BMC Bioinformatics. 2002 Sep 113(1)25
- Bioinformatics. 2003 Oct 1219(15)1978-84
- Protein Sci. 2004 Mar13(3)596-607
- Genome Inform Ser Workshop Genome Inform.
200415(1)198-212
6Other MHC-Peptide Prediction References
- J Comput Biol. 200411(4)683-94
- Methods. 2004 Dec34(4)454-9
- Methods. 2004 Dec34(4)444-53
- Methods. 2004 Dec34(4)436-43
- Org Biomol Chem. 2004 Nov 212(22)3274-83
- Immunogenetics. 2004 Sep56(6)405-19
- J Immunol. 2004 Jun 15172(12)7495-502
- J Immunol. 2004 Jun 1172(11)6783-9
- Appl Bioinformatics. 20032(1)63-6
- Appl Bioinformatics. 20032(3)155-8
- Bioinformatics. 2004 Jun 1220(9)1388-97.
- Proteins. 2004 Feb 1554(3)534-56
- Novartis Found Symp. 2003254102-20 discussion
120-5, 216-22, 250-2 - Hum Immunol. 2003 Dec64(12)1123-43
- J Mol Graph Model. 2004 Jan22(3)195-207
- Neural Comput. 2003 Dec15(12)2931-42
- Tissue Antigens. 2003 Nov62(5)378-84
7Other MHC-Peptide Prediction References
- Bioinformatics. 2003 Sep 2219(14)1765-72
- Hybrid Hybridomics. 2003 Aug22(4)229-34
- Nucleic Acids Res. 2003 Jul 131(13)3621-4
- Bioinformatics. 2003 May 2219(8)1009-14
- Methods. 2003 Mar29(3)236-47
- J Proteome Res. 2002 May-Jun1(3)263-72
- J Mol Biol. 2003 Feb 28326(4)1157-74
- BMC Bioinformatics. 2002 Sep 113(1)25
- Hum Immunol. 2002 Sep63(9)701-9
- J Comput Biol. 20029(3)527-39
- Mol Med. 2002 Mar8(3)137-48
- Immunol Cell Biol. 2002 Jun80(3)280-5
- Immunol Cell Biol. 2002 Jun80(3)270-9
- BMC Struct Biol. 2002 May 132(1)2
- Biologicals. 2001 Sep-Dec29(3-4)179-81
- Bioinformatics. 2001 Dec17(12)1236-7
- Bioinformatics. 2001 Oct17(10)942-8
- J Med Chem. 2001 Oct 2544(22)3572-81
- J Comput Aided Mol Des. 2001 Jun15(6)573-86
8Machine Learning Method
Inductive learning Example-based learning
9Machine Learning Method
A(1, 1, 1) B(0, 1, 1) C(1, 1, 1) D(0, 1,
1) E(0, 0, 0) F(1, 0, 1)
10SVM Method
Feature vectors in input space
Feature vector
A(1, 1, 1) B(0, 1, 1) C(1, 1, 1) D(0, 1,
1) E(0, 0, 0) F(1, 0, 1)
11SVM Method
12SVM method
13SVM Method
14Best Linear Separator?
15Best Linear Separator?
16Find Closest Points in Convex Hulls
d
c
17Plane Bisect Closest Points
d
c
18Find using quadratic program
Many existing and new solvers.
19Best Linear SeparatorSupporting Plane Method
Maximize distance Between two parallel
supporting planes
Distance Margin
20Best Linear Separator?
21SVM Method
Border line is nonlinear
22SVM method
23SVM method
Non-linear transformation
24SVM Method
25SVM Method
26SVM Method
27SVM Method
28SVM for Classification of Drugs
- How to represent a drug?
- Each structure represented by specific feature
vector assembled from structural,
physico-chemical properties - Simple molecular properties (molecular weight,
no. of rotatable bonds etc. 18 in total) - Molecular Connectivity and shape (28 in total)
- Electro-topological state polarity (84 in total)
- Quantum chemical properties (electric charge,
polaritability etc. 13 in total) - Geometrical properties (molecular size vector,
van der Waals volume, molecular surface etc. 16
in total) - J. Chem. Inf. Comput. Sci. 44,1630 (2004)
- J. Chem. Inf. Comput. Sci. 44, 1497 (2004)
- Toxicol. Sci. 79,170 (2004).
29SVM-based drug design and property prediction
software
Useful for inhibitor/activator/substrate
prediction, drug safety and pharmacokinetic
prediction.
Drug
Chemical Structure
Chemical Structure
Your drug structure
Option 2
Option 1
http//jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi
Which class your drug belongs to?
Send structure to classifier
Input structure through internet
Support vector machines classifier for every
Drug class
Computer loaded with SVMProt
Drug designed or property predicted
Identified classes
Input structure on local machine
J. Chem. Inf. Comput. Sci. 44,1630 (2004) J.
Chem. Inf. Comput. Sci. 44, 1497 (2004) Toxicol.
Sci. 79,170 (2004).
30SVM Drug Prediction Results
- Protein inhibitor/activator/substrate prediction
- 86 of the 129 estrogen receptor activators and
84 of 101 non-activators correctly predicted. - 81 of 116 P-glycoprotein substrates and 79 of
85 non-substrates correctly predicted - Drug Toxicity Prediction
- 97 of 102 TdP and 84 of 243 TdP- agents
correctly predicted - 73 of 229 genotoxic and 93 of 631 non-genotoxic
agents correctly predicted - Pharmacokinetics prediction
- 95 of 276 BBB and 82 of 139 BBB- agents
correctly predicted - 90 of 131 human intestine absorption and 80 of
65 non-absoption agents correctly predicted. - J. Chem. Inf. Comput. Sci. 44,1630 (2004)
- J. Chem. Inf. Comput. Sci. 44, 1497 (2004)
- Toxicol. Sci. 79,170 (2004).
31SVM for Classification of Proteins
- How to represent a protein?
- Each sequence represented by specific feature
vector assembled from encoded representations of
tabulated residue properties - amino acid composition
- Hydrophobicity
- normalized Van der Waals volume
- polarity,
- Polarizability
- Charge
- surface tension
- secondary structure
- solvent accessibility
- Three descriptors, composition (C), transition
(T), and distribution (D), are used to describe
global composition of each of these properties. - Nucleic Acids Res. 2003 31 3692-3697
32SVM for Classification of Proteins
- How to represent a protein?
- From protein sequence
To Feature vector (C_amino acid composition,
T_ amino acid composition, D_ amino acid
composition, C_hydrophobicity,
T_hydrophobicity, D_hydrophobicity, ) Nucleic
Acids Res. 2003 31 3692-3697
33SVM for Classification of Proteins
- How to represent a protein?
34Protein function prediction software SVMProt
Useful for functional prediction of novel
proteins, distantly-related proteins, homologous
proteins of different functions
35Protein function prediction software SVMProt
Useful for functional prediction of novel
proteins, distantly-related proteins, homologous
proteins of different functions. Protein
families covered 46 enzyme families, 3 receptor
families, 4 transporter and channel families, 6
DNA- and RNA-binding families, 8 structural
families, 2 regulator/factor families. SVMProt
web-version at http//jing.cz3.nus.edu.sg/cgi-bin
/svmprot.cgi
Nucl. Acids Res. 31, 3692-3697 (2003)
36Protein function prediction software SVMProt
Probability of correct prediction
Prediction score
Nucl. Acids Res. 31, 3692-3697 (2003)
37SVMProt Protein Functional Family Prediction
Results
- Overall prediction accuracies
- 87 of the 34,582 proteins correctly assigend to
their respective functional family. - 97 of the 310,000 non-member proteins correctly
predicted - Novel enzymes
- 67 of the 12 non-homologous enzymes (having no
homlogous proteins by PSI-BLAST search of NR
databases) are correctly assigned - 83 of the 29 non-homologous enzymes (having no
homologous proteins by PSI-BLAST search of
SwissProt database) are correctly assigned. - 70 of the 20 pairs of homologous enzymes of
different functions are correctly assigned. - NR databases include all non-redundant GenBank,
- CDS translations, PDB, SwissProt, PIR, and PRF
databases - 92 of 12,900 enzymes correctly assigned by BLAST
in 1997 - Nucleic Acids Res 2003 31, 3692