Title: Machine Learning Algorithms for Protein Structure Prediction
1Machine Learning Algorithms for Protein Structure
Prediction
- Jianlin Cheng
- Institute for Genomics and Bioinformatics
- School of Information and Computer Sciences
- University of California Irvine
- 2006
2Outline
- Introduction
- 1D Prediction
- 2D Prediction (Beta-Sheet Topology)
- 3D Prediction (Fold Recognition)
- Publications and Bioinformatics Tools
3Importance of Protein Structure Prediction
AGCWY
Cell
Sequence Structure
Function
4Four Levels of Protein Structure
Primary Structure (a directional sequence of
amino acids/residues)
N
C
Residue1
Residue2
Peptide bond
Secondary Structure (helix, strand, coil)
Alpha Helix
Beta Strand / Sheet
Coil
5Four Levels of Protein Structure
Quaternary Structure (complex)
Tertiary Structure
G Protein Complex
61D Secondary Structure Prediction
MWLKKFGINLLIGQSV
Helix
Neural Networks Alignments
Coil
CCCCHHHHHCCCSSSSS Accuracy 78
Strand
Cheng, Randall, Sweredoski, Baldi. Nucleic Acid
Research, 2005
71D Solvent Accessibility Prediction
Exposed
MWLKKFGINLLIGQSV
Neural Networks Alignments
eeeeeeebbbbbbbbeeeebbb Accuracy 79
Buried
Cheng, Randall, Sweredoski, Baldi. Nucleic Acid
Research, 2005
81D Disordered Region Prediction Using Neural
Networks
MWLKKFGINLLIGQSV
Disordered Region
1D-RNN
OOOOODDDDOOOOO 93 TP at 5 FP
Cheng, Sweredoski, Baldi. Data Mining and
Knowledge Discovery, 2005
91D Protein Domain Prediction Using Neural
Networks
MWLKKFGINLLIGQSV
Boundary
SS and SA
1D-RNN
NNNNNNNBBBBBNNNN
Inference/Cut
HIV capsid protein
Domain 1
Domain 2
Domains
Top ab-initio domain predictor in CAFASP4
Cheng, Sweredoski, Baldi. Data Mining and
Knowledge Discovery, 2006.
101D Predict Single-Site Mutation From Sequence
Using Support Vector Machine
Correlation 0.76
Support Vector Machine
MWLAVFILINLK
- First method to predict energy changes from
sequence accurately - Useful for protein engineering, protein design,
and mutagenesis analysis
Cheng, Randall, and Baldi. Proteins, 2006
112D Contact Map Prediction
2D Contact Map
3D Structure
1 2 ....j...
..n
1 2 3 . . . . i . . . . . . . n
Distance Threshold 8Ao
Cheng, Randall, Sweredoski, Baldi. Nucleic Acid
Research, 2005
122D Disulfide Bond Prediction
Cysteine i
Support Vector Machine
yes
2D-RNN
Disulfide Bond
Graph Matching
Cysteine j
1 Baldi, Cheng, Vullo. NIPS, 2004. 2 Cheng,
Saigo, Baldi. Proteins, 2005
132D Prediction of Beta-Sheet Topology
N terminus
- Ab-Initio Structure Prediction
- Fold Recognition
- Protein Design
- Protein Folding
Beta Sheet
Beta Strand
Cheng and Baldi, Bioinformatics, 2005
C terminus
Beta Residue Pair
14An Example of Beta-Sheet Topology
Level 1
4 5
2 1 3 6 7
Structure of Protein 1VJG
Beta Sheets
15An Example of Beta-Sheet Topology
Level 1
Level 2
4 5
Antiparallel
2 1 3 6 7
Parallel
Strand Strand Pair Strand Alignment Pairing
Direction
Structure of Protein 1VJG
Beta Sheets
16An Example of Beta-Sheet Topology
Level 1
Level 2
Level 3
4 5
Antiparallel
H-bond
2 1 3 6 7
Parallel
Strand Strand Pair Strand Alignment Pairing
Direction
Structure of Protein 1VJG
Beta Sheets
Beta Residue Residue Pair
17Three-Stage Prediction of Beta-Sheets
- Stage 1
- Predict beta-residue pairing probabilities
using 2D-Recursive Neural Networks (2D-RNN, Baldi
and Pollastri, 2003) - Stage 2
- Use beta-residue pairing probabilities to
align beta-strands - Stage 3
- Predict beta-strand pairs and beta-sheet
topology using graph algorithms
18Stage 1 Prediction of Beta-Residue Pairings
Using 2D-Recusive Neural Networks
Input Matrix I (mm)
Output / Target Matrix (mm)
Iij
2D-RNN O f(I)
(i,j)
i
j
Oij Pairing Prob. Tij 0/1
AHYHCKRWQNEDGHTPRKDECLIELMQDAQRMRK.
20 for Residues
3 SS
2 SA
19An Example (Target)
1
2
3
4
5
6
7
Protein 1VJG
Beta-Residue Pairing Map (Target Matrix)
20An Example (Target)
1
2
3
4
5
6
7
Antiparallel
Parallel
Protein 1VJG
Beta-Residue Pairing Map (Target Matrix)
21An Example (Prediction)
22Stage 2 Beta-Strand Alignment
Antiparallel
- Use output probability matrix as scoring matrix
- Dynamic programming
- Disallow gaps and use the simplified search
algorithm
1 m
n 1
Parallel
1 m
1 n
Total number of alignments 2(mn-1)
23Strand Alignment and Pairing Matrix
- The alignment score is the sum of the pairing
probabilities of the aligned residues - The best alignment is the alignment with the
maximum score - Strand Pairing Matrix
Strand Pairing Matrix of 1VJG
24Stage 3 Prediction of Beta-Strand Pairings and
Beta-Sheet Topology
(a) Seven strands of protein 1VJG in sequence
order
(b) Beta-sheet topology of protein 1VJG
25Minimum Spanning Tree Like Algorithm
Strand Pairing Graph (SPG)
(a) Complete SPG
Strand Pairing Matrix
26Minimum Spanning Tree Like Algorithm
Strand Pairing Graph (SPG)
(b) True Weighted SPG
(a) Complete SPG
Strand Pairing Matrix
Goal Find a set of connected subgraphs that
maximize the sum of the alignment scores
and satisfy the constraints Algorithm Minimum
Spanning Tree Like Algorithm
27An Example of MST Like Algorithm
1
2
3
4
5
6
7
Step 1 Pair strand 4 and 5
0
1.3 0
.94 .37 0
.02 .02 .04 0
.02 .02 .03 1.9 0
.10 .05 .74 .04 .04 0
.02 .02 .03 .02 .02 .20 0
1
2
3
4
5
4
5
6
7
Strand Pairing Matrix of 1VJG
28An Example of MST Like Algorithm
1
2
3
4
5
6
7
Step 2 Pair strand 1 and 2
0
1.3 0
.94 .37 0
.02 .02 .04 0
.02 .02 .03 1.9 0
.10 .05 .74 .04 .04 0
.02 .02 .03 .02 .02 .20 0
1
2
3
4
5
4
5
6
7
2
1
Strand Pairing Matrix of 1VJG
N
29An Example of MST Like Algorithm
1
2
3
4
5
6
7
Step 3 Pair strand 1 and 3
0
1.3 0
.94 .37 0
.02 .02 .04 0
.02 .02 .03 1.9 0
.10 .05 .74 .04 .04 0
.02 .02 .03 .02 .02 .20 0
1
2
3
4
5
4
5
6
7
2
1
3
Strand Pairing Matrix of 1VJG
N
30An Example of MST Like Algorithm
1
2
3
4
5
6
7
Step 4 Pair strand 3 and 6
0
1.3 0
.94 .37 0
.02 .02 .04 0
.02 .02 .03 1.9 0
.10 .05 .74 .04 .04 0
.02 .02 .03 .02 .02 .20 0
1
2
3
4
5
4
5
6
7
2
1
3
6
Strand Pairing Matrix of 1VJG
N
31An Example of MST Like Algorithm
1
2
3
4
5
6
7
Step 5 Pair strand 6 and 7
0
1.3 0
.94 .37 0
.02 .02 .04 0
.02 .02 .03 1.9 0
.10 .05 .74 .04 .04 0
.02 .02 .03 .02 .02 .20 0
1
2
3
4
5
4
5
6
C
7
2
1
3
6
7
Strand Pairing Matrix of 1VJG
N
321.Beta Residue Pairing
Method Specificity/ Sensitivity Ratio of Improvement
BetaPairing 41 17.8
CMAPpro (Pollastri and Baldi, 2002) 27 11.7
2. Beta Strand Alignment
Method Alignment Accuracy Pairing Direction
BetaPairing 66 84
Statistical Potential (Hubbard, 1994) 40 X
Pseudo-energy (Zhu and Braun, 1999) 35 X
Information Theory (Steward and Thornton, 2002) 37 X
3. Beta Strand Pairing
Method Specificity Sensitivity of non-local pairs
MST Like 53 59 20
333D Structure Prediction
MWLKKFGINLLIGQSV
- Ab-Initio Structure Prediction
Simulation
Physical force field protein folding Contact
map - reconstruction
Select structure with minimum free energy
- Template-Based Structure Prediction
Query protein
Fold
MWLKKFGINKH
Recognition
Alignment
Template
Protein Data Bank
34A Machine Learning Information Retrieval
Framework for Fold Recognition
Fold Recognition
Cheng and Baldi, Bioinformatics, 2006
Query Protein
Alignment
MWLKKFGIN
Template
Protein Data Bank
Machine Learning Ranking
35Classic Fold Recognition Approaches
Sequence - Sequence Alignment (Needleman and
Wunsch, 1970. Smith and Waterman, 1981)
Query
ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL
Template
ITAKPQWLKTSE------------SVTFLSFLLPQTQGLYHL
Alignment (similarity) score
Works for gt40 sequence identity (Close homologs
in protein family)
36Classic Fold Recognition Approaches
Profile - Sequence Alignment (Altschul et al.,
1997)
ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL ITAKPEKTPTSP
REQAIGLSVTFLEFLLPAGWVLYHL ITAKPAKTPTSPKEEAIGLSVTFL
SFLLPAGWVLYHL ITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYH
L
Query Family
Average Score
Template
ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN
More sensitive for distant homologs in
superfamily. (gt 25 identity)
37Classic Fold Recognition Approaches
Profile - Sequence Alignment (Altschul et al.,
1997)
12.n
1 2 n
A 0.4
C 0.1
W 0.5
ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL ITAKPEKTPTSP
REQAIGLSVTFLEFLLPAGWVLYHL ITAKPAKTPTSPKEEAIGLSVTFL
SFLLPAGWVLYHL ITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYH
L
Query Family
Position Specific Scoring Matrix Or Hidden Markov
Model
Template
ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN
More sensitive for distant homologs in
superfamily. (gt 25 identity)
38Classic Fold Recognition Approaches
Profile - Profile Alignment (Rychlewski et al.,
2000)
1 2 n
A 0.1
C 0.4
W 0.5
ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL ITAKPEKTPTSP
REQAIGLSVTFLEFLLPAGWVLYHL ILAKPAKTPTSPKEEAIGLSVTFL
SFLLPAGWVLYHL ITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYH
L
Query Family
1 2 m
A 0.3
C 0.5
W 0.2
Template Family
ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN IPARPQWLKTSKR
STEWQSVTFLSFLLPYTQGLYHN IGAKPQWLWTSERSTEWHSVTFLSFL
LPQTQGLYHM
More sensitive for very distant homologs. (gt 15
identity)
39Classic Fold Recognition Approaches
Sequence - Structure Alignment (Threading) (Bowie
et al., 1991. Jones et al., 1992. Godzik,
Skolnick, 1992. Lathrop, 1994)
Fit
Query
Fitness Score
MWLKKFGINLLIGQS.
Template Structure
Useful for recognizing similar folds without
sequence similarity. (no evolutionary
relationship)
40Integration of Complementary Approaches
FR Server1
Query
Meta Server
FR server2
Consensus
(Lundstrom et al.,2001. Fischer, 2003)
FR server3
Internet
- Reliability depends on availability of external
servers - Make decisions on a handful candidates
41Machine Learning Classification Approach
Class 1
Support Vector Machine (SVM)
Class 2
Proteins
Class m
Classify individual proteins to several or dozens
of structure classes (Jaakkola et al., 2000.
Leslie et al., 2002. Saigo et al., 2004)
Problem 1 cant scale up to thousands of protein
classes
Problem 2 doesnt provide templates for
structure modeling
42Machine Learning Information Retrieval Framework
Query-Template Pair
Score 1
Relevance Function (e.g., SVM)
Score 2
Rank
. . .
-
Score n
- Extract pairwise features
- Comparison of two pairs (four proteins)
- Relevant or not (one score) vs. many classes
- Ranking of templates (retrieval)
43Pairwise Feature Extraction
- Sequence / Family Information Features
- Cosine, correlation, and Gaussian kernel
- Sequence Sequence Alignment Features
- Palign, ClustalW
- Sequence Profile Alignment Features
- PSI-BLAST, IMPALA, HMMer, RPS-BLAST
- Profile Profile Alignment Features
- ClustalW, HHSearch, Lobster, Compass, PRC-HMM
- Structural Features
- Secondary structure, solvent accessibility,
contact map, beta-sheet topology
44Pairwise Feature Extraction
45Relevance Function Support Vector Machine
Learning
Feature Space
Positive Pairs (Same Folds)
Support Vector Machine
Negative Pairs (Different Folds)
Training/Learning
Hyperplane
Training Data Set
46Relevance Function Support Vector Machine
Learning
(2)
(1)
Margin
Margin
f(x)
K is Gaussian Kernel
47Training and Cross-Validation
- Standard benchmark (Lindahls dataset, 976
proteins) - 976 x 975 query-template pairs (about 7,468
positives)
Query
Query 1s pairs
975 pairs
1 2 3 . . . . . 976
Query 2s pairs
Train / Learn
975 pairs
. . .
(90 1- 878)
Rank 975 templates for each query
Test
(10 879 976)
975 pairs
48Results for Top Five Ranked Templates
Method Family Superfamily Fold
PSI-BLAST 72.3 27.9 4.7
HMMER 73.5 31.3 14.6
SAM-T98 75.4 38.9 18.7
BLASTLINK 78.9 4.06 16.5
SSEARCH 75.5 32.5 15.6
SSHMM 71.7 31.6 24
THREADER 58.9 24.7 37.7
FUGUE 85.8 53.2 26.8
RAPTOR 77.8 50 45.1
SPARKS3 86.8 67.7 47.4
FOLDpro 89.9 70.0 48.3
- Family close homologs, more identity
- Superfamily distant homologs, less identity
- Fold no evolutionary relation, no identity
49Specificity-Sensitivity Plot (Family)
50Specificity-Sensitivity Plot (Superfamily)
51Specificity-Sensitivity Plot (Fold)
52Advantages of MLIR Framework
- Integration
- Accuracy
- Extensibility
- Simplicity
- Reliability
- Completeness
- Potentials
Disadvantages
Slower than some alignment methods
53A CASP7 Example T0290
Query sequence (173 residues) RPRCFFDIAINNQPAGRVV
FELFSDVCPKTCENFRCLCTGEKGTGKSTQKPLHYKSCLFHRVVKDFMVQ
GGDFSEGNGRGGESIYGGFFEDESFAVKHNAAFLLSMANRGKDTNGSQFF
ITKPTPHLDGHHVVFGQVISGQEVVREIENQKTDAASKPFAEVRILSCGE
LIP
FOLDpro
Compare with the experimental structure RMSD
1Ao
Predicted Structure
54Publications and Bioinformatics Tools
1. P. Baldi, J. Cheng, and A. Vullo. Large-Scale
Prediction of Disulphide Bond Connectivity.
NIPS 2004. DIpro 1.0 2. J. Cheng, H.
Saigo, and P. Baldi. Large-Scale Prediction of
Disulphide Bridges Using Kernel Methods,
Two-Dimensional Recursive Neural Networks, and
Weighted Graph Matching. Proteins, 2006.
DIpro 2.0
3. J. Cheng and
P. Baldi. Three-Stage Prediction of Protein
Beta-Sheets by Neural Networks, Alignments, and
Graph Algorithms. Bioinformatics, 2005.
BETApro 4. J. Cheng, A. Randall, M.
Sweredoski, and P. Baldi. SCRATCH a Protein
Structure and Structural Feature Prediction
Server. Nucleic Acids Research, 2005. SSpro
4/ACCpro 4/CMAPpro 2 5. J. Cheng, M. Sweredoski,
and P. Baldi. Accurate Prediction of Protein
Disordered Regions by Mining Protein Structure
Data. Data Mining and Knowledge Discovery,
2005. DISpro
55Publications and Bioinformatics Tools
6. J. Cheng, L. Scharenbroich, P. Baldi, and E.
Mjolsness. Sigmoid Towards a Generative,
Scalable, Software Infrastructure for Pathway
Bioinformatics and Systems Biology. IEEE
Intelligent Systems, 2005. Sigmoid 7. J.
Cheng, A. Randall, and P. Baldi. Prediction of
Protein Stability Changes for Single Site
Mutations Using Support Vector Machines.
Proteins, 2006. MUpro 8. S. A. Danziger, S. J.
Swamidass, J. Zeng, L. R. Dearth, Q. Lu, J. H.
Chen, J. Cheng, V. P. Hoang, H. Saigo, R. Luo,
P. Baldi, R. K. Brachmann, and R. H. Lathrop.
Functional Census of Mutation Sequence Spaces
The Example of p53 Cancer Rescue Mutants. IEEE
Transactions on Computational Biology and
Bioinformatics, 2006. 9. J. Cheng, M.
Sweredoski, and P. Baldi. DOMpro Protein Domain
Prediction Using Profiles, Secondary Structure,
Relative Solvent Accessibility, and Recursive
Neural Networks. Data Mining and Knowledge
Discovery, 2006. DOMpro 10. J. Cheng and P.
Baldi. A Machine Learning Information Retrieval
Approach to Protein Fold Recognition.
Bioinformatics, 2006. FOLDpro
56Acknowledgements
- Pierre Baldi
- G. Wesley Hatfield, Eric Mjolsness, Hal Stern,
Dennis Decoste, Suzanne Sandmeyer, Richard
Lathrop, Gianluca Pollastri, Chin-Rang Yang - Mike Sweredoski, Arlo Randall, Liza Larsen, Sam
Danziger, Trent Su, Hiroto Saigo, Alessandro
Vullo, Lucas Scharenbroich
57(No Transcript)
58Markov Models
59(No Transcript)
60(No Transcript)
611D-Recursive Neural Network
622D-Recursive Neural Network
63(No Transcript)
642D-RNNs
652D RNNs