Title: Protein secondary structure predictions By: Refael Vivanti
1Protein secondary structure predictionsBy
Refael Vivanti Tal Tabakman
2Rising accuracy of protein secondary structure
prediction Burkhard Rost
3Main dogma in biology
AGCTCTCTGAGGCTT
- D.N.A.
- R.N.A
- strand of A.A.
- protein
UCGAGAGACUCCGAA
AGHTY
?
4Sequence structure gap
- Today we have much more sequenced proteins than
proteins structures. - The gap is rapidly increasing.
Problem Finding protein structure isnt that
simple.
Solution A good start find secondary
structure.
5"???? ?? ???? ??? ??? ?????? ?????" ?????? ?? ?
- Comparing methods requires same terms
and tests. - Secondary structure types
-
-
-
H - helix
E ß strand
L\C other.
seq
A A P P L L L L M M M G I M M R R I M E E E E E
C C C C H H H H C C C E E E
pred
6How to evaluate a prediction?
The Q3 test
correctly predicted residues number of
residues
Of course, all methods would be tested on the
same proteins.
7Old methods
- First generation single residue statistics
- Fasman Chou (1974)
- Some residues have particular secondary
- structure preference.
- Examples Glu a-Helix
- Val
ß-strand
- Second generation segment statistics
- Similar, but also considering adjacent
residues.
8Difficulties
Bad accuracy - below 66 (Q3 results).
Q3 of strands (E) 28 - 48.
Predicted structures were too short.
9Methods accuracy comparison
103rd generation methods
- Third generation methods reached 77 accuracy.
- They consist of two new ideas
- 1. A biological idea
- Using evolutionary information.
- 2. A technological idea
- Using neural networks.
11How can evolutionary information help us?
Homologues similar structure
But sequences change up to 85
Sequence would vary differently - depends on
structure
12How can evolutionary information help us?
Where can we find high sequence conservation?
Some examples
In defined secondary structures.
In protein cores segments (more hydrophobic).
In amphipatic helices (cycle of hydrophobic and
hydrophilic residues).
13How can evolutionary information help us?
- Predictions based on multiple alignments were
made manually.
- Problem
- There isnt any well defined algorithm!
- Solution
- Use Neural Networks .
14Artificial Neural Networks
An attempt to imitate the human brain
construction, (assuming this is the way it works).
When do we use it ?
When we cant solve the problems ourselves!!!
15Artificial Neural Network
- The neural network basic structure
- Big amount of processors
- neurons.
- Highly connected.
- Working together.
16Artificial Neural Network
What does a neuron do?
- Gets signals from its neighbours.
- Each signal has different weight.
- When achieving certain threshold - sends
signals.
17Artificial Neural Network
General structure of ANN
- Our ANN have one-direction flow !
18Artificial Neural Network
- Because this is a complete system, a neural
network can compute anything.
19Artificial Neural Network
Network training and testing
Test set
Correct
Neural network
Training set
Incorrect
Back - propagation
- Training set - inputs for which we know the
wanted output.
- Back propagation - algorithm for changing
neurons pulses - power.
- Test set - inputs used for final network
performance test.
20Artificial Neural Network
- The Network is a black box
- Even when it succeeds
- its hard to understand
- how.
- Its difficult to conclude
- an algorithm from the network
- Its hard to deduce
- new scientific principles.
21Structure of 3rd generation methods
Find homologues using large data bases.
Create a profile representing the entire protein
family.
Give sequence and profile to ANN.
Output of the ANN 2nd structure prediction.
22Structure of 3rd generation methods
- The ANN learning process
- Training testing set
- - Proteins with known sequence structure.
-
Training - Insert training set to ANN as
input. - Compare output to known structure. -
Back propagation.
233rd generation methods - difficulties
Main problem - unwise selection of training
test sets for ANN.
- First problem unbalanced training
- Overall protein composition
- Helices - 32
- Strands - 21
- Coils 47
What will happen if we train the ANN with random
segments ?
243rd generation methods - difficulties
- Second problem unwise separation between
training - test proteins
What will happen if homology / correlation exists
between test training proteins?
over optimism!
Above 80 accuracy in testing.
- Third problem similarity between test
proteins.
25Protein Secondary Structure Prediction Based on
Position specific Scoring Matrices David T.
Jones
PSI - PRED 3RD generation method based on the
iterated PSI BLAST
algorithm.
26PSI - BLAST
PSSM - position specific scoring matrix
Sequence
Distant homologues
- PSI - BLAST outperforms other algorithms in
finding distant - homologues.
- PSSM input for PSI - PRED.
27PSI - PRED
ANNs architecture
- Two ANNs working together.
Sequence PSSM
1ST ANN
Prediction
2ND ANN
Final prediction
28PSI - PRED
- Step 1
- Create PSSM from sequence - 3 iterations of
- PSI BLAST.
- Step 2 1ST ANN
- Sequence PSSM 1st ANNs input.
A D C Q E I L H T S T T W Y V 15
RESIDUES
E/H/C
output central amino acid secondary state
prediction.
A D C Q E I L H T S T T W Y V
29PSI - PRED
Using PSI - BLAST brings up PSI BLAST
difficulties
Iteration - extension of proteins family
Updating PSSM
Inclusion of non homologues
Misleading PSSM
30PSI - PRED
Step 3 2nd ANN
- So why do we need a second ANN ?
possible output for 1st ANN
one-amino-acid helix doesnt exist
seq
A A P P L L L L M M M G I M M R R I M E E E E
E C C C C C H C C C C C E E E
pred
whats wrong with that ?
Solution ANN that looks at the whole context !
Input output of 1st ANN.
Output final prediction.
31PSI - PRED
Training
- 10 of proteins were used as inner test.
Testing
- 187 proteins, Highly resolved
- structure.
- PSI BLAST was used for
- removing homologues.
- Without structural similarities.
32PSI - PRED
Joness reported results
33PSI - PRED
Reliability numbers
- The way the ANN tells us
- how much it is sure about
- the assignment.
- Correlates with accuracy.
34Performance evaluation
- Through 3rd generation methods accuracy
- jumped 10.
- Many 3rd generation methods exist today.
Which method is the best one ? How to recognize
over-optimism ?
35Performance evaluation
CASP - Critical Assessment of Techniques for
Protein Structure Prediction.
EVA Automatic Evaluation of Automatic
Prediction Servers.
36(No Transcript)
37Performance evaluation
Conclusion PSI-PRED seams to be one of the
most reliable method today.
Reasons
- The widest evolutionary information
- (PSI - BLAST profiles).
- Strict training testing criterions for ANN.
38Improvements
The first 3rd generation method PHD 72 in Q3.
3rd generation methods best results 77 in Q3 .
Sources of improvement
- Larger protein data bases.
- PSI BLAST
- PSI PRED broke through, many followed...
39Improvements
How can we do better than that ?
- Through larger data bases (?).
Example Combining 4 best methods Q3
of 78 !
- Find why certain proteins
- predicted poorly.
40Improvements
What is the limit of prediction improvement?
- Some regions of proteins are more mobile
- than others.
- 12 of proteins structure is unknown even by
- manual methods.
- The limit of accuracy is 88 !
41Secondary structure prediction in practice
SECONDARY STRUCTURE PREDICTION
finding structural switches
genome analysis
protein structure
42Finding Structural Switches
young et al
Prediction of secondary structure with several
methods
Different results same preferences
Structural switch ???
43Bibliography
- Jones DT. Protein secondary structure prediction
based on - position specific scoring matrices. J Mol Biol.
1999 292195-202
- Rost B. Rising accuracy of protein secondary
structure prediction - 'Protein structure determination, analysis, and
modeling for - drug discovery (ed. D Chasman), New York
Dekker, pp. 207-249