Title: Protein Structure Prediction
1Protein Structure Prediction
- Shandar Ahmad
- Kyushu Institute of Technology,
- Iizuka 820 8502,
- Fukuoka-ken, Japan
- shandar_at_bse.kyutech.ac.jp
2Secondary structure The basic unit of protein
structure
- Protein structures are stabilized by Hydrogen
bonds between atoms of the amino acid sequence. - There is a choice of hydrogen bond partner for
each residue. - The pattern of hydrogen bond pairing determines
secondary structure.
3Types of secondary structure
- Eight types of secondary structures have been
defined by Kabsch and Sander in DSSP (Dictionary
of secondary structures in proteins). They are - Alpha helix (H) 5. Pi-helix (I)
- Isolated beta bridge (B) 6. Turn (T)
- Extended Beta (E) 7. Bend (S)
- 3-10 helix (G) 8. Coil (C)
4Alpha helixHydrogen bond is formed between nth
and (n4)th residues
5Beta strand (E), part of beta ladder
6Beta strand (E) cont..
7Turn structure (T)
8Other helices
9Bend Conformation
- Bend is the caused by interactions with other
parts of protein. - Proline introduces bend due to conformational
constraints. - Water molecules cause bend to maximize CO
exposure to water.
10Some structural domains in proteins
11Methods to get secondary structure from
experimentally known structures
- DSSP is the most commonly used program to
calculate secondary structure of proteins. - DSSP also provides a database to get Sec
structure by searching their PDB codes. - Database and programs can be accessed at
- http//www.cmbi.kun.nl/gv/dssp/
- Program can be downloaded for local calculations.
- PDB files also contain secondary structures in
their headers, but only the broad details.
12Prediction of secondary structure
- Older methods
- Chou and fasman method (1974)
- The Chou-Fasman method of secondary structure
prediction depends on assigning a set of
prediction values to a residue and then applying
a simple algorithm to those numbers. - For example
- p(t) f(j)f(j1)f(j2)f(j3) See next table
- Online predictions http//fasta.bioch.virginia.ed
u/fasta_www/chofas.htm - Typical success rate of prediction is of 50
13Name P(a) P(b) P(turn) f(i) f(i1) f(i2)
f(i3) Alanine 142 83 66 0.06 0.076 0.035
0.058 Arginine 98 93 95 0.070 0.106 0.099
0.085 Aspartic Acid 101 54 146 0.147
0.110 0.179 0.081 Asparagine 67 89 156
0.161 0.083 0.191 0.091 Cysteine 70 119
119 0.149 0.050 0.117 0.128 Glutamic Acid
151 37 74 0.056 0.060 0.077 0.064
Glutamine 111 110 98 0.074 0.098 0.037
0.098 Glycine 57 75 156 0.102 0.085 0.190
0.152 Histidine 100 87 95 0.140 0.047
0.093 0.054 Isoleucine 108 160 47 0.043
0.034 0.013 0.056 Leucine 121 130 59
0.061 0.025 0.036 0.070 Lysine 114 74 101
0.055 0.115 0.072 0.095 Methionine 145 105
60 0.068 0.082 0.014 0.055 Phenylalanine
113 138 60 0.059 0.041 0.065 0.065 Proline
57 55 152 0.102 0.301 0.034 0.068 Serine
77 75 143 0.120 0.139 0.125 0.106
Threonine 83 119 96 0.086 0.108 0.065
0.079 Tryptophan 108 137 96 0.077 0.013
0.064 0.167 Tyrosine 69 147 114 0.082
0.065 0.114 0.125 Valine 106 170 50 0.062
0.048 0.028 0.053
Download
Source http//prowl.rockefeller.edu/aainfo/chou.h
tm
14Further improvements
- 1978 Garnier improved the method by using
statistically significant pair-wise interactions
as a determinant of the statistical significance.
This improved the success rate to 62 - 1993 Levin improved the prediction level by using
multiple sequence alignments. - The reasoning is as follows.
- Conserved regions in a multiple sequence
alignment provides a strong evolutionary
indicator of a role in the function of the
protein. - Those regions are also likely to have conserved
structure, including secondary structure and
strengthen the prediction by their joint
propensities. - This improved the success rate to 69.
15Neural network based methods
- In 1993, Qian Sejnowski and Holey and Karplus
introduced first neural network based method. - Sequence information is sent to the neural
network and the output is classified as helix,
beta, or other secondary structures - See next figure
16(No Transcript)
17Encoding the amino acids
- Amino acid residues are coded as 21 bit binary
vectors. - For predicting secondary structure of a residue,
this information about residue and its neighbour
is sent to the neural network. - For known structural data, network is trained and
validated.
18Other advanced methods of prediction
- PHD Predict Protein
- 1994 Rost and Sander combined neural networks
with multiple sequence alignments. The success
rate is 72. - http//www.embl-heidelberg.de/predictprotein/predi
ctprotein.html - Jpred (Cuff and Barton)
- http//www.compbio.dundee.ac.uk/www-jpred/
- Predator (Frishman D, Argos P )
- http//npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?p
age/NPSA/npsa_preda.html - PSIPRED (DT Jones)
- http//bioinf.cs.ucl.ac.uk/psipred/
19Solvent accessibility of amino acid residues
- This is another important property of amino acids
in proteins, which we want to predict. - Solvent accessibility is defined as the Area
around the surface of a residue, which is exposed
to water (or any solvent). - Higher solvent accessibility or accessible
surface area (ASA) indicates greater chance of
interactions with DNA, Ligands etc. and being in
the active sites.
20Accessible surface area or solvent accessibility
21Relative solvent accessibility
- Total ASA of an amino acid is normalised to
percentage scale. - Scaling is different for 20 types of amino acids.
- ASA of extended state (Gly-X-Gly or Ala-X-Ala)
are used for scaling. - Sometimes this relative ASA is used to say if a
residue is exposed or buried. E.g. if ASA is more
than 25, it may be called exposed, and if less
than 25 it may be called buried. - Different values of threshold (other than 25)
are used by different people.
22Solvent accessibility prediction methods
- PHD server described above gives ASA predictions
also. It devided residues into buried and exposed
categories at 16 threshold and gives a
prediction. - Real value prediction method based on neural
network was developed by us (Ahmad and Sarai
2003), which can make a prediction upto 18 mean
absolute error (better than any other prediction
method available). - http//gibk26.bse.kyutech.ac.jp/shandar/netasa/rv
p-net/ - This is the only server which also provides
graphical outputs.
23A graphical prediction of Solvent accessibility
by RVP-Net. Shandar Ahmad and Akinori Sarai, 2003
24Measuring prediction accuracy
- Different scales of prediction are used.
- Single residue accuracy or Qindex
- (Qhelix, Qstrand, Qcoil, Q3) gives percentage of
residues predicted correctly as helix, strand,
coil or for all three conformational states. The
definition of Qindex is as follows. - For a single conformational state
- number of residues correctly predicted in
state i - Qi --------------------------------------------
----- ----------- 100, - number of residues observed in state
i - where i is either helix, strand or coil.
25Other scores
- R Sxy / Sxx Syy Where Sxy ? (x xo)
(y-yo) - Sxx ? ? (x xo)2
- Syy ? ? (y yo)2
- Subscript o represents mean value of the
corresponding variable. - Sensitivity TP/ (TPFN)
- Specificity TN/(TNFP) (T-True, F-False,
P-Positive, N-Negative) -
26Segment overlap SOV score
- SOV Segment OVerlap quantity measure for a
single conformational state - 1 SUM MINOV(S1S2)
DELTA(S1S2) - SOV(i) --- SUM -------------------------
-- LEN(S1) - N(i) SUM MAXOV(S1S2)
- S(i)
- Where
- S1 and S2 are the observed and predicted
secondary structure segments (in state i, which
can be either H, E or C) LEN(S1) is the number
of residues in the segments - S1 MINOV(S1S2) is the length of actual overlap
of S1 and S2, i.e. the extent for which both
segments have residues in state i, for example H
- MAXOV(S1S2) is the length of the total extent
for which either of the segments S1 or S2 has a
residue in state i DELTA(S1S2) is the integer
value defined as being equal to the
MIN(MAXOV(S1S2)- MINOV(S1S2)) MINOV(S1S2)
INT(LEN(S1)/2) INT(LEN(S2)/2) - THE SUM is taken over S, all the pairs of
segments S1S2, where S1 and S2 have at least
one residue in state i in common N(i) is the
number of residues in state i
27Higher level predictions of protein structure
- This includes prediction of
- Structure class as defined by SCOP or CATH
- Folds, as defined by SCOP and CATH
- Complete three dimensional structure.
28Prediction of protein structure calss
- Some secondary structure prediction servers also
predict classe e.g. - http//www.cmpharm.ucsf.edu/jmc/pred2ary/
- (Chandonia and Karplus)
- All helix, all beta classes are easier to predict
than a/b and ab structural classes.
29Protein fold prediction
- Approaches to fold prediction may be classified
into two categories - Sequence to sequence prediction Based on getting
the best alignments with known structures and
predicting fold. - Sequence to structure methods Structure is
encoded as a sequence of residue environments.
Score is assigned to each residue and finally
score is added to detect the probability of a
given fold.
30Best fold predictors
- CASP is a biannual meeting for evaluating
structure prediction. Following methods were
found to be the best in 2002. - Krzysztof Ginalski and Leszek Rychlewski
- Nucleic Acids Research, 2003, Vol. 31, No. 13
3291-3292 - http//BioInfo.PL/Meta
- This is a metserver working on 3D-Jury method. It
collects predictions from many servers and
develops a consensus model based prediction.
31(No Transcript)
32Pcons Consensus predictor
- Earlier version of Meta server, but with slight
difference in building the final prediction - http//www.sbc.su.se/arne/pmodeller/
33ROSETA predictor by Baker
- This is an ab-initio method of structure
prediction. - When there is no significant alignment available,
this is the only way to predict. - Performs better than all other predictors.
- No online predictions, but group website is here
- http//depts.washington.edu/bakerpg/highlights1.ht
ml
34Three dimensional structure prediction
- Methods are based on
- Ab-intio, Molecular dynamics and Monte Carlo
methods of energy minimization. - Comparative modelling using sequence alignments
with known structures. - Combination of the above two methods.
35Some prediction methods/ predicted model databases
- Modeller
- http//www.salilab.org/modeller/modeller.html
- SwissModel
- http//www.expasy.org/swissmod/
- FAMSBASE
- http//famsbase.bio.nagoya-u.ac.jp/famsbase/
- GenTHREADER and PSIPRED
- http//bioinf.cs.ucl.ac.uk/psipred/
- UCLA/DOE Fold Server (Includes DASEY, in case no
alignments are found). - http//fold.doe-mbi.ucla.edu/
36(No Transcript)
37(No Transcript)
38(No Transcript)
39(No Transcript)
40(No Transcript)