Title: Proteins Secondary Structure Predictions
1Proteins SecondaryStructure Predictions
Structural Bioinformatics
2The first high resolution structure of a
protein-myoglobin
Was solved in 1958 by Max Perutz John Kendrew of
Cambridge University. (Won the 1962 and Nobel
Prize in Chemistry)
In 12.12.2013 there were 89,110 protein
structures in the protein structure
database. Great increase but still a magnitude
lower then the total number of protein sequence
databases (close to 1,000,000)
3What can we do to bridge the gap??
MERFGYTRAANCEAP.
- Predicting the three dimensional structure
from sequence of a protein is very hard - (some times impossible)
- However we can predict with relative high
precision the secondary structure
4What do we mean by Secondary Structure ?
- Secondary structure are the building blocks of
the protein structure
5What do we mean by Secondary Structure ?
- Secondary structure is usually divided into
three categories
Anything else turn/loop
Alpha helix
Beta strand (sheet)
6The different secondary structures are combined
together to form theTertiary Structure of the
Proteins
7Secondary
Tertiary
?
?
RBP
?
Globin
8Secondary Structure Prediction
- Given a primary sequence
- ADSGHYRFASGFTYKKMNCTEAA
- what secondary structure will it adopt
- (alpha helix, beta strand or random coil) ?
9Secondary Structure Prediction Methods
- Statistical methods
- Based on amino acid frequencies
- HMM (Hidden Markov Model)
- Machine learning methods
- SVM , Neural networks
10Chou and Fasman (1974)
Statistical Methods for SS prediction
Name P(a) P(b) P(turn) Alanine
142 83 66 Arginine 98 93
95 Aspartic Acid 101 54
146 Asparagine 67 89 156 Cysteine
70 119 119 Glutamic Acid 151 037
74 Glutamine 111 110
98 Glycine 57 75 156 Histidine
100 87 95 Isoleucine 108 160
47 Leucine 121 130 59 Lysine
114 74 101 Methionine 145
105 60 Phenylalanine 113 138
60 Proline 57 55 152 Serine
77 75 143 Threonine 83 119
96 Tryptophan 108 137
96 Tyrosine 69 147 114 Valine
106 170 50
The propensity of an amino acid to be part of a
certain secondary structure (e.g. Proline has a
low propensity of being in an alpha helix or beta
sheet ? breaker)
11Secondary Structure Method Improvements
- Sliding window approach
- Most alpha helices are 12 residues longMost
beta strands are 6 residues long - Look at all windows of size 6/12
- Calculate a score for each window. If gtthreshold
? predict this is an alpha helix/beta sheet
TGTAGPQLKCHIQWMLPLKK
12Improvements since 1980s
- Adding information from conservation in MSA
- Smarter algorithms (e.g. Machine learning, HMM).
13HMM (Hidden Markov Model) approach for
predicting Secondary Structure
- HMM enables us to calculate the probability of
assigning a sequence to a secondary structure
TGTAGPOLKCHIQWML HHHHHHHLLLLBBBBB
p ?
14Beginning with an a-helix
The probability of observing Alanine as part of a
ß-sheet
a-helix followed by a-helix
The probability of observing a residue which
belongs to an a-helix followed by a residue
belonging to a turn 0.15
Table built according to large database of known
secondary structures
15- Example
- What is the probability that the sequence TGQ
will be in a helical structure??
TGQ HHH
p 0.45 x 0.041 x 0.8 x 0.028 x 0.8x 0.0635
0.0020995
Success of HMM based methods-gt 75-80
16- What can we learn from secondary structure
predictions??
17Mad Cow DiseasePrPc to PrPsc
PRPc
PRPsc
18How do the protein structure relate to the
primary protein sequence??
19SEQUENCE
-Early experiments have shown that the sequence
of the protein is sufficient to determine its
structure (Anfisen) - Protein structure is more
conserved than protein sequence and more closely
related to function.
20How (CAN) Different Amino Acid Sequence Determine
Similar Protein Structure ??
Lesk and Chothia 1980
21The Globin Family
22Different sequences can result in similar
structures
1ecd
2hhd
23- We can learn about the important features
which determine structure and function by
comparing the sequences and structures ?
24The Globin Family
25Why is Proline 36 conserved in all the globin
family ?
26Where are the gaps??
The gaps in the pairwise alignment are mapped to
the loop regions
27How are remote homologs related in terms of their
structure?
RBD
b-lactoglobulin
28PSI-BLAST alignment of RBP and b-lactoglobulin
iteration 3
Score 159 bits (404), Expect
1e-38 Identities 41/170 (24), Positives
69/170 (40), Gaps 19/170 (11) Query 3
WVWALLLLAAWAAAERD--------CRVSSFRVKENFDKARFSGTWYAMA
KKDPEGLFLQ 54 V L LA A
S VENFD G WY K Sbjct 1
MVTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEIE
KIPASFE-KG 59 Query 55 DNIVAEFSVDETGQMSATAKGRVR
LLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQ 114
I A S E G K V PAK
Sbjct 60 NCIQANYSLMENGNIEVLNKELSPDG
TMNQVKGE--AKQSNVSEPAKLEVQFFPL----- 112 Query
115 KGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPNGL
PPEA 164 WI TDY YA YSC
RP LPPE Sbjct 113
MPPAPYWILATDYENYALVYSCTTFFWL--FHVDFFWILGRNPY-LPPET
159
29The Retinol Binding Protein
b-lactoglobulin
30Taken together
MERFGYTRAANCEAP.
FUNCTION
31Pfam
- Database that contains a large collection of
multiple sequence alignments of protein families
(common structures) - Very useful for function prediction.
http//pfam.sanger.ac.uk/
32The zinc-finger family (domain)
Known family of Transcription Factors
Protein sequence
ZINC FINGER DOMAIN
33Pfam
Based on Profile hidden Markov Models (HMMs)
which represents the protein family HMM in
comparison to PSSM is a model which considers
dependencies between the different columns in the
matrix (different residues) and is thus much more
powerful!!!!
http//pfam.sanger.ac.uk/
34Profile HMM (Hidden Markov Model)can accurately
represent a MSA
D19
D16
D17
D18
100
16 17 18 19
delete
D R T R D R T S S - - S S P T R D R T R D P
T S D - - S D - - S D - - S D - - R
100
50
M16
M17
M18
M19
100
100
50
D 0.8 S 0.2
P 0.4 R 0.6
R 0.4 S 0.6
Match
T 1.0
I16
I19
I18
I17
insert
X
X
X
X
35Extra Slides (for your interest)
36Alpha Helix Pauling (1951)
- A consecutive stretch of 5-40 amino acids
(average 10). - A right-handed spiral conformation.
- 3.6 amino acids per turn.
- Stabilized by Hydrogen bonds
3.6 residues 5.6 Ã…
37Beta Strand Pauling and Corey (1951)
ß -strand
gt An extended polypeptide chains is called
ß strand (consists of 5-10 amino acids gt The
chains are connected together by Hydrogen
bonds to form b-sheet
ß -sheet
38Loops
- Connect the secondary structure elements (alpha
helix and beta strands). - Have various length and shapes.