Title: Plegamiento de prote
1Plegamiento de proteínasUna perspectiva
bioinformática.
- Ugo Bastolla,
- Red Nacional de Bioinformática y
- Centro de Astrobiología (CSIC-INTA)
- Universidad Politécnica de Madrid, 14 de enero
2003
2Proteins as interdisciplinary molecules
- Proteins are evolving molecular machines, at the
border between Physics and Biology. - They are molecular machines that obey the laws
of statistical mechanics. - They are evolving machines, produced through the
action of mutation and natural selection. - Bioinformatics integrates both sources of
information to predict biological properties.
Thermodynamics sheds light on protein evolution,
and evolutionary considerations sheds light on
protein folding.
3Proteins are polymers formed by 20 amino-acid
types bound by peptide bonds.
Soft degrees of freedom phi-psi angles
4Torsion angles cluster at values corresponding to
regular local structure (secondary structure),
stabilized by hydrogen bonds.
5Hierarchical organization of protein structure
6Many proteins (e.g. antibodies) are formed by
several, almost independently folding units
called domains.
7Protein Folding
AMTYHLDVVSAEQQMFSGLVEKIQVT..
Most proteins fold spontaneously in a well
defined three dimensional conformation, the
Native State. It is believed that the Native
State is the state of minimal free energy
available to the protein plus solvent system.
This depends on the state of the solvent, ex. on
temperature and pH.
8Statistical mechanics of protein folding
N residues, exponentially large (eaN) number of
conformations. Boltzmann distribution in
configuration space Prob.(C) ? exp(-E(C)/kBT) kB
is the Boltzmann constant, T is absolute
temperature, the effective free energy depends on
the state of the solvent (averaged out) through
temperature, pH, presence of denaturants
9Lattice models of protein folding
- Exponentially large number of conformations,
Monte Carlo simulations. - Well designed sequences fold fast to the lowest
free energy state, stable thermodynamically and
against mutations. They have well correlated
landscape. - Qualitative features reproduced, no experimental
comparison possible.
10The normalized energy gap a gives a quantitative
measure of energy landscape correlations
E(C)-E(C0)
gta(1-q(C,C0))
E(C0)
Random sequences Slow folding, low stability,
small a
Designed sequences Fast folding, high stability,
large a
11Molecular Dynamics
- Model all atoms in the protein
- Solvent either explicit or implicit
- Molecular dynamics simulations
d2xi/dt2Fi(x1 xN) t10-12 sec. - Force field ideally from first principles (but
simplifications are needed!). Ex CHARMM, AMBER - Very useful to model the functioning of an
enzyme, but useless for folding prediction Time
scales are too long, simulations can be trapped
in energy minima, and it is not even clear
whether the model is accurate enough.
12The holy gral of protein folding
Develop a model simple enough to allow
computation, yet realistic enough to be
comparable with experiments. The only a-priori
reliable model needs quantum interactions (for
instance, interactions between aromatic amino
acids) and all atoms of the solvent
(PROBLEM!). Simplest models have 2N torsion
angles degrees of freedom. The number of possible
conformations is O(e2N), incredibly huge even for
a quite reduced chain (impossible to compute all
of them).
13Homology modelling
- Just use biology, not physics!
(homologycommon origin) - Proteins with more than 25 sequence similarity
always have very similar structure, because
structure is very conserved in evolution. - Align query and template
- Build the backbone from aligned template
- Build non-aligned regions (loops)
- Build side chains.
The more similar the sequences, the more similar
the structures and the better the model.
14Homology models tend to be rather good in
conserved regions, but they are poor in more
variable regions (loops). They are only reliable
if sequence similarity is above 25 (this
threshold has been decreased due to better
alignment techniques), whereas most protein pairs
have lower similarity.
15The bioinformatic approach look at known protein
structures
- Related proteins with the same fold have
typically low sequence similarity Their
similarity can hardly be recognized only aligning
the sequences. - Score how suitable is a structure template to a
query sequence? (Effective energy function) - Recognize the known structure which best fits an
unknown sequence - No physical derivation for the scoring scheme,
but thermodynamic estimates are sometimes possible
16Reduced representation of proteins
We represent protein structures as contact
maps Cij Similarity is measured as the
fraction of common contacts, or overlap q(C,C)
Alternative structures of sequence
AA1...AN are generated by aligning A without
gaps with all structures in the PDB (gapless
threading). The energy is assumed of the form
E(C,A)/kBTSij Cij U(Ai,Aj) depending
on 210 parameters U(a,b).
1 if ij contact 0 otherwise
Sij Cij Cij max(SijCij,SijCij)
17Effective energy for simplified protein models
We have optimized the parameters of a contact
energy function such that the Native State has
the lowest energy and the energy landscape is
well correlated for most independent proteins in
the Protein Data Bank. Our optimization method is
based on the maximization of the Boltzmann
average of the similarity with the native
state Q(A) SC exp(-E(C,A)/kBT) q(C,Cnat) When
this parameter is maximal (Q 1) the native
state has lowest energy and dissimilar states
have high energy (the energy landscape is well
correlated). This can be achieved for nearly all
proteins in the PDB
18Effective energy applied to crystal structures
Prediction of unfolding free energies using
crystal structures and effective energy
functionDG/NkBT Enat/NkBT - s
The Native States have lowest energy and the
energy landscapes are well correlated. The
resulting normalized energy gap a (0.2-0.8) is
much higher than for random sequences (lt0.1) and
increases with chain length
19The main contribution to the energy parameters
comes from hydrophobicity
20A facility for protein structure prediction at
CAB(http//www.cab.inta.es/CAFASP/)
The PROTFINDER algorithm looks for the structure
in the PDB database which better aligns (with
gaps) to the query sequence. It took part to
CAFASP4. It is available through a web server
realized and cured by Alain Lepinette of CAB.
21Scoring function
Sequence-structure alignment a(i) Score -Sij
C(a(i),a(j))U(Ai,Aj)- S0Lali
-G0Ngaps-G1Lgaps
Contact free energy Configurational entropy loss
S0 for each aligned residue Gap penalties G0
(create) and G1 (extend)
Sequence homology information is not used. A
semi-deterministic algorithm used to generate
candidate alignments.
22Fold recognition
The ability to predict protein structures depends
crucially on the most similar structure available
in the database, qmax. Very similar structure
present in the database correctly
selected on the basis of the energy. No structure
above a threshold of similarity almost
random prediction. The high similarity needed is
frequent in proteins of detectable homology, but
not in very distant homologous.
23Sequence-structure alignments obtained through
ProtFinder are very similar to those found in
databases of protein alignments (PFAM)
24The CASP experiment evaluates protein structure
prediction methods
25Stability of orthologous proteins
Related proteins with the same fold have
typically low sequence similarity. What are the
common features of their sequences? How similar
are their thermodynamic properties?
26With our tools we can compare thermodynamic
properties of homologous proteins. We estimate
two key parameters the folding free energy DG
and the normalized energy gap a. We apply our
energy function to families of orthologous
proteins predicting their Native Structure. In
all cases, this coincides with the structure of
the closest analog in the PDB, despite our
algorithm does not use the information on
sequence similarity.
27List of organisms
List of genes
- Free-living B.subtilis, B. anthracis,
C.crescentus, , D.radiodurans, E.coli,
E.acidophylus, H.influentiae, L.lactis,
L.monocytogenes, L.innocua, M.tubercolosis,
M.smegmatis, N.meningitis, P.multocida,
P.aeruginosa, P. putida, R.loti, R. meliloti,
S.typhimurium, S.aureus, S.pyogenes,
S.coelicolor, Synechococcus, T.pallidum,
V.cholerae, X.fastidiosa, Z.mobilis - Intracellular B.burgdorferi, B.aphidicola (APS,
BPS, SGR), C.jejuni, C.pneumoniae,
C.thrachomatis, H.pylori, M.capriolum,
M.genitalium, M.pneumoniae, M.leprae,
R.prowazeki, U.parvum, Y. pestis, W.glossinidia,
Wolbachia sp. - Thermophyles A.aeolicus, B.stereothermophylus,
T.maritima, T.aquaticus - Archea A.pernix, A.fulgidus, M.Jannaschi,
M.thermoautotrophicum, P.furiosus
- ATPE ACKA
- AROQ COAD
- DDL DUT
- EFTS FLAV
- FOLA FTSJ
- PDF PTH
- PTHP RL14
- RNH RNPA
- TRXA TRXB
- TPIS TRPA
- DNAK
28Protein folding thermodynamics depends on
hydrophobicity. More hydrophobic sequences have
more negative folding free energy (they are more
stable against unfolding), but they have lower
energy gap (they are less stable against
misfolding). Evolution has to look for a
compromise between these properties! (Frustration)
29Folding efficiency (normalized energy gap) is
correlated with genome size. Smaller genomes,
such as those of intracellular bacteria, have
reduced folding efficiency. Possible misfolding
problems are consistent with observed high
expression of chaperones in these bacteria.
30Intracellular bacteria
The genomes of obligate intracellular organisms
(organelles, endosymbionts, parasites) share
important common features
- Very small genomes
- High AT content
- High hydrophobicity
- Reduced population size
- Reduced folding ability of proteins
These features can be explained from the point of
view of evolutionary theory
31Our results show that the normalized energy gap a
is smaller for intracellular bacteria than for
free living bacteria. This fact can be explained
(a) because intracellular genomes have mutation
bias towards AT, hence express more hydrophobic
proteins (b) because of the weaker selection
experienced by intracellular bacteria due to
their small populations. A smaller folding
parameter implies that the occurrence of
misfolding is much higher. This can lead to
protein aggregation, very dangerous for cellular
processes. To avoid aggregation, these bacteria
express very high amounts of chaperones, proteins
in charge of helping protein folding. The
chaperone DNAK appears more stable in organisms
with smaller genome.
32What do sequences with the same fold have in
common?
Therefore, sequences with the same fold have a
common hydrophobic fingerprint that coincides
with the PE of the contact matrix. The
evolutionary average HV correlates with the PE
much more strongly than the PE of a single
sequence.
Spectral decomposition of the interaction
matrix E SikCikU(Ai,Ak) SikCikh(Ai)h(Ak)
Sequences with the same fold have similar
Hydrophobicity Vector h(Ai) (HV). The HV has
large correlation r(h,c) with the Principal
Eigenvector (PE) of the contact matrix Cij.
33Bioinformatics
- Biological information is accumulating at very
fast pace. - Need of classifying this information for storing
and retrieving (One could say that biology is the
art of classifying!) - Protein structures decomposition, structural
classification, hidden evolutionary
relationships. - Biological sequences Identification of protein
sequences (genes), classification, structure and
function prediction. - Molecular interactions reconstruction of
metabolic networks and cellular regulatory
networks (system biology) - Organisms evolutionary classification
(phylogeny) - Biological literature classification and
retrieving
34Proteins are made of modules (domains) that are
duplicated and combined in many possible ways to
create always new molecules.
35(No Transcript)
36The Protein Data Bank (PDB) contains roughly
24000 protein structures, determined either by
X-ray crystallography or by NMR spectrometry.
Less than 4000 are different folds. The number of
new folds (blue bar) is decreasing each year.
Other classification schemes yield less than 1000
different folds Evolution uses a
reduced number of folds for a large number of
biological functions.
37CATH structural classification 813 folds
(Topology level) (Thornton, Orengo)
38SCOP Structural Classification of Proteins 800
folds (Chothia, Murzin)
39DALI Algorithm and server for automatic
classification of protein structures (Holm
and Sander). It aligns protein structures
minimizing the dissimilarity score SSik raik
- rbik /(raik rbik) exp(-(raik - rbik)2/4r02)
r020A The sum runs over C alpha atoms
i,k. It generates the database FSSP of
structurally similar proteins (S much smaller
than for random pairs of structures, Z score
criterion).
40- For each new structure
- Store it in the PDB with proper format.
- Decompose it in domains
- Classify domains, discover new evolutionary
relationships. - For each new sequence
- Find the gene sequences in the genome (easy for
prokaryotes, very difficult for eukaryotes
because genes are interrupted by introns). - Find homologous domains, infer structure and
function. - Decide whether structure determination is
worthwhile
41Protein databases GeneBank Protein sequences
(not annotated), from genomic projects. SwissProt
Annotated protein sequences. Domain
organization, structure, function, active site
may be known from homology. Protein Data Bank
(PDB) Protein structures
42Sequence Alignment
43Alignment is the main tool in Bioinformatics. It
is justified by the fact that aligned elements
have a common evolutionary origin
(homology). Amino acids or nucleotides in
evolution can be conserved, substituted (usually
with minimal modification of the Native State),
inserted or deleted. The last two processes
generate gaps in the alignment. The score for an
alignment a(i) between two sequences A1i, A2k
is Score Si S(A1i,A2a(i)) - G0Ngaps -
G1Lgaps The 20 ? 20 matrix S(a,b) is called
Substitution matrix and is determined from
aligned protein families. The most used are the
BLOSUM62 and the PAM250 matrices. G0 is the gap
opening and G1 is the gap extension penalty. The
number of possible alignments grows exponentially
with sequence length, but the optimal alignment
can be found exactly with an O(L3) algorithm
using dynamic programming (Needleman Wunsch,
Smith Waterman). The optimal solution is often,
but not always, the biologically relevant one.
The gap parameter and substitution matrix used
are crucial! One has to check the statistical
significance.
44Multiple Sequence Alignments
- Multiple alignments of M sequences is an NP
problem no solution polinomial in M is thought
to exist. Once the first two sequences have been
aligned, in fact, the score for the next one has
been modified! - The most used solution is implemented in the
algorithm CLUSTALW, it consists in aligning the
easy pairs first - Align all pairs of sequences with a fast
algorithm - Build a tree of their relationship
- Start aligning accurately the two most closely
related sequences (easiest). Represent both of
them with a single profile. - Iterate, looking again for the two most closely
related sequences or profiles.
45Database search
- Often, we do not need accurate alignments but
just a list of database entries that are
evolutionarily related to our query sequence.
Most used algorithms for this purpose are BLAST
and FASTA. - BLAST compares the query sequence to all
sequences in a database like SwissProt or
GeneBank in few seconds. For each pair of
sequences, it finds all exact matches of length
k, extends and combines them, and provides the P
value that the matches are found by chance. - PSI-BLAST is an iterative procedure based on
BLAST. - Find all sequences significantly related to the
query. - Construct a profile (amino acid distribution per
site) from the multiple alignment - Iterate the search using the profile as query.
- In this way, very distant evolutionary
relationships can be retrieved confidently. This
method is very useful for protein structure
prediction.
46Phylogenetic trees
Evolving species can be placed on the leaves of a
phylogenetic tree. The time past since the last
common ancestor of species A and B, d(A,B), is a
distance allowing classification. This is based
on the ultrametric property all triangles have
the two longest sides equal. Phylogenetic trees
were once built by comparing external characters,
but now they are built using macromolecules such
as proteins, RNA and DNA.
47The molecular clock
- Empirical observation the number of amino acid
substitutions between two orthologous proteins
(ex. Myoglobin) of two speices A and B is
linearly correlated with their divergence time
t(A,B). Fluctuations of the number of
substitutions are small. - K(A,B) a t(A,B)
- If the divergence time is not known, the number
of substitutions can be used to estimate it.
K(A,B) can be obtained from the number of
mismatches in the sequence alignment, using some
model of evolution to correct for multiple
substitutions. - Methods to generate phylogenetic trees range from
deterministic clustering algorithms to
optimization methods. The two most used are - Neighbor Joining Join the two closest
sequences, recalculate distances, iterate. Very
fast but not very accurate. - Maximal Likelihood For a model of sequence
evolution (independent sites needed!), calculate
the likelihood of the observed sequences given
the parameters and the tree. Exhaustive search of
the ML tree is impossible, but approximate
algorithms give good results.
48(No Transcript)
49Tree of seven replication proteins found in all
bacterial genomes (using the BLAST algorithm),
obtained with the Neighbor-Joining method. The
number represent Bootstrap values (number of
times, out of 1000, that the plotted branching is
observed using a random subset of all aligned
positions). Some groups (clades) can be
confidently recontructed, for instance
Proteobacteria and Gram-positive bacteria, but
some divergences are too ancient and no
similarity signal is found in their proteins.
50Some problems with phylogenetics
- The protein tree, which we reconstruct, does not
always coincide with the species tree, if there
has been gene transfer between species (frequent
in bacteria) or gene duplication prior to species
separation (paralogous proteins). - The molecular clock is known to hold for neutral
evolution (when the properties of the protein do
not change), but adaptations happen at a much
faster rate. The substitution rate can vary in
different branches also due to different mutation
rate or generation time. When the rate is too
variable, the estimates of branch lengths and the
reconstructed trees are not reliable. - The number of substitutions K(A,B) can be
reliably estimated from the number of mismatches
when it is not saturated. - An indication of these problems is that
different proteins usually give different tree
topologies.
51Some courses on the web http//www.
pdg.cnb.uam.es/cursos/BioInfo2002/pages/index.html
Curso de Doctorado BIOINFORMÁTICA
http//www.cryst.bbk.ac.uk/PPS2/index.html Princ
iples of Protein Structure Using the
Internet http//www.biochemtech.uni-halle.de/PPS2
/projects/day/TDayDi The Source of Stability in
Proteins http//www.fst.reading.ac.uk/courses/fs9
16/index.htm Protein Structure and Function
http//www.cm.utexas.edu/academic/courses/Spring
2002/CH339K/Robertus/ http//www.oup.com/lesk/bio
inf Site of the book Introduction to
Bioinformatics, by A.M. Lesk (Oxford)
52Main databases and resources http//www.ncbi.nlm
.nih.gov/ National Center for Biotechnology
Information Genomes, PubMed (literature), genes,
proteins... http//www.bmn.com/ BioMedNet
(Medline) Biological literature http//www.tigr.o
rg/ The Institute for Genomic Research http//www
.ebi.ac.uk/swissprot/ Swiss-Prot annotated
proteins http//pfam.wustl.edu/ Pfam aligned
protein families http//gibk26.bse.kyutech.ac.jp/
jouhou/jouhoubank.html BioInfo Bank several data
bases
http//www.rcsb.org/pdb/ Protein Data Bank
protein structures http//www.ebi.ac.uk/dali/ FSS
P Alignment of protein domains http//www.biochem
.ucl.ac.uk/bsm/cath/ CATH Classification of
domains http//scop.mrc-lmb.cam.ac.uk/scop/ SCOP
Classification of domains http//www.ebi.ac.uk/
http//pqs.ebi.ac.uk/ Protein Quaternary
Structure (interactions) http//BioInfo.PL/cafasp
/ Servers for automatic protein structure
prediction