Plegamiento de prote - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

Plegamiento de prote

Description:

Title: Como la evoluci n modif ca la estabilidad de las prote nas Author: van Ham Last modified by: van Ham Created Date: 11/29/2001 10:45:02 AM – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 53
Provided by: vanH57
Category:

less

Transcript and Presenter's Notes

Title: Plegamiento de prote


1
Plegamiento de proteínasUna perspectiva
bioinformática.
  • Ugo Bastolla,
  • Red Nacional de Bioinformática y
  • Centro de Astrobiología (CSIC-INTA)
  • Universidad Politécnica de Madrid, 14 de enero
    2003

2
Proteins as interdisciplinary molecules
  • Proteins are evolving molecular machines, at the
    border between Physics and Biology.
  • They are molecular machines that obey the laws
    of statistical mechanics.
  • They are evolving machines, produced through the
    action of mutation and natural selection.
  • Bioinformatics integrates both sources of
    information to predict biological properties.
    Thermodynamics sheds light on protein evolution,
    and evolutionary considerations sheds light on
    protein folding.

3
Proteins are polymers formed by 20 amino-acid
types bound by peptide bonds.
Soft degrees of freedom phi-psi angles
4
Torsion angles cluster at values corresponding to
regular local structure (secondary structure),
stabilized by hydrogen bonds.
5
Hierarchical organization of protein structure
6
Many proteins (e.g. antibodies) are formed by
several, almost independently folding units
called domains.
7
Protein Folding
AMTYHLDVVSAEQQMFSGLVEKIQVT..
Most proteins fold spontaneously in a well
defined three dimensional conformation, the
Native State. It is believed that the Native
State is the state of minimal free energy
available to the protein plus solvent system.
This depends on the state of the solvent, ex. on
temperature and pH.
8
Statistical mechanics of protein folding
N residues, exponentially large (eaN) number of
conformations. Boltzmann distribution in
configuration space Prob.(C) ? exp(-E(C)/kBT) kB
is the Boltzmann constant, T is absolute
temperature, the effective free energy depends on
the state of the solvent (averaged out) through
temperature, pH, presence of denaturants
9
Lattice models of protein folding
  • Exponentially large number of conformations,
    Monte Carlo simulations.
  • Well designed sequences fold fast to the lowest
    free energy state, stable thermodynamically and
    against mutations. They have well correlated
    landscape.
  • Qualitative features reproduced, no experimental
    comparison possible.

10
The normalized energy gap a gives a quantitative
measure of energy landscape correlations
E(C)-E(C0)
gta(1-q(C,C0))
E(C0)
Random sequences Slow folding, low stability,
small a
Designed sequences Fast folding, high stability,
large a
11
Molecular Dynamics
  • Model all atoms in the protein
  • Solvent either explicit or implicit
  • Molecular dynamics simulations
    d2xi/dt2Fi(x1 xN) t10-12 sec.
  • Force field ideally from first principles (but
    simplifications are needed!). Ex CHARMM, AMBER
  • Very useful to model the functioning of an
    enzyme, but useless for folding prediction Time
    scales are too long, simulations can be trapped
    in energy minima, and it is not even clear
    whether the model is accurate enough.

12
The holy gral of protein folding
Develop a model simple enough to allow
computation, yet realistic enough to be
comparable with experiments. The only a-priori
reliable model needs quantum interactions (for
instance, interactions between aromatic amino
acids) and all atoms of the solvent
(PROBLEM!). Simplest models have 2N torsion
angles degrees of freedom. The number of possible
conformations is O(e2N), incredibly huge even for
a quite reduced chain (impossible to compute all
of them).
13
Homology modelling
  • Just use biology, not physics!
    (homologycommon origin)
  • Proteins with more than 25 sequence similarity
    always have very similar structure, because
    structure is very conserved in evolution.
  • Align query and template
  • Build the backbone from aligned template
  • Build non-aligned regions (loops)
  • Build side chains.

The more similar the sequences, the more similar
the structures and the better the model.
14
Homology models tend to be rather good in
conserved regions, but they are poor in more
variable regions (loops). They are only reliable
if sequence similarity is above 25 (this
threshold has been decreased due to better
alignment techniques), whereas most protein pairs
have lower similarity.
15
The bioinformatic approach look at known protein
structures
  • Related proteins with the same fold have
    typically low sequence similarity Their
    similarity can hardly be recognized only aligning
    the sequences.
  • Score how suitable is a structure template to a
    query sequence? (Effective energy function)
  • Recognize the known structure which best fits an
    unknown sequence
  • No physical derivation for the scoring scheme,
    but thermodynamic estimates are sometimes possible

16
Reduced representation of proteins
We represent protein structures as contact
maps Cij Similarity is measured as the
fraction of common contacts, or overlap q(C,C)
Alternative structures of sequence
AA1...AN are generated by aligning A without
gaps with all structures in the PDB (gapless
threading). The energy is assumed of the form
E(C,A)/kBTSij Cij U(Ai,Aj) depending
on 210 parameters U(a,b).
1 if ij contact 0 otherwise
Sij Cij Cij max(SijCij,SijCij)
17
Effective energy for simplified protein models
We have optimized the parameters of a contact
energy function such that the Native State has
the lowest energy and the energy landscape is
well correlated for most independent proteins in
the Protein Data Bank. Our optimization method is
based on the maximization of the Boltzmann
average of the similarity with the native
state Q(A) SC exp(-E(C,A)/kBT) q(C,Cnat) When
this parameter is maximal (Q 1) the native
state has lowest energy and dissimilar states
have high energy (the energy landscape is well
correlated). This can be achieved for nearly all
proteins in the PDB
18
Effective energy applied to crystal structures
Prediction of unfolding free energies using
crystal structures and effective energy
functionDG/NkBT Enat/NkBT - s
The Native States have lowest energy and the
energy landscapes are well correlated. The
resulting normalized energy gap a (0.2-0.8) is
much higher than for random sequences (lt0.1) and
increases with chain length
19
The main contribution to the energy parameters
comes from hydrophobicity
20
A facility for protein structure prediction at
CAB(http//www.cab.inta.es/CAFASP/)
The PROTFINDER algorithm looks for the structure
in the PDB database which better aligns (with
gaps) to the query sequence. It took part to
CAFASP4. It is available through a web server
realized and cured by Alain Lepinette of CAB.
21
Scoring function
Sequence-structure alignment a(i) Score -Sij
C(a(i),a(j))U(Ai,Aj)- S0Lali
-G0Ngaps-G1Lgaps
Contact free energy Configurational entropy loss
S0 for each aligned residue Gap penalties G0
(create) and G1 (extend)
Sequence homology information is not used. A
semi-deterministic algorithm used to generate
candidate alignments.
22
Fold recognition
The ability to predict protein structures depends
crucially on the most similar structure available
in the database, qmax. Very similar structure
present in the database correctly
selected on the basis of the energy. No structure
above a threshold of similarity almost
random prediction. The high similarity needed is
frequent in proteins of detectable homology, but
not in very distant homologous.
23
Sequence-structure alignments obtained through
ProtFinder are very similar to those found in
databases of protein alignments (PFAM)
24
The CASP experiment evaluates protein structure
prediction methods
25
Stability of orthologous proteins
Related proteins with the same fold have
typically low sequence similarity. What are the
common features of their sequences? How similar
are their thermodynamic properties?
26
With our tools we can compare thermodynamic
properties of homologous proteins. We estimate
two key parameters the folding free energy DG
and the normalized energy gap a. We apply our
energy function to families of orthologous
proteins predicting their Native Structure. In
all cases, this coincides with the structure of
the closest analog in the PDB, despite our
algorithm does not use the information on
sequence similarity.
27
List of organisms
List of genes
  • Free-living B.subtilis, B. anthracis,
    C.crescentus, , D.radiodurans, E.coli,
    E.acidophylus, H.influentiae, L.lactis,
    L.monocytogenes, L.innocua, M.tubercolosis,
    M.smegmatis, N.meningitis, P.multocida,
    P.aeruginosa, P. putida, R.loti, R. meliloti,
    S.typhimurium, S.aureus, S.pyogenes,
    S.coelicolor, Synechococcus, T.pallidum,
    V.cholerae, X.fastidiosa, Z.mobilis
  • Intracellular B.burgdorferi, B.aphidicola (APS,
    BPS, SGR), C.jejuni, C.pneumoniae,
    C.thrachomatis, H.pylori, M.capriolum,
    M.genitalium, M.pneumoniae, M.leprae,
    R.prowazeki, U.parvum, Y. pestis, W.glossinidia,
    Wolbachia sp.
  • Thermophyles A.aeolicus, B.stereothermophylus,
    T.maritima, T.aquaticus
  • Archea A.pernix, A.fulgidus, M.Jannaschi,
    M.thermoautotrophicum, P.furiosus
  • ATPE ACKA
  • AROQ COAD
  • DDL DUT
  • EFTS FLAV
  • FOLA FTSJ
  • PDF PTH
  • PTHP RL14
  • RNH RNPA
  • TRXA TRXB
  • TPIS TRPA
  • DNAK

28
Protein folding thermodynamics depends on
hydrophobicity. More hydrophobic sequences have
more negative folding free energy (they are more
stable against unfolding), but they have lower
energy gap (they are less stable against
misfolding). Evolution has to look for a
compromise between these properties! (Frustration)
29
Folding efficiency (normalized energy gap) is
correlated with genome size. Smaller genomes,
such as those of intracellular bacteria, have
reduced folding efficiency. Possible misfolding
problems are consistent with observed high
expression of chaperones in these bacteria.
30
Intracellular bacteria
The genomes of obligate intracellular organisms
(organelles, endosymbionts, parasites) share
important common features
  • Very small genomes
  • High AT content
  • High hydrophobicity
  • Reduced population size
  • Reduced folding ability of proteins

These features can be explained from the point of
view of evolutionary theory
31
Our results show that the normalized energy gap a
is smaller for intracellular bacteria than for
free living bacteria. This fact can be explained
(a) because intracellular genomes have mutation
bias towards AT, hence express more hydrophobic
proteins (b) because of the weaker selection
experienced by intracellular bacteria due to
their small populations. A smaller folding
parameter implies that the occurrence of
misfolding is much higher. This can lead to
protein aggregation, very dangerous for cellular
processes. To avoid aggregation, these bacteria
express very high amounts of chaperones, proteins
in charge of helping protein folding. The
chaperone DNAK appears more stable in organisms
with smaller genome.
32
What do sequences with the same fold have in
common?
Therefore, sequences with the same fold have a
common hydrophobic fingerprint that coincides
with the PE of the contact matrix. The
evolutionary average HV correlates with the PE
much more strongly than the PE of a single
sequence.
Spectral decomposition of the interaction
matrix E SikCikU(Ai,Ak) SikCikh(Ai)h(Ak)
Sequences with the same fold have similar
Hydrophobicity Vector h(Ai) (HV). The HV has
large correlation r(h,c) with the Principal
Eigenvector (PE) of the contact matrix Cij.
33
Bioinformatics
  • Biological information is accumulating at very
    fast pace.
  • Need of classifying this information for storing
    and retrieving (One could say that biology is the
    art of classifying!)
  • Protein structures decomposition, structural
    classification, hidden evolutionary
    relationships.
  • Biological sequences Identification of protein
    sequences (genes), classification, structure and
    function prediction.
  • Molecular interactions reconstruction of
    metabolic networks and cellular regulatory
    networks (system biology)
  • Organisms evolutionary classification
    (phylogeny)
  • Biological literature classification and
    retrieving

34
Proteins are made of modules (domains) that are
duplicated and combined in many possible ways to
create always new molecules.
35
(No Transcript)
36
The Protein Data Bank (PDB) contains roughly
24000 protein structures, determined either by
X-ray crystallography or by NMR spectrometry.
Less than 4000 are different folds. The number of
new folds (blue bar) is decreasing each year.
Other classification schemes yield less than 1000
different folds Evolution uses a
reduced number of folds for a large number of
biological functions.
37
CATH structural classification 813 folds
(Topology level) (Thornton, Orengo)
38
SCOP Structural Classification of Proteins 800
folds (Chothia, Murzin)
39
DALI Algorithm and server for automatic
classification of protein structures (Holm
and Sander). It aligns protein structures
minimizing the dissimilarity score SSik raik
- rbik /(raik rbik) exp(-(raik - rbik)2/4r02)
r020A The sum runs over C alpha atoms
i,k. It generates the database FSSP of
structurally similar proteins (S much smaller
than for random pairs of structures, Z score
criterion).
40
  • For each new structure
  • Store it in the PDB with proper format.
  • Decompose it in domains
  • Classify domains, discover new evolutionary
    relationships.
  • For each new sequence
  • Find the gene sequences in the genome (easy for
    prokaryotes, very difficult for eukaryotes
    because genes are interrupted by introns).
  • Find homologous domains, infer structure and
    function.
  • Decide whether structure determination is
    worthwhile

41
Protein databases GeneBank Protein sequences
(not annotated), from genomic projects. SwissProt
Annotated protein sequences. Domain
organization, structure, function, active site
may be known from homology. Protein Data Bank
(PDB) Protein structures
42
Sequence Alignment
43
Alignment is the main tool in Bioinformatics. It
is justified by the fact that aligned elements
have a common evolutionary origin
(homology). Amino acids or nucleotides in
evolution can be conserved, substituted (usually
with minimal modification of the Native State),
inserted or deleted. The last two processes
generate gaps in the alignment. The score for an
alignment a(i) between two sequences A1i, A2k
is Score Si S(A1i,A2a(i)) - G0Ngaps -
G1Lgaps The 20 ? 20 matrix S(a,b) is called
Substitution matrix and is determined from
aligned protein families. The most used are the
BLOSUM62 and the PAM250 matrices. G0 is the gap
opening and G1 is the gap extension penalty. The
number of possible alignments grows exponentially
with sequence length, but the optimal alignment
can be found exactly with an O(L3) algorithm
using dynamic programming (Needleman Wunsch,
Smith Waterman). The optimal solution is often,
but not always, the biologically relevant one.
The gap parameter and substitution matrix used
are crucial! One has to check the statistical
significance.
44
Multiple Sequence Alignments
  • Multiple alignments of M sequences is an NP
    problem no solution polinomial in M is thought
    to exist. Once the first two sequences have been
    aligned, in fact, the score for the next one has
    been modified!
  • The most used solution is implemented in the
    algorithm CLUSTALW, it consists in aligning the
    easy pairs first
  • Align all pairs of sequences with a fast
    algorithm
  • Build a tree of their relationship
  • Start aligning accurately the two most closely
    related sequences (easiest). Represent both of
    them with a single profile.
  • Iterate, looking again for the two most closely
    related sequences or profiles.

45
Database search
  • Often, we do not need accurate alignments but
    just a list of database entries that are
    evolutionarily related to our query sequence.
    Most used algorithms for this purpose are BLAST
    and FASTA.
  • BLAST compares the query sequence to all
    sequences in a database like SwissProt or
    GeneBank in few seconds. For each pair of
    sequences, it finds all exact matches of length
    k, extends and combines them, and provides the P
    value that the matches are found by chance.
  • PSI-BLAST is an iterative procedure based on
    BLAST.
  • Find all sequences significantly related to the
    query.
  • Construct a profile (amino acid distribution per
    site) from the multiple alignment
  • Iterate the search using the profile as query.
  • In this way, very distant evolutionary
    relationships can be retrieved confidently. This
    method is very useful for protein structure
    prediction.

46
Phylogenetic trees
Evolving species can be placed on the leaves of a
phylogenetic tree. The time past since the last
common ancestor of species A and B, d(A,B), is a
distance allowing classification. This is based
on the ultrametric property all triangles have
the two longest sides equal. Phylogenetic trees
were once built by comparing external characters,
but now they are built using macromolecules such
as proteins, RNA and DNA.
47
The molecular clock
  • Empirical observation the number of amino acid
    substitutions between two orthologous proteins
    (ex. Myoglobin) of two speices A and B is
    linearly correlated with their divergence time
    t(A,B). Fluctuations of the number of
    substitutions are small.
  • K(A,B) a t(A,B)
  • If the divergence time is not known, the number
    of substitutions can be used to estimate it.
    K(A,B) can be obtained from the number of
    mismatches in the sequence alignment, using some
    model of evolution to correct for multiple
    substitutions.
  • Methods to generate phylogenetic trees range from
    deterministic clustering algorithms to
    optimization methods. The two most used are
  • Neighbor Joining Join the two closest
    sequences, recalculate distances, iterate. Very
    fast but not very accurate.
  • Maximal Likelihood For a model of sequence
    evolution (independent sites needed!), calculate
    the likelihood of the observed sequences given
    the parameters and the tree. Exhaustive search of
    the ML tree is impossible, but approximate
    algorithms give good results.

48
(No Transcript)
49
Tree of seven replication proteins found in all
bacterial genomes (using the BLAST algorithm),
obtained with the Neighbor-Joining method. The
number represent Bootstrap values (number of
times, out of 1000, that the plotted branching is
observed using a random subset of all aligned
positions). Some groups (clades) can be
confidently recontructed, for instance
Proteobacteria and Gram-positive bacteria, but
some divergences are too ancient and no
similarity signal is found in their proteins.
50
Some problems with phylogenetics
  • The protein tree, which we reconstruct, does not
    always coincide with the species tree, if there
    has been gene transfer between species (frequent
    in bacteria) or gene duplication prior to species
    separation (paralogous proteins).
  • The molecular clock is known to hold for neutral
    evolution (when the properties of the protein do
    not change), but adaptations happen at a much
    faster rate. The substitution rate can vary in
    different branches also due to different mutation
    rate or generation time. When the rate is too
    variable, the estimates of branch lengths and the
    reconstructed trees are not reliable.
  • The number of substitutions K(A,B) can be
    reliably estimated from the number of mismatches
    when it is not saturated.
  • An indication of these problems is that
    different proteins usually give different tree
    topologies.

51
Some courses on the web http//www.
pdg.cnb.uam.es/cursos/BioInfo2002/pages/index.html
Curso de Doctorado BIOINFORMÁTICA
http//www.cryst.bbk.ac.uk/PPS2/index.html Princ
iples of Protein Structure Using the
Internet http//www.biochemtech.uni-halle.de/PPS2
/projects/day/TDayDi The Source of Stability in
Proteins http//www.fst.reading.ac.uk/courses/fs9
16/index.htm Protein Structure and Function
http//www.cm.utexas.edu/academic/courses/Spring
2002/CH339K/Robertus/ http//www.oup.com/lesk/bio
inf Site of the book Introduction to
Bioinformatics, by A.M. Lesk (Oxford)
52
Main databases and resources http//www.ncbi.nlm
.nih.gov/ National Center for Biotechnology
Information Genomes, PubMed (literature), genes,
proteins... http//www.bmn.com/ BioMedNet
(Medline) Biological literature http//www.tigr.o
rg/ The Institute for Genomic Research http//www
.ebi.ac.uk/swissprot/ Swiss-Prot annotated
proteins http//pfam.wustl.edu/ Pfam aligned
protein families http//gibk26.bse.kyutech.ac.jp/
jouhou/jouhoubank.html BioInfo Bank several data
bases
http//www.rcsb.org/pdb/ Protein Data Bank
protein structures http//www.ebi.ac.uk/dali/ FSS
P Alignment of protein domains http//www.biochem
.ucl.ac.uk/bsm/cath/ CATH Classification of
domains http//scop.mrc-lmb.cam.ac.uk/scop/ SCOP
Classification of domains http//www.ebi.ac.uk/
http//pqs.ebi.ac.uk/ Protein Quaternary
Structure (interactions) http//BioInfo.PL/cafasp
/ Servers for automatic protein structure
prediction
Write a Comment
User Comments (0)
About PowerShow.com