Protein homology I: Evolution and comparison of protein sequences - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Protein homology I: Evolution and comparison of protein sequences

Description:

Also sometimes defined as 'Similar due to descent from a common ancestor. ... Substitution rates in genes encoding orthologous rodent and human proteins. ... – PowerPoint PPT presentation

Number of Views:233
Avg rating:3.0/5.0
Slides: 38
Provided by: mattc5
Category:

less

Transcript and Presenter's Notes

Title: Protein homology I: Evolution and comparison of protein sequences


1
Protein homology I Evolution and comparison of
protein sequences
  • Biochem 565, Fall 2007
  • 09/10/07
  • Cordes

2
Outline
  • What is homology?
  • Sequence conservation and covariation
  • 3. Generalized substitution matrices
  • 4. Pairwise alignment--global vs. local
  • 5. Sequence identity and homology
  • 6. Sequence similarity and homology--
  • use of substitution matrices
  • 7. Alignment scores and statistics
  • 8. Limitations of pairwise alignment
  • 9. Remote homologies--use of evolutionary profiles

3
Homology between different proteins--key terms
A1
boxes represent protein-coding genes
gene duplication
A1
A2
speciation
orthologs
paralogs
A1
A2
A1
A2
homologous descended from a common ancestor, e.g.
A1 and A2 are homologous. Also sometimes
defined as Similar due to descent from a
common ancestor. Homology is either/or-- there
is no such thing as percent homology!
Homologous is not a synonym for similar! It
is, however, possible for only a part of two
sequences to be homologous, for instance one
domain in multidomain prot. paralogous related by
gene duplication orthologous related by
speciation
4
Sequence conservation among homologous proteins
If one aligns the sequences of multiple proteins
which are related by common ancestry and thus
form a family, one can then examine the extent of
amino-acid sequence evolution both in general and
at particular positions in the alignment.
bars here reflect level of conservation
some positions universally conserved
some are highly variable
coloring reflects conserved, chemically similar
amino acid types, e.g. charged residues such as
Arg, Lys
Multiple alignment of six bacteriophage Cro
proteins (made with ClustalX).
5
Overall rates of protein sequence evolution vary
widely
Table. Substitution rates in genes encoding
orthologous rodent and human proteins. Units are
substitution rates per site per billion
years. protein nonsynonymous rate synonymous
rate KA/KS histone 3 0.00 6.38 0 actin
a 0.01 3.68 0.002 insulin 0.13 4.02 0.03
myoglobin 0.56 4.44 0.126 b-globin 0.80 3.
05 0.262 urokinase 1.28 3.92 0.362
Nonsynonymous changes are nucleotide
substitutions which lead to amino acid changes.
Synonymous changes are those which do not. KA/KS
is the ratio of nonsynonymous to synonymous
changes in the gene, and is a measure of the
functional selection on a protein. Proteins with
highly constrained function tend to evolve more
slowly and have lower KA/KS values. This
includes critical proteins with multiple levels
of function and regulation, such as histones.
adapted from Protein Evolution by L. Patthy,
Blackwell Science, 1999 and from Fundamentals of
Molecular Evolution by Li Graur, Sinauer, 1991
6
Orthologous vs. paralogous proteins
As a general rule, orthologous proteins tend to
perform the same function in different species,
while paralogous proteins tend to have
diversified somewhat in function. For example,
the chymotrypsin serine proteases are orthologous
to each other, and they retain not only the same
general function (proteolysis using a catalytic
triad including a serine), but also the same
substrate specificity (cleavage at positions
following aromatic side chains). The
chymotrypsins are paralogous to the trypsins and
the elastases. These proteins share the same
general serine protease function but have evolved
different substrate specificities. These
proteins also have paralogs which have lost all
protease activity. Since the functional
constraints are different, sequence conservation
patterns might vary depending upon whether one is
comparing orthologs or paralogs, or close
paralogs vs. distant paralogs.
7
Classic phylogenetic studies of
sequence conservation the globins
The globins are the best studied family in terms
of sequence conservation, partly because they
were one of the first families for which multiple
members were sequenced, and partly because some
of the earliest protein structures (in fact, the
earliest) solved were globins. The classic
papers of Perutz, Kendrew and Watson were the
first to correlate sequence conservation with
aspects of protein structure and function. They
drew their conclusion based on only a few aligned
sequences. Later globin studies, such as that of
Bashford, Chothia and Lesk, expanded the analyses
of globin sequence conservation to include
hundreds of sequences. Perutz, Kendrew Watson
J Mol Biol 13, 669 (1965) Bashford, Chothia
Lesk J Mol Biol 196, 199 (1987)
Scapharca inaequivalvis oxygenated hemoglobin
8
Conservation of functional residues
There were only 2 perfectly conserved residues
among the 8 known globin structures at the time
of the Bashford et al study. These are residues
critical in binding of heme and/or interaction
w/heme-bound oxygen. It will often be found that
the best conserved residues in related proteins
are those involved in critical aspects of the
general function.
Phe 43
heme
His 87
Residues involved in more specific aspects of
function may or may not be conserved, depending
upon the relationship between the proteins under
consideration. For example, residues involved in
substrate specificity for serine proteases may be
conserved among orthologs, such as the
chymotrypsins, but not between paralogs, such as
chymotrypsins and trypsins.
9
Conservation at buried positions
  • core residues, which are usually hydrophobic,
    often tolerate conservative substitutions, i.e.
    to other hydrophobics
  • overall core volume is well-conserved (Lim
    Ptitsyn, 1970) though individual core positions
    tolerate variation in volume
  • this reflects what we know about packing and the
    effects of core mutations on stability--thus
    sequence conservation is partly related to
    maintaining a stable structure

portion of alignment of prokaryotic and
eukaryotic globins
Y140
yellow small neutral/polar green
hydrophobic red/pink polar/acidic blue basic
buried
human hemoglobin beta chain
H156
10
Conservation at solvent-exposed positions
  • solvent-exposed (surface) positions are mutable
    and usually tolerate
  • mutation to many residue types including
    hydrophobics. Bashford et al.,
  • however, noted that for globins at least, some
    surface positions do not
  • tolerate large hydrophobics. Since
    polar-to-hydrophobic mutations on protein
  • surfaces do not reduce stability, this
    conservation could reflect constraints
  • on solubility. Indeed, it is clear that the
    overall polar character of the
  • surface is conserved for soluble, globular
    proteins, even though a certain
  • number of hydrophobics may be tolerated.

Y140
examples of surface residues
yellow small neutral/polar green
hydrophobic red/pink polar/acidic blue basic
human hemoglobin beta chain
H156
11
Conservation of loops and turns
  • Spacer regions between secondary structures,
    such as loops and turns, are often hypermutable
    and vary not only in sequence but in length,
    tolerating insertion and deletion events
    (Insertions and deletions are much less often
    found within secondary structure elements. Why?)

part of alignment of animal hemoglobin a and b
chains
human a chain
Are the a and b chains related to each other by
paralogy or orthology?
12
Covariation analysis
Substitution patterns at different positions in a
sequence alignment are not necessarily
independent. This is sometimes referred to as
covariation or correlated evolution.
name sequence A YADLGRIKS B YSDLGSEKE C IDDFGEIAA
D IDDFGVIGT
For example, in the mini multiple alignment shown
at left, the identity of the residue at the 4th
position is correlated to the identity of the
residue at the 1st position.
A statistical perturbation analysis can be used
to characterize this covariation. An alignment
of related sequences is perturbed by only
considering sequences at which, for example, the
first position is Y. The effect of this
perturbation on the residue distribution observed
at other positions is then measured. If the
distribution changes significantly, covariation
between sequence changes at the first site and
other sites in the alignment is inferred.
13
Covariation and hydrophobic core packing
The hydrophobic core residues in related proteins
tend to be covariant due to constraints on core
packing. One sees compensatory volume changes
at different positions. Davidson and coworkers
found that for 266 aligned SH3 domain sequences,
the strongest covariation was observed for a
cluster of central hydrophobic residues. For
example, substitution of a smaller residue
(Ala-gtGly) at 39 was strongly correlated to
substitution of a larger residue (Ile-gtPhe) at 50.
Hydrophobic core of SH3 domains, with most
frequently covarying residues shown in yellow
S.M. Larson, A.A. DiNardo and A.R. Davidson, J
Mol Biol 303, 433 (2000)
14
Some recent studies (Suel et al) have suggested a
connection between covarying clusters of residues
and transduction of signals between distant sites
in proteins. For example, G-protein coupled
receptors bind a ligand on one side of a
membrane, and then transduce that signal to the
other side through conformational change. Suel
et al showed that the main clusters of covarying
residues tended to connect the ligand and
G-protein binding sites.
ligand
covarying networks (brown)
membrane
G-protein binding sites
Suel et al. Nat Struct Biol 2003
15
Sequence logos a common way of representing
sequence conservation
Note how the logo simplifies reading the
conservation, but doesnt throw away information
like simply calculating a consensus sequence
would.
convert alignment to logo
Logos represent sequence conservation in an easy
to read format, with letter heights essentially
representing the frequency with which a residue
type occurs at a position in an alignment,
relative to the frequency with which it would
occur at random. The units of the y-axis are
bits of information, which is to say that if a
residue did not occur more often than expected at
random, it would not offer us any information and
the letter height would be zero. Note that the
letter heights only become very high when a
residue really dominates in the alignment, like
Ala at the fifth position here.
weblogo server http//weblogo.berkeley.edu sequen
ce logos paper Schneider and Stevens, Nucleic
Acids Res 18, 6097 (1990).
16
Generalized substitution matrices
Conservation patterns observed in related
proteins have been used to construct generalized
substitution matrices (e.g. BLOSUM, PAM, Gonnet)
which reflect the gross average likelihood of a
mutation occurring and being accepted in a
protein.
(handouts in class will have more notes on these)
cysteine, tryptophan least mutable polar more
mutable than hydrophobic. Polar more likely
to be substituted by polar, hydrophobic by
hydrophobic
the PAM 250 matrix
17
Inferring homology between proteins
The simplest way of identifying homology is by
sequence comparison. If two protein sequences
are sufficiently similar (well talk about what
similarity means in a moment), they can be
statistically inferred to be homologous. In
addition, if a sequence obeys conservation
patterns observed in a known family of related
sequences, it can be inferred to be a member of
that family. For sequences of statistically
borderline similarity, structural and functional
comparison, if such information is available, can
be used as a supplement to establish common
ancestry. If similarity between two sequences is
really statistically weak, very strong structural
and functional similarity can still make a
convincing argument for homology. Finally, gene
context can play a role--for example, do two
genes occupy the same location within an operon
in different organisms? We will next focus on
identification of homology through sequence
comparison. We will begin with simple pairwise
comparison.
18
Pairwise alignment of sequences--global and local
F R T Y I A E W Q R T E P G A D H F Q T Y A A D Y
- R T E P S S D H

GLOBAL ALIGNMENT
entire length of sequence aligned--about 60
identity over 17 residues. Note that allowance
for gaps improves the identity. The best
alignment would be determined by using some
optimization algorithm in combination with a
scoring scheme, e.g. 1 for every identity and 0
for every mismatch or gap (identity matrix).
- - - - - - - - - R T E P G A D H - - - - - - - -
- R T E P S S D H

LOCAL ALIGNMENT
only the best matching portion(s) of sequence is
(are) included in the alignment--75 percent
identity over 8 residues. How does a local
alignment algorithm decide where to stop? By
lengthening the alignment only insofar as it
increases the score. For example, one could
increase the score by 2 for every identical
amino acid, while assigning a penalty of -1 for
every mismatch or gap. Such penalties would
prevent the alignment from extending to
dissimilar regions
19
Pairwise alignment of sequences--global vs. local
Local alignment is more versatile than global and
is thus more widely used. It can be used to align
proteins that are not related throughout their
lengths but share a conserved domain, as well as
proteins with very unevenly distributed sequence
similarity. Many many such cases exist. Thus,
when one has no prior knowledge of what to
expect, local alignment routines are preferable.
This will especially be the case if one is using
pairwise alignment to search a database for
sequences that are related to a query sequence.
Thus, alignment algorithms for database searching
essentially always use local alignment. It
should be noted that the scoring scheme used can
be tailored to favor longer or shorter local
alignments. Global alignment is usually used to
align sequences that are approximately the same
length and are already known to be related. Once
weve aligned all or part of a pair of sequences,
how do we decide whether they are homologous?
20
Percent sequence identity and homology
Common rule-of-thumb 30 identity indicates
homology. This is too simplistic!
high level of identity between unrelated proteins
is common at short alignment lengths
dont worry about this
20-30 identity called the twilight
zone difficult to assess relatedness from
identity
the 30 identity threshold for identification of
homology only works for long alignments,
i.e. gt100-150 amino acids
from Brenner et al. PNAS 95, 6073 (1998)
21
Sequence identity and homology false positives
Note also that gaps are allowed in
this alignment--identity would be lower if gaps
were not allowed. However, gaps are common among
true homologs.
sequence identity is 39 over 64 residues, yet
the two proteins are unrelated--this would be a
false positive using a 30 cutoff rule. Use of a
length-dependent cutoff would help.
from Brenner et al. PNAS 95, 6073 (1998)
22
Sequence identity and homology poor coverage
the two proteins have the same fold,both bind
heme and oxygen in same place good independent
structural/functional evidence for
homology... Yet alignments of their sequences
reveal only 24 identity. There are also many
examples of related globins and other proteins
with much lower identity than this.
1MBO and 1HBB hemoglobin and myoglobin
Any reasonable sequence identity criterion,
whether it is a flat percent cutoff or a
length-dependent cutoff, will give incomplete
coverage--in other words, it will fail to
identify many distant but true relationships.
23
Sequence similarity
Sequence identity is one specific way of
assessing sequence similarity, and its not a
very good one. If you just use sequence
identity, you are throwing away a lot of
information. Specifically, as we have recently
learned, not all mutations are equally likely to
occur and be accepted during the course of
evolution. Knowledge of what substitutions
commonly occur among related proteins can be put
to use in aligning sequences and identifying
relationships. Various methods have been
developed which use such knowledge to assess
sequence similarity. The most widely used and
familiar of these methods work by using
generalized amino acid substitution matrices (aka
scoring matrices) in tandem with effective local
pairwise alignment algorithms. This is coupled
with a statistical assessment of the significance
of the alignment score obtained between two
sequences using a given matrix. As well see,
its possible (and often worth doing) to get much
more sophisticated than that, but thats where it
begins.
24
Scoring alignments using substitution matrices
would moving F over two positions to the left
improve or worsen the score?
G D A Y M - - F R D W I G E R Y M
Q P L R D W G 6 2 -1 7 5 -11 -1 0 5
6 11 -4 25
gap extension penalty
overall score is sum of scores at each position
gap opening penalty
substitution penalties are just elements from
BLOSUM 62 matrix
In theory, the score is related to the odds that
the alignment represents a an actual homologous
relationship between two proteins. Because
scoring matrices are in log-odds form, the
overall alignment score is the sum of the scores
at each position rather than the product.
25
Common pairwise local alignment methods
Smith-Waterman dynamic programming algorithm See
handout for description of dynamic
programming. Mathematically guaranteed to find
highest scoring alignment for a given set of
input parameters. Tradeoff is that it is slow,
although computer speed is getting to the point
where this is less of a problem. The global
version of Smith-Waterman is called
Needleman-Wunsch. If one were simply comparing
any 2 sequences to see if they are homologous,
Smith-Waterman would be the method of choice.
BLAST (Basic Local Alignment Search
Tool) FASTA These two are very similar--both
achieve a speed advantage over Smith-Waterman by
initially looking for short words of 2 or 3
residues that (nearly) exactly match. Alignments
are then built from these initial seed matches.
The tradeoff for the speed advantage is that some
homologies may be missed. Because of their
speed, BLAST and FASTA are used in searches of
large databases for homologues.
26
Variables in local alignment-based search
algorithms
scoring matrix the generalized log odds
substitution matrix used to score
alignment--BLOSUM and PAM are the most
commonly used. BLOSUM 62 is default on BLAST
and BLOSUM 50 on most FASTA servers gap
penalties gap opening penalty (for initiating a
gap) gap extension penalty (adding elements to
existing gap) word size (ktup parameter in
FASTA). BLAST and FASTA are so fast partly
because they start by looking for short
words that match exactly and build up a longer
alignment from these words. The size of the
starting words can be varied with this
parameter (the shorter the word the more it
slows down the program) filter filters
sequence to get rid of low complexity regions.
Such regions can lead to false positives due to
their compositional bias.
27
Statistical significance of alignment scores The
extreme value distribution
Raw alignment scores by themselves are not
particularly meaningful. In order to assess the
statistical significance of an alignment, i.e.
the chances that it represents a real
relationship, one must understand what the
distribution of alignment scores would be for
random pairs of sequences of similar length and
composition. Such scores obey what is called an
extreme value distribution, which is like a
normal distribution but has a positively skewed
tail. The exact characteristics of the
distribution will depend upon the scoring matrix,
the gap penalties employed, the composition of
the sequences, etc.
what is probability P that a random alignment
will have a given score or higher?
example of extreme value distribution
of occurrences vs. alignment score
Altschul et al. Nucleic Acids Research 25, 3389
(1997)
28
Statistical significance of alignment scores
Z-scores, P-values and E-values
A Z-score is the number of standard deviations
between the alignment score and the mean of a
normal distribution. The FASTA algorithm reports
Z scores in its output. A P-value is the
probability that an alignment between two random
sequences will have a score equal to or greater
than the observed score, as calculated from the
extreme value distribution. The E-value or expect
value represents the number of times that the
observed score or higher would be observed when
searching a database of D sequences. For cases
where P lt 0.1, E DP. Both FASTA and BLAST
report E-values for alignments. Basically, to be
confident that a match between two sequences
represents true homology, you generally want an
E-value lt 0.01. That means theres a 1 in 100
chance that you have a false positive. It has
been shown (Brenner et al. 1998) that FASTA and
BLAST E-values do a pretty good job of
distinguishing true and false positives.
29
Sample BLAST output
alignment score
E-value
GenBank identifier
positives means positions at which scoring
matrix element is positive
percent positives is sometimes also called
percent similarity
30
BLAST and FASTA can identify some homologues in
the twilight zone--20 to 30 identity
Score 43.5 bits (101), Expect
0.001 Identities 36/145 (24), Positives
56/145 (37), Gaps 2/145 (1) Query 2
LSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT
EAEMKASEDL 61 L E V W KV D G
L RL P T F F L T Sbjct 4
LTPEEKSAVTALWGKVNVDEVGG--EALGRLLVVYPWTQRFFESFGDLST
PDAVMGNPKV 61 Query 62 KKHGVTVLTALGAILKKKGHHEAE
LKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPG 121
K HG VL A L L H K
VL Sbjct 62 KAHGKKVLGAFSDGLAHLDNLKGTFA
TLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK 121 Query
122 DFGADAQGAMNKALELFRKDIAAKY 146 F
Q A K A KY Sbjct 122
EFTPPVQAAYQKVVAGVANALAHKY 146
BLAST alignment of hemoglobin and myoglobin
Even though sequence identity here is low, the
E-value is statistically significant
31
Pairwise alignments will not detect all homologies
Although pairwise local alignment routines
employing generalized substitution matrices do a
good job of avoiding false positives, their
coverage is imperfect (though its better than
just using identity). That is, there will be
many relationships that they will miss because
the sequences have diverged too far.
big challenge in bioinformatics is finding these
remote homologs
homology identified independently in this trial
database by structural/functional similarity
EPQ means errors per query, ideally like E lt 0.01
(1 in 100 chance of false positive)
SSEARCH is a program employing the
Smith-Waterman algorithm
from Brenner et al. PNAS 95, 6073 (1998)
32
Multiple alignment of sequences
Conservation patterns observed in families of
homologous sequences carry much more useful
information than do single sequences, both from
the point of view of understanding structure and
function for a family, as well as for identifying
whether a particular sequences is homologous to a
particular family. Obtaining this information
depends upon the ability to generate alignments
of multiple related sequences
We arent going to have time to talk about
methods for multiple alignment. Some of the
better known methods/websites, such as ClustalX
for global multiple alignment, will be listed as
links on the course website. I recommend Chapter
4 of David Mounts Bioinformatics for thorough
coverage of the topic. Were going to focus
instead on what one can do with multiply aligned
sequences.
33
Position-dependent scoring matrices or profiles
of sequence families can be generated from
multiple alignments
row in matrix is constructed by weighting a
generalized substitution matrix by the appearance
of the different amino acids in the alignment.
For example, this row might be made from an equal
weight of the E, G, V and L columns in, say, a
PAM250 matrix. The resulting matrix
contains position-dependent information about
sequence conservation within a particular family
of sequences, as opposed to a generalized
scoring matrix, which is constructed by averaging
general sequence conservation tendencies among
many families of related sequences
Gribskov, McLachlan Eisenberg, PNAS 84, 4355,
1987
34
Examples of models generated from multiple
alignments
these two are almost the same thing
profiles position-specific scoring matrices
(PSSM) hidden Markov models (HMM)
These models can be generated for lengthy
sequences or for short ungapped conserved regions
(blocks or motifs)
35
PSI-BLAST (Position-Specific Iterated)
Altschul et al. Nucleic Acids Research 25, 3389
(1997)
initial BLAST search
hits with significant similarity (e.g. E lt 0.005)
multiple alignment of hits
query sequence
PSSM
utility obviously depends on getting some seed
hits
iterated BLAST search using the PSSM as query
the utility of PSI-BLAST in finding more remote
homologues than simple pairwise searches has been
demonstrated. An example of a similar program
that uses a Hidden Markov model rather than a
PSSM is SAM-T99 (now SAM-T02)
36
Example of utility of PSI BLAST
two BRCT domains from BRCA1 used as query
initial BLAST with cutoff of E lt0.01 brings up
only BRCT domains from other BRCA1s (orthologues)
few false positives were found using Elt0.01 cutoff
repeated rounds of PSI-BLAST bring up many
others and reveal first plant protein to contain
BRCTs
Altschul et al. Nucleic Acids Research 25, 3389
(1997)
37
Searching profile databases
database of HMMs, PSSMs
query sequence
A number of researchers have used similarity
searches to cluster the known proteins into
homologous groups, and then generated profiles
for each cluster using HMMs or PSSMs. Servers
now allow one to do similarity searches of these
database profiles using a single query sequence.
This is qualitatively the reverse of what is done
in PSI-BLAST, in which one generates a profile
and uses it to match individual database
sequences. Some of these profiles represent
motifs or short ungapped blocks, whereas others
are the length of entire domains. Among the best
known collections of domain profiles are SMART
and Pfam. These two form part of what is now
called the Conserved Domain Database (CDD).
BLAST searches with the NCBI server will now
automatically do a search against the CDD unless
you opt not to.
Write a Comment
User Comments (0)
About PowerShow.com