Title: Sequence Evolution
1Sequence Evolution
- Consider how DNA and amino acid sequences evolve
- All comparative sequence analysis in
bioinformatics depends on understanding evolution - If one does not understand the mechanism /
process / model under which a sequence can evolve
how can you know how to compare different
sequences?
2Evolutionary Distance
- Amount of DNA or protein sequence divergence
between individuals or species - Evolutionary distance is the total number of
substitutions that have occurred in two sequences
since their divergence from the common ancestor - Measured as the number of substitutions that have
occurred per site
3Evolution of DNA sequences
Common Ancestor
GCAAGAGATA
C ? G
t
Mouse
Rat
GGAAGAGATA
GCAAGAGATA
4Number of Differences
Rat GCAAGAGATA Mouse
GGAAGAGATA
- How many differences between Mouse and Rat
sequences? - 1
- What proportion of sites are different?
- p 1 / 10 0.1
- This is known as the p-distance
5Continuing Evolution over time
Common Ancestor
GCAAGAGATA
C ? G
1 myr
GGAAGAGATA
GCAAGAGATA
t
Mouse
Rat
6Relationship of p-distance with time
1.0
p-distance
0.5
0.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Time (Million years)
7About p-distance
- What is the theoretical maximum p-distance?
- pmax 0.75
- Why this value?
- There are 4 nucleotides (A, C, T, G) if two
sequences are completely unrelated, there is a
25 chance of sites being identical due to random
chance (and thus 75 chance of them being
different) - Thus random sequences should be different at 75
of the sites - Why does p-distance underestimate the actual
number of substitutions?
8Multiple Substitutions
Start 1 difference
Multiple substitutions at the same site
1 additional difference but 3 substitutions
9Back Substitutions
Start 1 difference
Different Substitutions at the same position
(site) producing the original nucleotide
0 extra differences created for 2 substitutions
10Coincidental Substitutions
Start 1 difference
Different Substitutions at the same position
(site) in different lineages
1 extra difference created for 2 substitutions
11Parallel Substitutions
Start 1 difference
Exactly the same Substitutions from the same
nucleotides at the same position (site) in
different lineages
0 extra differences created for 2 substitutions
12Convergent Substitutions
Start 2 differences
Exactly the same Substitutions from different
nucleotides at the same position (site) in
different lineages
1 difference Evolutionary divergence erased by
two substitutions
13How to transform p-distance into true distance?
- Simple model purely random every nucleotide
at every site in a sequence has an equal
probability of mutating into any other nucleotide - Known as the Jukes-Cantor (1969) model
14Jukes-Cantor model
To
A T C G A - a a a T a - a a C a a - a G a a a -
a is the probability of change per year
From
The total probability of change of any nucleotide
is r 3a r is equal to the rate of substitution
per site per year
15Jukes-Cantor model
- Consider sequences X and Y, diverged from a
common ancestor t years ago - qt is the proportion of sites which are identical
- pt is the proportion of sites which are
different, i.e., pt 1 qt - What happens at time t 1?
- Sites which are identical at time t will remain
identical with probability (1 r)2 - This can be approximated as 1 2r because r2 is
a very small term - Sites which were different at time t can become
identical at time t 1 with probability 2r/3 - If X and Y have nucleotides i and j at a site at
time t, they will become identical if - i in X changes to j, but j in Y stays the same
- j in Y changes to i, but i in X stays the same
- Probability of each scenario is (1 r)a (1
r)r/3 - Total probability is 2(1 r)r/3 2r/3 2r2/3
- Drop r2 term (very small) and we get 2r/3
16Jukes-Cantor model
Therefore we can write the following qt1 (1
2r)qt 2/3 r(1 qt) The first term is the
number of formerly identical sites which are
still identical and the second term is the number
of formerly different sites which are now
identical This can be rewritten as qt1 qt
2r / 3 8r/3 qt Changing to calculus this
becomes dq / dt 2r / 3 8r/3 q When q 1
and t 0 (i.e., the sequences are identical) q
1 3/4 (1 e -8rt/3) The expected number of
substitutions per site, d, for two sequences is
2rt Substituting we get q 1 3/4 (1 e
-4d/3) and solving for d d 3/4 ln (1 4/3
(1 q) ) 3/4 ln (1 4/3 p ) This is the
Jukes-Cantor distance
17Jukes-Cantor (JC) Distance
1.5
JC Distance
1.0
Estimated number of substitutions per site
0.5
p-distance
0.0
0.0
0.5
1.0
1.5
Actual number of substitutions per site
18Is the Jukes-Cantor distance accurate?
- What are its assumptions?
- Each nucleotide (A, C, G, T) occurs with equal
frequency (i.e., 25 each) - All sites in a sequence have the same mutation
rate - The rate of all substitutions are identical
(e.g., A ? C A ? G A ? T) - Reversibility C ? G G ? C
19Transitions Transversions
The four DNA bases fall into two structural
categories
Purines
Adenine Guanine
Double ring of 9 atoms
Pyrimidines
Cytosine Thymine
Single ring of 6 atoms
A mutation of the same type (purine to purine or
pyrimidine to pyrimidine) is a transition. A
mutation between types is a transversion.
20Purines
Pyrimidines
C
C
N
C
N
C
N
C
C
C
C
C
N
N
N
A
T
C
G
21Transition Bias
Transitions are observed to occur more often than
transversions
- Mutational Bias
- Biochemical mispairing (Topal Fresco 1976)
- Selective Bias
- Transitions are more often synonymous
- Transitional amino acid changes are often less
severe than transversional changes (Grantham
1974 Zhang 2000)
22Purines
Pyrimidines
C
C
N
C
N
C
N
C
C
C
C
C
N
N
N
A
T
C
G
23Kimuras Two-Parameter model (1980)
To
A T C G A - b b a T b - a b C b a - b G a b b -
a is the probability of a transitional change per
year b is the probability of a transversional
change per year
From
The total probability of change of any nucleotide
is r a 2b
24Kimuras Two-Parameter model (1980)
The total probability of change of any nucleotide
is r a 2b d is expected to be 2rt 2at
4bt Therefore, using the same sort of approach as
before
Where P is the observed number of transitional
differences and Q is the observed number of
transversional differences
25Transition-Transversion Bias
The ratio of the transitional substitution rate
to the transversional substitution rate k a / b
is known as the transition bias When measured on
a gene gene basis, this value can vary from
0.l5 to 48 This variation turns out to be
strongly related to sequence length
k estimates for 3,712 Human-Mouse gene pairs
of nucleotides
Variation is due to statistical sampling error
and does not necessarily represent true
differences among genes. For mammals, the bias is
approximately 3.6 for neutrally evolving sites
26Special Case CpG dinucleotides
In mammals, cytidine is usually methylated. When
a cytidine is followed immediately by a guanine
(5' to 3' direction) the C will often
spontaneously deaminate into a thymine CG ?
TG This transitional mutation occurs up to 10
times faster than any other mutation In humans,
C and G each make up about 21 of the nucleotides
(total GC content 42) The expected proportion
of dinucleotide pairs being C followed by G is
therefore 0.21 0.21 4 The observed
proportion of CpG in humans is 0.8 There are
large stretches of chromosomes where Cs are not
methylated these show the expected proportions
of CpG and are known as CpG islands
27Nucleotide Frequencies
Both models assume that A, C, G, and T all occur
with equal frequency Weve already discussed
that this is not usually true
28HKY model (1985)
To
A T C G A - bfT bfC afG T bfA - afC bfG C bfA afT
- bfG G afA bfT bfC -
a is the probability of a transitional change per
year b is the probability of a transversional
change per year fX is the expected frequency of
nucleotide X
From
Tamura-Nei model (1993) is an almost identical
variant where purine transitions and pyrimidine
transitions are allowed to have different rates
(a1 and a2)
29There are many other models
- General reversible each nucleotide pair has an
independent rate - AG, AT, AC, GT, GC, TC are all separately modeled
- 6 parameter model
- Reversible because A? G G ? A etc.
- Non-reversible/unrestricted model like above
but without assumption of reversibility - 12 parameter model
30Site Equality
- All of these models assume that every site has
the same substitution rate. - Again, this is often not the case. Nucleotide
sites might mutate at different rates because - Coding vs. non-coding regions
- Introns vs. exons
- 1st vs 2nd vs 3rd codon positions
- Local effects
- Most of the models can be adapted by allowing
site rate variation, usually using models based
on the gamma distribution
31Evolutionary Rate
Although it is obviously dependent on
circumstance, in mammals, the best estimate for
the overall average rate of substitution at
neutral sites is 2 10-9 substitutions per site
per year
32Codons
For the most part weve discussed DNA sequence
evolution without regard to how DNA is
processed For coding genes, changes in DNA may
lead to changes in proteins Synonymous
substitutions DNA substitutions in coding
sequence that do not change the amino acid
sequence Nonsynonymous substitutions DNA
substitutions in coding sequence that do change
the amino acid sequence This is mediated through
codons
33Codons
Mutations at the 2nd codon position are always
nonsynonymous Mutations at the 1st codon
position are usually nonsynonymous (exception
Leu, Arg) Mutations at the 3rd codon position
can be either
Substitution rate varies by position as expected
3rd gt 1st gt 2nd
34Codons
Codon pattern also explains additional reason for
transition bias transitions are more likely to
be synonymous than transversions
35Codons
Sites in which any mutation is synonymous are
known as 4-fold degenerate Example GCx Sites
in which only transitions are synonymous are
known as 2-fold degenerate Example AGx
36Amino Acid/Protein Sequences
Builds directly from DNA sequence
evolution p-distance is the proportion of sites
that differ between two amino acid
sequences p-distance will underestimate the true
distance for all of the same reasons this same
value is a problem with DNA (convergence,
etc) At its simplest, can model with
assumption that any amino acid can mutate into
any other amino acid with equal probability This
is known as the Poisson model (it is the protein
equivalent of the Jukes-Cantor model) Using the
same logic as JC, we find This is the Poisson
distance
37Amino Acid/Protein Sequences
- The Poisson distance suffers from many of the
same problems as Jukes-Cantor. It assumes - All sites mutate with equal frequency
- All amino acids mutate to each other with same
rate - etc.
- The last one is particularly problematic. Just
examine the codon table
38Amino Acid/Protein Evolution
Is a Proline equally likely to mutate into a
Threonine as it is to a Glycine?
39Amino Acid/Protein Evolution
An interesting thing is the coding table is not
random Amino acids with similar physiochemical
properties (e.g., polarity or charge) tend to be
single mutational events apart while those that
are more different require more steps
Evidence of selection it is generally
preferable to replace an amino acid with one with
similar properties (less chance of negative
effect on the organism), and the evolution of the
coding table supports that logic
40Amino Acid/Protein Evolution
Amino acids or codons are much more difficult to
model than DNA because of the greater
complexity Thus we often use empirical data based
models rather than theoretical models (such as
HKY or general reversible) The original and most
famous of these is known as the PAM or Dayhoff
model
41PAM/Dayhoff Matrix
Ala Arg Asn Asp Cys Gln
Glu Gly His Ile Leu Lys Met
Phe Pro Ser Thr Trp Tyr Val A R
N D C Q E G H I L K M F P S T W Y
VAla A 9867 2 9 10 3 8 17 21 2 6 4
2 6 2 22 35 32 0 2 18Arg R 1 9913
1 0 1 10 0 0 10 3 1 19 4 1 4 6 1 8
0 1Asn N 4 1 9822 36 0 4 6 6 21
3 1 13 0 1 2 20 9 1 4 1Asp D 6
0 42 9859 0 6 53 6 4 1 0 3 0 0 1 5
3 0 0 1Cys C 1 1 0 0 9973 0 0 0
1 1 0 0 0 0 1 5 1 0 3 2Gln Q 3
9 4 5 0 9876 27 1 23 1 3 6 4 0 6 2
2 0 0 1Glu E 10 0 7 56 0 35 9865
4 2 3 1 4 1 0 3 4 2 0 1 2Gly G
21 1 12 11 1 3 7 9935 1 0 1 2 1 1
3 21 3 0 0 5His H 1 8 18 3 1 20
1 0 9912 0 1 1 0 2 3 1 1 1 4
1Ile I 2 2 3 1 2 1 2 0 0 9872 9 2
12 7 0 1 7 0 1 33Leu L 3 1 3 0
0 6 1 1 4 22 9947 2 45 13 3 1 3 4 2
15Lys K 2 37 25 6 0 12 7 2 2 4
1 9926 20 0 3 8 11 0 1 1Met M 1 1
0 0 0 2 0 0 0 5 8 4 9874 1 0 1 2
0 0 4Phe F 1 1 1 0 0 0 0 1 2 8
6 0 4 9946 0 2 1 3 28 0Pro P 13 5
2 1 1 8 3 2 5 1 2 2 1 1 9926 12 4 0
0 2Ser S 28 11 34 7 11 4 6 16 2
2 1 7 4 3 17 9840 38 5 2 2Thr T
22 2 13 4 1 3 2 2 1 11 2 8 6 1 5
32 9871 0 2 9Trp W 0 2 0 0 0 0
0 0 0 0 0 0 0 1 0 1 0 9976 1
0Tyr Y 1 0 3 0 3 0 1 0 4 1 1 0 0 21
0 1 1 2 9945 1Val V 13 2 1 1 3 2 2
3 3 57 11 1 17 1 3 2 10 0 2 9901
The probability of a specific amino acid
substitutions (multiplied by 10,000) scaled to
every 100 total sustitutions
42Amino Acid/Protein Evolution
There are now many variants of this matrix, built
with newer and more data, including such things
as PAM 1, PAM 120, PAM 250, BLOSSUM 1, BLOSSUM
62, BLOSSUM 45, etc. These are all based on the
same principle, but are designed for different
organisms or different expected
divergences These matrices can be used to model
protein evolution without direct reference to the
underlying DNA or codons
43Other Evolutionary Events
- Insertions and Deletions
- Chunks of DNA can be added or removed from the
chromosome these can be as small as a single
base or as large as hundreds of thousands - With only a few sequences it can be very
difficult to distinguish whether a change in
length was due to an insertion or deletionthus
they are often referred to with the combined term
Indel
ACTCGTCATCGACTTAACGACTCATCGTA
2 base deletion
7 base insertion
ACGTCATCGACCGTACGTTTAACGACTCATCGTA
44Other Evolutionary Events
- Inversions
- Chunks of DNA can be flipped into reverse order
ACTCGTCATCGACTTAACGACTCATCGTA
ACTCGTCATATTCAGCACGACTCATCGTA
45Other Evolutionary Events
- Segmental duplications
- Large chunks of DNA can be accidentally repeated
- When this happens on a large scale, it leads to
multiple copies of entire genes and gene complexes
ACTCGTCATCGACTTAACGACTCATCGTA
ACTCGTCATCGACTTACGACTTAACGACTCATCGTA
46Other Evolutionary Events
- Duplications also can occur through polyploidy
- Polyploidy is the duplication of the entire
chromosome set - A diploid organism has 2n chromosomes
- A breakdown in meiosis can lead to a tetraploid
with 4n chromosomes - This has been very common in plant evolution, but
can also occur in animals - There is strong evidence this happened at least
twice in the early history of vertebrates
47Duplicated Genes
- What happens if a gene is duplicated (whether due
to polyploidy or a segmental duplication)? - We went from one copy of a gene to two copies
- At the beginning the genes are identicalredundanc
y - Mutations will gradually alter the genes
- Possible outcomes
- One gene will become non-functional (a
pseudogene) at this point the other selection
acts to maintain the other gene (assuming it is
necessary for life) - This is the most common outcome of gene
duplication - One gene will gain a new useful function now
both genes are potentially maintained by
selection - This is the rarest outcome of gene duplication
48Duplicated Genes
- Most genes have more than one function
- A pair of duplicated genes can diverge so that
each takes over a different function - If this happens, selection will then work to
preserve both genes because each can serve a
specialized function when the original gene
served as a generalist
49Duplicated Genes
Gene A
Promoter 1
Promoter 2
Each promoter serves a different function
50Gene Families
- A gene family is a set of genes related to each
other through evolutionary duplication events
they often serve similar functions and have
similar DNA sequences - Although usually defined within a species, gene
families can cross multiple species as well - Gene families can contain anywhere from two to
hundreds (maybe thousands) of genes - Examples
- Hox gene family as set of regulatory genes very
heavily involved in the control of animal body
plans - Olfactory receptors pretty much all olfactory
receptor proteins are part of a single gene family
51Homology
Homology similarity due to inheritance from a
common ancestor Homology is a critical concept
in evolutionary biology it is also an extremely
important concept (although often not recognized
as such) in bioinformatics Similarity which is
NOT due to inheritance from a common ancestor is
Analogy. Analogy can be due to things such as
convergence or parallel evolution
52Homology ExampleTetrapod Limbs
53Homology ExampleMammalian Necks
Mammals have 7 cervical vertebrae in their necks
1 2 3 4 5 6 7
Human
Giraffe
54Homology ExampleMammalian Necks - Exceptions
Only Exceptions Manatees have 6 cervical
vertebrae Two-toed sloths have 6 Three-toed
sloths have 9
Do you think the presence of six cervical
vertebrae is homologous in manatees and two-toed
sloths?
55Are these homologous?
- As wings?
- As tetrapod fore-limbs?
- Bird wings and bat wings?
To answer this question you must ask Did the
common ancestor have this trait?
56Sequence Homology
- With respect to sequences, homology has three
distinct, perfectly valid meanings - Sequences are homologous if they are descended
from a common ancestor - 16s rRNA is found in almost all living organisms
it is homologous among all living things because
they all inherited it from their common ancestor - Between a pair of sequences, specific sites are
said to be homologous if the position within the
sequence is the same as in the common ancestor - Nucleotides (or amino acids) between a pair of
sequences are said to be homologous if they are
(a) at a homologous site, (b) show the same
character (e.g., both sequences have adenine),
and (c) they both have that character because it
was inherited from the common ancestor
57Sequence Homology
These are unaligned sequences homologous sites
are not in the same column
ACTCGTCATCGACTTAACGACTCATCGTA ACGACCTCGTCCGTACGTTT
ACCGAATCATCCTA
58Sequence Homology
These are aligned sequences homologous sites
are in the same column
Homologous characters are marked in yellow
ACTCGTCATCGAC-------TTAACGACTCATCGTA A--CGACCTCGTC
CGTACGTTTACCGAATCATCCTA
If we actually knew the ancestral sequence it
might turn out that even some of these are not
actually homologous!
59Sequence Homology
Although all homologous sequences are similar due
to common inheritance, they can actually be
homologous due to two separate mechanisms
speciation or duplication Sequences which are
homologous due to speciation events are known as
orthologous sequences or orthologs Sequences
which are homologous due to duplication events
are known as paralogous sequences or
paralogs This can make for complicated
relationships among sequences
60Sequence Homology
Time
Ancestral Gene / Ancestral Species
Gene A1
- Genes A2 and B2 are paralogs (related through
duplication) - Genes A2 and A3 are orthologs (related through
speciation) - Genes B2 and A3 are orthologs (related through
speciation)
61Sequence Homology
Time
Ancestral Gene / Ancestral Species
Gene A1
- Genes A2 and A3 are orthologs (related through
speciation) - Genes B2 and B3 are orthologs (related through
speciation) - Genes A2 and B2 are paralogs (related through
duplication) - Genes A3 and B3 are paralogs (related through
duplication) - Genes A2 and B3 are paralogs (related through
duplication) - Genes B2 and A3 are paralogs (related through
duplication)
62Sequence Homology
Time
Ancestral Gene / Ancestral Species
Gene A1
Genes A2 and B3 appear to be orthologs, but they
are actually paralogs Note that the time of
divergence of A2 and B3 predates the speciation
event between species 2 and 3