Sequence Evolution

About This Presentation

Title:

Sequence Evolution

Description:

– PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 63

Provided by: michaelsr

Category:

more less

Transcript and Presenter's Notes

Title: Sequence Evolution

1
Sequence Evolution

Consider how DNA and amino acid sequences evolve
All comparative sequence analysis in
bioinformatics depends on understanding evolution
If one does not understand the mechanism /
process / model under which a sequence can evolve
how can you know how to compare different
sequences?

2
Evolutionary Distance

Amount of DNA or protein sequence divergence
between individuals or species
Evolutionary distance is the total number of
substitutions that have occurred in two sequences
since their divergence from the common ancestor
Measured as the number of substitutions that have
occurred per site

3
Evolution of DNA sequences
Common Ancestor
GCAAGAGATA
C ? G
t
Mouse
Rat
GGAAGAGATA
GCAAGAGATA
4
Number of Differences
Rat GCAAGAGATA Mouse
GGAAGAGATA

How many differences between Mouse and Rat
sequences?
1
What proportion of sites are different?
p 1 / 10 0.1
This is known as the p-distance

5
Continuing Evolution over time
Common Ancestor
GCAAGAGATA
C ? G
1 myr
GGAAGAGATA
GCAAGAGATA
t
Mouse
Rat
6
Relationship of p-distance with time
1.0
p-distance
0.5
0.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Time (Million years)
7
About p-distance

What is the theoretical maximum p-distance?
pmax 0.75
Why this value?
There are 4 nucleotides (A, C, T, G) if two
sequences are completely unrelated, there is a
25 chance of sites being identical due to random
chance (and thus 75 chance of them being
different)
Thus random sequences should be different at 75
of the sites
Why does p-distance underestimate the actual
number of substitutions?

8
Multiple Substitutions
Start 1 difference
Multiple substitutions at the same site
1 additional difference but 3 substitutions
9
Back Substitutions
Start 1 difference
Different Substitutions at the same position
(site) producing the original nucleotide
0 extra differences created for 2 substitutions
10
Coincidental Substitutions
Start 1 difference
Different Substitutions at the same position
(site) in different lineages
1 extra difference created for 2 substitutions
11
Parallel Substitutions
Start 1 difference
Exactly the same Substitutions from the same
nucleotides at the same position (site) in
different lineages
0 extra differences created for 2 substitutions
12
Convergent Substitutions
Start 2 differences
Exactly the same Substitutions from different
nucleotides at the same position (site) in
different lineages
1 difference Evolutionary divergence erased by
two substitutions
13
How to transform p-distance into true distance?

Simple model purely random every nucleotide
at every site in a sequence has an equal
probability of mutating into any other nucleotide
Known as the Jukes-Cantor (1969) model

14
Jukes-Cantor model
To
A T C G A - a a a T a - a a C a a - a G a a a -
a is the probability of change per year
From
The total probability of change of any nucleotide
is r 3a r is equal to the rate of substitution
per site per year
15
Jukes-Cantor model

Consider sequences X and Y, diverged from a
common ancestor t years ago
qt is the proportion of sites which are identical
pt is the proportion of sites which are
different, i.e., pt 1 qt
What happens at time t 1?
Sites which are identical at time t will remain
identical with probability (1 r)2
This can be approximated as 1 2r because r2 is
a very small term
Sites which were different at time t can become
identical at time t 1 with probability 2r/3
If X and Y have nucleotides i and j at a site at
time t, they will become identical if
i in X changes to j, but j in Y stays the same
j in Y changes to i, but i in X stays the same
Probability of each scenario is (1 r)a (1
r)r/3
Total probability is 2(1 r)r/3 2r/3 2r2/3
Drop r2 term (very small) and we get 2r/3

16
Jukes-Cantor model
Therefore we can write the following qt1 (1
2r)qt 2/3 r(1 qt) The first term is the
number of formerly identical sites which are
still identical and the second term is the number
of formerly different sites which are now
identical This can be rewritten as qt1 qt
2r / 3 8r/3 qt Changing to calculus this
becomes dq / dt 2r / 3 8r/3 q When q 1
and t 0 (i.e., the sequences are identical) q
1 3/4 (1 e -8rt/3) The expected number of
substitutions per site, d, for two sequences is
2rt Substituting we get q 1 3/4 (1 e
-4d/3) and solving for d d 3/4 ln (1 4/3
(1 q) ) 3/4 ln (1 4/3 p ) This is the
Jukes-Cantor distance
17
Jukes-Cantor (JC) Distance
1.5
JC Distance
1.0
Estimated number of substitutions per site
0.5
p-distance
0.0
0.0
0.5
1.0
1.5
Actual number of substitutions per site
18
Is the Jukes-Cantor distance accurate?

What are its assumptions?
Each nucleotide (A, C, G, T) occurs with equal
frequency (i.e., 25 each)
All sites in a sequence have the same mutation
rate
The rate of all substitutions are identical
(e.g., A ? C A ? G A ? T)
Reversibility C ? G G ? C

19
Transitions Transversions
The four DNA bases fall into two structural
categories
Purines
Adenine Guanine
Double ring of 9 atoms
Pyrimidines
Cytosine Thymine
Single ring of 6 atoms
A mutation of the same type (purine to purine or
pyrimidine to pyrimidine) is a transition. A
mutation between types is a transversion.
20
Purines
Pyrimidines
C
C
N
C
N
C
N
C
C
C
C
C
N
N
N
A
T
C
G
21
Transition Bias
Transitions are observed to occur more often than
transversions

Mutational Bias
Biochemical mispairing (Topal Fresco 1976)
Selective Bias
Transitions are more often synonymous
Transitional amino acid changes are often less
severe than transversional changes (Grantham
1974 Zhang 2000)

22
Purines
Pyrimidines
C
C
N
C
N
C
N
C
C
C
C
C
N
N
N
A
T
C
G
23
Kimuras Two-Parameter model (1980)
To
A T C G A - b b a T b - a b C b a - b G a b b -
a is the probability of a transitional change per
year b is the probability of a transversional
change per year
From
The total probability of change of any nucleotide
is r a 2b
24
Kimuras Two-Parameter model (1980)
The total probability of change of any nucleotide
is r a 2b d is expected to be 2rt 2at
4bt Therefore, using the same sort of approach as
before
Where P is the observed number of transitional
differences and Q is the observed number of
transversional differences
25
Transition-Transversion Bias
The ratio of the transitional substitution rate
to the transversional substitution rate k a / b
is known as the transition bias When measured on
a gene gene basis, this value can vary from
0.l5 to 48 This variation turns out to be
strongly related to sequence length
k estimates for 3,712 Human-Mouse gene pairs
of nucleotides
Variation is due to statistical sampling error
and does not necessarily represent true
differences among genes. For mammals, the bias is
approximately 3.6 for neutrally evolving sites
26
Special Case CpG dinucleotides
In mammals, cytidine is usually methylated. When
a cytidine is followed immediately by a guanine
(5' to 3' direction) the C will often
spontaneously deaminate into a thymine CG ?
TG This transitional mutation occurs up to 10
times faster than any other mutation In humans,
C and G each make up about 21 of the nucleotides
(total GC content 42) The expected proportion
of dinucleotide pairs being C followed by G is
therefore 0.21 0.21 4 The observed
proportion of CpG in humans is 0.8 There are
large stretches of chromosomes where Cs are not
methylated these show the expected proportions
of CpG and are known as CpG islands
27
Nucleotide Frequencies
Both models assume that A, C, G, and T all occur
with equal frequency Weve already discussed
that this is not usually true
28
HKY model (1985)
To
A T C G A - bfT bfC afG T bfA - afC bfG C bfA afT
- bfG G afA bfT bfC -
a is the probability of a transitional change per
year b is the probability of a transversional
change per year fX is the expected frequency of
nucleotide X
From
Tamura-Nei model (1993) is an almost identical
variant where purine transitions and pyrimidine
transitions are allowed to have different rates
(a1 and a2)
29
There are many other models

General reversible each nucleotide pair has an
independent rate
AG, AT, AC, GT, GC, TC are all separately modeled
6 parameter model
Reversible because A? G G ? A etc.
Non-reversible/unrestricted model like above
but without assumption of reversibility
12 parameter model

30
Site Equality

All of these models assume that every site has
the same substitution rate.
Again, this is often not the case. Nucleotide
sites might mutate at different rates because
Coding vs. non-coding regions
Introns vs. exons
1st vs 2nd vs 3rd codon positions
Local effects
Most of the models can be adapted by allowing
site rate variation, usually using models based
on the gamma distribution

31
Evolutionary Rate
Although it is obviously dependent on
circumstance, in mammals, the best estimate for
the overall average rate of substitution at
neutral sites is 2 10-9 substitutions per site
per year
32
Codons
For the most part weve discussed DNA sequence
evolution without regard to how DNA is
processed For coding genes, changes in DNA may
lead to changes in proteins Synonymous
substitutions DNA substitutions in coding
sequence that do not change the amino acid
sequence Nonsynonymous substitutions DNA
substitutions in coding sequence that do change
the amino acid sequence This is mediated through
codons
33
Codons
Mutations at the 2nd codon position are always
nonsynonymous Mutations at the 1st codon
position are usually nonsynonymous (exception
Leu, Arg) Mutations at the 3rd codon position
can be either
Substitution rate varies by position as expected
3rd gt 1st gt 2nd
34
Codons
Codon pattern also explains additional reason for
transition bias transitions are more likely to
be synonymous than transversions
35
Codons
Sites in which any mutation is synonymous are
known as 4-fold degenerate Example GCx Sites
in which only transitions are synonymous are
known as 2-fold degenerate Example AGx
36
Amino Acid/Protein Sequences
Builds directly from DNA sequence
evolution p-distance is the proportion of sites
that differ between two amino acid
sequences p-distance will underestimate the true
distance for all of the same reasons this same
value is a problem with DNA (convergence,
etc) At its simplest, can model with
assumption that any amino acid can mutate into
any other amino acid with equal probability This
is known as the Poisson model (it is the protein
equivalent of the Jukes-Cantor model) Using the
same logic as JC, we find This is the Poisson
distance
37
Amino Acid/Protein Sequences

The Poisson distance suffers from many of the
same problems as Jukes-Cantor. It assumes
All sites mutate with equal frequency
All amino acids mutate to each other with same
rate
etc.
The last one is particularly problematic. Just
examine the codon table

38
Amino Acid/Protein Evolution
Is a Proline equally likely to mutate into a
Threonine as it is to a Glycine?
39
Amino Acid/Protein Evolution
An interesting thing is the coding table is not
random Amino acids with similar physiochemical
properties (e.g., polarity or charge) tend to be
single mutational events apart while those that
are more different require more steps
Evidence of selection it is generally
preferable to replace an amino acid with one with
similar properties (less chance of negative
effect on the organism), and the evolution of the
coding table supports that logic
40
Amino Acid/Protein Evolution
Amino acids or codons are much more difficult to
model than DNA because of the greater
complexity Thus we often use empirical data based
models rather than theoretical models (such as
HKY or general reversible) The original and most
famous of these is known as the PAM or Dayhoff
model
41
PAM/Dayhoff Matrix
Ala Arg Asn Asp Cys Gln
Glu Gly His Ile Leu Lys Met
Phe Pro Ser Thr Trp Tyr Val A R
N D C Q E G H I L K M F P S T W Y
VAla A 9867 2 9 10 3 8 17 21 2 6 4
2 6 2 22 35 32 0 2 18Arg R 1 9913
1 0 1 10 0 0 10 3 1 19 4 1 4 6 1 8
0 1Asn N 4 1 9822 36 0 4 6 6 21
3 1 13 0 1 2 20 9 1 4 1Asp D 6
0 42 9859 0 6 53 6 4 1 0 3 0 0 1 5
3 0 0 1Cys C 1 1 0 0 9973 0 0 0
1 1 0 0 0 0 1 5 1 0 3 2Gln Q 3
9 4 5 0 9876 27 1 23 1 3 6 4 0 6 2
2 0 0 1Glu E 10 0 7 56 0 35 9865
4 2 3 1 4 1 0 3 4 2 0 1 2Gly G
21 1 12 11 1 3 7 9935 1 0 1 2 1 1
3 21 3 0 0 5His H 1 8 18 3 1 20
1 0 9912 0 1 1 0 2 3 1 1 1 4
1Ile I 2 2 3 1 2 1 2 0 0 9872 9 2
12 7 0 1 7 0 1 33Leu L 3 1 3 0
0 6 1 1 4 22 9947 2 45 13 3 1 3 4 2
15Lys K 2 37 25 6 0 12 7 2 2 4
1 9926 20 0 3 8 11 0 1 1Met M 1 1
0 0 0 2 0 0 0 5 8 4 9874 1 0 1 2
0 0 4Phe F 1 1 1 0 0 0 0 1 2 8
6 0 4 9946 0 2 1 3 28 0Pro P 13 5
2 1 1 8 3 2 5 1 2 2 1 1 9926 12 4 0
0 2Ser S 28 11 34 7 11 4 6 16 2
2 1 7 4 3 17 9840 38 5 2 2Thr T
22 2 13 4 1 3 2 2 1 11 2 8 6 1 5
32 9871 0 2 9Trp W 0 2 0 0 0 0
0 0 0 0 0 0 0 1 0 1 0 9976 1
0Tyr Y 1 0 3 0 3 0 1 0 4 1 1 0 0 21
0 1 1 2 9945 1Val V 13 2 1 1 3 2 2
3 3 57 11 1 17 1 3 2 10 0 2 9901
The probability of a specific amino acid
substitutions (multiplied by 10,000) scaled to
every 100 total sustitutions
42
Amino Acid/Protein Evolution
There are now many variants of this matrix, built
with newer and more data, including such things
as PAM 1, PAM 120, PAM 250, BLOSSUM 1, BLOSSUM
62, BLOSSUM 45, etc. These are all based on the
same principle, but are designed for different
organisms or different expected
divergences These matrices can be used to model
protein evolution without direct reference to the
underlying DNA or codons
43
Other Evolutionary Events

Insertions and Deletions
Chunks of DNA can be added or removed from the
chromosome these can be as small as a single
base or as large as hundreds of thousands
With only a few sequences it can be very
difficult to distinguish whether a change in
length was due to an insertion or deletionthus
they are often referred to with the combined term
Indel

ACTCGTCATCGACTTAACGACTCATCGTA
2 base deletion
7 base insertion
ACGTCATCGACCGTACGTTTAACGACTCATCGTA
44
Other Evolutionary Events

Inversions
Chunks of DNA can be flipped into reverse order

ACTCGTCATCGACTTAACGACTCATCGTA
ACTCGTCATATTCAGCACGACTCATCGTA
45
Other Evolutionary Events

Segmental duplications
Large chunks of DNA can be accidentally repeated
When this happens on a large scale, it leads to
multiple copies of entire genes and gene complexes

ACTCGTCATCGACTTAACGACTCATCGTA
ACTCGTCATCGACTTACGACTTAACGACTCATCGTA
46
Other Evolutionary Events

Duplications also can occur through polyploidy
Polyploidy is the duplication of the entire
chromosome set
A diploid organism has 2n chromosomes
A breakdown in meiosis can lead to a tetraploid
with 4n chromosomes
This has been very common in plant evolution, but
can also occur in animals
There is strong evidence this happened at least
twice in the early history of vertebrates

47
Duplicated Genes

What happens if a gene is duplicated (whether due
to polyploidy or a segmental duplication)?
We went from one copy of a gene to two copies
At the beginning the genes are identicalredundanc
y
Mutations will gradually alter the genes
Possible outcomes
One gene will become non-functional (a
pseudogene) at this point the other selection
acts to maintain the other gene (assuming it is
necessary for life)
This is the most common outcome of gene
duplication
One gene will gain a new useful function now
both genes are potentially maintained by
selection
This is the rarest outcome of gene duplication

48
Duplicated Genes

Most genes have more than one function
A pair of duplicated genes can diverge so that
each takes over a different function
If this happens, selection will then work to
preserve both genes because each can serve a
specialized function when the original gene
served as a generalist

49
Duplicated Genes
Gene A
Promoter 1
Promoter 2
Each promoter serves a different function
50
Gene Families

A gene family is a set of genes related to each
other through evolutionary duplication events
they often serve similar functions and have
similar DNA sequences
Although usually defined within a species, gene
families can cross multiple species as well
Gene families can contain anywhere from two to
hundreds (maybe thousands) of genes
Examples
Hox gene family as set of regulatory genes very
heavily involved in the control of animal body
plans
Olfactory receptors pretty much all olfactory
receptor proteins are part of a single gene family

51
Homology
Homology similarity due to inheritance from a
common ancestor Homology is a critical concept
in evolutionary biology it is also an extremely
important concept (although often not recognized
as such) in bioinformatics Similarity which is
NOT due to inheritance from a common ancestor is
Analogy. Analogy can be due to things such as
convergence or parallel evolution
52
Homology ExampleTetrapod Limbs
53
Homology ExampleMammalian Necks
Mammals have 7 cervical vertebrae in their necks
1 2 3 4 5 6 7
Human
Giraffe
54
Homology ExampleMammalian Necks - Exceptions
Only Exceptions Manatees have 6 cervical
vertebrae Two-toed sloths have 6 Three-toed
sloths have 9
Do you think the presence of six cervical
vertebrae is homologous in manatees and two-toed
sloths?
55
Are these homologous?

Bat wings and bee wings?

As wings?
As tetrapod fore-limbs?

Bird wings and bat wings?

To answer this question you must ask Did the
common ancestor have this trait?
56
Sequence Homology

With respect to sequences, homology has three
distinct, perfectly valid meanings
Sequences are homologous if they are descended
from a common ancestor
16s rRNA is found in almost all living organisms
it is homologous among all living things because
they all inherited it from their common ancestor
Between a pair of sequences, specific sites are
said to be homologous if the position within the
sequence is the same as in the common ancestor
Nucleotides (or amino acids) between a pair of
sequences are said to be homologous if they are
(a) at a homologous site, (b) show the same
character (e.g., both sequences have adenine),
and (c) they both have that character because it
was inherited from the common ancestor

57
Sequence Homology
These are unaligned sequences homologous sites
are not in the same column
ACTCGTCATCGACTTAACGACTCATCGTA ACGACCTCGTCCGTACGTTT
ACCGAATCATCCTA
58
Sequence Homology
These are aligned sequences homologous sites
are in the same column
Homologous characters are marked in yellow
ACTCGTCATCGAC-------TTAACGACTCATCGTA A--CGACCTCGTC
CGTACGTTTACCGAATCATCCTA
If we actually knew the ancestral sequence it
might turn out that even some of these are not
actually homologous!
59
Sequence Homology
Although all homologous sequences are similar due
to common inheritance, they can actually be
homologous due to two separate mechanisms
speciation or duplication Sequences which are
homologous due to speciation events are known as
orthologous sequences or orthologs Sequences
which are homologous due to duplication events
are known as paralogous sequences or
paralogs This can make for complicated
relationships among sequences
60
Sequence Homology
Time
Ancestral Gene / Ancestral Species
Gene A1

Genes A2 and B2 are paralogs (related through
duplication)
Genes A2 and A3 are orthologs (related through
speciation)
Genes B2 and A3 are orthologs (related through
speciation)

61
Sequence Homology
Time
Ancestral Gene / Ancestral Species
Gene A1

Genes A2 and A3 are orthologs (related through
speciation)
Genes B2 and B3 are orthologs (related through
speciation)
Genes A2 and B2 are paralogs (related through
duplication)
Genes A3 and B3 are paralogs (related through
duplication)
Genes A2 and B3 are paralogs (related through
duplication)
Genes B2 and A3 are paralogs (related through
duplication)

62
Sequence Homology
Time
Ancestral Gene / Ancestral Species
Gene A1
Genes A2 and B3 appear to be orthologs, but they
are actually paralogs Note that the time of
divergence of A2 and B3 predates the speciation
event between species 2 and 3

Write a Comment

User Comments (0)

About PowerShow.com

Sequence Evolution - PowerPoint PPT Presentation

Sequence Evolution

– PowerPoint PPT presentation