Sequence Evolution - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

Sequence Evolution

Description:

– PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 63
Provided by: michaelsr
Category:

less

Transcript and Presenter's Notes

Title: Sequence Evolution


1
Sequence Evolution
  • Consider how DNA and amino acid sequences evolve
  • All comparative sequence analysis in
    bioinformatics depends on understanding evolution
  • If one does not understand the mechanism /
    process / model under which a sequence can evolve
    how can you know how to compare different
    sequences?

2
Evolutionary Distance
  • Amount of DNA or protein sequence divergence
    between individuals or species
  • Evolutionary distance is the total number of
    substitutions that have occurred in two sequences
    since their divergence from the common ancestor
  • Measured as the number of substitutions that have
    occurred per site

3
Evolution of DNA sequences
Common Ancestor
GCAAGAGATA
C ? G
t
Mouse
Rat
GGAAGAGATA
GCAAGAGATA
4
Number of Differences
Rat GCAAGAGATA Mouse
GGAAGAGATA
  • How many differences between Mouse and Rat
    sequences?
  • 1
  • What proportion of sites are different?
  • p 1 / 10 0.1
  • This is known as the p-distance

5
Continuing Evolution over time
Common Ancestor
GCAAGAGATA
C ? G
1 myr
GGAAGAGATA
GCAAGAGATA
t
Mouse
Rat
6
Relationship of p-distance with time
1.0
p-distance
0.5
0.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Time (Million years)
7
About p-distance
  • What is the theoretical maximum p-distance?
  • pmax 0.75
  • Why this value?
  • There are 4 nucleotides (A, C, T, G) if two
    sequences are completely unrelated, there is a
    25 chance of sites being identical due to random
    chance (and thus 75 chance of them being
    different)
  • Thus random sequences should be different at 75
    of the sites
  • Why does p-distance underestimate the actual
    number of substitutions?

8
Multiple Substitutions
Start 1 difference
Multiple substitutions at the same site
1 additional difference but 3 substitutions
9
Back Substitutions
Start 1 difference
Different Substitutions at the same position
(site) producing the original nucleotide
0 extra differences created for 2 substitutions
10
Coincidental Substitutions
Start 1 difference
Different Substitutions at the same position
(site) in different lineages
1 extra difference created for 2 substitutions
11
Parallel Substitutions
Start 1 difference
Exactly the same Substitutions from the same
nucleotides at the same position (site) in
different lineages
0 extra differences created for 2 substitutions
12
Convergent Substitutions
Start 2 differences
Exactly the same Substitutions from different
nucleotides at the same position (site) in
different lineages
1 difference Evolutionary divergence erased by
two substitutions
13
How to transform p-distance into true distance?
  • Simple model purely random every nucleotide
    at every site in a sequence has an equal
    probability of mutating into any other nucleotide
  • Known as the Jukes-Cantor (1969) model

14
Jukes-Cantor model
To
A T C G A - a a a T a - a a C a a - a G a a a -
a is the probability of change per year
From
The total probability of change of any nucleotide
is r 3a r is equal to the rate of substitution
per site per year
15
Jukes-Cantor model
  • Consider sequences X and Y, diverged from a
    common ancestor t years ago
  • qt is the proportion of sites which are identical
  • pt is the proportion of sites which are
    different, i.e., pt 1 qt
  • What happens at time t 1?
  • Sites which are identical at time t will remain
    identical with probability (1 r)2
  • This can be approximated as 1 2r because r2 is
    a very small term
  • Sites which were different at time t can become
    identical at time t 1 with probability 2r/3
  • If X and Y have nucleotides i and j at a site at
    time t, they will become identical if
  • i in X changes to j, but j in Y stays the same
  • j in Y changes to i, but i in X stays the same
  • Probability of each scenario is (1 r)a (1
    r)r/3
  • Total probability is 2(1 r)r/3 2r/3 2r2/3
  • Drop r2 term (very small) and we get 2r/3

16
Jukes-Cantor model
Therefore we can write the following qt1 (1
2r)qt 2/3 r(1 qt) The first term is the
number of formerly identical sites which are
still identical and the second term is the number
of formerly different sites which are now
identical This can be rewritten as qt1 qt
2r / 3 8r/3 qt Changing to calculus this
becomes dq / dt 2r / 3 8r/3 q When q 1
and t 0 (i.e., the sequences are identical) q
1 3/4 (1 e -8rt/3) The expected number of
substitutions per site, d, for two sequences is
2rt Substituting we get q 1 3/4 (1 e
-4d/3) and solving for d d 3/4 ln (1 4/3
(1 q) ) 3/4 ln (1 4/3 p ) This is the
Jukes-Cantor distance
17
Jukes-Cantor (JC) Distance
1.5
JC Distance
1.0
Estimated number of substitutions per site
0.5
p-distance
0.0
0.0
0.5
1.0
1.5
Actual number of substitutions per site
18
Is the Jukes-Cantor distance accurate?
  • What are its assumptions?
  • Each nucleotide (A, C, G, T) occurs with equal
    frequency (i.e., 25 each)
  • All sites in a sequence have the same mutation
    rate
  • The rate of all substitutions are identical
    (e.g., A ? C A ? G A ? T)
  • Reversibility C ? G G ? C

19
Transitions Transversions
The four DNA bases fall into two structural
categories
Purines
Adenine Guanine
Double ring of 9 atoms
Pyrimidines
Cytosine Thymine
Single ring of 6 atoms
A mutation of the same type (purine to purine or
pyrimidine to pyrimidine) is a transition. A
mutation between types is a transversion.
20
Purines
Pyrimidines
C
C
N
C
N
C
N
C
C
C
C
C
N
N
N
A
T
C
G
21
Transition Bias
Transitions are observed to occur more often than
transversions
  • Mutational Bias
  • Biochemical mispairing (Topal Fresco 1976)
  • Selective Bias
  • Transitions are more often synonymous
  • Transitional amino acid changes are often less
    severe than transversional changes (Grantham
    1974 Zhang 2000)

22
Purines
Pyrimidines
C
C
N
C
N
C
N
C
C
C
C
C
N
N
N
A
T
C
G
23
Kimuras Two-Parameter model (1980)
To
A T C G A - b b a T b - a b C b a - b G a b b -
a is the probability of a transitional change per
year b is the probability of a transversional
change per year
From
The total probability of change of any nucleotide
is r a 2b
24
Kimuras Two-Parameter model (1980)
The total probability of change of any nucleotide
is r a 2b d is expected to be 2rt 2at
4bt Therefore, using the same sort of approach as
before
Where P is the observed number of transitional
differences and Q is the observed number of
transversional differences
25
Transition-Transversion Bias
The ratio of the transitional substitution rate
to the transversional substitution rate k a / b
is known as the transition bias When measured on
a gene gene basis, this value can vary from
0.l5 to 48 This variation turns out to be
strongly related to sequence length
k estimates for 3,712 Human-Mouse gene pairs
of nucleotides
Variation is due to statistical sampling error
and does not necessarily represent true
differences among genes. For mammals, the bias is
approximately 3.6 for neutrally evolving sites
26
Special Case CpG dinucleotides
In mammals, cytidine is usually methylated. When
a cytidine is followed immediately by a guanine
(5' to 3' direction) the C will often
spontaneously deaminate into a thymine CG ?
TG This transitional mutation occurs up to 10
times faster than any other mutation In humans,
C and G each make up about 21 of the nucleotides
(total GC content 42) The expected proportion
of dinucleotide pairs being C followed by G is
therefore 0.21 0.21 4 The observed
proportion of CpG in humans is 0.8 There are
large stretches of chromosomes where Cs are not
methylated these show the expected proportions
of CpG and are known as CpG islands
27
Nucleotide Frequencies
Both models assume that A, C, G, and T all occur
with equal frequency Weve already discussed
that this is not usually true
28
HKY model (1985)
To
A T C G A - bfT bfC afG T bfA - afC bfG C bfA afT
- bfG G afA bfT bfC -
a is the probability of a transitional change per
year b is the probability of a transversional
change per year fX is the expected frequency of
nucleotide X
From
Tamura-Nei model (1993) is an almost identical
variant where purine transitions and pyrimidine
transitions are allowed to have different rates
(a1 and a2)
29
There are many other models
  • General reversible each nucleotide pair has an
    independent rate
  • AG, AT, AC, GT, GC, TC are all separately modeled
  • 6 parameter model
  • Reversible because A? G G ? A etc.
  • Non-reversible/unrestricted model like above
    but without assumption of reversibility
  • 12 parameter model

30
Site Equality
  • All of these models assume that every site has
    the same substitution rate.
  • Again, this is often not the case. Nucleotide
    sites might mutate at different rates because
  • Coding vs. non-coding regions
  • Introns vs. exons
  • 1st vs 2nd vs 3rd codon positions
  • Local effects
  • Most of the models can be adapted by allowing
    site rate variation, usually using models based
    on the gamma distribution

31
Evolutionary Rate
Although it is obviously dependent on
circumstance, in mammals, the best estimate for
the overall average rate of substitution at
neutral sites is 2 10-9 substitutions per site
per year
32
Codons
For the most part weve discussed DNA sequence
evolution without regard to how DNA is
processed For coding genes, changes in DNA may
lead to changes in proteins Synonymous
substitutions DNA substitutions in coding
sequence that do not change the amino acid
sequence Nonsynonymous substitutions DNA
substitutions in coding sequence that do change
the amino acid sequence This is mediated through
codons
33
Codons
Mutations at the 2nd codon position are always
nonsynonymous Mutations at the 1st codon
position are usually nonsynonymous (exception
Leu, Arg) Mutations at the 3rd codon position
can be either
Substitution rate varies by position as expected
3rd gt 1st gt 2nd
34
Codons
Codon pattern also explains additional reason for
transition bias transitions are more likely to
be synonymous than transversions
35
Codons
Sites in which any mutation is synonymous are
known as 4-fold degenerate Example GCx Sites
in which only transitions are synonymous are
known as 2-fold degenerate Example AGx
36
Amino Acid/Protein Sequences
Builds directly from DNA sequence
evolution p-distance is the proportion of sites
that differ between two amino acid
sequences p-distance will underestimate the true
distance for all of the same reasons this same
value is a problem with DNA (convergence,
etc) At its simplest, can model with
assumption that any amino acid can mutate into
any other amino acid with equal probability This
is known as the Poisson model (it is the protein
equivalent of the Jukes-Cantor model) Using the
same logic as JC, we find This is the Poisson
distance
37
Amino Acid/Protein Sequences
  • The Poisson distance suffers from many of the
    same problems as Jukes-Cantor. It assumes
  • All sites mutate with equal frequency
  • All amino acids mutate to each other with same
    rate
  • etc.
  • The last one is particularly problematic. Just
    examine the codon table

38
Amino Acid/Protein Evolution
Is a Proline equally likely to mutate into a
Threonine as it is to a Glycine?
39
Amino Acid/Protein Evolution
An interesting thing is the coding table is not
random Amino acids with similar physiochemical
properties (e.g., polarity or charge) tend to be
single mutational events apart while those that
are more different require more steps
Evidence of selection it is generally
preferable to replace an amino acid with one with
similar properties (less chance of negative
effect on the organism), and the evolution of the
coding table supports that logic
40
Amino Acid/Protein Evolution
Amino acids or codons are much more difficult to
model than DNA because of the greater
complexity Thus we often use empirical data based
models rather than theoretical models (such as
HKY or general reversible) The original and most
famous of these is known as the PAM or Dayhoff
model
41
PAM/Dayhoff Matrix
Ala Arg Asn Asp Cys Gln
Glu Gly His Ile Leu Lys Met
Phe Pro Ser Thr Trp Tyr Val A R
N D C Q E G H I L K M F P S T W Y
VAla A 9867 2 9 10 3 8 17 21 2 6 4
2 6 2 22 35 32 0 2 18Arg R 1 9913
1 0 1 10 0 0 10 3 1 19 4 1 4 6 1 8
0 1Asn N 4 1 9822 36 0 4 6 6 21
3 1 13 0 1 2 20 9 1 4 1Asp D 6
0 42 9859 0 6 53 6 4 1 0 3 0 0 1 5
3 0 0 1Cys C 1 1 0 0 9973 0 0 0
1 1 0 0 0 0 1 5 1 0 3 2Gln Q 3
9 4 5 0 9876 27 1 23 1 3 6 4 0 6 2
2 0 0 1Glu E 10 0 7 56 0 35 9865
4 2 3 1 4 1 0 3 4 2 0 1 2Gly G
21 1 12 11 1 3 7 9935 1 0 1 2 1 1
3 21 3 0 0 5His H 1 8 18 3 1 20
1 0 9912 0 1 1 0 2 3 1 1 1 4
1Ile I 2 2 3 1 2 1 2 0 0 9872 9 2
12 7 0 1 7 0 1 33Leu L 3 1 3 0
0 6 1 1 4 22 9947 2 45 13 3 1 3 4 2
15Lys K 2 37 25 6 0 12 7 2 2 4
1 9926 20 0 3 8 11 0 1 1Met M 1 1
0 0 0 2 0 0 0 5 8 4 9874 1 0 1 2
0 0 4Phe F 1 1 1 0 0 0 0 1 2 8
6 0 4 9946 0 2 1 3 28 0Pro P 13 5
2 1 1 8 3 2 5 1 2 2 1 1 9926 12 4 0
0 2Ser S 28 11 34 7 11 4 6 16 2
2 1 7 4 3 17 9840 38 5 2 2Thr T
22 2 13 4 1 3 2 2 1 11 2 8 6 1 5
32 9871 0 2 9Trp W 0 2 0 0 0 0
0 0 0 0 0 0 0 1 0 1 0 9976 1
0Tyr Y 1 0 3 0 3 0 1 0 4 1 1 0 0 21
0 1 1 2 9945 1Val V 13 2 1 1 3 2 2
3 3 57 11 1 17 1 3 2 10 0 2 9901
The probability of a specific amino acid
substitutions (multiplied by 10,000) scaled to
every 100 total sustitutions
42
Amino Acid/Protein Evolution
There are now many variants of this matrix, built
with newer and more data, including such things
as PAM 1, PAM 120, PAM 250, BLOSSUM 1, BLOSSUM
62, BLOSSUM 45, etc. These are all based on the
same principle, but are designed for different
organisms or different expected
divergences These matrices can be used to model
protein evolution without direct reference to the
underlying DNA or codons
43
Other Evolutionary Events
  • Insertions and Deletions
  • Chunks of DNA can be added or removed from the
    chromosome these can be as small as a single
    base or as large as hundreds of thousands
  • With only a few sequences it can be very
    difficult to distinguish whether a change in
    length was due to an insertion or deletionthus
    they are often referred to with the combined term
    Indel

ACTCGTCATCGACTTAACGACTCATCGTA
2 base deletion
7 base insertion
ACGTCATCGACCGTACGTTTAACGACTCATCGTA
44
Other Evolutionary Events
  • Inversions
  • Chunks of DNA can be flipped into reverse order

ACTCGTCATCGACTTAACGACTCATCGTA
ACTCGTCATATTCAGCACGACTCATCGTA
45
Other Evolutionary Events
  • Segmental duplications
  • Large chunks of DNA can be accidentally repeated
  • When this happens on a large scale, it leads to
    multiple copies of entire genes and gene complexes

ACTCGTCATCGACTTAACGACTCATCGTA
ACTCGTCATCGACTTACGACTTAACGACTCATCGTA
46
Other Evolutionary Events
  • Duplications also can occur through polyploidy
  • Polyploidy is the duplication of the entire
    chromosome set
  • A diploid organism has 2n chromosomes
  • A breakdown in meiosis can lead to a tetraploid
    with 4n chromosomes
  • This has been very common in plant evolution, but
    can also occur in animals
  • There is strong evidence this happened at least
    twice in the early history of vertebrates

47
Duplicated Genes
  • What happens if a gene is duplicated (whether due
    to polyploidy or a segmental duplication)?
  • We went from one copy of a gene to two copies
  • At the beginning the genes are identicalredundanc
    y
  • Mutations will gradually alter the genes
  • Possible outcomes
  • One gene will become non-functional (a
    pseudogene) at this point the other selection
    acts to maintain the other gene (assuming it is
    necessary for life)
  • This is the most common outcome of gene
    duplication
  • One gene will gain a new useful function now
    both genes are potentially maintained by
    selection
  • This is the rarest outcome of gene duplication

48
Duplicated Genes
  • Most genes have more than one function
  • A pair of duplicated genes can diverge so that
    each takes over a different function
  • If this happens, selection will then work to
    preserve both genes because each can serve a
    specialized function when the original gene
    served as a generalist

49
Duplicated Genes
Gene A
Promoter 1
Promoter 2
Each promoter serves a different function
50
Gene Families
  • A gene family is a set of genes related to each
    other through evolutionary duplication events
    they often serve similar functions and have
    similar DNA sequences
  • Although usually defined within a species, gene
    families can cross multiple species as well
  • Gene families can contain anywhere from two to
    hundreds (maybe thousands) of genes
  • Examples
  • Hox gene family as set of regulatory genes very
    heavily involved in the control of animal body
    plans
  • Olfactory receptors pretty much all olfactory
    receptor proteins are part of a single gene family

51
Homology
Homology similarity due to inheritance from a
common ancestor Homology is a critical concept
in evolutionary biology it is also an extremely
important concept (although often not recognized
as such) in bioinformatics Similarity which is
NOT due to inheritance from a common ancestor is
Analogy. Analogy can be due to things such as
convergence or parallel evolution
52
Homology ExampleTetrapod Limbs
53
Homology ExampleMammalian Necks
Mammals have 7 cervical vertebrae in their necks
1 2 3 4 5 6 7
Human
Giraffe
54
Homology ExampleMammalian Necks - Exceptions
Only Exceptions Manatees have 6 cervical
vertebrae Two-toed sloths have 6 Three-toed
sloths have 9
Do you think the presence of six cervical
vertebrae is homologous in manatees and two-toed
sloths?
55
Are these homologous?
  • Bat wings and bee wings?
  • As wings?
  • As tetrapod fore-limbs?
  • Bird wings and bat wings?

To answer this question you must ask Did the
common ancestor have this trait?
56
Sequence Homology
  • With respect to sequences, homology has three
    distinct, perfectly valid meanings
  • Sequences are homologous if they are descended
    from a common ancestor
  • 16s rRNA is found in almost all living organisms
    it is homologous among all living things because
    they all inherited it from their common ancestor
  • Between a pair of sequences, specific sites are
    said to be homologous if the position within the
    sequence is the same as in the common ancestor
  • Nucleotides (or amino acids) between a pair of
    sequences are said to be homologous if they are
    (a) at a homologous site, (b) show the same
    character (e.g., both sequences have adenine),
    and (c) they both have that character because it
    was inherited from the common ancestor

57
Sequence Homology
These are unaligned sequences homologous sites
are not in the same column
ACTCGTCATCGACTTAACGACTCATCGTA ACGACCTCGTCCGTACGTTT
ACCGAATCATCCTA
58
Sequence Homology
These are aligned sequences homologous sites
are in the same column
Homologous characters are marked in yellow
ACTCGTCATCGAC-------TTAACGACTCATCGTA A--CGACCTCGTC
CGTACGTTTACCGAATCATCCTA
If we actually knew the ancestral sequence it
might turn out that even some of these are not
actually homologous!
59
Sequence Homology
Although all homologous sequences are similar due
to common inheritance, they can actually be
homologous due to two separate mechanisms
speciation or duplication Sequences which are
homologous due to speciation events are known as
orthologous sequences or orthologs Sequences
which are homologous due to duplication events
are known as paralogous sequences or
paralogs This can make for complicated
relationships among sequences
60
Sequence Homology
Time
Ancestral Gene / Ancestral Species
Gene A1
  • Genes A2 and B2 are paralogs (related through
    duplication)
  • Genes A2 and A3 are orthologs (related through
    speciation)
  • Genes B2 and A3 are orthologs (related through
    speciation)

61
Sequence Homology
Time
Ancestral Gene / Ancestral Species
Gene A1
  • Genes A2 and A3 are orthologs (related through
    speciation)
  • Genes B2 and B3 are orthologs (related through
    speciation)
  • Genes A2 and B2 are paralogs (related through
    duplication)
  • Genes A3 and B3 are paralogs (related through
    duplication)
  • Genes A2 and B3 are paralogs (related through
    duplication)
  • Genes B2 and A3 are paralogs (related through
    duplication)

62
Sequence Homology
Time
Ancestral Gene / Ancestral Species
Gene A1
Genes A2 and B3 appear to be orthologs, but they
are actually paralogs Note that the time of
divergence of A2 and B3 predates the speciation
event between species 2 and 3
Write a Comment
User Comments (0)
About PowerShow.com