Title: Diapositiva 1
1http//creativecommons.org/licenses/by-sa/2.0/
2Multiple Alignments Molecular Evolution
ProfRui Alves ralves_at_cmb.udl.es 973702406 Dept
Ciencies Mediques Basiques, 1st Floor, Room
1.08 Website of the Coursehttp//web.udl.es/usuar
is/pg193845/Courses/Bioinformatics_2007/ Course
http//10.100.14.36/Student_Server/
3Part I Multiple Alignments
4Pairwise Alignment
- We have seen how pairwise alignments are made.
- Dynamic programming creates efficient algorithms
for finding the optimal alignment. - Break problem into smaller subproblems
- Solve subproblems optimally, recursively
- Use optimal solutions to construct an optimal
solution for the original problem - Alignments require a substitution (scoring)
matrix that accounts for gap penalties.
5Sub. Matrix Basic idea
- Probability of substitution (mutation)
6PAM Matrices
- A family of matrices (PAM-N)
- Based upon an evolutionary model
- The score for a substitution of nucleotides/amino
acids is based on how much we expect that
substitution to be observed after a certain
length of evolutionary time - The scores are derived using a Markov model
i.e., the probability that one amino acid will
change to another is not affected by changes that
occurred at an earlier stage of evolutionary
history
7Nucleic acid PAM matrices
- PAM point accepted mutation
- 1 PAM 1 probability of mutation at each
sequence position. - A uniform PAM1 matrix for a familiy of closely
related proteins
A G T C
A 0.99 0.00333 0.00333 0.00333
G 0.00333 0.99 0.00333 0.00333
T 0.00333 0.00333 0.99 0.00333
C 0.00333 0.00333 0.00333 0.99
8How did they get the values for PAM-1?
- Look at 71 groups of protein sequences where the
proteins in each group are at least 85 similar
(Why these groups?) - Compute relative mutability of each amino acid
probability of change - From relative mutability, compute mutability
probability for each amino acid pair X,Y
probability that X will change to Y over a
certain evolutionary time
9Transitions and transversions
- Transitions (A ? G or C ? T) are more likely than
transversions (A ? T or G ? C) - Assume that transitions are three times as likely
A G T C
A 0.99 0.006 0.002 0.002
G 0.006 0.99 0.002 0.002
T 0.002 0.002 0.99 0.006
C 0.002 0.002 0.006 0.99
10PAM-N Matrices
- N is a measure of evolutionary distance
- PAM-1 is modeled on an estimate of how long in
evolutionary time it would take one amino acid
out of 100 to change. That length of time is
called 1 PAM unit, roughly 10 million years
(abbreviated my). - Values in a PAM-1 matrix show the probability
that an amino acid will change over 10 my. - To get the PAM-N matrix for any N, multiply
PAM-(N-1) by PAM-1.
11Distant relatives
- If a family of proteins is say, 80 homologous
use a PAM 2.
A G T C
A 0.98014 0.011888 0.003984 0.003984
G 0.011888 0.98014 0.003984 0.003984
T 0.003984 0.003984 0.98014 0.011888
C 0.003984 0.003984 0.011888 0.98014
12Computing Relative Mutability A Measure of the
Likelihood that an Amino Acid Will Mutate
- For each amino acid
- changes number of times the amino acid changed
into something else - exposure to mutation
- (percentage occurrence of the amino acid in the
group of sequences being analyzed) (frequency
of amino acids changes in the group) - relative mutability
- (changes/exposure to mutation) / 100
13Computing Relative Mutability of A changes
times A changes into something else 4
occurrence of A in group 10 / 63
0.159 frequency of all amino acid changes in
group 6 2 12 (Note Count changes
backwards and forwards.) exposure to mutation
( occurrence of A in group) (frequency of
all amino acid changes in group) 12
0.159 relative mutability (changes /
exposure to mutation) / 100 (4 /
(12 0.159)) 2.09 / 100 0.0209 Example
from Fundamental Concepts of Bioinformatics by
Krane and Raymer.
14How can we understand relative mutability
intuitively? relative mutability changes /
exposure to mutation the number of times A
changed in proportion to the the probability that
it COULD have changed exposure to mutation
that were 6 times when something changed in the
tree. Each time, that change could have been A
changing to something else, or something
else changing to A 12 chances for a change
involving A. But A appears in a sequence only
.159 of the time.
15Computing Mutability Probability Between Amino
Acid Pairs
- For each pair of amino acids X and Y
- r relative mutability of X
- c num times X becomes Y or vice versa
- p num changes involving X
- mutability probability of X to Y
- (r c) / p
16Computing Mutability Probability that A will
change to G r relative mutability of A
.0209 c num times A becomes G or vice versa
3 p num changes involving A 4 mutability
probability of A to G (r c) / p (0.0209
3) / 4 0.0156
17Normalizing Mutability Probability, X to Y
- For each Y among all amino acids, compute
mutability probability of X to Y as described
above - Get a total of these 20 probabilities. Divide
them by a normalizing factor such that the
probability that X will NOT change is 99 and the
sum of probabilities that it will change to any
other amino acid is 1 - These are the numbers that go in the PAM-1 matrix!
18Converting Mutability Probabilities to Log Odds
Score for X to Y
- Compute the relative frequency of change for X to
Y as follows - Get the X to Y mutability probability
- Divide by the frequency of X in the sequence
data - Convert to log base 10, multiply by 10
- In our example, we get log10(0.0156/0.1587)
- log10(.098)
- To compute log10(.098) solve for x
- 10x 0.098 x -1.01 10-1.01
1/101.01 0.098 - Compute log odds score for Y to X
- Take the average of these two values
19Usefulness of Log Odds Scores
- A score of 0 indicates that the change from one
amino acid to another is what is expected by
chance - A negative score means that the change is
probably due to chance - A positive score means that the change is more
than expected by chance - Because the scores are in log form, they can be
added (i.e., the chance that X will change to Y
and then Y to Z)
20Disadvantages of PAM Matrices
- An alignment tree must be constructed first,
implying some circularity in the analysis - The original PAM-1 matrix was based on a limited
number of families, not necessarily
representative of all protein families - The Markov model does not take into account that
multi-step mutations should be treated
differently from single-step ones
21Most Commonly-Used Amino Acid Subtitution Matrices
- PAM (Percent Accepted Mutation, also called
Dayhoff Amino Acid Substitution Matrix) - BLOSUM (BLOcks amino acid SUbstitution Matrix)
22BLOSUM Scoring Matrices
- Based on a larger set of protein families than
PAM (about 500 families). The proteins in the
families are known to be biochemically related. - Focuses on blocks of conserved amino acid
patterns in these families - Designed to find conserved domains in protein
families - BLOSUM matrices with lower numbers are more
useful for scoring matches in pairs that are
expected to be less closely related through
evolution e.g., BLOSUM50 is used for more
distantly-related proteins than BLOSUM62. (This
is the opposite of the PAM matrices.)
23BLOSUM Matrices
- Target frequencies are identified directly and
not by extrapolation - Sequences more than x identical are collapsed
into a single sequence - BLOSUM 50 gt50 Identity
- BLOSUM 62 gt62 Identity
24Building a BLOSUM Matrix
- BLOSUM 62
- Collapse Sequences that have more than 62
identity into one - Calculate probability of a given pair of AAs
being in same column (qij) - Calculate the frequency of a given AA (fi)
- Calculate log odds ratio sijlog2(qij/fi). This
is the value that goes into the BLOSUM matrix
25BLOSUM50
26Most Commonly-Used Amino Acid Subtitution Matrices
- PAM (Percent Accepted Mutation, also called
Dayhoff Amino Acid Substitution Matrix) - BLOSUM (BLOcks amino acid SUbstitution Matrix)
- Gonnet (Matrix derived from alignments performed
using the PAM series)
27What matrix to choose?
- BLOSUM Matrices perform better in local
similarity searches - BLOSUM 62 is the default matrix used for database
searching
28Gap Penalty
29Gap Penalties
- Gaps in the alignment are necessary to increase
score. - They must be penalized however if penalty is to
high no gaps will appear - On the other hand if they are too low, gaps
everywhere!!! - The default settings of programs are usually ok
for their default scoring matrices
30Once a gap, can we widen it?
gtgi729942spP40601LIP1_PHOLU Lipase 1
precursor (Triacylglycerol lipase)
Length 645 Score 33.5 bits (75), Expect
5.9 Identities 32/180 (17), Positives
70/180 (38), Gaps 9/180 (5) Query 2038
IYSLYGLYNVPYENLFVEAIASYSDNKIRSKSRRVIATTLETVGYQTANG
KYKSESYTGQ 2097 YGL Y
Y D K R N G Sbjct 441
VFTAYGLWRY-YDKGWISGDLHYLDMKYEDITRGIVLNDW----LRKEN
ASTSGHQWGGR 495 Query 2098 LMAGYTYMMPENINLTPLAGL
RYSTIKDKGYKETGTTYQNLTVKGKNYNTFDGLLGAKVS 2157
AG P KGYEG
Y G LG Sbjct 496 ITAGWDIPLTSAVTTSPIIQY
AWDKSYVKGYRESGNNSTAMHFGEQRYDSQVGTLGWRLD
555 Query 2158 SNINVNEIVLTPELYAMVDYAFKNKVSAIDARL
QGMTAPLPTNSFKQSKTSFDVGVGVTA 2217 N
P F K I S KQ
G A Sbjct 556 TNFG----YFNPYAEVRFNHQFGDKRYQIRSA
INSTQTSFVSESQKQDTHWREYTIGMNA 611
Real gaps are often more than one letter long.
31Affine gap penalty
LETVGY W----L
-5 -1 -1 -1
- Separate penalties for gap opening and gap
extension. - This requires modifying the DP algorithm to store
three values in each box.
32Scoring Gap Penalties
- Linear Gap Penalty Score
- Affine Penalty Score
- Opening a gap is costly extending it not so much
(open12 extension1)
33Multiple Sequence Alignment
34MSA Introduction
- Goal of protein sequence alignment
- To discover biological (structural /
functional) similarities - If sequence similarity is weak, pairwise
alignment can fail to identify important features
(eg interaction residues) - Simultaneous comparison of many sequences often
find similarities that are invisible in PA.
35Why do we care about sequence alignment?
- Identify regions of a gene (or protein)
susceptible to mutation and regions where residue
replacement does not change function. - Information about the evolution of organisms.
- Orthologs are genes that are evolutionarily
related, have a similar function, but now appear
in different species. - Homologous genes (genes with share evolutionary
origin) have similar sequences. - Paralogs are evolutionarily related (share an
origin) but no longer have the same function. - You can uncover either orthologs or paralogs
through sequence alignment.
36Multiple Sequence Alignment
- Often applied to proteins (not very good with
DNA) - Proteins that are similar in sequence are often
similar in structure and function - Sequence changes more rapidly in evolution than
does structure and function.
37Work with proteins!If at all possible
- Twenty match symbols versus four, plus
similarity! Way better signal to noise. - Also guarantees no indels are placed within
codons. So translate, then align. - Nucleotide sequences will only reliably align if
they are very similar to each other. And they
will require extensive hand editing and careful
consideration.
38Overview of Methods
- Dynamic programming too computationally
expensive to do a complete search uses
heuristics - Progressive starts with pair-wise alignment of
most similar sequences adds to that (LOCAL
OPTIMIZATION) - Iterative make an initial alignment of groups
of sequences, adds to these (e.g. genetic
algorithms) (GLOBAL OPTIMIZATION) - Locally conserved patterns
- Statistical and probabilistic methods
39Dynamic Programming
- Computational complexity even worse than for
pair-wise alignment because were finding all the
paths through an n-dimensional hyperspace
(Remember matrix, now add many dimensions) - Can align less than 20 relatively short (200-300)
protein sequences in a reasonable amount of time
not much beyond that
40A Heuristic for Reducing the Search Space in
Dynamic Programming
- Consider the pair-wise alignments of each pair of
sequences. - Create alignments from these scores.
- Consider a multiple sequence alignment built from
the individual pairwise alignments. - These alignments circumscribe a space in which to
search for a good (but not necessarily optimal)
alignment of all n sequences.
41The details
- Create an alignment of alignments (AOA) based
on pair-wise alignments (Pairs of sequences that
have the best scores are paired first in the
tree.) - Do a first-cut msa by incrementally doing
pair-wise alignments in the order of alikeness
of sequences as indicated by the AOA. Most alike
sequences aligned first. - Use the pair-wise alignments and the first-cut
msa to circumscribe a space within which to do a
full msa that searches through this solution
space. - The score for a given alignment of all the
sequences is the sum of the scores for each pair,
where each of the pair-wise scores is multiplied
by a weight ? indicating how far the pair-wise
score differs from the first-cut msa alignment
score.
42Heuristic Dynamic Programming Method for MSA
- Does not guarantee an optimal alignment of all
the sequences in the group. - Does get an optimal alignment within the space
chosen.
43Progressive Methods
- Similar to dynamic programming method in that it
uses the first step (i.e., it creates an AOA,
aligns the most-alike pair, and incrementally
adds sequences to the alignment.) - Differs from dynamic programming method for MSA
in that it doesnt refine the first-cut MSA by
doing a full search through the reduced search
space. (This is the computationally expensive
part of DP MSA.)
44(No Transcript)
45Progressive Method the details
- Generally proceeds as follows
- Choose a starting pair of sequences and align
them - Align each next sequence to those already
aligned, one at a time - Heuristic method doesnt guarantee an optimal
alignment - Details vary in implementation
- How to choose the first sequence to align?
- Align all subsequence sequences cumulatively or
in subfamilies? - How to score?
46ClustalW
- Based on phylogenetic analysis
- A AOA is created using a pairwise distance matrix
and nearest-neighbor algorithm - The most closely-related pairs of sequences are
aligned using dynamic programming - Each of the alignments is analyzed and a profile
of it is created - Alignment profiles are aligned progressively for
a total alignment - W in ClustalW refers to a weighting of scores
depending on how far a sequence is from the root
on the AOA
47(No Transcript)
48Once a gap, always a gap
49Basic Steps in Progressive Alignment
Once a gap, always a gap
50(No Transcript)
51ClustalW Procedure
AOA
52Problems with Progressive Method
- Highly sensitive to the choice of initial pair to
align. If they arent very similar, it throws
everything off. - Its not trivial to come up with a suitable
scoring matrix or gap penalties.
53Part II Molecular Evolution
54Theory of Evolution
- Evolution is the theory that allows us to
understand how organisms came to be how they are - In probabilistic terms, it is likely that all
living beings today have originated from a single
type of cells - These cells divided and occupied ecological
niches, where they adapted to the new
environments through natural selection
55How did the first cell create different cells?
Neutral Mutation (e.g. by error in genome
replication)
56How did the first cell create different cells?
Neutral Mutation (e.g. by error in genome
replication)
57How did the first cell create different cells?
Neutral Mutation (e.g. by error in genome
replication)
58How did the first cell create different cells?
Deleterious Mutation (e.g. by error in genome
replication)
59How did the first cell create different cells?
Deleterious Mutation (e.g. by error in genome
replication)
60How did the first cell create different cells?
Deleterious Mutation (e.g. by error in genome
replication)
61How did the first cell create different cells?
Advantageous Mutation (e.g. by error in genome
replication)
62How did the first cell create different cells?
Advantageous Mutation (e.g. by error in genome
replication)
63And then there was sex
64Why Sex???
- Asexual reproduction is quicker, easier ? more
offspring/individual. - Sex may limit harmful mutations
- Asexual all offspring get all mutations
- Sexual Random distribution of mutations. Those
with the most harmful ones tend not to reproduce. - Generate beneficial gene combinations
- Adaptation to changing environment
- Adaptation to all aspects of constant environment
- Can separate beneficial mutations from harmful
ones - Sample a larger space of gene combinations
65What drives cells to adapt?
New Niche/ New conditions in old niche
66What drives cells to adapt?
New (better addapted) mutation
67How do New Genes and Proteins appear?
- Genes (Proteins) are build by combining domains
- New proteins may appear either by intradomain
mutation of by combining existing domains of
other proteins
Cell Division
Cell Division
68The Coalescent
- This model of cellular evolution has implications
for molecular evolution - Coalescent Theory
- a retrospective model of population genetics that
traces all alleles of a gene in a sample from a
population to a single ancestral copy shared by
all members of the population, known as the most
recent common ancestor
69Why is the coalescent the de facto standard today?
Alternatives?
Current sequences have evolved from the same
original sequence (Coalescent) Current
sequences have converged to a similar sequence
from multiple origins of life
70Back of the envelop support for ?
Back of the envelop support for divergence
71About the mutational process
- Point mutations
- Transitions (A?G, C?T) are more frequent than
transversions (all other substitutions) - In mammals, the CpG dinucleotide is frequently
mutated to TG or CA (possibly related to the fact
that most CpG dinucleotides are methylated at the
C-residues) - Microsatellites frequently increase or decrease
in size (possibly due to polymerase slippage
during replication) - Gene and genome duplications (complete or
partial), may lead to - pseudogenes function-less copies of genes which
rapidly accumulate (mostly deleterious)
mutations, useful for estimating mutation rates! - new genes after functional diversification
- Chromosomal rearrangements (inversions and
translocation), may lead to - meiotic incompatibilities, speciation
- Estimated mutation rates
- Human nuclear DNA 3-510-9 per year
- Human mitochondrial DNA 3-510-8 per year
- RNA and retroviruses 10-2 per year
72Consequences of the coalescent model?
73So what if we accept the coalescent model?
A1 TSRISEIRR A2 TSRISEIRR A3 TSRISEIRR A4 TSRISEIR
R A5 TSRISEIRR A6 TSRISEIRR A7 PSRISEIRR A8 PKRISE
VRR A9 PKRISEVRR A10 PQRISAIQR A11 PQRISAIQR A12 P
QRISTIQR A13 PQRISTIQR A14 ASHLHNLQR A15 TKHLQELQR
E A16 TKHLQELQRE A17 TKHLQELQRE A18 SKHLHELQRD A19
PKNLHELQKD A20 SKRLHEVQSE
A1-6 TSRISEIRR A7 PSRISEIRR A8-9 PKRISEVRR A10-11
PQRISAIQR A12-13 PQRISTIQR A14 ASHLHNLQR A15-17 TK
HLQELQR A18 SKHLHELQR A19 PKNLHELQK A20 SKRLHEVQS
74So what if we accept the coalescent model?
A1-6
A1-6 TSRI SEI RR A7 PSRI SEI RR A8-9 PKRI
SEVRR A10-11 PQRI SAI QR A12-13 PQRI STI
QR A14 ASHLHNLQR A15-17 TKHLQELQR A18 SKHLHELQR A1
9 PKNLHELQK A20 SKRLHEVQS
A1-7
A7
A10-11
A10-13
A12-A13
75So what if we accept the coalescent model?
A1-7 (p-t) SRI S E I RR A8-9 P KRI S E
VRR A10-13 P QRI S(a-t)I QR A14 A SHLH
N LQR A15-17 T KHLQ E LQR A18 S KHLH
E LQR A19 P KNLH E LQK A20 S KRLH E
VQS
4 3324 5 323
The study of sequence alignments can gives
information about the evolution of the different
organisms!!!!
76Phylogenetic trees
A tree is a graph reflecting the approximate
distances between a set of objects. A tree is
also called a dendrogram. There are different
types of trees Unrooted versus rooted trees A
rooted tree has an additional node representing
the origin, in molecular phylogeny the last
common ancestor of the sequences analyzed. In
general, the root cannot be directly inferred
from the data. It may be inferred from the
paleontological record, from a trusted outlier,
or on the basis of the molecular clock
hypothesis. Scaled and unscaled trees In an
unscaled tree, the length of the branches are not
important. Only the topology counts. In
phylogeny, trees are usually scaled. Binary
trees each node branches into two daughter
nodes. Other trees are usually not considered in
phylogeny as they can easily be approximated by
binary trees with very short edges between
nodes. Note A rooted (or unrooted) tree
connecting n objects (leaves) has 2n1 (or 2n2)
nodes altogether and 2n2 (or 2n3) edges
77Phylogenetic trees
Rooted tree
Rooted tree satisfying molecular clock
hypothesis all leaves at same distance from the
root.
root
6
root
7
8
time
7
6
8
3
5
1
2
4
2
1
3
4
5
Unrooted tree
Note 1-5 are called leaves, or leave nodes. 6-8
are inferred nodes corresponding to ancestral
species or molecules. Branches are also called
edges. The edge lengths reflect evolutionary
distances.
3
4
8
6
2
7
5
1
78Phylogenetic tree reconstruction, overview
- Computational challenge There is an enormous
number of different topologies even for a
relatively small number of sequences - 3 sequences 1
- 4 sequences 3
- 5 sequences 15
- 10 sequences 2,027,025
- 20 sequences 221,643,095,476,699,771,875
- Consequence Most tree construction algorithm are
heuristic methods not guaranteed to find the
optimal topology. - Input data for two major classes of algorithms
- Input data distance matrix, examples UPGMA,
neighbor-joining - 2. Input data multiple alignment parsimony,
maximum likelihood - Distance matrix methods use distances computed
from pairwise or multiple alignments as input.
79Building phylogenetic trees of proteins
Genome 1
Genome 2
Genome 3
Genome
80Distance based phylogenetic trees
- ACTDEEGGGGSRGHI
- A-TEEDGGAASRGHI
- ACFDDEGGGGSRGHL
A1 A2 A3
A1
A3
A2
A3
A2
A1
5 substitutions
3 substitutions
8 substitutions
5
A1
A3
3
A2
81Maximum likelihood phylogenetic trees
Probability of aa substitution
Alignment
A - E D
A 1 0.01 0.2 0.09 -
0.01 1 0.0001 0.0001 E 0.2
0.0001 1 0.5 D 0.09 0.0001 0.5
1
- ACTDEEGGGGSRGHI
- A-TEEDGGAASRGHI
- ACFDDEGGGGSRGHL
82Distance measures for phylogenetic tree
construction
Distance measures respect the following
constraints d 0 if the sequences are
identical, d gt 0 if the sequences are
different Distances between molecular sequences
are computed from pair-wise alignment scores.
For closely related DNA sequences, one could
simply use f , the fraction of non-identical
residues (readily computed from the identity
value returned by an alignment program). For
more distantly related sequences, the
Jukes-Cantor distance, d ¾log(14f/3) is
preferred. This measure is assumed to be
proportional to evolutionary time. It takes into
account that the percent identity value saturates
at 25 over time. For protein sequences aligned
with the aid of a substitution matrix, an
approximate distance is often computed as
follows
Sobs observed pairwise alignment
score Smax maximum score (average of alignment
scores of each sequence against
itself) Srand expected score for random sequences
of same length and composition
83Maximum likelihood phylogenetic trees
A2
Alignment
p(1,2)
A1
- ACTDEEGGGGSRGHI
- A-TEEDGGAASRGHI
- ACFDDEGGGGSRGHL
5 substitutions
A1
p(1,3)
A3
3 substitutions
p(2,3)gtp(1,2)gtp(1,3)
A3
A2
A1
A2
p(2,3)
8 substitutions
A3
A3
A1
A2
84Maximum LikelyhoodParsimony
- Goal To explain the MSA with a minimal number of
mutational events to find the tree with the
minimal cost - Input a multiple sequence alignment (MSA)
- Major components
- A cost function for a tree given an MSA which
simultaneously defines the branch lengths - An algorithm which finds a tree with the minimal
cost - Output
- an un-rooted tree (topology plus branch-lengths)
- a total cost
- an ancestral sequences for each non-terminal
node
85Statistical evaluation of trees bootstrapping
5
4
1
6
7
2
8
3
- Motivation Some branching patterns in a tree may
be uncertain for statistical reasons (short
sequences, small number of mutational events) - Goal of bootstrapping To assess the statistical
robustness for each edge of the tree. - Note that each edge divides the leave nodes into
two subsets. For instance, edge 78 divides the
leaves into subsets 1,2,3 and 4,5.However,
is this short edge statistically robust ? - Method Try to generate tree from subsets of
input data as follows - Randomly modify input MSA by eliminating some
columns and replacing them by existing ones, This
results in duplication of columns. - Compute tree for each modified input MSA.
- For each edge of the tree derived from the real
MSA, determine the fraction of trees derived from
modified MSAs which contain an edge that divides
the leaves into the same subsets. This fraction
is called the bootstrap value. Edges with low
bootstrap values (e.g. lt0.9) are considered
unreliable.
86Statistical evaluation of trees bootstrapping
87Other Trees
- Use genomes
- Use Enzymomes
- Use whatever group of molecules are important for
a given function