Title: MSA- multiple sequence alignment
1MSA- multiple sequence alignment
- Aligning many sequences is often preferable to
pairwise comparisons. - Problem- Computational complexity of multiple
alignments grows rapidly with the number of
sequences being aligned.
2Even using supercomputers or networks of
workstations, multiple sequence alignment is an
intractable problem for more than 20 or so
sequences of average length and complexity.
3As a result, alignment methods using heuristics
have been developed. These methods, (including
ClustalW) cannot guarantee an optimal alignment,
but can find near-optimal alignments for larger
number of sequences.
4CLUSTALW
- Developed in 1988
- Begins by aligning closely related sequences and
then adds increasingly divergent sequences to
produce a complete msa.
5- http//www.ncbi.nlm.nih.gov/
- http//www.ebi.ac.uk/clustalw/
6Introduction to Molecular Phylogeny
- Phylogeny- the evolutionary history of a group
7Mutations Happen!
- 3 types possible
- Deleterious
- Advantageous
- ???
8Important Point
- Much of variation that is observed among
individuals must have little beneficial or
detrimental effect and be essentially selectively
neutral. - Deleterious mutations are screened out.
Advantageous mutations are rare.
9Functional Constraints?
- Portions of genes that especially important are
said to be under functional constraint and tend
to accumulate changes very slowly. -
- Ex. histone proteins- practically every amino
acid is important. A yeast histone can replace a
human histone.
10Relative Rate of Change within ?-globin gene (4
mammals)
11Basis of Molecular Phylogenetics
- The evolution of species can be modeled as a
bifurcating process- speciation is initiated
when two populations become reproductively
isolated.
12Basis of Molecular Phylogenetics
- Once these two populations cease to interbreed,
it is inevitable that they diverge due to random
mutational processes.
13Basis of Molecular Phylogenetics
- Over time, this branching process may repeat
itself. - A species is said to be related to some other
species with which it shares a direct common
ancestor.
14(No Transcript)
15Basis of Molecular Phylogenetics
- The amount of DNA sequence difference between a
pair of organisms should indicate how recently
those two organisms shared a common ancestor.
16(No Transcript)
17Basis of Molecular Phylogenetics
- The longer two populations remain reproductively
isolated, the more DNA divergence will occur. - The longer two populations remain reproductively
isolated, the more protein divergence will occur.
18Molecular Phylogeny is relatively new.
- Evolution by Natural Selection- Darwin/Wallace
1858 - Molecular Phylogeny 1960s ??
19How it started . . ..
- In 1959, scientists determined the
three-dimensional structures of two proteins that
are found in almost every animal hemoglobin and
myoglobin. - During the next two decades, myoglobin and
hemoglobin sequences were determined for dozens
of mammals, birds, reptiles, amphibians, fish,
etc.
20What they found . . .
- This tree agreed completely with observations
derived from paleontology and anatomy about the
common descent of the corresponding organisms. - from Science and Creationism A View from the
National Academy of Sciences, 2nd Ed., 1999.
21Organisms with high degrees of molecular
similarity are expected to be more closely
related than those that are dissimilar.
22Advantages of Molecular Phylogeny
- Can be used to decipher relationships between all
living things - Relying on anatomy can be misleading- Similar
traits can evolve in organisms that are not
closely related (i.e. convergent evolution lead
to eyes in vertebrates, insects, and molluscs).
23Word of Caution
- Phylogenetic analysis is controversial. There
are a wide variety of different methods for
analyzing the data, and even the experts often
disagree on the best method for analyzing the
data.
24Why so controversial??
251 - Molecular vs. Classical
- How much weight is given to molecular
phylogenetic data, when it contrasts the findings
of the traditional taxonomist??
26. . .
27 28How many cars changed spaces during this 2 hour
interval?
- Parking lot A at 200 ?
- Parking lot A at 400 ?
292- Molecular Phylogeny requires statistical
estimations.
- Parking lot A at 200 ?
- Parking lot A at 400 ?
30Phylogenetic Data Analysis requires 4 steps
- 1) Alignment
- 2) Determine the substitution model
- 3) Tree Building
- 4) Tree Evaluation
31STEP 1- Alignment
- Molecular phylogenetic analysis is dependent on a
good alignment. An evolutionary tree based on an
improper alignment is an erroneous tree.
32(No Transcript)
33Homology
- It is critical to phylogenetic analysis that
homologous characters be compared across species.
- Websters New Collegiate- Fundamental
similarity of structure due to descent from a
common ancestral form.
34(No Transcript)
35(No Transcript)
36Compare homologous genes and homologous
characters
- For DNA and proteins, this means that gaps must
be placed correctly in multiple alignments to
ensure that the same position is being compared
for each species.
37Homologous Genes? When could you accidentally
compare nonhomologous genes?
- Be careful if you comparing genes that are
members of a gene family. - Comparing a tubulin-3 from one species with a
tubulin-6 from another will not generate accurate
results.
38What to align?
- Phylogenetic trees are generated by comparing DNA
or protein. The molecule of choice depends on
the question you are attempting to answer.
39DNA
- contains more evolutionary information than
protein - ATT GCG AAA CAC
-
- ATA GCC AAG CTC
40Protein
- (same region analyzed ? only 1 difference)
- Ile-Ala-Lys- His
- Ile-Ala-Lys- Leu
41DNA
- high rate of base substitution makes DNA best for
very short term studies, e.g. closely-related
species
42(No Transcript)
43Homoplasy
- Return of a character to its original state, thus
masking intervening mutational events. Every
fourth mutation should result in a homoplasy.
44Protein
- more reliable alignment than DNA
- fewer homoplasies than DNA
- lower rate of substitution than DNA better
for wide species comparisons
45(No Transcript)
46rRNA ribosomal RNA
- Best for very long term evolutionary studies
spanning biological kingdoms - Selective processes constraining sequence
evolution should be roughly the same across
species boundaries
47STEP 2- Determine the substitution model.
48A nucleotide substitution rate matrix
A T C G
A 5 -4 -4 -4
T -4 5 -4 -4
C -4 -4 5 -4
G -4 -4 -4 5
49Step 3- Tree Building
50Step 3- Tree Building
Tree terminology Nodes branching points
Branches lines Topology branching pattern
51 Branches can be rotated at a node, without
changing the relationships.
52(No Transcript)
53Unrooted trees explain phylogenetic
relationships they say nothing about the
directions of evolution- the order of descent
54(No Transcript)
55(No Transcript)
56There are two main tree drawing methods.
- - Character Methods
- - Distance Methods
- Both approaches are widely used and work well
with most data sets.
57Distance methods
- Distance- a measure of the overall pairwise
difference between two data sets. -
- The raw material for tree reconstruction is
tabular summaries of the pairwise differences
between all data sets to be analyzed
58In distance methods, the first step is to
calculate a matrix of all pairwise differences
between a set of sequences.
Species A B C D
B 9 ----- ----- -----
C 8 11 ----- -----
D 12 15 10 -----
E 15 18 13 5
59Distance methods
- Identify the sequence pairs that have the
smallest number of sequence changes between them
and are identified as neighbors. On a tree,
these sequences share a common ancestor and are
joined by a short branch.
60UPGMA, pairwise distance and neighbor joining are
distance methods.
- They progressively group sequences, starting with
those that are most alike. - UPGMA unweighted-pair-group method with
arithmetic mean
61Phylogenetic trees based on distance methods.
- The two sequences that are closest together are
connected at a node. - The process is repeated until all sequences are
joined. - Addition of the last sequence defines the root of
the tree.
62The branch lengths may reflect the degree of
similarity (and theoretically reflect
evolutionary time).
- Scaled trees- when branch length are proportional
to the differences between base pairs. - In the best of cases, scaled trees are additive
(the physical length of branches connecting any
two nodes is an accurate representation of their
accumulated differences).
63(No Transcript)
64Phylogenetic trees based on distance methods.
- Relatively simple.
- Problem
- May not be accurate!!
65Character Methods
- There is no denying that distance-based methods
look at the big picture and pointedly ignore
much potentially valuable information.
66Character Methods
- Analysis of individual characters are translated
into evolutionary trees. - Character- a well-defined feature that can exist
in a limited number of different states. (Ex.
DNA and protein sequences)
67The concept of parsimony is at the heart of all
character-based methods of phylogenetic
reconstruction.
- The process of attaching preference to one
evolutionary pathway over another on the basis of
which pathway requires the invocation of the
smallest number of mutational events.
68Character-based methods of phylogenetic
reconstruction.
- The relationship that requires the fewest
number of mutations to explain the current state
of affairs is most likely to be correct
69First Step in Character Methods Identify all
of the informative sites
702nd step Calculate the minimum number of
substitutions at each informative site
71Final step
- After sequences are aligned, algorithms model
each tree.
72Maximum parsimony is a character method
- Character methods require a multiple sequence
align. Analysis of informative characters is
used to construct an evolutionary tree.
73Maximum Parsimony General scientific criterion
for choosing among competing hypotheses states
that we should accept the hypothesis that
explains the data most simply and efficiently.
- The tree requiring the _______ number of nucleic
acid or amino acid substitutions is selected.
74Maximum Parsimony
- The algorithm searches for a tree that requires
the smallest number of changes to explain the
differences observed among the groups under study.
75Character methods are best suited for . . .
- Sequences that are quite similar.
- Small number of sequences
- The method is computationally time consuming as
all possible trees are examined.
76Phylogenetic trees based on maximum likelihood
- The aim is to find the tree (among all possible
trees) - that has the highest likelihood of producing the
observed data (statistical methods).
77Phylogenetic trees based on maximum likelihood
- are similar to maximum parsimony methods but
also take into account the likelihood of specific
mutations (ex. A ? G).
78Mutation Rates Vary
- Transitions (purine to purine or pyrimidine to
pyrimidine) occur more frequently than
transversions (purine to pyrimidine or pyrimidine
to purine).
79Many of the methods described require significant
amounts of computer time.
80Number of possible rooted and unrooted trees
of Data Sets of Rooted Trees of Unrooted Trees
2 1 1
3 3 1
4 15 3
5 105 15
10 34,459,425 2,027,025
15 213,458,046,767,875 7,905,853,580,625
20 8,200,794,532,637,891,559,375 221,643,095,476,699,771,875
81(No Transcript)
82Programs take shortcuts.
- When a large number of tree is being compared, it
is impossible to score each tree. A shortcut
algorithm establishes an upper limit. As it
evaluates other trees, it throws out any tree
exceeding the upper bound before the calculation
is completed.
83- Here are some 194 of the phylogeny packages, and
16 free servers, that I know about. Updates to
these pages are made about twice a year.
84Tree Evaluation
- Every tree drawing program will generate a
tree. The important question is whether or not
the tree drawn is the right one. - In some cases, there are many trees of similar
probabilities.
85Vertebrate b-globins
86(No Transcript)
87Bootstrap method of assessing tree reliability
- Inferred tree is constructed from data set.
- Re-run the calculation on subsets of the data
(resampling). - Resampling is repeated several (100-1000) times.
88(No Transcript)
89Bootstrap method
- Bootstrap trees are constructed from the
resampled data sets. - Bootstrap tree is compared to original inferred
tree. - of bootstrap trees supporting a node are
determined for each node in the tree.
90Molecular Clock
- Addition of time to phylogenetic tree. Units of
time are often in millions of years. - Assumption- substitution rates are constant over
millions of years.
91Molecular Clock
- Rates of molecular evolution for genes with
similar functional constraints can be quite
uniform. (Clock may run at different rates in
different proteins.)
92The End
93- Evolutionary biology also has benefited greatly
from genome-sequencing projects. The wealth of
new genome data is helping to better resolve the
tree of life, particularly its major branches.
This has been especially true for prokaryotes,
where more than 80 genomes have been sequenced so
far and the results have greatly improved our
view of the early history of life.
94Problem- As the of sequences increases, the
of possible trees increases dramatically
of sequences of trees
3 1
4 3
5 15
6 105
7 945
8 10,395
9 135,135
10 1,027,025
50 2.8 x 1074
95Phylogenetic trees based on neighbor joining.
- Also utilizes a distance matrix
- Neighbor joining algorithm searches for sets of
neighbors that minimize the total length of the
tree. - Can produce reasonable trees, especially when
evolutionary distances are short.
96- For vertebrates, many thorny issues remain to be
resolved, such as the phylogeny of families and
other major groups in the tree of life. For
example, it is not yet known whether humans are
closer to mice or to cattle because different
results have been obtained with different gene
analyses. On the other hand, there is no
guarantee that complete genome sequences will
immediately solve all phylogenetic questions, as
evidenced by the continuing debate over the
relationships among humans, flies, and nematodes.
We will need to develop new statistical methods
and bioinformatics tools to handle the greater
volume of data and to unravel the complexities of
molecular evolution.
97Today
- The examination of molecular structure offers an
extremely powerful tool for studying evolutionary
relationships. The quantity of information is
huge--as large as the thousands of different
proteins contained in living organisms, and
limited only by the time and resources of
molecular biologists.
98- Choice of individual genes or proteins.
99Determine the substitution model
- May be an amino acid substitution rate matrix
such as PAM or BLOSUM. ADD DEMO.
100Maximum parsimony and maximum likelihood are
character methods
- Character methods attempt to reconstruct
ancestral nodes of trees in order to fit the tree
to an evolutionary model. They therefore use more
of the information in the data, at the expense of
longer execution time.
101Distance matrices
- Scoring matrices include values for all possible
substitutions. Each mismatch between two
sequences adds to the distance, and each identity
subtracts from the distance. Â