Title: Bioinformatics Algorithms and Data Structures
1Bioinformatics Algorithms and Data Structures
- Chapter 17.1 Strings and Evolutionary Trees
- Lecturer Dr. Rose
- Slides by Dr. Rose
- April 8, 2003
2Strings and Evolutionary Trees
- the great Tree of Life fills with its dead and
broken branches the crust of the earth, and
covers the surface with is ever-branching and
beautiful ramifications. - Darwin
3Strings and Evolutionary Trees
- There are three competing theories to creating
classification trees - Evolutionary taxonomy
- Numerical taxonomy
- Cladistics
4Strings and Evolutionary Trees
- Evolutionary taxonomy
- classification informed by evolutionary theory
- Fills in internal nodes corresponding to common
ancestors.
5Strings and Evolutionary Trees
- Numerical taxonomy (Phenetics)
- Studies the relationship between groups of
organisms based on the degree of similarity - Similarity can be in terms of molecular,
phenotypic or anatomical data. - The resulting graph, which is a tree-like network
is called a phenogram. - Maximum Likelihood method.
6Strings and Evolutionary Trees
- Cladistics
- Characterized by character-state methods, e.g.,
maximum parsimony. - Guiding principle
- not all character states shared by organisms
provide evolutionary information - Important to restrict consideration to
evolutionarily significant states.
7Strings and Evolutionary Trees
- Cladistics continued
- Willi Hennig Society
- http//www.cladistics.org/
8Strings and Evolutionary Trees
- Tree building algorithms
- Distance-based methods
- Input distance data such as sequence edit
distance - Output weighted tree with pairwise distances
matching evolutionary distance - We will consider data that is
- Ultrametric (section 17.1)
- Additive but not ultrametric (section 17.2)
- Nonadditive data (no section)
9Strings and Evolutionary Trees
- Tree building algorithms continued
- Maximum-parsimony methods
- Character-based methods
- Input character data (often aligned sequences)
- Output tree with
- input taxa at leaves
- Inferred taxa at internal nodes
- Goal minimize the total cost of mutations
- maximize parsimony.
- Seeks a tree that has the minimum cost over all
possible trees
10Ultrametric trees and distances
- Before discussing ultrametric distances, consider
additive distances - Defn. Additive distances are distances which can
be fitted to an unrooted tree such that all
pairwise taxa distances are equal to the sum of
the branch lengths connecting them. (Table and
figure from http//imbs.massey.ac.nz/Research/MolE
vol/Farside/DNA/00312.html)
11Ultrametric trees and distances
- Ultrametric distances are more constrained than
additive distances. - Defn. Ultrametric distances are distances that
- fit a tree so that the distance between any two
taxa is equal to the sum of the branches joining
them. - for any three taxa i, j and k, the two largest
distances are equal, i.e., - If dikgt djk then dik dij
- Else if dikgt dij then dik djk
- Else dij djk
12Ultrametric trees and distances
- Q What is an ultrametric tree?
- An ultrametric tree T for n-by-n symmetric
distance matrix D has the following properties - T has n leaves, one per unique row of D.
- Internal nodes are labeled by an entry from D and
have two children. - The numbers labeling internal nodes strictly
decrease along any path from the root to a leaf. - D(i, j) denotes the label of the least common
ancestor of leaves i and j in T. - The distances in D must be ultrametric.
13Ultrametric trees and distances
- Consider the following example from the textbook
- Verify the ultrametric condition, i.e., for any
three taxa, two of the distances will be the same
and larger than the third distance.
14Ultrametric trees and distances
- Interpretation of ultrametric as evolutionary
trees - The leaves are the existing OTUs
- The internal nodes are the divergence events
- A divergence event is a point where the
evolutionary histories of two OTUs split.
15Ultrametric trees and distances
- Q If taxa A and B diverge at time t, which
statements are implied by the meaning of
divergence? - A is the ancestor of B
- B is the ancestor of A
- Neither A nor B is an ancestor of the other.
- Neither A nor B have a living ancestor.
16Ultrametric trees and distances
- If the branching order time of each divergence
is known - The label at each internal node is the time of
the divergent event corresponding to that node. - The labels from the root to leaves must be
strictly increasing. - D(i, j) is the time that taxa i and j diverged.
- The author calls T a min-ultrametric tree.
17Ultrametric trees and distances
- Equivalently, if the evolutionary history is
known - The label at each internal node is the time that
has passed since the divergent event
corresponding to that node. - The labels from the root to leaves must be
strictly decreasing. - D(i, j) is the time since taxa i and j diverged.
- T is an ultrametric tree for D.
18Ultrametric trees and distances
- Defn.The symmetric matrix D defines an
ultrametric distance iff for any three indices i,
j and k, the two largest distances are equal,
i.e., - If dikgt djk then dik dij
- Else if dikgt dij then dik djk
- Else dij djk
- Call D ultrametric if it defines ultrametric
distances. - Thm. Distance matrix D has an ultrametric tree
iff D is an ultrametric matrix. (Proof page 451)
19Ultrametric trees and distances
- Proof.
- (if T ultrametic then D ultrametric)
- If T is ultrametric (draw T)
- each internal node v is labeled D(i, j) where i
and j are leaves and v is the least common
ancestor. - For any three leaves, i, j, k, in T, let u be
the least common ancestor, then - u is labeled by two of D(i, j), D(i, k), D(j,
k), i.e., two of these are equal. - Further more one of D(i, j), D(i, k), D(j, k)
is smallest - Therefore D is ultrametric.
20Ultrametric trees and distances
- Proof.
- ? (if D ultrametic then there is an ultrametric
T) - If D is ultrametric
- The number of distinct entries d in each row i
defines the number of nodes from the root to leaf
i. - Each node in this path is labeled in decreasing
order with a distinct label. - Any node v on this path labeled D(i, j) must be
the least common ancestor of leaves i and j.
21Ultrametric trees and distances
- Proof.? continued
- The path to leaf i partitions the n-1 remaining
leaves in d-1 classes. - Each distinct node on the path to i is labeled by
the distance from i to to the leaves in that
partition. - Example
22Ultrametric trees and distances
- Proof.? continued
- We want to recursively find the ultrametric tree
for each of the d-1 partitions and then combine
them. - Consider the partition defined by internal node
v. - Let j be a leaf contained in this partition.
- Let l be some other leaf. There are three cases
- l is in the same partition as j.
- l is in a partition between i and node v.
- l is in a partition between node v and the root.
23Ultrametric trees and distances
- Proof.? continued
- The three cases Let i A, j F
- l is in the same partition as j. example l B
- l is in a partition between i and v. example l
D - l is in a partition between v and the root.
example l C
24Ultrametric trees and distances
- Proof.? continued
- Case 1 (i A, j F, l B)
- D(i, j) D(i, l) thus D(j, l) ? D(i, j). Why?
- ? So we can add the subtree containing j l.
knowing that D(j, l) is correct.
25Ultrametric trees and distances
Typos in text
- Proof.? continued
- Case 2 (i A, j F, l D)
- D(i, l) lt D(i, j) thus D(i, j) D(j, l)
- ? So we can add the subtree at v containing j
knowing that D(j, l) is correct, i.e. v is
labeled correctly.
26Ultrametric trees and distances
Typos in text
- Proof.? continued
- Case 3 (i A, j F, l C)
- D(i, l) gt D(i, j) thus D(i, l) D(j, l)
- ? So we can add the subtree at v containing j
knowing that D(j, l) is correct, i.e., it labels
their least common ancestor .
27Ultrametric trees and distances
- Proof.? continued
- In each of the three cases, the ultrametric tree
defined by v can be correctly attached to v. - Hence, using recursion, we can construct the
ultrametric tree T for D.
28Ultrametric trees and distances
- Gusfield presents two related theorems
- Thm. If D is ultrametric, then the ultrametric
tree for D is unique. - This is a consequence of the fact that the nodes
that appear on path to a given node i must appear
in every ultrametric tree for D. - Thm. If D is ultrametric, then the ultrametric
tree for D can be constructed in O(n2) time.
29Ultrametric trees and distances
- Given ultrametic data we can
- reconstruct evolutionary history.
- Find the relative divergence times
- Find the exact tree topology
- Q How do we get ultrametric data?
- Consider the molecular clock theory.
30Molecular Clock Theory
- Proposed by Emile Zucker and Linus Pauling.
- Idea accepted mutations occur at a constant rate
for a given protein. - There are three important issues
- Accepted mutations are those that still allow
the protein to function properly. - Lethal mutations will be selected against and
should not accumulate.
31Molecular Clock Theory
- The theoretical clock rate is protein specific.
- Different proteins will have different clocks.
- Gusfield mentions hemoglobin and cytochrome c.
- Both are stable and similar in all mammals.
- However, hemoglobin mutates faster than
cytochrome c.
32Molecular Clock Theory
- The implication is that the number of mutations
is proportional to length of the time interval. - Requirement the interval must be long enough.
- The length of an interval can be measured by the
number of mutations. - This requires that the clock be calibrated.
33Molecular Clock Theory
- Assumptions
- all DNA mutates at the same rate
- Observed accepted rate differences are due to
different constraints - Natural selection at the organism level
- Physical chemistry at the molecular level
34Molecular Clock Theory
- Q How would we use the molecular clock to
collect ultrametric data? - Find a common protein for two taxa of interest
- Determine the number of accepted mutational
differences, say k. - By molecular clock theory each taxa contributed
k/2 accepted mutations.
35Molecular Clock Theory
- Do this for each pair of n taxa.
- Result n choose 2 numbers satisfying the
requirement for an ultrametric tree. - The ultrametric tree will describe the true
evolutionary history for the n taxa. - Great huh?
36Molecular Clock Theory
- If only the molecular clock theory was correct ?
- In the real world, the situation is complicated.
- Molecular clock rates can and do diverge.
- Sometimes there are common mutation rates.
- Sometimes the mutation rates diverge.
37Additive Distance Trees