Bioinformatics Algorithms and Data Structures - PowerPoint PPT Presentation

About This Presentation

Title:

Bioinformatics Algorithms and Data Structures

Description:

UNIVERSITY OF SOUTH CAROLINA. College of Engineering & Information Technology ... If dik djk then dik = dij. Else if dik dij then dik = djk. Else dij = djk ... – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 38

Provided by: john244

Learn more at: https://www.cse.sc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Bioinformatics Algorithms and Data Structures

1
Bioinformatics Algorithms and Data Structures

Chapter 17.1 Strings and Evolutionary Trees
Lecturer Dr. Rose
Slides by Dr. Rose
April 8, 2003

2
Strings and Evolutionary Trees

the great Tree of Life fills with its dead and
broken branches the crust of the earth, and
covers the surface with is ever-branching and
beautiful ramifications. - Darwin

3
Strings and Evolutionary Trees

There are three competing theories to creating
classification trees
Evolutionary taxonomy
Numerical taxonomy
Cladistics

4
Strings and Evolutionary Trees

Evolutionary taxonomy
classification informed by evolutionary theory
Fills in internal nodes corresponding to common
ancestors.

5
Strings and Evolutionary Trees

Numerical taxonomy (Phenetics)
Studies the relationship between groups of
organisms based on the degree of similarity
Similarity can be in terms of molecular,
phenotypic or anatomical data.
The resulting graph, which is a tree-like network
is called a phenogram.
Maximum Likelihood method.

6
Strings and Evolutionary Trees

Cladistics
Characterized by character-state methods, e.g.,
maximum parsimony.
Guiding principle
not all character states shared by organisms
provide evolutionary information
Important to restrict consideration to
evolutionarily significant states.

7
Strings and Evolutionary Trees

Cladistics continued
Willi Hennig Society
http//www.cladistics.org/

8
Strings and Evolutionary Trees

Tree building algorithms
Distance-based methods
Input distance data such as sequence edit
distance
Output weighted tree with pairwise distances
matching evolutionary distance
We will consider data that is
Ultrametric (section 17.1)
Additive but not ultrametric (section 17.2)
Nonadditive data (no section)

9
Strings and Evolutionary Trees

Tree building algorithms continued
Maximum-parsimony methods
Character-based methods
Input character data (often aligned sequences)
Output tree with
input taxa at leaves
Inferred taxa at internal nodes
Goal minimize the total cost of mutations
maximize parsimony.
Seeks a tree that has the minimum cost over all
possible trees

10
Ultrametric trees and distances

Before discussing ultrametric distances, consider
additive distances
Defn. Additive distances are distances which can
be fitted to an unrooted tree such that all
pairwise taxa distances are equal to the sum of
the branch lengths connecting them. (Table and
figure from http//imbs.massey.ac.nz/Research/MolE
vol/Farside/DNA/00312.html)

11
Ultrametric trees and distances

Ultrametric distances are more constrained than
additive distances.
Defn. Ultrametric distances are distances that
fit a tree so that the distance between any two
taxa is equal to the sum of the branches joining
them.
for any three taxa i, j and k, the two largest
distances are equal, i.e.,
If dikgt djk then dik dij
Else if dikgt dij then dik djk
Else dij djk

12
Ultrametric trees and distances

Q What is an ultrametric tree?
An ultrametric tree T for n-by-n symmetric
distance matrix D has the following properties
T has n leaves, one per unique row of D.
Internal nodes are labeled by an entry from D and
have two children.
The numbers labeling internal nodes strictly
decrease along any path from the root to a leaf.
D(i, j) denotes the label of the least common
ancestor of leaves i and j in T.
The distances in D must be ultrametric.

13
Ultrametric trees and distances

Consider the following example from the textbook
Verify the ultrametric condition, i.e., for any
three taxa, two of the distances will be the same
and larger than the third distance.

14
Ultrametric trees and distances

Interpretation of ultrametric as evolutionary
trees
The leaves are the existing OTUs
The internal nodes are the divergence events
A divergence event is a point where the
evolutionary histories of two OTUs split.

15
Ultrametric trees and distances

Q If taxa A and B diverge at time t, which
statements are implied by the meaning of
divergence?
A is the ancestor of B
B is the ancestor of A
Neither A nor B is an ancestor of the other.
Neither A nor B have a living ancestor.

16
Ultrametric trees and distances

If the branching order time of each divergence
is known
The label at each internal node is the time of
the divergent event corresponding to that node.
The labels from the root to leaves must be
strictly increasing.
D(i, j) is the time that taxa i and j diverged.
The author calls T a min-ultrametric tree.

17
Ultrametric trees and distances

Equivalently, if the evolutionary history is
known
The label at each internal node is the time that
has passed since the divergent event
corresponding to that node.
The labels from the root to leaves must be
strictly decreasing.
D(i, j) is the time since taxa i and j diverged.
T is an ultrametric tree for D.

18
Ultrametric trees and distances

Defn.The symmetric matrix D defines an
ultrametric distance iff for any three indices i,
j and k, the two largest distances are equal,
i.e.,
If dikgt djk then dik dij
Else if dikgt dij then dik djk
Else dij djk
Call D ultrametric if it defines ultrametric
distances.
Thm. Distance matrix D has an ultrametric tree
iff D is an ultrametric matrix. (Proof page 451)

19
Ultrametric trees and distances

Proof.
(if T ultrametic then D ultrametric)
If T is ultrametric (draw T)
each internal node v is labeled D(i, j) where i
and j are leaves and v is the least common
ancestor.
For any three leaves, i, j, k, in T, let u be
the least common ancestor, then
u is labeled by two of D(i, j), D(i, k), D(j,
k), i.e., two of these are equal.
Further more one of D(i, j), D(i, k), D(j, k)
is smallest
Therefore D is ultrametric.

20
Ultrametric trees and distances

Proof.
? (if D ultrametic then there is an ultrametric
T)
If D is ultrametric
The number of distinct entries d in each row i
defines the number of nodes from the root to leaf
i.
Each node in this path is labeled in decreasing
order with a distinct label.
Any node v on this path labeled D(i, j) must be
the least common ancestor of leaves i and j.

21
Ultrametric trees and distances

Proof.? continued
The path to leaf i partitions the n-1 remaining
leaves in d-1 classes.
Each distinct node on the path to i is labeled by
the distance from i to to the leaves in that
partition.
Example

22
Ultrametric trees and distances

Proof.? continued
We want to recursively find the ultrametric tree
for each of the d-1 partitions and then combine
them.
Consider the partition defined by internal node
v.
Let j be a leaf contained in this partition.
Let l be some other leaf. There are three cases
l is in the same partition as j.
l is in a partition between i and node v.
l is in a partition between node v and the root.

23
Ultrametric trees and distances

Proof.? continued
The three cases Let i A, j F
l is in the same partition as j. example l B
l is in a partition between i and v. example l
D
l is in a partition between v and the root.
example l C

24
Ultrametric trees and distances

Proof.? continued
Case 1 (i A, j F, l B)
D(i, j) D(i, l) thus D(j, l) ? D(i, j). Why?
? So we can add the subtree containing j l.
knowing that D(j, l) is correct.

25
Ultrametric trees and distances
Typos in text

Proof.? continued
Case 2 (i A, j F, l D)
D(i, l) lt D(i, j) thus D(i, j) D(j, l)
? So we can add the subtree at v containing j
knowing that D(j, l) is correct, i.e. v is
labeled correctly.

26
Ultrametric trees and distances
Typos in text

Proof.? continued
Case 3 (i A, j F, l C)
D(i, l) gt D(i, j) thus D(i, l) D(j, l)
? So we can add the subtree at v containing j
knowing that D(j, l) is correct, i.e., it labels
their least common ancestor .

27
Ultrametric trees and distances

Proof.? continued
In each of the three cases, the ultrametric tree
defined by v can be correctly attached to v.
Hence, using recursion, we can construct the
ultrametric tree T for D.

28
Ultrametric trees and distances

Gusfield presents two related theorems
Thm. If D is ultrametric, then the ultrametric
tree for D is unique.
This is a consequence of the fact that the nodes
that appear on path to a given node i must appear
in every ultrametric tree for D.
Thm. If D is ultrametric, then the ultrametric
tree for D can be constructed in O(n2) time.

29
Ultrametric trees and distances

Given ultrametic data we can
reconstruct evolutionary history.
Find the relative divergence times
Find the exact tree topology
Q How do we get ultrametric data?
Consider the molecular clock theory.

30
Molecular Clock Theory

Proposed by Emile Zucker and Linus Pauling.
Idea accepted mutations occur at a constant rate
for a given protein.
There are three important issues
Accepted mutations are those that still allow
the protein to function properly.
Lethal mutations will be selected against and
should not accumulate.

31
Molecular Clock Theory

The theoretical clock rate is protein specific.
Different proteins will have different clocks.
Gusfield mentions hemoglobin and cytochrome c.
Both are stable and similar in all mammals.
However, hemoglobin mutates faster than
cytochrome c.

32
Molecular Clock Theory

The implication is that the number of mutations
is proportional to length of the time interval.
Requirement the interval must be long enough.
The length of an interval can be measured by the
number of mutations.
This requires that the clock be calibrated.

33
Molecular Clock Theory

Assumptions
all DNA mutates at the same rate
Observed accepted rate differences are due to
different constraints
Natural selection at the organism level
Physical chemistry at the molecular level

34
Molecular Clock Theory

Q How would we use the molecular clock to
collect ultrametric data?
Find a common protein for two taxa of interest
Determine the number of accepted mutational
differences, say k.
By molecular clock theory each taxa contributed
k/2 accepted mutations.

35
Molecular Clock Theory

Do this for each pair of n taxa.
Result n choose 2 numbers satisfying the
requirement for an ultrametric tree.
The ultrametric tree will describe the true
evolutionary history for the n taxa.
Great huh?

36
Molecular Clock Theory