Title: Bioinformatics Algorithms and Data Structures
1Bioinformatics Algorithms and Data Structures
- Chapter 17.4-6 Strings and Evolutionary Trees
- Lecturer Dr. Rose
- Slides by Dr. Rose
- April 17, 2003
2Ultrametric Problem Centrality
- Four related tree problems
- Ultrametric
- Additive
- Binary perfect phylogeny
- Tree compatibility
- All can be solved as ultrametric tree problems.
- Recall tree compatibility reduces to perfect
phylogeny. - Now we reduce additive tree (binary) perfect
phylogeny problems to the ultrametric tree
problem.
3Ultrametric Problem Additive Trees
- Goal reduce additive tree problem to ultrametric
problem - Complexity O(n2) reduction
- Approach create a matrix D? that is ultrametric
? D is additive. - We will start by describing a reduction that
involves a tree T for D and T? for D?. - We will then describe a direct reduction of D to
D?.
4Ultrametric Problem Additive Trees
- Assume that D is additive.
- Assume that we know of an additive tree T for D
- Assume that each of the n taxa in D labels a leaf
of T. - Idea label the nodes of T to create an
ultrametric tree T?. - Q How can we do this?
5Ultrametric Problem Additive Trees
- A we will do the following
- Select one node as the root
- Stretch the leaf edges so that they are
equidistant from the root. - Let v be the row of D containing the largest
entry. - Let mv denote the value of this entry.
- Select node v as the root of T.
- This creates a directed tree.
6Ultrametric Problem Additive Trees
- Example
- A is the row of D containing the largest entry.
- Select node A as the root of T.
7Ultrametric Problem Additive Trees
- Stretch leaf edges
- for each leaf i, add mA D(A, i) to the leaf
edge. - Leaf edges are now equidistant from A.
8Ultrametric Problem Additive Trees
- The resulting tree T? is
- a rooted edge-weighted tree
- distance mv from root to every leaf
- each internal node is equidistant to leaves in
its subtree.
9Ultrametric Problem Additive Trees
- Since each internal node is equidistant to the
leaves in its subtree - Label each internal node by this unique distance.
- These labels can be used to define an ultrametric
matrix D?. - D?(i, j) is the label at the least common
ancestor of leaves i and j in T?. - Q How can we go directly from matrix D to matrix
D? without involving T and T??
10Ultrametric Problem Additive Trees
- Consider leaves i j in T
- Let node w be their least common ancestor
- Let x be the distance from the root v to w.
- Let y be the distance from node w to leaf i.
11Ultrametric Problem Additive Trees
- Q What is the distance from w to i in T??
- A y mv - D(v, i) in T?.
- Q Where does mv - D(v, i) come from?
- A Recall we add mv - D(v, i) to stretch the leaf
edges.
12Ultrametric Problem Additive Trees
- Gusfield presents the following lemma
- Without knowing T or T explicitly, we can deduce
that D(i, j) mv (D(i, j) - D(v, i) - D(v,
j))/2 - Q Is this equation correct?
- D(i, j) mv ((y z) - (x y) - (x z))/2 ?
- D(i, j) mv -2x/2 ?
- Should it instead be
- D(i, j) 2mv D(i, j) - D(v, i) - D(v, j)?
- i.e., D(i, j) 2mv - 2x?
- Probably, but it is not necessary for the
reduction (slide 9)
13Ultrametric Problem Additive Trees
- This brings us to the following Theorem
- If D is an additive matrix, then D is
ultrametric, where D(i, j) mv (D(i, j) -
D(v, i) - D(v, j))/2 - Proof. Weve shown that
- D(i, j) y mv - D(v, i)
- y D(v, i) x
- x (D(v, i) D(v, j) - D(i, j))/2
- Putting it altogether establishes the equation in
the theorem. - D satisfies the ultrametric requirement.
14Ultrametric Problem Additive Trees
- Q What is the value of y?
- A y D(v, i) - x.
- Q What is the value of x in terms of values in
D? - A x (D(v, i) D(v, j) - D(i, j))/2
15Ultrametric Problem Additive Trees
- So D additive ? D ultrametric
- By contraposition ?D ultrametric ? ?D additive
- Q does D ultrametric ? D additive?
- A Theorem D ultrametric ? D additive
- Proof. (constructive)
- Let T be the ultrametric tree for D
- Assign weights to edges of T
- Note the sum of edges from a leaf to an
ancestor must match the ancestors label. - For each edge (p, q), assign the weight p-q
16Ultrametric Problem Additive Trees
- Assign weights to edges of T continued
- Note the path distance between leaves (i, j) is
twice the value labeling the least common
ancestor - Hence, 2D(i, j) 2mv D(i, j) - D(v, i) - D(v,
j) - Now shrink the edge into each leaf i by mv - D(v,
i) - The path from leaf i to leaf j is now D(i, j)
- The result is an additive tree for matrix D from
Ds ultrametric tree. - Putting all of this together results in a method
for contructing and additive tree for an additive
matrix.
17Ultrametric Problem Additive Trees
- Additive Tree Algorithm
- Create matrix D from D.
- Create ultrametric tree T from D
- Create T from T
- Label edge (p, q) with the value p-q
- For each leaf i, shrink the leaf edge by mv -
D(v, i) - Note no step takes more than O(n2) time.
- Thm. An additive tree for an additive matrix can
be constructed in O(n2) time.
18Ultrametric Problem Additive Trees
- Example Given D, first find D
- Recall D(i, j) mv (D(i, j) - D(v, i) - D(v,
j))/2
19Ultrametric Problem Additive Trees
- Example From D find T
- Recall label edge inner edges (p, q) by p-q
20Ultrametric Problem Additive Trees
- Example From T find T
- Recall shrink leaf edge i by mv - D(v, i)
21Ultrametric Problem Additive Trees
- Example Finally compare the derived T with the
original tree as a sanity check.
22Ultrametric Problem Perfect Phylogeny
- We now recast perfect phylogeny in terms of an
ultrametric tree problem. - Defn. DM the n by n matrix of shared characters
- More formally
- Given the n by m character matrix M, define the n
by n matrix DM for each pair of objects, set
DM(p, q) to be the number of characters that p
and q both possess.
23Ultrametric Problem Perfect Phylogeny
- Lemma If M has a perfect phylogeny, then DM is a
min-ultrametric matrix. - Proof convert Ms perfect phylogeny T to a
min-ultrametric tree for DM - Let T be the perfect phylogeny for M.
- Label Ts root be zero.
- Traverse T from top to bottom, for each node v
- Let pv be the number labeling node vs parent.
- Let ev be the of characters labeling the edge
into v. - Label node v with the sum pv ev
24Ultrametric Problem Perfect Phylogeny
- The label of node v is the number of characters
common to all leaves in the subtree rooted at v. - if v is the immediate parent of leaves p and q,
then the label of v is DM(p, q) - The numbers labeling nodes on any path from the
root are strictly increasing. - The result is an ultrametric tree for matrix DM.
25Ultrametric Problem Perfect Phylogeny
- Algorithm perfect phylogeny via ultrametrics
- Create matrix DM from M.
- Attempt to create a min-ultrametric tree T from
DM. If not possible, then M has no perfect
phylogeny. - If T was successfully created in step 2
- Attempt to label its edges with the m characters
of M. - If not possible, then M has no perfect phylogeny.
- O/w the modified T is the perfect phylogeny T.
- Note T may be min-ultrametric but M may not
have a perfect phylogeny, hence the check in step
3
26Ultrametric Problem Perfect Phylogeny
- Final notes on the centrality ultrametric
problem. - We can see that the following problems
- perfect phylogeny
- tree compatibility
- can be cast as ultrametric problems.
- This is not an efficient way to address these
problems.
27Maximum Parsimony
- Maximum parsimony
- Perfect phylogeny is a special instance
- Can be viewed as a Steiner tree problem on a
hypercube - Presentation Approach
- Introduce Steiner trees
- Hypercube graphs
- Maximum parsimony as a Steiner tree problem
28Maximum Parsimony
- Definitions
- Let N be a set of nodes
- Let E be a set undirected edges with non-negative
weight - Let G (N, E) be an undirected graph
- Let X ? N be a subset of nodes.
- A Steiner tree ST for X is any connected subtree
of G that contains all nodes of X and possibly
nodes in N-X. - Weighted Steiner Tree Problem Given G and X,
find the Steiner tree of minimum total weight.
29Maximum Parsimony
- More Definitions
- A hypercube of dimension d is an undirected graph
with 2d nodes, labeled 0..2d-1. Adjacent nodes
differ in only one label bit position. - The weighted Steiner tree problem on hypercubes
G must be a hypercube.
30Maximum Parsimony
- More Definitions
- Maximum Parsimony Occams razor applied to
phylogenetic reconstruction. A preference for
trees requiring fewer evolutionary events to
explain data. - Gusfields definition
- The Maximum Parsimony problem is the unweighted
Steiner tree problem on a d-dimensional hypercube.
31Maximum Parsimony
- More about the hypercube formulation of MP
- The X input taxa are described as d-length binary
vectors. - Recall adjacent nodes differ in only one label
bit position. - Correspondingly, taxa that differ by a single
mutation will be adjacent. - ?? Steiner tree of X nodes and l edges iff ? a
corresponding phylogenetic tree that entails l
character-state mutations.
32Steiner interpretation of Perfect Phylogeny
- Define a nontrivial binary character to be a
character contained by some taxa but not all. - Consider an MP dataset of d nontrivial binary
characters - Q what is the minimal number of mutations in the
MP tree? - A at least d.
33Steiner interpretation of Perfect Phylogeny
- Q What is the relation to binary perfect
phylogeny? - A the binary perfect phylogeny problem is
equivalent to asking if there is an MP solution
with a cost of exactly d. - Q What about generalized perfect phylogeny?
- A Its similar. The lower bound must reflect
- the number of character states in the input taxa.
- a character having r states in the input taxa is
allowed only r-1 transitions.
34Steiner interpretation of Perfect Phylogeny
- Complexity
- No known efficient solution for Steiner tree
problem on unweighted graphs. - Polynomial time solution for generalized perfect
phylogeny problem when r is fixed. - ? this particular Steiner tree problem can be
answer in polynomial time.
35Steiner interpretation of Perfect Phylogeny
- MP approximations
- The weighted Steiner tree problem on hypercubes
is NP-hard. - There is an approximate method with an error
bound of a factor of 11/6. - Also MST can be used to find a Steiner tree with
weight less than twice the optimal Steiner tree.
36Phylogenetic Alignment
- Recall
- phylogenetic alignment was discussed in section
14.8 - The focus was on deriving a multiple alignment
enlightened by evolutionary history. - The tree focused emphasis on specific alignment
groupings - Internal node sequences were a secondary artifact
37Phylogenetic Alignment
- Phylogenetic alignment as a parsimony problem
- In contrast
- we are now interested in the internal sequences
- These sequences are waypoints in the evoutionary
trajectory leading to the extant taxa - phylogenetic alignment is thus a parsimony problem
38Phylogenetic Alignment
- Hypothesis optimal phylogenetic alignment
describes evolutionary history. - Assumptions
- Edit distance realistically models evolutionary
distance - Globally optimal phylogenetic alignment captures
essence of the evolutionary process - We will look at minimum mutation, a variant of
phylogenetic alignment
39Fitch-Hartigan minimum mutation problem
- Defn. minimum mutation problem variant of
phylogenetic alignment problem. - Input comprised of
- Tree
- Strings labeling the leaves
- A multiple alignment of those strings
40Fitch-Hartigan minimum mutation problem
- Q If you are given the tree and the multiple
alignment, what is left to compute? - A the mutations that accounts for the input
data. - These mutations should be
- minimum sequence of site mutations that is
- compatible with the given tree and
- the given multiple alignment.
41Fitch-Hartigan minimum mutation problem
- Q How is the input data used to determine the
minimum sequence of mutations? - The multiple alignment associates each amino acid
with a specific position. - The evolutionary history of the sequences is then
treated as a combined but independent
evolutionary history of each position. - The tree guides the order of mutations for each
position.
42Fitch-Hartigan minimum mutation problem
- Assumptions
- Each column of the alignment can be solved
separately - The strings labeling inner nodes adhere to the
same alignment - The problem reduces to a computation at a single
position.
43Fitch-Hartigan minimum mutation problem
- Minimum mutation for a single position
- Input
- rooted tree with n nodes
- Each leaf is labeled by a single character
- Output
- Each interior node is labeled by a single
character - The labeling minimizes the number of edges
between nodes with different labels.
44Fitch-Hartigan minimum mutation problem
- Algorithmic approach Dynamic Programming
- Let Tv denote the subtree rooted at node v
- Let C(v) be the cost of the optimal solution for
Tv - Let C(v, x) be the cost when v must be labeled by
x - Let vi denote the ith child of node v
- Base case for each leaf specify C(v) C(v, x)
?x ? S. - C(v) 0 C(v, x) 0 if leaf v is labeled by x.
- C(v, x) ? if leaf v is not labeled by x.
45Fitch-Hartigan minimum mutation problem
- When v is an internal node
- The recurrence relations start from the base
cases. - Bottom up from leaves
- Backtracking is used to after all C(v,x) computed
to extract the solution.
46Fitch-Hartigan minimum mutation problem
- Backtracking process
- The root is labeled by the character x s.t. C(r)
C(r,x) - The traversal is then top-down
- If v is labeled x, then vi is labeled
- character x if C(vi) 1 gt C(vi,x)
- o/w character y such that C(vi) C(vi,y)
47Fitch-Hartigan minimum mutation problem
- Lets evaluate an example
- C(v) 0 C(v, x) 0 if leaf v is labeled by x,
o/w C(v, x) ? if leaf v is not labeled by x.
48Fitch-Hartigan minimum mutation problem
- Time complexity
- Bottom-up portion
- Let s S
- Each node is evaluate wrt each x ? S.
- For n nodes this gives O(ns)
- The backtracking portion is O(n)
- Overall O(ns)
49Maximum Parsimony
- Most widely used tree building algorithm
- Differs from distance-based algorithms
- Does not actually build trees from distances
- Parsimony is used to compute the cost of a tree
- A search strategy is used to search through all
topologies - Goal find the tree topology with the overall
minimum cost
50Traditional Parsimony
- Algorithm Traditional parsimony Fitch 1971
- Goal count the number of substitutions at a
site. - Method recursion, keeping track of
- C, the current cost
- Rk, the residues at k, the current node
51Traditional Parsimony
- Algorithm Traditional parsimony Fitch 1971
- C 0, k root / initialize the cost and
- TP(k)
- If k is a leaf then return xk
- Rleft TP( k.left)
- Rright TP(k.right)
- if Rleft ? Rright ? ? return Rleft ? Rright
- else
- C C 1
- return Rleft ? Rright
52Traditional Parsimony
- Lets evaluate an example
- if Rleft ? Rright ? ? return Rleft ? Rright
- else C C 1, return Rleft ? Rright
53Traditional Parsimony
- There is a traceback procedure for finding
ancestral assignments. - Q How do you think the traceback works?
- A Start from the root
- Pick a residue
- Pick the same residue for each child set if
possible - If a child set does not contain the parents
residue, randomly select a residue from its set. -
54Traditional Parsimony
- Lets perform the traceback on our example