Bioinformatics Algorithms and Data Structures - PowerPoint PPT Presentation

About This Presentation

Title:

Bioinformatics Algorithms and Data Structures

Description:

UNIVERSITY OF SOUTH CAROLINA. College of Engineering & Information Technology ... By contraposition: D' ultrametric D additive. Q: does D' ultrametric D additive? ... – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 55

Provided by: john244

Learn more at: https://www.cse.sc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Bioinformatics Algorithms and Data Structures

1
Bioinformatics Algorithms and Data Structures

Chapter 17.4-6 Strings and Evolutionary Trees
Lecturer Dr. Rose
Slides by Dr. Rose
April 17, 2003

2
Ultrametric Problem Centrality

Four related tree problems
Ultrametric
Additive
Binary perfect phylogeny
Tree compatibility
All can be solved as ultrametric tree problems.
Recall tree compatibility reduces to perfect
phylogeny.
Now we reduce additive tree (binary) perfect
phylogeny problems to the ultrametric tree
problem.

3
Ultrametric Problem Additive Trees

Goal reduce additive tree problem to ultrametric
problem
Complexity O(n2) reduction
Approach create a matrix D? that is ultrametric
? D is additive.
We will start by describing a reduction that
involves a tree T for D and T? for D?.
We will then describe a direct reduction of D to
D?.

4
Ultrametric Problem Additive Trees

Assume that D is additive.
Assume that we know of an additive tree T for D
Assume that each of the n taxa in D labels a leaf
of T.
Idea label the nodes of T to create an
ultrametric tree T?.
Q How can we do this?

5
Ultrametric Problem Additive Trees

A we will do the following
Select one node as the root
Stretch the leaf edges so that they are
equidistant from the root.
Let v be the row of D containing the largest
entry.
Let mv denote the value of this entry.
Select node v as the root of T.
This creates a directed tree.

6
Ultrametric Problem Additive Trees

Example
A is the row of D containing the largest entry.
Select node A as the root of T.

7
Ultrametric Problem Additive Trees

Stretch leaf edges
for each leaf i, add mA D(A, i) to the leaf
edge.
Leaf edges are now equidistant from A.

8
Ultrametric Problem Additive Trees

The resulting tree T? is
a rooted edge-weighted tree
distance mv from root to every leaf
each internal node is equidistant to leaves in
its subtree.

9
Ultrametric Problem Additive Trees

Since each internal node is equidistant to the
leaves in its subtree
Label each internal node by this unique distance.
These labels can be used to define an ultrametric
matrix D?.
D?(i, j) is the label at the least common
ancestor of leaves i and j in T?.
Q How can we go directly from matrix D to matrix
D? without involving T and T??

10
Ultrametric Problem Additive Trees

Consider leaves i j in T
Let node w be their least common ancestor
Let x be the distance from the root v to w.
Let y be the distance from node w to leaf i.

11
Ultrametric Problem Additive Trees

Q What is the distance from w to i in T??
A y mv - D(v, i) in T?.
Q Where does mv - D(v, i) come from?
A Recall we add mv - D(v, i) to stretch the leaf
edges.

12
Ultrametric Problem Additive Trees

Gusfield presents the following lemma
Without knowing T or T explicitly, we can deduce
that D(i, j) mv (D(i, j) - D(v, i) - D(v,
j))/2
Q Is this equation correct?
D(i, j) mv ((y z) - (x y) - (x z))/2 ?
D(i, j) mv -2x/2 ?
Should it instead be
D(i, j) 2mv D(i, j) - D(v, i) - D(v, j)?
i.e., D(i, j) 2mv - 2x?
Probably, but it is not necessary for the
reduction (slide 9)

13
Ultrametric Problem Additive Trees

This brings us to the following Theorem
If D is an additive matrix, then D is
ultrametric, where D(i, j) mv (D(i, j) -
D(v, i) - D(v, j))/2
Proof. Weve shown that
D(i, j) y mv - D(v, i)
y D(v, i) x
x (D(v, i) D(v, j) - D(i, j))/2
Putting it altogether establishes the equation in
the theorem.
D satisfies the ultrametric requirement.

14
Ultrametric Problem Additive Trees

Q What is the value of y?
A y D(v, i) - x.
Q What is the value of x in terms of values in
D?
A x (D(v, i) D(v, j) - D(i, j))/2

15
Ultrametric Problem Additive Trees

So D additive ? D ultrametric
By contraposition ?D ultrametric ? ?D additive
Q does D ultrametric ? D additive?
A Theorem D ultrametric ? D additive
Proof. (constructive)
Let T be the ultrametric tree for D
Assign weights to edges of T
Note the sum of edges from a leaf to an
ancestor must match the ancestors label.
For each edge (p, q), assign the weight p-q

16
Ultrametric Problem Additive Trees

Assign weights to edges of T continued
Note the path distance between leaves (i, j) is
twice the value labeling the least common
ancestor
Hence, 2D(i, j) 2mv D(i, j) - D(v, i) - D(v,
j)
Now shrink the edge into each leaf i by mv - D(v,
i)
The path from leaf i to leaf j is now D(i, j)
The result is an additive tree for matrix D from
Ds ultrametric tree.
Putting all of this together results in a method
for contructing and additive tree for an additive
matrix.

17
Ultrametric Problem Additive Trees

Additive Tree Algorithm
Create matrix D from D.
Create ultrametric tree T from D
Create T from T
Label edge (p, q) with the value p-q
For each leaf i, shrink the leaf edge by mv -
D(v, i)
Note no step takes more than O(n2) time.
Thm. An additive tree for an additive matrix can
be constructed in O(n2) time.

18
Ultrametric Problem Additive Trees

Example Given D, first find D
Recall D(i, j) mv (D(i, j) - D(v, i) - D(v,
j))/2

19
Ultrametric Problem Additive Trees

Example From D find T
Recall label edge inner edges (p, q) by p-q

20
Ultrametric Problem Additive Trees

Example From T find T
Recall shrink leaf edge i by mv - D(v, i)

21
Ultrametric Problem Additive Trees

Example Finally compare the derived T with the
original tree as a sanity check.

22
Ultrametric Problem Perfect Phylogeny

We now recast perfect phylogeny in terms of an
ultrametric tree problem.
Defn. DM the n by n matrix of shared characters
More formally
Given the n by m character matrix M, define the n
by n matrix DM for each pair of objects, set
DM(p, q) to be the number of characters that p
and q both possess.

23
Ultrametric Problem Perfect Phylogeny

Lemma If M has a perfect phylogeny, then DM is a
min-ultrametric matrix.
Proof convert Ms perfect phylogeny T to a
min-ultrametric tree for DM
Let T be the perfect phylogeny for M.
Label Ts root be zero.
Traverse T from top to bottom, for each node v
Let pv be the number labeling node vs parent.
Let ev be the of characters labeling the edge
into v.
Label node v with the sum pv ev

24
Ultrametric Problem Perfect Phylogeny

The label of node v is the number of characters
common to all leaves in the subtree rooted at v.
if v is the immediate parent of leaves p and q,
then the label of v is DM(p, q)
The numbers labeling nodes on any path from the
root are strictly increasing.
The result is an ultrametric tree for matrix DM.

25
Ultrametric Problem Perfect Phylogeny

Algorithm perfect phylogeny via ultrametrics
Create matrix DM from M.
Attempt to create a min-ultrametric tree T from
DM. If not possible, then M has no perfect
phylogeny.
If T was successfully created in step 2
Attempt to label its edges with the m characters
of M.
If not possible, then M has no perfect phylogeny.
O/w the modified T is the perfect phylogeny T.
Note T may be min-ultrametric but M may not
have a perfect phylogeny, hence the check in step
3

26
Ultrametric Problem Perfect Phylogeny

Final notes on the centrality ultrametric
problem.
We can see that the following problems
perfect phylogeny
tree compatibility
can be cast as ultrametric problems.
This is not an efficient way to address these
problems.

27
Maximum Parsimony

Maximum parsimony
Perfect phylogeny is a special instance
Can be viewed as a Steiner tree problem on a
hypercube
Presentation Approach
Introduce Steiner trees
Hypercube graphs
Maximum parsimony as a Steiner tree problem

28
Maximum Parsimony

Definitions
Let N be a set of nodes
Let E be a set undirected edges with non-negative
weight
Let G (N, E) be an undirected graph
Let X ? N be a subset of nodes.
A Steiner tree ST for X is any connected subtree
of G that contains all nodes of X and possibly
nodes in N-X.
Weighted Steiner Tree Problem Given G and X,
find the Steiner tree of minimum total weight.

29
Maximum Parsimony

More Definitions
A hypercube of dimension d is an undirected graph
with 2d nodes, labeled 0..2d-1. Adjacent nodes
differ in only one label bit position.
The weighted Steiner tree problem on hypercubes
G must be a hypercube.

30
Maximum Parsimony

More Definitions
Maximum Parsimony Occams razor applied to
phylogenetic reconstruction. A preference for
trees requiring fewer evolutionary events to
explain data.
Gusfields definition
The Maximum Parsimony problem is the unweighted
Steiner tree problem on a d-dimensional hypercube.

31
Maximum Parsimony

More about the hypercube formulation of MP
The X input taxa are described as d-length binary
vectors.
Recall adjacent nodes differ in only one label
bit position.
Correspondingly, taxa that differ by a single
mutation will be adjacent.
?? Steiner tree of X nodes and l edges iff ? a
corresponding phylogenetic tree that entails l
character-state mutations.

32
Steiner interpretation of Perfect Phylogeny

Define a nontrivial binary character to be a
character contained by some taxa but not all.
Consider an MP dataset of d nontrivial binary
characters
Q what is the minimal number of mutations in the
MP tree?
A at least d.

33
Steiner interpretation of Perfect Phylogeny

Q What is the relation to binary perfect
phylogeny?
A the binary perfect phylogeny problem is
equivalent to asking if there is an MP solution
with a cost of exactly d.
Q What about generalized perfect phylogeny?
A Its similar. The lower bound must reflect
the number of character states in the input taxa.
a character having r states in the input taxa is
allowed only r-1 transitions.

34
Steiner interpretation of Perfect Phylogeny

Complexity
No known efficient solution for Steiner tree
problem on unweighted graphs.
Polynomial time solution for generalized perfect
phylogeny problem when r is fixed.
? this particular Steiner tree problem can be
answer in polynomial time.

35
Steiner interpretation of Perfect Phylogeny

MP approximations
The weighted Steiner tree problem on hypercubes
is NP-hard.
There is an approximate method with an error
bound of a factor of 11/6.
Also MST can be used to find a Steiner tree with
weight less than twice the optimal Steiner tree.

36
Phylogenetic Alignment

Recall
phylogenetic alignment was discussed in section
14.8
The focus was on deriving a multiple alignment
enlightened by evolutionary history.
The tree focused emphasis on specific alignment
groupings
Internal node sequences were a secondary artifact

37
Phylogenetic Alignment

Phylogenetic alignment as a parsimony problem
In contrast
we are now interested in the internal sequences
These sequences are waypoints in the evoutionary
trajectory leading to the extant taxa
phylogenetic alignment is thus a parsimony problem

38
Phylogenetic Alignment

Hypothesis optimal phylogenetic alignment
describes evolutionary history.
Assumptions
Edit distance realistically models evolutionary
distance
Globally optimal phylogenetic alignment captures
essence of the evolutionary process
We will look at minimum mutation, a variant of
phylogenetic alignment

39
Fitch-Hartigan minimum mutation problem

Defn. minimum mutation problem variant of
phylogenetic alignment problem.
Input comprised of
Tree
Strings labeling the leaves
A multiple alignment of those strings

40
Fitch-Hartigan minimum mutation problem

Q If you are given the tree and the multiple
alignment, what is left to compute?
A the mutations that accounts for the input
data.
These mutations should be
minimum sequence of site mutations that is
compatible with the given tree and
the given multiple alignment.

41
Fitch-Hartigan minimum mutation problem

Q How is the input data used to determine the
minimum sequence of mutations?
The multiple alignment associates each amino acid
with a specific position.
The evolutionary history of the sequences is then
treated as a combined but independent
evolutionary history of each position.
The tree guides the order of mutations for each
position.

42
Fitch-Hartigan minimum mutation problem

Assumptions
Each column of the alignment can be solved
separately
The strings labeling inner nodes adhere to the
same alignment
The problem reduces to a computation at a single
position.

43
Fitch-Hartigan minimum mutation problem

Minimum mutation for a single position
Input
rooted tree with n nodes
Each leaf is labeled by a single character
Output
Each interior node is labeled by a single
character
The labeling minimizes the number of edges
between nodes with different labels.

44
Fitch-Hartigan minimum mutation problem

Algorithmic approach Dynamic Programming
Let Tv denote the subtree rooted at node v
Let C(v) be the cost of the optimal solution for
Tv
Let C(v, x) be the cost when v must be labeled by
x
Let vi denote the ith child of node v
Base case for each leaf specify C(v) C(v, x)
?x ? S.
C(v) 0 C(v, x) 0 if leaf v is labeled by x.
C(v, x) ? if leaf v is not labeled by x.

45
Fitch-Hartigan minimum mutation problem

When v is an internal node

The recurrence relations start from the base
cases.
Bottom up from leaves
Backtracking is used to after all C(v,x) computed
to extract the solution.

46
Fitch-Hartigan minimum mutation problem

Backtracking process
The root is labeled by the character x s.t. C(r)
C(r,x)
The traversal is then top-down
If v is labeled x, then vi is labeled
character x if C(vi) 1 gt C(vi,x)
o/w character y such that C(vi) C(vi,y)

47
Fitch-Hartigan minimum mutation problem

Lets evaluate an example
C(v) 0 C(v, x) 0 if leaf v is labeled by x,
o/w C(v, x) ? if leaf v is not labeled by x.

48
Fitch-Hartigan minimum mutation problem

Time complexity
Bottom-up portion
Let s S
Each node is evaluate wrt each x ? S.
For n nodes this gives O(ns)
The backtracking portion is O(n)
Overall O(ns)

49
Maximum Parsimony

Most widely used tree building algorithm
Differs from distance-based algorithms
Does not actually build trees from distances
Parsimony is used to compute the cost of a tree
A search strategy is used to search through all
topologies
Goal find the tree topology with the overall
minimum cost

50
Traditional Parsimony

Algorithm Traditional parsimony Fitch 1971
Goal count the number of substitutions at a
site.
Method recursion, keeping track of
C, the current cost
Rk, the residues at k, the current node

51
Traditional Parsimony

Algorithm Traditional parsimony Fitch 1971
C 0, k root / initialize the cost and
TP(k)
If k is a leaf then return xk
Rleft TP( k.left)
Rright TP(k.right)
if Rleft ? Rright ? ? return Rleft ? Rright
else
C C 1
return Rleft ? Rright

52
Traditional Parsimony

Lets evaluate an example
if Rleft ? Rright ? ? return Rleft ? Rright
else C C 1, return Rleft ? Rright

53
Traditional Parsimony

There is a traceback procedure for finding
ancestral assignments.
Q How do you think the traceback works?
A Start from the root
Pick a residue
Pick the same residue for each child set if
possible
If a child set does not contain the parents
residue, randomly select a residue from its set.

54
Traditional Parsimony