Title: Constructing Phylogenies from Quartets: Elucidation of Eutherian Superordinal Relationships
1Constructing Phylogenies from Quartets
Elucidation of Eutherian Superordinal
Relationships
A. Ben-Dor, B. Chor, D. Graur, R. Ophir, D.
Pelleg Journal of Computational Biology, Vol. 5,
1998, pp. 377?390.
- Speaker Chuang-Chieh Lin
- National Chung Cheng University
2Outline
- Introduction and preliminaries
- Problem description
- The dynamic programming algorithm
- The space complexity and the time complexity
3Evolutionary trees
- Let S be a set of taxa and S n.
- An evolutionary tree T on S is an unrooted,
leaf-labeled tree such that the leaves of T are
bijectively labeled by the taxa in S, and each
internal node of T has degree 3.
4Evolutionary trees
- For 4 taxa a, b, c, d, we have 3 possible
topologies
a
c
a
b
a
c
b
d
d
c
d
b
adbc
abcd
acbd
5Evolutionary trees (contd.)
- For 5 taxa a, b, c, d, e, how many possible
evolutionary trees can we derive? - The answer is 5 ? 3 15.
a
c
There are 5 possible positions for e to be
inserted.
b
d
6Evolutionary trees (contd.)
- For n taxa, how many possible evolutionary trees
can we derive? - The answer is (2n ? 5)!!
- This observation can be verified by induction on
n. - For an odd positive integer n, it is defined that
n!! n?? (n ? 2) ? (n ? 4) ? ? 3 ? 1. - If n 15, (2n ? 5)!! is approximately 8 ? 1012.
7n!!
- Let us analyze n!! in another way.
- For a nonnegative integer m ? 0, let n 2m 1.
- Then we have
8(2n-5)!! O(nn-2)
- For n taxa, we have (2n ? 5)!! O((n ? 3)n?2)
- O(nn-2) possible evolutionary trees.
9Quartet topologies
- A set of four taxa is called a quartet.
- Given an evolutionary tree T and a quartet a, b,
c, d, the quartet topology of a, b, c, d
induced by T is obtained by the following
procedure.
10Step 1 All leaves but a, b, c and d are deleted
from the tree. Edges adjacent to these leaves are
also deleted.
11Step 2 Internal nodes with degree two are
contracted and deleted, so their two adjacent
nodes become connected. This process is repeated
until no internal nodes of degree two are left.
For simplicity, we denote the quartet topology
above by bcad, which is a kind of bipartition
of a, b, c, d.
12- For simplicity, we denote the quartet topology
above by bcad, which is a kind of bipartition
of a, b, c, d. - Note that each input quartet topology t is
accom-panied by a positive weight Ct .
13Problem description
- Input
- A list of weighted quartet topologies over n
taxa. - Output
- A binary tree with n leaves such that the total
weight of the satisfied quartet topologies is
maximized. - This problem was shown to be NP-hard.
14Quartet method
- The fact that small phylogenies are easier to
infer than large ones leads to another approach
the quartet method. - First, consider subsets of 4 taxa, one at a time,
and infer the phylogenies (i.e., quartet
topologies) for these subsets. - The next stage combines the multiple quartet
topologies into a single phylogeny.
15- Given a set of quartet topologies Q, how to
determine whether an evolutionary tree T is
good or bad?
16- Given an evolutionary tree T and a set of quartet
topologies Q. - We say that T satisfies a quartet topology tq of
a quartet q if the induced quartet topology of q
by T is exactly tq.
For example, T satisfies abdg, cefg,
adbc, etc.
17Score
- We denote by S, where S ? Q, the set of quartet
topologies that are satisfied by T, and let U Q
? S. - We define the score of the evolutionary tree T as
follows.
18Score (contd.)
- The latter term was chosen because there are
three possible topologies for every quartet. - Therefore this term equals the expected increase.
- In a variant of the same method, the latter term
is zeroed, so the quartet topologies which are
not satisfied by T do not contribute to the score.
19Score (contd.)
- It can be easily derived that is
an upper bound on the score of any evolutionary
tree T.
20Preliminaries for the dynamic programming
algorithm
- For technical reasons, the following discussion
deals with rooted evolutionary trees. - For a node v, its left and right children are
denoted by vl and vr respectively.
21Preliminaries for the dynamic programming
algorithm (contd.)
- Given a rooted evolutionary tree T and a node v
in it we denote by T(v) the subtree of T rooted
at v.
u
w
v
T(v)
22Preliminaries for the dynamic programming
algorithm (contd.)
- We denote by L(T) the set of leaves (i.e., taxa)
of the tree T.
u
w
v
L(Tv)
23Preliminaries for the dynamic programming
algorithm (contd.)
- For a pair of nodes u, v, the least common
ancestor of u and v, lca(u, v), is defined as an
ancestor p of both u and v such that no node in
T(p) other than p is an ancestor of both u and v.
24Preliminaries for the dynamic programming
algorithm (contd.)
The lca of a and c.
a
c
a
c
b
d
b
d
25Preliminaries for the dynamic programming
algorithm (contd.)
- Definition Given a quartet topology t abcd
and an evolutionary tree T, the quartet least
common ancestor of t, qlca(t) is defined as a
node p that is the lca of two or more pairs of
elements from a, b, c, d, and no node in T(p)
except p is the lca of two or more pairs of
elements from a, b, c, d.
26Preliminaries for the dynamic programming
algorithm (contd.)
The qlca for abcd.
a
a
b
c
b
c
d
d
27Another equivalent definition for the quartet
least common ancestor
- Definition Given a quartet topology t abcd
and an evolutionary tree T, the qlca of t is a
node p such that - L(T(p))??a, b, c, d ? 3.
- For any child s of p, L(T(s))??a, b, c, d ? 2.
28Some observations
- Every quartet topology t has a unique qlca(t).
- Given a tree T and a quartet topology t, the
subtree rooted at qlca(t) determines whether t is
satisfied in the evolutionary tree T. - Let t abcd and v qlca(t). We look at vl ,
vr , T(vl) and T(vr). - At least one of these subtrees contains exactly
two taxa e, f from a, b, c, d. - Then t is satisfied iff the pair e, f is either
a, b or c, d.
29Some observations (contd.)
- Given a quartet topology t abcd and an
evolutionary tree T, let v qlca(t). Then T
satisfies t if and only if at least one of the
following holds - a, b ? L(T(s)).
- c, d ? L(T(s)).
- where s vl or s vr.
30The algorithm
- We denote by SATQ(T(v)) the set of quartet
topologies t ? Q such that t is satisfied by T,
and qlca(t) is a node in T(v). - Let TOPQ(T(v))?? SATQ(T(v)) be the set of quartet
topologies in Q that have v as their qlca and are
satisfied by T.
31The algorithm (contd.)
- For a set A ? Q of quartet topologies, let
denote the sum of their weights.
- The score of the subtree T(v) (with respect to Q)
is defined as
32The algorithm (contd.)
- By the above equation, we have
33The algorithm (contd.)
- Let S be a set of three or more taxa.
- Denote by opt_scoreQ(S) the maximum score with
respect to Q among all trees that have S as their
set of leaves. - We denote by opt_treeQ(S) a tree which attains
the maximum score.
34The algorithm (contd.)
- For every proper partition of S into two subsets
S1 and S2, let T(S1, S2) denote a tree whose left
subtree equals opt_treeQ(S1) and its right
subtree equals opt_treeQ(S2). - We then have
35The algorithm (contd.)
- This implies that
- By employing the dynamic programming paradigm, we
can avoid wasteful repetitions. - To do this, we scan the subsets S?? 1 ,2 , n
by increasing size of S.
36The algorithm (contd.)
- For simplicity, the details of implementing the
dynamic programming algorithm are omitted.
37The space complexity and the time complexity
- The time complexity
- The space complexity ?
- ?(2n).
38Thank you.
39References
- S92 M. Steel The complexity of reconstructing
trees from qualitative characters and subtrees.
Journal of Classification, 9 (1992), pp. 91-116.