Title: Introduction to phylogeny
1Supertrees Algorithms and Databases
Roderic Page University of Glasgow r.page_at_bio.gla
.ac.uk DIMACS Working Group Meeting on
Mathematical and Computational Aspects Related
to the Study of The Tree of Life
2What do we mean by the Tree of Life
Our perception of what the tree is may affect
what we view as being the interesting problems
or
Supertrees, datatypes, databases, taxonomy
Tree algorithms, models, genomics, lateral gene
transfer
3Topics
- Supertrees (MinCut)
- Phylogenetic databases
4Tree terminology
d
a
b
c
leaf
a,b
edge
internal node
a,b,c
cluster
root
a,b,c,d
5Nestings and triplets
d
a
b
c
Nestings
a,b ltT a,b,c,d
b,c ltT a,b,c,d
Triplets
(bc)d
bcd
6Supertree
d
a
b
c
a
b
c
b
c
d
T
T
1
2
supertree
7Some desirable properties of a supertree
method(Steel et al., 2000)
- The supertree can be computed in polynomial time
- A grouping in one or more trees that is not
contradicted by any other tree occurs in the
supertree
8Aho et al.s algorithm (OneTree)
- Aho, A. V., Sagiv, Y., Syzmanski, T. G., and
Ullman, J. D. 1981. Inferring a tree from lowest
common ancestors with an application to the
optimization of relational expressions. SIAM J.
Comput. 10 405-421. - Input set of rooted trees
- 1. If set is compatible (i.e., will agree on a
tree), output that tree. - 2. If set is not compatible, stop!
9a
b
c
b
c
d
Aho et al.s OneTree algorithm
T
T
1
2
supertree
10Mincut supertrees
- Semple, C., and Steel, M. 2000. A supertree
method for rooted trees. Discrete Appl. Math.
105 147-158. - Modifies OneTree by cutting graph
- Requires rooted trees (no analogue of OneTree for
unrooted trees) - Recursive
- Polynomial time
11a
b
c
d
e
a
b
c
d
T
T
1
2
S
T
,
T
1
2
Semple and Steel (2000)
12Collapsing the graph(Semple and Steel mincut
algorithm)
This edge has maximum weight
b
a,b
1
2
1
c
a
c
1
1
1
d
e
d
e
1
1
max
S
S
/
E
T
,
T
T
,
T
T
,
T
1
2
1
2
1
2
13Cut the graph to get supertree
a,b
a
b
c
d
e
1
c
1
d
e
1
max
S
/
E
T
,
T
T
,
T
1
2
1
2
supertree
14My mincut supertree implementationdarwin.zoology.
gla.ac.uk/rpage/supertree
- Written in C
- Uses GTL (Graph Template Library) to handle
graphs (formerly a free alternative to LEDA) - Finds all mincuts of a graph faster than Semple
and Steels algorithm
15A counter example two input trees...
a
c
b
b
a
c
y
1
x
1
y
2
x
2
y
3
x
y
3
4
16Mincut gives this (strange) result
- Disputed relationships among a, b, and c are
resolved - x1, x2, and x3 collapsed into polytomy
c
x
1
x
2
x
3
b
a
y
1
y
2
y
3
y
4
17ProblemCuts depend on connectivity(in this
example it is a function of tree size)
y4
x3
y1
x2
y2
b
x1
y3
c
a
18So, mincut doesnt work
- But, Semple and Steel said it did
- My program seems to work
- Argh!!! What is happening.?
19What mincut does and does not do
- Mincut supertree is guaranteed to include any
nesting which occurs in all input trees - Makes no claims about nestings which occur in
only some of the trees - Does exactly what it says on the tin
20Modifying mincut supertree
- Can we incorporate more of the information in the
input trees? - Three categories of information
- Unanimous (all trees have that grouping)
- Contradicted (trees explicitly disagree)
- Uncontradicted (some trees have information that
no other tree disagrees with)
21Uncontradicted informationassume we have k input
trees
a and b co-occur in a tree
a and b nested in a tree
n
c
a
b
a
b
c - n 0 ? uncontradicted (if c k then
unanimous)
c - n gt 0 ? contradicted
22Uncontradicted informationassume we have k input
trees
a and b in a fan
a and b co-occur in a tree
a and b nested in a tree
f
n
c
a
b
a
b
a
b
c - n -f 0 ? uncontradicted (if c k then
unanimous)
c - n - f gt 0 ? contradicted
23Classifying edges
S
T
,
T
1
2
y
x
1
1
y
y
1
2
x
x
y
2
1
2
y
y
x
3
4
2
x
3
b
y
b
4
y
x
3
3
a
c
a
c
Uncontradicted
Uncontradicted but adjacent to contradicted
Contradicted
24Modified mincut
- Species a, b, and c form a polytomy
- x1, x2, and x3 resolved as per the input tree
modified
mincut
a
b
c
x
1
x
2
x
3
y
1
y
2
y
3
y
4
25If no tree contradicts an item of information, is
that information always in the supertree?
(23)5
(12)5
(45)1
(34)1
26No!Steel, Dress, Böcker 2000
- The four trees display (12)5, (23)5, (34)1, and
(45)1 - No tree displays (IK)J or (JK)I for any (IJ)K
above - Triplets are uncontradicted, but cannot form a
tree
27Future directions for supertrees
- Improve handling of uncontradicted information
- Add support for constraints
- Visualising very big trees
- Better integration into phylogeny
- databases (www.treebase.org)
- darwin.zoology.gla.ac.uk/rpage/supertree
28Supertree Challenge (proposed by Mike Sanderson
mjsanderson_at_ucdavis.edu)
The TreeBASE database currently contains over
1000 phylogenies with over 11,000 taxa among
them. Many of these trees share taxa with each
other and are therefore candidates for the
construction of composite phylogenies, or
"supertrees", by various algorithms. A
challenging problem is the construction of the
largest and "best" supertree possible from this
database. "Largest" and "best" may represent
conflicting goals, however, because resolution of
a supertree can be easily diminished by addition
of "inappropriate" trees or taxa.
29Its a scandal
- We cannot answer even the most basic question
what is the phylogeny for group x? - GenBank is currently the best phylogenetic
database (!) - Can't even say how many species are in a given
group - Little idea of who is doing what
30(No Transcript)
31Tree of Lifetolweb.org
- Provides text and images
- Relies on extensive manual effort (e.g., writing
text) - Cant do any computations with it
- Limited research value
32TreeBASEwww.treebase.org
- Relational database
- Query by author, taxon, study number
- Compute supertrees
- Submit NEXUS data files
33TreeBASE
34TreeBASE and mincut supertrees
- User selects two or more trees
- Clicks on button
- and script on darwin.zoology.gla.ac.uk is
run to create supertree - Can view as PS, PDF, treefile, or in Java applet
(ATV)
35Whats wrong with TreeBASE?
- No consistency of taxon names
- (e.g., Human, Homo sapiens,
- Homo sapiens X54666-1)
- No consistency of data names (e.g., gene names,
morphological characters, etc.)
36The same organism may have multiple names
37www.all-species.org
The ALL Species Foundation is a non-profit
organization dedicated to the complete inventory
of all species of life on Earth within the next
25 years - a human generation.
Press Release November 13, 2002
Starting December 1, the ALL Species Foundation
will close its San Francisco office because of a
lack of funding for the Foundation.
38The first challenge
- We need a taxonomic name server that can resolve
the name of any organism - This server needs to reconcile multiple
classifications (e.g., GenBank, ITIS, etc.) - Must handle at least 1 million names, perhaps 100
million
39Second Challenge
- How do we query trees?
- Trees can be classifications or phylogenies
40SQL Queries on Trees
- Oracle SQL Transitive Closure Query (recursion)
- Nested queries
- Node path queries
411. All ancestors of node A
A
422. Least Common Ancestor (LCA) of A and B
A
B
433. Spanning Clade of A and B
A
B
444. Path Length from A and B
A
B
5
45(No Transcript)
46Node paths
/1/1/2
/1/2/2
/1/2/1
/1/1/1/2
/2
/1/1/1/1
/1/2
/1/1/1
/1/1
/1
47Node paths - selecting subtree
/1/1/2
/1/2/2
/1/2/1
/1/1/1/2
/2
/1/1/1/1
/1/2
/1/1/1
/1/1
/1
SELECT node WHERE (path LIKE /1/1/) AND
(path lt /1/10/)
48Node paths - selecting subtree
/1/1/2
/1/2/2
/1/2/1
/1/1/1/2
/2
/1/1/1/1
/1/2
/1/1/1
/1/1
/1
SELECT node WHERE (path LIKE /1/1/) AND
(path lt /1/10/) AND (num_children IS 0)
49Node paths - LCA
/1/1/2
/1/2/2
/1/2/1
/1/1/1/2
/2
/1/1/1/1
/1/2
/1/1/1
/1/1
/1
Common substring starting from left
50What do we do now?
- Setup a taxonomic name server (TNS)
- Develop a phylogenetic genetic database linked to
TNS, PubMed, GenBank, etc. - Develop easy ways to populate database (e.g.,
from TreeBASE, GenBank, journal databases) - Develop standard set of tree queries
- Deploy