Title: Clustering of Leaflabelled Trees on Free Leafset
1Clustering of Leaf-labelled Trees on Free Leafset
- Jakub Koperwas
- Krzysztof Walczak
- Institute of Computer Science
- Warsaw University of Technology
2Agenda
- Background
- Introduction to phylogenetic trees
- Consensus and distance methods
- Clustering of leaf-labelled trees
- Clustering motivation
- Clustering Quality Measure
- Clustering Approaches
- Free leafset extension
- Experiments and results
- Future works
3Phylogenetic Tree
ancestor
cat
CACCTGT
dog
CAACTGT
mouse
CACCTAT
rat
CACTTGT
CACCTCT
horse
cow
CACCTCT
4Tree Representation
Splits
a
b
abcdef
bacdef
cabdef
dabcef
c
eabcdf
fabcde
abcdef
abcdef
abcdef
d
e
f
Clusters
abcde
ab
cde
cd
e
a
b
a
b
c
d
e
c
d
5Robinson Foulds Distance
Splits for tree T1 abcdef, bacdef, cabdef,
dabcef, eabcdf, fabcde, abcdef, abcdef,
abcdef
Splits for tree T2 abcdef, bacdef, cabdef,
dabcef, eabcdf, fabcde, abcdef, abcdef,
abcefd
Uncommon splits abcdef, abcefd
6Strict Consensus Tree
Splits for tree T1 abcdef, bacdef, cabdef,
dabcef, eabcdf, fabcde, abcdef,
abcdef, abcdef
Splits for tree T2 abcdef, bacdef, cabdef,
dabcef, eabcdf, fabcde, abcdef,
abcdef, abcefd
The common splits abcdef, bacdef, cabdef,
dabcef, eabcdf, fabcde, abcdef,
abcdef
7Clustering motivation
- Phylogenetic trees reconstruction methods may
produce many candidate trees - Hard to apply consensus methods to achieve one
tree from profile of hundreds of trees - Clustering helps to designate small number of
candidate trees form a large number of trees
8Representative Tree
- Representative tree tree that shares common
knowledge of all trees in cluster. - Strict Consensus Tree
- Majoruty-rule Consensus Tree
- Other
9Information in Tree
a
a
b
c
a
b
b
e
e
c
e
d
d
d
c
a
a
a
b
b
b
e
e
e
c
c
c
d
d
d
10Information Loss
- Cluster Information Loss the amount of
information that will be lost while replacing the
cluster of trees with one representative tree
- Clustering Information Loss the amount of
information that will be lost while replacing the
input profile of trees with k representative
trees
11Information Loss - Example
12K-best problem
- K-best problem is the problem of finding
partition of dataset on k clusters (where k is an
given value), in such way that this partition
maximizes Information Gain towards given type of
representative tree. - Proposition
- K-mean algorithm for majority-rule consensus tree
as representative tree - Agg-inf algorithm for strict consensus tree as
representative tree
13K-mean Approach(Majority-rule CT)
- Majority-rule consensus tree is a center tree,
therefore can be used as centroid - The objective of k-mean for trees is identical
with the k-best objective if the majority-rule
consensus tree is chosen as representative tree - Conclusion K-mean is a good candidate when a
majority-rule consensus tree is used as
representative tree
14Agglomerative approach (Strict CT)
- Typical Merging Strategies
- Single linkage
- Complete linkage
- Average Linkage
- Our Merging Strategy minimize information loss
after merging - For Strict Consensus Tree as Representative Tree
15Free leafset extension
a
a
c
c
b
b
f
d
e
d
T2 abcdf, bacdf, cabdf, dabcf acbdf, abcdf
T1 abcde, bacde, cabde, dabce acbde, abcde
No two splits can ever be equal Consensus
methods always empty set Distance always sum
of splits in all trees
16Z-restriction
abcdefgy
abcdefgx
abcdefg
17Z-restricted consensus tree
Zabcd
18Z-restricted distance
dRF(T1,T2)15
dRFabcde2
19Z-restriction in Clustering
- 1. Z-Restricted Majority rule consensus tree is a
middle tree - K-mean can be used
2.
- Agg-inf improvement can be used
20Pros and cons of z-restriction
- Pros
- Simple and efficient
- Nice Mathematical Features
- Cons
- Arbitrary Information discarding
- Hard to choose the z parameter
21Results Strict consensus (same leafset)
22Results (free leafset)
23Future Work
- Extension of presented methods with frequent
subsplits approach - Developing an algorithm for general
representative tree - Incorporate more biologically-significant
features into clustering objective function - Experiments on datasets from other disciplines
like linguistics
24Thank You