SupreFine,%20a%20new%20supertree%20method - PowerPoint PPT Presentation

About This Presentation
Title:

SupreFine,%20a%20new%20supertree%20method

Description:

... data types (e.g., morphological features, biomolecular sequences, gene orders, even distances based upon ... What does lead to missing data? Evolution (gain and loss of genes) ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 40
Provided by: michelle183
Category:

less

Transcript and Presenter's Notes

Title: SupreFine,%20a%20new%20supertree%20method


1
SupreFine, a new supertree method
  • Shel Swenson
  • September 17th 2009

2
Reconstructing the Tree of Life
Tree of Life challenges - millions of species
- lots of missing data
Two possible approaches - Combined Analysis -
Supertree Methods
3
Two competing approaches
Species
4
Combined Analysis Methods
5
Combined Analysis
gene 2
gene 3
gene 1
S1
TCTAATGGAA
TATTGATACA
? ? ? ? ? ? ? ? ? ?
S2
GCTAAGGGAA
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
S3
TCTAAGGGAA
TCTTGATACC
? ? ? ? ? ? ? ? ? ?
S4
TCTAACGGAA
GGTAACCCTC
TAGTGATGCA
S5
GCTAAACCTC
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
S6
GGTGACCATC
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
S7
TCTAATGGAC
GCTAAACCTC
TAGTGATGCA
S8
TATAACGGAA
CATTCATACC
? ? ? ? ? ? ? ? ? ?
6
Two competing approaches
Species
Combined Analysis
7
Why use supertree methods?
  • Missing data
  • Large dataset sizes
  • Incompatible data types (e.g., morphological
    features, biomolecular sequences, gene orders,
    even distances based upon biochemistry)
  • Unavailable sequence data (only trees)

8
Many Supertree Methods
  • MRP
  • weighted MRP
  • Min-Cut
  • Modified Min-Cut
  • Semi-strict Supertree
  • MRF
  • MRD
  • QILI
  • SDM
  • Q-imputation
  • PhySIC
  • Majority-Rule Supertrees
  • Maximum Likelihood Supertrees
  • and many more ...

9
Todays Outline
?
  • Supertree and combined analysis methods
  • Why we need better supertree methods
  • SuperFine a new supertree method that is fast
    and more accurate than other supertree methods
  • Strict Consensus Merger (SCM)
  • Resolving polytomies
  • Performance of SuperFine (compared to MRP and
    combined anaylses)
  • applications and future work

10
Previous Simulation Studies
1. Generate Model Tree
3. Select Subsets
2. Generate sequence data
11
What does lead to missing data?
  • Evolution (gain and loss of genes)
  • Dataset selection
  • Limited resources (time, money, etc.)

12
My Simulation Study
  • Generate model trees (100-1000 taxa)
  • Simulate gene gain and loss and generate
    sequences
  • Simulate techniques for gene and taxon selection
  • Clade-based datasets
  • Scaffold dataset
  • Generate source trees and a combined dataset
  • Apply supertree and combined analysis methods
  • Compare each estimated tree to the model tree,
    and record topological error

13
Experimental Parameters
  • Number of taxa in model tree 100, 500, and 1000
  • Generate 5, 15 and 25 clade-based datasets,
    respectively
  • Scaffold density 20, 50, 75, and 100
  • Six super-methods
  • Combined analysis using ML and MP
  • MRP on ML and MP source trees
  • Weighted MRP on ML and MP source trees

(MRP Matrix Representation with Parsimony)
14
Quantifying Topological Error
True Tree
Estimated Tree
15
Comparison of MRP-ML and CA-ML(False Negative
Rate)
Scaffold Density ()
16
We still need supertree methods!
  • Combined analysis cannot be used for
  • Datasets that are very large
  • Incompatible data types
  • Unavailable sequence data

17
Outline
?
  • Supertree and combined analysis methods
  • Why we need better supertree methods
  • SuperFine a new supertree method that is fast
    and more accurate than other supertree methods
  • Strict Consensus Merger (SCM)
  • Resolving polytomies
  • Performance of SuperFine (compared to MRP and
    combined anaylses)
  • applications and future work

?
18
Methods that Led to SuperFine
  • The Strict Consensus Merger (SCM) (Huson et al.
    1999)
  • Quartet MaxCut (QMC) (Snir and Rao 2008)

19
Strict Consensus Merger (SCM)
b
e
a
f
c
d
g
a
b
c
h
d
i
j
20
Theorem
  • Let S be a collection of source trees and T be a
    SCM tree on S.
  • Then for every s in S, ?(TL(s)) ? ?(s), where
    TL(s) is the induced subtree of T on the leafset
    of s.

21
Corollary
  • Let S be a collection of source trees, T be a SCM
    tree T on S, and let v be a vertex of T. Let T
    be a subtree of T rooted at a vertex u adjacent
    to v, such that v is not a vertex of T
  • Then for every s in S, one of the following
    holds
  • L(s) ? L(T)
  • L(s) ? L(T) ?
  • L(s) ? L(T) L(s) - L(T) ? ?(s)

22
Intuition for the Theorem
e
b
b
e
a
a
c
f
c
d
f
g
g
d
a
b
b
a
c
h
c
i
j
d
h
d
i
j
23
Performance of SCM
  • Low false positive (FP) rate
  • (Estimated supertree has few false edges)
  • High false negative (FN) rate
  • (Estimated supertree is missing many true edges)

24
Methods that Led to SuperFine
  • The Strict Consensus Merger (SCM) (Huson et al.
    1999)
  • Quartet MaxCut (QMC) (Snir and Rao 2008)

25
Quartet MaxCut (QMC)
  • QMC is a heuristic for the following optimization
    problem
  • Given a collection Q of quartet trees, find a
    supertree T, with leaf set L(T) ?q?Q L(q), that
    displays the maximum number of quartet trees in
    Q.

26
Maximizing of Quartet Trees Displayed
  • 1234, 2345, 3456, 4567 are compatible quartet
    trees with supertree
  • Adding the quartet 1723 creates an incompatible
    set of quartet trees. An optimal supertree
    would be the same as above, because it agrees
    with 4 out of 5 quartet trees.

27
QMC as a Supertree Method
  • Step 1 Encode source trees as a set of quartets
  • Step 2 Apply QMC

28
Idea behind SuperFine
  • First, construct a supertree with low false
    positives using SCM The Strict Consensus
    Merger
  • Then, refine the tree to reduce false negatives
    by resolving each polytomy using
    QMC Quartet Max Cut

29
Resolving a single polytomy, v
  • Step 1 Encode each source tree as a collection
    of quartet trees on 1,2,...,d, where
    ddegree(v)
  • Step 2 Apply Quartet MaxCut (Snir and Rao) to
    the collection of quartet trees, to produce a
    tree t on leafset 1,2,...,d
  • Step 3 Replace the star tree at v by tree t

Why?
30
Back to Our Example
31
Where We Use the Theorem
For every s in S, ?(TL(s)) ? ?(s)
32
Step 1 Encode each source tree as a collection
of quartet trees on 1,2,...,d
33
Step 2 Apply Quartet MaxCut (QMC) to the
collection of quartet trees
5
1
4
QMC
2
3
6
34
Theorem
  • For each source tree, and each polytomy v of
    degree d, the encoding of each source tree with
    leaf labels 1,2,...,d is well-defined and
    produces no conflicting quartet trees.

35
Replace polytomy using tree from QMC
g
5
d
4
1
2
3
h
6
f
h
d
f
g
36
False Negative Rate
Scaffold Density ()
37
False Negative Rate
Scaffold Density ()
38
False Positive Rate
Scaffold Density ()
39
Running Time
SuperFine vs. MRP
Scaffold Density ()
Scaffold Density ()
Scaffold Density ()
40
Running Time
SuperFine vs. MRP
MRP 8-12 sec. SuperFine 2-3 sec.
Scaffold Density ()
Scaffold Density ()
Scaffold Density ()
41
Observations
  • SuperFine is much more accurate than MRP, with
    comparable performance only when the scaffold
    density is 100
  • SuperFine is almost as accurate as CA-ML
  • SuperFine is extremely fast

42
Future Work
  • Exploring algorithm design space for Superfine
  • Different quartet encodings
  • Not using SCM in Step 1
  • Parallel version
  • Post-processing step to minimize Sum-of-FN to
    source trees
  • Using Superfine to enable phylogeny estimation
  • without an alignment
  • on many marker combined datasets
  • Using Superfine in conjunction with
    divide-and-conquer methods to create more
    accurate phylogenetic methods
  • Exploration of impact of source tree collections
    (in particular the scaffold) on supertree
    analyses
  • Revisiting specific biological supertrees
Write a Comment
User Comments (0)
About PowerShow.com