Understanding sets of trees PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Understanding sets of trees


1
Understanding sets of trees
  • CS 394C
  • September 10, 2009

2
Basic challenge
  • Phylogenetic analyses are sometimes based upon a
    single marker, but often based upon many markers
  • Each marker can be analyzed separately, or the
    entire set can be combined into one
    super-matrix
  • Each matrix (each dataset) can result in many
    trees (almost no matter how you analyze the
    matrix)
  • What to do with huge numbers of trees?

3
What to do?
  • How to estimate evolutionary history from many
    trees
  • How to efficiently store large sets of trees
  • How to enable efficient queries of the set of
    trees

4
What to do?
  • How to estimate evolutionary history from many
    trees
  • How to efficiently store large sets of trees
  • How to enable efficient queries of the set of
    trees

5
First, a few questions
  • Why are gene trees different from the species
    tree?
  • Why are estimated gene trees different from the
    true gene tree?
  • Under what conditions is the true evolutionary
    history not a tree? (i.e., what is
    reticulation?)

6
Reticulation
  • Evolutionary histories can be reticulate (meaning
    non-treelike)
  • Horizontal Gene Transfer (HGT)
  • Hybrid speciation
  • Recombination
  • Most phylogeny estimation methods produce trees.
  • Good resource about reticulate phylogenies book
    chapter by Luay Nakhleh (see 394C webpage for the
    link)

7
  • We will assume that all evolutionary histories
    are treelike for the remainder of todays
    presentation.
  • Later in the course well discuss reticulate
    evolution

8
Estimated Gene Trees can differ from Species Trees
  • Biological reasons
  • Deep coalescent events (alleles)
  • Gene duplication and loss (gene families)
  • Computational reasons
  • Insufficient time
  • Poor methods (e.g., UPGMA)
  • Poor models (e.g., ML using Jukes-Cantor)
  • Data issues
  • Insufficient data (meaning not enough sites)
  • Poor alignments

9
Examples of problems
  • When true gene trees can differ from species
    tree
  • Given a collection of gene trees, find a species
    tree that minimizes the number of deep
    coalescent events
  • When true gene trees should equal the species
    tree
  • Given a collection of gene trees, find a species
    tree that minimizes the total distance to the
    gene trees

10
When gene trees can differ from species tree
  • Software/Algorithms for deep-coalescent (see
    PhyloNet from Nakhlehs webpage at Rice)
  • GLASS (Roch and Mossel) - distance-based
  • MDC (Than and Nakhleh) - parsimony
  • STEM (Kubatko) - ML
  • BEST (Liu et al.) - Bayesian
  • BUCKy (AnĂ© et al.) - Bayesian
  • Software/Algorithms for duplication-loss
  • NOTUNG (Durand)
  • Duptree (Bansal et al.)
  • Hallet and Lagergren - algorithms/complexity

11
When gene trees should equal the species tree
  • The problem here is that estimated gene trees can
    differ from the true gene trees.
  • Although the problem is simple, it is still
    interesting -- computationally and
    mathematically.
  • Plus, we can still make novel contributions.

12
The very simplest problem
  • Easiest case
  • One species tree, true gene trees will agree with
    the species tree,
  • Estimated trees are on the full set of taxa
  • Approaches
  • Consensus methods return a tree on the entire
    set S of taxa summarizing the input trees
  • Agreement methods return a tree on a subset of
    the taxa on which the trees agree
  • Clustering, then consensus/agreement

13
Consensus methods
  • These are the most usual ways of analyzing
    datasets of trees
  • Examples
  • Strict consensus
  • Majority consensus
  • Greedy consensus (aka extended majority)
  • Others less frequently used include Gordons,
    Adams, the Strict Consensus Supertree, Local
    Consensus methods, and more.
  • Survey paper by David Bryant for some of these

14
Simplest problems, cont.
  • Agreement methods return trees on subsets of S,
    on which the trees are the same (or compatible)
  • MAST maximum agreement subtree (used in
    practice, sometimes)
  • MCST maximum compatible subtree (Ganapathy et
    al., not used in practice)
  • The difference between these is how polytomies
    are handled

15
Soft vs. hard polytomies
  • Polytomy node of high degree (greater than three
    for an unrooted tree)
  • Polytomies arise in estimations when consensus
    methods are used
  • Polytomies also arise when contracting short
    branches in estimated trees
  • Polytomies can be hard (representing true
    radiations) or soft (representing lack of
    information)

16
Compatible source trees
  • Estimated trees can be compatible when we
    interpret polytomies as soft
  • Compatible means that there is a tree which is
    a common refinement.
  • Example 123456, 123456, 123546.
  • We can compute the compatibility tree (when it
    exists) in O(nk) time, where nS and there are
    k source trees

17
Computational complexity
  • Most consensus methods (which return a tree on
    the entire set S of taxa) are polynomial time.
  • Most agreement methods (which return a tree on
    the largest subset of the taxa on which the
    source trees agree) are based upon NP-hard
    problems. Some (e.g., MAST) have fixed-parameter
    polynomial time solutions.

18
Supertree problems
  • Realistic complexity not all the source trees
    are on the same set of taxa.
  • Obvious problems
  • Find the tree on which all the source trees agree
    (if it exists).
  • Find the tree on which a maximum number of the
    source trees agree.
  • Both are NP-hard.

19
Quartet compatibility
  • Simple case all the source trees are on four
    taxa.
  • We ask does there exist a tree which agrees with
    all the source trees?
  • NP-hard!

20
Quartet tree amalgamation
  • Given collection of quartet trees, find a tree
    which agrees with a maximum number of these
    quartet trees
  • NP-hard, since compatibility is NP-hard
  • Hard to approximate, but PTAS if you have a tree
    on every quartet of taxa (Jiang et al.)

21
Quartet amalgamation algorithms
  • Quartet Puzzling (Strimmer and von Haeseler)
  • Q (Berry et al.)
  • Quartet Cleaning (Berry et al.)
  • Weight Optimization (Ranwez and Gascuel)
  • Quartets MaxCut (Snir and Rao)
  • But see also the paper (St. John et al.)
    evaluating early quartet methods on the CS 394C
    webpage

22
What about rooted trees?
  • Given set of rooted source trees, we ask
  • Is there a tree on which all the rooted source
    trees are correct?

23
Rooted tree compatibility
  • Aho, Sagiv, Szymanski, and Ullman polynomial
    time, recursive algorithm
  • If n1, return the singleton tree.
  • If ngt1, then compute an equivalence relation on
    the set of taxa as follows.
  • For each rooted triple ((a,b),c) in the set, put
    a and b in the same equivalence class.
  • Compute transitive closure.
  • If only one equivalence class, reject (set is
    incompatible). Otherwise, recurse on each subset,
    and return tree obtained by making all
    recursively computed trees sibling subtrees.

24
Subtree compatibility
  • If source trees are rooted, then compatibility
    can be tested in polynomial time. Optimization
    problems are NP-hard, however.
  • If source trees are unrooted, then compatibility
    is NP-hard. And so optimization problems are
    also NP-hard.

25
Supertree problems, in practice
  • In practice, the most frequently used supertree
    method is MRP, for Matrix Representation with
    Parsimony.
  • There are, however, many other supertree methods!

26
Many Supertree Methods
  • MRP
  • weighted MRP
  • Min-Cut
  • Modified Min-Cut
  • Semi-strict Supertree
  • MRF
  • MRD
  • QILI
  • SDM
  • Q-imputation
  • PhySIC
  • Majority-Rule Supertrees
  • Maximum Likelihood Supertrees
  • and many more ...

27
MRP
  • Idea take every sourcetree, and replace it with
    a matrix of 0,1,?.
  • Concatenate the matrices.
  • Apply Maximum Parsimony.
  • If all the source trees are compatible, then an
    exact solution to MRP will return the
    compatibility trees.

28
Homework, due 9/15
  • Read two papers (linked on the webpage)
  • St. John et al., about quartet-based methods
  • Moret et al., about sequence-length requirements
  • Pick one, write summary, and include questions

29
Question!
  • How do you feel about occasionally having class
    on some Monday or Friday, so we can have guest
    lectures?
Write a Comment
User Comments (0)
About PowerShow.com