Phylogenetic supertrees: seeing the data for the trees - PowerPoint PPT Presentation

About This Presentation
Title:

Phylogenetic supertrees: seeing the data for the trees

Description:

the fundamental issue: characters versus trees. open questions: are trees data? ... in tRNA, protein folding) and codon structure mean primary mutations may require ... – PowerPoint PPT presentation

Number of Views:239
Avg rating:3.0/5.0
Slides: 49
Provided by: olafbinin
Category:

less

Transcript and Presenter's Notes

Title: Phylogenetic supertrees: seeing the data for the trees


1
Phylogenetic supertreesseeing the data for the
trees
  • Olaf R. P. Bininda-Emonds
  • Technische Universität München

2
Outline
  • the fundamental issue characters versus trees
  • open questions are trees data?
  • loss of contact with primary character data
  • loss of information
  • novel solutions
  • data duplication
  • the nature of supertrees
  • analytical issues
  • conclusions
  • are supertrees a valid phylogenetic technique?

3
The fundamental issues
4
The basic distinction
  • Supertrees
  • source data phylogenies
  • basic unit membership criterion / statement of
    relationship
  • at best, can be viewed as a proxy for a shared
    derived character
  • Conventional studies
  • source data measurable attribute of an organism
  • basic unit character
  • can be viewed as a putative statement of
    relationship

5
The fundamental issue
  • supertrees combine trees, not real data
  • has led to many criticisms of supertree
    construction
  • but also lends advantages to the approach

6
Supertree construction
7
Supertree methods
  • Direct
  • strict consensus supertrees
  • MinCutSupertree (and variants)
  • semi-strict supertrees
  • Lanyon (1993)
  • Goloboff and Pol (2002)
  • Indirect
  • most matrix representation (MR) supertrees
  • parsimony (MRP and variants)
  • compatibility (MRC)
  • minimum flip supertrees (MRF)
  • average consensus (MRD)
  • gene tree parsimony

8
Are trees data?
9
Open questions
  • loss of contact with raw (character) data
  • loss of information
  • novel solutions
  • data duplication
  • the nature of supertrees consensus or
    phylogenetic hypothesis?
  • analytical issues

10
Loss of information
  • a tree is a graphical representation of the
    primary signal in a character-based data set
  • strength of primary signal can be measured (e.g.,
    bootstrap frequencies)
  • but information regarding nature of any
    conflicting subsignals lost

11
Potential problems
  • all trees and clades on them have equal support a
    priori
  • prevents signal enhancement (sensu de Queiroz
    et al., 1995) in combined data sets
  • coherent subsignals in different data partitions,
    when combined, outweigh conflicting primary
    signals
  • throwing away of information should cause a
    supertree analysis to be less accurate than a
    total evidence one, where primary data are
    combined

12
No loss of accuracy
  • simulation studies indicate loss of information
    is not detrimental
  • MRP (and variants) (Bininda-Emonds and Sanderson,
    2001)
  • average consensus (Lapointe and Levasseur, 2001)
  • both methods perform about on a par with total
    evidence analyses of primary character data
  • and show similar behaviour to total evidence
    analyses

13
Maximizing contact
  • weighting according to evidential support in
    source trees
  • possible for all MR methods, average consensus,
    and MinCutSupertree (and gene tree parsimony?)
  • causes MRP to outperform total evidence analyses
    of primary character data in simulation
    (Bininda-Emonds and Sanderson, 2001)
  • bootstrapping of primary character data
  • both non-parametric (Moore et al., in prep) and
    parametric versions (Huelsenbeck et al., in prep)

14
Non-parametric bootstrapped supertrees
)
(
consensus of supertrees
bootstrapped source trees
bootstrapped supertree
original data
15
Open questions
  • loss of contact with raw (character) data
  • loss of information
  • novel solutions
  • data duplication
  • the nature of supertrees consensus or
    phylogenetic hypothesis?
  • analytical issues

16
Novel clades
  • all supertree methods have the potential to yield
    novel statements
  • relationships between taxa that do not co-exist
    on any single source tree (sensu Sanderson et
    al., 1998)
  • defining characteristic of method


17
Unsupported clades
  • some supertree methods have the potential to make
    statements that are not only novel, but also
    contradicted (unsupported) by every source tree
  • violation of a weaker form of co-Pareto property
  • co-Pareto relationship of a given kind in the
    consensus is present in at least one input tree

18
  • from Goloboff and Pol (2002)

19
Comparing supertree methods
  • indirect, optimization-based methods seem more
    prone to producing unsupported clades
  • ?
  • MRP (and variants)
  • MRF?
  • average consensus?
  • MinCutSupertree (and variants)?
  • gene tree parsimony?
  • ?
  • strict consensus supertrees
  • semi-strict supertrees
  • MRC

20
Questions unsupported clades
  • how should they be treated?
  • how common are they?

21
Appropriateness
  • Conventional studies
  • unsupported clades (at level of resulting trees)
    arise via signal enhancement
  • have direct character support in the combined
    matrix
  • Supertrees
  • subsignals are invisible
  • unsupported clades lack any support among source
    trees ? should be regarded as spurious (Pisani
    and Wilkinson, 2002)
  • not equivalent to signal enhancement

22
(No Transcript)
23
Incidence of unsupported clades
  • circumstantial evidence hints that they are rare
  • only a few reported in the literature
  • theoretical Goloboff and Pol (2002) Wilkinson
    et al. (2001)
  • empirical Bininda-Emonds and Bryant (1998)
    Wilkinson et al. (2001)
  • estimated that 8 of the 198 clades in the
    carnivore MRP supertree ( 4) had no support
    among the source trees (Bininda-Emonds et al.,
    1999)
  • dinosaur MRP supertree (Pisani et al., 2002) has
    no unsupported clades

24
Unsupported clades are very rare
  • simulation results (MRP only)
  • occur most often with source trees that are
  • few in number (n 5)
  • large in size (up to 50 taxa)
  • possess identical taxon sets (consensus
    setting)
  • most often means lt 0.21 of all simulated
    clades
  • overall incidence was 131 of 282 137 clades (lt
    0.05)
  • empirical results
  • both the carnivore and lagomorph MRP supertrees
    have no unsupported clades whatsoever

25
Open questions
  • loss of contact with raw (character) data
  • loss of information
  • novel solutions
  • data duplication
  • the nature of supertrees consensus or
    phylogenetic hypothesis?
  • analytical issues

26
Data duplication
  • character data are often recycled between
    phylogenetic analyses
  • e.g., total evidence analyses, molecular studies
    of the same gene
  • the same character data may contribute to more
    than one source tree
  • overrepresented in a supertree analysis ? data
    duplication
  • also violates assumption of data non-independence

27
  • data duplication among cetartio-dactyl source
    trees in the Liu et al. (2001) mammal order MRP
    supertree
  • from Gatesy et al. (2002)

28
Minimizing duplication
  • data duplication a potential problem for all
    supertree methods
  • use of trees does not reveal directly source of
    underlying data set
  • but can be minimized / avoided with careful data
    collection protocols
  • e.g., supertrees of Daubin et al. (2001) and
    Kennedy and Page (2002) lack data duplication

29
Is data duplication unavoidable?
  • no phylogenies are independent given a single
    Tree of Life
  • all characters and data sources have been subject
    to the same evolutionary processes and history
  • want to combine phylogenetic hypotheses that can
    reasonably be viewed as being independent

30
Is the problem overrated?
  • supertrees combine phylogenetic hypotheses
  • emergent property ? composed of more than their
    raw character data
  • manipulation of data (weighting, alignment,
    recoding)
  • method and assumptions of analysis
  • for example
  • strongly conflicting molecular phylogenies for
    whales can be explained largely by the choice of
    outgroup (Messenger and McGuire, 1998)
  • alignment and weighting of primary data also
    important

31
Is data duplication overrated?
  • data duplication is often only partial
  • most combined data sets represent unique
    combinations of individual data sets
  • easy to deal with data sets that are supersets of
    others
  • signal enhancement means that each unique
    combination could justifiably be viewed as an
    independent hypothesis
  • also independent from constituent data sets

32
Are supertrees unfairly singled out?
  • data duplication also exists in conventional
    studies (but less obviously so and to a lesser
    known extent)
  • morphological ? single features often described
    by multiple characters
  • molecular ? secondary structure (e.g., stems in
    tRNA, protein folding) and codon structure mean
    primary mutations may require secondary
    compensatory ones
  • total evidence ? mixing of phenotypic and
    genotypic data must represent data duplication at
    some level

33
Open questions
  • loss of contact with raw (character) data
  • loss of information
  • novel solutions
  • data duplication
  • the nature of supertrees consensus or
    phylogenetic hypothesis?
  • analytical issues

34
The nature of supertrees
  • is the supertree itself a legitimate phylogenetic
    hypothesis?
  • many would say no, arguing instead that they
    are a
  • form of consensus
  • historical summary of systematic effort
  • therefore, supertrees should not be used to
    answer biological questions

35
Supertrees as consensus
  • association derives from
  • similar methodology (combining trees rather than
    data)
  • both containing polytomies
  • resulting topologies may be suboptimal given
    underlying data
  • why are consensus trees not valid phylogenetic
    hypotheses?
  • especially if polytomies viewed as soft rather
    than hard

36
Dealing with incongruence
  • all supertree methods must somehow deal with
    incongruence among source trees
  • ignore it strict consensus, semi-strict,
    MinCutSupertree, MRC
  • fix it MRF
  • explain it biologically gene tree parsimony
  • optimize it average consensus and MRP

37
Incongruence as homoplasy
  • a repeated criticism of MRP is that inferred
    homoplasy on supertree has no biological meaning
  • convergence and reversals meaningless with
    respect to a membership criterion
  • but why is MRP singled out?
  • similar arguments should apply at least to
    average consensus

38
Parsimony and parsimony
  • Principle of parsimony
  • a criterion for deciding among scientific
    theories or explanations
  • Plurality should not be posited without
    necessity ? choose the simplest explanation of a
    phenomenon
  • Cladistic parsimony
  • specific application of principle of parsimony
  • prefer the tree with the fewest number of
    evolutionary steps (i.e., character state
    changes)
  • additional changes over minimum number represent
    homoplasy

39
Homoplasy and supertrees
  • notions of homoplasy, convergence, and reversals
    have nothing to do with parsimony per se
  • or really even with cladistic parsimony
  • post hoc biological interpretation of
    incongruence
  • incongruence on an MRP supertree is simply
    incongruence
  • idea of homoplasy in this context is
    epistemologically, not biologically meaningless

40
Open questions
  • loss of contact with raw (character) data
  • loss of information
  • novel solutions
  • data duplication
  • the nature of supertrees consensus or
    phylogenetic hypothesis?
  • analytical issues

41
Limitations of total evidence
  • analytical limitations of combined primary data
    sets also result in a loss of information
  • data must be compatible
  • use of a single optimization criterion ? usually
    MP, but ML now also possible
  • some data still not analyzable under either
    framework (e.g., DNA-DNA hybridization,
    morphometric data)
  • use of simplistic models of evolution
  • MP differential weighting (including titv
    ratio)
  • ML same model for every partition
  • alignment problems

42
Advantages to supertrees
  • no loss of information all phylogenetic
    hypotheses can be combined
  • even those that arent based on any data
  • process amounts to partitioned analyses
  • each partition can be analyzed according to most
    appropriate model of evolution, and optimization
    criterion
  • can be done in parallel
  • results then combined with little loss of
    accuracy
  • or hopefully less than loss of information for a
    total evidence analysis entails

43
A phylogeny of mammals
  • The superteam
  • have complete supertrees for
  • Carnivora
  • Chiroptera
  • Insectivora
  • Lagomorpha
  • Marsupialia
  • Primates
  • total of 1923 species (41.5)
  • Molecular data
  • Murphy et al. (2001a)
  • 9779 bp from 18 genes for 64 species
  • Madsen et al. (2001)
  • 8655 bp from 4 genes for 82 species
  • Murphy et al. (2001b)
  • 16 397 bp from 22 genes for 44 species (lt 1)

44
Summary
45
Whither supertrees?
  • criticisms of supertree construction have been
    launched at two levels
  • at the supertree approach as a whole
  • at individual supertree methods

46
Of approaches
  • supertree problem inherently difficult because of
    missing data
  • results in the lack of a single right answer

47
Of approaches
  • trees are data
  • potential loss of information not detrimental
  • key is to think in terms of phylogenetic
    hypotheses
  • still awaiting a response from the cladistic
    community

48
and methods
  • all methods will go astray if its assumptions are
    violated
  • e.g., parsimony and long-branch attraction,
    likelihood and wrong model, regression and data
    non-independence
  • for supertrees, key is to try and establish
  • what each methods boundary conditions are
  • how robust each method is to violations of its
    assumptions
  • what the properties of each method are (in
    relation to our desired objective)
Write a Comment
User Comments (0)
About PowerShow.com