Title: Phylogenetic supertrees: seeing the data for the trees
1Phylogenetic supertreesseeing the data for the
trees
- Olaf R. P. Bininda-Emonds
- Technische Universität München
2Outline
- the fundamental issue characters versus trees
- open questions are trees data?
- loss of contact with primary character data
- loss of information
- novel solutions
- data duplication
- the nature of supertrees
- analytical issues
- conclusions
- are supertrees a valid phylogenetic technique?
3The fundamental issues
4The basic distinction
- Supertrees
- source data phylogenies
- basic unit membership criterion / statement of
relationship - at best, can be viewed as a proxy for a shared
derived character
- Conventional studies
- source data measurable attribute of an organism
- basic unit character
- can be viewed as a putative statement of
relationship
5The fundamental issue
- supertrees combine trees, not real data
- has led to many criticisms of supertree
construction - but also lends advantages to the approach
6Supertree construction
7Supertree methods
- Direct
- strict consensus supertrees
- MinCutSupertree (and variants)
- semi-strict supertrees
- Lanyon (1993)
- Goloboff and Pol (2002)
- Indirect
- most matrix representation (MR) supertrees
- parsimony (MRP and variants)
- compatibility (MRC)
- minimum flip supertrees (MRF)
- average consensus (MRD)
- gene tree parsimony
8Are trees data?
9Open questions
- loss of contact with raw (character) data
- loss of information
- novel solutions
- data duplication
- the nature of supertrees consensus or
phylogenetic hypothesis? - analytical issues
10Loss of information
- a tree is a graphical representation of the
primary signal in a character-based data set - strength of primary signal can be measured (e.g.,
bootstrap frequencies) - but information regarding nature of any
conflicting subsignals lost
11Potential problems
- all trees and clades on them have equal support a
priori - prevents signal enhancement (sensu de Queiroz
et al., 1995) in combined data sets - coherent subsignals in different data partitions,
when combined, outweigh conflicting primary
signals - throwing away of information should cause a
supertree analysis to be less accurate than a
total evidence one, where primary data are
combined
12No loss of accuracy
- simulation studies indicate loss of information
is not detrimental - MRP (and variants) (Bininda-Emonds and Sanderson,
2001) - average consensus (Lapointe and Levasseur, 2001)
- both methods perform about on a par with total
evidence analyses of primary character data - and show similar behaviour to total evidence
analyses
13Maximizing contact
- weighting according to evidential support in
source trees - possible for all MR methods, average consensus,
and MinCutSupertree (and gene tree parsimony?) - causes MRP to outperform total evidence analyses
of primary character data in simulation
(Bininda-Emonds and Sanderson, 2001) - bootstrapping of primary character data
- both non-parametric (Moore et al., in prep) and
parametric versions (Huelsenbeck et al., in prep)
14Non-parametric bootstrapped supertrees
)
(
consensus of supertrees
bootstrapped source trees
bootstrapped supertree
original data
15Open questions
- loss of contact with raw (character) data
- loss of information
- novel solutions
- data duplication
- the nature of supertrees consensus or
phylogenetic hypothesis? - analytical issues
16Novel clades
- all supertree methods have the potential to yield
novel statements - relationships between taxa that do not co-exist
on any single source tree (sensu Sanderson et
al., 1998) - defining characteristic of method
17Unsupported clades
- some supertree methods have the potential to make
statements that are not only novel, but also
contradicted (unsupported) by every source tree - violation of a weaker form of co-Pareto property
- co-Pareto relationship of a given kind in the
consensus is present in at least one input tree
18- from Goloboff and Pol (2002)
19Comparing supertree methods
- indirect, optimization-based methods seem more
prone to producing unsupported clades
- ?
- MRP (and variants)
- MRF?
- average consensus?
- MinCutSupertree (and variants)?
- gene tree parsimony?
- ?
- strict consensus supertrees
- semi-strict supertrees
- MRC
20Questions unsupported clades
- how should they be treated?
- how common are they?
21Appropriateness
- Conventional studies
- unsupported clades (at level of resulting trees)
arise via signal enhancement - have direct character support in the combined
matrix
- Supertrees
- subsignals are invisible
- unsupported clades lack any support among source
trees ? should be regarded as spurious (Pisani
and Wilkinson, 2002) - not equivalent to signal enhancement
22(No Transcript)
23Incidence of unsupported clades
- circumstantial evidence hints that they are rare
- only a few reported in the literature
- theoretical Goloboff and Pol (2002) Wilkinson
et al. (2001) - empirical Bininda-Emonds and Bryant (1998)
Wilkinson et al. (2001) - estimated that 8 of the 198 clades in the
carnivore MRP supertree ( 4) had no support
among the source trees (Bininda-Emonds et al.,
1999) - dinosaur MRP supertree (Pisani et al., 2002) has
no unsupported clades
24Unsupported clades are very rare
- simulation results (MRP only)
- occur most often with source trees that are
- few in number (n 5)
- large in size (up to 50 taxa)
- possess identical taxon sets (consensus
setting) - most often means lt 0.21 of all simulated
clades - overall incidence was 131 of 282 137 clades (lt
0.05) - empirical results
- both the carnivore and lagomorph MRP supertrees
have no unsupported clades whatsoever
25Open questions
- loss of contact with raw (character) data
- loss of information
- novel solutions
- data duplication
- the nature of supertrees consensus or
phylogenetic hypothesis? - analytical issues
26Data duplication
- character data are often recycled between
phylogenetic analyses - e.g., total evidence analyses, molecular studies
of the same gene - the same character data may contribute to more
than one source tree - overrepresented in a supertree analysis ? data
duplication - also violates assumption of data non-independence
27- data duplication among cetartio-dactyl source
trees in the Liu et al. (2001) mammal order MRP
supertree - from Gatesy et al. (2002)
28Minimizing duplication
- data duplication a potential problem for all
supertree methods - use of trees does not reveal directly source of
underlying data set - but can be minimized / avoided with careful data
collection protocols - e.g., supertrees of Daubin et al. (2001) and
Kennedy and Page (2002) lack data duplication
29Is data duplication unavoidable?
- no phylogenies are independent given a single
Tree of Life - all characters and data sources have been subject
to the same evolutionary processes and history - want to combine phylogenetic hypotheses that can
reasonably be viewed as being independent
30Is the problem overrated?
- supertrees combine phylogenetic hypotheses
- emergent property ? composed of more than their
raw character data - manipulation of data (weighting, alignment,
recoding) - method and assumptions of analysis
- for example
- strongly conflicting molecular phylogenies for
whales can be explained largely by the choice of
outgroup (Messenger and McGuire, 1998) - alignment and weighting of primary data also
important
31Is data duplication overrated?
- data duplication is often only partial
- most combined data sets represent unique
combinations of individual data sets - easy to deal with data sets that are supersets of
others - signal enhancement means that each unique
combination could justifiably be viewed as an
independent hypothesis - also independent from constituent data sets
32Are supertrees unfairly singled out?
- data duplication also exists in conventional
studies (but less obviously so and to a lesser
known extent) - morphological ? single features often described
by multiple characters - molecular ? secondary structure (e.g., stems in
tRNA, protein folding) and codon structure mean
primary mutations may require secondary
compensatory ones - total evidence ? mixing of phenotypic and
genotypic data must represent data duplication at
some level
33Open questions
- loss of contact with raw (character) data
- loss of information
- novel solutions
- data duplication
- the nature of supertrees consensus or
phylogenetic hypothesis? - analytical issues
34The nature of supertrees
- is the supertree itself a legitimate phylogenetic
hypothesis? - many would say no, arguing instead that they
are a - form of consensus
- historical summary of systematic effort
- therefore, supertrees should not be used to
answer biological questions
35Supertrees as consensus
- association derives from
- similar methodology (combining trees rather than
data) - both containing polytomies
- resulting topologies may be suboptimal given
underlying data - why are consensus trees not valid phylogenetic
hypotheses? - especially if polytomies viewed as soft rather
than hard
36Dealing with incongruence
- all supertree methods must somehow deal with
incongruence among source trees - ignore it strict consensus, semi-strict,
MinCutSupertree, MRC - fix it MRF
- explain it biologically gene tree parsimony
- optimize it average consensus and MRP
37Incongruence as homoplasy
- a repeated criticism of MRP is that inferred
homoplasy on supertree has no biological meaning - convergence and reversals meaningless with
respect to a membership criterion - but why is MRP singled out?
- similar arguments should apply at least to
average consensus
38Parsimony and parsimony
- Principle of parsimony
- a criterion for deciding among scientific
theories or explanations - Plurality should not be posited without
necessity ? choose the simplest explanation of a
phenomenon
- Cladistic parsimony
- specific application of principle of parsimony
- prefer the tree with the fewest number of
evolutionary steps (i.e., character state
changes) - additional changes over minimum number represent
homoplasy
39Homoplasy and supertrees
- notions of homoplasy, convergence, and reversals
have nothing to do with parsimony per se - or really even with cladistic parsimony
- post hoc biological interpretation of
incongruence - incongruence on an MRP supertree is simply
incongruence - idea of homoplasy in this context is
epistemologically, not biologically meaningless
40Open questions
- loss of contact with raw (character) data
- loss of information
- novel solutions
- data duplication
- the nature of supertrees consensus or
phylogenetic hypothesis? - analytical issues
41Limitations of total evidence
- analytical limitations of combined primary data
sets also result in a loss of information - data must be compatible
- use of a single optimization criterion ? usually
MP, but ML now also possible - some data still not analyzable under either
framework (e.g., DNA-DNA hybridization,
morphometric data) - use of simplistic models of evolution
- MP differential weighting (including titv
ratio) - ML same model for every partition
- alignment problems
42Advantages to supertrees
- no loss of information all phylogenetic
hypotheses can be combined - even those that arent based on any data
- process amounts to partitioned analyses
- each partition can be analyzed according to most
appropriate model of evolution, and optimization
criterion - can be done in parallel
- results then combined with little loss of
accuracy - or hopefully less than loss of information for a
total evidence analysis entails
43A phylogeny of mammals
- The superteam
- have complete supertrees for
- Carnivora
- Chiroptera
- Insectivora
- Lagomorpha
- Marsupialia
- Primates
- total of 1923 species (41.5)
- Molecular data
- Murphy et al. (2001a)
- 9779 bp from 18 genes for 64 species
- Madsen et al. (2001)
- 8655 bp from 4 genes for 82 species
- Murphy et al. (2001b)
- 16 397 bp from 22 genes for 44 species (lt 1)
44Summary
45Whither supertrees?
- criticisms of supertree construction have been
launched at two levels - at the supertree approach as a whole
- at individual supertree methods
46Of approaches
- supertree problem inherently difficult because of
missing data - results in the lack of a single right answer
47Of approaches
- trees are data
- potential loss of information not detrimental
- key is to think in terms of phylogenetic
hypotheses - still awaiting a response from the cladistic
community
48 and methods
- all methods will go astray if its assumptions are
violated - e.g., parsimony and long-branch attraction,
likelihood and wrong model, regression and data
non-independence - for supertrees, key is to try and establish
- what each methods boundary conditions are
- how robust each method is to violations of its
assumptions - what the properties of each method are (in
relation to our desired objective)