Title: Gene tree discordance and
1Gene tree discordance and multi-species
coalescent models
Noah Rosenberg December 21, 2007
James Degnan
David Bryant
2Gene trees and species trees
Different genes may produce different inferences
about species relationships
3T2
T3
Coalescent model for evolution within species,
conditional on the species tree
Hudson (1983, Evolution) Tajima (1983,
Genetics) Nei (1987, Molecular Evolutionary
Genetics book) Pamilo Nei (1988, Molecular
Biology and Evolution) Takahata (1989,
Genetics) Wu (1991, Genetics) Hudson (1992,
Genetics) Maddison (1997, Systematic Biology)
4T2
T3
Assumptions of the multispecies coalescent model
conditional on a species tree
1. Coalescences occur within species, with the
same rate for each lineage pair.
2. The rate of coalescence is proportional to the
number of pairs of lineages.
3. When species splits are encountered, lineages
from all groups descended from the split are
allowed to coalesce.
5Takahata and Nei (1985, Genetics) Tavare (1984,
Theoretical Population Biology)
6Probability of a concordant gene tree topology
Concordant gene tree
Discordant gene tree
1. The probability gene tree is determined in the
2-species phase, or 1-e-T
2. 1/3 of the probability that gene tree is
determined in the ancestral phase, or (1/3)e-T
Hudson (1983, Evolution) Nei (1987, Molecular
Evolutionary Genetics) Tajima (1983, Genetics)
7Probability of the matching gene tree ((AB)C)
Probability of a particular discordant gene tree
((BC)A)
8It would be desirable to have a general
computation of the probability that a particular
species tree topology with branch lengths gives
rise to a particular gene tree topology
9Gene tree probabilities under the multispecies
coalescent model
A coalescent history gives the list of species
tree branches on which gene tree coalescences
occur.
A
B
C
A
B
C
Consider a species tree S (topology and branch
lengths)
Consider a species tree G (topology only)
JH Degnan LA Salter Evolution 59 24-37 (2005)
10The list of coalescent histories for an example
with five taxa
Gene tree
Species tree
1
2
3
4
A
B
C
D
E
A
C
B
D
E
11(No Transcript)
12The number of coalescent histories
13The number of coalescent histories for the
matching gene tree
14The number of coalescent histories for trees with
at most 5 taxa
15Number of coalescent histories for special shapes
with n taxa
16The number of coalescent histories for up to 11
taxa
17Ratio of the largest and smallest number of
coalescent histories for n taxa
gt
18Which types of shapes have the most coalescent
histories?
Most
The number of coalescent histories for trees with
8 taxa
Least
19Caterpillar-like shapes with n taxa, based on 4-
and 5-taxon subtrees
Cn-1
20Largest values for caterpillar-like shapes based
on 7 and 8-taxon subtrees
21Can a non-matching gene tree have more coalescent
histories?
Caterpillar species tree
1430 coalescent histories
1441 coalescent histories
22Computing the probabilities of gene trees
What are the properties of the number of
coalescent histories?
23For ngt3 taxa, can species trees be discordant
with the gene trees they are most likely to
produce?
24The labeled history for a gene tree is its
sequence of coalescence events.
The two labeled histories below produce the same
labeled topology ((AB)(CD))
Randomly joining pairs of lineages leads to a
uniform distribution over the set of possible
labeled histories.
The number of labeled histories possible for four
taxa is
25If the branch lengths of the species tree are
sufficiently short, coalescences will occur more
anciently than the species tree root.
26Gene tree frequency distribution
((AB)(CD)) 0.132 ((AC)(BD)) 0.094 ((AD)(BC)) 0.094
(((AB)C)D) 0.125 (((AB)D)C) 0.100 (((AC)B)D) 0.07
0 (((AC)D)B) 0.062 (((AD)B)C) 0.032 (((AD)C)B) 0.0
32 (((BC)A)D) 0.070 (((BC)D)A) 0.062 (((BD)A)C) 0.
032 (((BD)C)A) 0.032 (((CD)A)B) 0.032 (((CD)B)A) 0
.032
Species tree
Matching gene tree
27Species tree is (((AB)C)D) but most likely gene
tree is ((AB)(CD))
T2
T3
Species tree is (((AB)C)D)
Most likely gene tree is not (((AB)C)D)
A species tree topology produces anomalous gene
trees if branch lengths can be chosen so that the
most likely gene tree topology differs from the
species tree topology.
28Does the 4-taxon symmetric species tree topology
produce anomalous gene trees?
29- 3 species no anomalous gene trees.
- 4 species asymmetric but not symmetric species
trees have AGTs. - 5 or more species?
Probability of the concordant gene tree
Probability of a particular discordant gene tree
30With 5 or more species, any species tree topology
produces at least one anomalous gene tree.
For n gt 4, suppose a species tree topology is not
n-maximally probable. If its branches are short
enough, it produces AGTs that are n-maximally
probable.
31With 5 or more species, any species tree topology
produces at least one anomalous gene tree.
Proof (continued)
Suppose a species tree topology is n-maximally
probable.
For n gt 8 an inductive argument reduces the
problem to the case of n5, 6, 7, or 8.
For n5, 6, 7, or 8 taxa it remains to show that
the n-maximally probable species tree topologies
produce AGTs.
32With 5 or more species, any species tree topology
produces at least one anomalous gene tree.
Proof (continued)
For n5 the n-maximally probable species tree
topology produces AGTs.
33With 5 or more species, any species tree topology
produces at least one anomalous gene tree.
Proof (continued)
For n5, 6, 7, or 8 the n-maximally probable
species tree topologies produce AGTs.
34With 5 or more species, any species tree topology
produces at least one anomalous gene tree.
Proof (continued)
An inductive argument for n gt 8 reduces the
problem to the case of n5, 6, 7, or 8.
For n gt 8 one of the two most basal subtrees has
between 5 and n-1 taxa inclusive.
Choose branch lengths to produce an AGT for that
subtree, and make them long for the other subtree.
35With 5 or more species, any species tree topology
produces at least one anomalous gene tree.
Proof (summary)
If the species tree topology is not n-maximally
probable, it has maximally probable AGTs.
By example, n-maximally probable species tree
topologies produce AGTs for n5, 6, 7, or 8.
For n gt 8, induction reduces the problem to the
case of n5, 6, 7, or 8.
This completes the proof
36Some properties of anomalous gene trees
37Species tree
Gene tree
Anomalous gene trees can have the same unlabeled
shape as the species tree
38 There exist mutually anomalous sets of tree
topologies (wicked forests).
39T3
T4
T2
AGTs can occur if some but not all species tree
branches are short
40Does the severity of AGTs increase with more taxa?
Maximal value for shared branch length that
still produces AGTs 0.1568
41Does the severity of AGTs increase with more taxa?
42Number of AGTs for the 4-taxon asymmetric species
tree
43Number of AGTs for 5-taxon species trees
44Does the number of AGTs increase with more taxa?
45What implications do gene tree probabilities have
for phylogenetic inference algorithms?
46- Most commonly observed gene tree topology
Statistically inconsistent in estimating the
species tree
Species tree
Estimated species tree
47- Estimated gene tree of concatenated sequence
Statistically inconsistent in estimating the
species tree
48- Maximum likelihood based on the frequency
distribution of gene tree topologies
Statistically consistent even when anomalous gene
trees exist
Gene tree frequency distribution
Anomalous gene tree
((AB)(CD)) 0.132 ((AC)(BD)) 0.094 ((AD)(BC)) 0.094
(((AB)C)D) 0.125 (((AB)D)C) 0.100 (((AC)B)D) 0.07
0 (((AC)D)B) 0.062 (((AD)B)C) 0.032 (((AD)C)B) 0.0
32 (((BC)A)D) 0.070 (((BC)D)A) 0.062 (((BD)A)C) 0.
032 (((BD)C)A) 0.032 (((CD)A)B) 0.032 (((CD)B)A) 0
.032
Species tree
Matching gene tree
49- Consensus among gene tree topologies
-Majority rule consensus -Greedy
consensus -Rooted triple consensus (R)
50- Tree obtained by agglomeration using minimum
pairwise coalescence times across a large number
of loci (Glass tree)
51Summary
There exist algorithms for computing gene tree
probabilities on species trees
The number of coalescent histories increases
quickly - algorithmic improvements in gene tree
probability computations are likely possible
A species tree can disagree with the gene tree
that it is most likely to produce
This severe discordance only gets worse with more
taxa
HOWEVER, some algorithms can infer the correct
species tree even when gene tree discordance is
extreme
52Acknowledgments David Bryant Mike
DeGiorgio James Degnan Randa Tao
National Science Foundation DEB-0716904