Title: Treesearches and tests for phylogenetic signal
1Tree-searches and tests for phylogenetic signal
2Many issues glossed over
- What if characters disagree?
- How is the tree score determined?
- How can we root the trees?
- How do we find the optimal tree?
- How can we evaluate the robustness of our
conclusions?
3Terminology
Polytomy
Binary/dichotomous/fully-resolved
polyotomous/unresolved
4Unrooted trees
Polytomy
5Number of unrooted fully resolved trees for t taxa
i t
i 3
6How many places can you add another taxon?
- Two taxa
- Three taxa
- Four Taxa
- Five taxa
7What is the number of rooted trees?
- The root is just one more taxon same formula but
t number of taxa 1
8The number of trees gets big
- Number of binary unrooted trees
- 1
- 3
- 15
- 105
- 2,027,025
- 2.2 x 1020
- 2.8 x 1074
- 1 x 101074
- Number of tips
- 3
- 4
- 5
- 6
- 10
- 20
- 50
- 500
9How do you find the optimal tree?
10How do you find the optimal tree?
- Exhaustive (
- Branch-and-bound (
- Obtain the length of a random tree (initial upper
bound) - As trees are built determine length
- If length exceeds upper bound then that tree and
all its descendant trees are ignored
11How do you find the optimal tree?
- Exhaustive (
- Branch-and-bound (
- Heuristic search (unlimited?)
12Heuristic searches
- Search for optimal trees by finding good trees
and then rearranging them in the hopes of finding
an even better tree
13Getting starting trees
- Random tree - not done
- User tree (e.g., a NJ tree)
- Build a tree by adding taxa to the location that
is optimal - Can hold more than one tree at each step
14Taxon addition order
- As-is
- In the order of the matrix (not done for
parsimony) - Simple taxon addition
- use a distance algorithm to decide order
- Closest taxon addition
- Add the taxon that makes the optimal tree
- Random taxon addition order
- Repeat many times
15Heuristic search
Suboptimal island of trees
Global optimum
Starting trees
Treespace
16Branch swapping
- Nearest-neighbour interchange (NNI)
17Branch swapping
- Subtree pruning and regrafting (SPR)
18Branch swapping
- Tree-bisection reconnection (TBR)
19Many issues glossed over
- What if characters disagree?
- How is the tree score determined?
- How can we root the trees?
- How do we find the optimal tree?
- How can we evaluate the robustness of our
conclusions?
20Even if the shortest trees is the best estimate
of the true tree - the true tree might not be the
shortest
We should consider suboptimal trees
We should use statistical tests to help us
determine what to actually believe
21Questions we can ask
- Are the data random or do they have signal?
- How much homoplasy is there?
- To what extent are particular elements of the
trees (clades) supported? - What alternative results can we reject?
22How can we evaluate the reliability of the
tree(s) we obtain?
- Is there agreement within the data?
23The logic of looking at consistency indices
- If all the characters have the same signal then
the tree is more trustworthy - The more agreement there is, the less homoplasy
(more consistency) the characters will show on
the most parsimonious tree - We need statistics to measure consistency
24How much homoplasy is there?
- Taxon 1 A C A T T T A
- Taxon 2 A C G A T T A
- Taxon 3 A G G A T A G
- Taxon 4 G A A A A C ?
- Taxon 5 G A T A ? C G
- ObsL 1 2 3 1 1 2 1
Min L 1 2 2 1 1 2 1
Minimum length overall 10 Length of MP tree 11
25Consistency index
- CI Min L 10 0.91
- Obs L 11
Homoplasy index
HI 1-CI 0.09
26How much homoplasy is there?
- Taxon 1 A C A T T T A
- Taxon 2 A C G A T T A
- Taxon 3 A G G A T A G
- Taxon 4 G A A A A C ?
- Taxon 5 G A T A ? C G
- ObsL 1 2 3 1 1 2 1
Min L 1 2 2 1 1 2 1
CI 1 1 .67 1 1 1 1
27CI is affected by uninformative characters
- Taxon 1 A C A T T T A
- Taxon 2 A C G A T T A
- Taxon 3 A G G A T A G
- Taxon 4 G A A A A C ?
- Taxon 5 G A T A ? C G
- Min L 1 2 2 1 1 2 1
CI 1 1 .67 1 1 1 1
Minimum length overall 8 Length of MP tree
9 CI 0.89
28Retention Index
- Taxon 1 A C A T T T A
- Taxon 2 A C G A T T A
- Taxon 3 A G G A T A G
- Taxon 4 G A A A A C ?
- Taxon 5 G A T A ? C G
- Min L 1 2 2 1 1 2 1
- Max L 2 3 3 1 1 3 2
Maximum length overall 15
29Retention index (RI)
- MaxL - ObsL 5 0.83
- MaxL - MinL 6
-
30General trends observed with CI/RIs
- Strong negative correlation between taxon number
and CI/RI - Data sets with few characters can show unexpected
high CI/RI
31How can we evaluate the significance of CI/RI?
- CI depends directly on tree length
- We can compare the observed tree length with that
we would obtain if there were no phylogenetic
signal
The permutation tail probability (PTP) test
32Permuting data removes phylogenetic signal
- Taxon 1 ACATTTA
- Taxon 2 ACGATTA
- Taxon 3 AGGATAG
- Taxon 4 GAAAAC?
- Taxon 5 GATA?CG
Permuted data sets
Taxon 1 GAAA?AA Taxon 2 ACAATC? Taxon 3
GAGTATG Taxon 4 AGTATCG Taxon 5 ACGATTA
33Example with signal
Number of Tree length replicates
------------------------- 1222 1
1669 1 1671 1
1672 1 1673 1 1674
1 1675 2 1676 2
1678 1 1679 2
1680 4 1681 5 1682
8 1683 4 1684 4
1685 2
Number of Tree length replicates
------------------------- 1686 8
1687 7 1688 6
1689 8 1690 6 1691
3 1692 2 1693
3 1694 3 1695 3
1696 3 1697 2
1699 2 1702 1 1704
2 1705 1
34Example without signal
Number of Tree length replicates
------------------------- 1924
3 1926 1 1927
4 1928 1
1929 2 1930 8
1931 6 1932
5 1933 4 1934
4 1935 5
1936 1 1937 8
1938 11 1939
7
Number of Tree length replicates
------------------------- 1940 6
1941 7 1942
4 1943 2 1944
1 1945 1
1946 1 1947 1
1950 3 1952
1 1953 1 1955
1 1958 1
35The PTP test is slow
- Hillis and Huelsenbeck (1991) observed a
difference between the shape of the tree length
distribution as a function of phylogenetic signal
36A data set without signal
mean599.182107 sd4.944738 g1-0.150922 582.0000
0 /-----------------------------------------------
------------------------- 583.80000
(5) 585.60000 (25) 587.40000
(71) 589.20000 (209) 591.00000
(161) 592.80000
(521) 594.60000
(883) 596.40000
(1132) 598.20000
(1469) 600.00000
(788) 601.80000
(1631) 603.60000
(1486) 605.40000
(1047) 607.20000
(567) 609.00000
(157) 610.80000
(171) 612.60000 (57) 614.40000
(11) 616.20000 (3) 618.00000 (1)
\-------------------------------------------------
-----------------------
37A data set with signal
mean611.572872 sd31.049455 g1-0.942643 501.000
00 /----------------------------------------------
-------------------------- 508.65000
(15) 516.30000 (60) 523.95000
(84) 531.60000 (135) 539.25000
(21) 546.90000 (26) 554.55000
(96) 562.20000 (166) 569.85000
(290) 577.50000
(737) 585.15000
(1118) 592.80000
(665) 600.45000 (120) 608.10000
(268) 615.75000
(497) 623.40000
(796) 631.05000
(1337) 638.70000
(2031) 646.35000
(1610) 654.00000 (323)
\--------------------------------------------
----------------------------
38Skewness test for phylogenetic signal
- Hillis and Huelsenbeck (1991) generated random
data for different numbers of taxa/characters to
find the null distribution of g1 scores - One can compare observed g1 statistics with this
null distribution
39Tests for phylogenetic signal (g1 and PTP)
- Are sensitive to any signal in the data
- For example
- g1 of permuted data -0.04 (ns)
- Duplicate one taxon and g1 -1.56
- Useful for identifying truly useless data (very
rare) - But otherwise does not tell you much about data
quality
40Tests of signal
- These methods seek to determine overall data
quality as a guide to whether we should believe
particular results - We can, instead, evaluate particular results
- Clade support measures bootstrap/decay
- Statistical tests of alternative hypotheses