http://creativecommons.org/licenses/by-sa/2.0/

About This Presentation

Title:

http://creativecommons.org/licenses/by-sa/2.0/

Description:

500 rbcL genes (Zilla dataset) DCM3 decomposition. Blue: separator (and subset) Red: subset 2 ... (Zilla dataset) Medium datasets. 1000 ARB RNA. 2025 ARB RNA ... – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 112

Provided by: Usm16

Learn more at: http://www.cs.njit.edu

Category:

more less

Transcript and Presenter's Notes

Title: http://creativecommons.org/licenses/by-sa/2.0/

1
http//creativecommons.org/licenses/by-sa/2.0/
2
CIS786, Lecture 5

Usman Roshan

3
Previously

DCM decompositions in detail
DCM1 improved significantly over NJ
DCM2 did not always improve over TNT (for solving
MP)
New DCM3 improved over DCM2 but not better than
TNT

4
Previously

DCM decompositions in detail
DCM1 improved significantly over NJ
DCM2 did not always improve over TNT (for solving
MP)
New DCM3 improved over DCM2 but not better than
TNT
The DCM story continues

5
Disk Covering Methods (DCMs)

DCMs are divide-and-conquer booster methods. They
divide the dataset into small subproblems,
compute subtrees using a given base method, merge
the subtrees, and refine the supertree.
DCMs to date
DCM1 for improving statistical performance of
distance-based methods.
DCM2 for improving heuristic search for MP and
ML
DCM3 latest, fastest, and best (in accuracy and
optimality) DCM

6
DCM2 technique for speeding up MP searches
7
DCM1(NJ)
8
Computing tree for one threshold
9
Error as a function of evolutionary rate
NJ
DCM1-NJMP
10
I. Comparison of DCMs (4583 sequences)
Base method is the TNT-ratchet. DCM2 takes almost
10 hours to produce a tree and is too slow to
run on larger datasets.
11
DCM2 decomposition on 500 rbcL genes (Zilla
dataset)

DCM2 decomposition
Blue separator
Red subset 1
Pink subset 2
Vizualization produced by
graphviz program---draws
graph according to specified
distances.
Nodes species in the dataset
Distances p-distances
(hamming) between the DNAs
Separator is very large
Subsets are very large
Scattered subsets

12
DCM3 decomposition - example
13
Approx centroid-edge DCM3 decomposition example

Locate the centroid edge e (O(n) time)
Set the closest leaves around e to be the
separator (O(n) time)
Remaining leaves in subtrees around e form the
subsets (unioned with the separator)

14
DCM2 decomposition on 500 rbcL genes (Zilla
dataset)

DCM2 decomposition
Blue separator
Red subset 1
Pink subset 2
Vizualization produced by
graphviz program---draws
graph according to specified
distances.
Nodes species in the dataset
Distances p-distances
(hamming) between the DNAs
Separator is very large
Subsets are very large
Scattered subsets

15
DCM3 decomposition on 500 rbcL genes (Zilla
dataset)

DCM3 decomposition
Blue separator (and subset)
Red subset 2
Pink subset 3
Yellow subset 4
Vizualization produced by graphviz
program---draws graph according to
specified distances.
Nodes species in the dataset
Distances p-distances
(hamming) between the DNAs
Separator is small
Subsets are small
Compact subsets

16
Comparison of DCMs
TNT
DCM2
DCM3
Rec-DCM3

Dataset 4583 actinobacteria ssu rRNA from RDP.
Base method is the TNT-ratchet.
DCM2 takes almost 10 hours to produce a tree and
is too slow to run on larger datasets.
DCM3 followed by TNT-ratchet doesnt improve
over TNT
Recursive-DCM3 followed by TNT-ratchet doesnt
improve over TNT

17
Local optima is a problem
18
Local optima is a problem
Average MP
score above
optimal,
shown as a
percentage
of the optimal
Hours
19
Iterated local search escape local optima by
perturbation
Local optimum
Local search
Perturbation
Local search
Output of perturbation
20
Iterated local search Recursive-Iterative-DCM3
Local optimum
Local search
Recursive-DCM3
Local search
Output of Recursive-DCM3
21
Comparison of DCMs for solving MP
Rec-I-DCM3(TNT-ratchet) improves upon unboosted
TNT-ratchet
22
I. Comparison of DCMs (13,921 sequences)
Base method is the TNT-ratchet.
23
I. Comparison of DCMs (13,921 sequences)
Base method is the TNT-ratchet.
24
I. Comparison of DCMs (13,921 sequences)
Base method is the TNT-ratchet.
25
I. Comparison of DCMs (13,921 sequences)
Base method is the TNT-ratchet.
26
I. Comparison of DCMs (13,921 sequences)
Base method is the TNT-ratchet. Note the
improvement in DCMs as we move from the
default to recursion to iteration to
recursioniteration.
27
Improving upon TNT

But what happens after 24 hours?
We studied boosting upon TNT-ratchet. Other TNT
heuristics are actually better and improving upon
them may not be possible. Can we improve upon the
default TNT search?

28
Improving upon TNT

But what happens after 24 hours?
We studied boosting upon TNT-ratchet. Other TNT
heuristics are actually better and improving upon
them may not be possible. What about the default
TNT search?
We select some real and large datasets.
(Previously we showed that TNT reaches best known
scores on small datasets)
We run 5 trials of TNT for two weeks and 5 of
Rec-I-DCM3(TNT) for one week on each dataset

29
2000 Eukaryotes rRNA
30
6722 3-domain2-org rRNA
31
13921 Proteobacteria rRNA
32
How to run Rec-I-DCM3 then?

Unanswered question what about better TNT
heuristics? Can Rec-I-DCM3 improve upon them?
Rec-I-DCM3 improves upon default TNT but we dont
know what happens for better TNT heuristics.
Therefore, for a large-scale analysis figure out
best settings of the software (e.g. TNT or PAUP)
on the dataset and then use it in conjunction
with Rec-I-DCM3 with various subset sizes

33
Maximum likelihood
34
Maximum likelihood

Four problems
Given tree, edge lengths, and ancestral states
find likelihood of tree polynomial time
Given tree and edge lengths find likelihood of
tree polynomial time dynamic programming

35
Second case
Ron Shamirs lectures
36
Second case
Exponential time summation!
Ron Shamirs lectures
37
Second case
Exponential time summation!
Can be solved in polytime using dynamic
programmming ---similar to computing MP scores
Ron Shamirs lectures
38
Second case-DP
39
Second case-DP
40
Second case-DP
Complexity?
41
Second case-DP
Complexity? For each node and each site we do
k2 work, so total is mnk2
42
Maximum likelihood

Four problems
Given data, tree, edge lengths, and ancestral
states find likelihood of tree polynomial time
Given data, tree and edge lengths find likelihood
of tree polynomial time dynamic programming
Given data and tree, find likelihood unknown
complexity

43
Third case

Assign arbitrary values to all edge lengths
except one t_rv
Now optimize function of one parameter using EM
or Newton Raphson
Repeat for other edges
Stop when improvement in likelihood is less than
delta

44
Maximum likelihood

Four problems
Given data, tree, edge lengths, and ancestral
states find likelihood of tree polynomial time
Given data, tree and edge lengths find likelihood
of tree polynomial time dynamic programming
Given data and tree, find likelihood unknown
complexity
Given data find tree with best likelihood
unknown complexity

45
ML is a very hard problem

Number of potential trees grows exponentially

Taxa Trees
5 15
10 2,027,025
15 7,905,853,580,625
50 2.84 1076
This is ? the number of atoms in the universe
1080
46
Local search

Greedy-ML tree followed by local search using TBR
moves
Software packages PAUP, PHYLIP, PhyML, RAxML
We now look at RAxML in detail
Major RAxML innovations
Good starting trees
Subtree rearrangements
Lazy rescoring

47
TBR
48
TBR
49
TBR
50
TBR
51
TBR
52
TBR
53
Subtree Rearrangements
54
Subtree Rearrangements
ST2
ST1
ST3
ST6
ST4
ST5
55
Subtree Rearrangements
ST2
ST1
1
ST3
ST6
ST4
ST5
56
Subtree Rearrangements
ST2
ST1
1
ST3
ST6
ST4
ST5
57
Subtree Rearrangements
ST6
ST2
ST1
1
ST3
ST4
ST5
58
Subtree Rearrangements
ST6
ST2
ST1
1
ST3
ST4
ST5
59
Subtree Rearrangements
ST2
ST1
2
ST3
ST4
ST5
ST6
60
Subtree Rearrangements
ST2
ST1
2
ST3
ST4
ST5
ST6
61
Sequential RAxML
Compute randomized parsimony starting tree with
dnapars from PHYLIP
Every run starts from distinct point in search
space!
62
Sequential RAxML
Compute randomized parsimony starting tree with
dnapars from PHYLIP
Apply exhaustive subtree rearrangements
RAxML performs fast lazy rearrangements
63
Sequential RAxML
Compute randomized parsimony starting tree with
dnapars from PHYLIP
Apply exhaustive subtree rearrangements
Iterate while tree improves
64
Subtree Rearrangements
65
Subtree Rearrangements
ST2
ST1
ST3
ST6
ST4
ST5
66
Subtree Rearrangements
ST2
ST1
1
ST3
ST6
ST4
ST5
67
Subtree Rearrangements
ST2
ST1
1
ST3
ST6
ST4
ST5
68
Subtree Rearrangements
ST6
ST2
ST1
1
ST3
ST4
ST5
69
Subtree Rearrangements
ST6
ST2
ST1
1
ST3
ST4
ST5
70
Subtree Rearrangements
ST2
ST1
2
ST3
ST4
ST5
ST6
71
Subtree Rearrangements
ST2
ST1
2
ST3
ST4
ST5
ST6
72
Subtree Rearrangements
ST2
ST1
Optimize all branches
ST3
ST4
ST5
ST6
73
Subtree Rearrangements
ST2
ST1
Need to optimize all branches ?
ST3
ST4
ST5
ST6
74
Idea 1 Lazy Subtree Rearrangements
ST2
ST1
ST3
ST6
ST4
ST5
75
Idea 1 Lazy Subtree Rearrangements
ST2
ST1
ST3
ST6
ST4
ST5
76
Why is Idea 1 useful?

Lazy subtree rearrangements
Update less likelihood vectors ? significantly
faster
Allows for higher rearrangement settings ?
better trees
Likelihood depends strongly on topology
Fast exploration of large number of topologies
Fast pre-scoring of topologies
Mostly straight-forward parallelization
Store best 20 trees from each rearrangement step
Branch length optimization of best 20 trees only
Experimental results justify this mechanism

77
Idea 2Subsequent Application of Changes
ST2
ST1
ST3
ST3
ST6
ST4
ST5
ST2
ST1
ST3
ST3
ST6
ST4
ST5
78
Why is Idea 2 useful?

During initial 5-10 rearrangement steps many
improved topologies are encountered
Acceleration of likelihood improvement during
initial optimization phase
Fast optimization of random starting trees
Subsequent application of changes is hard to
parallelize

79
RAxML comparison to other programs
80
Parallel Distributed RAxML

Design goals
minimize communication overhead
attain good speedup
Master-Worker architecture
2 computational phases
Computation of workers parsimony trees
Rearrangement of subtrees at each worker
Program is non-deterministic ? every run yields
distinct result, even with fixed starting tree

81
Parallel RAxML Phase I
Distribute alignment file compute parsimony
trees
Master Process
82
Parallel RAxML Phase I
Receive parsimony trees select best as starting
tree
Master Process
83
Parallel RAxML Phase II
Distribute currently best tree
Master Process
84
Parallel RAxML Phase II
Workers issue work requests
Master Process
85
Parallel RAxML Phase II
Distribute subtree IDs
Master Process
86
Parallel RAxML Phase II
Distribute subtree IDs
Master Process
Only one integer must be sent to each node!
87
Parallel RAxML Phase II
ST1
ST2
ST4
ST3
88
Parallel RAxML Phase II
ST1
ST2
ST4
ST3
89
Parallel RAxML Phase II
Receive result trees and continue with best tree
Master Process
90
ML trees on large alignments

Significant progress over last 3-4 years
Many programs can infer large trees of 500
organisms now
Technical aspects becoming increasingly important
limiting factor

Program Largest Tree Limitation
parallel GAML 3000 organisms Memory
parallel IQPNNI 1500 organisms Memory
PHYML 2500 organisms Memory
MrBayes 1000 organisms Memory
parallel RAxML 10000 organisms Available Resources (60 CPUs)
Rec-I-DCM3(RAxML) 7769 organisms Available Resources (16 CPUs)
DPRml 417 organisms Memory
TREE-PUZZLE 257 organisms Data Structures
91
Improving RAxML

RAxML fastest heuristic for constructing highly
optimal maximum likelihood phylogenies on large
datasets
But can it be further improved?
We compare Rec-I-DCM3(RAxML) against RAxML on
real and simulated data

92
Real data study

Dataset 20 real datasets ranging from 101 to
8780 sequences (DNA and rRNA)
Methods studied
Recursive-Iterative-DCM3 (Rec-I-DCM3)
1/2 below 2K
1/4 between 2K and 6K
1/8 above 6K
Base and global methods default and fast RAxML
RAxML HKY model, tr/tv ratio estimated
10 runs on datasets with at most 2K taxa, 5 runs
on more than 2K taxa
RAxML run till completion
Rec-I-DCM3 ran for the same amount of time as
RAxML

93
Small datasets
500 rbcL DNA (Zilla dataset)
250 ARB RNA
94
Medium datasets
2025 ARB RNA
1000 ARB RNA
95
Large datasets
8780 ARB RNA
6722 RNA (Gutell)
96
Comparison across all datasets
Dataset size Improvement as Steps improvement Max p Avg p
101 (SC) -0.004 -2.7 0.45 0.25
150 (SC) 0.007 3.2 0.43 0.18
150 (ARB) 0 0.3 0.54 0.36
193 (Vinh) 0.06 38.6 0.78 0.64
200 (ARB) -0.006 -6.5 0.54 0.35
218 (RDP) 0.014 21 0.42 0.26
250 (ARB) 0.014 19 0.55 0.34
439 (PG) 0 0.1 0.65 0.27
97
Comparison across all datasets
Dataset size Improvement as Steps improvement Max p Avg p
476 (PG) -0.004 -4 0.89 0.18
500 (rbcL) 0.011 11 0.18 0.09
567 (PG) 0.006 13.9 0.33 0.17
854 (PG) 0.03 42 0.32 0.14
921 (KJ) 0.06 109.6 0.39 0.15
1000 (ARB) 0.031 123 0.55 0.35
1663 (ARB) -0.004 -11.7 0.48 0.2
98
Comparison across all datasets
Dataset size Improvement as Steps improvement Max p Avg p
2025 (ARB) -0.002 -6 0.56 0.36
2415 Bininda-Emonds 0.004 23 0.48 0.2
6673 (RG) 1.251 6877 1 0.29
7769 (RG) 2.338 13290 1 0.33
8780 (ARB) 0.03 270 0.55 0.23
99
Summary

Out of 20 datasets Rec-I-DCM3(RAxML) finds better
trees on 15
On datasets below 500 taxa Rec-I-DCM3(RAxML) wins
in 6 out of 9
On dataset above and including 500 taxa
Rec-I-DCM3(RAxML) wins in 9 out of 11
But what about accuracy?

100
Simulation study

Model trees Beta-model with beta-1 (software
provided by Li-San Wang at UPenn)
Model of sequence evolution General Time
Reversible (GTR) with gamma distributed site
rates and invariant sites---all parameters
determined by NJ tree on rbcL500 (Zilla) dataset
using PAUP seqgen used for evolving sequences
Simulation parameters
Model trees 5 model phylogenies of 1000, 2000,
and 4000 taxa
Evolutionary rates Branch lengths of each tree
were scaled to yield low and moderate
evolutionary rates (0.01 and 0.02).
Sequence length 1000
Methods RAxML under GTRCAT model till
completion and one iteration of
Rec-I-DCM3(RAxML---GTRCAT)

101
1000 taxa
-1030237.6
-1030180.6
-914164.8
-914142.7
Low rates
Moderate rates
102
2000 taxa
-2069504.6
-1824145.5
-2069493.9
-1824123.2
Low rates
Moderate rates
103
4000 taxa
-4138244
-4137922
-3665526.5
-3665486.2
Low rates
Moderate rates
104
Summary

Rec-I-DCM3(RAxML) finds more accurate trees much
faster than RAxML
ML scores are also improved
Improvement pronounced on large and divergent
datasets

105
Parallel Rec-I-DCM3
Use parallel RAxML developed by Du and Stamatakis

Solve subproblems in parallel
Merge subtrees in the proper subtree order

106
Real data study

Dataset 6 real datasets ranging from 500 to 7769
sequences (DNA and rRNA)
Max subset sizes 100 for dataset 1, 125 for
dataset 2, 500 for datasets 3-6
Methodology
One iteration of Rec-I-DCM3
P-Rec-I-DCM3 for the same amount of time as
Rec-I-DCM3
3 runs of each method on each dataset.
Same starting tree for each method

107
P-Rec-I-DCM3 vs Rec-I-DCM3
Dataset Parallel LH Sequential LH Improvement in steps Improvement (as a )
500 rbcL (Zilla) -99945 -99967 22 0.022
2560 rbcL (Kallersjo) -354944 -355088 144 0.041
4114 16s Actinobacteria (RDP) -383108 -383524 416 0.11
6281 ssu rRNA Eukaryotes (ERNA) -1270379 -1270785 406 0.032
6458 16s Firmicutes Bacteria (RDP) -900875 -902077 1202 0.13
7769 rRNA 3-dom2org (Gutell) -540334 -541019 685 0.13
108
Speedup values
Processors Base Global Overall
Dataset 1 4 8 16 4 4.7 4.85 2.4 2.8 2.78 2.6 3.6 3.5
Dataset 2 4 8 16 3 5.3 7 2.68 3.2 4.2 2.7 3.45 4.6
Dataset 3 4 8 16 1.95 5.5 6.7 2.6 5 5.7 2.2 5.3 6.2
109
Speedup values
Processors Base Global Overall
Dataset 4 4 8 16 2.9 4.2 8.3 2.3 4.9 5.3 2.6 4.6 6.3
Dataset 5 4 8 16 2.3 4.8 7.6 2.7 4.4 5.1 2.5 4.7 5.8
Dataset 6 34 8 16 3.2 4.8 5.4 1.95 2.5 2.8 2.2 3 3.3
110
Parallel performance limits

Performance appears sub-optimal because of
significant load imbalance caused by different
subproblem sizes
Optimal speedup(total subproblem time)/(minimum
time)
Dataset 3
19 subproblems of which 3 require at least 5K
seconds (max is 5569 seconds)
Optimal speedup 37353/55696.71
Dataset 6
43 subproblems of which longest takes 12164
seconds
Optimal speedup 63620/121645.23

Processors Base Global Overall
Dataset 3 4 8 16 1.9 5.5 6.7 2.6 5 5.7 2.2 5.3 6.2
Dataset 6 4 8 16 3.2 4.8 5.4 1.95 2.5 2.8 2.2 3 3.3
111
Conclusions

RAxML is fastest and most accurate ML heuristic
to date---yet improvement is possible with
Rec-I-DCM3 boosting
Rec-I-DCM3 and P-Rec-I-DCM3 could be used for
tree of life reconstructions (fast and accurate)
Viewed as iterated local search
divide-and-conquer works well for escaping local
optima (can this be used for other combinatorial
optimization problems?)

Write a Comment

User Comments (0)