http://creativecommons.org/licenses/by-sa/2.0/ - PowerPoint PPT Presentation

1 / 111
About This Presentation
Title:

http://creativecommons.org/licenses/by-sa/2.0/

Description:

500 rbcL genes (Zilla dataset) DCM3 decomposition. Blue: separator (and subset) Red: subset 2 ... (Zilla dataset) Medium datasets. 1000 ARB RNA. 2025 ARB RNA ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 112
Provided by: Usm16
Learn more at: http://www.cs.njit.edu
Category:

less

Transcript and Presenter's Notes

Title: http://creativecommons.org/licenses/by-sa/2.0/


1
http//creativecommons.org/licenses/by-sa/2.0/
2
CIS786, Lecture 5
  • Usman Roshan

3
Previously
  • DCM decompositions in detail
  • DCM1 improved significantly over NJ
  • DCM2 did not always improve over TNT (for solving
    MP)
  • New DCM3 improved over DCM2 but not better than
    TNT

4
Previously
  • DCM decompositions in detail
  • DCM1 improved significantly over NJ
  • DCM2 did not always improve over TNT (for solving
    MP)
  • New DCM3 improved over DCM2 but not better than
    TNT
  • The DCM story continues

5
Disk Covering Methods (DCMs)
  • DCMs are divide-and-conquer booster methods. They
    divide the dataset into small subproblems,
    compute subtrees using a given base method, merge
    the subtrees, and refine the supertree.
  • DCMs to date
  • DCM1 for improving statistical performance of
    distance-based methods.
  • DCM2 for improving heuristic search for MP and
    ML
  • DCM3 latest, fastest, and best (in accuracy and
    optimality) DCM

6
DCM2 technique for speeding up MP searches
7
DCM1(NJ)
8
Computing tree for one threshold
9
Error as a function of evolutionary rate
NJ
DCM1-NJMP
10
I. Comparison of DCMs (4583 sequences)
Base method is the TNT-ratchet. DCM2 takes almost
10 hours to produce a tree and is too slow to
run on larger datasets.
11
DCM2 decomposition on 500 rbcL genes (Zilla
dataset)
  • DCM2 decomposition
  • Blue separator
  • Red subset 1
  • Pink subset 2
  • Vizualization produced by
  • graphviz program---draws
  • graph according to specified
  • distances.
  • Nodes species in the dataset
  • Distances p-distances
  • (hamming) between the DNAs
  • Separator is very large
  • Subsets are very large
  • Scattered subsets

12
DCM3 decomposition - example
13
Approx centroid-edge DCM3 decomposition example
  1. Locate the centroid edge e (O(n) time)
  2. Set the closest leaves around e to be the
    separator (O(n) time)
  3. Remaining leaves in subtrees around e form the
    subsets (unioned with the separator)

14
DCM2 decomposition on 500 rbcL genes (Zilla
dataset)
  • DCM2 decomposition
  • Blue separator
  • Red subset 1
  • Pink subset 2
  • Vizualization produced by
  • graphviz program---draws
  • graph according to specified
  • distances.
  • Nodes species in the dataset
  • Distances p-distances
  • (hamming) between the DNAs
  • Separator is very large
  • Subsets are very large
  • Scattered subsets

15
DCM3 decomposition on 500 rbcL genes (Zilla
dataset)
  • DCM3 decomposition
  • Blue separator (and subset)
  • Red subset 2
  • Pink subset 3
  • Yellow subset 4
  • Vizualization produced by graphviz
  • program---draws graph according to
  • specified distances.
  • Nodes species in the dataset
  • Distances p-distances
  • (hamming) between the DNAs
  • Separator is small
  • Subsets are small
  • Compact subsets

16
Comparison of DCMs
TNT
DCM2
DCM3
Rec-DCM3
  • Dataset 4583 actinobacteria ssu rRNA from RDP.
    Base method is the TNT-ratchet.
  • DCM2 takes almost 10 hours to produce a tree and
    is too slow to run on larger datasets.
  • DCM3 followed by TNT-ratchet doesnt improve
    over TNT
  • Recursive-DCM3 followed by TNT-ratchet doesnt
    improve over TNT

17
Local optima is a problem
18
Local optima is a problem
Average MP
score above
optimal,
shown as a
percentage
of the optimal
Hours
19
Iterated local search escape local optima by
perturbation
Local optimum
Local search
Perturbation
Local search
Output of perturbation
20
Iterated local search Recursive-Iterative-DCM3
Local optimum
Local search
Recursive-DCM3
Local search
Output of Recursive-DCM3
21
Comparison of DCMs for solving MP
Rec-I-DCM3(TNT-ratchet) improves upon unboosted
TNT-ratchet
22
I. Comparison of DCMs (13,921 sequences)
Base method is the TNT-ratchet.
23
I. Comparison of DCMs (13,921 sequences)
Base method is the TNT-ratchet.
24
I. Comparison of DCMs (13,921 sequences)
Base method is the TNT-ratchet.
25
I. Comparison of DCMs (13,921 sequences)
Base method is the TNT-ratchet.
26
I. Comparison of DCMs (13,921 sequences)
Base method is the TNT-ratchet. Note the
improvement in DCMs as we move from the
default to recursion to iteration to
recursioniteration.
27
Improving upon TNT
  • But what happens after 24 hours?
  • We studied boosting upon TNT-ratchet. Other TNT
    heuristics are actually better and improving upon
    them may not be possible. Can we improve upon the
    default TNT search?

28
Improving upon TNT
  • But what happens after 24 hours?
  • We studied boosting upon TNT-ratchet. Other TNT
    heuristics are actually better and improving upon
    them may not be possible. What about the default
    TNT search?
  • We select some real and large datasets.
    (Previously we showed that TNT reaches best known
    scores on small datasets)
  • We run 5 trials of TNT for two weeks and 5 of
    Rec-I-DCM3(TNT) for one week on each dataset

29
2000 Eukaryotes rRNA
30
6722 3-domain2-org rRNA
31
13921 Proteobacteria rRNA
32
How to run Rec-I-DCM3 then?
  • Unanswered question what about better TNT
    heuristics? Can Rec-I-DCM3 improve upon them?
  • Rec-I-DCM3 improves upon default TNT but we dont
    know what happens for better TNT heuristics.
  • Therefore, for a large-scale analysis figure out
    best settings of the software (e.g. TNT or PAUP)
    on the dataset and then use it in conjunction
    with Rec-I-DCM3 with various subset sizes

33
Maximum likelihood
34
Maximum likelihood
  • Four problems
  • Given tree, edge lengths, and ancestral states
    find likelihood of tree polynomial time
  • Given tree and edge lengths find likelihood of
    tree polynomial time dynamic programming

35
Second case
Ron Shamirs lectures
36
Second case
Exponential time summation!
Ron Shamirs lectures
37
Second case
Exponential time summation!
Can be solved in polytime using dynamic
programmming ---similar to computing MP scores
Ron Shamirs lectures
38
Second case-DP
39
Second case-DP
40
Second case-DP
Complexity?
41
Second case-DP
Complexity? For each node and each site we do
k2 work, so total is mnk2
42
Maximum likelihood
  • Four problems
  • Given data, tree, edge lengths, and ancestral
    states find likelihood of tree polynomial time
  • Given data, tree and edge lengths find likelihood
    of tree polynomial time dynamic programming
  • Given data and tree, find likelihood unknown
    complexity

43
Third case
  1. Assign arbitrary values to all edge lengths
    except one t_rv
  2. Now optimize function of one parameter using EM
    or Newton Raphson
  3. Repeat for other edges
  4. Stop when improvement in likelihood is less than
    delta

44
Maximum likelihood
  • Four problems
  • Given data, tree, edge lengths, and ancestral
    states find likelihood of tree polynomial time
  • Given data, tree and edge lengths find likelihood
    of tree polynomial time dynamic programming
  • Given data and tree, find likelihood unknown
    complexity
  • Given data find tree with best likelihood
    unknown complexity

45
ML is a very hard problem
  • Number of potential trees grows exponentially

Taxa Trees
5 15
10 2,027,025
15 7,905,853,580,625
50 2.84 1076
This is ? the number of atoms in the universe
1080
46
Local search
  • Greedy-ML tree followed by local search using TBR
    moves
  • Software packages PAUP, PHYLIP, PhyML, RAxML
  • We now look at RAxML in detail
  • Major RAxML innovations
  • Good starting trees
  • Subtree rearrangements
  • Lazy rescoring

47
TBR
48
TBR
49
TBR
50
TBR
51
TBR
52
TBR
53
Subtree Rearrangements
54
Subtree Rearrangements
ST2
ST1
ST3
ST6
ST4
ST5
55
Subtree Rearrangements
ST2
ST1
1
ST3
ST6
ST4
ST5
56
Subtree Rearrangements
ST2
ST1
1
ST3
ST6
ST4
ST5
57
Subtree Rearrangements
ST6
ST2
ST1
1
ST3
ST4
ST5
58
Subtree Rearrangements
ST6
ST2
ST1
1
ST3
ST4
ST5
59
Subtree Rearrangements
ST2
ST1
2
ST3
ST4
ST5
ST6
60
Subtree Rearrangements
ST2
ST1
2
ST3
ST4
ST5
ST6
61
Sequential RAxML
Compute randomized parsimony starting tree with
dnapars from PHYLIP
Every run starts from distinct point in search
space!
62
Sequential RAxML
Compute randomized parsimony starting tree with
dnapars from PHYLIP
Apply exhaustive subtree rearrangements
RAxML performs fast lazy rearrangements
63
Sequential RAxML
Compute randomized parsimony starting tree with
dnapars from PHYLIP
Apply exhaustive subtree rearrangements
Iterate while tree improves
64
Subtree Rearrangements
65
Subtree Rearrangements
ST2
ST1
ST3
ST6
ST4
ST5
66
Subtree Rearrangements
ST2
ST1
1
ST3
ST6
ST4
ST5
67
Subtree Rearrangements
ST2
ST1
1
ST3
ST6
ST4
ST5
68
Subtree Rearrangements
ST6
ST2
ST1
1
ST3
ST4
ST5
69
Subtree Rearrangements
ST6
ST2
ST1
1
ST3
ST4
ST5
70
Subtree Rearrangements
ST2
ST1
2
ST3
ST4
ST5
ST6
71
Subtree Rearrangements
ST2
ST1
2
ST3
ST4
ST5
ST6
72
Subtree Rearrangements
ST2
ST1
Optimize all branches
ST3
ST4
ST5
ST6
73
Subtree Rearrangements
ST2
ST1
Need to optimize all branches ?
ST3
ST4
ST5
ST6
74
Idea 1 Lazy Subtree Rearrangements
ST2
ST1
ST3
ST6
ST4
ST5
75
Idea 1 Lazy Subtree Rearrangements
ST2
ST1
ST3
ST6
ST4
ST5
76
Why is Idea 1 useful?
  • Lazy subtree rearrangements
  • Update less likelihood vectors ? significantly
    faster
  • Allows for higher rearrangement settings ?
    better trees
  • Likelihood depends strongly on topology
  • Fast exploration of large number of topologies
  • Fast pre-scoring of topologies
  • Mostly straight-forward parallelization
  • Store best 20 trees from each rearrangement step
  • Branch length optimization of best 20 trees only
  • Experimental results justify this mechanism

77
Idea 2Subsequent Application of Changes
ST2
ST1
ST3
ST3
ST6
ST4
ST5
ST2
ST1
ST3
ST3
ST6
ST4
ST5
78
Why is Idea 2 useful?
  • During initial 5-10 rearrangement steps many
    improved topologies are encountered
  • Acceleration of likelihood improvement during
    initial optimization phase
  • Fast optimization of random starting trees
  • Subsequent application of changes is hard to
    parallelize

79
RAxML comparison to other programs
80
Parallel Distributed RAxML
  • Design goals
  • minimize communication overhead
  • attain good speedup
  • Master-Worker architecture
  • 2 computational phases
  • Computation of workers parsimony trees
  • Rearrangement of subtrees at each worker
  • Program is non-deterministic ? every run yields
    distinct result, even with fixed starting tree

81
Parallel RAxML Phase I
Distribute alignment file compute parsimony
trees
Master Process
82
Parallel RAxML Phase I
Receive parsimony trees select best as starting
tree
Master Process
83
Parallel RAxML Phase II
Distribute currently best tree
Master Process
84
Parallel RAxML Phase II
Workers issue work requests
Master Process
85
Parallel RAxML Phase II
Distribute subtree IDs
Master Process
86
Parallel RAxML Phase II
Distribute subtree IDs
Master Process
Only one integer must be sent to each node!
87
Parallel RAxML Phase II
ST1
ST2
ST4
ST3
88
Parallel RAxML Phase II
ST1
ST2
ST4
ST3
89
Parallel RAxML Phase II
Receive result trees and continue with best tree
Master Process
90
ML trees on large alignments
  • Significant progress over last 3-4 years
  • Many programs can infer large trees of 500
    organisms now
  • Technical aspects becoming increasingly important
    limiting factor

Program Largest Tree Limitation
parallel GAML 3000 organisms Memory
parallel IQPNNI 1500 organisms Memory
PHYML 2500 organisms Memory
MrBayes 1000 organisms Memory
parallel RAxML 10000 organisms Available Resources (60 CPUs)
Rec-I-DCM3(RAxML) 7769 organisms Available Resources (16 CPUs)
DPRml 417 organisms Memory
TREE-PUZZLE 257 organisms Data Structures
91
Improving RAxML
  • RAxML fastest heuristic for constructing highly
    optimal maximum likelihood phylogenies on large
    datasets
  • But can it be further improved?
  • We compare Rec-I-DCM3(RAxML) against RAxML on
    real and simulated data

92
Real data study
  • Dataset 20 real datasets ranging from 101 to
    8780 sequences (DNA and rRNA)
  • Methods studied
  • Recursive-Iterative-DCM3 (Rec-I-DCM3)
  • 1/2 below 2K
  • 1/4 between 2K and 6K
  • 1/8 above 6K
  • Base and global methods default and fast RAxML
  • RAxML HKY model, tr/tv ratio estimated
  • 10 runs on datasets with at most 2K taxa, 5 runs
    on more than 2K taxa
  • RAxML run till completion
  • Rec-I-DCM3 ran for the same amount of time as
    RAxML

93
Small datasets
500 rbcL DNA (Zilla dataset)
250 ARB RNA
94
Medium datasets
2025 ARB RNA
1000 ARB RNA
95
Large datasets
8780 ARB RNA
6722 RNA (Gutell)
96
Comparison across all datasets
Dataset size Improvement as Steps improvement Max p Avg p
101 (SC) -0.004 -2.7 0.45 0.25
150 (SC) 0.007 3.2 0.43 0.18
150 (ARB) 0 0.3 0.54 0.36
193 (Vinh) 0.06 38.6 0.78 0.64
200 (ARB) -0.006 -6.5 0.54 0.35
218 (RDP) 0.014 21 0.42 0.26
250 (ARB) 0.014 19 0.55 0.34
439 (PG) 0 0.1 0.65 0.27
97
Comparison across all datasets
Dataset size Improvement as Steps improvement Max p Avg p
476 (PG) -0.004 -4 0.89 0.18
500 (rbcL) 0.011 11 0.18 0.09
567 (PG) 0.006 13.9 0.33 0.17
854 (PG) 0.03 42 0.32 0.14
921 (KJ) 0.06 109.6 0.39 0.15
1000 (ARB) 0.031 123 0.55 0.35
1663 (ARB) -0.004 -11.7 0.48 0.2
98
Comparison across all datasets
Dataset size Improvement as Steps improvement Max p Avg p
2025 (ARB) -0.002 -6 0.56 0.36
2415 Bininda-Emonds 0.004 23 0.48 0.2
6673 (RG) 1.251 6877 1 0.29
7769 (RG) 2.338 13290 1 0.33
8780 (ARB) 0.03 270 0.55 0.23
99
Summary
  • Out of 20 datasets Rec-I-DCM3(RAxML) finds better
    trees on 15
  • On datasets below 500 taxa Rec-I-DCM3(RAxML) wins
    in 6 out of 9
  • On dataset above and including 500 taxa
    Rec-I-DCM3(RAxML) wins in 9 out of 11
  • But what about accuracy?

100
Simulation study
  • Model trees Beta-model with beta-1 (software
    provided by Li-San Wang at UPenn)
  • Model of sequence evolution General Time
    Reversible (GTR) with gamma distributed site
    rates and invariant sites---all parameters
    determined by NJ tree on rbcL500 (Zilla) dataset
    using PAUP seqgen used for evolving sequences
  • Simulation parameters
  • Model trees 5 model phylogenies of 1000, 2000,
    and 4000 taxa
  • Evolutionary rates Branch lengths of each tree
    were scaled to yield low and moderate
    evolutionary rates (0.01 and 0.02).
  • Sequence length 1000
  • Methods RAxML under GTRCAT model till
    completion and one iteration of
    Rec-I-DCM3(RAxML---GTRCAT)

101
1000 taxa
-1030237.6
-1030180.6
-914164.8
-914142.7
Low rates
Moderate rates
102
2000 taxa
-2069504.6
-1824145.5
-2069493.9
-1824123.2
Low rates
Moderate rates
103
4000 taxa
-4138244
-4137922
-3665526.5
-3665486.2
Low rates
Moderate rates
104
Summary
  • Rec-I-DCM3(RAxML) finds more accurate trees much
    faster than RAxML
  • ML scores are also improved
  • Improvement pronounced on large and divergent
    datasets

105
Parallel Rec-I-DCM3
Use parallel RAxML developed by Du and Stamatakis
  1. Solve subproblems in parallel
  2. Merge subtrees in the proper subtree order

106
Real data study
  • Dataset 6 real datasets ranging from 500 to 7769
    sequences (DNA and rRNA)
  • Max subset sizes 100 for dataset 1, 125 for
    dataset 2, 500 for datasets 3-6
  • Methodology
  • One iteration of Rec-I-DCM3
  • P-Rec-I-DCM3 for the same amount of time as
    Rec-I-DCM3
  • 3 runs of each method on each dataset.
  • Same starting tree for each method

107
P-Rec-I-DCM3 vs Rec-I-DCM3
Dataset Parallel LH Sequential LH Improvement in steps Improvement (as a )
500 rbcL (Zilla) -99945 -99967 22 0.022
2560 rbcL (Kallersjo) -354944 -355088 144 0.041
4114 16s Actinobacteria (RDP) -383108 -383524 416 0.11
6281 ssu rRNA Eukaryotes (ERNA) -1270379 -1270785 406 0.032
6458 16s Firmicutes Bacteria (RDP) -900875 -902077 1202 0.13
7769 rRNA 3-dom2org (Gutell) -540334 -541019 685 0.13
108
Speedup values
Processors Base Global Overall
Dataset 1 4 8 16 4 4.7 4.85 2.4 2.8 2.78 2.6 3.6 3.5
Dataset 2 4 8 16 3 5.3 7 2.68 3.2 4.2 2.7 3.45 4.6
Dataset 3 4 8 16 1.95 5.5 6.7 2.6 5 5.7 2.2 5.3 6.2
109
Speedup values
Processors Base Global Overall
Dataset 4 4 8 16 2.9 4.2 8.3 2.3 4.9 5.3 2.6 4.6 6.3
Dataset 5 4 8 16 2.3 4.8 7.6 2.7 4.4 5.1 2.5 4.7 5.8
Dataset 6 34 8 16 3.2 4.8 5.4 1.95 2.5 2.8 2.2 3 3.3
110
Parallel performance limits
  • Performance appears sub-optimal because of
    significant load imbalance caused by different
    subproblem sizes
  • Optimal speedup(total subproblem time)/(minimum
    time)
  • Dataset 3
  • 19 subproblems of which 3 require at least 5K
    seconds (max is 5569 seconds)
  • Optimal speedup 37353/55696.71
  • Dataset 6
  • 43 subproblems of which longest takes 12164
    seconds
  • Optimal speedup 63620/121645.23

Processors Base Global Overall
Dataset 3 4 8 16 1.9 5.5 6.7 2.6 5 5.7 2.2 5.3 6.2
Dataset 6 4 8 16 3.2 4.8 5.4 1.95 2.5 2.8 2.2 3 3.3
111
Conclusions
  • RAxML is fastest and most accurate ML heuristic
    to date---yet improvement is possible with
    Rec-I-DCM3 boosting
  • Rec-I-DCM3 and P-Rec-I-DCM3 could be used for
    tree of life reconstructions (fast and accurate)
  • Viewed as iterated local search
    divide-and-conquer works well for escaping local
    optima (can this be used for other combinatorial
    optimization problems?)
Write a Comment
User Comments (0)
About PowerShow.com