Title: Microarray Statistics
1DETECTING RECOMBINATION IN THE FELSENSTEIN ZONE
Alexander V. Mantzaris Wolfgang Lehrach Dirk
Husmeier
2Inference of a phylogenetic tree
3Nucleotide substitution model
4Maximum likelihood tree
5Probabilistic phylogenetic model
6Steps of recombination
7Steps of recombination
8Steps of recombination
9Steps of recombination
10Steps of recombination
11Steps of recombination
12Sequence alignment changes from recombination
13Sequence alignment changes from recombination
14Multiple change point model (MCP)
- Suchard 2003 the MCP uses Reversible Jump MCMC
(RJMCMC) to place break points modelling the
change in topology, rate of evolution and
substitution parameters
s-topology r-rate t-substitution parameters
15Dual multiple change point model (dual MCP)
- Decoupling the topology from the substitution
parameters removes the prior belief that these
changes must coincide Minin 2005
s-topology r-rate t-substitution parameters
16Analytic integration of branch lengths
- Following Suchard 2003 both the MCP and Dual MCP
models analytically integrate the branch lengths
from the model - The approximation assumes independence of the
branch lengths to make the model more tractable
17Introduction of the prior into the model
18Felsenstein parameters
- Branch lengths are in groups d2 and d3 which all
share the same length
19Data
- Sequence alignment data produced synthetically
- seq-gen Rambaut, A. and Grassly, N. C. (1996)
20Dualbrothers
d2-0.25 d3-0.25
Posterior probabilities per topology vs. position
21Dualbrothers
d2-0.45 d3-0.65
Posterior probabilities per topology vs. position
22Dualbrothers long branch attraction
Posterior probabilities per topology vs. position
23Dualbrothers long branch attraction
Posterior probabilities per topology vs. position
24Phylo-Factorial Hidden Markov model
- Two independent chains for rate and topology
inference - Husmeier, Bioinformatics 2005
25Sampling in phylo-FHMM
26FHMM
Posterior probabilities per topology vs. position
27FHMM
d2-0.45 d3-0.65
Posterior probabilities per topology vs. position
28FHMM long branch attraction
Posterior probabilities per topology vs. position
29FHMM long branch attraction
Posterior probabilities per topology vs. position
30Long branch attraction
- Long branches can produce a statistically
consistent wrong tree that converges towards a
parsimonious tree - Parsimony in a likelihood method
31Felsenstein zone
- Joseph Felsenstein, Systematic Zoology 1978
32Recpars on synthetic data set
- Viewing most parsimonious tree with no
recombination
score of correct topology-White low score of
correct topology-Black
33Model integrating out the branch lengths
- The data is dependent on the topology and rate
with a hyper-parameter for the rate distribution
Model 1
34Removing the branch length independence assumption
- Model incorporates the branch length sampling
Model 2
35Contrast to model without analytic integration
- This new model incorporates branch lengths
Model 1
Model 2
36Intra-model explorationAnnealed Importance
Sampling (AIS)
Objective P(DM)
Inter-model exploration Simultaneous MCMC
sampling of model topologies and parameters
37Annealed importance sampling
- Radford Neal 1998, based on importance sampling
with annealing transitions
The first sample is produced from the prior
-prior
-posterior
38Annealed importance sampling
Each annealing transition produces importance
weights
The weights allow computing the marginal
likelihood with respect to the sampling
distribution (prior here).
The ratio of the marginal likelihood with the
prior normalisation factor
39AIS analytic integration
- 200 samples, 5 MCMC steps, 5 transitions
Model 1
High posterior probability of correct
topology-White low posterior probability of
correct topology-Black
40Annealed Importance Sampler
- Average posterior probabilities of correct
topology over 8 independent simulations
Model 2
High posterior probability of correct
topology-White low posterior probability of
correct topology-Black
41Model 1/Model 2 comparison with AIS
Model 2
Model 1
Model 2
42Remarks on Annealed importance sampling
- Sporadic wrong inferences deep in the Felsenstein
zone assumed to be due to the sampler
occasionally missing the narrow high density
regions as noted in Neal 1998 - Longer simulations alleviate this weakness, which
is unsuitable for large batch jobs - The method is particularly good where the large
density region is less narrow
43Intra-model explorationAnnealed Importance
Sampling (AIS)
Objective P(DM)
Inter-model exploration Simultaneous MCMC
sampling of model topologies and parameters
44MCMC procedure without branch lengths
- The peeling algorithm is used to calculate the
probability of the data
45Region of long branch attraction
Model 1
- Long branch attraction where expected from the
theory
High posterior probability of correct
topology-White low posterior probability of
correct topology-Black
46MCMC procedure including branch lengths
- The peeling algorithm is used to calculate the
probability of the data
47MCMC procedure including branch lengths
- Symmetric proposal distributions centred on the
current value-Cauchy used
48Convergence Diagnostics
- Gelman-Rubin tests performed as a heuristic test
for convergence - PSR factor below 1.3 required 15K iterations for
burn-in and 5K sampling
49Avoiding the Felsenstein zone
Model 2
- Metropolis-Hastings with Cauchy distribution
escapes long branch attraction
High posterior probability of correct
topology-White low posterior probability of
correct topology-Black
50Integration of branch length sampling into
phylo-FHMM
51Contrast to model without analytic integration
- This new model incorporates branch lengths
Model 1
Model 2
52Felsenstein parameters
- Branch lengths are in groups d2 and d3 which all
share the same length
53FHMM long branch attraction
Posterior probabilities per topology vs. position
54Integration of branch length sampling into
phylo-FHMM
- Previous incorrect inference corrected D2-0.15
D3-0.85
Posterior probabilities per topology vs. position
55Mosaic sequence alignment Model2
- D2-0.25 D3-0.25 flanking 500bp
- D2-0.15 D3-0.85 center
Model 2
Model 1
56Conclusion
- Long branch attraction is evident in the region
of the Felsenstein zone for models using the
independence assumption for branch lengths as in
Suchard 2003 - The more complex model escapes false positives
- Future work on real biological DNA sequence
alignments
57Appendix
58Reliability of sampling scheme
- For problems outside the Felsenstein zone the
scheme is reliable - Within the Felsenstein zone where the region of
high density is narrow there can be occasions
with wrong inference from a lack of convergence - Sufficient iterations are needed in any case