Title: Approximate Bayesian Methods for Population Genetics
1Approximate Bayesian Methods for Population
Genetics
Mark Beaumont University of Reading
2Outline of Talk
- Background and motivation.
- The ABC method
- Demographic models
- Comparison with full-likelihood.
- Examples.
- PAC with microsatellites
- Conclusions.
32
Frequency distribution of microsatellite lengths
Example genetic data (wildcats, Felis silvestris,
Randi et al, Mol. Biol. Evol., 2001)
Minimum spanning network of mitochondrial
haplotypes
4General Problem
In population genetics the data we observe have
many possible unobservable causes, which
generally follow a hierarchical structure. For
example, genetic data depends on some unknown
genealogical history, which in turn depends on
the mutation model, demographic history, and the
effects of selection. These, in turn, depend on
the ecology of the organism. Therefore we have
many competing explanations for the data and we
wish to choose among them. How to do this?
5Be pragmatic take a Bayesian approach
Bayesian analysis offers a flexible framework for
modelling uncertainty. MCMC has made this
possible for population genetic problems.
6A common framework for modelling in population
genetics is to use coalescent theory, but there
are many other possible frameworks.
74
Coalescent Theory
Statistical theory of gene genealogies
From Rosenberg and Nordborg,NRG, 2002.
8Coalescent Theory (Kingman, 1982 Hudson, 1983)
90
C t6
M t5
1
C t4
M t3
2
M t2
3
C t1
2
2
0
3
10(No Transcript)
11(No Transcript)
12(No Transcript)
13Markov Chain Monte Carlo
- Work with p(D,G F), which is easily calculated
from coalescent theory. - Starting with any Gi such that p(DGi) 1,
- modify Gi Gi1 (where p(DGi1)1),
- and Fi Fi1
- such that it is possible to calculate p(G
i1,F i1 Gi,Fi) -
and p(Gi,FiGi1,Fi1). - Then accept Gi1 and Fi1, with probability
- Â
- Likelihood ratio
Hastings term Ratio of priors - Â
- Otherwise Gi1 Gi, and Fi1 Fi.
14- The MCMC should converge on p(F,GD)
- From the output, we can look at marginal
posterior distribution of components of- - F (e.g. scaled mutation rate, or growth rate)
- G (e.g. time to most recent common ancestor,
number of mutations, etc.)
15Model of Population Change (Beaumont, Genetics,
1999)
Scaled parameters
16(No Transcript)
17(No Transcript)
18Accounting for ascertainment
Ignoring ascertainment
Beaumont (Genetics, 1999)
19Problems with MCMC-based methods of genealogical
inference
MCMC is useful, but
- Slow problems of convergence.
- Difficult to code up.
- Difficult to modify flexibly to different
scenarios. - Difficulty addressing the questions that
biologists want answered.
20Approximate Bayesian Computation ABC
- Key features
- Does not require a likelihood function to be
specified. - Based only on summary statistics computed from
the data. - Easy framework for model choice.
- Tavaré et al. (1997, Genetics) used a rejection
algorithm to infer demographic parameters
(population growth). - Pritchard et al. (1999, MBE) - introduced the
first ABC approach, using a rejection method. - Beaumont et al. (2002, Genetics) introduced a
regression method. - Marjoram et al (2003, PNAS) MCMC without
likelihoods in an ABC framework. - Sisson et al. (2007, PNAS) introduce a
Sequential ABC approach (SABC)
21Replace the data with summary statistics
- Key Points
- For most problems, we cant hit the data exactly.
- But similar data may have similar posterior
distributions. - If we replace the data with summary statistics,
then it is easier to decide how similar data
sets are to each other.
22Prior p(F)
Marginal likelihood p(D)
Likelihood p(D F)
Posterior distribution p(F D)
23- Simulate parameter values from prior
- Simulate data with these parameter values from a
simulation program. - Retain parameter values that give simulated data
that are similar to the real data. - Repeat a large number of times.
- Adjust the parameter values by weighted
regression.
24ABC local regression method
Parameter
Fit a (local) linear regression. Project points
along line.
Summary Statistic
How the points are weighted
1
0
Epanechnikov kernel
25Local Linear Regression
Assume we have observed a d dimensional vector of
summary statistics s, and we have n random draws
of a (scalar) parameter F1,,n and corresponding
summary statistics S1,,n. We scale s and S1,,n
so that S1,,n have unit variance.
26We want to minimize
where
Epanechnikov kernel
27The solution is
where
28Our best estimate of the posterior mean is then
where e1 is a d1 length vector (1,0,,0).
29Obtaining posterior densities and other summaries
using regression approach.
We make an assumption that the errors are
constant in the interval and adjust the parameter
values as
30Accuracy in the estimation of scaled mutation
rate q 2Nm
- Data-
- linked microsat loci
Standard Rejection
Relative mean square error
- Summary statistics-
- mean variance in length
- mean heterozygosity
- number of haplotypes
MCMC
Regression
Tolerance
31Model Comparison
In the Bayesian framework we can compare two
models, M1 and M2 by calculating the marginal
probability of the data, D, under each model
pM1(D) and pM2(D). The posterior probability of
model M1 is then
Using an ABC approach we can estimate this by
comparing under each model the proportion of
simulations that give rise to summary statistics
within the tolerance window under each model
(Pritchard et al., 1999). An alternative
approach is to use regression (Beaumont, in
press).
32Another Approach to Model Selection
Beaumont, M.A. (2006). Joint determination of
topology, divergence time, and immigration in
population trees. In Simulation, Genetics, and
Human Prehistory, eds. S. Matsumura, P. Forster,
C. Renfrew. (McDonald Institute Monographs.)
Cambridge McDonald Institute for Archaeological
Research. In Press.
- Directly estimate posterior probability of a
model rather than indirecly via comparison of
estimates of PMi(Ss). - Use regression framework. Treat model indicator
as a categorical variable Y that can take values
from (1,, nM). - We can then estimate the coefficients b in a
multinomial logit model in which
- Then get an estimate of P(Yj S s). Use
weighted regression, as before. - Implemented in the VGAM package (Thomas Yee,
http//www.stat.auckland.ac.nz/yee)
33Example Applications
Hamilton et al (Genetics, 2005 PNAS, 2005)
Models of dispersal Estoup et al (Evolution,
2005) Modelling invasions by cane toads Chan et
al (PLOS Genetics, 2006) Population dynamics from
aDNA. Miller et al (Science, 2006) Modelling
invasion by corn-borer beetle.
34Comparison with MCMC One of the problems with
the ABC method is how to choose summary
statistics, and how well this choice approximates
what would be the best case. Compare ABC
method with MCMC method (IM) for this complex
demography, so that we can identify summary
statistics that provide accurate inference.
35Demographic Models
IM is a program that uses MCMC to jointly infer
all these parameters, developed by Hey and
Nielsen (2004), based on Nielsen and Wakeley
(2001), available on Jody Heys website.
T
Nielsen and Wakeley (2001)
36NA2
T2
ma
NA1
T1
m3
m1
N1
N2
N3
m2
Pop 1
Pop 2
Pop 3
37Microsatellite data evolving by a stepwise
mutation model.
Beaumont, M.A. (2006). Joint determination of
topology, divergence time, and immigration in
population trees. In Simulation, Genetics, and
Human Prehistory, eds. S. Matsumura, P. Forster,
C. Renfrew. (McDonald Institute Monographs.)
Cambridge McDonald Institute for Archaeological
Research. In Press.
Summary Statistics
38Comparison with IM (MCMC) for the two-population
case.
2mNA 10
mT 4
m1/m 2
2mN2 2
2mN1 0.5
m2/m 1
Pop 1
Pop 2
Priors U(0,30) for scaled pop sizes U(0,10) for
scaled immigration rates U(0,8) for mT
39(No Transcript)
40(No Transcript)
41(No Transcript)
42Timing
For these runs IM took 44 hours the ABC method
took 25 minutes.
43A comparison of ABC with IM (MCMC) in Infinite
Sites Model. Work of Joao Lopes
30
44Summary Statistics used
- Sequence Data
- mean of pairwise differences
- in each population
- both populations joined together
- number of segregating sites
- in each population
- both populations joined together
- number of haplotypes
- in each population
- both populations joined together
45Simulated real data and Prior information
1000
1000
1000
500
0.01
0.01
0 10000
0 10000
0 10000
0 0.05
0 0.05
0 5000
Ne1
Ne2
Neanc
Tev
Mig2
Mig1
Mutation rate fixed at 0.001
ABC method
real data
MCMC method
prior distribution
46- Iterations -
- Mutation Rate -
- Tolerance -
- 500000
- 0.001 per sequence (fixed in prior)
- 2 (proportion of points accepted)
47ABC vs MCMC
Data 1 (no migration) Simulation 7
Ne1
Ne2
Neanc
Tev
Data 2 (migration 0.01) Simulation 9
Ne1
Ne2
Neanc
Tev
Mig2
Mig1
48ABC vs MCMC (500 000 iter, tol0.02)
MISE No migration
MISE Migration 0.01
49Timing
For these runs IM took 24 hours (no migration)
48 hours
(with migration). The ABC method took 25 minutes
(no migration)
40 minutes (with migration)
50Statistical Evaluation of Alternative Models of
Human Evolution Nelson J. R. Fagundes1,2,3,
Nicolas Ray3, Samuel Neuenschwander3,4, Mark
Beaumont5, Francisco M. Salzano2, Sandro L.
Bonatto1 Laurent Excoffier3
(Submitted)
An appropriate model of recent human evolution is
not only important to understand our own history,
but it is necessary to disentangle the effects of
demography and selection on genome diversity.
While most genetic data support the view that our
species originated recently in Africa, it is
still unclear if it completely replaced former
members of the Homo genus, or if some
interbreeding occurred during its range
expansion. Several scenarios of modern human
evolution have been proposed on the basis of
molecular and palaeontological data, but their
likelihood has never been statistically assessed.
Using DNA data from 50 nuclear loci sequenced in
African, Asian and American samples, we show here
by extensive simulations that a simple African
Replacement model with exponential growth has a
much higher probability (98) than alternative
multiregional evolution or assimilation
scenarios. A Bayesian analysis of the data under
this best supported model points to an origin of
our species 145 thousands years ago (Kya), an
exit out-of-Africa 54 Kya, and a recent
colonization of the Americas 9.5 Kya. We also
find that the African replacement model can not
only explain the shallow ancestry of mtDNA or
Y-chromosomes, but also the occurrence of deep
lineages at some autosomal loci, which has been
formerly interpreted as a sign of interbreeding
with H. erectus.
51(No Transcript)
52Current Issues
- What is the best route?
- MCMC with ABC (Marjoram et al., 2003)
- Improved conditional density estimation (Beaumont
et al, 2002) - Sequential Methods (Sissons et al, 2007)
- How to choose summary statistics?
- How to improve Model Choice?
53Acknowledgments
Joao Lopes David Balding Phillip Endicott
NERC EPSRC