Approximate Bayesian Methods for Population Genetics presentation

About This Presentation

Transcript and Presenter's Notes

Title: Approximate Bayesian Methods for Population Genetics

1
Approximate Bayesian Methods for Population
Genetics
Mark Beaumont University of Reading
2
Outline of Talk

Background and motivation.
The ABC method
Demographic models
Comparison with full-likelihood.
Examples.
PAC with microsatellites
Conclusions.

3
2
Frequency distribution of microsatellite lengths
Example genetic data (wildcats, Felis silvestris,
Randi et al, Mol. Biol. Evol., 2001)
Minimum spanning network of mitochondrial
haplotypes
4
General Problem
In population genetics the data we observe have
many possible unobservable causes, which
generally follow a hierarchical structure. For
example, genetic data depends on some unknown
genealogical history, which in turn depends on
the mutation model, demographic history, and the
effects of selection. These, in turn, depend on
the ecology of the organism. Therefore we have
many competing explanations for the data and we
wish to choose among them. How to do this?
5
Be pragmatic take a Bayesian approach
Bayesian analysis offers a flexible framework for
modelling uncertainty. MCMC has made this
possible for population genetic problems.
6
A common framework for modelling in population
genetics is to use coalescent theory, but there
are many other possible frameworks.
7
4
Coalescent Theory
Statistical theory of gene genealogies
From Rosenberg and Nordborg,NRG, 2002.
8
Coalescent Theory (Kingman, 1982 Hudson, 1983)
9
0
C t6
M t5
1
C t4
M t3
2
M t2
3
C t1
2
2
0
3
10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
Markov Chain Monte Carlo

Work with p(D,G F), which is easily calculated
from coalescent theory.
Starting with any Gi such that p(DGi) 1,
modify Gi Gi1 (where p(DGi1)1),
and Fi Fi1
such that it is possible to calculate p(G
i1,F i1 Gi,Fi)
and p(Gi,FiGi1,Fi1).
Then accept Gi1 and Fi1, with probability
Likelihood ratio
Hastings term Ratio of priors
Otherwise Gi1 Gi, and Fi1 Fi.

The MCMC should converge on p(F,GD)
From the output, we can look at marginal
posterior distribution of components of-
F (e.g. scaled mutation rate, or growth rate)
G (e.g. time to most recent common ancestor,
number of mutations, etc.)

15
Model of Population Change (Beaumont, Genetics,
1999)
Scaled parameters
16
(No Transcript)
17
(No Transcript)
18
Accounting for ascertainment
Ignoring ascertainment
Beaumont (Genetics, 1999)
19
Problems with MCMC-based methods of genealogical
inference
MCMC is useful, but

Slow problems of convergence.
Difficult to code up.
Difficult to modify flexibly to different
scenarios.
Difficulty addressing the questions that
biologists want answered.

20
Approximate Bayesian Computation ABC

Key features
Does not require a likelihood function to be
specified.
Based only on summary statistics computed from
the data.
Easy framework for model choice.

Tavaré et al. (1997, Genetics) used a rejection
algorithm to infer demographic parameters
(population growth).
Pritchard et al. (1999, MBE) - introduced the
first ABC approach, using a rejection method.
Beaumont et al. (2002, Genetics) introduced a
regression method.
Marjoram et al (2003, PNAS) MCMC without
likelihoods in an ABC framework.
Sisson et al. (2007, PNAS) introduce a
Sequential ABC approach (SABC)

21
Replace the data with summary statistics

Key Points
For most problems, we cant hit the data exactly.
But similar data may have similar posterior
distributions.
If we replace the data with summary statistics,
then it is easier to decide how similar data
sets are to each other.

22
Prior p(F)
Marginal likelihood p(D)
Likelihood p(D F)
Posterior distribution p(F D)
23

Simulate parameter values from prior
Simulate data with these parameter values from a
simulation program.
Retain parameter values that give simulated data
that are similar to the real data.
Repeat a large number of times.
Adjust the parameter values by weighted
regression.

24
ABC local regression method
Parameter
Fit a (local) linear regression. Project points
along line.
Summary Statistic
How the points are weighted
1
0
Epanechnikov kernel
25
Local Linear Regression
Assume we have observed a d dimensional vector of
summary statistics s, and we have n random draws
of a (scalar) parameter F1,,n and corresponding
summary statistics S1,,n. We scale s and S1,,n
so that S1,,n have unit variance.
26
We want to minimize
where
Epanechnikov kernel
27
The solution is
where
28
Our best estimate of the posterior mean is then
where e1 is a d1 length vector (1,0,,0).
29
Obtaining posterior densities and other summaries
using regression approach.
We make an assumption that the errors are
constant in the interval and adjust the parameter
values as
30
Accuracy in the estimation of scaled mutation
rate q 2Nm

Data-
linked microsat loci

Standard Rejection
Relative mean square error

Summary statistics-
mean variance in length
mean heterozygosity
number of haplotypes

MCMC
Regression
Tolerance
31
Model Comparison
In the Bayesian framework we can compare two
models, M1 and M2 by calculating the marginal
probability of the data, D, under each model
pM1(D) and pM2(D). The posterior probability of
model M1 is then
Using an ABC approach we can estimate this by
comparing under each model the proportion of
simulations that give rise to summary statistics
within the tolerance window under each model
(Pritchard et al., 1999). An alternative
approach is to use regression (Beaumont, in
press).
32
Another Approach to Model Selection
Beaumont, M.A. (2006). Joint determination of
topology, divergence time, and immigration in
population trees. In Simulation, Genetics, and
Human Prehistory, eds. S. Matsumura, P. Forster,
C. Renfrew. (McDonald Institute Monographs.)
Cambridge McDonald Institute for Archaeological
Research. In Press.

Directly estimate posterior probability of a
model rather than indirecly via comparison of
estimates of PMi(Ss).
Use regression framework. Treat model indicator
as a categorical variable Y that can take values
from (1,, nM).
We can then estimate the coefficients b in a
multinomial logit model in which

Then get an estimate of P(Yj S s). Use
weighted regression, as before.
Implemented in the VGAM package (Thomas Yee,
http//www.stat.auckland.ac.nz/yee)

33
Example Applications
Hamilton et al (Genetics, 2005 PNAS, 2005)
Models of dispersal Estoup et al (Evolution,
2005) Modelling invasions by cane toads Chan et
al (PLOS Genetics, 2006) Population dynamics from
aDNA. Miller et al (Science, 2006) Modelling
invasion by corn-borer beetle.
34
Comparison with MCMC One of the problems with
the ABC method is how to choose summary
statistics, and how well this choice approximates
what would be the best case. Compare ABC
method with MCMC method (IM) for this complex
demography, so that we can identify summary
statistics that provide accurate inference.
35
Demographic Models
IM is a program that uses MCMC to jointly infer
all these parameters, developed by Hey and
Nielsen (2004), based on Nielsen and Wakeley
(2001), available on Jody Heys website.
T
Nielsen and Wakeley (2001)
36
NA2
T2
ma
NA1
T1
m3
m1
N1
N2
N3
m2
Pop 1
Pop 2
Pop 3
37
Microsatellite data evolving by a stepwise
mutation model.
Beaumont, M.A. (2006). Joint determination of
topology, divergence time, and immigration in
population trees. In Simulation, Genetics, and
Human Prehistory, eds. S. Matsumura, P. Forster,
C. Renfrew. (McDonald Institute Monographs.)
Cambridge McDonald Institute for Archaeological
Research. In Press.
Summary Statistics
38
Comparison with IM (MCMC) for the two-population
case.

Simulated 5 data sets

2mNA 10
mT 4
m1/m 2
2mN2 2
2mN1 0.5
m2/m 1
Pop 1
Pop 2
Priors U(0,30) for scaled pop sizes U(0,10) for
scaled immigration rates U(0,8) for mT
39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
Timing
For these runs IM took 44 hours the ABC method
took 25 minutes.
43
A comparison of ABC with IM (MCMC) in Infinite
Sites Model. Work of Joao Lopes
30
44
Summary Statistics used

Sequence Data
mean of pairwise differences
in each population
both populations joined together
number of segregating sites
in each population
both populations joined together
number of haplotypes
in each population
both populations joined together

45
Simulated real data and Prior information
1000
1000
1000
500
0.01
0.01
0 10000
0 10000
0 10000
0 0.05
0 0.05
0 5000
Ne1
Ne2
Neanc
Tev
Mig2
Mig1
Mutation rate fixed at 0.001
ABC method
real data
MCMC method
prior distribution
46

Iterations -
Mutation Rate -
Tolerance -

500000
0.001 per sequence (fixed in prior)
2 (proportion of points accepted)

47
ABC vs MCMC
Data 1 (no migration) Simulation 7
Ne1
Ne2
Neanc
Tev
Data 2 (migration 0.01) Simulation 9
Ne1
Ne2
Neanc
Tev
Mig2
Mig1
48
ABC vs MCMC (500 000 iter, tol0.02)
MISE No migration
MISE Migration 0.01
49
Timing
For these runs IM took 24 hours (no migration)
48 hours
(with migration). The ABC method took 25 minutes
(no migration)
40 minutes (with migration)
50
Statistical Evaluation of Alternative Models of
Human Evolution Nelson J. R. Fagundes1,2,3,
Nicolas Ray3, Samuel Neuenschwander3,4, Mark
Beaumont5, Francisco M. Salzano2, Sandro L.
Bonatto1 Laurent Excoffier3
(Submitted)
An appropriate model of recent human evolution is
not only important to understand our own history,
but it is necessary to disentangle the effects of
demography and selection on genome diversity.
While most genetic data support the view that our
species originated recently in Africa, it is
still unclear if it completely replaced former
members of the Homo genus, or if some
interbreeding occurred during its range
expansion. Several scenarios of modern human
evolution have been proposed on the basis of
molecular and palaeontological data, but their
likelihood has never been statistically assessed.
Using DNA data from 50 nuclear loci sequenced in
African, Asian and American samples, we show here
by extensive simulations that a simple African
Replacement model with exponential growth has a
much higher probability (98) than alternative
multiregional evolution or assimilation
scenarios. A Bayesian analysis of the data under
this best supported model points to an origin of
our species 145 thousands years ago (Kya), an
exit out-of-Africa 54 Kya, and a recent
colonization of the Americas 9.5 Kya. We also
find that the African replacement model can not
only explain the shallow ancestry of mtDNA or
Y-chromosomes, but also the occurrence of deep
lineages at some autosomal loci, which has been
formerly interpreted as a sign of interbreeding
with H. erectus.
51
(No Transcript)
52
Current Issues

What is the best route?
MCMC with ABC (Marjoram et al., 2003)
Improved conditional density estimation (Beaumont
et al, 2002)
Sequential Methods (Sissons et al, 2007)
How to choose summary statistics?
How to improve Model Choice?

53
Acknowledgments
Joao Lopes David Balding Phillip Endicott
NERC EPSRC

Write a Comment

User Comments (0)

About PowerShow.com

Approximate Bayesian Methods for Population Genetics PowerPoint PPT Presentation