Approximate Bayesian Methods for Population Genetics - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Approximate Bayesian Methods for Population Genetics

Description:

Estoup et al (Evolution, 2005) Modelling invasions by cane toads ... loci sequenced in African, Asian and American samples, we show here by extensive ... – PowerPoint PPT presentation

Number of Views:199
Avg rating:3.0/5.0
Slides: 54
Provided by: acetR
Category:

less

Transcript and Presenter's Notes

Title: Approximate Bayesian Methods for Population Genetics


1
Approximate Bayesian Methods for Population
Genetics
Mark Beaumont University of Reading
2
Outline of Talk
  • Background and motivation.
  • The ABC method
  • Demographic models
  • Comparison with full-likelihood.
  • Examples.
  • PAC with microsatellites
  • Conclusions.

3
2
Frequency distribution of microsatellite lengths
Example genetic data (wildcats, Felis silvestris,
Randi et al, Mol. Biol. Evol., 2001)
Minimum spanning network of mitochondrial
haplotypes
4
General Problem
In population genetics the data we observe have
many possible unobservable causes, which
generally follow a hierarchical structure. For
example, genetic data depends on some unknown
genealogical history, which in turn depends on
the mutation model, demographic history, and the
effects of selection. These, in turn, depend on
the ecology of the organism. Therefore we have
many competing explanations for the data and we
wish to choose among them. How to do this?
5
Be pragmatic take a Bayesian approach
Bayesian analysis offers a flexible framework for
modelling uncertainty. MCMC has made this
possible for population genetic problems.
6
A common framework for modelling in population
genetics is to use coalescent theory, but there
are many other possible frameworks.
7
4
Coalescent Theory
Statistical theory of gene genealogies
From Rosenberg and Nordborg,NRG, 2002.
8
Coalescent Theory (Kingman, 1982 Hudson, 1983)
9
0
C t6
M t5
1
C t4
M t3
2
M t2
3
C t1
2
2
0
3
10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
Markov Chain Monte Carlo
  • Work with p(D,G F), which is easily calculated
    from coalescent theory.
  • Starting with any Gi such that p(DGi) 1,
  • modify Gi Gi1 (where p(DGi1)1),
  • and Fi Fi1
  • such that it is possible to calculate p(G
    i1,F i1 Gi,Fi)

  • and p(Gi,FiGi1,Fi1).
  • Then accept Gi1 and Fi1, with probability
  •  
  • Likelihood ratio
    Hastings term Ratio of priors
  •  
  • Otherwise Gi1 Gi, and Fi1 Fi.

14
  • The MCMC should converge on p(F,GD)
  • From the output, we can look at marginal
    posterior distribution of components of-
  • F (e.g. scaled mutation rate, or growth rate)
  • G (e.g. time to most recent common ancestor,
    number of mutations, etc.)

15
Model of Population Change (Beaumont, Genetics,
1999)
Scaled parameters
16
(No Transcript)
17
(No Transcript)
18
Accounting for ascertainment
Ignoring ascertainment
Beaumont (Genetics, 1999)
19
Problems with MCMC-based methods of genealogical
inference
MCMC is useful, but
  • Slow problems of convergence.
  • Difficult to code up.
  • Difficult to modify flexibly to different
    scenarios.
  • Difficulty addressing the questions that
    biologists want answered.

20
Approximate Bayesian Computation ABC
  • Key features
  • Does not require a likelihood function to be
    specified.
  • Based only on summary statistics computed from
    the data.
  • Easy framework for model choice.
  • Tavaré et al. (1997, Genetics) used a rejection
    algorithm to infer demographic parameters
    (population growth).
  • Pritchard et al. (1999, MBE) - introduced the
    first ABC approach, using a rejection method.
  • Beaumont et al. (2002, Genetics) introduced a
    regression method.
  • Marjoram et al (2003, PNAS) MCMC without
    likelihoods in an ABC framework.
  • Sisson et al. (2007, PNAS) introduce a
    Sequential ABC approach (SABC)

21
Replace the data with summary statistics
  • Key Points
  • For most problems, we cant hit the data exactly.
  • But similar data may have similar posterior
    distributions.
  • If we replace the data with summary statistics,
    then it is easier to decide how similar data
    sets are to each other.

22
Prior p(F)
Marginal likelihood p(D)
Likelihood p(D F)
Posterior distribution p(F D)
23
  • Simulate parameter values from prior
  • Simulate data with these parameter values from a
    simulation program.
  • Retain parameter values that give simulated data
    that are similar to the real data.
  • Repeat a large number of times.
  • Adjust the parameter values by weighted
    regression.

24
ABC local regression method
Parameter
Fit a (local) linear regression. Project points
along line.
Summary Statistic
How the points are weighted
1
0
Epanechnikov kernel
25
Local Linear Regression
Assume we have observed a d dimensional vector of
summary statistics s, and we have n random draws
of a (scalar) parameter F1,,n and corresponding
summary statistics S1,,n. We scale s and S1,,n
so that S1,,n have unit variance.
26
We want to minimize
where
Epanechnikov kernel
27
The solution is
where
28
Our best estimate of the posterior mean is then
where e1 is a d1 length vector (1,0,,0).
29
Obtaining posterior densities and other summaries
using regression approach.
We make an assumption that the errors are
constant in the interval and adjust the parameter
values as
30
Accuracy in the estimation of scaled mutation
rate q 2Nm
  • Data-
  • linked microsat loci

Standard Rejection
Relative mean square error
  • Summary statistics-
  • mean variance in length
  • mean heterozygosity
  • number of haplotypes

MCMC
Regression
Tolerance
31
Model Comparison
In the Bayesian framework we can compare two
models, M1 and M2 by calculating the marginal
probability of the data, D, under each model
pM1(D) and pM2(D). The posterior probability of
model M1 is then
Using an ABC approach we can estimate this by
comparing under each model the proportion of
simulations that give rise to summary statistics
within the tolerance window under each model
(Pritchard et al., 1999). An alternative
approach is to use regression (Beaumont, in
press).
32
Another Approach to Model Selection
Beaumont, M.A. (2006). Joint determination of
topology, divergence time, and immigration in
population trees. In Simulation, Genetics, and
Human Prehistory, eds. S. Matsumura, P. Forster,
C. Renfrew. (McDonald Institute Monographs.)
Cambridge McDonald Institute for Archaeological
Research. In Press.
  • Directly estimate posterior probability of a
    model rather than indirecly via comparison of
    estimates of PMi(Ss).
  • Use regression framework. Treat model indicator
    as a categorical variable Y that can take values
    from (1,, nM).
  • We can then estimate the coefficients b in a
    multinomial logit model in which
  • Then get an estimate of P(Yj S s). Use
    weighted regression, as before.
  • Implemented in the VGAM package (Thomas Yee,
    http//www.stat.auckland.ac.nz/yee)

33
Example Applications
Hamilton et al (Genetics, 2005 PNAS, 2005)
Models of dispersal Estoup et al (Evolution,
2005) Modelling invasions by cane toads Chan et
al (PLOS Genetics, 2006) Population dynamics from
aDNA. Miller et al (Science, 2006) Modelling
invasion by corn-borer beetle.
34
Comparison with MCMC One of the problems with
the ABC method is how to choose summary
statistics, and how well this choice approximates
what would be the best case. Compare ABC
method with MCMC method (IM) for this complex
demography, so that we can identify summary
statistics that provide accurate inference.
35
Demographic Models
IM is a program that uses MCMC to jointly infer
all these parameters, developed by Hey and
Nielsen (2004), based on Nielsen and Wakeley
(2001), available on Jody Heys website.
T
Nielsen and Wakeley (2001)
36
NA2
T2
ma
NA1
T1
m3
m1
N1
N2
N3
m2
Pop 1
Pop 2
Pop 3
37
Microsatellite data evolving by a stepwise
mutation model.
Beaumont, M.A. (2006). Joint determination of
topology, divergence time, and immigration in
population trees. In Simulation, Genetics, and
Human Prehistory, eds. S. Matsumura, P. Forster,
C. Renfrew. (McDonald Institute Monographs.)
Cambridge McDonald Institute for Archaeological
Research. In Press.
Summary Statistics
38
Comparison with IM (MCMC) for the two-population
case.
  • Simulated 5 data sets

2mNA 10
mT 4
m1/m 2
2mN2 2
2mN1 0.5
m2/m 1
Pop 1
Pop 2
Priors U(0,30) for scaled pop sizes U(0,10) for
scaled immigration rates U(0,8) for mT
39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
Timing
For these runs IM took 44 hours the ABC method
took 25 minutes.
43
A comparison of ABC with IM (MCMC) in Infinite
Sites Model. Work of Joao Lopes
30
44
Summary Statistics used
  • Sequence Data
  • mean of pairwise differences
  • in each population
  • both populations joined together
  • number of segregating sites
  • in each population
  • both populations joined together
  • number of haplotypes
  • in each population
  • both populations joined together

45
Simulated real data and Prior information
1000
1000
1000
500
0.01
0.01
0 10000
0 10000
0 10000
0 0.05
0 0.05
0 5000
Ne1
Ne2
Neanc
Tev
Mig2
Mig1
Mutation rate fixed at 0.001
ABC method
real data
MCMC method
prior distribution
46
  • Iterations -
  • Mutation Rate -
  • Tolerance -
  • 500000
  • 0.001 per sequence (fixed in prior)
  • 2 (proportion of points accepted)

47
ABC vs MCMC
Data 1 (no migration) Simulation 7
Ne1
Ne2
Neanc
Tev
Data 2 (migration 0.01) Simulation 9
Ne1
Ne2
Neanc
Tev
Mig2
Mig1
48
ABC vs MCMC (500 000 iter, tol0.02)
MISE No migration
MISE Migration 0.01
49
Timing
For these runs IM took 24 hours (no migration)
48 hours
(with migration). The ABC method took 25 minutes
(no migration)
40 minutes (with migration)
50
Statistical Evaluation of Alternative Models of
Human Evolution Nelson J. R. Fagundes1,2,3,
Nicolas Ray3, Samuel Neuenschwander3,4, Mark
Beaumont5, Francisco M. Salzano2, Sandro L.
Bonatto1 Laurent Excoffier3
(Submitted)
An appropriate model of recent human evolution is
not only important to understand our own history,
but it is necessary to disentangle the effects of
demography and selection on genome diversity.
While most genetic data support the view that our
species originated recently in Africa, it is
still unclear if it completely replaced former
members of the Homo genus, or if some
interbreeding occurred during its range
expansion. Several scenarios of modern human
evolution have been proposed on the basis of
molecular and palaeontological data, but their
likelihood has never been statistically assessed.
Using DNA data from 50 nuclear loci sequenced in
African, Asian and American samples, we show here
by extensive simulations that a simple African
Replacement model with exponential growth has a
much higher probability (98) than alternative
multiregional evolution or assimilation
scenarios. A Bayesian analysis of the data under
this best supported model points to an origin of
our species 145 thousands years ago (Kya), an
exit out-of-Africa 54 Kya, and a recent
colonization of the Americas 9.5 Kya. We also
find that the African replacement model can not
only explain the shallow ancestry of mtDNA or
Y-chromosomes, but also the occurrence of deep
lineages at some autosomal loci, which has been
formerly interpreted as a sign of interbreeding
with H. erectus.
51
(No Transcript)
52
Current Issues
  • What is the best route?
  • MCMC with ABC (Marjoram et al., 2003)
  • Improved conditional density estimation (Beaumont
    et al, 2002)
  • Sequential Methods (Sissons et al, 2007)
  • How to choose summary statistics?
  • How to improve Model Choice?

53
Acknowledgments
Joao Lopes David Balding Phillip Endicott
NERC EPSRC
Write a Comment
User Comments (0)
About PowerShow.com