Title: The Coalescent and Measurably Evolving Populations
1The Coalescent and Measurably Evolving Populations
- Alexei Drummond
- Department of Computer Science
- University of Auckland, NZ
2Overview
- Introduction to the Coalescent
- Hepatitis C in Egypt
- An example using the coalescent
- Measurably evolving populations
- HIV-1 evolution within and among hosts
- An example using MEP concepts
- Summary Conclusions
3The coalescent
- The coalescent is a model of the ancestral
relationships of a small sample of individuals
taken from a large background population. - The coalescent describes a probability
distribution on ancestral genealogies (trees)
given a population history. - Therefore the coalescent can convert information
from ancestral genealogies into information about
population history and vice versa. - The coalescent is a model of ancestral
genealogies, not sequences, and its simplest form
assumes neutral evolution.
4The history of coalescent theory
- 1930-40s Genealogical arguments well known to
Wright Fisher - 1964 Crow Kimura Infinite Allele Model
- 1966 (Hubby Lewontin) (Harris) make first
surveys of population allele variation by protein
electrophoresis - 1968 Motoo Kimura proposes neutral explanation
of molecular evolution population variation.
So do King Jukes - 1971 Kimura Ohta proposes infinite sites
model. - 1975 Watterson makes explicit use of The
Coalescent - 1982 Kingman introduces The Coalescent.
- 1983 Hudson introduces The Coalescent with
Recombination - 1983 Kreitman publishes first major population
sequences. - 1987 Cann et al. traces human origin and
migrations with mitochondrial DNA.
5The history of coalescent theory
- 1988 Hughes Nei Genes with positive Darwinian
Selection. - 1989-90 Kaplan, Hudson, Takahata and others
Selection regimes with coalescent structure (MHC,
Incompatibility alleles). - 1991 MacDonald Kreitman Data with surplus of
replacement interspecific substitutions. - 1994-95 Griffiths-Tavaré Kuhner-Yamoto-Felsenst
ein introduces sampling techniques to estimate
parameters in population models. - 1997-98 Krone-Neuhauser introduces Ancestral
Selection Graph - 1999 Wiuf Donnelly uses coalescent theory to
estimate age of disease allele - 2000 Wiuf et al. introduces gene conversion
into coalescent. - 2000- A flood of SNP data haplotypes are on
their way.
6Population processes
COALESCENT THEORY
Genealogy
7Coalescent inference
Randomly sample individuals from population
Obtain gene sequences from sampled individuals
Reconstruct tree / trees from sequences
Infer coalescent results directly from sequences
Infer coalescent results from tree / trees
8Demographic History
- Change in population size through time
- Applications include
- Estimating history of human populations
- Conservation biology
- Reconstructing infectious disease epidemics
- Investigating viral dynamics within hosts
9Idealized Wright-Fisher populations
Diploid
Haploid
10Random mating in an ideal population
- A constant population size of N individuals
- Each individual in the new generation chooses
its parent from the previous generation at random
11Genetic drift extinction and ancestry
If you trace the ancestry of a sample of
individuals back in time you inevitably reach a
single most recent common ancestor. If you pick a
random individual and trace their descendents
forward in time, all the descendents of that
individual will with high probability eventually
die out.
12A sample genealogy from an idealized
Wright-Fisher population
Past
A sample genealogy of 3 sequences from a
population (N 10).
Discrete Generations
Present
13The coalescent distributions and expectations on
a sample genealogy
Past
Present
14The coalescent probability density distribution
Past
Present
The genealogy is an edge graph Eg and a vector of
times t.
Kingman (1982a,b)
15The coalescent estimating population size from a
sample genealogy
Past
Present
Felsenstein (1992)
16The coalescent estimating population size
confidence limits via ML
The confidence intervals are calculated from the
curvature of the likelihood. For a single
parameter model the 95 confidence limits are
defined by the points where the log-likelihood
drops 1.92 log-units below the maximum
log-likelihood.
Maximum likelihood can be used to estimate
population size by choosing a population size
that maximizes the probability of the observed
coalescent waiting times.
17The coalescent shapes of gene genealogies
The Coalescent
Exponential growth
Constant size
The coalescent can be used to convert coalescent
times into knowledge about population size and
its change though time.
18Constant population size N(t)N0
The Coalescent
small N0
large N0
TIME
19Coalescent and serial samples
The Coalescent
Constant population
Exponential growth
20Uncertainty in Genealogies
The Coalescent
How similar are these two trees? Both of them are
plausible given the data. We can use MCMC to get
the average result over all plausible trees,
21Coalescent Summary
The Coalescent
- The coalescent provides a theory of how
population size is related to the distribution of
coalescent events in a tree. - Big populations have old trees
- Exponentially growing populations have star-like
trees - Given a genealogy the most likely population size
can be estimated. - MCMC can be used to get a distribution of trees
from which a distribution of population sizes can
be estimated.
22Markov chain Monte Carlo (MCMC)
MCMC
- Imagine you would like to estimate two parameters
(?,?) from some data (D). - You want to find values of ? and ? that have high
probability given the data p(?,?D) - Say you have a likelihood function of the form
PrD ?,? - Bayes rule tells us that
- p(?,?D) PrD ?,?p(?,?) / PrD
- So that p(?,?D) ? PrD ?,?p(?,?)
23Markov chain Monte Carlo (MCMC)
MCMC
- p(?,?D) is called the posterior probability
(density) of ?,? given D - In an ideal world we want to know the posterior
density for all possible values of ?,?. - Then we could pick a credible region in two
dimensions that contained values of ?,? that
account for the majority of the posterior
probability mass. - This credible region would serve as an estimate
that includes incorporates our uncertainty and
this credible set could be used to address
hypotheses like ? is greater than x. - In reality we have to make due with a sample of
the posterior - so that we evaluate p(?,?D) for
a finite number (say 10,000,000) pairs of ?,?. - So which pairs should we choose?
24Markov chain Monte Carlo (MCMC)
MCMC
- Lets construct a random walk in 2-dimensional
space - In each step of the random walk we propose to
make an (unbiased) small jump from our current
position (?,?) to a new position (?,?) - If p(?,?D) gt p(?,?D) then we make the
proposed jump - However, if p(?,?D) lt p(?,?D), then we make
the proposed jump with probability ? p(?,?D)
/ p(?,?D), otherwise we stay where we are. - It can be shown (trust me!) that if you proceed
in this fashion for an infinite time then the
equilibrium distribution of this random walk will
be p(?,?D)! - That is, the random walk will visit a particular
region ?0, ?1 x ?0, ?1 of the state space
this often
25Markov chain Monte Carlo (MCMC)
MCMC
26Hepatitis C Virus (HCV)
Population genetics of Hepatitis C in Egypt
- Identified in 1989
- 9.6kb single-stranded RNA genome
- Polyprotein cleaved by proteases
- No efficient tissue culture system
27How important is HCV?
Population genetics of Hepatitis C in Egypt
- 170m infected
- 80 infections are chronic
- Liver cirrhosis cancer risk
- 10,000 deaths per year in USA
- No protective immunity?
28HCV Transmission
Population genetics of Hepatitis C in Egypt
Percutaneous exposure to infected blood
Blood transfusion / blood products
Injecting nasal drug use Sexual
vertical transmission Unsafe injections
Unidentified routes
29Estimating demographic history of HCV using the
coalescent
Population genetics of Hepatitis C in Egypt
- Egyptian HCV gene sequences
- n61
- E1 gene, 411bp
- All sequence contemporaneous
- Egypt has highest prevalence of HCV worldwide
(10-20) - But low prevalence in neighbouring states
- Why is Egypt so seriously affected?
- Parenteral antischistosomal therapy (PAT)
30Demographic model
Population genetics of Hepatitis C in Egypt
- The coalescent can be extended to model
deterministically varying populations. - The model we used was a const-exp-const model.
- A Bayesian MCMC method was developed to sample
the gene genealogy, the substitution model and
demographic function simultaneously.
31Estimated demographic history
Population genetics of Hepatitis C in Egypt
Based on a single tree
Averaged over all trees
32Parameter estimates
Population genetics of Hepatitis C in Egypt
33Uncertainty in parameter estimates
Population genetics of Hepatitis C in Egypt
Demographic parameters
Mutational parameters
Growth rate of the growth phase Grey box is the
prior
Rates at different codon positions, All
significantly different
34Full Bayesian Estimation
Population genetics of Hepatitis C in Egypt
- Marginalized over uncertainty in genealogy and
mutational processes - Yellow band represents the region over which PAT
was employed in Egypt
35Measurably evolving populations (MEPs)
Measurably evolving populations
- MEP pathogens
- HIV
- Hepatitis C
- Influenza A
- MEPs from ancient DNA
- Bison
- Brown Bears
- Adelie penguins
- Anything cold and numerous
- Even over short periods (less than a year) HIV
sequences can exhibit measurable evolutionary
change - Time-structure can not be ignored in our models
36Time structure in samples
Measurably evolving populations
time
Contemporary sample no time structure
Serial sample with time structure
1980
1990
2000
37Molecular evolution and population genetics of
MEPs
Measurably evolving populations
- Given sequence data that is time-structured
estimate true values of - substitution parameters
- Overall substitution rate and relative rates of
different substitutions - population history N(t)
- Ancestral genealogy
- Topology
- Coalescent times
m
Ne
time
A
B
C
D
E
38Molecular evolutionary model Felsensteins
likelihood (1981)
The probability of the sequence alignment,
can be efficiently calculated given a
tree and branch lengths (T), and a probabilistic
model of mutation represented by an instantaneous
rate matrix (Q). In phylogenetics, branch lengths
are usually unconstrained.
39Combining the coalescent with Felsensteins
likelihood
AC
b4
t2
AA
The molecular clock constraint
b1
b3
t3
b2
GA
b5
t4
AA
GA
AC
GC
GC
2n3 branch lengths
n1 waiting times
The joint posterior probability of the population
history (N), the genealogy (g) and the mutation
matrix (Q) are estimated using Markov chain Monte
Carlo (Drummond et al, Genetics, 2002)
40Full Bayesian Model
Measurably evolving populations
Probability of what we dont know given what we
do know.
Likelihood function
other priors
P(g, ?, Ne, Q D)
P(D g, ?, Q)fG(g Ne) f?(?)fN(Ne )fQ(Q)
Unknown normalizing constant
coalescent prior
Q substitution parameters Ne population
parameters g tree ? overall substitution rate
In the software package BEAST, MCMC integration
can be used to provide a chain of samples from
this density.
41HIV-1 (env) evolution in nine infected individuals
Measurably evolving populations
Shankarappa et al (1999)
42Molecular clock HIV-1 (env) evolution in 9
individuals
Measurably evolving populations
Shankarappa et al (1999)
43MEP Summary
Measurably evolving populations
- Most RNA viruses, including HCV and HIV are
measurably evolving - Most vertebrate populations that have
well-preserved recent fossil records are MEPs. - If sequence data comes from different times the
time-structure cant be ignored - Time structure permits the direct estimation of
- substitution rate
- Concerted changes in substitution rate
- coalescent times in calendar units
- Demographic function N(t) in calendar units
44Intermission
45What is HIV?
Population genetics of HIV
- HIV is a retrovirus.
- Within infected individuals HIV exhibits
extremely high genetic variability due to - Error-prone reverse transcriptase (RT) that
converts RNA to DNA (error rate is about one
mutation per genome per replication cycle). - DNA-dependent polymerase also error-prone
- High turnover of virus within infected individual
throughout infection.
46Patient 2 (Shankarappa et al, 1999)
Population genetics of HIV
0
0
- 210 sequences collected over a period of 9.5
years - 660 nucleotides from env C2-V5 region
- Effective population size and mutation rate were
co-estimated using Bayesian MCMC.
47A tree sampled from the posterior distribution
Population genetics of HIV
Ladder-like appearance
Lineage A
Lineage B
48Estimated substitution rate
Population genetics of HIV
- Patient 2
- 0.771.0 per year
- BUT.
- Long term rates in HIV
- Korber et al
- 0.24 (0.18-0.28) per year
- Only 1/4 of the intrapatient rate
49Bayesian MCMC of Shankarappa data
Measurably evolving populations
50Intra- and inter- patient rate estimates (C2V3
envelope)
Population genetics of HIV
p1 - p11
B
A
C
51Summary HIV intra-patient evolution
Population genetics of HIV
- HIV evolutionary rates appear to be faster
intra-patient then across pandemic - Different selection pressure at transmission?
- Transmitted viruses undergoing less rounds of
replication? - Latent viruses?
- Reversion of escape mutants?
- Effective population size is changing over time
(bottleneck in envelope at least)
52But how good is our best model?
Goodness-of-fit tests
- We can use standard statistical model-choice
criteria to choose between different models of
substitution and demography, but are any of the
models we consider any good at all? - One way to look at this is ask the following
question - Does our real data look anything like what we
would expect data from our model to look like? - So what aspect of the data should we look at?
- And what should we expect?
53We could look at branch length distributions
Goodness-of-fit tests
54Tree imbalance measures might also be interesting
Goodness-of-fit tests
4 cherries
3 cherries
2 cherries
55Posterior predictive simulation
Posterior predictive simulation
- A method of testing the goodness-of-fit of a
Bayesian model. - Run a Bayesian MCMC analysis on the data
- Calculate the value of your favourite summary
statistic, T(.) from the data, D - For each state in the chain
- Simulate a synthetic dataset, Di, using the
parameter values of state i. - Calculate T(Di) from the simulated data set.
- Compare the T(D) value with predictive
distribution of T(Di)
56So we need some summary statistics
Posterior predictive simulation
- Summary statistics that can be measured directly
from sequence alignment - Mean pairwise distance (?)
- Tajimas D
- Fu Lis D
- Number of segregating sites (S)
- Summary statistics that can be measured directly
from an genealogy - Genealogical mean pairwise distance (?)
- Genealogical Tajimas D
- Genealogical Fu Lis D
- Tree-imbalance statistics
- Age of the root
- Length of the tree
57Posterior predictive simulation (2)
Posterior predictive simulation
- Testing the goodness-of-fit of the neutral
coalescent model under variable demographic
functions. - Run a Bayesian MCMC analysis on the data
- For each state in the chain
- Simulate a coalescent genealogy (GiS) using the
population parameter values of state i. - Calculate T(GiS) from the ith simulated genealogy
- Calculate T(GiP) from the ith posterior genealogy
- Calculate the predictive probability by comparing
the posterior distribution of T(.) with
predictive distribution of T(.)
58Human influenza A (HA gene) trees
Goodness-of-fit tests
State 5m
State 10m
Posterior genealogy
Predictive simulations
59Human influenza A trees Genealogical Fu Lis
D statistic
Goodness-of-fit tests
60Puerto Rican Dengue-4 gene trees multivariate
summary statistics
Goodness-of-fit tests
61Results of test of neutrality
Goodness-of-fit tests
62Results for 28 HIV-1 infected individuals
Goodness-of-fit tests
63Is the population size constant?
Population genetics of HIV
Pop size
1000
mean
Ne / 30
100
lower
upper
10
0
20
40
60
80
100
120
months (post seroconversion)
Patient 2
64Virus population dynamics
Phylodynamics
Measles virus
Human influenza virus
65Dengue-4 Modeling complex demography
Phylodynamics
N(t) N0exp(-rt) -10566.421 N(t) scaled
translated case data -10478.572
Hospital case data courtesy of Shannon Bennett
66Population size changes
Population size changes
67The generalized skyline plot
Population size changes
- Visual framework for exploring the demographic
history of sampled DNA sequences - Input a single estimated ancestral genealogy (a
tree) - Output nonparametric plot of the population size
through time - Groups adjacent coalescent intervals
- Converts information within these intervals to
estimates of population size
Estimate of population size from single
coalescent interval
Estimate of population size from l adjacent
coalescent intervals.
68Examples
Generalized Skyline Plot
- I Constant population size
- N(t)N(0)
69Skyline Plot
Generalized Skyline Plot
- I Constant population size
- N(t)N(0)
II Exponential growth N(t)N(0)e-rt
70Skyline Plot
Generalized Skyline Plot
- III HIV-1 group M
- (tree estimated in Yusim et al (2001) Phil.
Trans. Roy. Soc. Lond. B 356 855-866) - Black curve is a parametric estimate obtained
from the same data under the expansion model - Results follow accepted demographic pattern for
the HIV pandemic
71The Bayesian skyline plot
Population size changes
Estimate a demographic function that has a
certain fixed number of steps (in this example
15) and then integrate over all possible
positions of the break points.
Explains the Dengue data quite well (test of
neutrality do not reject the data if we use the
Bayesian skyline plot to describe the demographic
history.
72Prior/Model population is auto-correlated
through time
Population size changes
73Validating the Bayesian skyline plot (1)
Population size changes
Simulated data Constant population
Simulated data Exponential growth
74Validating the Bayesian skyline plot (2)
Population size changes
75Comparing Bayesian skyline plot of Dengue-4 with
incidence data
Population size changes
76Example of Bayesian skyline plot
Population size changes
(1920-1980) Anti-schistosomal needle-based
treatment
Effective population size jumped from 300 to 100
to 10,000
77Comparison to parametric model
Population size changes
78http//evolve.zoo.ox.ac.uk/BEAST
79Coalescent with population structure
Structured populations
80Population subdivision - two demes
Structured populations
81Population subdivision - two demes
Structured populations
82Stepping stone model of subdivision
Structured populations
83Human migration
Structured populations
From Cavalli-Sforza,2001
84Simplified model of human evolution
Structured populations
Past
0.2
Mutation rate 2.5
Rate of common ancestry 1
Present
Africa
Non-Africa
85Why Bayesian?
- Probabilistic model-based inference
- Can make simple statements about the probability
of alternative hypotheses given the data - Markov chain Monte Carlo
- Convenient computational technique
- Allows for complex models if you can simulate
you can sample - Incorporates prior probabilities
- P(?D) ? P(D ?)P(?)
- Convenient means of assessing alternative sets of
assumptions - Allows incorporation of independent sources of
information - Easy to include sources of uncertainty
- Dont need to assume perfect knowledge of tree
(for example) - Can treat the tree and a nuisance parameter and
focus on parameters of interest (strength of
selection, mutation rate, growth rate, etc)
86Conclusions cautionary remarks
- Bayesian MCMC has advantages
- a useful tool for exploring prior hypotheses
- Good for assessing levels of uncertainty
- Complex models can be investigated on practical
datasets - Bayesian MCMC has disadvantages
- Diagnostics are difficult, and it is essentially
impossible to guarantee correctness - Model comparison can be difficult
- Requires large programs that are difficult to
optimize and debug.
87Conclusions cautionary remarks (2)
- Population genetics has advantages
- provides a framework for objective analysis of
genetic data - Allows interpretation of genetic data in terms of
biological properties of virus - Can be extended to include selection,
recombination et cetera - Population genetics has disadvantages
- Models are still too simple
- Assumptions are too strong
- Extending to complex models that include changing
selection pressures and recombination are
possible in MCMC but still very difficult!