The Coalescent and Measurably Evolving Populations

About This Presentation

Title:

The Coalescent and Measurably Evolving Populations

Description:

The genealogy is an edge graph Eg and a vector of times t. ... Given a genealogy the most likely population size can be estimated. ... – PowerPoint PPT presentation

Number of Views:224

Avg rating:3.0/5.0

Slides: 88

Provided by: alexeid

Learn more at: http://archive.dimacs.rutgers.edu

Category:

more less

Transcript and Presenter's Notes

Title: The Coalescent and Measurably Evolving Populations

1
The Coalescent and Measurably Evolving Populations

Alexei Drummond
Department of Computer Science
University of Auckland, NZ

2
Overview

Introduction to the Coalescent
Hepatitis C in Egypt
An example using the coalescent
Measurably evolving populations
HIV-1 evolution within and among hosts
An example using MEP concepts
Summary Conclusions

3
The coalescent

The coalescent is a model of the ancestral
relationships of a small sample of individuals
taken from a large background population.
The coalescent describes a probability
distribution on ancestral genealogies (trees)
given a population history.
Therefore the coalescent can convert information
from ancestral genealogies into information about
population history and vice versa.
The coalescent is a model of ancestral
genealogies, not sequences, and its simplest form
assumes neutral evolution.

4
The history of coalescent theory

1930-40s Genealogical arguments well known to
Wright Fisher
1964 Crow Kimura Infinite Allele Model
1966 (Hubby Lewontin) (Harris) make first
surveys of population allele variation by protein
electrophoresis
1968 Motoo Kimura proposes neutral explanation
of molecular evolution population variation.
So do King Jukes
1971 Kimura Ohta proposes infinite sites
model.
1975 Watterson makes explicit use of The
Coalescent
1982 Kingman introduces The Coalescent.
1983 Hudson introduces The Coalescent with
Recombination
1983 Kreitman publishes first major population
sequences.
1987 Cann et al. traces human origin and
migrations with mitochondrial DNA.

5
The history of coalescent theory

1988 Hughes Nei Genes with positive Darwinian
Selection.
1989-90 Kaplan, Hudson, Takahata and others
Selection regimes with coalescent structure (MHC,
Incompatibility alleles).
1991 MacDonald Kreitman Data with surplus of
replacement interspecific substitutions.
1994-95 Griffiths-Tavaré Kuhner-Yamoto-Felsenst
ein introduces sampling techniques to estimate
parameters in population models.
1997-98 Krone-Neuhauser introduces Ancestral
Selection Graph
1999 Wiuf Donnelly uses coalescent theory to
estimate age of disease allele
2000 Wiuf et al. introduces gene conversion
into coalescent.
2000- A flood of SNP data haplotypes are on
their way.

6
Population processes
COALESCENT THEORY
Genealogy
7
Coalescent inference
Randomly sample individuals from population
Obtain gene sequences from sampled individuals
Reconstruct tree / trees from sequences
Infer coalescent results directly from sequences
Infer coalescent results from tree / trees
8
Demographic History

Change in population size through time
Applications include
Estimating history of human populations
Conservation biology
Reconstructing infectious disease epidemics
Investigating viral dynamics within hosts

9
Idealized Wright-Fisher populations
Diploid
Haploid
10
Random mating in an ideal population

A constant population size of N individuals
Each individual in the new generation chooses
its parent from the previous generation at random

11
Genetic drift extinction and ancestry
If you trace the ancestry of a sample of
individuals back in time you inevitably reach a
single most recent common ancestor. If you pick a
random individual and trace their descendents
forward in time, all the descendents of that
individual will with high probability eventually
die out.
12
A sample genealogy from an idealized
Wright-Fisher population
Past
A sample genealogy of 3 sequences from a
population (N 10).
Discrete Generations
Present
13
The coalescent distributions and expectations on
a sample genealogy
Past
Present
14
The coalescent probability density distribution
Past
Present
The genealogy is an edge graph Eg and a vector of
times t.
Kingman (1982a,b)
15
The coalescent estimating population size from a
sample genealogy
Past
Present
Felsenstein (1992)
16
The coalescent estimating population size
confidence limits via ML
The confidence intervals are calculated from the
curvature of the likelihood. For a single
parameter model the 95 confidence limits are
defined by the points where the log-likelihood
drops 1.92 log-units below the maximum
log-likelihood.
Maximum likelihood can be used to estimate
population size by choosing a population size
that maximizes the probability of the observed
coalescent waiting times.
17
The coalescent shapes of gene genealogies
The Coalescent

Exponential growth
Constant size
The coalescent can be used to convert coalescent
times into knowledge about population size and
its change though time.
18
Constant population size N(t)N0
The Coalescent

small N0
large N0
TIME
19
Coalescent and serial samples
The Coalescent

Constant population
Exponential growth
20
Uncertainty in Genealogies
The Coalescent

How similar are these two trees? Both of them are
plausible given the data. We can use MCMC to get
the average result over all plausible trees,
21
Coalescent Summary
The Coalescent

The coalescent provides a theory of how
population size is related to the distribution of
coalescent events in a tree.
Big populations have old trees
Exponentially growing populations have star-like
trees
Given a genealogy the most likely population size
can be estimated.
MCMC can be used to get a distribution of trees
from which a distribution of population sizes can
be estimated.

22
Markov chain Monte Carlo (MCMC)
MCMC

Imagine you would like to estimate two parameters
(?,?) from some data (D).
You want to find values of ? and ? that have high
probability given the data p(?,?D)
Say you have a likelihood function of the form
PrD ?,?
Bayes rule tells us that
p(?,?D) PrD ?,?p(?,?) / PrD
So that p(?,?D) ? PrD ?,?p(?,?)

23
Markov chain Monte Carlo (MCMC)
MCMC

p(?,?D) is called the posterior probability
(density) of ?,? given D
In an ideal world we want to know the posterior
density for all possible values of ?,?.
Then we could pick a credible region in two
dimensions that contained values of ?,? that
account for the majority of the posterior
probability mass.
This credible region would serve as an estimate
that includes incorporates our uncertainty and
this credible set could be used to address
hypotheses like ? is greater than x.
In reality we have to make due with a sample of
the posterior - so that we evaluate p(?,?D) for
a finite number (say 10,000,000) pairs of ?,?.
So which pairs should we choose?

24
Markov chain Monte Carlo (MCMC)
MCMC

Lets construct a random walk in 2-dimensional
space
In each step of the random walk we propose to
make an (unbiased) small jump from our current
position (?,?) to a new position (?,?)
If p(?,?D) gt p(?,?D) then we make the
proposed jump
However, if p(?,?D) lt p(?,?D), then we make
the proposed jump with probability ? p(?,?D)
/ p(?,?D), otherwise we stay where we are.
It can be shown (trust me!) that if you proceed
in this fashion for an infinite time then the
equilibrium distribution of this random walk will
be p(?,?D)!
That is, the random walk will visit a particular
region ?0, ?1 x ?0, ?1 of the state space
this often

25
Markov chain Monte Carlo (MCMC)
MCMC
26
Hepatitis C Virus (HCV)
Population genetics of Hepatitis C in Egypt

Identified in 1989
9.6kb single-stranded RNA genome
Polyprotein cleaved by proteases
No efficient tissue culture system

27
How important is HCV?
Population genetics of Hepatitis C in Egypt

170m infected
80 infections are chronic
Liver cirrhosis cancer risk
10,000 deaths per year in USA
No protective immunity?

28
HCV Transmission
Population genetics of Hepatitis C in Egypt

Percutaneous exposure to infected blood
Blood transfusion / blood products
Injecting nasal drug use Sexual
vertical transmission Unsafe injections
Unidentified routes
29
Estimating demographic history of HCV using the
coalescent
Population genetics of Hepatitis C in Egypt

Egyptian HCV gene sequences
n61
E1 gene, 411bp
All sequence contemporaneous
Egypt has highest prevalence of HCV worldwide
(10-20)
But low prevalence in neighbouring states
Why is Egypt so seriously affected?
Parenteral antischistosomal therapy (PAT)

30
Demographic model
Population genetics of Hepatitis C in Egypt

The coalescent can be extended to model
deterministically varying populations.
The model we used was a const-exp-const model.
A Bayesian MCMC method was developed to sample
the gene genealogy, the substitution model and
demographic function simultaneously.

31
Estimated demographic history
Population genetics of Hepatitis C in Egypt

Based on a single tree
Averaged over all trees
32
Parameter estimates
Population genetics of Hepatitis C in Egypt

33
Uncertainty in parameter estimates
Population genetics of Hepatitis C in Egypt

Demographic parameters
Mutational parameters
Growth rate of the growth phase Grey box is the
prior
Rates at different codon positions, All
significantly different
34
Full Bayesian Estimation
Population genetics of Hepatitis C in Egypt

Marginalized over uncertainty in genealogy and
mutational processes
Yellow band represents the region over which PAT
was employed in Egypt

35
Measurably evolving populations (MEPs)
Measurably evolving populations

MEP pathogens
HIV
Hepatitis C
Influenza A
MEPs from ancient DNA
Bison
Brown Bears
Adelie penguins
Anything cold and numerous
Even over short periods (less than a year) HIV
sequences can exhibit measurable evolutionary
change
Time-structure can not be ignored in our models

36
Time structure in samples
Measurably evolving populations

time
Contemporary sample no time structure
Serial sample with time structure
1980
1990
2000
37
Molecular evolution and population genetics of
MEPs
Measurably evolving populations

Given sequence data that is time-structured
estimate true values of
substitution parameters
Overall substitution rate and relative rates of
different substitutions
population history N(t)
Ancestral genealogy
Topology
Coalescent times

m
Ne
time
A
B
C
D
E
38
Molecular evolutionary model Felsensteins
likelihood (1981)
The probability of the sequence alignment,
can be efficiently calculated given a
tree and branch lengths (T), and a probabilistic
model of mutation represented by an instantaneous
rate matrix (Q). In phylogenetics, branch lengths
are usually unconstrained.
39
Combining the coalescent with Felsensteins
likelihood
AC
b4
t2
AA
The molecular clock constraint
b1
b3
t3
b2
GA
b5
t4
AA
GA
AC
GC
GC
2n3 branch lengths
n1 waiting times
The joint posterior probability of the population
history (N), the genealogy (g) and the mutation
matrix (Q) are estimated using Markov chain Monte
Carlo (Drummond et al, Genetics, 2002)
40
Full Bayesian Model
Measurably evolving populations

Probability of what we dont know given what we
do know.
Likelihood function
other priors
P(g, ?, Ne, Q D)
P(D g, ?, Q)fG(g Ne) f?(?)fN(Ne )fQ(Q)

Unknown normalizing constant
coalescent prior
Q substitution parameters Ne population
parameters g tree ? overall substitution rate
In the software package BEAST, MCMC integration
can be used to provide a chain of samples from
this density.
41
HIV-1 (env) evolution in nine infected individuals
Measurably evolving populations

Shankarappa et al (1999)
42
Molecular clock HIV-1 (env) evolution in 9
individuals
Measurably evolving populations

Shankarappa et al (1999)
43
MEP Summary
Measurably evolving populations

Most RNA viruses, including HCV and HIV are
measurably evolving
Most vertebrate populations that have
well-preserved recent fossil records are MEPs.
If sequence data comes from different times the
time-structure cant be ignored
Time structure permits the direct estimation of
substitution rate
Concerted changes in substitution rate
coalescent times in calendar units
Demographic function N(t) in calendar units

44
Intermission

My brain is fried!

45
What is HIV?
Population genetics of HIV

HIV is a retrovirus.
Within infected individuals HIV exhibits
extremely high genetic variability due to
Error-prone reverse transcriptase (RT) that
converts RNA to DNA (error rate is about one
mutation per genome per replication cycle).
DNA-dependent polymerase also error-prone
High turnover of virus within infected individual
throughout infection.

46
Patient 2 (Shankarappa et al, 1999)
Population genetics of HIV

0
0

210 sequences collected over a period of 9.5
years
660 nucleotides from env C2-V5 region
Effective population size and mutation rate were
co-estimated using Bayesian MCMC.

47
A tree sampled from the posterior distribution
Population genetics of HIV

Ladder-like appearance
Lineage A
Lineage B
48
Estimated substitution rate
Population genetics of HIV

Patient 2
0.771.0 per year
BUT.
Long term rates in HIV
Korber et al
0.24 (0.18-0.28) per year
Only 1/4 of the intrapatient rate

49
Bayesian MCMC of Shankarappa data
Measurably evolving populations

50
Intra- and inter- patient rate estimates (C2V3
envelope)
Population genetics of HIV

p1 - p11
B
A
C
51
Summary HIV intra-patient evolution
Population genetics of HIV

HIV evolutionary rates appear to be faster
intra-patient then across pandemic
Different selection pressure at transmission?
Transmitted viruses undergoing less rounds of
replication?
Latent viruses?
Reversion of escape mutants?
Effective population size is changing over time
(bottleneck in envelope at least)

52
But how good is our best model?
Goodness-of-fit tests

We can use standard statistical model-choice
criteria to choose between different models of
substitution and demography, but are any of the
models we consider any good at all?
One way to look at this is ask the following
question
Does our real data look anything like what we
would expect data from our model to look like?
So what aspect of the data should we look at?
And what should we expect?

53
We could look at branch length distributions
Goodness-of-fit tests
54
Tree imbalance measures might also be interesting
Goodness-of-fit tests
4 cherries
3 cherries
2 cherries
55
Posterior predictive simulation
Posterior predictive simulation

A method of testing the goodness-of-fit of a
Bayesian model.
Run a Bayesian MCMC analysis on the data
Calculate the value of your favourite summary
statistic, T(.) from the data, D
For each state in the chain
Simulate a synthetic dataset, Di, using the
parameter values of state i.
Calculate T(Di) from the simulated data set.
Compare the T(D) value with predictive
distribution of T(Di)

56
So we need some summary statistics
Posterior predictive simulation

Summary statistics that can be measured directly
from sequence alignment
Mean pairwise distance (?)
Tajimas D
Fu Lis D
Number of segregating sites (S)

Summary statistics that can be measured directly
from an genealogy
Genealogical mean pairwise distance (?)
Genealogical Tajimas D
Genealogical Fu Lis D
Tree-imbalance statistics
Age of the root
Length of the tree

57
Posterior predictive simulation (2)
Posterior predictive simulation

Testing the goodness-of-fit of the neutral
coalescent model under variable demographic
functions.
Run a Bayesian MCMC analysis on the data
For each state in the chain
Simulate a coalescent genealogy (GiS) using the
population parameter values of state i.
Calculate T(GiS) from the ith simulated genealogy
Calculate T(GiP) from the ith posterior genealogy
Calculate the predictive probability by comparing
the posterior distribution of T(.) with
predictive distribution of T(.)

58
Human influenza A (HA gene) trees
Goodness-of-fit tests
State 5m
State 10m
Posterior genealogy
Predictive simulations
59
Human influenza A trees Genealogical Fu Lis
D statistic
Goodness-of-fit tests
60
Puerto Rican Dengue-4 gene trees multivariate
summary statistics
Goodness-of-fit tests
61
Results of test of neutrality
Goodness-of-fit tests
62
Results for 28 HIV-1 infected individuals
Goodness-of-fit tests
63
Is the population size constant?
Population genetics of HIV

Pop size
1000
mean
Ne / 30
100
lower
upper
10
0
20
40
60
80
100
120
months (post seroconversion)
Patient 2
64
Virus population dynamics
Phylodynamics

Measles virus
Human influenza virus
65
Dengue-4 Modeling complex demography
Phylodynamics

N(t) N0exp(-rt) -10566.421 N(t) scaled
translated case data -10478.572
Hospital case data courtesy of Shannon Bennett
66
Population size changes
Population size changes
67
The generalized skyline plot
Population size changes

Visual framework for exploring the demographic
history of sampled DNA sequences
Input a single estimated ancestral genealogy (a
tree)
Output nonparametric plot of the population size
through time
Groups adjacent coalescent intervals
Converts information within these intervals to
estimates of population size

Estimate of population size from single
coalescent interval
Estimate of population size from l adjacent
coalescent intervals.
68
Examples
Generalized Skyline Plot

I Constant population size
N(t)N(0)

69
Skyline Plot
Generalized Skyline Plot

I Constant population size
N(t)N(0)

II Exponential growth N(t)N(0)e-rt
70
Skyline Plot
Generalized Skyline Plot

III HIV-1 group M
(tree estimated in Yusim et al (2001) Phil.
Trans. Roy. Soc. Lond. B 356 855-866)
Black curve is a parametric estimate obtained
from the same data under the expansion model
Results follow accepted demographic pattern for
the HIV pandemic

71
The Bayesian skyline plot
Population size changes
Estimate a demographic function that has a
certain fixed number of steps (in this example
15) and then integrate over all possible
positions of the break points.
Explains the Dengue data quite well (test of
neutrality do not reject the data if we use the
Bayesian skyline plot to describe the demographic
history.
72
Prior/Model population is auto-correlated
through time
Population size changes
73
Validating the Bayesian skyline plot (1)
Population size changes
Simulated data Constant population
Simulated data Exponential growth
74
Validating the Bayesian skyline plot (2)
Population size changes
75
Comparing Bayesian skyline plot of Dengue-4 with
incidence data
Population size changes
76
Example of Bayesian skyline plot
Population size changes
(1920-1980) Anti-schistosomal needle-based
treatment
Effective population size jumped from 300 to 100
to 10,000
77
Comparison to parametric model
Population size changes
78
http//evolve.zoo.ox.ac.uk/BEAST
79
Coalescent with population structure
Structured populations
80
Population subdivision - two demes
Structured populations
81
Population subdivision - two demes
Structured populations
82
Stepping stone model of subdivision
Structured populations
83
Human migration
Structured populations
From Cavalli-Sforza,2001
84
Simplified model of human evolution
Structured populations
Past
0.2
Mutation rate 2.5
Rate of common ancestry 1
Present
Africa
Non-Africa
85
Why Bayesian?

Probabilistic model-based inference
Can make simple statements about the probability
of alternative hypotheses given the data
Markov chain Monte Carlo
Convenient computational technique
Allows for complex models if you can simulate
you can sample
Incorporates prior probabilities
P(?D) ? P(D ?)P(?)
Convenient means of assessing alternative sets of
assumptions
Allows incorporation of independent sources of
information
Easy to include sources of uncertainty
Dont need to assume perfect knowledge of tree
(for example)
Can treat the tree and a nuisance parameter and
focus on parameters of interest (strength of
selection, mutation rate, growth rate, etc)

86
Conclusions cautionary remarks

Bayesian MCMC has advantages
a useful tool for exploring prior hypotheses
Good for assessing levels of uncertainty
Complex models can be investigated on practical
datasets
Bayesian MCMC has disadvantages
Diagnostics are difficult, and it is essentially
impossible to guarantee correctness
Model comparison can be difficult
Requires large programs that are difficult to
optimize and debug.

87
Conclusions cautionary remarks (2)

Population genetics has advantages
provides a framework for objective analysis of
genetic data
Allows interpretation of genetic data in terms of
biological properties of virus
Can be extended to include selection,
recombination et cetera
Population genetics has disadvantages
Models are still too simple
Assumptions are too strong
Extending to complex models that include changing
selection pressures and recombination are
possible in MCMC but still very difficult!