The Coalescent and Measurably Evolving Populations - PowerPoint PPT Presentation

About This Presentation
Title:

The Coalescent and Measurably Evolving Populations

Description:

The genealogy is an edge graph Eg and a vector of times t. ... Given a genealogy the most likely population size can be estimated. ... – PowerPoint PPT presentation

Number of Views:224
Avg rating:3.0/5.0
Slides: 88
Provided by: alexeid
Category:

less

Transcript and Presenter's Notes

Title: The Coalescent and Measurably Evolving Populations


1
The Coalescent and Measurably Evolving Populations
  • Alexei Drummond
  • Department of Computer Science
  • University of Auckland, NZ

2
Overview
  • Introduction to the Coalescent
  • Hepatitis C in Egypt
  • An example using the coalescent
  • Measurably evolving populations
  • HIV-1 evolution within and among hosts
  • An example using MEP concepts
  • Summary Conclusions

3
The coalescent
  • The coalescent is a model of the ancestral
    relationships of a small sample of individuals
    taken from a large background population.
  • The coalescent describes a probability
    distribution on ancestral genealogies (trees)
    given a population history.
  • Therefore the coalescent can convert information
    from ancestral genealogies into information about
    population history and vice versa.
  • The coalescent is a model of ancestral
    genealogies, not sequences, and its simplest form
    assumes neutral evolution.

4
The history of coalescent theory
  • 1930-40s Genealogical arguments well known to
    Wright Fisher
  • 1964 Crow Kimura Infinite Allele Model
  • 1966 (Hubby Lewontin) (Harris) make first
    surveys of population allele variation by protein
    electrophoresis
  • 1968 Motoo Kimura proposes neutral explanation
    of molecular evolution population variation.
    So do King Jukes
  • 1971 Kimura Ohta proposes infinite sites
    model.
  • 1975 Watterson makes explicit use of The
    Coalescent
  • 1982 Kingman introduces The Coalescent.
  • 1983 Hudson introduces The Coalescent with
    Recombination
  • 1983 Kreitman publishes first major population
    sequences.
  • 1987 Cann et al. traces human origin and
    migrations with mitochondrial DNA.

5
The history of coalescent theory
  • 1988 Hughes Nei Genes with positive Darwinian
    Selection.
  • 1989-90 Kaplan, Hudson, Takahata and others
    Selection regimes with coalescent structure (MHC,
    Incompatibility alleles).
  • 1991 MacDonald Kreitman Data with surplus of
    replacement interspecific substitutions.
  • 1994-95 Griffiths-Tavaré Kuhner-Yamoto-Felsenst
    ein introduces sampling techniques to estimate
    parameters in population models.
  • 1997-98 Krone-Neuhauser introduces Ancestral
    Selection Graph
  • 1999 Wiuf Donnelly uses coalescent theory to
    estimate age of disease allele
  • 2000 Wiuf et al. introduces gene conversion
    into coalescent.
  • 2000- A flood of SNP data haplotypes are on
    their way.

6
Population processes
COALESCENT THEORY
Genealogy
7
Coalescent inference
Randomly sample individuals from population
Obtain gene sequences from sampled individuals
Reconstruct tree / trees from sequences
Infer coalescent results directly from sequences
Infer coalescent results from tree / trees
8
Demographic History
  • Change in population size through time
  • Applications include
  • Estimating history of human populations
  • Conservation biology
  • Reconstructing infectious disease epidemics
  • Investigating viral dynamics within hosts

9
Idealized Wright-Fisher populations
Diploid
Haploid
10
Random mating in an ideal population
  • A constant population size of N individuals
  • Each individual in the new generation chooses
    its parent from the previous generation at random

11
Genetic drift extinction and ancestry
If you trace the ancestry of a sample of
individuals back in time you inevitably reach a
single most recent common ancestor. If you pick a
random individual and trace their descendents
forward in time, all the descendents of that
individual will with high probability eventually
die out.
12
A sample genealogy from an idealized
Wright-Fisher population
Past
A sample genealogy of 3 sequences from a
population (N 10).
Discrete Generations
Present
13
The coalescent distributions and expectations on
a sample genealogy
Past
Present
14
The coalescent probability density distribution
Past
Present
The genealogy is an edge graph Eg and a vector of
times t.
Kingman (1982a,b)
15
The coalescent estimating population size from a
sample genealogy
Past
Present
Felsenstein (1992)
16
The coalescent estimating population size
confidence limits via ML
The confidence intervals are calculated from the
curvature of the likelihood. For a single
parameter model the 95 confidence limits are
defined by the points where the log-likelihood
drops 1.92 log-units below the maximum
log-likelihood.
Maximum likelihood can be used to estimate
population size by choosing a population size
that maximizes the probability of the observed
coalescent waiting times.
17
The coalescent shapes of gene genealogies
The Coalescent

Exponential growth
Constant size
The coalescent can be used to convert coalescent
times into knowledge about population size and
its change though time.
18
Constant population size N(t)N0
The Coalescent

small N0
large N0
TIME
19
Coalescent and serial samples
The Coalescent

Constant population
Exponential growth
20
Uncertainty in Genealogies
The Coalescent

How similar are these two trees? Both of them are
plausible given the data. We can use MCMC to get
the average result over all plausible trees,
21
Coalescent Summary
The Coalescent
  • The coalescent provides a theory of how
    population size is related to the distribution of
    coalescent events in a tree.
  • Big populations have old trees
  • Exponentially growing populations have star-like
    trees
  • Given a genealogy the most likely population size
    can be estimated.
  • MCMC can be used to get a distribution of trees
    from which a distribution of population sizes can
    be estimated.

22
Markov chain Monte Carlo (MCMC)
MCMC
  • Imagine you would like to estimate two parameters
    (?,?) from some data (D).
  • You want to find values of ? and ? that have high
    probability given the data p(?,?D)
  • Say you have a likelihood function of the form
    PrD ?,?
  • Bayes rule tells us that
  • p(?,?D) PrD ?,?p(?,?) / PrD
  • So that p(?,?D) ? PrD ?,?p(?,?)

23
Markov chain Monte Carlo (MCMC)
MCMC
  • p(?,?D) is called the posterior probability
    (density) of ?,? given D
  • In an ideal world we want to know the posterior
    density for all possible values of ?,?.
  • Then we could pick a credible region in two
    dimensions that contained values of ?,? that
    account for the majority of the posterior
    probability mass.
  • This credible region would serve as an estimate
    that includes incorporates our uncertainty and
    this credible set could be used to address
    hypotheses like ? is greater than x.
  • In reality we have to make due with a sample of
    the posterior - so that we evaluate p(?,?D) for
    a finite number (say 10,000,000) pairs of ?,?.
  • So which pairs should we choose?

24
Markov chain Monte Carlo (MCMC)
MCMC
  • Lets construct a random walk in 2-dimensional
    space
  • In each step of the random walk we propose to
    make an (unbiased) small jump from our current
    position (?,?) to a new position (?,?)
  • If p(?,?D) gt p(?,?D) then we make the
    proposed jump
  • However, if p(?,?D) lt p(?,?D), then we make
    the proposed jump with probability ? p(?,?D)
    / p(?,?D), otherwise we stay where we are.
  • It can be shown (trust me!) that if you proceed
    in this fashion for an infinite time then the
    equilibrium distribution of this random walk will
    be p(?,?D)!
  • That is, the random walk will visit a particular
    region ?0, ?1 x ?0, ?1 of the state space
    this often

25
Markov chain Monte Carlo (MCMC)
MCMC
26
Hepatitis C Virus (HCV)
Population genetics of Hepatitis C in Egypt
  • Identified in 1989
  • 9.6kb single-stranded RNA genome
  • Polyprotein cleaved by proteases
  • No efficient tissue culture system

27
How important is HCV?
Population genetics of Hepatitis C in Egypt
  • 170m infected
  • 80 infections are chronic
  • Liver cirrhosis cancer risk
  • 10,000 deaths per year in USA
  • No protective immunity?

28
HCV Transmission
Population genetics of Hepatitis C in Egypt

Percutaneous exposure to infected blood
Blood transfusion / blood products
Injecting nasal drug use Sexual
vertical transmission Unsafe injections
Unidentified routes
29
Estimating demographic history of HCV using the
coalescent
Population genetics of Hepatitis C in Egypt
  • Egyptian HCV gene sequences
  • n61
  • E1 gene, 411bp
  • All sequence contemporaneous
  • Egypt has highest prevalence of HCV worldwide
    (10-20)
  • But low prevalence in neighbouring states
  • Why is Egypt so seriously affected?
  • Parenteral antischistosomal therapy (PAT)

30
Demographic model
Population genetics of Hepatitis C in Egypt
  • The coalescent can be extended to model
    deterministically varying populations.
  • The model we used was a const-exp-const model.
  • A Bayesian MCMC method was developed to sample
    the gene genealogy, the substitution model and
    demographic function simultaneously.

31
Estimated demographic history
Population genetics of Hepatitis C in Egypt

Based on a single tree
Averaged over all trees
32
Parameter estimates
Population genetics of Hepatitis C in Egypt

33
Uncertainty in parameter estimates
Population genetics of Hepatitis C in Egypt

Demographic parameters
Mutational parameters
Growth rate of the growth phase Grey box is the
prior
Rates at different codon positions, All
significantly different
34
Full Bayesian Estimation
Population genetics of Hepatitis C in Egypt
  • Marginalized over uncertainty in genealogy and
    mutational processes
  • Yellow band represents the region over which PAT
    was employed in Egypt

35
Measurably evolving populations (MEPs)
Measurably evolving populations
  • MEP pathogens
  • HIV
  • Hepatitis C
  • Influenza A
  • MEPs from ancient DNA
  • Bison
  • Brown Bears
  • Adelie penguins
  • Anything cold and numerous
  • Even over short periods (less than a year) HIV
    sequences can exhibit measurable evolutionary
    change
  • Time-structure can not be ignored in our models

36
Time structure in samples
Measurably evolving populations

time
Contemporary sample no time structure
Serial sample with time structure
1980
1990
2000
37
Molecular evolution and population genetics of
MEPs
Measurably evolving populations
  • Given sequence data that is time-structured
    estimate true values of
  • substitution parameters
  • Overall substitution rate and relative rates of
    different substitutions
  • population history N(t)
  • Ancestral genealogy
  • Topology
  • Coalescent times

m
Ne
time
A
B
C
D
E
38
Molecular evolutionary model Felsensteins
likelihood (1981)
The probability of the sequence alignment,
can be efficiently calculated given a
tree and branch lengths (T), and a probabilistic
model of mutation represented by an instantaneous
rate matrix (Q). In phylogenetics, branch lengths
are usually unconstrained.
39
Combining the coalescent with Felsensteins
likelihood
AC
b4
t2
AA
The molecular clock constraint
b1
b3
t3
b2
GA
b5
t4
AA
GA
AC
GC
GC
2n3 branch lengths
n1 waiting times
The joint posterior probability of the population
history (N), the genealogy (g) and the mutation
matrix (Q) are estimated using Markov chain Monte
Carlo (Drummond et al, Genetics, 2002)
40
Full Bayesian Model
Measurably evolving populations

Probability of what we dont know given what we
do know.
Likelihood function
other priors
P(g, ?, Ne, Q D)
P(D g, ?, Q)fG(g Ne) f?(?)fN(Ne )fQ(Q)

Unknown normalizing constant
coalescent prior
Q substitution parameters Ne population
parameters g tree ? overall substitution rate
In the software package BEAST, MCMC integration
can be used to provide a chain of samples from
this density.
41
HIV-1 (env) evolution in nine infected individuals
Measurably evolving populations

Shankarappa et al (1999)
42
Molecular clock HIV-1 (env) evolution in 9
individuals
Measurably evolving populations

Shankarappa et al (1999)
43
MEP Summary
Measurably evolving populations
  • Most RNA viruses, including HCV and HIV are
    measurably evolving
  • Most vertebrate populations that have
    well-preserved recent fossil records are MEPs.
  • If sequence data comes from different times the
    time-structure cant be ignored
  • Time structure permits the direct estimation of
  • substitution rate
  • Concerted changes in substitution rate
  • coalescent times in calendar units
  • Demographic function N(t) in calendar units

44
Intermission
  • My brain is fried!

45
What is HIV?
Population genetics of HIV
  • HIV is a retrovirus.
  • Within infected individuals HIV exhibits
    extremely high genetic variability due to
  • Error-prone reverse transcriptase (RT) that
    converts RNA to DNA (error rate is about one
    mutation per genome per replication cycle).
  • DNA-dependent polymerase also error-prone
  • High turnover of virus within infected individual
    throughout infection.

46
Patient 2 (Shankarappa et al, 1999)
Population genetics of HIV

0
0
  • 210 sequences collected over a period of 9.5
    years
  • 660 nucleotides from env C2-V5 region
  • Effective population size and mutation rate were
    co-estimated using Bayesian MCMC.

47
A tree sampled from the posterior distribution
Population genetics of HIV

Ladder-like appearance
Lineage A
Lineage B
48
Estimated substitution rate
Population genetics of HIV
  • Patient 2
  • 0.771.0 per year
  • BUT.
  • Long term rates in HIV
  • Korber et al
  • 0.24 (0.18-0.28) per year
  • Only 1/4 of the intrapatient rate

49
Bayesian MCMC of Shankarappa data
Measurably evolving populations

50
Intra- and inter- patient rate estimates (C2V3
envelope)
Population genetics of HIV

p1 - p11
B
A
C
51
Summary HIV intra-patient evolution
Population genetics of HIV
  • HIV evolutionary rates appear to be faster
    intra-patient then across pandemic
  • Different selection pressure at transmission?
  • Transmitted viruses undergoing less rounds of
    replication?
  • Latent viruses?
  • Reversion of escape mutants?
  • Effective population size is changing over time
    (bottleneck in envelope at least)

52
But how good is our best model?
Goodness-of-fit tests
  • We can use standard statistical model-choice
    criteria to choose between different models of
    substitution and demography, but are any of the
    models we consider any good at all?
  • One way to look at this is ask the following
    question
  • Does our real data look anything like what we
    would expect data from our model to look like?
  • So what aspect of the data should we look at?
  • And what should we expect?

53
We could look at branch length distributions
Goodness-of-fit tests
54
Tree imbalance measures might also be interesting
Goodness-of-fit tests
4 cherries
3 cherries
2 cherries
55
Posterior predictive simulation
Posterior predictive simulation
  • A method of testing the goodness-of-fit of a
    Bayesian model.
  • Run a Bayesian MCMC analysis on the data
  • Calculate the value of your favourite summary
    statistic, T(.) from the data, D
  • For each state in the chain
  • Simulate a synthetic dataset, Di, using the
    parameter values of state i.
  • Calculate T(Di) from the simulated data set.
  • Compare the T(D) value with predictive
    distribution of T(Di)

56
So we need some summary statistics
Posterior predictive simulation
  • Summary statistics that can be measured directly
    from sequence alignment
  • Mean pairwise distance (?)
  • Tajimas D
  • Fu Lis D
  • Number of segregating sites (S)
  • Summary statistics that can be measured directly
    from an genealogy
  • Genealogical mean pairwise distance (?)
  • Genealogical Tajimas D
  • Genealogical Fu Lis D
  • Tree-imbalance statistics
  • Age of the root
  • Length of the tree

57
Posterior predictive simulation (2)
Posterior predictive simulation
  • Testing the goodness-of-fit of the neutral
    coalescent model under variable demographic
    functions.
  • Run a Bayesian MCMC analysis on the data
  • For each state in the chain
  • Simulate a coalescent genealogy (GiS) using the
    population parameter values of state i.
  • Calculate T(GiS) from the ith simulated genealogy
  • Calculate T(GiP) from the ith posterior genealogy
  • Calculate the predictive probability by comparing
    the posterior distribution of T(.) with
    predictive distribution of T(.)

58
Human influenza A (HA gene) trees
Goodness-of-fit tests
State 5m
State 10m
Posterior genealogy
Predictive simulations
59
Human influenza A trees Genealogical Fu Lis
D statistic
Goodness-of-fit tests
60
Puerto Rican Dengue-4 gene trees multivariate
summary statistics
Goodness-of-fit tests
61
Results of test of neutrality
Goodness-of-fit tests
62
Results for 28 HIV-1 infected individuals
Goodness-of-fit tests
63
Is the population size constant?
Population genetics of HIV

Pop size
1000
mean
Ne / 30
100
lower
upper
10
0
20
40
60
80
100
120
months (post seroconversion)
Patient 2
64
Virus population dynamics
Phylodynamics

Measles virus
Human influenza virus
65
Dengue-4 Modeling complex demography
Phylodynamics

N(t) N0exp(-rt) -10566.421 N(t) scaled
translated case data -10478.572
Hospital case data courtesy of Shannon Bennett
66
Population size changes
Population size changes
67
The generalized skyline plot
Population size changes
  • Visual framework for exploring the demographic
    history of sampled DNA sequences
  • Input a single estimated ancestral genealogy (a
    tree)
  • Output nonparametric plot of the population size
    through time
  • Groups adjacent coalescent intervals
  • Converts information within these intervals to
    estimates of population size

Estimate of population size from single
coalescent interval
Estimate of population size from l adjacent
coalescent intervals.
68
Examples
Generalized Skyline Plot
  • I Constant population size
  • N(t)N(0)

69
Skyline Plot
Generalized Skyline Plot
  • I Constant population size
  • N(t)N(0)

II Exponential growth N(t)N(0)e-rt
70
Skyline Plot
Generalized Skyline Plot
  • III HIV-1 group M
  • (tree estimated in Yusim et al (2001) Phil.
    Trans. Roy. Soc. Lond. B 356 855-866)
  • Black curve is a parametric estimate obtained
    from the same data under the expansion model
  • Results follow accepted demographic pattern for
    the HIV pandemic

71
The Bayesian skyline plot
Population size changes
Estimate a demographic function that has a
certain fixed number of steps (in this example
15) and then integrate over all possible
positions of the break points.
Explains the Dengue data quite well (test of
neutrality do not reject the data if we use the
Bayesian skyline plot to describe the demographic
history.
72
Prior/Model population is auto-correlated
through time
Population size changes
73
Validating the Bayesian skyline plot (1)
Population size changes
Simulated data Constant population
Simulated data Exponential growth
74
Validating the Bayesian skyline plot (2)
Population size changes
75
Comparing Bayesian skyline plot of Dengue-4 with
incidence data
Population size changes
76
Example of Bayesian skyline plot
Population size changes
(1920-1980) Anti-schistosomal needle-based
treatment
Effective population size jumped from 300 to 100
to 10,000
77
Comparison to parametric model
Population size changes
78
http//evolve.zoo.ox.ac.uk/BEAST
79
Coalescent with population structure
Structured populations
80
Population subdivision - two demes
Structured populations
81
Population subdivision - two demes
Structured populations
82
Stepping stone model of subdivision
Structured populations
83
Human migration
Structured populations
From Cavalli-Sforza,2001
84
Simplified model of human evolution
Structured populations
Past
0.2
Mutation rate 2.5
Rate of common ancestry 1
Present
Africa
Non-Africa
85
Why Bayesian?
  • Probabilistic model-based inference
  • Can make simple statements about the probability
    of alternative hypotheses given the data
  • Markov chain Monte Carlo
  • Convenient computational technique
  • Allows for complex models if you can simulate
    you can sample
  • Incorporates prior probabilities
  • P(?D) ? P(D ?)P(?)
  • Convenient means of assessing alternative sets of
    assumptions
  • Allows incorporation of independent sources of
    information
  • Easy to include sources of uncertainty
  • Dont need to assume perfect knowledge of tree
    (for example)
  • Can treat the tree and a nuisance parameter and
    focus on parameters of interest (strength of
    selection, mutation rate, growth rate, etc)

86
Conclusions cautionary remarks
  • Bayesian MCMC has advantages
  • a useful tool for exploring prior hypotheses
  • Good for assessing levels of uncertainty
  • Complex models can be investigated on practical
    datasets
  • Bayesian MCMC has disadvantages
  • Diagnostics are difficult, and it is essentially
    impossible to guarantee correctness
  • Model comparison can be difficult
  • Requires large programs that are difficult to
    optimize and debug.

87
Conclusions cautionary remarks (2)
  • Population genetics has advantages
  • provides a framework for objective analysis of
    genetic data
  • Allows interpretation of genetic data in terms of
    biological properties of virus
  • Can be extended to include selection,
    recombination et cetera
  • Population genetics has disadvantages
  • Models are still too simple
  • Assumptions are too strong
  • Extending to complex models that include changing
    selection pressures and recombination are
    possible in MCMC but still very difficult!
Write a Comment
User Comments (0)
About PowerShow.com