Title: Estimating evolutionary parameters for Neisseria meningitidis
1Estimating evolutionary parameters forNeisseria
meningitidis
- Based on the Czech MLST dataset
2Testing a model of evolution what you need
Simulation
Real Data
Starting sequence
Choose codons at random from the observed
distribution of codon usage
1
Mutational model
Estimate evolutionary parameters from the
observed data
2
Evolved sequence
Statistically test for differences between
simulated and observedpatterns of variation.
3
1
Codon usage frequencies
2
Mutational model of sequence evolution
3
Statistical test of hypothesis
3Estimating Codon Usage Frequencies
1
4Estimating Codon Frequency Usage
- Methods available
- Empirical observation of the Z2491 genome
- Empirical observation of the MLST data
- Bayesian inference using the MLST data
5Empirical observation of the Z2491 genome
Parkhill et al (2000) Complete DNA sequence of a
serogroup A strain of Neisseria meningitidis
Z2491. Nature 404 502-506. Nakamura et al (2000)
Codon usage tabulated from the international DNA
sequence databases status for the year 2000.
Nuc. Acids Res. 28 292.
6Empirical observation of the MLST data
Jolley et al (2000) Carried meningococci in the
Czech Republic a diverse recombining population.
Journal of Clinical Microbiology 38 4492-4498
7Bayesian Inference
- Prior belief
- In the absence of any information, what might
you expect codon usage to look like a priori?
E.g. Codon frequency usage is unbiased and
homogeneous, except for the stop codons which
have zero frequency, since the sequences are
coding. - Empirical data - tally the codon usage in the
MLST dataset - Posterior belief
- Modify the prior beliefs a posteriori, following
exposure to real data. The degree to which your
beliefs are modified depends on the conviction
with which you held your prior beliefs. The
posterior beliefs will fall somewhere between the
empirical observations and the prior beliefs.
I.e. the posterior distribution of codon usage
will be a compromise between all non-stop codons
having some non-zero frequency and the observed
empirical patterns of variation in codon usage.
8Assumptions made in the Bayesian Inference
- Refer to a triplet as a 3-base slot in the
reading frame, and a codon as the specific
combination of bases filling that slot. - Codon usage was modelled multinomially, i.e. each
triplet is a random draw from one of the 61
possible non-stop codons. This makes the
following assumptions - The presence of one or another codon at any
particular triplet is entirely independent of the
codons at adjacent triplets. - All triplets are identical with respect to the
probable codon usage. - We will never see any of the three STOP codons in
our sequences.
9A priori belief in codon frequency usage
10Empirical observation of the MLST data
Jolley et al (2000) Carried meningococci in the
Czech Republic a diverse recombining population.
Journal of Clinical Microbiology 38 4492-4498
11A posteriori belief in codon frequency usage
12Mutational Model ofSequence Evolution
2
13Phylogenetic Inference
14Coalescent simulations
- The coalescent is a very fast way of simulating
gene histories under neutral evolution. - It works because, if all mutations are neutral,
then the presence/absence of mutations on the
tree cannot affect its topology. - Therefore the tree topology can be simulated
first, independently of the mutations. - The mutations are then superimposed onto the
topology.
15Underlying rates of non-synonymous mutation are
usually confounded with selection against
inviable mutants.Thus it is convenient to model
functional constraint as mutational bias.(Or
rather, make no attempt to disentangle the
two).If we assume that the patterns of
functional constraint can be modelled as a
biased, but neutral, form of mutation, then we
can use Coalescent simulation.
16Mutational bias in Coalescent Simulations
- The topology is simulated at random, as before.
- As in normal coalescent simulations, mutations
are superimposed onto the topology according to a
Poisson process (just as in the neutral model of
molecular evolution). - Those mutations, although assumed to be neutral,
are biased. - The types of mutations must therefore be
classified to specify the bias.
17Types of single nucleotide mutationTransitions
vs. transversions
A
G
Purine
Transitions
Transversions
T
C
Pyramidine
Transitions
- For any base there are always 2 possible
transversions and 1 possible transition.
18Types of codon mutationSynonymous vs.
non-synonymous
Synonymous
Non-synonymous
Leucine pH 5.98 6-fold degeneracy in the genetic
code
Methionine pH 5.74 Single unique codon ATG
CH3-S-(CH2)2-CH(NH2)-COOH
(CH3)2-CH-CH2-CH(NH2)-COOH
19Relative rates of the different classes of
mutation
Rate of occurrence
Synonymous transversion
Synonymous transition
Non-synonymous transversion
Non-synonymous transition
m
km
wm
wkm
Interpretation
k Transition-transversion ratio
w Proportion of non-synonymous mutations that are viable
M 3m (2k) Basic rate of mutation per codon
20Example CTT
Phe Non-synonymous transition wkm
Ile Non-synonymous transversion wm
Val Non-synonymous transversion wm
Ser Non-synonymous transition wkm
Tyr Non-synonymous transversion wm
Cys Non-synonymous transversion wm
Phe Non-synonymous transition wkm
Leu Synonymous transversion m
Leu Synonymous transversion km
C
T
T
T
T
T
T
T
A
Leucine
T
T
G
T
T
C
T
T
A
T
T
G
T
T
C
T
T
A
T
T
G
21Likelihood
- Having defined the model of evolution, the
probability of observing different patterns in
the data can be expressed. - The triplets in the MLST sequences are aligned,
and the pattern of diversity in the sample at
each triplet is analyzed. - The number of mutations occurring in the gene
history is Poisson distributed, according to the
neutral theory, with rate equal to the basic
mutation rate multiplied by the evolutionary time
over which mutation could have occurred. - Evolutionary time is obtained from Coalescent
theory. - The basic mutation rate and the relative rates of
each type of mutation are estimated from the data.
22Interpreting the data in light of the model
Segregating Dimorphic
Non-segregating Monomorphic
Segregating Dimorphic
Segregating Trimorphic
A
T
C
G
A
G
T
T
G
G
G
T
A
T
C
G
A
G
T
T
G
G
G
T
A
T
C
G
A
G
T
T
G
G
G
C
A
T
C
G
A
G
T
T
G
G
G
A
A
T
T
G
A
G
T
T
G
G
G
A
A
T
T
G
A
G
C
T
A
G
G
A
23Interpreting the data in light of the model
Segregating Dimorphic
Non-segregating Monomorphic
Segregating Dimorphic
Segregating Trimorphic
A
T
C
G
A
G
T
T
G
G
G
T
A
T
C
G
A
G
T
T
G
G
G
T
A
T
C
G
A
G
T
T
G
G
G
C
A
T
C
G
A
G
T
T
G
G
G
A
A
T
T
G
A
G
T
T
G
G
G
A
A
T
T
G
A
G
C
T
A
G
G
A
24Interpreting the data in light of the model
A
T
C
A
T
C
A
T
C
A
T
C
A
T
T
A
T
T
Make the assumption that no more than a single
mutation occurs anywhere in the tree since the
most recent common ancestor.
25Interpreting the data in light of the model
A
T
C
A
T
C
A
T
C
A
T
C
A
T
C
A
T
T
A
T
T
Synonymous transition, rate km/M
A
T
C
Synonymous transition, rate km/M
A
T
C
A
T
C
A
T
T
A
T
C
A
T
T
A
T
T
For a dimorphic segregating triplet, on the
assumption that no more than a single mutation
has occurred, ancestral type is irrelevant.
26Interpreting the data in light of the model
From Coalescent Theory, the evolutionary time
over which mutations can occur for a gene history
of n genes is given by the Watterson constant
If M is the basic rate of mutation per codon and
the number of mutations in the tree is Poisson
distributed, then
Pr0 mutations e-Ma
Pr1 mutation Ma e-Ma
Pr2 mutations (Ma)2e-Ma/2
Pr3 mutations (Ma)3e-Ma/6
27Interpreting the data in light of the model
Segregating Dimorphic
Non-segregating Monomorphic
Segregating Dimorphic
Segregating Trimorphic
A
T
C
G
A
G
T
T
G
G
G
T
A
T
C
G
A
G
T
T
G
G
G
T
A
T
C
G
A
G
T
T
G
G
G
C
A
T
C
G
A
G
T
T
G
G
G
A
A
T
T
G
A
G
T
T
G
G
G
A
A
T
T
G
A
G
C
T
A
G
G
A
One synonymous transition inferred
28Interpreting the data in light of the model
Segregating Dimorphic
Non-segregating Monomorphic
Segregating Dimorphic
Segregating Trimorphic
A
T
C
G
A
G
T
T
G
G
G
T
A
T
C
G
A
G
T
T
G
G
G
T
A
T
C
G
A
G
T
T
G
G
G
C
A
T
C
G
A
G
T
T
G
G
G
A
A
T
T
G
A
G
T
T
G
G
G
A
A
T
T
G
A
G
C
T
A
G
G
A
One synonymous transition inferred
One synonymous transition inferred
29Interpreting the data in light of the model
T
T
G
T
T
G
T
T
G
T
T
G
T
T
G
C
T
A
Under the assumption of no more than a single
mutation this change cannot occur. Its frequency
is assumed negligible, and any occurrences in the
data are ignored.
30Interpreting the data in light of the model
Segregating Dimorphic
Non-segregating Monomorphic
Segregating Dimorphic
Segregating Trimorphic
A
T
C
G
A
G
T
T
G
G
G
T
A
T
C
G
A
G
T
T
G
G
G
T
A
T
C
G
A
G
T
T
G
G
G
C
A
T
C
G
A
G
T
T
G
G
G
A
A
T
T
G
A
G
T
T
G
G
G
A
A
T
T
G
A
G
C
T
A
G
G
A
One synonymous transition inferred
Inference not possible, incidence assumed
negligible
One synonymous transition inferred
31Interpreting the data in light of the model
Segregating Dimorphic
Non-segregating Monomorphic
Segregating Dimorphic
Segregating Trimorphic
A
T
C
G
A
G
T
T
G
G
G
T
A
T
C
G
A
G
T
T
G
G
G
T
A
T
C
G
A
G
T
T
G
G
G
C
A
T
C
G
A
G
T
T
G
G
G
A
A
T
T
G
A
G
T
T
G
G
G
A
A
T
T
G
A
G
C
T
A
G
G
A
One synonymous transition inferred
Inference not possible, incidence assumed
negligible
One synonymous transition inferred
32Interpreting the data in light of the model
Segregating Dimorphic
Non-segregating Monomorphic
Segregating Dimorphic
Segregating Trimorphic
A
T
C
G
A
G
T
T
G
G
G
T
A
T
C
G
A
G
T
T
G
G
G
T
A
T
C
G
A
G
T
T
G
G
G
C
A
T
C
G
A
G
T
T
G
G
G
A
A
T
T
G
A
G
T
T
G
G
G
A
A
T
T
G
A
G
C
T
A
G
G
A
One synonymous transition inferred
Inference not possible, incidence assumed
negligible
Inference not possible, incidence assumed
negligible
One synonymous transition inferred
33Interpreting the data in light of the model
Segregating Dimorphic
Non-segregating Monomorphic
Segregating Dimorphic
Segregating Trimorphic
A
T
C
G
A
G
T
T
G
G
G
T
A
T
C
G
A
G
T
T
G
G
G
T
A
T
C
G
A
G
T
T
G
G
G
C
A
T
C
G
A
G
T
T
G
G
G
A
A
T
T
G
A
G
T
T
G
G
G
A
A
T
T
G
A
G
C
T
A
G
G
A
One synonymous transition inferred
Inference not possible, incidence assumed
negligible
Inference not possible, incidence assumed
negligible
One synonymous transition inferred
34Interpreting the data in light of the model
Why might a site be monomorphic?
1. Because there has been no mutation since the most recent common ancestor! Pr e-Ma
2. Because there has been an inviable non-synonymous mutation that was purged by selection Pr x(1-w)m Ma e-Ma/M y(1-w)km Ma e-Ma/M
Where x and y are the number of possible
non-synonymous transversions and transitions
respectively from codon GAG. Therefore
35Interpreting the data in light of the model
Segregating Dimorphic
Non-segregating Monomorphic
Segregating Dimorphic
Segregating Trimorphic
A
T
C
G
A
G
T
T
G
G
G
T
A
T
C
G
A
G
T
T
G
G
G
T
A
T
C
G
A
G
T
T
G
G
G
C
A
T
C
G
A
G
T
T
G
G
G
A
A
T
T
G
A
G
T
T
G
G
G
A
A
T
T
G
A
G
C
T
A
G
G
A
One synonymous transition inferred
Inference not possible, incidence assumed
negligible
Inference not possible, incidence assumed
negligible
One synonymous transition inferred
No mutation or inviable non-synonymous mutation
36Interpreting the data in light of the model
37(No Transcript)
38Interpreting the data in light of the model
Segregating Dimorphic
Non-segregating Monomorphic
Segregating Dimorphic
Segregating Trimorphic
A
T
C
G
A
G
T
T
G
G
G
T
A
T
C
G
A
G
T
T
G
G
G
T
A
T
C
G
A
G
T
T
G
G
G
C
A
T
C
G
A
G
T
T
G
G
G
A
A
T
T
G
A
G
T
T
G
G
G
A
A
T
T
G
A
G
C
T
A
G
G
A
One synonymous transition inferred
Inference not possible, incidence assumed
negligible
Inference not possible, incidence assumed
negligible
One synonymous transition inferred
No mutation or inviable non-synonymous mutation
Total 1094
315
27
52
700
39Maximum likelihood estimation of m, k and w
- It is assumed that no more than a single mutation
has occurred at each triplet since the most
recent common ancestor of all sequences. - This avoids inference of ancestral types.
- And allows dimorphic segregating sites to be
directly classified into one of the four mutation
types. - However, it wastes some information
- Some triplets that are segregating cannot be
classified because they involve more than a
single point mutation. Rather than attempt to
infer the order of mutational events, the data is
ignored. - E.g. TTG and CTA both encode Leucine, but to get
from one to the other requires multiple point
mutations at positions 1 and 3. - If a triplet is segregating for more than a
single codon (e.g. it is trimorphic) in the
sample then ancestral type would need to be
inferred. Rather than do that, the data is
ignored. - Maximum likelihood is then used to find the most
probable values of m, k and w given the observed
data.
40Maximum likelihood estimation of m, k and w
- In maximum likelihood estimation, a formula for
the probability of the data given a set of values
for the parameters (m, k and w) is found. Then
the values of the parameters are varied until a
set are chosen for which the data is the most
probable. - In this case, as there are 3 parameters, an
animation is used to represent variation in kappa
by a fourth dimension, time.
41Maximum likelihood estimation of m, k and w
- The maximum likelihood estimates were
- 0.001662 (per 2N generations)
- 5.848
- 0.2598
- Therefore the rates, per codon per 2N generations
were - Synonymous transversion 0.001662
- Synonymous transition 0.00972
- Non-synonymous transversion 0.0004318
- Non-synonymous transition 0.002525
- where N is the effective population size
42Underlying mutation rate, M
- Under the parameters estimated, the basic
mutation rate per codon, M 0.03819 per 2N
generations, where N is the effective population
size. - Biochemical estimates of the basic mutation rate
in Escherichia coli have been of the order of 5
x 10-9 per generation. - Equating this to the true underlying mutation
rate, the effective population size can be
estimated as N 1.3million. - Such an estimate is subject to assumptions of
selective neutrality, once functional constraint
has been modelled as mutational bias. - In a human pathogen such as Neisseria
meningitidis, selective neutrality is highly
unlikely.
E. coli rate from Drake et. al. 1998 or Drake
Holland 1999
43Statistical test of hypothesis
3
44Statistical hypothesis testing
- This is the next stage.
- First the coalescent simulations need running.
- Then we can test the MLST data for selective
neutrality. - I expect neutrality to be overwhelmingly rejected
as a null hypothesis. - Then we can go on to test the clonal epidemic
model.