Title: Detecting adaptive protein evolution
1Detecting adaptive protein evolution
Ziheng Yang Department of BiologyUniversity
College London
2There are two main explanations for genetic
variation observed within a population or between
speciesNatural selection (survival of the
fittest)mutation and drift (survival of the
luckiest)
Gillespie, J.H. 1998. Population genetics a
concise guide. John Hopkins University Press,
Baltimore. Hartl, D.L., and A.G. Clark. 1997.
Principles of population genetics. Sinauer
Associates, Sunderland, Massachusetts.
3Positive negative selection
Genotype AA Aa aa Frequency p2
2p(1-p) (1-p)2 Fitness 1 1s 12s
(A wildtype-allele a new mutant) s is
selection coefficient s ? 0 neutral
evolution s lt 0 negative (purifying) selection
s gt 0 positive selection (adaptive evolution)
4Positive negative selection
Whether mutation or selection dominates the fate
of the new allele depends on whether Ns ? 1,
where N is the effective population size.
Ns lt -3 fatal mutations -3 lt Ns lt -1 unlucky
losers -1 lt Ns lt 1 nearly neutral 1 lt Ns lt
3 occasional hopefuls Ns gt 3 rare monsters
5Theories of molecular evolution
Akashi, H. (1999) Gene 238 39-51
6Detecting the effect of natural selection is
useful for (a) advancing evolutionary theory
(b) inferring functional significance from
genomic data.
7Evolutionary conservation means functional
significance.
Thomas, et al. 2003. Nature 424788-793
8Fast-evolving genes or gene regions are also
functionally important if the variability is
driven by natural selection.
9In protein-coding genes, we can distinguish
between synonymous (silent) and nonsynonymous
(replacement) mutations, and contrast their
substitution rates to infer selection on the
protein.
10Synonymous nonsynonymous substitutions
11Definitions
dS (KS) number of synonymous substitutions per
synonymous site dN (KA) number of nonsynonymous
substitutions per nonsynonymous site ? dN/dS
nonsynonymous/synonymous rate ratio
12The ? ratio measures selection at the protein
level
- ? 1 neutral evolution
- ? lt 1 negative (purifying) selection
- ? gt 1 positive (diversifying) selection
13Data information
a2 GGC TCT CAC TCC ATG AGG TAT TTC TTC ACA
TCC a24 ... ..C ... ... ... ..T ... ... .A.
..C ... a11 ... ..C ..A ... ... ... ... ...
.A. ..C ... aw24 ... ..C ... ... ... ... ...
... CA. ..C ... aw68 ... ..C ... ... ... ..A
... ... .A. ..C ... a3 ... ..T ..T ... ...
... ... ... C.. ..T ...
14Early studies average synonymous and
nonsynonymous rates over sites and have little
power in detecting adaptive evolution.
15Possible approaches
- Test each site for positive selection (Suzuki
Gojobori 1999 Mol. Biol. Evol. 16 13151328)
- Decide on which sites might be under selection
and focus on them (Hughes Nei 1988 Nature
335167-170) (fixed-sites model)
- Use a statistical distribution to model the ?
variation (random-sites model, fishing expedition)
16A simple approach (Fitch et al. 1997 Suzuki
Gojobori 1999)
TTC
TTC
T?A
TTC
ATC
TTC
TTC
TTA
C?A
T?A
TAT
C?T
TTT
TTT
3 nonsynonymous changes1 synonymous change
17Use of codon models to detect amino acid sites
under diversifying selection
- Likelihood Ratio Test (LRT) for sites under
positive selection - Bayes calculation of posterior probabilities of
sites under positive selection
18Rates to CTG
Synonymous CTC (Leu) ? CTG (Leu)
?CTG TTG (Leu) ? CTG (Leu)
??CTG
Nonsynonymous GTG (Val) ? CTG (Leu)
??CTG CCG (Pro) ? CTG (Leu)
???CTG
19Rate matrix Q qij
(Goldman Yang 1994 Mol Biol Evol
11725-736Muse Gaut 1994 Mol Biol Evol
11715-724)
20LRT of sites under positive selection
H0 there are no sites at which ? gt 1H1 there
are such sites Compare 2?? 2(?1 - ?0) with a ?2
distribution
(Nielsen Yang 1998 Genetics 148929-936Yang,
Nielsen, Goldman Pedersen 2000. Genetics
155431-449)
21Two pairs of useful models
- M1a (Nearly Neutral)
- Site class k 0 1
- pk p0 p1
- ?k ?0lt1 ?11
- M2a (Positive Selection)
- Site class k 0 1 2
- pk p0 p1 p2
- ?k ?0lt1 ?11 ?2gt1
Modified from Nielsen Yang (1998), where ?00
is fixed
22- M7 (beta, using 10 site classes)
- ? beta(p, q)
- M8 (beta?)
- p0 of sites from beta(p, q)
- p1 1 - p0 of sites with ?s gt 1
From Yang et al. (2000)
23(No Transcript)
24Discretisation of a continuous distribution
M7(beta)
Sites
0
0.2
0.4
0.6
0.8
1
? ratio
25Mixture distribution M8(beta?)
p1
p0 from beta(p, q)
Sites
0
0.2
0.4
0.6
0.8
1
?1.7
? ratio
26Likelihood function and Empirical Bayesian
inference of sites under selection (M2a)
Site class k 0 1
2 Proportion pk p0 p1
p2 ? ratio ?k ?0 lt 1 ?1 1 ?2
gt 1
27Bayes Empirical Bayes (BEB) M2a
28Human MHC Class I data192 alleles, 270 codons
Model ? Parameter estimates M7 (beta)
?7,498.97 beta(0.10, 0.35) M8 (beta?)
?7,232.68 p0 0.90, beta(0.17, 0.71) (p1
0.10), ?s 5.12
Likelihood ratio test of positive selection 2??
2 ? 266.29 532.58, P lt 0.000, d.f. 2
29Posterior probabilities for MHC
3025 sites identified by M8 (beta?) using both NEB
BEB
31Comparison between NEB and BEB from real data
analysis and computer simulation suggests that
- BEB is effective in correcting high false
positive rates of NEB in small (non-informative)
data sets. - BEB does not seem to cause a loss of power in
large (informative) data sets. - Some wrong models are more useful than the true
model.
32A small data set (HTLV tax gene)(Suzuki Nei
2004 MBE 21914-921)
20 sequences, 181 codons. 23 singleton
differences on star tree 2 synonymous, 21
nonsynonymous NEB M0 (one-ratio), M2
(selection), M2a (PositiveSelection), M8 (beta?)
all give ? 4.87. Every site is under positive
selection with P 1 BEB 21 sites have 0.91 lt P
lt 0.93 under M2a and 0.96 lt P lt 0.97 under M8.
Other sites have P 57 or 70.
33Performance measures in simulation
True positive 50/80 False positive
10/120 Accuracy 50/60
34Performance of BEB (NEB) in simulations
(cutoff P 95)
35Advantages of ML
- Accounts for the genetic code
- Accounts for ts/tv rate bias and codon usage bias
- Avoids bias in ancestral reconstruction
- Uses probability theory to correct for multiple
hits
36Assumptions Limitations
- Same selective pressure over all lineages
- No recombination within the sequence
- No variation in synonymous rate among sites
- Same rate for all amino acid changes
- No sequencing or alignment errors
- The level of sequence divergence and the number
of sequences are two major factors affecting
accuracy and power. Data of only a few closely
related sequences do not contain much information.
37Adaptive molecular evolution
- proteins involved in immunity or defence (MHC,
immunoglobulin VH, class 1 chitinas) - proteins involved in evading defence systems
(HIV env, nef, gap, pol, etc., capsid in FMD
virus, flu virus hemagglutinin gene) - proteins involved in male female
reproduction(abalone sperm lysin, sea urchin
bindin, proteins in mammals) - Miscellaneous
38Acknowledgments
BBSRC
http//abacus.gene.ucl.ac.uk/
39References
Yang, Z., and J.P. Bielawski. 2000. Statistical
methods for detecting molecular adaptation.
Trends in Ecology and Evolution 15
496-503. Yang, Z. 2001. Adaptive molecular
evolution, Chapter 12 (pp. 327-350) in Handbook
of statistical genetics, eds. D. Balding, M.
Bishop, and C. Cannings. Wiley, New York. Yang,
Z. 2002. Inference of selection from multiple
species alignments. Current Opinion in Genetics
and Development 12688-694. Wong, W.S.W., et al.
2004. Accuracy and power of statistical methods
for detecting adaptive evolution in protein
coding sequences and for identifying positively
selected sites. Genetics 168 1041-1051. Yang,
Z., et al. submitted. Bayes empirical Bayes
inference of amino acid sites under positive
selection. Molecular Biology Evolution