Title: Models of Protein Evolution
1Models of Protein Evolution
- Amino acid sequences (20 amino acids)
- Protein-coding DNA sequences
2Models for Amino Acid Sequences
- DNA (4 x 4 rate matrix) vs amino acid (20 x 20)
resulting in many more parameters and thus,
computation time - Consequently, amino acid models have
concentrated on empirical approaches - EMPIRICAL (several implemented in MrBayes model
fixed ) - NON-EMPIRICAL (model variable in MrBayes)
3Models for Amino Acid Sequences
- EMPIRICAL (several implemented in MrBayes model
fixed ) 20 x 20 matrices - Dayhoff et al. (1978) matrix based on the
observation of 1572 accepted mutations between 34
superfamilies of closely related sequences - JTT matrix (Jones et al. 1992 Gonnett et al.
1992) same methodology as Dayhoff, but with
modern databases (other later modifications for
transmembrane Jones et al. 1994) - mtREV (Adachi and Hasegawa 1995, 1996) matrix
derived from maximum likelihood-inferred
replacements in mitochondrial proteins of 20
vertebrate species - WAG (Whelan and Goldman 2001) matrix derived from
maximum likelihood improvement of JTT - Poisson assumes equal stationary state
frequencies and equal substitution rates
(equivalent to JC model for DNA). Not really
empirical, but it is fixed
4Dayhoff Evolutionary Mutation Matrix
5Models for Amino Acid Sequences
- 2. NON-EMPIRICAL ( variable in MrBayes)
- Equalin in MrBayes substitution rates equal,
but adjusted for amino acid equilibrium
frequencies in dataset (equivalent to F81 for
DNA) - Cao et al. (2004) and Goldman and Wheelan (2002)
Empirical matrices adjusted for the equilibrium
frequencies of amino acids in dataset (similar to
base frequencies in GTR matrix 19 rate
parameters) - GTR allows all stationary state frequencies and
substitution rates to vary (MANY parameters 19
free stationary state frequency parameters and
189 free substitution rate parameters)
6Equalin instantaneous rate matrix
7GTR instantaneous rate matrix
8Phylogeny inference software for Amino Acid
Sequences
- Parsimony PAUP and others
- Maximum Likelihood
- Molphy (only UNIX)
- TreePuzzle uses quartet puzzling a different
search strategy - PhyML (web based or downloadable binary) uses a
different search algorithm - RaxML
- ProtML (really old probably slow searches)
- Bayesian MrBayes
- Distance (MEGA)
- www.megasoftware.net
- implements empirical (Dayhoff and JTT, as well as
Poisson model for inferring pairwise distances)
9Models of Protein Evolution
- Amino acid sequences (20 amino acids)
- Protein-coding DNA sequences
- Codon-position models (4 nucleotides)
- Codon-based models (64 codons)
101. Codon-position models (4 nucleotides)
- 1st, 2nd, and 3rd codon positions are treated
differently - Equivalent to establishing different partitions
such as different partitions for different genes
(covered in Data Partition lecture) - Uses NO information from the Genetic Code
- Each codon position can have
- A different rate (partition by codon position in
PAUP and in MrBayes), but all codon positions
the same base freq and substitution matrix - A different substitution matrix (MrBayes but not
PAUP) - A different base frequencies (MrBayes but not
PAUP) - A different gamma distribution (MrBayes but not
PAUP) - Or a combination of the above (MrBayes but not
PAUP) - see Shapiro et al. 2006
112. Codon-based models (64 codons)
122. Codon-based models
- consider a codon triplet as the unit of evolution
- A codon can change to another only through steps
of one nucleotide change at a time - distinguish between synonymous (silent) and
nonsynonymous (replacement) substitutions - Uses a 64 X 64 (or 61 X 61 excluding stop codons
3721) matrix of probabilities of change among
codons - The two most commonly used models employ an
extension of the HKY DNA model (ts/tv ratio and
base or codon frequencies) and an additional
parameter - nonsynonymous/synonymous rate ratio (?)
- Models
- Goldman and Yang 1994 Yang et al. 1998
- Muse and Gaut 1994
- Widely used for testing hypotheses about natural
selection (see PAML manual), but not for
phylogenetic inference because of the
computational expense - However, may still be used to select among a
reduced number of possible trees - see Ren et al. 2005
132. Codon-based models
- Goldman and Yang 1994 Yang et al. 1998
- Equilibrium frequencies of each codon are
estimated from the codon frequency (?j) - Parameters
- qij transition probability of codon i to codon
j - ?j frequency of codon j
- ? transition/transversion ratio
- ? nonsynonymous/synonymous rate ratio
142. Codon-based models
- Muse and Gaut 1994
- Transition probability of codons in proportional
to the equilibrium frequencies the target
nucleotide rather than of the target codon - Equilibrium frequencies of target nucleotides can
be treated the same for all three codon positions
together or separately for each codon position
(parameter k below) - Parameters
- qij transition probability of codon i to codon
j - ?j frequency of nucleotide j at codon position
k (k 3) - ? transition/transversion ratio
- ? nonsynonymous/synonymous rate ratio
152. Codon-based models
- Inagaki Y, Roger AJ (2006) present a problem of
codon-based models when codon usage varies among
lineages - Similar to a long branch attraction effect when
two distantly related lineages have similar codon
biases - None of the models implemented to date
incorporates codon usage heterogeneity among
lineages
162. Codon-based models Software
- PAML
- implements the two models described (i.e.,
extension of the HKY) - Not good for searching trees
- In practice, used to compare hypotheses of models
or parameters on one or a few trees especially
tests of positive selection - Parameters
- ?j frequency of codon j or frequency of
nucleotide j at codon position k (k 3) - ? transition/transversion ratio
- ? nonsynonymous/synonymous rate ratio
- Fixed or variable among lineages (branch models)
- Fixed or variable among sites (sites models)
- Fixed or variable among sites and lineages
(branch-site models)
172. Codon-based models Software
- MrBayes
- does allow you to search for the tree topology
(i.e., the topology does not have to be fixed) - the substitution matrix has 3600 instead of 16
cells - Runs 200 times slower
- Require 16 times more memory than nucleotide
models - Parameters
- GTR (F81 or JC) rather than HKY
- substitution probabilities and equilibrium
frequencies of nucleotides? or codons?) - ? nonsynonymous/synonymous rate ratio
- Fixed among lineages
- May vary among sites
- values 0 lt ?1 lt 1, ?2 1, and ?3 gt 1 (Ny98)
- ?1 lt ?2 lt ?3 (M3)
18References
- Anisimova, M., and C. Kosiol. 2009. Investigating
Protein-Coding Sequence Evolution with
Probabilistic Codon Substitution Models.
Molecular Biology and Evolution 26255-271.
(deals more with hypothesis testing than
phylogenetic inference). - Huelsenbeck, J. P., P. Joyce, C. Lakner, and F.
Ronquist. 2008. Bayesian analysis of amino acid
substitution models. Philosophical Transactions
of the Royal Society B Biological Sciences
3633941-3953. - Le, S. Q., N. Lartillot, and O. Gascuel. 2008.
Phylogenetic mixture models for proteins.
Philosophical Transactions of the Royal Society
B Biological Sciences 3633965-3976. (see
http//www.atgc-montpellier.fr/models/index.php?mo
delmixture) - Inagaki Y, Roger AJ (2006) Phylogenetic
estimation under codon models can be biased by
codon usage heterogeneity. Mol. Phylogenet. Evol.
40, 428-434. - Kosiol C, Holmes I, Goldman N (2007) An empirical
codon model for protein sequence evolution.
Molecular Biology and Evolution 24, 1464-1479. - Shapiro B, Rambaut A, Drummond AJ (2006) Choosing
appropriate substitution models for the
phylogenetic analysis of protein-coding
sequences. Mol. Biol. Evol. 23, 7-9. - Ren FR, Tanaka H, Yang ZH (2005) An empirical
examination of the utility of codon-substitution
models in phylogeny reconstruction. Systematic
Biology 54, 808-818.
19References
- Chapter 14 Felsensteins textbook Models of
protein evolution. - MrBayes manual sections 4.1.34.2.3.
- PAML manual.
- Goldman N, Yang Z (1994) A codon-based model of
nucleotide substitution for protein-coding DNA
sequences. Mol. Biol. Evol. 11, 725-736 - Muse SV, Gaut BS (1994) A likelihood approach for
comparing synonymous and nonsynonymous
substitution rates, with application to the
chloroplast genome. Mol. Biol. Evol. 11, 715-724. - Yang Z, Nielsen R, Hasegawa M (1998) Models of
amino acid substitution and applications to
mitochondrial protein evolution. Mol. Biol. Evol.
15, 1600-1611.