MODELS OF PROTEIN EVOLUTION: AN INTRODUCTION TO AMINO ACID EXCHANGE MATRICES presentation

About This Presentation

Transcript and Presenter's Notes

Title: MODELS OF PROTEIN EVOLUTION: AN INTRODUCTION TO AMINO ACID EXCHANGE MATRICES

1
MODELS OF PROTEIN EVOLUTION AN INTRODUCTION TO
AMINO ACID EXCHANGE MATRICES
Robert Hirt Department of Zoology, The Natural
History Museum, London
2
Inferring trees is difficult!!!
1. The method problem
A
Method 1
Dataset 1
B
?
C
A
Method 2
C
Dataset 1
B
3
Inferring trees is difficult!!!
2. The dataset problem
A
Method 1
B
Dataset 1
?
C
A
Method 1
C
Dataset 2
B
4
From DNA/protein sequences to trees

1
Sequence data

2
Align Sequences
Phylogenetic signal? Patternsgtevolutionary
processes?

3
Distances methods
Characters based methods

Distance calculation (which model?)
4
Choose a method
MB
ML
MP
Wheighting? (sites, changes)?
Model?
Model?
Single tree
Optimality criterion
LS
ME
NJ
Calculate or estimate best fit tree
5
Test phylogenetic reliability
Modified from Hillis et al., (1993). Methods in
Enzymology 224, 456-487
5
Agenda

Some general considerations
why protein phylogenetics?
What are we comparing? Protein sequences - some
basic features
Protein structure/function and its impact on
patterns of mutations
Amino acid exchange matrices where do they come
from and when do we use them?
Database searches (Blast, FASTA)
Sequence alignment (ClustalX)
Phylogenetics (model based methods)

6
Why protein phylogenies?

For historical reasons - the first sequences
Most genes encode proteins
To study protein structure, function and
evolution
Comparing DNA and protein based phylogenies can
be useful
Different genes - e.g. 18S rRNA versus EF-2
protein
Protein encoding gene - codons versus amino acids

7
Proteins were the first molecular sequences to be
used for phylogenetic inference

Fitch and Margoliash (1967). Construction of
phylogenetic trees. Science 155, 279-284.

8
Phylogenies from proteins

Parsimony
Distance matrices
Maximum likelihood
Bayesian methods

9
Evolutionary models for amino acid changes

All methods have explicit or implicit
evolutionary models
Can be in the form of simple formula
Kimura formula to estimate distances
Most models for amino acid changes typically
include
20x20 rate matrix
Correction for rate heterogeneity among sites (G
a pinv)
Assume neutrality - what if there are biases, or
non neutral changes such as selection?

10
Character states in DNA and protein alignments

DNA sequences have four states (five) A, C, G,
T, (and indels)
Proteins have 20 states (21) A, C, D, E, F, G,
H, I, K, L, M, N, P, Q, R, S, T, V, W, Y (and
indels)
gt more information in DNA or protein alignments?

11
DNA-gtProtein the code

3 nucleotides (a codon) code for one amino acid
(61 codons! 61x61 rate matrices?)
Degeneracy of the code most amino acids are
coded by several codons
gt more data/information in DNA?

12
DNAgtProtein

The code is degenerate
20 amino acids are encoded by 61 possible
codons (3 stop codons)
Complex patterns of changes among codons
Synonymous/non synonymous changes
Synonymous changes correspond to codon changes
not affecting the coded amino acid

13
Codon degeneracy protein alignments as a guide
for DNA alignments
Glu-Gly-Ser-Ser-Trp-Leu-Leu-Leu-Gly-Ser
Glu-Gly-Ser-Ser-Tyr-Leu-Leu-Ile-Gly-Ser Asp-Gly-S
er-Ala-Trp-Leu-Leu-Leu-Gly-Ser Asp-Gly-Ser-Ala-Tyr
-Leu-Leu-Ala-Gly-Ser

GAA-GGA-AGC-TCC-TGG-TTA-CTC-CTG-GGA-TCC
GAG-GGT-TCC-AGC-TAT-CTA-TTA-ATT-GGT-AGC
GAC-GGC-AGT-GCA-TGG-TTG-CTT-TTG-GGC-AGT
GAT-GGG-TCA-GCT-TAC-CTC-CTG-GCC-GGG-TCA

Ask James for PUTGAPS
14
DNA-gtProtein code usage

Difference in codon usage can lead to large base
composition bias - in which case one often needs
to remove the 3rd codon, the more bias prone
site and possibly the 1st
Comparing protein sequences can reduce the
compositional bias problem
gt more information in DNA or protein?

15
Models for DNA and Protein evolution

DNA 4 x 4 rate matrices
Easy to estimate (can be combined with tree
search)
Protein 20 x 20 matrices
More complex time and estimation problems (rare
changes?) -gt
Empirical models from large datasets are
typically used
One can correct for amino acid frequencies for a
given dataset

16
Proteins and amino acids

Proteins determine shape and structure of cells
and carry most catalytic processes - 3D structure
Proteins are polymers of 20 different amino acids
Amino acids sequence composition determines the
structure (2ndary, 3ary) and function of the
protein
Amino acids can be categorized by their side
chain physicochemical properties
Polarity (hydrophobic versus hydrophilic, /-
charges)
Size (small versus large)

17
Amino acid physico-chemical properties

Major factor in protein folding
Key to protein functions

gt Major influence in pattern of amino
acid mutations As for Ts versus Tv in DNA
sequences, some amino acid changes are more
common than others very important for sequence
comparisons (alignment and phylogenetics!) Small
ltgt small gt small ltgt big
18
Estimation of relative rates of residue
replacement (models of evolution)

Differences/changes in protein alignments can be
pooled and patterns of changes investigate.
Selected sequence, alignment and counting method
dependent! Empirical models!
Patterns of changes give insights into the
evolutionary processes underlying protein
diversification -gt estimation of evolutionary
models
How general is such a model?
Choice of protein evolutionary models can be
important for the sequence analysis we perform
(database searching, sequence alignment,
phylogenetics)

19
Amino acid substitution matrices based on
observed substitutions empirical models

Summarise the substitution pattern from large
amount of existing data
Based on a selection of proteins
Globular proteins, membrane proteins?
Mitochondrial proteins?
Uses a given counting method and set of recorded
changes
tree dependent/independent
restriction on the sequence divergence

20
Amino acid physico-chemical properties

Size
Polarity
Hydrophilic (polar, /- charges)
Hydrophobic (non polar)

21
Taylors Venn diagram of amino acids properties
Tiny
Small
P
A
Aliphatic
CS-S
G
N
Polar
S
CS-H
Q
V
D
T

I
E
L
Charged
K
M
-
Y
F
H
R
W
Hydrophobic
Aromatic
22
Amino acids categories 1Doolittle (1985). Sci.
Am. 253, 74-85.

Small polar S, G, D, N
Small non-polar T, A, P, C
Large polar E, Q, K, R
Large non-polar V, I, L, M, F
Intermediate polarity W, Y, H

23
Amino acids categories 2

Sulfhydryl C
Small hydrophilic S, T, A, P, G
Acid, amide D, E, N, Q
Basic H, R, K
Small hydrophobic M, I, L, V
Aromatic F, Y, W

24
Amino acids categories

Changes within a category are more common then
other changes
Colour coding of alignments to help visualise its
quality
Differential weighting of cost matrices in
parsimony analyses
Mutational data matrices in model based methods

25
gt Colour coding of different categories is
useful for protein alignment visual inspection
26
Phylogenetic trees from protein alignments

Parsimony based methods - unweighted/weighted
Distance methods - model for distance estimation
probability of amino acid changes, site rate
heterogeneity
Maximum likelihood and Bayesian methods- model
for ML calculations
probability of amino acid changes, site rate
heterogeneity

27
Trees from protein alignmentParsimony methods -
cost matrices

All changes weighted equally
Differential weighting of changes an attempt to
correct for homoplasy!
Based on the minimal number of amino acid
substitutions, the genetic code matrix
(PHYLIP-PROTPARS)
Weights based on physico-chemical properties of
amino acids
Weights based on observed frequency of amino acid
substitutions in alignments

28
Parsimony unweighted matrix for amino acid
changes

Ile -gt Leu cost 1
Trp -gt Asp cost 1
Ser -gt Arg cost 1
Lys -gt Asp cost 1

29
Parsimony weighted matrix for amino acid
changes, the genetic code matrix

Ile -gt Leu cost 1
Trp -gt Asn cost 3
Ser -gt Arg cost 2
Lys -gt Asp cost 2

30
Weighting matrix based on minimal amino acid
changes PROTPARS inPHYLIP

A C D E F G H I K L M N P Q R 1 2 T V W Y
A 0 2 1 1 2 1 2 2 2 2 2 2 1 2 2 1 2 1 1 2 2
C 2 0 2 2 1 1 2 2 2 2 2 2 2 2 1 1 1 2 2 1 1
D 1 2 0 1 2 1 1 2 2 2 2 1 2 2 2 2 2 2 1 2 1
E 1 2 1 0 2 1 2 2 1 2 2 2 2 1 2 2 2 2 1 2 2
F 2 1 2 2 0 2 2 1 2 1 2 2 2 2 2 1 2 2 1 2 1
G 1 1 1 1 2 0 2 2 2 2 2 2 2 2 1 2 1 2 1 1 2
H 2 2 1 2 2 2 0 2 2 1 2 1 1 1 1 2 2 2 2 2 1
I 2 2 2 2 1 2 2 0 1 1 1 1 2 2 1 2 1 1 1 2 2
K 2 2 2 1 2 2 2 1 0 2 1 1 2 1 1 2 2 1 2 2 2
L 2 2 2 2 1 2 1 1 2 0 1 2 1 1 1 1 2 2 1 1 2
M 2 2 2 2 2 2 2 1 1 1 0 2 2 2 1 2 2 1 1 2 3
N 2 2 1 2 2 2 1 1 1 2 2 0 2 2 2 2 1 1 2 3 1
P 1 2 2 2 2 2 1 2 2 1 2 2 0 1 1 1 2 1 2 2 2
Q 2 2 2 1 2 2 1 2 1 1 2 2 1 0 1 2 2 2 2 2 2
R 2 1 2 2 2 1 1 1 1 1 1 2 1 1 0 2 1 1 2 1 2
1 1 1 2 2 1 2 2 2 2 1 2 2 1 2 2 0 2 1 2 1 1
2 2 1 2 2 2 1 2 1 2 2 2 1 2 2 1 2 0 1 2 2 2
T 1 2 2 2 2 2 2 1 1 2 1 1 1 2 1 1 1 0 2 2 2

W TGG N AAC AAT A minimum of 3
changes are needed at the DNA level for Wlt-gtN
31
Phylogenetic trees from protein alignments

Parsimony based methods - unweighted/weighted
Distance methods - model for distance estimation
probability of amino acid changes, site rate
heterogeneity
Maximum likelihood and Bayesian methods- model
for ML calculations
probability of amino acid changes, site rate
heterogeneity

32
Distance methods

A two step approach - two choices!
1) Estimate all pairwise distances
Choose a method (100s) - has an explicit model
for sequence evolution
2) Estimate a tree from the distance matrix
Choose a method with or without an optimality
criterion?

33
Estimation of protein pairwise distances

Simple formula
More complex models
20 x 20 matrices (evolutionary model)
Identity matrix
Genetic code matrix
Mutational data matrices (MDMs)
Correction for rate heterogeneity between sites
(G a pinv)

34
The Kimura formula correction for multiple hits

dij -Ln (1 - Dij - (Dij2/5))
Dij the observed dissimilarity between i and j
(0-1).
- Can give good estimate of dij for 0.75 gt Dij gt
0
It can approximates the PAM matrix well
If Dij 0.8541, dij infinite.
Does not take into account which amino acid are
changing
Implemented in Clustal and PHYLIP
-gt Importance of mutational matrices, MDM!

35
Amino acid substitution matrices (MDMs)

Sequence alignments based matrices
PAM, JTT, BLOSUM, WAG...
Structure alignments based matrices
STR (for highly divergent sequences)

36
Protein alignment may be guided by structural
interactions
Homo sapiens djlA protein
Escherichia. coli djlA protein
37
Protein distance measurements with MDM

20 x 20 matrices
PAM, BLOSUM, WAGmatrices
Maximum likelihood calculation which takes into
account
All sites in the alignment
All pairwise rates in the matrix
Branch length

dij ML P(n), Xij, (G, pinv) (dodgy notation!)
dij -Ln (1 - Dij - (Dij2/5)) F(Dij)
38
How is an MDM inferred?

Observed raw changes are corrected for
The amino acid relative mutability
The amino acid normalised frequency
Differences between MDM comes from
Choice of proteins used (membrane, globular)
Range of sequence similarities used
Counting methods
On a tree MP, ML
Pairwise comparison from an alignment

-gt empirical models from large datasets are
typically used
39
How is an MDM inferred?
The raw data observed changes in pairwise
comparisons in an alignment or on a tree
seq.1 AIDESLIIASIATATI seq.
2 AGDEALILASAATSTI
40
seq.1 AIDESLIIASIATATI seq.
2 AGEEALILASAATSTI
A S T G I L E D A 3 S 2 1 T 0 0 1 G 0 0 0 0 I
1 0 0 1 2 L 0 0 0 0 1 1 E 0 0 0 0 0 0 1 D 0 0 0 0
0 0 1 0
Raw matrix Symmetrical!
-gt The larger the dataset the better the
estimates!
41
Amino Acid exchange matrices

- s1,2 s1,3 s1,20
s1,2 - s2,3 s2,20
s1,3 s2,3 - s3,20
s1,20 s2,20 s3,20 -

X diag(p1, , p20) Q matrix
Q Rate matrix Qij Instantaneous
rates of change of amino acids sij
Exchangeabilities of amino acid pairs ij sij
sij Time reversibility pi Stationarity
of amino acid frequencies
(typically the observed proportion of residues in
the dataset)
42
Amino Acid exchange matrices
R
Relative rate matrix (no composition, no branch
length)
Q
Rate matrix (with composition, not branch length)
P
R
F
Raw matrix Observed changes (counted on a MP
tree or in pairwise comparisons)
Relatedness odd matrix Used for scoring
alignments (Blast, Clustal)
Probability matrix (composition branch
length) Can be estimated using ML on a tree
Modified from Peter Foster
43
The PAM and JTT matrices

PAM - Dayhoff et al. 1968
Nuclear encoded genes, 100 proteins
JTT - Jones et al. 1992
59,190 accepted point mutations for 16,300
proteins
Jones, Taylor Thornton (1992). CABIOS 8, 275-282

44
The BLOSUM matrices
Henikoff Henikoff (1992). Proc Natl Acad Sci
USA 89, 10915-9

BLOcks SUbstitution Matrices
The matrix values are based on 2000 conserved
amino acid patterns (blocks) - pairwise
comparisons
gt more efficient for distantly related proteins
gt more agreement with 3D structure data
BLOSUM62 - 62 minimum sequence identity
BLOSUM50 - 50 minimum sequence identity

45
The WAG matrix
Whelan and Goldman (2001) Mol. Biol. Evol. 18,
691-699

Globular protein sequences
3,905 sequences from 182 protein families
Produced a phylogenetic trees for every family
and used maximum likelihood to estimate the
relative rate values in the rate matrix (overall
lnL over 182 different trees)
Better fit of the model with most data
(significant improvement of the lnL of a tree
when compared to PAM or JTT matrices)
Might not be the best option in some cases such
as for mitochondria encoded proteins

46
Comparisons of MDMs (sij) amino acid
exchangeability
Whelan and Goldman (2001) Mol. Biol. Evol. 18,
691-699
47
Log-odds matrices

MDMij 10 log10 Rij

The MDMij values are rounded to the nearest
integer MDMij lt 0 freq. less than chance MDMij
0 freq. expected by chance MDMij gt 0 freq.
greater then chance
The Log-odds matrices can be used for scoring
alignments (Blast and Clustal)
48
BLOSUM62 Amino Acid Substitution Matrix

C S T P A G N D E Q H R K M I
L V F Y W
C 9
C sulfhydryl
S -1 4
S
T -1 1 5
T
P -3 -1 -1 7
P small
A 0 1 0 -1 4
A hydrophilic
G -3 0 -2 -2 0 6
G
N -3 1 0 -2 -2 0 6
N
D -3 0 -1 -1 -2 -1 1 6
D acid, acid-amide
E -4 0 -1 -1 -1 -2 0 2 5
E and hydrophilic
Q -3 0 -1 -1 -1 -2 0 0 2 5
Q
H -3 -1 -2 -2 -2 -2 1 -1 0 0 8
H
R -3 -1 -1 -2 -1 -2 0 -2 0 1 0 5
R basic
K -3 0 -1 -1 -1 -2 0 -1 1 1 -1 2 5
K
M -1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5
M
I -1 -2 -1 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4
I small
L -1 -2 -1 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2
4 L hydrophobic
V -1 -2 0 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3
1 4 V
F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0
0 -1 6 F

MDMij lt 0 freq. less than chance MDMij 0
freq. expected by chance MDMij gt 0 freq.
greater then chance
49
Summary

Many amino acid rate matrices exist and one needs
to choose one for protein comparisons (alignment,
phylogenetics...) do not hesitate to experiment!
One should make a rational choice (as much as
possible)
How was the rate matrix produced?
What are the structural features of the sequences
you are comparing? Globular/membrane protein?
What is the level of sequence identity of the
compared sequences?
Always try to correct for rate heterogeneity
between sites in phylogenetics!

50
Summary 2

In practice MDM are obtained by averaging the
observed changes and amino acid frequencies
between numerous proteins (e.g. JTT, BLOSUM) and
are used for your specific dataset
You can correct an MDM for the pi values of your
data (amino acid frequencies)
Specific matrices have been calculated to reflect
particular composition biases (e.g. the
mitochondrial proteins matrix mtREV24)
Future work
What about context-dependent MDM alpha helices
versus beta sheets, surface accessibility?
(Heterogenous models)
Changes between grouped amino acids - estimation
of data specific GTR matrices

51
From DNA/protein sequences to trees

1
Sequence data

2
Align Sequences
Phylogenetic signal? Patternsgtevolutionary
processes?

3
Distances methods
Characters based methods

Distance calculation (which model?)
4
Choose a method
MB
ML
MP
Wheighting? (sites, changes)?
Model?
Model?
Single tree
Optimality criterion
LS
ME
NJ
Calculate or estimate best fit tree
5
Test phylogenetic reliability
Modified from Hillis et al., (1993). Methods in
Enzymology 224, 456-487

Write a Comment

User Comments (0)

About PowerShow.com

MODELS OF PROTEIN EVOLUTION: AN INTRODUCTION TO AMINO ACID EXCHANGE MATRICES PowerPoint PPT Presentation