MODELS OF PROTEIN EVOLUTION: AN INTRODUCTION TO AMINO ACID EXCHANGE MATRICES - PowerPoint PPT Presentation

About This Presentation
Title:

MODELS OF PROTEIN EVOLUTION: AN INTRODUCTION TO AMINO ACID EXCHANGE MATRICES

Description:

LS. ME. NJ. Distance calculation (which model?) Model? MP. Wheighting? ... Construction of phylogenetic trees. Science 155, 279-284. Phylogenies from proteins ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 58
Provided by: rober210
Category:

less

Transcript and Presenter's Notes

Title: MODELS OF PROTEIN EVOLUTION: AN INTRODUCTION TO AMINO ACID EXCHANGE MATRICES


1
MODELS OF PROTEIN EVOLUTION AN INTRODUCTION TO
AMINO ACID EXCHANGE MATRICES
Robert Hirt Institute for Cell and Molecular
Biosciences, Newcastle University, UK
2
Inferring trees is difficult!!!
1. The method problem
A
Method 1
Dataset 1
B
?
C
A
Method 2
C
Dataset 1
B
3
Inferring trees is difficult!!!
2. The dataset problem
A
Method 1
B
Dataset 1
?
C
A
Method 1
C
Dataset 2
B
4
From DNA/protein sequences to trees

1
Sequence data

2
Align Sequences
Phylogenetic signal? Patternsgtevolutionary
processes?

3
Distances methods
Characters based methods

Distance calculation (which model?)
4
Choose a method
MB
ML
MP
Wheighting? (sites, changes)?
Model?
Model?
Single tree
Optimality criterion
LS
ME
NJ
Calculate or estimate best fit tree
5
Test phylogenetic reliability
Modified from Hillis et al., (1993). Methods in
Enzymology 224, 456-487
5
Agenda
  • Some general considerations
  • Why protein phylogenetics?
  • What are we comparing? Protein sequences - some
    basic features
  • Protein structure/function and its impact on
    patterns of mutations
  • Amino acid exchange matrices where do they come
    from and when do we use them?
  • Database searches (e.g. Blast, FASTA)
  • Sequence alignment (e.g. ClustalX)
  • Phylogenetics (model based methods distance, ML
    Bayesian)

6
Why protein phylogenies?
  • For historical reasons - the first sequences
  • Most genes encode proteins
  • To study protein structure, function and
    evolution
  • Comparing DNA and protein based phylogenies can
    be useful
  • Different genes - e.g. 18S rRNA versus EF-2
    protein
  • Protein encoding gene - codons versus amino acids

7
Proteins were the first molecular sequences to be
used for phylogenetic inference
  • Fitch and Margoliash (1967). Construction of
    phylogenetic trees. Science 155, 279-284.

8
Phylogenies from proteins
  • Parsimony
  • Distance matrices
  • Maximum likelihood
  • Bayesian methods

9
Evolutionary models for amino acid changes
  • All methods have explicit or implicit
    evolutionary models
  • Can be in the form of simple formula
  • Kimura formula to estimate distances
  • Most models for amino acid changes typically
    include
  • A 20x20 rate matrix (or reduced version of it,
    6x6 rate matrix)
  • Correction for rate heterogeneity among sites (G
    a pinv)
  • Assume stationarity and neutrality - what if
    there are biases in composition, or non neutral
    changes such as selection?

10
Character states in DNA and protein alignments
  • DNA sequences have four states (five) A, C, G,
    T, (and indels)
  • Proteins have 20 states (21) A, C, D, E, F, G,
    H, I, K, L, M, N, P, Q, R, S, T, V, W, Y (and
    indels)
  • gt more information in DNA or protein alignments?

11
DNA-gtProtein the code
  • 3 nucleotides (a codon) code for one amino acid
    (61 codons! 61x61 rate matrices?)
  • Degeneracy of the code most amino acids are
    coded by several codons
  • gt more data/information in DNA?

12
DNAgtProtein
  • The code is degenerate
  • 20 amino acids are encoded by 61 possible
    codons (3 stop codons)
  • Complex patterns of changes among codons
  • Synonymous/non synonymous changes
  • Synonymous changes correspond to codon changes
    not affecting the coded amino acid

13
Codon degeneracy protein alignments as a guide
for DNA alignments
Glu-Gly-Ser-Ser-Trp-Leu-Leu-Leu-Gly-Ser
Glu-Gly-Ser-Ser-Tyr-Leu-Leu-Ile-Gly-Ser Asp-Gly-S
er-Ala-Trp-Leu-Leu-Leu-Gly-Ser Asp-Gly-Ser-Ala-Tyr
-Leu-Leu-Ala-Gly-Ser
  • GAA-GGA-AGC-TCC-TGG-TTA-CTC-CTG-GGA-TCC
  • GAG-GGT-TCC-AGC-TAT-CTA-TTA-ATT-GGT-AGC
  • GAC-GGC-AGT-GCA-TGG-TTG-CTT-TTG-GGC-AGT
  • GAT-GGG-TCA-GCT-TAC-CTC-CTG-GCC-GGG-TCA

14
DNA-gtProtein code usage
  • Difference in codon usage can lead to large base
    composition bias - in which case one often needs
    to remove the 3rd codon, the more bias prone
    site and possibly the 1st
  • Comparing protein sequences can reduce the
    compositional bias problem
  • gt more information in DNA or protein?

15
Models for DNA and Protein evolution
  • DNA 4 x 4 rate matrices
  • Easy to estimate (can be combined with tree
    search)
  • Protein 20 x 20 matrices
  • More complex time and estimation problems (rare
    changes?) -gt
  • Empirical models from large datasets are
    typically used
  • One can correct for amino acid frequencies for a
    given dataset

16
Proteins and their amino acids
  • Proteins determine shape and structure of cells
    and carry most catalytic processes - 3D structure
  • Proteins are polymers of 20 different amino acids
  • Amino acids sequence composition determines the
    structure (2ndary, 3ary) and function of the
    protein
  • Amino acids can be categorized by their side
    chain physicochemical properties
  • Size (small versus large)
  • Polarity (hydrophobic versus hydrophilic, /-
    charges)

17
D
R
18
(No Transcript)
19
Amino acid physico-chemical properties
  • Major factor in protein folding
  • Key to protein functions

gt Major influence in pattern of amino
acid mutations As for Ts versus Tv in DNA
sequences, some amino acid changes are more
common than others fundamental for sequence
comparisons (alignments and phylogenetics!) Small
ltgt small gt small ltgt big
20
Estimation of relative rates of residue
replacement (models of evolution)
  • Differences/changes in protein alignments can be
    pooled and patterns of changes investigate.
  • Patterns of changes give insights into the
    evolutionary processes underlying protein
    diversification -gt estimation of evolutionary
    models
  • Choice of protein evolutionary models can be
    important for the sequence analysis we perform
    (database searching, sequence alignment,
    phylogenetics)

21
Amino acid substitution matrices based on
observed substitutions empirical models
  • Summarise the substitution pattern from large
    amount of existing data (average models)
  • Based on a selection of proteins
  • Globular proteins, membrane proteins?
  • Mitochondrial proteins?
  • Uses a given counting method and set of recorded
    changes
  • tree dependent/independent
  • restriction on the sequence divergence

22
Amino acid physico-chemical properties
  • Size
  • Polarity
  • Charges (acidic/basic)
  • Hydrophilic (polar)
  • Hydrophobic (non polar)

23
(No Transcript)
24
Taylors Venn diagram of amino acids properties
Tiny
Small
P
A
Aliphatic
CS-S
G
N
Polar
S
CS-H
Q
V
D
T
-
I
E
L
Charged
K
M

Y
F
H
R
W
Hydrophobic
Aromatic
Taylor (1986). J Theor. Biol. 119 205-218
25
Hydrophylic
Small
Large
Hydrophobic
Kosiol et al. (2004). J. Theor. Biol. 228 97-106
26
Amino acids categories 1Doolittle (1985). Sci.
Am. 253, 74-85.
  • Small polar S, G, D, N
  • Small non-polar T, A, P, C
  • Large polar E, Q, K, R
  • Large non-polar V, I, L, M, F
  • Intermediate polarity W, Y, H

27
Amino acids categories 2(PAM matrix)
  • Sulfhydryl C
  • Small hydrophilic S, T, A, P, G
  • Acid, amide D, E, N, Q
  • Basic H, R, K
  • Small hydrophobic M, I, L, V
  • Aromatic F, Y, W

28
Amino acids categories 3(implemented in SEAVIEW
colour coding)
  • Tiny 1, non-polar C
  • Tiny 2, non-polar G
  • Imino acid P
  • Non-polar M, V, L, I, A, F, W
  • Acid D, E
  • Basic R, K
  • Aromatic Y, H
  • Uncharged polar S, T, Q, N

29
Amino acids categories
  • Changes within a category are more common then
    between them
  • Colour coding of alignments to help visualise
    their quality (ClustalX, SEAVIEW)
  • Differential weighting of cost matrices in
    parsimony analyses
  • Mutational data matrices in model based methods
    (e.g. ML and Bayesian framework)
  • Recoding of the 20 amino acids into bins to focus
    on changes between bins (categories) (6x6 matrix)

30
gt Colour coding of different categories is
useful for protein alignment visual inspection
31
Phylogenetic trees from protein alignments
  • Parsimony based methods - unweighted/weighted
  • Distance methods - model for distance estimation
  • probability of amino acid changes, site rate
    heterogeneity
  • Maximum likelihood and Bayesian methods- model
    for ML calculations
  • probability of amino acid changes, site rate
    heterogeneity

32
Trees from protein alignmentParsimony methods -
cost matrices
  • All changes weighted equally
  • Differential weighting of changes an attempt to
    correct for homoplasy!
  • Based on the minimal number of amino acid
    substitutions, the genetic code matrix
    (PHYLIP-PROTPARS)
  • Weights based on physico-chemical properties of
    amino acids
  • Weights based on observed frequency of amino acid
    substitutions in alignments

33
Parsimony unweighted matrix for amino acid
changes
  • Ile -gt Leu cost 1
  • Trp -gt Asp cost 1
  • Ser -gt Arg cost 1
  • Lys -gt Asp cost 1

34
Parsimony weighted matrix for amino acid
changes, the genetic code matrix
  • Ile -gt Leu cost 1
  • Trp -gt Asn cost 3
  • Ser -gt Arg cost 2
  • Lys -gt Asp cost 2

35
Weighting matrix based on minimal amino acid
changes PROTPARS inPHYLIP
  • A C D E F G H I K L M N P Q R 1 2 T V W Y
  • A 0 2 1 1 2 1 2 2 2 2 2 2 1 2 2 1 2 1 1 2 2
  • C 2 0 2 2 1 1 2 2 2 2 2 2 2 2 1 1 1 2 2 1 1
  • D 1 2 0 1 2 1 1 2 2 2 2 1 2 2 2 2 2 2 1 2 1
  • E 1 2 1 0 2 1 2 2 1 2 2 2 2 1 2 2 2 2 1 2 2
  • F 2 1 2 2 0 2 2 1 2 1 2 2 2 2 2 1 2 2 1 2 1
  • G 1 1 1 1 2 0 2 2 2 2 2 2 2 2 1 2 1 2 1 1 2
  • H 2 2 1 2 2 2 0 2 2 1 2 1 1 1 1 2 2 2 2 2 1
  • I 2 2 2 2 1 2 2 0 1 1 1 1 2 2 1 2 1 1 1 2 2
  • K 2 2 2 1 2 2 2 1 0 2 1 1 2 1 1 2 2 1 2 2 2
  • L 2 2 2 2 1 2 1 1 2 0 1 2 1 1 1 1 2 2 1 1 2
  • M 2 2 2 2 2 2 2 1 1 1 0 2 2 2 1 2 2 1 1 2 3
  • N 2 2 1 2 2 2 1 1 1 2 2 0 2 2 2 2 1 1 2 3 1
  • P 1 2 2 2 2 2 1 2 2 1 2 2 0 1 1 1 2 1 2 2 2
  • Q 2 2 2 1 2 2 1 2 1 1 2 2 1 0 1 2 2 2 2 2 2
  • R 2 1 2 2 2 1 1 1 1 1 1 2 1 1 0 2 1 1 2 1 2
  • 1 1 1 2 2 1 2 2 2 2 1 2 2 1 2 2 0 2 1 2 1 1
  • 2 2 1 2 2 2 1 2 1 2 2 2 1 2 2 1 2 0 1 2 2 2
  • T 1 2 2 2 2 2 2 1 1 2 1 1 1 2 1 1 1 0 2 2 2

W TGG N AAC AAT A minimum of 3
changes are needed at the DNA level for Wlt-gtN
36
Phylogenetic trees from protein alignments
  • Parsimony based methods - unweighted/weighted
  • Distance methods - model for distance estimation
  • probability of amino acid changes, site rate
    heterogeneity
  • Maximum likelihood and Bayesian methods- model
    for ML calculations
  • probability of amino acid changes, site rate
    heterogeneity

37
Distance methods
  • A two step approach - two choices!
  • 1) Estimate all pairwise distances
  • Choose a method (100s) - has an explicit model
    for sequence evolution
  • 2) Estimate a tree from the distance matrix
  • Choose a method with or without an optimality
    criterion?

38
Estimation of protein pairwise distances
  • Simple formula
  • More complex models
  • 20 x 20 matrices (evolutionary model)
  • Identity matrix
  • Genetic code matrix
  • Mutational data matrices (MDMs)
  • Correction for rate heterogeneity between sites
    (G a pInv)

39
The Kimura formula correction for multiple hits
  • dij -Ln (1 - Dij - (Dij2/5))
  • Dij the observed dissimilarity between i and j
    (0-1).
  • - Can give good estimate of dij for 0.75 gt Dij gt
    0
  • It can approximates the PAM matrix well
  • If Dij 0.8541, dij infinite.
  • Implemented in ClustalX1.83 and PHYLIP3.62
  • Does not take into account which amino acid are
    changing
  • -gt Importance of mutational matrices, MDM!

40
Amino acid substitution matrices (MDMs)
  • Sequence alignments based matrices
  • PAM, JTT, BLOSUM, WAG...
  • Structure alignments based matrices
  • STR (for highly divergent sequences)

41
Protein distance measurements with MDM
  • 20 x 20 matrices
  • PAM, BLOSUM, WAGmatrices
  • Maximum likelihood calculation which takes into
    account
  • All sites in the alignment
  • All pairwise rates in the matrix
  • Branch length

dij ML P(n), Xij, (G, pinv) (dodgy notation!)
dij -Ln (1 - Dij - (Dij2/5)) F(Dij)
42
How is an MDM inferred?
  • Observed raw changes are corrected for
  • The amino acid relative mutability
  • The amino acid normalised frequency
  • Differences between MDM come from
  • Choice of proteins used (membrane, globular)
  • Range of sequence similarities used
  • Counting methods
  • On a tree MP, ML
  • Pairwise comparison from an alignment

-gt empirical models from large datasets are
typically used
43
How is an MDM inferred?
The raw data observed changes in pairwise
comparisons in an alignment or on a tree
seq.1 AIDESLIIASIATATI seq.
2 AGDEALILASAATSTI
44
seq.1 AIDESLIIASIATATI seq.
2 AGEEALILASAATSTI
A S T G I L E D A 3 S 2 1 T 0 0 1 G 0 0 0 0 I
1 0 0 1 2 L 0 0 0 0 1 1 E 0 0 0 0 0 0 1 D 0 0 0 0
0 0 1 0
Raw matrix Symmetrical!
-gt The larger the dataset the better the
estimates!
45
Amino Acid exchange matrices
  • - s1,2 s1,3 s1,20
  • s1,2 - s2,3 s2,20
  • s1,3 s2,3 - s3,20
  • s1,20 s2,20 s3,20 -

X diag(p1, , p20) Q matrix
Q Rate matrix Qij Instantaneous
rates of change of amino acids sij
Exchangeabilities of amino acid pairs ij sij
sij Time reversibility pi Stationarity
of amino acid frequencies
(typically the observed proportion of residues in
the dataset)
46
Amino Acid exchange matrices
R
Relative rate matrix (no composition, no branch
length)
Q
Rate matrix (with composition, not branch length)
P
R
F
Raw matrix Observed changes (counted on a MP
tree or in pairwise comparisons)
Relatedness odd matrix Used for scoring
alignments (BlastP, ClustalX)
Probability matrix (composition branch
length) Can be estimated using ML on a tree
Modified from Peter Foster
47
The PAM and JTT matrices
  • PAM - Dayhoff et al. 1968
  • Nuclear encoded genes, 100 proteins
  • JTT - Jones et al. 1992
  • 59,190 accepted point mutations for 16,300
    proteins
  • Jones, Taylor Thornton (1992). CABIOS 8, 275-282

48
The BLOSUM matrices
Henikoff Henikoff (1992). Proc Natl Acad Sci
USA 89, 10915-9
  • BLOcks SUbstitution Matrices
  • The matrix values are based on 2000 conserved
    amino acid patterns (blocks) - pairwise
    comparisons
  • gt more efficient for distantly related proteins
  • gt more agreement with 3D structure data
  • BLOSUM62 - 62 minimum sequence identity (BlastP
    default)
  • BLOSUM50 - 50 minimum sequence identity
  • BLOSUM42 - 42 minimum sequence identity (BlastP)

49
The WAG matrix
Whelan and Goldman (2001) Mol. Biol. Evol. 18,
691-699
  • Globular protein sequences
  • 3,905 sequences from 182 protein families
  • Produced a phylogenetic trees for every family
    and used maximum likelihood to estimate the
    relative rate values in the rate matrix (overall
    lnL over 182 different trees)
  • Better fit of the model with most data
    (significant improvement of the tree lnL when
    compared to PAM or JTT matrices)
  • Might not be the best option in some cases such
    as for mitochondria encoded proteins or other
    membrane proteins

50
Comparisons of MDMs (sij) amino acid
exchangeability
Whelan and Goldman (2001) Mol. Biol. Evol. 18,
691-699
JTT
PAM
WAG
WAG
51
Log-odds matrices
  • MDMij 10 log10 Rij

The MDMij values are rounded to the nearest
integer MDMij lt 0 freq. less than chance MDMij
0 freq. expected by chance MDMij gt 0 freq.
greater then chance
The Log-odds matrices can be used for scoring
alignments (Blast and Clustalx)
52
PAM250 Amino Acid Substitution Matrix
  • C S T P A G N D E Q H R K M I
    L V F Y W
  • C 12
    C sulfhydryl (1)
  • S 0 2
    S
  • T -2 1 3
    T
  • P -3 1 0 6
    P small
  • A -2 1 1 1 2
    A hydrophilic (2)
  • G -3 1 0 0 1 5
    G
  • N -4 1 0 0 0 0 2
    N
  • D -5 0 0 -1 0 1 2 4
    D acid, acid-amide
  • E -5 0 0 -1 0 0 1 3 4
    E and hydrophilic (3)
  • Q -5 -1 -1 -1 0 -1 1 2 2 4
    Q
  • H -3 -1 -1 0 -1 -2 2 1 1 3 6
    H
  • R -4 -0 -2 0 -2 -3 0 -1 -1 1 2 6
    R basic (4)
  • K -5 0 0 -1 -1 -2 1 0 0 1 0 3 5
    K
  • M -5 -1 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6
    M
  • I -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5
    I small
  • L -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2
    6 L hydrophobic (5)
  • V -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4
    2 4 V
  • F -4 -3 -3 -5 -3 -5 -3 -6 -5 -5 -2 -4 -5 0 1
    2 -1 9 F

MDMij lt 0 freq. less than chance MDMij 0
freq. expected by chance MDMij gt 0 freq.
greater then chance
53
BLOSUM62 Amino Acid Substitution Matrix
  • C S T P A G N D E Q H R K M I
    L V F Y W
  • C 9
    C sulfhydryl (1)
  • S -1 4
    S
  • T -1 1 5
    T
  • P -3 -1 -1 7
    P small
  • A 0 1 0 -1 4
    A hydrophilic (2)
  • G -3 0 -2 -2 0 6
    G
  • N -3 1 0 -2 -2 0 6
    N
  • D -3 0 -1 -1 -2 -1 1 6
    D acid, acid-amide
  • E -4 0 -1 -1 -1 -2 0 2 5
    E and hydrophilic (3)
  • Q -3 0 -1 -1 -1 -2 0 0 2 5
    Q
  • H -3 -1 -2 -2 -2 -2 1 -1 0 0 8
    H
  • R -3 -1 -1 -2 -1 -2 0 -2 0 1 0 5
    R basic (4)
  • K -3 0 -1 -1 -1 -2 0 -1 1 1 -1 2 5
    K
  • M -1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5
    M
  • I -1 -2 -1 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4
    I small
  • L -1 -2 -1 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2
    4 L hydrophobic (5)
  • V -1 -2 0 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3
    1 4 V
  • F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0
    0 -1 6 F

MDMij lt 0 freq. less than chance MDMij 0
freq. expected by chance MDMij gt 0 freq.
greater then chance
54
Log-odds matrices
  • MDMij 10 log10 Rij

The MDMij values are rounded to the nearest
integer MDMij lt 0 freq. less than chance MDMij
0 freq. expected by chance MDMij gt 0 freq.
greater then chance
I lt---gt M Log-odds 2 (in PAM250) 2
corresponds to an actual value of 0.2 Log10
0.20412, hence 100.2 1.6 Meaning Llt---gtM
changes between two sequences are occurring 1.6
times more often then random
55
Summary 1
  • Many amino acid rate matrices (MDM) exist and one
    needs to choose one for protein comparisons
    (alignment, phylogenetics...)
  • do not hesitate to experiment!
  • One should make a rational choice (as much as
    possible)
  • How was the rate matrix produced?
  • What are the structural features of the sequences
    of the sequences that you are comparing?
    Globular/membrane protein?
  • What is the level of sequence identity of the
    compared sequences?
  • Does one MDM fit my data better then the others
    You can use ModelGenerator or ProtTest to compare
    models
  • Always try to correct for rate heterogeneity
    between sites in phylogenetics!

56
Summary 2
  • In practice MDM are obtained by averaging the
    observed changes and amino acid frequencies
    between numerous proteins (e.g. JTT, BLOSUM) and
    are used for your specific dataset
  • With some software you can correct an MDM for the
    pi values of your data (amino acid frequencies -F
    option)
  • Specific matrices have been calculated to reflect
    particular composition biases
  • the mitochondrial proteins matrix mtREV24
  • Transmembrane domains PHAT
  • Using recoding of amino acids one can generate
    dataset specific models (specific GTR type model)

57
And
  • Other developments
  • What about context-dependent MDM alpha helices
    versus beta sheets, surface accessibility?
  • Heterogeneous models between sites or taxa
    (branches)
  • Protein LodDet? For long alignments only
  • Modeltest-like software that allow to choose
    protein models analytically
  • Modelgenerator http//bioinf.may.ie/software/
  • ProtTest http//darwin.uvigo.es
Write a Comment
User Comments (0)
About PowerShow.com