Title: Pairwise sequence
1 Pairwise sequence Alignment Homology, Score
Matrix
2Outline pairwise alignment
- Overview and examples
- Definitions homologs, paralogs, orthologs
- Assigning scores to aligned amino acids
- Dayhoffs PAM matrices
- Alignment algorithms Needleman-Wunsch,
- Smith-Waterman
- Statistical significance of pairwise alignments
3Pairwise alignments in the 1950s
b-corticotropin (sheep) Corticotropin A (pig)
ala gly glu asp asp glu asp gly ala glu asp glu
CYIQNCPLG CYFQNCPRG
Oxytocin Vasopressin
4Pairwise sequence alignment is the most
fundamental operation of bioinformatics
- It is used to decide if two proteins (or genes)
- are related structurally or functionally
- It is used to identify domains or motifs that
- are shared between proteins
- It is the basis of BLAST searching
- It is used in the analysis of genomes
5(No Transcript)
6Pairwise alignment protein sequences can be more
informative than DNA
- protein is more informative (20 vs 4
characters) - many amino acids share related biophysical
properties - codons are degenerate changes in the third
position - often do not alter the amino acid that is
specified - DNA sequences can be translated into protein,
- and then used in pairwise alignments
7Page 54
8Pairwise alignment protein sequences can be more
informative than DNA
DNA can be translated into six potential
proteins
5 CAT CAA 5 ATC AAC 5 TCA ACT
5 CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACC
CAC 3 3 GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTT
TGGATGGGTG 5
5 GTG GGT 5 TGG GTA 5 GGG TAG
9Pairwise alignment protein sequences can be more
informative than DNA
- Many times, DNA alignments are appropriate
- --to confirm the identity of a cDNA
- --to study noncoding regions of DNA
- --to study DNA polymorphisms
- --example Neanderthal vs modern human DNA
Query 181 catcaactacaactccaaagacacccttacacccactag
gatatcaacaaacctacccac 240
Sbjct 189 catcaactgcaaccccaaagccacccct-caccca
ctaggatatcaacaaacctacccac 247
10b-lactoglobulin (P02754)
retinol-binding protein 4 (NP_006735)
Page 42
11Outline pairwise alignment
- Overview and examples
- Definitions homologs, paralogs, orthologs
- Assigning scores to aligned amino acids
- Dayhoffs PAM matrices
- Alignment algorithms Needleman-Wunsch,
- Smith-Waterman
- Statistical significance of pairwise alignments
12Definitions
Pairwise alignment The process of lining up two
sequences to achieve maximal levels of identity
(and conservation, in the case of amino acid
sequences) for the purpose of assessing the
degree of similarity and the possibility of
homology.
wheat --DPNKPKRAMTSFVFFMSEFRSEFKQKHSKLKSIVEMVKAAG
ER
????? KKDSNAPKRAMTSFMFFSSDFRS----KHSDL-SIVEM
SKAAGAA
13Definitions
Homology Similarity attributed to descent from a
common ancestor.
Page 42
14Definitions
Homology Similarity attributed to descent from a
common ancestor.
Identity The extent to which two (nucleotide or
amino acid) sequences are invariant.
RBP 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVA
59 K GTWMA L
A glycodelin 23 QTKQDLELPKLAGTWHSMAMA-TNNIS
LMATLKA 55
Page 44
15Definitions two types of homology
Orthologs Homologous sequences in different
species that arose from a common ancestral gene
during speciation may or may not be responsible
for a similar function. Paralogs Homologous
sequences within a single species that arose by
gene duplication.
Page 43
16common carp
Orthologs members of a gene (protein) family in
various organisms. This tree shows RBP
orthologs. (during speciation)
zebrafish
rainbow trout
teleost
African clawed frog
chicken
human
mouse
rat
horse
rabbit
cow
pig
10 changes
Page 43
17Paralogs members of a gene (protein) family
within a Species. (during duplication)
apolipoprotein D
retinol-binding protein 4
Complement component 8
Alpha-1 Microglobulin /bikunin
prostaglandin D2 synthase
progestagen- associated endometrial protein
neutrophil gelatinase- associated lipocalin
Odorant-binding protein 2A
10 changes
Lipocalin 1
Page 44
18(No Transcript)
19Pairwise alignment of retinol-binding protein 4
and b-lactoglobulin
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKK
DPEG 50 RBP . . . . .
.. 1 ...MKCLLLALALTCGAQALIVT..QTMK
GLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51
LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE
97 RBP . .
. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQK
WENGECAQKKIIAEKTK 93 lactoglobulin 98
DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC
136 RBP . . .
. 94 IPAVFKIDALNENKVL........VLDTDYKK
YLLFCMENSAEPEQSLAC 135 lactoglobulin 137
RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV
185 RBP . .
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSF
NPTQLEEQCHI....... 178 lactoglobulin
Page 46
20Definitions
Similarity The extent to which nucleotide or
protein sequences are related. It is based upon
identity plus conservation. Identity The extent
to which two sequences are invariant. Conservatio
n Changes at a specific position of an amino
acid or (less commonly, DNA) sequence that
preserve the physico-chemical properties of the
original residue.
21Pairwise alignment of retinol-binding protein
and b-lactoglobulin
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKK
DPEG 50 RBP . . . . .
.. 1 ...MKCLLLALALTCGAQALIVT..QTMK
GLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51
LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE
97 RBP . .
. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQK
WENGECAQKKIIAEKTK 93 lactoglobulin 98
DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC
136 RBP . . .
. 94 IPAVFKIDALNENKVL........VLDTDYKK
YLLFCMENSAEPEQSLAC 135 lactoglobulin 137
RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV
185 RBP . .
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSF
NPTQLEEQCHI....... 178 lactoglobulin
Identity (bar)
Page 46
22Pairwise alignment of retinol-binding protein
and b-lactoglobulin
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKK
DPEG 50 RBP . . . . .
.. 1 ...MKCLLLALALTCGAQALIVT..QTMK
GLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51
LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE
97 RBP . .
. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQK
WENGECAQKKIIAEKTK 93 lactoglobulin 98
DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC
136 RBP . . .
. 94 IPAVFKIDALNENKVL........VLDTDYKK
YLLFCMENSAEPEQSLAC 135 lactoglobulin 137
RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV
185 RBP . .
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSF
NPTQLEEQCHI....... 178 lactoglobulin
Very similar (two dots)
Somewhat similar (one dot)
Page 46
23Definitions
Pairwise alignment The process of lining up two
sequences to achieve maximal levels of identity
(and conservation, in the case of amino acid
sequences) for the purpose of assessing the
degree of similarity and the possibility of
homology.
Page 47
24Pairwise alignment of retinol-binding protein
and b-lactoglobulin
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKK
DPEG 50 RBP . . . . .
.. 1 ...MKCLLLALALTCGAQALIVT..QTMK
GLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51
LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE
97 RBP . .
. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQK
WENGECAQKKIIAEKTK 93 lactoglobulin 98
DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC
136 RBP . . .
. 94 IPAVFKIDALNENKVL........VLDTDYKK
YLLFCMENSAEPEQSLAC 135 lactoglobulin 137
RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV
185 RBP . .
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSF
NPTQLEEQCHI....... 178 lactoglobulin
Internal gap
Terminal gap
Page 46
25Gaps
Positions at which a letter is paired with a
null are called gaps. Gap scores are
typically negative. Since a single mutational
event may cause the insertion or deletion of
more than one residue, the presence of a gap
is ascribed more significance than the length
of the gap.
26Pairwise alignment of retinol-binding protein
and b-lactoglobulin
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKK
DPEG 50 RBP . . . . .
.. 1 ...MKCLLLALALTCGAQALIVT..QTMK
GLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51
LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE
97 RBP . .
. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQK
WENGECAQKKIIAEKTK 93 lactoglobulin 98
DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC
136 RBP . . .
. 94 IPAVFKIDALNENKVL........VLDTDYKK
YLLFCMENSAEPEQSLAC 135 lactoglobulin 137
RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV
185 RBP . .
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSF
NPTQLEEQCHI....... 178 lactoglobulin
27Pairwise alignment of retinol-binding protein
from human (top) and rainbow trout (O. mykiss)
1 .MKWVWALLLLA.AWAAAERDCRVSSFRVKENFDKARFSGT
WYAMAKKDP 48 ...
. .. . 1
MLRICVALCALATCWA...QDCQVSNIQVMQNFDRSRYTGRWYAVAKKDP
47 . . .
. . 49 EGLFLQDNIVAEFSVDETGQMSATAKG
RVRLLNNWDVCADMVGTFTDTED 98
... ..
48 VGLFLLDNVVAQFSVDESGKMTATAHGRVIILNNWEMCANMFGTFE
DTPD 97 . . .
. . 99 PAKFKMKYWGVASFLQKGNDDHW
IVDTDYDTYAVQYSCRLLNLDGTCADS 148
..
98 PAKFKMRYWGAASYLQTGNDDHWVIDTDYDNYAIHYSCR
EVDLDGTCLDG 147 . .
. . . 149
YSFVFSRDPNGLPPEAQKIVRQRQEELCLARQYRLIVHNGYCDGRSERNL
L 199 .. .
148 YSFIFSRHPTGLRPEDQKIVTDKKKEICFLGK
YRRVGHTGFCESS...... 192
28Pairwise sequence alignment allows us to look
back billions of years ago (BYA)
Origin of life
Origin of eukaryotes
Earliest fossils
Eukaryote/ archaea
Fungi/animal Plant/animal
insects
4
3
2
1
0
Page 48
29Multiple sequence alignment of glyceraldehyde
3-phosphate dehydrogenases
fly GAKKVIISAP SAD.APM..F VCGVNLDAYK
PDMKVVSNAS CTTNCLAPLA human GAKRVIISAP
SAD.APM..F VMGVNHEKYD NSLKIISNAS CTTNCLAPLA
plant GAKKVIISAP SAD.APM..F VVGVNEHTYQ
PNMDIVSNAS CTTNCLAPLA bacterium GAKKVVMTGP
SKDNTPM..F VKGANFDKY. AGQDIVSNAS CTTNCLAPLA
yeast GAKKVVITAP SS.TAPM..F VMGVNEEKYT
SDLKIVSNAS CTTNCLAPLA archaeon GADKVLISAP
PKGDEPVKQL VYGVNHDEYD GE.DVVSNAS CTTNSITPVA fly
KVINDNFEIV EGLMTTVHAT TATQKTVDGP SGKLWRDGRG
AAQNIIPAST human KVIHDNFGIV EGLMTTVHAI
TATQKTVDGP SGKLWRDGRG ALQNIIPAST plant
KVVHEEFGIL EGLMTTVHAT TATQKTVDGP SMKDWRGGRG
ASQNIIPSST bacterium KVINDNFGII EGLMTTVHAT
TATQKTVDGP SHKDWRGGRG ASQNIIPSST yeast
KVINDAFGIE EGLMTTVHSL TATQKTVDGP SHKDWRGGRT
ASGNIIPSST archaeon KVLDEEFGIN AGQLTTVHAY
TGSQNLMDGP NGKP.RRRRA AAENIIPTST fly
GAAKAVGKVI PALNGKLTGM AFRVPTPNVS VVDLTVRLGK
GASYDEIKAK human GAAKAVGKVI PELNGKLTGM
AFRVPTANVS VVDLTCRLEK PAKYDDIKKV plant
GAAKAVGKVL PELNGKLTGM AFRVPTSNVS VVDLTCRLEK
GASYEDVKAA bacterium GAAKAVGKVL PELNGKLTGM
AFRVPTPNVS VVDLTVRLEK AATYEQIKAA yeast
GAAKAVGKVL PELQGKLTGM AFRVPTVDVS VVDLTVKLNK
ETTYDEIKKV archaeon GAAQAATEVL PELEGKLDGM
AIRVPVPNGS ITEFVVDLDD DVTESDVNAA
Page 49
30Multiple sequence alignment of human lipocalin
paralogs
EIQDVSGTWYAMTVDREFPEMNLESVTPMTLTTL.GGNLEAKVTM
lipocalin 1 LSFTLEEEDITGTWYAMVVDKDFPEDRRRKVSP
VKVTALGGGNLEATFTF odorant-binding protein
2a TKQDLELPKLAGTWHSMAMATNNISLMATLKAPLRVHITSEDNLEIV
LHR progestagen-assoc. endo. VQENFDVNKYLGRWYEIE
KIPTTFENGRCIQANYSLMENGNQELRADGTV
apolipoprotein D VKENFDKARFSGTWYAMAKDPEGLFLQDNIVAE
FSVDETGNWDVCADGTF retinol-binding
protein LQQNFQDNQFQGKWYVVGLAGNAI.LREDKDPQKMYATIDKS
YNVTSVLF neutrophil gelatinase-ass. VQPNFQQDKFL
GRWFSAGLASNSSWLREKKAALSMCKSVDGGLNLTSTFL
prostaglandin D2 synthase VQENFNISRIYGKWYNLAIGSTCP
WMDRMTVSTLVLGEGEAEISMTSTRW alpha-1-microglobuli
n PKANFDAQQFAGTWLLVAVGSACRFLQRAEATTLHVAPQGSTFRKLD.
.. complement component 8
Page 49
31Outline pairwise alignment
- Overview and examples
- Definitions homologs, paralogs, orthologs
- Assigning scores to aligned amino acids
- Dayhoffs PAM matrices
- Alignment algorithms Needleman-Wunsch,
- Smith-Waterman
- Statistical significance of pairwise alignments
32General approach to pairwise alignment
- Choose two sequences
- Select an algorithm that generates a score
- Allow gaps (insertions, deletions)
- Score reflects degree of similarity
- Alignments can be global or local
- Estimate probability that the alignment
- occurred by chance
33Calculation of an alignment score
Source http//www.ncbi.nlm.nih.gov/Education/BLAS
Tinfo/Alignment_Scores2.html
34lys found at 58 of arg sites
Emile Zuckerkandl and Linus Pauling (1965)
considered substitution frequencies in 18
globins (myoglobins and hemoglobins from human to
lamprey). Black identity Gray very
conservative substitutions (gt40
occurrence) White fairly conservative
substitutions (gt21 occurrence) Red no
substitutions observed
Page 80
35Page 80
36Dayhoffs 34 protein superfamilies
Protein PAMs per 100 million years Ig kappa
chain 37 Kappa casein 33 luteinizing
hormone b 30 lactalbumin 27 complement
component 3 27 epidermal growth
factor 26 proopiomelanocortin 21 pancreatic
ribonuclease 21 haptoglobin alpha 20 serum
albumin 19 phospholipase A2, group IB
19 prolactin 17 carbonic anhydrase
C 16 Hemoglobin a 12 Hemoglobin b 12
Page 50
37Dayhoffs 34 protein superfamilies
Protein PAMs per 100 million years Ig kappa
chain 37 Kappa casein 33 luteinizing
hormone b 30 lactalbumin 27 complement
component 3 27 epidermal growth
factor 26 proopiomelanocortin 21 pancreatic
ribonuclease 21 haptoglobin alpha 20 serum
albumin 19 phospholipase A2, group IB
19 prolactin 17 carbonic anhydrase
C 16 Hemoglobin a 12 Hemoglobin b 12
human (NP_005203) versus mouse (NP_031812)
38Dayhoffs 34 protein superfamilies
Protein PAMs per 100 million years apolipoprote
in A-II 10 lysozyme 9.8 gastrin 9.8 my
oglobin 8.9 nerve growth factor 8.5 myelin
basic protein 7.4 thyroid stimulating hormone
b 7.4 parathyroid hormone 7.3 parvalbumin
7.0 trypsin 5.9 insulin 4.4 calcitonin
4.3 arginine vasopressin 3.6 adenylate
kinase 1 3.2
Page 50
39Dayhoffs 34 protein superfamilies
Protein PAMs per 100 million years triosephosph
ate isomerase 1 2.8 vasoactive intestinal
peptide 2.6 glyceraldehyde phosph.
dehydrogease 2.2 cytochrome c 2.2 collagen
1.7 troponin C, skeletal muscle 1.5 alpha
crystallin B chain 1.5 glucagon 1.2 glutamat
e dehydrogenase 0.9 histone H2B, member
Q 0.9 ubiquitin 0
Page 50
40Pairwise alignment of human (NP_005203) versus
mouse (NP_031812) ubiquitin
41Accepted point mutations (PAMs) inferring amino
acid substitutions between a protein and its
ancestor
Dayhoff et al. compared protein sequences with
inferred ancestors, rather than with each other
directly. Consider four globins (myoglobin,
alpha, beta, delta globin). In a phylogenetic
tree there are four existing sequences plus two
inferred ancestral sequences (5, 6). (We will
learn how to make trees later.)
5
1
6
2
3
4
42Accepted point mutations (PAMs) inferring amino
acid substitutions between a protein and its
ancestor
- The tree is made from a multiple sequence
alignment of the four globins. Consider a
comparison of alpha globin to myoglobin, and to
their common ancestor (node 6). - A direct comparison suggests alanine changed to
glycine. But an ancestral glutamate changed to
ala or gly! - Three additional examples are boxed.
5
1
6
2
3
4
beta MVHLTPEEKSAVTALWGKV delta
MVHLTPEEKTAVNALWGKV alpha
MV.LSPADKTNVKAAWGKV myoglobin
.MGLSDGEWQLVLNVWGKV 5
MVHLSPEEKTAVNALWGKV 6
MVHLTPEEKTAVNALWGKV
43Dayhoffs numbers of accepted point
mutations what amino acid substitutions occur
in proteins?
Dayhoff (1978) p.346.
Page 52
44Multiple sequence alignment of glyceraldehyde
3-phosphate dehydrogenases
fly GAKKVIISAP SAD.APM..F VCGVNLDAYK
PDMKVVSNAS CTTNCLAPLA human GAKRVIISAP
SAD.APM..F VMGVNHEKYD NSLKIISNAS CTTNCLAPLA
plant GAKKVIISAP SAD.APM..F VVGVNEHTYQ
PNMDIVSNAS CTTNCLAPLA bacterium GAKKVVMTGP
SKDNTPM..F VKGANFDKY. AGQDIVSNAS CTTNCLAPLA
yeast GAKKVVITAP SS.TAPM..F VMGVNEEKYT
SDLKIVSNAS CTTNCLAPLA archaeon GADKVLISAP
PKGDEPVKQL VYGVNHDEYD GE.DVVSNAS CTTNSITPVA fly
KVINDNFEIV EGLMTTVHAT TATQKTVDGP SGKLWRDGRG
AAQNIIPAST human KVIHDNFGIV EGLMTTVHAI
TATQKTVDGP SGKLWRDGRG ALQNIIPAST plant
KVVHEEFGIL EGLMTTVHAT TATQKTVDGP SMKDWRGGRG
ASQNIIPSST bacterium KVINDNFGII EGLMTTVHAT
TATQKTVDGP SHKDWRGGRG ASQNIIPSST yeast
KVINDAFGIE EGLMTTVHSL TATQKTVDGP SHKDWRGGRT
ASGNIIPSST archaeon KVLDEEFGIN AGQLTTVHAY
TGSQNLMDGP NGKP.RRRRA AAENIIPTST fly
GAAKAVGKVI PALNGKLTGM AFRVPTPNVS VVDLTVRLGK
GASYDEIKAK human GAAKAVGKVI PELNGKLTGM
AFRVPTANVS VVDLTCRLEK PAKYDDIKKV plant
GAAKAVGKVL PELNGKLTGM AFRVPTSNVS VVDLTCRLEK
GASYEDVKAA bacterium GAAKAVGKVL PELNGKLTGM
AFRVPTPNVS VVDLTVRLEK AATYEQIKAA yeast
GAAKAVGKVL PELQGKLTGM AFRVPTVDVS VVDLTVKLNK
ETTYDEIKKV archaeon GAAQAATEVL PELEGKLDGM
AIRVPVPNGS ITEFVVDLDD DVTESDVNAA
45The relative mutability of amino acids
Dayhoff et al. described the relative
mutability of each amino acid as the probability
that amino acid will change over a small
evolutionary time period. The total number of
changes are counted (on all branches of all
protein trees considered), and the total number
of occurrences of each amino acid is also
considered. A ratio is determined. Relative
mutability ? changes / occurrences Example
sequence 1 ala his val ala sequence
2 ala arg ser val For ala, relative mutability
1 / 3 0.33 For val, relative mutability
2 / 2 1.0
Page 53
46The relative mutability of amino acids
Asn 134 His 66 Ser 120 Arg 65 Asp 106 Lys 56
Glu 102 Pro 56 Ala 100 Gly 49 Thr 97 Tyr 41
Ile 96 Phe 41 Met 94 Leu 40 Gln 93 Cys 20 V
al 74 Trp 18
Page 53
47The relative mutability of amino acids
Asn 134 His 66 Ser 120 Arg 65 Asp 106 Lys 56
Glu 102 Pro 56 Ala 100 Gly 49 Thr 97 Tyr 41
Ile 96 Phe 41 Met 94 Leu 40 Gln 93 Cys 20 V
al 74 Trp 18
Note that alanine is normalized to a value of
100. Trp and cys are least mutable. Asn and ser
are most mutable.
Page 53
48Normalized frequencies of amino acids
- Gly 8.9 Arg 4.1
- Ala 8.7 Asn 4.0
- Leu 8.5 Phe 4.0
- Lys 8.1 Gln 3.8
- Ser 7.0 Ile 3.7
- Val 6.5 His 3.4
- Thr 5.8 Cys 3.3
- Pro 5.1 Tyr 3.0
- Glu 5.0 Met 1.5
- Asp 4.7 Trp 1.0
- blue6 codons red1 codon
- These frequencies fi sum to 1
Page 53
49Page 54
50Dayhoffs numbers of accepted point
mutations what amino acid substitutions occur
in proteins?
Page 52
51Dayhoffs mutation probability matrix for the
evolutionary distance of 1 PAM
- We have considered three kinds of information
- a table of number of accepted point mutations
(PAMs) - relative mutabilities of the amino acids
- normalized frequencies of the amino acids in PAM
data - This information can be combined into a mutation
probability matrix in which each element Mij
gives the probability that the amino acid in
column j will be replaced by the amino acid in
row i after a given evolutionary interval (e.g. 1
PAM).
Page 50
52Dayhoffs PAM1 mutation probability matrix
Original amino acid
Page 55
53Dayhoffs PAM1 mutation probability matrix
Each element of the matrix shows the probability
that an original amino acid (top) will be
replaced by another amino acid (side)
54Substitution Matrix
A substitution matrix contains values
proportional to the probability that amino acid
i mutates into amino acid j for all pairs of
amino acids. Substitution matrices are
constructed by assembling a large and diverse
sample of verified pairwise alignments (or
multiple sequence alignments) of amino acids.
Substitution matrices should reflect the true
probabilities of mutations occurring through a
period of evolution. The two major types of
substitution matrices are PAM and BLOSUM.
55PAM matrices Point-accepted mutations
PAM matrices are based on global alignments of
closely related proteins. The PAM1 is the
matrix calculated from comparisons of sequences
with no more than 1 divergence. At an
evolutionary interval of PAM1, one change has
occurred over a length of 100 amino acids. Other
PAM matrices are extrapolated from PAM1. For
PAM250, 250 changes have occurred for two
proteins over a length of 100 amino acids. All
the PAM data come from closely related
proteins (gt85 amino acid identity).
56Dayhoffs PAM1 mutation probability matrix
Page 55
57Dayhoffs PAM0 mutation probability matrix the
rules for extremely slowly evolving proteins
Top original amino acid Side replacement amino
acid
Page 56
58Dayhoffs PAM2000 mutation probability
matrix the rules for very distantly related
proteins
G
8.9
8.9
8.9
8.9
8.9
8.9
8.9
8.9
Top original amino acid Side replacement amino
acid
Page 56
59PAM250 mutation probability matrix
Top original amino acid Side replacement amino
acid
Page 57
60PAM250 log odds scoring matrix
Page 58
61Why do we go from a mutation probability matrix
to a log odds matrix?
- We want a scoring matrix so that when we do a
pairwise - alignment (or a BLAST search) we know what
score to - assign to two aligned amino acid residues.
- Logarithms are easier to use for a scoring
system. They - allow us to sum the scores of aligned residues
(rather - than having to multiply them).
Page 57
62How do we go from a mutation probability matrix
to a log odds matrix?
- The cells in a log odds matrix consist of an
odds ratio - the probability that an alignment is authentic
- the probability that the alignment was random
- The score S for an alignment of residues a,b is
given by - S(a,b) 10 log10 (Mab/pb)
- As an example, for tryptophan,
- S(a,tryptophan) 10 log10 (0.55/0.010) 17.4
Page 57
63What do the numbers mean in a log odds matrix?
S(a,tryptophan) 10 log10 (0.55/0.010) 17.4 A
score of 17 for tryptophan means that this
alignment is 50 times more likely than a chance
alignment of two Trp residues. S(a,b)
17 Probability of replacement (Mab/pb)
x Then 17 10 log10 x 1.7 log10 x 101.7 x
50
Page 58
64What do the numbers mean in a log odds matrix?
A score of 2 indicates that the amino acid
replacement occurs 1.6 times as frequently as
expected by chance. A score of 0 is neutral. A
score of 10 indicates that the correspondence of
two amino acids in an alignment that accurately
represents homology (evolutionary descent) is one
tenth as frequent as the chance alignment of
these amino acids.
Page 58
65PAM250 log odds scoring matrix
Page 58
66PAM10 log odds scoring matrix
Page 59
67More conserved
Less conserved
Rat versus mouse RBP
Rat versus bacterial lipocalin
68Comparing two proteins with a PAM1 matrix gives
completely different results than PAM250!
Consider two distantly related proteins. A PAM40
matrix is not forgiving of mismatches, and
penalizes them severely. Using this matrix you
can find almost no match.
hsrbp, 136 CRLLNLDGTC btlact, 3 CLLLALALTC
A PAM250 matrix is very tolerant of mismatches.
24.7 identity in 81 residues overlap Score
77.0 Gap frequency 3.7 rbp4 26
RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKG
RVRLLNNWDV btlact 21 QTMKGLDIQKVAGTWYSLAMAASD-ISL
LDAQSAPLRVYVEELKPTPEGDLEILLQKWEN
rbp4 86 --CADMVGTFTDTEDPAKFKM btlact 80
GECAQKKIIAEKTKIPAVFKI
Page 60
69BLOSUM Matrices
BLOSUM matrices are based on local alignments.
BLOSUM stands for blocks substitution
matrix. BLOSUM62 is a matrix calculated from
comparisons of sequences with no less than 62
divergence.
Page 60
70BLOSUM Matrices
100
collapse
62
Percent amino acid identity
30
BLOSUM62
71BLOSUM Matrices
100
100
100
collapse
collapse
62
62
62
collapse
Percent amino acid identity
30
30
30
BLOSUM62
BLOSUM30
BLOSUM80
72BLOSUM Matrices
All BLOSUM matrices are based on observed
alignments they are not extrapolated from
comparisons of closely related proteins. The
BLOCKS database contains thousands of groups
of multiple sequence alignments. BLOSUM62 is the
default matrix in BLAST 2.0. Though it is
tailored for comparisons of moderately distant
proteins, it performs well in detecting closer
relationships. A search for distant relatives
may be more sensitive with a different matrix.
Page 60
73Blosum62 scoring matrix
Page 61
74Blosum62 scoring matrix
Page 61
75Rat versus mouse RBP
Rat versus bacterial lipocalin
Page 61
76PAM matrices Point-accepted mutations
PAM matrices are based on global alignments of
closely related proteins. The PAM1 is the
matrix calculated from comparisons of sequences
with no more than 1 divergence. At an
evolutionary interval of PAM1, one change has
occurred over a length of 100 amino acids. Other
PAM matrices are extrapolated from PAM1. For
PAM250, 250 changes have occurred for two
proteins over a length of 100 amino acids.
77Two randomly diverging protein sequences
change in a negatively exponential fashion
Percent identity
twilight zone
Evolutionary distance in PAMs
Page 62
78At PAM1, two proteins are 99 identical At
PAM10.7, there are 10 differences per 100
residues At PAM80, there are 50 differences per
100 residues At PAM250, there are 80 differences
per 100 residues
Percent identity
twilight zone
Differences per 100 residues
Page 62
79PAM matrices reflect different degrees of
divergence
PAM250
80PAM Accepted point mutation
- Two proteins with 50 identity may have 80
changes - per 100 residues. (Why? Because any residue can
be - subject to back mutations.)
- Proteins with 20 to 25 identity are in the
twilight zone - and may be statistically significantly related.
- PAM or accepted point mutation refers to the
hits or - matches between two sequences (Dayhoff Eck,
1968)
Page 62
81Ancestral sequence
ACCCTAC
A no change A C single
substitution C --gt A C multiple
substitutions C --gt A --gt T C --gt G
coincidental substitutions C --gt A T --gt A
parallel substitutions T --gt A A --gt C --gt T
convergent substitutions A --gt T C back
substitution C --gt T --gt C
Sequence 1 ACCGATC
Sequence 2 AATAATC
Li (1997) p.70
82Percent identity between two proteins What
percent is significant?
100 80 65 30 23 19
We will see in the BLAST lecture that it is
appropriate to describe significance in terms of
probability (or expect) values. As a rule of
thumb, two proteins sharing gt 30 over a
substantial region are usually homologous.