Pairwise sequence - PowerPoint PPT Presentation

1 / 82
About This Presentation
Title:

Pairwise sequence

Description:

Pairwise sequence Alignment Homology, Score Matrix – PowerPoint PPT presentation

Number of Views:162
Avg rating:3.0/5.0
Slides: 83
Provided by: Jonatha581
Category:

less

Transcript and Presenter's Notes

Title: Pairwise sequence


1
Pairwise sequence Alignment Homology, Score
Matrix
2
Outline pairwise alignment
  • Overview and examples
  • Definitions homologs, paralogs, orthologs
  • Assigning scores to aligned amino acids
  • Dayhoffs PAM matrices
  • Alignment algorithms Needleman-Wunsch,
  • Smith-Waterman
  • Statistical significance of pairwise alignments

3
Pairwise alignments in the 1950s
b-corticotropin (sheep) Corticotropin A (pig)
ala gly glu asp asp glu asp gly ala glu asp glu
CYIQNCPLG CYFQNCPRG
Oxytocin Vasopressin
4
Pairwise sequence alignment is the most
fundamental operation of bioinformatics
  • It is used to decide if two proteins (or genes)
  • are related structurally or functionally
  • It is used to identify domains or motifs that
  • are shared between proteins
  • It is the basis of BLAST searching
  • It is used in the analysis of genomes

5
(No Transcript)
6
Pairwise alignment protein sequences can be more
informative than DNA
  • protein is more informative (20 vs 4
    characters)
  • many amino acids share related biophysical
    properties
  • codons are degenerate changes in the third
    position
  • often do not alter the amino acid that is
    specified
  • DNA sequences can be translated into protein,
  • and then used in pairwise alignments

7
Page 54
8
Pairwise alignment protein sequences can be more
informative than DNA
DNA can be translated into six potential
proteins
5 CAT CAA 5 ATC AAC 5 TCA ACT
5 CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACC
CAC 3 3 GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTT
TGGATGGGTG 5
5 GTG GGT 5 TGG GTA 5 GGG TAG
9
Pairwise alignment protein sequences can be more
informative than DNA
  • Many times, DNA alignments are appropriate
  • --to confirm the identity of a cDNA
  • --to study noncoding regions of DNA
  • --to study DNA polymorphisms
  • --example Neanderthal vs modern human DNA

Query 181 catcaactacaactccaaagacacccttacacccactag
gatatcaacaaacctacccac 240

Sbjct 189 catcaactgcaaccccaaagccacccct-caccca
ctaggatatcaacaaacctacccac 247
10
b-lactoglobulin (P02754)
retinol-binding protein 4 (NP_006735)
Page 42
11
Outline pairwise alignment
  • Overview and examples
  • Definitions homologs, paralogs, orthologs
  • Assigning scores to aligned amino acids
  • Dayhoffs PAM matrices
  • Alignment algorithms Needleman-Wunsch,
  • Smith-Waterman
  • Statistical significance of pairwise alignments

12
Definitions
Pairwise alignment The process of lining up two
sequences to achieve maximal levels of identity
(and conservation, in the case of amino acid
sequences) for the purpose of assessing the
degree of similarity and the possibility of
homology.
wheat --DPNKPKRAMTSFVFFMSEFRSEFKQKHSKLKSIVEMVKAAG
ER
????? KKDSNAPKRAMTSFMFFSSDFRS----KHSDL-SIVEM
SKAAGAA
13
Definitions
Homology Similarity attributed to descent from a
common ancestor.
Page 42
14
Definitions
Homology Similarity attributed to descent from a
common ancestor.
Identity The extent to which two (nucleotide or
amino acid) sequences are invariant.
RBP 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVA
59 K GTWMA L
A glycodelin 23 QTKQDLELPKLAGTWHSMAMA-TNNIS
LMATLKA 55
Page 44
15
Definitions two types of homology
Orthologs Homologous sequences in different
species that arose from a common ancestral gene
during speciation may or may not be responsible
for a similar function. Paralogs Homologous
sequences within a single species that arose by
gene duplication.
Page 43
16
common carp
Orthologs members of a gene (protein) family in
various organisms. This tree shows RBP
orthologs. (during speciation)
zebrafish
rainbow trout
teleost
African clawed frog
chicken
human
mouse
rat
horse
rabbit
cow
pig
10 changes
Page 43
17
Paralogs members of a gene (protein) family
within a Species. (during duplication)
apolipoprotein D
retinol-binding protein 4
Complement component 8
Alpha-1 Microglobulin /bikunin
prostaglandin D2 synthase
progestagen- associated endometrial protein
neutrophil gelatinase- associated lipocalin
Odorant-binding protein 2A
10 changes
Lipocalin 1
Page 44
18
(No Transcript)
19
Pairwise alignment of retinol-binding protein 4
and b-lactoglobulin
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKK
DPEG 50 RBP . . . . .
.. 1 ...MKCLLLALALTCGAQALIVT..QTMK
GLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51
LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE
97 RBP . .
. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQK
WENGECAQKKIIAEKTK 93 lactoglobulin 98
DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC
136 RBP . . .
. 94 IPAVFKIDALNENKVL........VLDTDYKK
YLLFCMENSAEPEQSLAC 135 lactoglobulin 137
RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV
185 RBP . .
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSF
NPTQLEEQCHI....... 178 lactoglobulin
Page 46
20
Definitions
Similarity The extent to which nucleotide or
protein sequences are related. It is based upon
identity plus conservation. Identity The extent
to which two sequences are invariant. Conservatio
n Changes at a specific position of an amino
acid or (less commonly, DNA) sequence that
preserve the physico-chemical properties of the
original residue.
21
Pairwise alignment of retinol-binding protein
and b-lactoglobulin
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKK
DPEG 50 RBP . . . . .
.. 1 ...MKCLLLALALTCGAQALIVT..QTMK
GLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51
LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE
97 RBP . .
. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQK
WENGECAQKKIIAEKTK 93 lactoglobulin 98
DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC
136 RBP . . .
. 94 IPAVFKIDALNENKVL........VLDTDYKK
YLLFCMENSAEPEQSLAC 135 lactoglobulin 137
RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV
185 RBP . .
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSF
NPTQLEEQCHI....... 178 lactoglobulin
Identity (bar)
Page 46
22
Pairwise alignment of retinol-binding protein
and b-lactoglobulin
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKK
DPEG 50 RBP . . . . .
.. 1 ...MKCLLLALALTCGAQALIVT..QTMK
GLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51
LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE
97 RBP . .
. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQK
WENGECAQKKIIAEKTK 93 lactoglobulin 98
DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC
136 RBP . . .
. 94 IPAVFKIDALNENKVL........VLDTDYKK
YLLFCMENSAEPEQSLAC 135 lactoglobulin 137
RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV
185 RBP . .
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSF
NPTQLEEQCHI....... 178 lactoglobulin
Very similar (two dots)
Somewhat similar (one dot)
Page 46
23
Definitions
Pairwise alignment The process of lining up two
sequences to achieve maximal levels of identity
(and conservation, in the case of amino acid
sequences) for the purpose of assessing the
degree of similarity and the possibility of
homology.
Page 47
24
Pairwise alignment of retinol-binding protein
and b-lactoglobulin
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKK
DPEG 50 RBP . . . . .
.. 1 ...MKCLLLALALTCGAQALIVT..QTMK
GLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51
LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE
97 RBP . .
. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQK
WENGECAQKKIIAEKTK 93 lactoglobulin 98
DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC
136 RBP . . .
. 94 IPAVFKIDALNENKVL........VLDTDYKK
YLLFCMENSAEPEQSLAC 135 lactoglobulin 137
RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV
185 RBP . .
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSF
NPTQLEEQCHI....... 178 lactoglobulin
Internal gap
Terminal gap
Page 46
25
Gaps
Positions at which a letter is paired with a
null are called gaps. Gap scores are
typically negative. Since a single mutational
event may cause the insertion or deletion of
more than one residue, the presence of a gap
is ascribed more significance than the length
of the gap.
26
Pairwise alignment of retinol-binding protein
and b-lactoglobulin
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKK
DPEG 50 RBP . . . . .
.. 1 ...MKCLLLALALTCGAQALIVT..QTMK
GLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51
LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE
97 RBP . .
. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQK
WENGECAQKKIIAEKTK 93 lactoglobulin 98
DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC
136 RBP . . .
. 94 IPAVFKIDALNENKVL........VLDTDYKK
YLLFCMENSAEPEQSLAC 135 lactoglobulin 137
RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV
185 RBP . .
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSF
NPTQLEEQCHI....... 178 lactoglobulin
27
Pairwise alignment of retinol-binding protein
from human (top) and rainbow trout (O. mykiss)
1 .MKWVWALLLLA.AWAAAERDCRVSSFRVKENFDKARFSGT
WYAMAKKDP 48 ...
. .. . 1
MLRICVALCALATCWA...QDCQVSNIQVMQNFDRSRYTGRWYAVAKKDP
47 . . .
. . 49 EGLFLQDNIVAEFSVDETGQMSATAKG
RVRLLNNWDVCADMVGTFTDTED 98
... ..
48 VGLFLLDNVVAQFSVDESGKMTATAHGRVIILNNWEMCANMFGTFE
DTPD 97 . . .
. . 99 PAKFKMKYWGVASFLQKGNDDHW
IVDTDYDTYAVQYSCRLLNLDGTCADS 148
..
98 PAKFKMRYWGAASYLQTGNDDHWVIDTDYDNYAIHYSCR
EVDLDGTCLDG 147 . .
. . . 149
YSFVFSRDPNGLPPEAQKIVRQRQEELCLARQYRLIVHNGYCDGRSERNL
L 199 .. .
148 YSFIFSRHPTGLRPEDQKIVTDKKKEICFLGK
YRRVGHTGFCESS...... 192
28
Pairwise sequence alignment allows us to look
back billions of years ago (BYA)
Origin of life
Origin of eukaryotes
Earliest fossils
Eukaryote/ archaea
Fungi/animal Plant/animal
insects
4
3
2
1
0
Page 48
29
Multiple sequence alignment of glyceraldehyde
3-phosphate dehydrogenases
fly GAKKVIISAP SAD.APM..F VCGVNLDAYK
PDMKVVSNAS CTTNCLAPLA human GAKRVIISAP
SAD.APM..F VMGVNHEKYD NSLKIISNAS CTTNCLAPLA
plant GAKKVIISAP SAD.APM..F VVGVNEHTYQ
PNMDIVSNAS CTTNCLAPLA bacterium GAKKVVMTGP
SKDNTPM..F VKGANFDKY. AGQDIVSNAS CTTNCLAPLA
yeast GAKKVVITAP SS.TAPM..F VMGVNEEKYT
SDLKIVSNAS CTTNCLAPLA archaeon GADKVLISAP
PKGDEPVKQL VYGVNHDEYD GE.DVVSNAS CTTNSITPVA fly
KVINDNFEIV EGLMTTVHAT TATQKTVDGP SGKLWRDGRG
AAQNIIPAST human KVIHDNFGIV EGLMTTVHAI
TATQKTVDGP SGKLWRDGRG ALQNIIPAST plant
KVVHEEFGIL EGLMTTVHAT TATQKTVDGP SMKDWRGGRG
ASQNIIPSST bacterium KVINDNFGII EGLMTTVHAT
TATQKTVDGP SHKDWRGGRG ASQNIIPSST yeast
KVINDAFGIE EGLMTTVHSL TATQKTVDGP SHKDWRGGRT
ASGNIIPSST archaeon KVLDEEFGIN AGQLTTVHAY
TGSQNLMDGP NGKP.RRRRA AAENIIPTST fly
GAAKAVGKVI PALNGKLTGM AFRVPTPNVS VVDLTVRLGK
GASYDEIKAK human GAAKAVGKVI PELNGKLTGM
AFRVPTANVS VVDLTCRLEK PAKYDDIKKV plant
GAAKAVGKVL PELNGKLTGM AFRVPTSNVS VVDLTCRLEK
GASYEDVKAA bacterium GAAKAVGKVL PELNGKLTGM
AFRVPTPNVS VVDLTVRLEK AATYEQIKAA yeast
GAAKAVGKVL PELQGKLTGM AFRVPTVDVS VVDLTVKLNK
ETTYDEIKKV archaeon GAAQAATEVL PELEGKLDGM
AIRVPVPNGS ITEFVVDLDD DVTESDVNAA
Page 49
30
Multiple sequence alignment of human lipocalin
paralogs
EIQDVSGTWYAMTVDREFPEMNLESVTPMTLTTL.GGNLEAKVTM
lipocalin 1 LSFTLEEEDITGTWYAMVVDKDFPEDRRRKVSP
VKVTALGGGNLEATFTF odorant-binding protein
2a TKQDLELPKLAGTWHSMAMATNNISLMATLKAPLRVHITSEDNLEIV
LHR progestagen-assoc. endo. VQENFDVNKYLGRWYEIE
KIPTTFENGRCIQANYSLMENGNQELRADGTV
apolipoprotein D VKENFDKARFSGTWYAMAKDPEGLFLQDNIVAE
FSVDETGNWDVCADGTF retinol-binding
protein LQQNFQDNQFQGKWYVVGLAGNAI.LREDKDPQKMYATIDKS
YNVTSVLF neutrophil gelatinase-ass. VQPNFQQDKFL
GRWFSAGLASNSSWLREKKAALSMCKSVDGGLNLTSTFL
prostaglandin D2 synthase VQENFNISRIYGKWYNLAIGSTCP
WMDRMTVSTLVLGEGEAEISMTSTRW alpha-1-microglobuli
n PKANFDAQQFAGTWLLVAVGSACRFLQRAEATTLHVAPQGSTFRKLD.
.. complement component 8
Page 49
31
Outline pairwise alignment
  • Overview and examples
  • Definitions homologs, paralogs, orthologs
  • Assigning scores to aligned amino acids
  • Dayhoffs PAM matrices
  • Alignment algorithms Needleman-Wunsch,
  • Smith-Waterman
  • Statistical significance of pairwise alignments

32
General approach to pairwise alignment
  • Choose two sequences
  • Select an algorithm that generates a score
  • Allow gaps (insertions, deletions)
  • Score reflects degree of similarity
  • Alignments can be global or local
  • Estimate probability that the alignment
  • occurred by chance

33
Calculation of an alignment score
Source http//www.ncbi.nlm.nih.gov/Education/BLAS
Tinfo/Alignment_Scores2.html
34
lys found at 58 of arg sites
Emile Zuckerkandl and Linus Pauling (1965)
considered substitution frequencies in 18
globins (myoglobins and hemoglobins from human to
lamprey). Black identity Gray very
conservative substitutions (gt40
occurrence) White fairly conservative
substitutions (gt21 occurrence) Red no
substitutions observed
Page 80
35
Page 80
36
Dayhoffs 34 protein superfamilies
Protein PAMs per 100 million years Ig kappa
chain 37 Kappa casein 33 luteinizing
hormone b 30 lactalbumin 27 complement
component 3 27 epidermal growth
factor 26 proopiomelanocortin 21 pancreatic
ribonuclease 21 haptoglobin alpha 20 serum
albumin 19 phospholipase A2, group IB
19 prolactin 17 carbonic anhydrase
C 16 Hemoglobin a 12 Hemoglobin b 12
Page 50
37
Dayhoffs 34 protein superfamilies
Protein PAMs per 100 million years Ig kappa
chain 37 Kappa casein 33 luteinizing
hormone b 30 lactalbumin 27 complement
component 3 27 epidermal growth
factor 26 proopiomelanocortin 21 pancreatic
ribonuclease 21 haptoglobin alpha 20 serum
albumin 19 phospholipase A2, group IB
19 prolactin 17 carbonic anhydrase
C 16 Hemoglobin a 12 Hemoglobin b 12
human (NP_005203) versus mouse (NP_031812)
38
Dayhoffs 34 protein superfamilies
Protein PAMs per 100 million years apolipoprote
in A-II 10 lysozyme 9.8 gastrin 9.8 my
oglobin 8.9 nerve growth factor 8.5 myelin
basic protein 7.4 thyroid stimulating hormone
b 7.4 parathyroid hormone 7.3 parvalbumin
7.0 trypsin 5.9 insulin 4.4 calcitonin
4.3 arginine vasopressin 3.6 adenylate
kinase 1 3.2
Page 50
39
Dayhoffs 34 protein superfamilies
Protein PAMs per 100 million years triosephosph
ate isomerase 1 2.8 vasoactive intestinal
peptide 2.6 glyceraldehyde phosph.
dehydrogease 2.2 cytochrome c 2.2 collagen
1.7 troponin C, skeletal muscle 1.5 alpha
crystallin B chain 1.5 glucagon 1.2 glutamat
e dehydrogenase 0.9 histone H2B, member
Q 0.9 ubiquitin 0
Page 50
40
Pairwise alignment of human (NP_005203) versus
mouse (NP_031812) ubiquitin
41
Accepted point mutations (PAMs) inferring amino
acid substitutions between a protein and its
ancestor
Dayhoff et al. compared protein sequences with
inferred ancestors, rather than with each other
directly. Consider four globins (myoglobin,
alpha, beta, delta globin). In a phylogenetic
tree there are four existing sequences plus two
inferred ancestral sequences (5, 6). (We will
learn how to make trees later.)
5
1
6
2
3
4
42
Accepted point mutations (PAMs) inferring amino
acid substitutions between a protein and its
ancestor
  • The tree is made from a multiple sequence
    alignment of the four globins. Consider a
    comparison of alpha globin to myoglobin, and to
    their common ancestor (node 6).
  • A direct comparison suggests alanine changed to
    glycine. But an ancestral glutamate changed to
    ala or gly!
  • Three additional examples are boxed.

5
1
6
2
3
4
beta MVHLTPEEKSAVTALWGKV delta
MVHLTPEEKTAVNALWGKV alpha
MV.LSPADKTNVKAAWGKV myoglobin
.MGLSDGEWQLVLNVWGKV 5
MVHLSPEEKTAVNALWGKV 6
MVHLTPEEKTAVNALWGKV
43
Dayhoffs numbers of accepted point
mutations what amino acid substitutions occur
in proteins?
Dayhoff (1978) p.346.
Page 52
44
Multiple sequence alignment of glyceraldehyde
3-phosphate dehydrogenases
fly GAKKVIISAP SAD.APM..F VCGVNLDAYK
PDMKVVSNAS CTTNCLAPLA human GAKRVIISAP
SAD.APM..F VMGVNHEKYD NSLKIISNAS CTTNCLAPLA
plant GAKKVIISAP SAD.APM..F VVGVNEHTYQ
PNMDIVSNAS CTTNCLAPLA bacterium GAKKVVMTGP
SKDNTPM..F VKGANFDKY. AGQDIVSNAS CTTNCLAPLA
yeast GAKKVVITAP SS.TAPM..F VMGVNEEKYT
SDLKIVSNAS CTTNCLAPLA archaeon GADKVLISAP
PKGDEPVKQL VYGVNHDEYD GE.DVVSNAS CTTNSITPVA fly
KVINDNFEIV EGLMTTVHAT TATQKTVDGP SGKLWRDGRG
AAQNIIPAST human KVIHDNFGIV EGLMTTVHAI
TATQKTVDGP SGKLWRDGRG ALQNIIPAST plant
KVVHEEFGIL EGLMTTVHAT TATQKTVDGP SMKDWRGGRG
ASQNIIPSST bacterium KVINDNFGII EGLMTTVHAT
TATQKTVDGP SHKDWRGGRG ASQNIIPSST yeast
KVINDAFGIE EGLMTTVHSL TATQKTVDGP SHKDWRGGRT
ASGNIIPSST archaeon KVLDEEFGIN AGQLTTVHAY
TGSQNLMDGP NGKP.RRRRA AAENIIPTST fly
GAAKAVGKVI PALNGKLTGM AFRVPTPNVS VVDLTVRLGK
GASYDEIKAK human GAAKAVGKVI PELNGKLTGM
AFRVPTANVS VVDLTCRLEK PAKYDDIKKV plant
GAAKAVGKVL PELNGKLTGM AFRVPTSNVS VVDLTCRLEK
GASYEDVKAA bacterium GAAKAVGKVL PELNGKLTGM
AFRVPTPNVS VVDLTVRLEK AATYEQIKAA yeast
GAAKAVGKVL PELQGKLTGM AFRVPTVDVS VVDLTVKLNK
ETTYDEIKKV archaeon GAAQAATEVL PELEGKLDGM
AIRVPVPNGS ITEFVVDLDD DVTESDVNAA
45
The relative mutability of amino acids
Dayhoff et al. described the relative
mutability of each amino acid as the probability
that amino acid will change over a small
evolutionary time period. The total number of
changes are counted (on all branches of all
protein trees considered), and the total number
of occurrences of each amino acid is also
considered. A ratio is determined. Relative
mutability ? changes / occurrences Example
sequence 1 ala his val ala sequence
2 ala arg ser val For ala, relative mutability
1 / 3 0.33 For val, relative mutability
2 / 2 1.0
Page 53
46
The relative mutability of amino acids
Asn 134 His 66 Ser 120 Arg 65 Asp 106 Lys 56
Glu 102 Pro 56 Ala 100 Gly 49 Thr 97 Tyr 41
Ile 96 Phe 41 Met 94 Leu 40 Gln 93 Cys 20 V
al 74 Trp 18
Page 53
47
The relative mutability of amino acids
Asn 134 His 66 Ser 120 Arg 65 Asp 106 Lys 56
Glu 102 Pro 56 Ala 100 Gly 49 Thr 97 Tyr 41
Ile 96 Phe 41 Met 94 Leu 40 Gln 93 Cys 20 V
al 74 Trp 18
Note that alanine is normalized to a value of
100. Trp and cys are least mutable. Asn and ser
are most mutable.
Page 53
48
Normalized frequencies of amino acids
  • Gly 8.9 Arg 4.1
  • Ala 8.7 Asn 4.0
  • Leu 8.5 Phe 4.0
  • Lys 8.1 Gln 3.8
  • Ser 7.0 Ile 3.7
  • Val 6.5 His 3.4
  • Thr 5.8 Cys 3.3
  • Pro 5.1 Tyr 3.0
  • Glu 5.0 Met 1.5
  • Asp 4.7 Trp 1.0
  • blue6 codons red1 codon
  • These frequencies fi sum to 1

Page 53
49
Page 54
50
Dayhoffs numbers of accepted point
mutations what amino acid substitutions occur
in proteins?
Page 52
51
Dayhoffs mutation probability matrix for the
evolutionary distance of 1 PAM
  • We have considered three kinds of information
  • a table of number of accepted point mutations
    (PAMs)
  • relative mutabilities of the amino acids
  • normalized frequencies of the amino acids in PAM
    data
  • This information can be combined into a mutation
    probability matrix in which each element Mij
    gives the probability that the amino acid in
    column j will be replaced by the amino acid in
    row i after a given evolutionary interval (e.g. 1
    PAM).

Page 50
52
Dayhoffs PAM1 mutation probability matrix
Original amino acid
Page 55
53
Dayhoffs PAM1 mutation probability matrix
Each element of the matrix shows the probability
that an original amino acid (top) will be
replaced by another amino acid (side)
54
Substitution Matrix
A substitution matrix contains values
proportional to the probability that amino acid
i mutates into amino acid j for all pairs of
amino acids. Substitution matrices are
constructed by assembling a large and diverse
sample of verified pairwise alignments (or
multiple sequence alignments) of amino acids.
Substitution matrices should reflect the true
probabilities of mutations occurring through a
period of evolution. The two major types of
substitution matrices are PAM and BLOSUM.
55
PAM matrices Point-accepted mutations
PAM matrices are based on global alignments of
closely related proteins. The PAM1 is the
matrix calculated from comparisons of sequences
with no more than 1 divergence. At an
evolutionary interval of PAM1, one change has
occurred over a length of 100 amino acids. Other
PAM matrices are extrapolated from PAM1. For
PAM250, 250 changes have occurred for two
proteins over a length of 100 amino acids. All
the PAM data come from closely related
proteins (gt85 amino acid identity).
56
Dayhoffs PAM1 mutation probability matrix
Page 55
57
Dayhoffs PAM0 mutation probability matrix the
rules for extremely slowly evolving proteins
Top original amino acid Side replacement amino
acid
Page 56
58
Dayhoffs PAM2000 mutation probability
matrix the rules for very distantly related
proteins
G
8.9
8.9
8.9
8.9
8.9
8.9
8.9
8.9
Top original amino acid Side replacement amino
acid
Page 56
59
PAM250 mutation probability matrix
Top original amino acid Side replacement amino
acid
Page 57
60
PAM250 log odds scoring matrix
Page 58
61
Why do we go from a mutation probability matrix
to a log odds matrix?
  • We want a scoring matrix so that when we do a
    pairwise
  • alignment (or a BLAST search) we know what
    score to
  • assign to two aligned amino acid residues.
  • Logarithms are easier to use for a scoring
    system. They
  • allow us to sum the scores of aligned residues
    (rather
  • than having to multiply them).

Page 57
62
How do we go from a mutation probability matrix
to a log odds matrix?
  • The cells in a log odds matrix consist of an
    odds ratio
  • the probability that an alignment is authentic
  • the probability that the alignment was random
  • The score S for an alignment of residues a,b is
    given by
  • S(a,b) 10 log10 (Mab/pb)
  • As an example, for tryptophan,
  • S(a,tryptophan) 10 log10 (0.55/0.010) 17.4

Page 57
63
What do the numbers mean in a log odds matrix?
S(a,tryptophan) 10 log10 (0.55/0.010) 17.4 A
score of 17 for tryptophan means that this
alignment is 50 times more likely than a chance
alignment of two Trp residues. S(a,b)
17 Probability of replacement (Mab/pb)
x Then 17 10 log10 x 1.7 log10 x 101.7 x
50
Page 58
64
What do the numbers mean in a log odds matrix?
A score of 2 indicates that the amino acid
replacement occurs 1.6 times as frequently as
expected by chance. A score of 0 is neutral. A
score of 10 indicates that the correspondence of
two amino acids in an alignment that accurately
represents homology (evolutionary descent) is one
tenth as frequent as the chance alignment of
these amino acids.
Page 58
65
PAM250 log odds scoring matrix
Page 58
66
PAM10 log odds scoring matrix
Page 59
67
More conserved
Less conserved
Rat versus mouse RBP
Rat versus bacterial lipocalin
68
Comparing two proteins with a PAM1 matrix gives
completely different results than PAM250!
Consider two distantly related proteins. A PAM40
matrix is not forgiving of mismatches, and
penalizes them severely. Using this matrix you
can find almost no match.
hsrbp, 136 CRLLNLDGTC btlact, 3 CLLLALALTC

A PAM250 matrix is very tolerant of mismatches.
24.7 identity in 81 residues overlap Score
77.0 Gap frequency 3.7 rbp4 26
RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKG
RVRLLNNWDV btlact 21 QTMKGLDIQKVAGTWYSLAMAASD-ISL
LDAQSAPLRVYVEELKPTPEGDLEILLQKWEN

rbp4 86 --CADMVGTFTDTEDPAKFKM btlact 80
GECAQKKIIAEKTKIPAVFKI

Page 60
69
BLOSUM Matrices
BLOSUM matrices are based on local alignments.
BLOSUM stands for blocks substitution
matrix. BLOSUM62 is a matrix calculated from
comparisons of sequences with no less than 62
divergence.
Page 60
70
BLOSUM Matrices
100
collapse
62
Percent amino acid identity
30
BLOSUM62
71
BLOSUM Matrices
100
100
100
collapse
collapse
62
62
62
collapse
Percent amino acid identity
30
30
30
BLOSUM62
BLOSUM30
BLOSUM80
72
BLOSUM Matrices
All BLOSUM matrices are based on observed
alignments they are not extrapolated from
comparisons of closely related proteins. The
BLOCKS database contains thousands of groups
of multiple sequence alignments. BLOSUM62 is the
default matrix in BLAST 2.0. Though it is
tailored for comparisons of moderately distant
proteins, it performs well in detecting closer
relationships. A search for distant relatives
may be more sensitive with a different matrix.
Page 60
73
Blosum62 scoring matrix
Page 61
74
Blosum62 scoring matrix
Page 61
75
Rat versus mouse RBP
Rat versus bacterial lipocalin
Page 61
76
PAM matrices Point-accepted mutations
PAM matrices are based on global alignments of
closely related proteins. The PAM1 is the
matrix calculated from comparisons of sequences
with no more than 1 divergence. At an
evolutionary interval of PAM1, one change has
occurred over a length of 100 amino acids. Other
PAM matrices are extrapolated from PAM1. For
PAM250, 250 changes have occurred for two
proteins over a length of 100 amino acids.
77
Two randomly diverging protein sequences
change in a negatively exponential fashion
Percent identity
twilight zone
Evolutionary distance in PAMs
Page 62
78
At PAM1, two proteins are 99 identical At
PAM10.7, there are 10 differences per 100
residues At PAM80, there are 50 differences per
100 residues At PAM250, there are 80 differences
per 100 residues
Percent identity
twilight zone
Differences per 100 residues
Page 62
79
PAM matrices reflect different degrees of
divergence
PAM250
80
PAM Accepted point mutation
  • Two proteins with 50 identity may have 80
    changes
  • per 100 residues. (Why? Because any residue can
    be
  • subject to back mutations.)
  • Proteins with 20 to 25 identity are in the
    twilight zone
  • and may be statistically significantly related.
  • PAM or accepted point mutation refers to the
    hits or
  • matches between two sequences (Dayhoff Eck,
    1968)

Page 62
81
Ancestral sequence
ACCCTAC
A no change A C single
substitution C --gt A C multiple
substitutions C --gt A --gt T C --gt G
coincidental substitutions C --gt A T --gt A
parallel substitutions T --gt A A --gt C --gt T
convergent substitutions A --gt T C back
substitution C --gt T --gt C
Sequence 1 ACCGATC
Sequence 2 AATAATC
Li (1997) p.70
82
Percent identity between two proteins What
percent is significant?
100 80 65 30 23 19
We will see in the BLAST lecture that it is
appropriate to describe significance in terms of
probability (or expect) values. As a rule of
thumb, two proteins sharing gt 30 over a
substantial region are usually homologous.
Write a Comment
User Comments (0)
About PowerShow.com