Title: Where do you go from a protein sequence:
1Where do you go from a protein sequence? What
does the protein do? What structure does it
have? Determining elements that could control
specificity! Functions and activities related to
other proteins
2What do we want to know from the sequence?
- FUNCTION!
- If we want to design inhibitors for the protein,
we want to understand the relationship between
sequence and structure and function - Sequence alignment and decomposition can help
- So, first things first. One wants to find what
proteins have a sequence similar to yours
3Screening databases for sequence
homologues Scoring matrices (Discussed by
WB) Local vs Global alignments Database
searching Analyzing data and determining the
significance of matches Probability and
Expectation values
identifying motifs PROSITE PFAM Hidden Markov
models PCPMer (based on physicochemical property
conservation)
Aligning sequences Progressive alignments CLUSTALW
Phylogenetic trees
4Introduction to decomposition
SequenceStructure
MotifsMolegos
Activity
Functions
Elements can be sequential or independent
5- Section 1
- Finding homologous sequences in databases
- sequence matching programs for database search
- BLAST, PSIBLAST, FASTA
- Decomposing proteins into domains Pfam PROSITE,
BLOCKS, - Section 2
- Aligning the sequence with finds from above
- CLUSTAL-W,
- motif assisted hand alignment
- algorithms to identify similar areas in divergent
sequences - Section 3
- using structure to align DALI, CE
- Identifying molegosPCPMer
- Section 4
- Determining function from sequence and structure
- PCPMer (GETAREA)
6You should know the one letter code
And you should understand the chemical properties
of amino acids
7(No Transcript)
8Section 1 Finding homologous sequences in
databases
- scoring matrices and sequence similarity indices
PAM, BLOSUM, GONNET, Identity - sequence matching programs for database
searching FASTA, BLAST, PSIBLAST
9Homology vs. similarity
- Homology the similarity between sequences that
indicates common ancestry. - Sequence similarity is observable homology is a
hypothesis based on observation. - We want to know whether sequences are truly
homologous because this will enable us to make
conclusions about their probable structure and
function. - We can talk about overall similarity and areas of
similarity (motifs) in distinguishing probable
homologues
10Amino acid distance measures
One of the most common measures used in computer
algorithms for sequence analysis is some measure
of the distance between two sequences. Distance
can be used with anything that can be measured,
even the strength of an immunological reaction.
The concept of distance allows different types of
measures to be combined, as we will see later for
obtaining vectors that combine a lot of
physicochemical properties. Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
       Â
11Scoring matrices
The rate of substitution is not the same for all
amino acids. Thus amino acid distances use
empirical weighting schemes, called substitution
matrices (PAM, Blossum, Gonnet,Identity)
The available matrices are statistical summaries
of changes observed in similar sequences. The
numbers are logarithms of the ratio of
probability of a meaningful to a random
occurrence. They differ primarily in how the
sequence alignments used for the calculations are
assembled.
12Alternatively, one can revert to first principles
to group the amino acids according to physical
chemical principles (PCP)
Multidimensional scaling
13Components E1 to E5 Physicalchemical Properties
The numerical descriptors for each amino acid i
are calculated by for the five
eigenvectors
14Methods to look for related sequences in databases
Global search methods require too much time.
Local programs, FASTA,BLAST and PSIBLAST are
faster. FASTA matches one protein sequence to
another over its whole length and looks for
stretches of similarities. It then chooses the
frame that gives the best similarity score and
seeks to extend the blocks in both
directions. BLAST precuts the data base. It
looks for all the words in the query and then
finds all the sequences in the database with a
matching score for one of these words above a
certain cutoff. It then selects sequences with
the most matching words and seeks to connect
these. PCPMer our method for finding distant
relatives, which relies on identifying PCP-
motifs in a protein family and then finding
proteins that match some or all of these
15(No Transcript)
16(No Transcript)
17Watch out for these when running BLAST searches
- Signal and pro- sequences at the N-terminus,
which can obscure the similarity of the active
region of the protein - Transmembrane regions. For example, mammalian
adenyl cyclase enzymes have two active site
domains, separated by about 400 residues of
transmembrane helix - Single domains matching in a large protein. For
example, APE endonuclease may match endonuclease
domains in polymerases, but they are not
polymerases! - Trim your sequences as you get more results
18Total sequence decomposition or profiling
methods  PFAM -alignments of protein domains -A
is generated with expert knowledge, B is
automatically generated -uses Hidden Markov
methods to profile proteins and identify related
proteins sequences  BLOCKS -uses a seed aligment
method to identify conserved blocks in
proteins -can be used to create your own blocks
from a given sequence alignment and compare it to
the database
19- PSIBLAST does an initial BLAST search and then
aligns the sequences with the query to develop a
profile of the related sequences. It then, in
iterations, uses this profile to find proteins
that match best in selected regions.
20- Use FASTA for Sensitivity
- Eg, for comparing DNA fragments, longer aligned
regions, in regions separated by long insertions - Use BLAST for Speed
- Eg proteins, short regions of high similarity,
use a filter to avoid repeated sequences - However, BLAST is fast because it throws away
sequences that do not have a significant word
match without gaps to the query. So sequences
with homology throughout but no long words will
be missed. - Use PSIBLAST for divergent family members
- Eguse to find highly divergent family members or
related proteins with similarity only in a
defined region. - Makes a similarity matrix from the sequences
found in the initial BLAST search and uses this
to find distantly related sequences.
21To illustrate how the different sections of Blast
give you different information, and how you can
use them in decomposing a protein into domains,
look at this specific example
http//www.ncbi.nlm.nih.gov/BLAST/
MNIKKEFIKVISMSCLVTAITLSGPVFIPLVQGAGGHGDVGMHVKEKEKN
KDENKRKDEERNKTQEEHLKEIMKHIVKIEVKGEEAVKKEAAEKLLEKVP
SDVLEMYKAIGGKIYIVDGDITKHISLEALSEDKKKIKDIYGKDALLHEH
YVYAKEGYEPVLVIQSSEDYVENTEKALNVYYEIGKILSRDILSKINQPY
QKFLDVLNTIKNASDSDGQDLLFTNQLKEHPTDFSVEFLEQNSNEVQEVF
AKAFAYYIEPQHRDVLQLYAPEAFNYMDKFNEQEINLSLEELKDQRMLAR
YEKWEKIKQHYQHWSDSLSEEGRGLLKKLQIPIEPKKDDIIHSLSQEEKE
LLKRIQIDSSDFLSTEEKEFLKKLQIDIRDSLSEEEKELLNRIQVDSSNP
LSEKEKEFLKKLKLDIQPYDINQRLQDTGGLIDSPSINLDVRKQYKRDIQ
NIDALLHQSIGSTLYNKIYLYENMNINNLTATLGADLVDSTDNTKINRGI
FNEFKKNFKYSISSNYMIVDINERPALDNERLKWRIQLSPDTRAGYLENG
KLILQRNIGLEIKDVQIIKQSEKEYIRIDAKVVPKSKIDTKIQEAQLNIN
QEWNKALGLPKYTKLITFNVHNRYASNIVESAYLILNEWKNNIQSDLIKK
VTNYLVDGNGRFVFTDITLPNIAEQYTHQDEIYEQVHSKGLYVPESRSIL
LHGPSKGVELRNDSEGFIHEFGHAVDDYAGYLLDKNQSDLVTNSKKFIDI
FKEEGSNLTYGRTNEAEFFAEAFRLMHSTDHAERLKVQKNAPKTFQFIND
QIKFIINS
22Our protein has 4 different domains that are
distinct in sequence and structure
23So now we can parse our protein into 3 (or 4)
distinct domains
MNIKKEFIKVISMSCLVTAITLSGPVFIPLVQGAGGHGDVGMHVKEKEKN
KDENKRKDEERNKTQEEHLKEIMKHIVKIEVKGEEAVKKEAAEKLLEKVP
SDVLEMYKAIGGKIYIVDGDITKHISLEALSEDKKKIKDIYGKDALLHEH
YVYAKEGYEPVLVIQSSEDYVENTEKALNVYYEIGKILSRDILSKINQPY
QKFLDVLNTIKNASDSDGQDLLFTNQLKEHPTDFSVEFLEQNSNEVQEVF
AKAFAYYIEPQHRDVLQLYAPEAFNYMDKFNEQEINLSLEELKDQ RML
ARYEKWEKIKQHYQHWSDSLSEEGRGLLKKLQIPIEPKKDDIIHSLSQEE
KELLKRIQIDSSDFLSTEEKEFLKKLQIDIRDSLSEEEKELLNRIQVDSS
NPLSEKEKEFLKKLKLDIQPY DINQRLQDTGGLIDSPSINLDVRKQYKR
DIQNIDALLHQSIGSTLYNKIYLYENMNINNLTATLGADLVDSTDNTKIN
RGIFNEFKKNFKYSISSNYMIVDINERPALDNERLKWRIQLSPDTRAGYL
ENGKLILQRNIGLEIKDVQIIKQSEKEYIRIDAKV VPKSKIDTKIQEA
QLNINQEWNKALGLPKYTKLITFNVHNRYASNIVESAYLILNEWKNNIQS
DLIKKVTNYLVDGNGRFVFTDITLPNIAEQYTHQDEIYEQVHSKGLYVPE
SRSILLHGPSKGVELRNDSEGFIHEFGHAVDDYAGYLLDKNQSDLVTNSK
KFIDIFKEEGSNLTYGRTNEAEFFAEAFRLMHSTDHAERLKVQKNAPKTF
QFINDQIKFIINS
24Getting the most similar sequences justifying
similarity at low identity
- Optimize
- matrix used to analyze similarity
- The search method for determining regions of
similarity - The penalties for gapping
- Using other sequences to improve the alignment
- Identifying motifs in sequences
25Section 2 Aligning sequences
- CLUSTAL-W,
- motif assisted hand alignment
- algorithms to identify similar areas in divergent
sequences
26Aligning sequences
CLUSTAL W progressive pairwise alignment
method. Does a 11 match of sequences and selects
for the frame and gapping that gives the highest
similarity score, calculated using a default
scoring matrix or one selected by the user.
http//www.ebi.ac.uk/Tools/clustalw2/index.html
DALI structure based sequence alignment method.
When structures are available, aligns sequences
according to the match of their secondary
structures. Dali has now been superceded by SSM
http//www.ebi.ac.uk/msd-srv/ssm/
Neither method can back up or match elements that
do not occur consecutively in the sequence. Both
have problems in matching a short sequence to a
long one, unless they have a large area of
identity.
27We can generate an alignment of the middle
sections of our protein from the previous example
RMLARYEKWEKIKQHYQHWSDSLSEEGRGLLKKLQIPIEPKKDDIIHSLS
QEEKELLKRIQIDSSDFLSTEEKEFLKKLQIDIRDSLSEEEKELLNRIQV
DSSNPLSEKEKEFLKKLKLDIQPYDINQRLQDTGGLIDSPSINLDVRKQY
KRDIQNIDALLHQSIGSTLYNKIYLYENMNINNLTATLGADLVDSTDNTK
INRGIFNEFKKNFKYSISSNYMIVDINERPALDNERLKWRIQLSPDTRAG
YLENGKLILQRNIGLEIKDVQIIKQSEKEYIRIDAKV
- MKKRKVLIPLMALSTILVSSTGNLEVIQAEVKQENRLLNESESSSQGLLG
YYFSDLNFQAPMVVTSSTTGDLSIPSSELENIPSENQYFQSAIWSGFIKV
KKSDEYTFATSADNHVTMWVDDQEVINKASNSNKIRLEKGRLYQIKIQYQ
RENPTEKGLDFKLYWTDSQNKKEVISSDNLQLPELKQKSSNSRKKRSTSA
GPTVPDRDNDGIPDSLEVEGYTVDVKNKRTFLSPWISNIHEKKGLTKYKS
SPEKWSTASDPYSDFEKVTGRIDKNVSPEARHPLVAAYPIVHVDMENIIL
SKNEDQSTQNTDSQTRTISKNTSTSRTHTSEVHGNAEVHASFFDIGGSVS
AGFSNSNSSTVAIDHSLSLAGERTWAETMGLNTADTARLNANIRYVNTGT
APIYNVLPTTSLVLGKNQTLATIKAKENQLSQILAPNNYYPSKNLAPIAL
NAQDDFSSTPITMNYNQFLELEKTKQLRLDTDQVYGNIATYNFENGRVRV
DTGSNWSEVLPQIQETTARIIFNGKDLNLVERRIAAVNPSDPLETTKPDM
TLKEALKIAFGFNEPNGNLQYQGKDITEFDFNFDQQTSQNIKNQLAELNA
TNIYTVLDKIKLNAKMNILIRDKRFHYDRNNIAVGADESVVKEAHREVIN
SSTEGLLLNIDKDIRKILSGYIVEIEDTEGLKEVINDRYDMLNISSLRQD
GKTFIDFKKYNDKLPLYISNPNYKVNVYAVTKENTIINPSENGDTSTNGI
KKILIFSKKGYEIG
28Section 3 structure based alignment
- Entering the twilight zone and the programs
that extricate you - DALI,
- CE
- Identifying molegos
- MASIA,PCPMer
29The longer the sequence, the less identity is
needed to have a significant match and probably
the same fold
T(L)
30(No Transcript)
31When one goes to very low levels of sequence
identity or similarity (below 30), it is
difficult to distinguish proteins with similar
structures. This is where alignment scores are
important.
32DALI is a structure based site
- http//www.ebi.ac.uk/dali/
- http//www.bioinfo.biocenter.helsinki.fi8080/dali
/index.html - Go here and enter a PDB structure name to see
related structures and structure based
alignments. - You can then put that DALI alignment into MASIA
or PCPMer to identify molegos, areas of sequence
and structural conservation
33Functional decomposition
A simple activity can be divided into many steps
or functions
34Section 4 Determining function from sequence and
structure
- PROSITE http//ca.expasy.org/prosite/
- PFAM, http//www.sanger.ac.uk/Software/Pfam/
- BLOCKS http//blocks.fhcrc.org/
- PCPMer http//landau.utmb.edu8080/WebPCPMer/
35Methods for identifying motifs in protein
sequences
PROSITE to search for distinctive elements that
are stored in its library or defined by a user
PFAM blocks of similar sequences and families.
BLOCKS library of similar sequences (used to
derive the BLOSUM scoring matrix) and will also
analyze your sequence alignment for areas of
similarity
PCPMer a tool specifically designed for user
directed sequence and structural decomposition
36Â MASIA -user determines most parameters -can
search for patterns in aligned sequences, as
defined in macros -can search for alphabetic
motifs or those based on physical properties of
amino acids -now being developed with subroutines
to locate protein homologues based on local
pattern matching
37Sequence Alignment based on motifs
Motifs can guide the alignment of sequences from
different families that share discrete functions
that correlate with shared sequence. However,
there are currently no straightforward methods to
do this. We have developed a program suite,
PCPMer to profile sequences according to physical
chemical property based motifs.
38Identification of Motifs in a Protein Family
- To find the local conserved segments which
meets the following conditions - The number of insignificant positions in motif
between two significant positions should be less
than an empirical parameter(G-cutoff). - the minimum number of significant residue
position in this segment should be more than an
empirical parameter (L cutoff). - The entropy of each residue in the segment should
be in the range defined by empirical parameters.
39(No Transcript)
40(No Transcript)
41Adjusting the relative entropy levels to define
motifs
- Finding the Local sequence Motifs
- PCPMer can find the local maximum conserved
region in the given relative entropy range.
ALYEDPPDHKTSPSGKPATLKICSWNVDGLRAWIKKKGLDWV
KEEAPDILCLQETKCSENKLPAELQEL 0.50
ALYEDPPDHKTSPSGKPATLKICSWNVDGLRAWIKKKGLDWVKEEAPDIL
CLQETKCSENKLPAEL--- 0.60 ALYEDPPDHKTSPSGK---LKI
CSWNVDGLRAWIKKKGLDWVKEEAPDILCLQETKCSENKLPAE---- 0.
70 --YEDPPDHKTSPSGK---LKICSWNVDGLRAWIKKKGLDWVKE
EAPDILCLQETKCSENKLP------ 0.80
--YEDPPD-----------LKICSWNVDGLRAWIKKKGLDWVKEEAPDIL
CLQETKC------------ 0.90 --YEDPPD-----------LKI
CSWNVDGLRAWIKKKGLDWVKEEAPDILCLQETKC------------ 1.
00 --YEDPPD-----------LKICSWNVDGLRAWIKKKGLDWVKE
EAPDILCLQETK------------- 1.10
-------------------LKICSWNVDGLRA--------------PDIL
CLQETK------------- 1.25 -------------------LKI
CSWNVDGLRA--------------PDILCLQETK-------------
MOTIF 1 3 YEDPPD 8 MOTIF 2 20
LKICSWNVDGLRA 32 MOTIF 3 47 PDILCLQETK 56
42Motif List Motif List Motif List No.Start
Position Motif Sequence End Position 1 47 GLLGY
Y 52 2 91 SAIWSGFIKVKKSDEYTF 108 3 118 MWVD 121
4 135 IRLEKGRLYQIKIQY 149 5 162 KLYW 165 6 172
KEVISSDN 179 7 206 DRDNDGIPD 214 8 228 KRTFLS 23
3 9 244 GLTKYKSSP 252 10 260 DPYSD 264 11 283 P
LVAAYP 289 12 293 VDMEN 297 13 388 RLNANIRYVNTG
399 14 449 ALNAQDDFSS 458 15 650 NSSTEGL 656
43Search Again a Protein Database With Motifs of
the Family
- Search with a motif against a query sequence.
- To find the maximum score window as the
score of the sequence, a Lorentzian based scoring
scheme is used to measure the quality of fit for
a query sequence to a motif at position k for the
component vector i.
- Search with the motifs of the family against the
database. - The Bayesian method is applied to decide
if a given score S for a segment in a query
sequence is a sufficient match to an motif.
is the average and is the difference
between average scores of a motif of the family
and a protein database.
44Stereochemical variability plot of DV2env. This
plot shows the position specific entropy, a
measure for variability.
45PCP-Motifs (blue) common to all flaviviruses
located on the envelope protein of Dengue virus
serotype 2. The arrow shows the fusion peptide
note the high conservation of residues on two
loops near it that are far away in the sequence.
46Exercises (homework 5)
- For the homework, answer the short questions and
run a BLAST and PsiBLAST search using your
project sequence as query (for those who have a
project). Sample sequences will be assigned for
others in the course who are not doing projects.
Collect at least 6 related sequences and align
them with CLUSTALW, trimming sequences if
necessary so that the program can find the true
areas of similarity. - Keep your alignments, etc. as word files, as in a
future homework we will submit the alignment to
PCPMer to define PCP-motifs. Use the motifs to
scan the ASTRAL40 database for similar proteins
of known structure, and to help in identifying
functional areas of the sequence/structure. - Also do a BLAST search of the PDB to find
proteins of known structure related to your
protein. Use your motif lists to determine if any
of these are possible templates. - Submit your alignments and motif lists by next
Monday and include them with your project. Do not
submit your blast searches. Be sure you answer
all the questions about your assigned protein.
The 10 homework problems will be individually
scored, so missing one or two will automatically
bring your grade down. - Note anyone caught printing out unedited BLAST
searches will be charged .10/page.
47Sequences for homework problems(for those with
no project)
- gthwno1
- MAWSANKAAVVLCMDVGVAMGNSFPGEESSFEQAKKVMTMFVQRQVFSES
KDEIALVLFGTDNTNNALASEDQYQNITVHRHLMLPDFDLLEDIESKIQL
GSRQADILDALIVCMDLIQRETIGKKFEKKHIEVFTDLSSPFSQDQLDVI
ICNLKKSGISLQFFLPFPISKNDETGDRGDGDLGLDHCGPSFPQKGITEQ
QKEGICMVERVMVSLEGEDGLDEIYSFSESLRRLCVFKKIERRSMPWSCQ
LTIGPDLSIKIVAYKSIVQEKVKKSWIVVDARTLKKEDIRKETVYCLNDD
DETEVSKEDTIQGFRYGSDIIPFSKVDEEQMKYKSEGKCFSVLGFCRSSQ
VHRRFFMGYQVLKVFAAKDDEAAAVALSSLIHALDELNMVAIVRYAYDKR
ANPQVGVAFPYIKDSYECLVYVQLPFMEDLRQYMFSSLKNNKKCTPTEAQ
LSAIDDLIESMSLVKKSEEEDTIEDLFPTSKIPNPEFQRFFQCLLHRVLH
PQERLPPIQQHILNMLNLPTEMKAKCEIPLSKVRTLFPLTEAVKKKDQVT
AQDIFQDIHEEGPAAKKCKTEKEEGHISISSVAEGNVTKVGSVNPVESFR
VLVRQKIASFEQASLQLISHIEQFLDTNETLYFMKSMECIKAFREEAIQF
SEEQRFNSFLEALREKVEIKQLNHFWEIVVQDGVTLITKDEGSGSSVTTE
EATKFLAPKDKAKEDAAGLEEGGDVDDLLDMI - gthwno2
- MTRNKFIPNKFSIISFSVLLFAISSSQAIEVNAMNEHYTESDIKRNHKTE
KNKTEKEKFKDSINNLVKTEFTNETLDKIQQTQGLLKKIPKDVLEIYSEL
GGEIYFTDIDLVEHKELQDLSEEEKNSMNSRGEKVPFASRFVFEKKRETP
KLIINIKDYAINSEQSKEVYYEIGKGISLDIISKDKSLDPEFLNLIKSLS
DDSDSSDLLFSQKFKEKLELNNKSIDINFIKENLTEFQHAFSLAFSYYFA
PDHRTVLELYAPDMFEYMNKLEKGGFEKISESLKKEGVEKDRIDVLKGEK
ALKASGLVPEHADAFKKIARELNTYILFRPVNKLATNLIKSGVATKGLNV
HVKSSDWGPVAGYIPFDQDLSKKHGQQLAVEKGNLENKKSITEHEGEIGK
IPLKLDHLRIEELKENGIILKGKKEIDNGKKYYLLESNNQVYEFRISDEN
NEVQYKTKEGKITVLGEKFNWRNIEVMAKNVEGVLKPLTADYDLFALAPS
LTEIKKQIPQKEWDKVVNTPNSLEKQKGVTNLLIKYGIERKPDSTKGTLS
NWQKQMLDRLNEAVKYTGYTGGDVVNHGTEQDNEEFPEKDNEIFIINPEG
EFILTKNWEMTGRFIEKNITGKDYLYYFNRSYNKIAPGNKAYIEWTDPIT
KAKINTIPTSAEFIKNLSSIRRSSNVGVYKDSGDKDEFAKKESVKKIAGY
LSDYYNSANHIFSQEKKRKISIFRGIQAYNEIENVLKSKQIAPEYKNYFQ
YLKERITNQVQLLLTHQKSNIEFKLLYKQLNFTENETDNFEVFQKIIDEK
- gthwno3
- MKIQMRNKKVLSFLTLTAIVSQALVYPVYAQTSTSNHSNKKKEIVNEDIL
PNNGLMGYYFSDEHFKDLKLMAPIKDGNLKFEEKKVDKLLDKDKSDVKSI
RWTGRIIPSKDGEYTLSTDRDDVLMQVNTESTISNTLKVNMKKGKEYKVR
IELQDKNLGSIDNLSSPNLYWELDGMKKIIPEENLFLRDYSNIEKDDPFI
PNNNFFDPKLMSDWEDEDLDTDNDNIPDSYERNGYTIKDLIAVKWEDSFA
EQGYKKYVSNYLESNTAGDPYTDYEKASGSFDKAIKTEARDPLVAAYPIV
GVGMEKLIISTNEHASTDQGKTVSRATTNSKTESNTAGVSVNVGYQNGFT
ANVTTNYSHTTDNSTAVQDSNGESWNTGLSINKGESAYINANVRYYNTGT
APMYKVTPTTNLVLDGDTLSTIKAQENQIGNNLSPGDTYPKKGLSPLALN
TMDQFSSRLIPINYDQLKKLDAGKQIKLETTQVSGNFGTKNSSGQIVTEG
NSWSDYISQIDSISASIILDTENESYERRVTAKNLQDPEDKTPELTIGEA
IEKAFGATKKDGLLYFNDIPIDESCVELIFDDNTANKIKDSLKTLSDKKI
YNVKLERGMNILIKTPTYFTNFDDYNNYPSTWSNVNTTNQDGLQGSANKL
NGETKIKIPMSELKPYKRYVFSGYSKDPLTSNSIIVKIKAKEEKTDYLVP
EQGYTKFSYEFETTEKDSSNIEITLIGSGTTYLDNLSITELNSTPEILDE
PEVKIPTDQEIMDAHKIYFADLNFNPSTGNTYINGMYFAPTQTNKEALDY
IQKYRVEATLQYSGFKDIGTKDKEMRNYLGDPNQPKTNYV
NLRSYFTGGENIMTYKKLRIYAITPDDRELLVLSVD - gthwno4
- MNTINIAKNDFSDIELAAIPFNTLADHYGERLAREQLALEHESYEMGEAR
FRKMFERQLKAGEVADNAAAKPLITTLLPKMIARINDWFEEVKAKRGKRP
TAFQFLQEIKPEAVAYITIKTTLACLTSADNTTVQAVASAIGRAIEDEAR
FGRIRDLEAKHFKKNVEEQLNKRVGHVYKKAFMQVVEADMLSKGLLGGEA
WSSWHKEDSIHVGVRCIEMLIESTGMVSLHRQNAGVVGQDSETIELAPEY
AEAIATRAGALAGISPMFQPCVVPPKPWTGITGGGYWANGRRPLALVRTH
SKKALMRYEDVYMPEVYKAINIAQNTAWKINKKVLAVANVITKWKHCPVE
DIPAIEREELPMKPEDIDMNPEALTAWKRAAAAVYRKDKARKSRRISLEF
MLEQANKFANHKAIWFPYNMDWRGRVYAVSMFNPQGNDMTKGLLTLAKGK
PIGKEGYYWLKIHGANCAGVDKVPFPERIKFIEENHENIMACAKSPLENT
WWAEQDSPFCFLAFCFEYAGVQHHGLSYNCSLPLAFDGSCSGIQHFSAML
RDEVGGRAVNLLPSETVQDIYGIVAKKVNEILQADAINGTDNEVVTVTDE
NTGEISEKVKLGTKALAGQWLAYGVTRSVTKRSVMTLAYGSKEFGFRQQV
LEDTIQPAIDSGKGLMFTQPNQAAGYMAKLIWESVSVTVVAAVEAMNWLK
SAAKLLAAEVKDKKTGEILRKRCAVHWVTPDGFPVWQEYKKPIQTRLNLM
FLGQFRLQPTINTNKDSEIDAHKQESGIAPNFVHSQDGSHLRKTVVWAHE
KYGIESFALIHDSFGTIPADAANLFKAVRETMVDTYESCDVLADFYDQFA
DQLHESQLDKMPALPAKGNLNLRDILESDFAFA
48PROSITE Or what makes my protein special? -motifs
that distinguish members of a protein family or
superfamily -motifs are usually hand edited by
experts working with proteins in the family
familiar with its activities and areas needed for
activity -searching requires a set format and
specification of a motif in PROSITE script eg,
PKAPTGKPQKGRAKAEAT
49PAM point accepted mutation
- PAM scores are derived from alignments of
closely relatedsequences, i.e., proteins whose
function is known to be the same (Hemoglobin,
cytochrome c, ribosomal proteins, RNase A...)
from many organisms. The original PAM scoring
matrix was derived by Margaret Dayhoff, a pioneer
in sequence analysis - Numbers may be expressed in terms of
time-dependent probability matrices (P(t)) One
PAM unit is the time required to achieve an
average change of 1 in the amino acid positions.
The original aim was to relate observed changes
to the evolutionary distance between organisms,
as reflected by the geological record. Thus PAM
units may be expressed in millions of years of
evolution. - PAM250 will be drawn from a more diverse sequence
alignment than PAM100. -
50BLOSUM the BLOcks SUbstitution Matrix.
- use the BLOCKS database to search for differences
among sequences but only among the very conserved
regions of a protein family. - Should give a better substitution matrix for more
distantly related sequences than the PAM
matrices. Also, as PAM is limited to proteins of
known function for its derivation, you have more
sequences contributing to the BLOSUM numbers - Scores are derived from alignments of distantly
related sequences, without regard to function - the sequence alignments are the from the BLOCKS
database, with the numerical value derived from
the cutoff value for the diversity of the
sequence - BLOSUM62 (sequences are gt62 identical) will be
drawn from a less diverse sequence alignment than
BLOSUM35 (where the sequences are gt35 identical)
51BLOSUM62 log odds matrix
52Other scoring matrices
- Gaston Gonnet and coworkers derived a matrix
much like PAM250 by using pairwise alignments of
all the sequences known in 1992, in an iterative
fashion starting with alignments based on PAM250.
They noted that their results were different when
they used closely related sequence alignments vs.
more distantly related ones. - Identity matrix sort of the original, but only
useful if it is scored according to the frequency
of occurrence of amino acids in the database.
53Log Odds Gonnet Matrix
54Abalone pheromone example
- I was sent the following sequence for a high
abundance 51 residue protein in abalone - AFSCDPECFSYLDGPFCISNGSVVCDNCLRDRMLCEDMSLTSKDCSAPCD
K - The cloners told me the could find no matches for
this with BLAST. So what do I do with it? - First, a repeat of BLAST with default conditions
even in my hands gave nothing.
55Using BLAST to find short exact matches
This match, from a shrimp protease inhibitor,
looked especially promising.
- However, there were lots of other matches, so
could not decide among them, so sent it to fold
recognition servers
56Fold recognition helps to sort things out
- Â 7.96e00Â 18Â 1TBR Â Chain R Â 52
- The first choice from the servers was the Kazal
protease inhibitor from insects, for which there
already is a structure (PDB file 1TBR)