Where do you go from a protein sequence: - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Where do you go from a protein sequence:

Description:

Components E1 to E5 Physical chemical Properties ... PCPMer to profile sequences according to physical chemical property based motifs. ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 57
Provided by: bose9
Category:
Tags: protein | sequence

less

Transcript and Presenter's Notes

Title: Where do you go from a protein sequence:


1
Where do you go from a protein sequence? What
does the protein do? What structure does it
have? Determining elements that could control
specificity! Functions and activities related to
other proteins
2
What do we want to know from the sequence?
  • FUNCTION!
  • If we want to design inhibitors for the protein,
    we want to understand the relationship between
    sequence and structure and function
  • Sequence alignment and decomposition can help
  • So, first things first. One wants to find what
    proteins have a sequence similar to yours

3
Screening databases for sequence
homologues Scoring matrices (Discussed by
WB) Local vs Global alignments Database
searching Analyzing data and determining the
significance of matches Probability and
Expectation values
identifying motifs PROSITE PFAM Hidden Markov
models PCPMer (based on physicochemical property
conservation)
Aligning sequences Progressive alignments CLUSTALW
Phylogenetic trees
4
Introduction to decomposition
SequenceStructure
MotifsMolegos
Activity
Functions
Elements can be sequential or independent
5
  • Section 1
  • Finding homologous sequences in databases
  • sequence matching programs for database search
  • BLAST, PSIBLAST, FASTA
  • Decomposing proteins into domains Pfam PROSITE,
    BLOCKS,
  • Section 2
  • Aligning the sequence with finds from above
  • CLUSTAL-W,
  • motif assisted hand alignment
  • algorithms to identify similar areas in divergent
    sequences
  • Section 3
  • using structure to align DALI, CE
  • Identifying molegosPCPMer
  • Section 4
  • Determining function from sequence and structure
  • PCPMer (GETAREA)

6
You should know the one letter code
And you should understand the chemical properties
of amino acids
7
(No Transcript)
8
Section 1 Finding homologous sequences in
databases
  • scoring matrices and sequence similarity indices
    PAM, BLOSUM, GONNET, Identity
  • sequence matching programs for database
    searching FASTA, BLAST, PSIBLAST

9
Homology vs. similarity
  • Homology the similarity between sequences that
    indicates common ancestry.
  • Sequence similarity is observable homology is a
    hypothesis based on observation.
  • We want to know whether sequences are truly
    homologous because this will enable us to make
    conclusions about their probable structure and
    function.
  • We can talk about overall similarity and areas of
    similarity (motifs) in distinguishing probable
    homologues

10
Amino acid distance measures
One of the most common measures used in computer
algorithms for sequence analysis is some measure
of the distance between two sequences. Distance
can be used with anything that can be measured,
even the strength of an immunological reaction.
The concept of distance allows different types of
measures to be combined, as we will see later for
obtaining vectors that combine a lot of
physicochemical properties.                       
        
11
Scoring matrices
The rate of substitution is not the same for all
amino acids. Thus amino acid distances use
empirical weighting schemes, called substitution
matrices (PAM, Blossum, Gonnet,Identity)
The available matrices are statistical summaries
of changes observed in similar sequences. The
numbers are logarithms of the ratio of
probability of a meaningful to a random
occurrence. They differ primarily in how the
sequence alignments used for the calculations are
assembled.
12
Alternatively, one can revert to first principles
to group the amino acids according to physical
chemical principles (PCP)
Multidimensional scaling
13
Components E1 to E5 Physicalchemical Properties
The numerical descriptors for each amino acid i
are calculated by for the five
eigenvectors
14
Methods to look for related sequences in databases
Global search methods require too much time.
Local programs, FASTA,BLAST and PSIBLAST are
faster. FASTA matches one protein sequence to
another over its whole length and looks for
stretches of similarities. It then chooses the
frame that gives the best similarity score and
seeks to extend the blocks in both
directions. BLAST precuts the data base. It
looks for all the words in the query and then
finds all the sequences in the database with a
matching score for one of these words above a
certain cutoff. It then selects sequences with
the most matching words and seeks to connect
these. PCPMer our method for finding distant
relatives, which relies on identifying PCP-
motifs in a protein family and then finding
proteins that match some or all of these
15
(No Transcript)
16
(No Transcript)
17
Watch out for these when running BLAST searches
  • Signal and pro- sequences at the N-terminus,
    which can obscure the similarity of the active
    region of the protein
  • Transmembrane regions. For example, mammalian
    adenyl cyclase enzymes have two active site
    domains, separated by about 400 residues of
    transmembrane helix
  • Single domains matching in a large protein. For
    example, APE endonuclease may match endonuclease
    domains in polymerases, but they are not
    polymerases!
  • Trim your sequences as you get more results

18
Total sequence decomposition or profiling
methods   PFAM -alignments of protein domains -A
is generated with expert knowledge, B is
automatically generated -uses Hidden Markov
methods to profile proteins and identify related
proteins sequences   BLOCKS -uses a seed aligment
method to identify conserved blocks in
proteins -can be used to create your own blocks
from a given sequence alignment and compare it to
the database
19
  • PSIBLAST does an initial BLAST search and then
    aligns the sequences with the query to develop a
    profile of the related sequences. It then, in
    iterations, uses this profile to find proteins
    that match best in selected regions.

20
  • Use FASTA for Sensitivity
  • Eg, for comparing DNA fragments, longer aligned
    regions, in regions separated by long insertions
  • Use BLAST for Speed
  • Eg proteins, short regions of high similarity,
    use a filter to avoid repeated sequences
  • However, BLAST is fast because it throws away
    sequences that do not have a significant word
    match without gaps to the query. So sequences
    with homology throughout but no long words will
    be missed.
  • Use PSIBLAST for divergent family members
  • Eguse to find highly divergent family members or
    related proteins with similarity only in a
    defined region.
  • Makes a similarity matrix from the sequences
    found in the initial BLAST search and uses this
    to find distantly related sequences.

21
To illustrate how the different sections of Blast
give you different information, and how you can
use them in decomposing a protein into domains,
look at this specific example
http//www.ncbi.nlm.nih.gov/BLAST/
MNIKKEFIKVISMSCLVTAITLSGPVFIPLVQGAGGHGDVGMHVKEKEKN
KDENKRKDEERNKTQEEHLKEIMKHIVKIEVKGEEAVKKEAAEKLLEKVP
SDVLEMYKAIGGKIYIVDGDITKHISLEALSEDKKKIKDIYGKDALLHEH
YVYAKEGYEPVLVIQSSEDYVENTEKALNVYYEIGKILSRDILSKINQPY
QKFLDVLNTIKNASDSDGQDLLFTNQLKEHPTDFSVEFLEQNSNEVQEVF
AKAFAYYIEPQHRDVLQLYAPEAFNYMDKFNEQEINLSLEELKDQRMLAR
YEKWEKIKQHYQHWSDSLSEEGRGLLKKLQIPIEPKKDDIIHSLSQEEKE
LLKRIQIDSSDFLSTEEKEFLKKLQIDIRDSLSEEEKELLNRIQVDSSNP
LSEKEKEFLKKLKLDIQPYDINQRLQDTGGLIDSPSINLDVRKQYKRDIQ
NIDALLHQSIGSTLYNKIYLYENMNINNLTATLGADLVDSTDNTKINRGI
FNEFKKNFKYSISSNYMIVDINERPALDNERLKWRIQLSPDTRAGYLENG
KLILQRNIGLEIKDVQIIKQSEKEYIRIDAKVVPKSKIDTKIQEAQLNIN
QEWNKALGLPKYTKLITFNVHNRYASNIVESAYLILNEWKNNIQSDLIKK
VTNYLVDGNGRFVFTDITLPNIAEQYTHQDEIYEQVHSKGLYVPESRSIL
LHGPSKGVELRNDSEGFIHEFGHAVDDYAGYLLDKNQSDLVTNSKKFIDI
FKEEGSNLTYGRTNEAEFFAEAFRLMHSTDHAERLKVQKNAPKTFQFIND
QIKFIINS
22
Our protein has 4 different domains that are
distinct in sequence and structure
23
So now we can parse our protein into 3 (or 4)
distinct domains
MNIKKEFIKVISMSCLVTAITLSGPVFIPLVQGAGGHGDVGMHVKEKEKN
KDENKRKDEERNKTQEEHLKEIMKHIVKIEVKGEEAVKKEAAEKLLEKVP
SDVLEMYKAIGGKIYIVDGDITKHISLEALSEDKKKIKDIYGKDALLHEH
YVYAKEGYEPVLVIQSSEDYVENTEKALNVYYEIGKILSRDILSKINQPY
QKFLDVLNTIKNASDSDGQDLLFTNQLKEHPTDFSVEFLEQNSNEVQEVF
AKAFAYYIEPQHRDVLQLYAPEAFNYMDKFNEQEINLSLEELKDQ RML
ARYEKWEKIKQHYQHWSDSLSEEGRGLLKKLQIPIEPKKDDIIHSLSQEE
KELLKRIQIDSSDFLSTEEKEFLKKLQIDIRDSLSEEEKELLNRIQVDSS
NPLSEKEKEFLKKLKLDIQPY DINQRLQDTGGLIDSPSINLDVRKQYKR
DIQNIDALLHQSIGSTLYNKIYLYENMNINNLTATLGADLVDSTDNTKIN
RGIFNEFKKNFKYSISSNYMIVDINERPALDNERLKWRIQLSPDTRAGYL
ENGKLILQRNIGLEIKDVQIIKQSEKEYIRIDAKV VPKSKIDTKIQEA
QLNINQEWNKALGLPKYTKLITFNVHNRYASNIVESAYLILNEWKNNIQS
DLIKKVTNYLVDGNGRFVFTDITLPNIAEQYTHQDEIYEQVHSKGLYVPE
SRSILLHGPSKGVELRNDSEGFIHEFGHAVDDYAGYLLDKNQSDLVTNSK
KFIDIFKEEGSNLTYGRTNEAEFFAEAFRLMHSTDHAERLKVQKNAPKTF
QFINDQIKFIINS
24
Getting the most similar sequences justifying
similarity at low identity
  • Optimize
  • matrix used to analyze similarity
  • The search method for determining regions of
    similarity
  • The penalties for gapping
  • Using other sequences to improve the alignment
  • Identifying motifs in sequences

25
Section 2 Aligning sequences
  • CLUSTAL-W,
  • motif assisted hand alignment
  • algorithms to identify similar areas in divergent
    sequences

26
Aligning sequences
CLUSTAL W progressive pairwise alignment
method. Does a 11 match of sequences and selects
for the frame and gapping that gives the highest
similarity score, calculated using a default
scoring matrix or one selected by the user.
http//www.ebi.ac.uk/Tools/clustalw2/index.html
DALI structure based sequence alignment method.
When structures are available, aligns sequences
according to the match of their secondary
structures. Dali has now been superceded by SSM
http//www.ebi.ac.uk/msd-srv/ssm/
Neither method can back up or match elements that
do not occur consecutively in the sequence. Both
have problems in matching a short sequence to a
long one, unless they have a large area of
identity.
27
We can generate an alignment of the middle
sections of our protein from the previous example
RMLARYEKWEKIKQHYQHWSDSLSEEGRGLLKKLQIPIEPKKDDIIHSLS
QEEKELLKRIQIDSSDFLSTEEKEFLKKLQIDIRDSLSEEEKELLNRIQV
DSSNPLSEKEKEFLKKLKLDIQPYDINQRLQDTGGLIDSPSINLDVRKQY
KRDIQNIDALLHQSIGSTLYNKIYLYENMNINNLTATLGADLVDSTDNTK
INRGIFNEFKKNFKYSISSNYMIVDINERPALDNERLKWRIQLSPDTRAG
YLENGKLILQRNIGLEIKDVQIIKQSEKEYIRIDAKV
  • MKKRKVLIPLMALSTILVSSTGNLEVIQAEVKQENRLLNESESSSQGLLG
    YYFSDLNFQAPMVVTSSTTGDLSIPSSELENIPSENQYFQSAIWSGFIKV
    KKSDEYTFATSADNHVTMWVDDQEVINKASNSNKIRLEKGRLYQIKIQYQ
    RENPTEKGLDFKLYWTDSQNKKEVISSDNLQLPELKQKSSNSRKKRSTSA
    GPTVPDRDNDGIPDSLEVEGYTVDVKNKRTFLSPWISNIHEKKGLTKYKS
    SPEKWSTASDPYSDFEKVTGRIDKNVSPEARHPLVAAYPIVHVDMENIIL
    SKNEDQSTQNTDSQTRTISKNTSTSRTHTSEVHGNAEVHASFFDIGGSVS
    AGFSNSNSSTVAIDHSLSLAGERTWAETMGLNTADTARLNANIRYVNTGT
    APIYNVLPTTSLVLGKNQTLATIKAKENQLSQILAPNNYYPSKNLAPIAL
    NAQDDFSSTPITMNYNQFLELEKTKQLRLDTDQVYGNIATYNFENGRVRV
    DTGSNWSEVLPQIQETTARIIFNGKDLNLVERRIAAVNPSDPLETTKPDM
    TLKEALKIAFGFNEPNGNLQYQGKDITEFDFNFDQQTSQNIKNQLAELNA
    TNIYTVLDKIKLNAKMNILIRDKRFHYDRNNIAVGADESVVKEAHREVIN
    SSTEGLLLNIDKDIRKILSGYIVEIEDTEGLKEVINDRYDMLNISSLRQD
    GKTFIDFKKYNDKLPLYISNPNYKVNVYAVTKENTIINPSENGDTSTNGI
    KKILIFSKKGYEIG

28
Section 3 structure based alignment
  • Entering the twilight zone and the programs
    that extricate you
  • DALI,
  • CE
  • Identifying molegos
  • MASIA,PCPMer

29
The longer the sequence, the less identity is
needed to have a significant match and probably
the same fold
T(L)
30
(No Transcript)
31
When one goes to very low levels of sequence
identity or similarity (below 30), it is
difficult to distinguish proteins with similar
structures. This is where alignment scores are
important.
32
DALI is a structure based site
  • http//www.ebi.ac.uk/dali/
  • http//www.bioinfo.biocenter.helsinki.fi8080/dali
    /index.html
  • Go here and enter a PDB structure name to see
    related structures and structure based
    alignments.
  • You can then put that DALI alignment into MASIA
    or PCPMer to identify molegos, areas of sequence
    and structural conservation

33
Functional decomposition
A simple activity can be divided into many steps
or functions
34
Section 4 Determining function from sequence and
structure
  • PROSITE http//ca.expasy.org/prosite/
  • PFAM, http//www.sanger.ac.uk/Software/Pfam/
  • BLOCKS http//blocks.fhcrc.org/
  • PCPMer http//landau.utmb.edu8080/WebPCPMer/

35
Methods for identifying motifs in protein
sequences
PROSITE to search for distinctive elements that
are stored in its library or defined by a user
PFAM blocks of similar sequences and families.
BLOCKS library of similar sequences (used to
derive the BLOSUM scoring matrix) and will also
analyze your sequence alignment for areas of
similarity
PCPMer a tool specifically designed for user
directed sequence and structural decomposition
36
  MASIA -user determines most parameters -can
search for patterns in aligned sequences, as
defined in macros -can search for alphabetic
motifs or those based on physical properties of
amino acids -now being developed with subroutines
to locate protein homologues based on local
pattern matching
37
Sequence Alignment based on motifs
Motifs can guide the alignment of sequences from
different families that share discrete functions
that correlate with shared sequence. However,
there are currently no straightforward methods to
do this. We have developed a program suite,
PCPMer to profile sequences according to physical
chemical property based motifs.
38
Identification of Motifs in a Protein Family
  • To find the local conserved segments which
    meets the following conditions
  • The number of insignificant positions in motif
    between two significant positions should be less
    than an empirical parameter(G-cutoff).
  • the minimum number of significant residue
    position in this segment should be more than an
    empirical parameter (L cutoff).
  • The entropy of each residue in the segment should
    be in the range defined by empirical parameters.

39
(No Transcript)
40
(No Transcript)
41
Adjusting the relative entropy levels to define
motifs
  • Finding the Local sequence Motifs
  • PCPMer can find the local maximum conserved
    region in the given relative entropy range.

ALYEDPPDHKTSPSGKPATLKICSWNVDGLRAWIKKKGLDWV
KEEAPDILCLQETKCSENKLPAELQEL 0.50
ALYEDPPDHKTSPSGKPATLKICSWNVDGLRAWIKKKGLDWVKEEAPDIL
CLQETKCSENKLPAEL--- 0.60 ALYEDPPDHKTSPSGK---LKI
CSWNVDGLRAWIKKKGLDWVKEEAPDILCLQETKCSENKLPAE---- 0.
70 --YEDPPDHKTSPSGK---LKICSWNVDGLRAWIKKKGLDWVKE
EAPDILCLQETKCSENKLP------ 0.80
--YEDPPD-----------LKICSWNVDGLRAWIKKKGLDWVKEEAPDIL
CLQETKC------------ 0.90 --YEDPPD-----------LKI
CSWNVDGLRAWIKKKGLDWVKEEAPDILCLQETKC------------ 1.
00 --YEDPPD-----------LKICSWNVDGLRAWIKKKGLDWVKE
EAPDILCLQETK------------- 1.10
-------------------LKICSWNVDGLRA--------------PDIL
CLQETK------------- 1.25 -------------------LKI
CSWNVDGLRA--------------PDILCLQETK-------------
MOTIF 1 3 YEDPPD 8 MOTIF 2 20
LKICSWNVDGLRA 32 MOTIF 3 47 PDILCLQETK 56
42
Motif List Motif List Motif List No.Start
Position Motif Sequence End Position 1 47 GLLGY
Y 52 2 91 SAIWSGFIKVKKSDEYTF 108 3 118 MWVD 121
4 135 IRLEKGRLYQIKIQY 149 5 162 KLYW 165 6 172
KEVISSDN 179 7 206 DRDNDGIPD 214 8 228 KRTFLS 23
3 9 244 GLTKYKSSP 252 10 260 DPYSD 264 11 283 P
LVAAYP 289 12 293 VDMEN 297 13 388 RLNANIRYVNTG
399 14 449 ALNAQDDFSS 458 15 650 NSSTEGL 656
43
Search Again a Protein Database With Motifs of
the Family
  • Search with a motif against a query sequence.
  • To find the maximum score window as the
    score of the sequence, a Lorentzian based scoring
    scheme is used to measure the quality of fit for
    a query sequence to a motif at position k for the
    component vector i.

  • Search with the motifs of the family against the
    database.
  • The Bayesian method is applied to decide
    if a given score S for a segment in a query
    sequence is a sufficient match to an motif.

is the average and is the difference
between average scores of a motif of the family
and a protein database.
44
Stereochemical variability plot of DV2env. This
plot shows the position specific entropy, a
measure for variability.
45
PCP-Motifs (blue) common to all flaviviruses
located on the envelope protein of Dengue virus
serotype 2. The arrow shows the fusion peptide
note the high conservation of residues on two
loops near it that are far away in the sequence.
46
Exercises (homework 5)
  • For the homework, answer the short questions and
    run a BLAST and PsiBLAST search using your
    project sequence as query (for those who have a
    project). Sample sequences will be assigned for
    others in the course who are not doing projects.
    Collect at least 6 related sequences and align
    them with CLUSTALW, trimming sequences if
    necessary so that the program can find the true
    areas of similarity.
  • Keep your alignments, etc. as word files, as in a
    future homework we will submit the alignment to
    PCPMer to define PCP-motifs. Use the motifs to
    scan the ASTRAL40 database for similar proteins
    of known structure, and to help in identifying
    functional areas of the sequence/structure.
  • Also do a BLAST search of the PDB to find
    proteins of known structure related to your
    protein. Use your motif lists to determine if any
    of these are possible templates.
  • Submit your alignments and motif lists by next
    Monday and include them with your project. Do not
    submit your blast searches. Be sure you answer
    all the questions about your assigned protein.
    The 10 homework problems will be individually
    scored, so missing one or two will automatically
    bring your grade down.
  • Note anyone caught printing out unedited BLAST
    searches will be charged .10/page.

47
Sequences for homework problems(for those with
no project)
  • gthwno1
  • MAWSANKAAVVLCMDVGVAMGNSFPGEESSFEQAKKVMTMFVQRQVFSES
    KDEIALVLFGTDNTNNALASEDQYQNITVHRHLMLPDFDLLEDIESKIQL
    GSRQADILDALIVCMDLIQRETIGKKFEKKHIEVFTDLSSPFSQDQLDVI
    ICNLKKSGISLQFFLPFPISKNDETGDRGDGDLGLDHCGPSFPQKGITEQ
    QKEGICMVERVMVSLEGEDGLDEIYSFSESLRRLCVFKKIERRSMPWSCQ
    LTIGPDLSIKIVAYKSIVQEKVKKSWIVVDARTLKKEDIRKETVYCLNDD
    DETEVSKEDTIQGFRYGSDIIPFSKVDEEQMKYKSEGKCFSVLGFCRSSQ
    VHRRFFMGYQVLKVFAAKDDEAAAVALSSLIHALDELNMVAIVRYAYDKR
    ANPQVGVAFPYIKDSYECLVYVQLPFMEDLRQYMFSSLKNNKKCTPTEAQ
    LSAIDDLIESMSLVKKSEEEDTIEDLFPTSKIPNPEFQRFFQCLLHRVLH
    PQERLPPIQQHILNMLNLPTEMKAKCEIPLSKVRTLFPLTEAVKKKDQVT
    AQDIFQDIHEEGPAAKKCKTEKEEGHISISSVAEGNVTKVGSVNPVESFR
    VLVRQKIASFEQASLQLISHIEQFLDTNETLYFMKSMECIKAFREEAIQF
    SEEQRFNSFLEALREKVEIKQLNHFWEIVVQDGVTLITKDEGSGSSVTTE
    EATKFLAPKDKAKEDAAGLEEGGDVDDLLDMI
  • gthwno2
  • MTRNKFIPNKFSIISFSVLLFAISSSQAIEVNAMNEHYTESDIKRNHKTE
    KNKTEKEKFKDSINNLVKTEFTNETLDKIQQTQGLLKKIPKDVLEIYSEL
    GGEIYFTDIDLVEHKELQDLSEEEKNSMNSRGEKVPFASRFVFEKKRETP
    KLIINIKDYAINSEQSKEVYYEIGKGISLDIISKDKSLDPEFLNLIKSLS
    DDSDSSDLLFSQKFKEKLELNNKSIDINFIKENLTEFQHAFSLAFSYYFA
    PDHRTVLELYAPDMFEYMNKLEKGGFEKISESLKKEGVEKDRIDVLKGEK
    ALKASGLVPEHADAFKKIARELNTYILFRPVNKLATNLIKSGVATKGLNV
    HVKSSDWGPVAGYIPFDQDLSKKHGQQLAVEKGNLENKKSITEHEGEIGK
    IPLKLDHLRIEELKENGIILKGKKEIDNGKKYYLLESNNQVYEFRISDEN
    NEVQYKTKEGKITVLGEKFNWRNIEVMAKNVEGVLKPLTADYDLFALAPS
    LTEIKKQIPQKEWDKVVNTPNSLEKQKGVTNLLIKYGIERKPDSTKGTLS
    NWQKQMLDRLNEAVKYTGYTGGDVVNHGTEQDNEEFPEKDNEIFIINPEG
    EFILTKNWEMTGRFIEKNITGKDYLYYFNRSYNKIAPGNKAYIEWTDPIT
    KAKINTIPTSAEFIKNLSSIRRSSNVGVYKDSGDKDEFAKKESVKKIAGY
    LSDYYNSANHIFSQEKKRKISIFRGIQAYNEIENVLKSKQIAPEYKNYFQ
    YLKERITNQVQLLLTHQKSNIEFKLLYKQLNFTENETDNFEVFQKIIDEK
  • gthwno3
  • MKIQMRNKKVLSFLTLTAIVSQALVYPVYAQTSTSNHSNKKKEIVNEDIL
    PNNGLMGYYFSDEHFKDLKLMAPIKDGNLKFEEKKVDKLLDKDKSDVKSI
    RWTGRIIPSKDGEYTLSTDRDDVLMQVNTESTISNTLKVNMKKGKEYKVR
    IELQDKNLGSIDNLSSPNLYWELDGMKKIIPEENLFLRDYSNIEKDDPFI
    PNNNFFDPKLMSDWEDEDLDTDNDNIPDSYERNGYTIKDLIAVKWEDSFA
    EQGYKKYVSNYLESNTAGDPYTDYEKASGSFDKAIKTEARDPLVAAYPIV
    GVGMEKLIISTNEHASTDQGKTVSRATTNSKTESNTAGVSVNVGYQNGFT
    ANVTTNYSHTTDNSTAVQDSNGESWNTGLSINKGESAYINANVRYYNTGT
    APMYKVTPTTNLVLDGDTLSTIKAQENQIGNNLSPGDTYPKKGLSPLALN
    TMDQFSSRLIPINYDQLKKLDAGKQIKLETTQVSGNFGTKNSSGQIVTEG
    NSWSDYISQIDSISASIILDTENESYERRVTAKNLQDPEDKTPELTIGEA
    IEKAFGATKKDGLLYFNDIPIDESCVELIFDDNTANKIKDSLKTLSDKKI
    YNVKLERGMNILIKTPTYFTNFDDYNNYPSTWSNVNTTNQDGLQGSANKL
    NGETKIKIPMSELKPYKRYVFSGYSKDPLTSNSIIVKIKAKEEKTDYLVP
    EQGYTKFSYEFETTEKDSSNIEITLIGSGTTYLDNLSITELNSTPEILDE
    PEVKIPTDQEIMDAHKIYFADLNFNPSTGNTYINGMYFAPTQTNKEALDY
    IQKYRVEATLQYSGFKDIGTKDKEMRNYLGDPNQPKTNYV
    NLRSYFTGGENIMTYKKLRIYAITPDDRELLVLSVD
  • gthwno4
  • MNTINIAKNDFSDIELAAIPFNTLADHYGERLAREQLALEHESYEMGEAR
    FRKMFERQLKAGEVADNAAAKPLITTLLPKMIARINDWFEEVKAKRGKRP
    TAFQFLQEIKPEAVAYITIKTTLACLTSADNTTVQAVASAIGRAIEDEAR
    FGRIRDLEAKHFKKNVEEQLNKRVGHVYKKAFMQVVEADMLSKGLLGGEA
    WSSWHKEDSIHVGVRCIEMLIESTGMVSLHRQNAGVVGQDSETIELAPEY
    AEAIATRAGALAGISPMFQPCVVPPKPWTGITGGGYWANGRRPLALVRTH
    SKKALMRYEDVYMPEVYKAINIAQNTAWKINKKVLAVANVITKWKHCPVE
    DIPAIEREELPMKPEDIDMNPEALTAWKRAAAAVYRKDKARKSRRISLEF
    MLEQANKFANHKAIWFPYNMDWRGRVYAVSMFNPQGNDMTKGLLTLAKGK
    PIGKEGYYWLKIHGANCAGVDKVPFPERIKFIEENHENIMACAKSPLENT
    WWAEQDSPFCFLAFCFEYAGVQHHGLSYNCSLPLAFDGSCSGIQHFSAML
    RDEVGGRAVNLLPSETVQDIYGIVAKKVNEILQADAINGTDNEVVTVTDE
    NTGEISEKVKLGTKALAGQWLAYGVTRSVTKRSVMTLAYGSKEFGFRQQV
    LEDTIQPAIDSGKGLMFTQPNQAAGYMAKLIWESVSVTVVAAVEAMNWLK
    SAAKLLAAEVKDKKTGEILRKRCAVHWVTPDGFPVWQEYKKPIQTRLNLM
    FLGQFRLQPTINTNKDSEIDAHKQESGIAPNFVHSQDGSHLRKTVVWAHE
    KYGIESFALIHDSFGTIPADAANLFKAVRETMVDTYESCDVLADFYDQFA
    DQLHESQLDKMPALPAKGNLNLRDILESDFAFA

48
PROSITE Or what makes my protein special? -motifs
that distinguish members of a protein family or
superfamily -motifs are usually hand edited by
experts working with proteins in the family
familiar with its activities and areas needed for
activity -searching requires a set format and
specification of a motif in PROSITE script eg,
PKAPTGKPQKGRAKAEAT
49
PAM point accepted mutation
  • PAM scores are derived from alignments of
    closely relatedsequences, i.e., proteins whose
    function is known to be the same (Hemoglobin,
    cytochrome c, ribosomal proteins, RNase A...)
    from many organisms. The original PAM scoring
    matrix was derived by Margaret Dayhoff, a pioneer
    in sequence analysis
  • Numbers may be expressed in terms of
    time-dependent probability matrices (P(t)) One
    PAM unit is the time required to achieve an
    average change of 1 in the amino acid positions.
    The original aim was to relate observed changes
    to the evolutionary distance between organisms,
    as reflected by the geological record. Thus PAM
    units may be expressed in millions of years of
    evolution.
  • PAM250 will be drawn from a more diverse sequence
    alignment than PAM100.

50
BLOSUM the BLOcks SUbstitution Matrix.
  • use the BLOCKS database to search for differences
    among sequences but only among the very conserved
    regions of a protein family.
  • Should give a better substitution matrix for more
    distantly related sequences than the PAM
    matrices. Also, as PAM is limited to proteins of
    known function for its derivation, you have more
    sequences contributing to the BLOSUM numbers
  • Scores are derived from alignments of distantly
    related sequences, without regard to function
  • the sequence alignments are the from the BLOCKS
    database, with the numerical value derived from
    the cutoff value for the diversity of the
    sequence
  • BLOSUM62 (sequences are gt62 identical) will be
    drawn from a less diverse sequence alignment than
    BLOSUM35 (where the sequences are gt35 identical)

51
BLOSUM62 log odds matrix

52
Other scoring matrices
  • Gaston Gonnet and coworkers derived a matrix
    much like PAM250 by using pairwise alignments of
    all the sequences known in 1992, in an iterative
    fashion starting with alignments based on PAM250.
    They noted that their results were different when
    they used closely related sequence alignments vs.
    more distantly related ones.
  • Identity matrix sort of the original, but only
    useful if it is scored according to the frequency
    of occurrence of amino acids in the database.

53
Log Odds Gonnet Matrix

54
Abalone pheromone example
  • I was sent the following sequence for a high
    abundance 51 residue protein in abalone
  • AFSCDPECFSYLDGPFCISNGSVVCDNCLRDRMLCEDMSLTSKDCSAPCD
    K
  • The cloners told me the could find no matches for
    this with BLAST. So what do I do with it?
  • First, a repeat of BLAST with default conditions
    even in my hands gave nothing.

55
Using BLAST to find short exact matches
This match, from a shrimp protease inhibitor,
looked especially promising.
  • However, there were lots of other matches, so
    could not decide among them, so sent it to fold
    recognition servers

56
Fold recognition helps to sort things out
  •  7.96e00 18 1TBR   Chain R   52
  • The first choice from the servers was the Kazal
    protease inhibitor from insects, for which there
    already is a structure (PDB file 1TBR)
Write a Comment
User Comments (0)
About PowerShow.com