Title: Sequence analysis: Macromolecular motif recognition
1Sequence analysisMacromolecular motif
recognition
2DNA sequence
Automatic translation
Amino acid primary sequence
Physico-chemical properties (e. g., using EMBOSS
suite)
Primary db searches FASTA, BLAST
1. Search for sequence homologue(s) and construct
an alignment
2. Homologue(s) with known 3D structure?
Homology modelling
available
3. Motif recognition Search secondary databases
Secondary structure prediction
Fold assignment
3Terminology
Terminology
- Motif the biological object one attempts to
model - a functional or structural domain, active
site, phosphorylation site etc. -
- Pattern a qualitative motif description based on
a regular expression-like syntax -
- Profile a quantitative motif description -
assigns a degree of similarity to a potential
match
4Active site recognition
EXAMPLE CATHEPSIN A PEPTIDASE FAMILY S10 EC
3.4.16.5
3-D representation
3D profile (PROCAT)
5Active site motifs
Conserved seq patterns
1ac5
438LTFVSVYNASHMVPFDKS455
1ivy
419IAFLTIKGAGHMVPTDKP436
6Domain recognition
Kringle domain from plasminogen protein
EGF-like domain from coagulation factor X
7Macromolecular motif recognition
- Why search for motifs?
- to find homologous sequences
- apply existing information to new sequence
- find functionally important sites
- to find templates for homology modelling -lecture
on homology modelling
8Different analysis methods
Percent identity Method
100 90 80 70 60 50 40 30 20 10
0
Automatic pairwise Alignment BLAST, Fasta)
Macromolecular motif recognition
Twilight zone
Structure prediction
Midnight zone
9Macromolecular motif recognition
- What do we need?
- Method for defining motifs
- Algorithm for finding them
- Statistics to evaluate matches
10Macromolecular motif recognition
- Methods for defining motifs
- Regular expression (patterns)
- Profiles
- Hidden Markov Model (HMM)
11Macromolecular motif recognition
1-D representation Primary amino acid
sequence MIRAAPPPLFLLLLLLLLLVSWASRGEAAPDQDEIQRLPGL
AKQPSFRQYSGYLKSSGSKHLHYWFVESQKDPENSPVVLWLNGGPGCSSL
DGLLTEHGPFLVQPDGVTLEYNPYSWNLIANVLYLESPAGVGFSYSDDKF
YATNDTEVAQSNFEALQDFFRLFPEYKNNKL...
Query secondary databases over the Internet
Computational sequence analysis
http//www.ebi.ac.uk/interpro/
12Macromolecular motif recognition
single motif
exact regular expression (PROSITE)
full domain alignment
profile (PROSITE)
Hidden Markov Model (Pfam, PROSITE)
residue frequency matrices (PRINTS)
multiple motifs
13Active site motifs
Conserved seq patterns
1ac5
438LTFVSVYNASHMVPFDKS455
1ivy
419IAFLTIKGAGHMVPTDKP436
14Motif modelling methods
Prosite Regular
expressions CARBOXYPEPT_SER_HIS LIVF-x(2)-LIVS
TA-x-IVPST-x-GSDNQL-SAGV-SG-H-x-IVAQ-P-
x(3)-PSA Regular expressions represent
features by logical combinations of characters. A
regular expression defines a sequence pattern to
be matched.
15Regular expressions contd.
- Basic rules for regular expressions
- Each position is separated by a hyphen -
- A symbol X is a regular expression matching
itself - x means any residue
- surround ambiguities - a string XYZ
matches any of the enclosed symbols - A string R matches any number of strings
that match - surround forbidden residues
- ( ) surround repeat counts
- Model formation
- Restricted to key conserved features in order to
reduce the noise level - Built by hand in a stepwise fashion from multiple
alignments -
16Regular expressions contd.
Regular expressions, such as PROSITE patterns,
are matched to primary amino acid sequences using
finite state automata.
all-or-none
17Motif modelling methods
Prints Residue
frequency matrices Motif 1 NPESWTNFANMLW NPYSWV
NLTNVLW REYSWHQNHHMIY
NEGSWISKGDLLF NPYSWTNLTNVVY
NEYSWNKMASVVY
NDFGWDQESNLIY NENSWNNYANMIY
NEYGWDQVSNLLY
NPYAWSKVSTMIY NPYSWNGNASIIY
NEYAWNKFANVLF
NPYSWNRVSNILY NPYSWNLIANVLY
NEYRWNKVANVLF
Motif 2
LDQPFGTGYSQ
VDNPVGAGFSY
VDQPVGTGFSL
VDQPGGTGFSS
IDNPVGTGFSF
IDQPTGTGFSV
VDQPLGTGYSY
IDQPAGTGFSP
LESPIGVGFSY
LDQPVGSGFSY
LDQPVGSGFSY
LDQPINTGFSN
LDQPIGAGFSY
LDAPAGVGFSY
LDQPVGAGFSY
Motif 3 FFQHFPEYQTNDFHIAGESY
AGHYIP FFNKFPEYQNRPFYITGESYGGI
YVP WVERFPEYKGRDFYIVGESYAGNGLM
FLSKFPEYKGRDFWITGESYAGVYIP
WFQLYPEFLSNPFYIAGESYAGVYVP
FFEAFPHLRSNDFHIAGESYAGHYIP
FFRLFPEYKDNKLFLTGESYAGIYIP
FLTRFPQFIGRETYLAGESYGGVYVP
FFNEFPQYKGNDFYVTGESYGGIYVP
WMSRFPQYQYRDFYIVGESYAGHYVP
FFRLFPEYKNNKLFLTGESYAGIYIP
FFRLFPEYKNNKLFLTGESYAGIYIP
WLERFPEYKGREFYITGESYAGHYVP
WMSRFPQYRYRDFYIVGESYAGHYVP
WFEKFPEHKGNEFYIAGESYAGIYVP
Motif 4 LAFTLSNSVGHMAP
LQFWWILRAGHMVA
LMWAETFQSGHMQP
LTYVRVYNSSHMVP
LQEVLIRNAGHMVP
LTFVSVYNASHMVP
LTFARIVEASHMVP
LTFSSVYLSGHEIP
IDVVTVKGSGHFVP
MTFATIKGSGHTAE
MTFATIKGGGHTAE
FGYLRLYEAGHMVP
MTFATVKGSGHTAE
ITLISIKGGGHFPA
MTFATVKGSGHTAE
- a collection of protein fingerprints that
exploit groups of motifs to build characteristic
family signatures - motifs are encoded in ungapped raw sequence
format - different scoring methods may be superimposed
onto the data, e. .g. BLAST - improved diagnostic reliability
- mutual context provided by motif neighbours
-
18Motif modelling methods
Prosite Profiles Feature is
represented as a matrix with a score for every
possible character. Matrix is derived from a
sequence alignment, e.g. F K L L S H
C L L V F K A F
G Q T M F Q Y P I V G Q E
L L G F P V V K E A I L K F
K V L A A V I A D L E F I
S E C I I Q
19Profiles contd.
Derived matrix A -18 -10 -1 -8 8 -3
3 -10 -2 -8 C -22 -33 -18 -18 -22
-26 22 -24 -19 -7 D -35 0 -32 -33
-7 6 -17 -34 -31 0 E -27 15 -25
-26 -9 23 -9 -24 -23 -1 F 60 -30
12 14 -26 -29 -15 4 12 -29 G -30
-20 -28 -32 28 -14 -23 -33 -27 -5 H
-13 -12 -25 -25 -16 14 -22 -22 -23 -10 I
3 -27 21 25 -29 -23 -8 33 19 -23
K -26 25 -25 -27 -6 4 -15 -27 -26 0
L 14 -28 19 27 -27 -20 -9 33 26
-21 M 3 -15 10 14 -17 -10 -9 25
12 -11 N -22 -6 -24 -27 1 8 -15
-24 -24 -4 P -30 24 -26 -28 -14 -10
-22 -24 -26 -18 Q -32 5 -25 -26 -9
24 -16 -17 -23 7 R -18 9 -22 -22
-10 0 -18 -23 -22 -4 S -22 -8 -16
-21 11 2 -1 -24 -19 -4 T -10 -10
-6 -7 -5 -8 2 -10 -7 -11 V 0
-25 22 25 -19 -26 6 19 16 -16 W
9 -25 -18 -19 -25 -27 -34 -20 -17 -28 Y
34 -18 -1 1 -23 -12 -19 0 0 -18
Alignment positions
20Profiles contd.
- inclusion of all possible information to maximise
overall signal of protein/domain - i. e., a full representation of features in the
aligned sequences - can detect distant relationships with only few
well conserved residues - position-dependent weights/penalties for all 20
amino acids -- BASED ON AMINO ACID SUBSTITUTION
MATRICES -- and for gaps and insertions - dynamic programming algorithms for scoring hits
21Macromolecular motif recognition
- Pfam and Prosite Hidden Markov Models (HMMs)
- Feature is represented by a probabilistic model
of interconnecting match, delete or insert states - contains statistical information on observed and
expected positional variation - platonic
ideal of protein family -
Di
Ii
B
E
Mi
22Macromolecular motif recognition
Pfam and Prosite Hidden Markov
Models (HMMs)
P of a given amino acid to occurs in a
particular state (M, I, D) - at particular
position in sequence (for all 20, profile-like)
P of transition state
Di
Ii
B
E
Mi
23Statistical significance
- Statistical tests aim to assess the likelihood
that a match of a query sequence to a profile,
regular expression, HMM, etc, is the result of
chance. - They control for such factors as sequence (match)
length, amino acid composition and size of the
database searched.
24Statistical significance
- log-odds score this number is the log of the
ratio between two probabilities - P that the
sequence belongs to the positive set, and P that
the result was obtained by chance due to the
amino acid distribution in the positive set
(random model). - Z-score one needs to estimate an average score
and a standard deviation as a function of
sequence length. Then, one uses the number of
standard deviations each sequence is away from
the average as the score. - e-value (Expect value) given a database search
result with alignment score S, the e-value is the
expected number of sequences of score gt S that
would be found by random chance. - p-value the probability that one or more
sequences of score gt S would have been found
randomly.
25INTERPRO
- The InterPro database allows efficient searching
- An integrated annotation resource for protein
families, domains and functional sites that
amalgamates the efforts of the PROSITE, PRINTS,
Pfam, ProDom, SMART and TIGRFAMs secondary
database projects. - http//www.ebi.ac.uk/interpro
26(No Transcript)