Sequence analysis: Macromolecular motif recognition - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Sequence analysis: Macromolecular motif recognition

Description:

Sequence analysis: Macromolecular motif recognition Sylvia Nagl Sequence analysis: Macromolecular motif recognition Sylvia Nagl Amino acid primary sequence 2. – PowerPoint PPT presentation

Number of Views:213
Avg rating:3.0/5.0
Slides: 27
Provided by: bioche2
Category:

less

Transcript and Presenter's Notes

Title: Sequence analysis: Macromolecular motif recognition


1
Sequence analysisMacromolecular motif
recognition
  • Sylvia Nagl

2
DNA sequence
Automatic translation
Amino acid primary sequence
Physico-chemical properties (e. g., using EMBOSS
suite)
Primary db searches FASTA, BLAST
1. Search for sequence homologue(s) and construct
an alignment
2. Homologue(s) with known 3D structure?
Homology modelling
available
3. Motif recognition Search secondary databases
Secondary structure prediction
Fold assignment
3
Terminology
Terminology
  • Motif the biological object one attempts to
    model - a functional or structural domain, active
    site, phosphorylation site etc.
  • Pattern a qualitative motif description based on
    a regular expression-like syntax
  • Profile a quantitative motif description -
    assigns a degree of similarity to a potential
    match

4
Active site recognition
EXAMPLE CATHEPSIN A PEPTIDASE FAMILY S10 EC
3.4.16.5
3-D representation
3D profile (PROCAT)
5
Active site motifs
Conserved seq patterns
1ac5
438LTFVSVYNASHMVPFDKS455
1ivy
419IAFLTIKGAGHMVPTDKP436
6
Domain recognition
Kringle domain from plasminogen protein
EGF-like domain from coagulation factor X
7
Macromolecular motif recognition
  • Why search for motifs?
  • to find homologous sequences
  • apply existing information to new sequence
  • find functionally important sites
  • to find templates for homology modelling -lecture
    on homology modelling

8
Different analysis methods
Percent identity Method
100 90 80 70 60 50 40 30 20 10
0
Automatic pairwise Alignment BLAST, Fasta)
Macromolecular motif recognition
Twilight zone
Structure prediction
Midnight zone
9
Macromolecular motif recognition
  • What do we need?
  • Method for defining motifs
  • Algorithm for finding them
  • Statistics to evaluate matches

10
Macromolecular motif recognition
  • Methods for defining motifs
  • Regular expression (patterns)
  • Profiles
  • Hidden Markov Model (HMM)

11
Macromolecular motif recognition
1-D representation Primary amino acid
sequence MIRAAPPPLFLLLLLLLLLVSWASRGEAAPDQDEIQRLPGL
AKQPSFRQYSGYLKSSGSKHLHYWFVESQKDPENSPVVLWLNGGPGCSSL
DGLLTEHGPFLVQPDGVTLEYNPYSWNLIANVLYLESPAGVGFSYSDDKF
YATNDTEVAQSNFEALQDFFRLFPEYKNNKL...
Query secondary databases over the Internet
Computational sequence analysis
http//www.ebi.ac.uk/interpro/
12
Macromolecular motif recognition


single motif
exact regular expression (PROSITE)
full domain alignment
profile (PROSITE)
Hidden Markov Model (Pfam, PROSITE)
residue frequency matrices (PRINTS)
multiple motifs
13
Active site motifs
Conserved seq patterns
1ac5
438LTFVSVYNASHMVPFDKS455
1ivy
419IAFLTIKGAGHMVPTDKP436
14
Motif modelling methods
Prosite Regular
expressions CARBOXYPEPT_SER_HIS LIVF-x(2)-LIVS
TA-x-IVPST-x-GSDNQL-SAGV-SG-H-x-IVAQ-P-
x(3)-PSA Regular expressions represent
features by logical combinations of characters. A
regular expression defines a sequence pattern to
be matched.  



15
Regular expressions contd.
  • Basic rules for regular expressions
  • Each position is separated by a hyphen -
  • A symbol X is a regular expression matching
    itself
  • x means any residue
  • surround ambiguities - a string XYZ
    matches any of the enclosed symbols
  • A string R matches any number of strings
    that match
  • surround forbidden residues
  • ( ) surround repeat counts
  • Model formation
  • Restricted to key conserved features in order to
    reduce the noise level
  • Built by hand in a stepwise fashion from multiple
    alignments




16
Regular expressions contd.
Regular expressions, such as PROSITE patterns,
are matched to primary amino acid sequences using
finite state automata.
all-or-none
17
Motif modelling methods
Prints Residue
frequency matrices Motif 1 NPESWTNFANMLW NPYSWV
NLTNVLW REYSWHQNHHMIY
NEGSWISKGDLLF NPYSWTNLTNVVY
NEYSWNKMASVVY
NDFGWDQESNLIY NENSWNNYANMIY
NEYGWDQVSNLLY
NPYAWSKVSTMIY NPYSWNGNASIIY
NEYAWNKFANVLF
NPYSWNRVSNILY NPYSWNLIANVLY
NEYRWNKVANVLF
Motif 2
LDQPFGTGYSQ
VDNPVGAGFSY
VDQPVGTGFSL
VDQPGGTGFSS
IDNPVGTGFSF
IDQPTGTGFSV
VDQPLGTGYSY
IDQPAGTGFSP
LESPIGVGFSY
LDQPVGSGFSY
LDQPVGSGFSY
LDQPINTGFSN
LDQPIGAGFSY
LDAPAGVGFSY
LDQPVGAGFSY
Motif 3 FFQHFPEYQTNDFHIAGESY
AGHYIP FFNKFPEYQNRPFYITGESYGGI
YVP WVERFPEYKGRDFYIVGESYAGNGLM
FLSKFPEYKGRDFWITGESYAGVYIP
WFQLYPEFLSNPFYIAGESYAGVYVP
FFEAFPHLRSNDFHIAGESYAGHYIP
FFRLFPEYKDNKLFLTGESYAGIYIP
FLTRFPQFIGRETYLAGESYGGVYVP
FFNEFPQYKGNDFYVTGESYGGIYVP
WMSRFPQYQYRDFYIVGESYAGHYVP
FFRLFPEYKNNKLFLTGESYAGIYIP
FFRLFPEYKNNKLFLTGESYAGIYIP
WLERFPEYKGREFYITGESYAGHYVP
WMSRFPQYRYRDFYIVGESYAGHYVP
WFEKFPEHKGNEFYIAGESYAGIYVP
Motif 4 LAFTLSNSVGHMAP
LQFWWILRAGHMVA
LMWAETFQSGHMQP
LTYVRVYNSSHMVP
LQEVLIRNAGHMVP
LTFVSVYNASHMVP
LTFARIVEASHMVP
LTFSSVYLSGHEIP
IDVVTVKGSGHFVP
MTFATIKGSGHTAE
MTFATIKGGGHTAE
FGYLRLYEAGHMVP
MTFATVKGSGHTAE
ITLISIKGGGHFPA
MTFATVKGSGHTAE
  • a collection of protein fingerprints that
    exploit groups of motifs to build characteristic
    family signatures
  • motifs are encoded in ungapped raw sequence
    format
  • different scoring methods may be superimposed
    onto the data, e. .g. BLAST
  • improved diagnostic reliability
  • mutual context provided by motif neighbours

18
Motif modelling methods
Prosite Profiles Feature is
represented as a matrix with a score for every
possible character. Matrix is derived from a
sequence alignment, e.g. F K L L S H
C L L V F K A F
G Q T M F Q Y P I V G Q E
L L G F P V V K E A I L K F
K V L A A V I A D L E F I
S E C I I Q
19
Profiles contd.
Derived matrix A -18 -10 -1 -8 8 -3
3 -10 -2 -8 C -22 -33 -18 -18 -22
-26 22 -24 -19 -7 D -35 0 -32 -33
-7 6 -17 -34 -31 0 E -27 15 -25
-26 -9 23 -9 -24 -23 -1 F 60 -30
12 14 -26 -29 -15 4 12 -29 G -30
-20 -28 -32 28 -14 -23 -33 -27 -5 H
-13 -12 -25 -25 -16 14 -22 -22 -23 -10 I
3 -27 21 25 -29 -23 -8 33 19 -23
K -26 25 -25 -27 -6 4 -15 -27 -26 0
L 14 -28 19 27 -27 -20 -9 33 26
-21 M 3 -15 10 14 -17 -10 -9 25
12 -11 N -22 -6 -24 -27 1 8 -15
-24 -24 -4 P -30 24 -26 -28 -14 -10
-22 -24 -26 -18 Q -32 5 -25 -26 -9
24 -16 -17 -23 7 R -18 9 -22 -22
-10 0 -18 -23 -22 -4 S -22 -8 -16
-21 11 2 -1 -24 -19 -4 T -10 -10
-6 -7 -5 -8 2 -10 -7 -11 V 0
-25 22 25 -19 -26 6 19 16 -16 W
9 -25 -18 -19 -25 -27 -34 -20 -17 -28 Y
34 -18 -1 1 -23 -12 -19 0 0 -18
Alignment positions
20
Profiles contd.
  • inclusion of all possible information to maximise
    overall signal of protein/domain
  • i. e., a full representation of features in the
    aligned sequences
  • can detect distant relationships with only few
    well conserved residues
  • position-dependent weights/penalties for all 20
    amino acids -- BASED ON AMINO ACID SUBSTITUTION
    MATRICES -- and for gaps and insertions
  • dynamic programming algorithms for scoring hits

21
Macromolecular motif recognition
  • Pfam and Prosite Hidden Markov Models (HMMs)
  • Feature is represented by a probabilistic model
    of interconnecting match, delete or insert states
  • contains statistical information on observed and
    expected positional variation - platonic
    ideal of protein family

Di
Ii
B
E
Mi
22
Macromolecular motif recognition
Pfam and Prosite Hidden Markov
Models (HMMs)
P of a given amino acid to occurs in a
particular state (M, I, D) - at particular
position in sequence (for all 20, profile-like)
P of transition state
Di
Ii
B
E
Mi
23
Statistical significance
  • Statistical tests aim to assess the likelihood
    that a match of a query sequence to a profile,
    regular expression, HMM, etc, is the result of
    chance.
  • They control for such factors as sequence (match)
    length, amino acid composition and size of the
    database searched.

24
Statistical significance
  • log-odds score this number is the log of the
    ratio between two probabilities - P that the
    sequence belongs to the positive set, and P that
    the result was obtained by chance due to the
    amino acid distribution in the positive set
    (random model).
  • Z-score one needs to estimate an average score
    and a standard deviation as a function of
    sequence length. Then, one uses the number of
    standard deviations each sequence is away from
    the average as the score.
  • e-value (Expect value) given a database search
    result with alignment score S, the e-value is the
    expected number of sequences of score gt S that
    would be found by random chance.
  • p-value the probability that one or more
    sequences of score gt S would have been found
    randomly.

25
INTERPRO
  • The InterPro database allows efficient searching
  • An integrated annotation resource for protein
    families, domains and functional sites that
    amalgamates the efforts of the PROSITE, PRINTS,
    Pfam, ProDom, SMART and TIGRFAMs secondary
    database projects.
  • http//www.ebi.ac.uk/interpro

26
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com