Remote Homology detection: A motif based approach - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Remote Homology detection: A motif based approach

Description:

A method based on presence of discrete sequence motifs for ... The Blocks database combines various databases such as pFAM, PRINTs, ProDom, DOMO and InterPro. ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 36
Provided by: siddha
Category:

less

Transcript and Presenter's Notes

Title: Remote Homology detection: A motif based approach


1
Remote Homology detectionA motif based approach
  • CS 6890 Bioinformatics - Dr. Yan
  • Swati Adhau
  • 04/14/06

2
Outline of the Presentation
  • Motivation
  • Introduction
  • Description (Remote Homology Detection)
  • Methods
  • Results Discussion
  • Q and A

3
Motivation
  • Remote homology detection is the problem of
    detecting homology in case of low sequence
    similarity.
  • A method based on presence of discrete sequence
    motifs for detecting remote homology.
  • The motif content of a pair of sequences is used
    to define a similarity that is used as a kernel
    for support vector machine (SVM) classifier.

4
  • Testing of method is done upon two remote
    homology detection tasks
  • 1) Prediction of previously unseen SCOP family
    (Structural classification of Proteins).
  • 2) Prediction of an Enzyme class given other
    enzymes that have a similar function on other
    substrates.

5
Introduction
  • Protein Homology detection is one of the most
    important problems in computational biology.
  • Homology is generally established by sequence
    similarity.
  • Two established methods
  • 1) Smith Waterman algorithm
  • 2) Blast
  • Protein sequence motifs are an alternative method
    of detecting sequence similarity

6
Intro(continued)
  • By focussing on limited highly conserved regions
    of proteins, motifs can often reveal important
    clues to a proteins role.
  • Motifs often represent functionally important
    regions such as catalytic sites, binding sites
    and structural motifs.
  • The Blocks database combines various databases
    such as pFAM, PRINTs, ProDom, DOMO and InterPro.
    eMotif database contains discrete sequence motifs
    constructed from blocks of BLOCKS.
  • This paper uses discrete sequence motifs
    extracted from the eBLOCKS database using the
    eMOTIF method.

7
Intro(Continued)
  • Based upon the motif content of a pair of
    sequence we introduce sequence similarity
    measure.
  • This paper uses an SVM method.
  • SVM method is shown to perform better than
    methods for Fisher-Kernel method, SAM T-98 and
    PSI-BLAST.
  • When a sequence similarity is shown to be a dot
    product in some space it is called the kernel.
  • In this paper we use protein motifs to construct
    a kernel that can be computed efficiently which
    performs better than a kernel based on BLAST or
    Smith-Waterman scores.

8
Remote Homology Detection
  • This method was tested on the following two
    tasks-
  • 1) Prediction of a SCOP family when trained on
    other families in that familys fold.
  • 2) Prediction of the function of an enzyme when
    the training set contains enzyme that have same
    general functions but different substrates.

9
BackGround of the first dataset
  • The first dataset is composed of sequences of
    domains from the SCOP database.
  • Objective- To detect homology at the SCOP
    superfamily level. Recognizing a SCOP family when
    the training set contains other families in the
    familys superfamily.

10
(No Transcript)
11
  • contd
  • This specifies the ve examples in the test set
    and training set.
  • The ve examples are taken from outside of the
    familys fold.
  • A random family is chosen to belong to
  • -ve test set rest of the families in its
    superfamily are added to negative training set.

12
The second dataset
  • We use the classification of Enzymes to simulate
    remote homology.
  • The function of an enzyme is given by EC number
    given it to by Enzyme Commision.
  • EC number is like n1.n2.n3.n4
  • For eg 1.1.3.13 for alcohol oxidase.
  • n1 1-6 - indicates the type of chemical
    reaction catalyzed by the enzyme.
  • n2 specifies donor molecule.
  • n3 specifies the acceptor.
  • n4 specifies the substrate.

13
  • contd
  • In this paper author concentrates on
    oxidoreductase (n1 1).
  • A classifier is trained to predict
    oxidoreductases with a certain function
  • (n2 n3).
  • The classifier will be tested on oxidoreductases
    with adifferent substrate (n4) than those it was
    trained on.

14
  • For eg.
  • EC class 1.14.13.8 ? Positive examples of
    training set.
  • EC class 1.14.13.39 ? Positive examples of test
    set.
  • So the similarity between the ve training
    test may not be very high.
  • Negative test training set are defined
    analogusly.

15
Methods?The Motif kernel
  • When the similarity is a dot product it is called
    a kernel.
  • The method is as follows- Each position in the
    motif represents the variability in the column in
    a block from multiple sequence alignment.
  • For eg the motif
  • as.dkffilmv..filmvlast.
  • filmv is a substitution group.

16
  • .contd
  • If the pattern of amino acids that appear in a
    column of a block does not match any substitution
    group, then the motif contains the wild card
    symbol . .
  • A sequence will or match above motif if it has
    either an a an s in some position, then any
    character, then d, k, f so on, matching until
    the end of motif

17
  • A sequence x contains a motif m, if x contains m
    at some position.
  • A sequence x can be represented in vector space
    indexed by a set of motifs M as follows-
  • F(x) (fm(x))m?M
  • where fm(x) is the number of occurences of
    the motif m in x.
  • We can define motif kernel as
  • K(x, x) F(x) F(x)

18
  • As in the most cases a motif appears only once in
    sequence, this kernel will count the number of
    motifs that are common to both sequence.
  • Q Why are we using eBlocks database over other
    motif databases to define a motif kernel?
  • Ans-
  • Usage of databases like PROSITE the eMOTIF
    presents a problem in the evaluation of
    performance of the kernel.
  • The eBLOCKS database are generated in an
    unsupervised way
  • Increased coverage of eBLOCKS set of BLOCKS.

19
Computing the Motif Kernel
20
Computing the Motif kernel
  • To compute the motif content of each sequence
    the subsequent computation of the kernel is
    simply a dot product between the vectors.
  • To facilitate the efficient computation of the
    motif content of a sequence, the motif database
    is stored in TRIE which is defined as follows.
  • ? Let m be a motif over the alphabet A U S U
    .
  • ?Every prefix of m has a node.
  • Let m1 and m2 be prefixes of m there is an
    edge from
  • m1 to m2 if lm2l lm1l 1.
  • ?To compute all the motifs that are contained
    in x at any position, this search is started at
    each position of x.

__
21
The Blast kernel
  • A query sequence by its BLAST scores against the
    training set is represented.
  • This representation in conjuction with SVMs was
    used to address the problem of remote homology
    detection
  • Results were better than Fisher-kernel method.

22
Classification Methods
  • We report results using two classification
    methods-
  • 1) SVMs
  • 2) K-Nearest-Neighbour.
  • SVM
  • f(x) w.x b
  • w ? weight vector
  • b ? constant bias
  • Query is classified according to the sign of f.

23
  • As a consequence of optimization process, the
    weight vector can be expressed as a weighted sum
    of the Support Vectors(SV)-
  • w S bixi
  • The decision function is now written as
  • f(x) S bixi x b
  • In terms of kernel function, the decision is
    expressed as-
  • f(x) S bi K(xi , x) b

iÃŽSV
iÃŽSV
iÃŽSV
24
  • KNN classifier
  • We use a KNN classifier with a continuous valued
    decision functions.
  • A score for class j is defined as
  • fj(x) S K(xi , x)
  • kNNj(x) is the set of k nearest neighbors of x
  • in class j.

iÃŽkNNj(x)
25
Metrics
  • We consider two metrics for asessing the
    performance of a classifier
  • ROC (area under receiver operator
    characteristic).
  • RFP (the median rate of false positive)
  • The ROC curve describes the tradeoff between
    sensitivity and specificity.
  • More specifically we use ROC50 curve, which
    counts true positives only up to the first 50
    false positives.
  • The RFP score of a positive test sequence x is
    the fraction of negative test sequences that have
    a value of the decision function that is at least
    as high as the value of the decision function of
    x.

26
Results
  • Use of astral database to obtain protein domain
    sequences of the SCOP database.
  • Retained only superfamilies having atleast two
    families that have atleast 10 members in each
    family.
  • A dataset with1639 domains in 23 superfamilies
    56 families was yielded.
  • Protein sequences annotated with EC numbers were
    extracted from SwissProt database.
  • The extracted dataset has 2187 enzymes in 65
    classes.
  • To generate Blast kernel, authors ran an all vs
    all BLAST on two datasets using default
    parameters E value cut off 0.1.
  • To generate motif kernel, datasets were computed
    with eBLOCKS sequence motifs using the TRIE
    method.

27
Results contd
  • A family by family comparison of classification
    performance of the motif-SVM BLAST-SVM methods
    is provided in figure in next slide.
  • On the SCOP task the motif-SVM method performs
    significantly better than BLAST-SVM method with a
    p-value of 3.9 10-9. in a wilcoxon signed rank
    test for the ROC50 score.
  • In enzyme classification task there is no
    significant difference in ROC50 scores.
  • Similar behavior is observed in the median RFP
    and RFP50.
  • The results were similar when Smith-Waterman
    algorithm was used instead of BLAST.

28
Results contd
29
Results contd
30
Results contd
  • The motif kernel in figure 4 shows the similarity
    between the families in superfamily whereas none
    is detected by the BLAST kernel.
  • Increased sensitivity of motif kernel.

31
Results contd
32
Results contd
  • Figure 5 shows the comparison of the SVM-based
    method to the one that uses KNN as a classifier.
  • In both the motif and BLAST kernels, SVM based
    classifier performs significantly better than
    corresponding KNN classifier.

33
Discussion
  • This paper showed that an SVM classifier that
    uses motif kernel performs significantly better
    than SVM that uses a BLAST/Smith-Waterman kernel
    on a remote homology detection problem derived
    from SCOP database.
  • Both methods performed equally well on the task
    of Enzyme detection.
  • BLAST kernel motif kernel worked significantly
    better when used in conjunction with an SVM
    rather than a Nearest Neighbor classifier.
  • Despite the relative success of motif method,
    there were many SCOP families EC classes that
    were not detected using this method.

34
  • Questions?? Comments!!

35
  • Thank you !!!
Write a Comment
User Comments (0)
About PowerShow.com