A Sequence Clustering Approach to Motif Discovery For Proteins PowerPoint PPT Presentation

presentation player overlay
1 / 45
About This Presentation
Transcript and Presenter's Notes

Title: A Sequence Clustering Approach to Motif Discovery For Proteins


1
A Sequence Clustering Approach to Motif Discovery
For Proteins
  • Sun Kim
  • sunkim2atindiana.edu
  • April 4, 2006
  • IU School of Informatics

2
OUTLINE
  • Introduction
  • Research Question and Motivation
  • Why clustering approach?
  • Two approaches
  • Subsequence clustering approach
  • iGibbs a double clustering approach
  • Future work/Discussion

3
INTRODUCTION
  • Motifs
  • Short, conserved subsequences across a set of
    proteins that share similar function
  • Seq 1 KGGAKRHRKIL
  • Seq 2 KVGAKRHSKRS
  • Seq 3 KVGAKRHSRKS
  • Seq 4 KGGAKRHRKVL

4
Why do we want to identify motifs ?
  • Useful for
  • Predicting protein functions from its primary
    sequence
  • Predicting the family to which a protein belongs
  • Predicting protein structural features

5
Motifs
6
Motif Discovery Problem
  • Input N sequences
  • Output set of conserved regions

7
RELATED RESEARCH
  • Existing motif discovery algorithms
  • Stochastic
  • Gibbs
  • MEME
  • Combinatorial
  • Pratt
  • Teiresias

8
RELATED RESEARCH
  • Gibbs Sampling
  • Advantages
  • Very Fast
  • Disadvantages
  • Heuristic instead of exhaustive search. No
    guarantee to reach an optimal value
  • Requires relatively large input set (15 or more
    sequences)
  • Need to specify the length and number of motifs

9
RELATED RESEARCH
  • MEME Multiple EM for Motif Elicitation
  • Advantages
  • Expectation Maximization
  • Quite Accurate
  • Disadvantages
  • Time Complexity O(N5)
  • Number of motifs need to be known

10
RELATED RESEARCH
  • Challenges for existing motif discovery
    algorithms
  • Finding motifs in heterogeneous sequences
  • Number of motifs unknown
  • Width of motifs unknown

11
Why clustering approach?
12
RESEARCH PROBLEM OF INTEREST
w1
Family 1
Motif 1
  • Separate non-homologous ones using clustering

w2
Family 2
Motif 2
Look at subsequences !
13
Two Search frameworks for motif discovery
  • One by clustering subsequences.
  • with Hardik Sheth
  • iGibbs by clustering whole sequences into
    clusters of sequences and then clustering motifs
    predicted using gibbs motif sampler from each
    cluster.
  • with Zhiping Wang and Mehmet Dalkilic

14
OBJECTIVE
  • Algorithm should
  • Find motifs from an unknown input set i.e. work
    without knowing the expected number of motifs
  • (Work without knowing the expected length of the
    motifs)
  • Run Fast

15
  • A Subsequence Clustering Approach

16
Overview A Subsequence Clustering Approach
  • Generate subsequences of length k.
  • Perform all pairwise comaprison of the
    subsequences using FASTA.
  • Group subsequences into clusters with a common
    shared regions of at least l
  • Build, extend, and combine profiles using the
    clusters

17
MOTIVATION
Using whole sequences
Using subsequences
  • Fast
  • Multiple motifs

18
PROCEDURE
  • Initial subsequences discovery
  • Find statistically significant subsequences with
    positive log-odds score
  • Reject the ones appearing in less than k of
    sequences
  • Iteratively search the input sequences to sample
    these subsequences
  • Compare subsequences using FASTA

19
PROCEDURE
  • Cluster the subsequences
  • Using BAG clustering algorithm

s9
s1
s8
s2
s5
s3
s7
s4
s6
Articulation Point
20
PROCEDURE
  • The shared regions in each cluster form the
    initial motif model
  • Motif Extension
  • Calculate the relative entropy of the motif model
  • Extend the model if addition of column increases
    the relative entropy

21
PROCEDURE
  • Merge motifs with high correlation

22
PROCEDURE
  • Ranking Filtering
  • Rank motifs based on support (quorum)
  • Calculate Support Weight for motif
  • Reduce the weight of a sequence by half if it
    has been covered by a motif
  • Reject motifs with lower support weight
  • WA 5 WC 2

23
EXPERIMENTS
  • Three test scenarios using PROSITE database.
  • 20 test cases, each having sequences from
    randomly selected
  • Single protein family
  • Two different protein families
  • Three different protein families

24
EXPERIMENTS
  • Results compared against Gibbs and MEME programs
  • Results validated using PROSITE patterns
  • Example PS00017
  • AG - x(4) - G - K - ST

25
RESULTS Motif coverage in single family scenario
26
RESULTS Motif coverage in two family scenario
27
RESULTS Motif coverage in three family scenario
28
RESULTS
29
RESULTS Run timings for single family scenario
(a) Linear and (b) Semi-log plot of run timings
for single-family cases
30
RESULTS Run timings for two family scenario
(a) Linear and (b) Semi-log plot of run timings
for two-family cases
31
RESULTS Run timings for three family scenario
(a) Linear and (b) Semi-log plot of run timings
for three-family cases
32
  • iGibbs An Improved Gibbs Motif Sampler for
    Proteins

33
Overview
34
Basic idea again, the effect of the input
sequence set
  • The input sequences are divided in two different
    ways
  • Horizontally by clustering whole sequences, and
  • Vertically by selecting subsequences
  • Then, each partition is input to the motif
    discovery
  • Algorithm, Gibbs in this case.

35
Subsequence selection by patterns
We look for patterns common to all input
sequences. However, it is unlikley to find a
single pattern common all sequences. Thus we are
using multiple patterns which Are similar to each
other.
36
patRefine a procedure to select patterns


37
iGibbs a double clustering approach
  • Group input sequences in clusters, Ci
  • Predict motifs using Gibbs
  • Scan the input for occurrences of the motifs
    predicted.
  • Group input sequences into clusters, Di , based
    on the motif occurrences.
  • Predict the final motifs from Dj

38
Experiment with PROSITE sequences
  • We collected sequence sets with the same PROSITE
    patterns.
  • There are three data sets
  • Single-family-set
  • Sequence sets with a single PROSITE pattern
  • Two-family-set
  • Sequence sets with two PROSITE patterns
  • More extensive two family data 1025 data sets
  • We ran Gibbs, new Gibbs, MEME, iGibbs without
    patRefine, iGibbs with patRefine.

39
Single-family-set
PPV from 10 runs GGibbs NGGibbs (new
version) IUNiGibbs unguided (no patRefine)
IiGibbs with patRefine
MMEME
40
Two-faimly-set
PPV from 10 runs GGibbs NGGibbs (new
version) IUNiGibbs unguided (no patRefine)
IiGibbs with patRefine
MMEME
41
Runtime
42
More extensive two family data 1025 data sets
43
Availability
  • Motif Discovery for Proteins By Subsequence
    Clustering Hardik Sheth, and Sun Kim 5th ACM
    SIGKDD Workshop on Data Mining in Bioinformatics
    August 2005
  • iGibbs A Motif Discovery Framework for Gibbs
    Sampling Algorithm Try iGibbs online! revision
    for ProteinsSun Kim, Zhiping Wang, and Mehmet
    Dalkilic
  • More information on our motif projects at
  • http//bio.informatics.indiana.edu/projects/MOTIF/

44
FUTURE WORK
  • Extending the algorithm to DNA sequences
  • Parameter Optimization
  • More rigorous statistical study
  • More general graphical model in general,
  • a sequence has several different motifs, not
    just one.

45
ACKNOWLEDGEMENTS
  • Mehmet Dalkilic
  • Hardik Sheth
  • Zhiping Wang
  • Presentation materials by Mehmet Dalkilic and
    Hardik Sheth
  • Bioinformatics Research Group
  • Supported by
  • NSF CAREER DBI-0237901,    
  • INGEN (Indiana Genomics Initiatives)
  • Collaboration in Life Sciences and Informatics
    Research (CLSIR) Jun Xie (PI)
Write a Comment
User Comments (0)
About PowerShow.com