Title: A Sequence Clustering Approach to Motif Discovery For Proteins
1A Sequence Clustering Approach to Motif Discovery
For Proteins
- Sun Kim
- sunkim2atindiana.edu
- April 4, 2006
- IU School of Informatics
2OUTLINE
- Introduction
- Research Question and Motivation
- Why clustering approach?
- Two approaches
- Subsequence clustering approach
- iGibbs a double clustering approach
- Future work/Discussion
3INTRODUCTION
- Motifs
- Short, conserved subsequences across a set of
proteins that share similar function - Seq 1 KGGAKRHRKIL
- Seq 2 KVGAKRHSKRS
- Seq 3 KVGAKRHSRKS
- Seq 4 KGGAKRHRKVL
4Why do we want to identify motifs ?
- Useful for
- Predicting protein functions from its primary
sequence - Predicting the family to which a protein belongs
- Predicting protein structural features
5Motifs
6Motif Discovery Problem
- Input N sequences
-
- Output set of conserved regions
7RELATED RESEARCH
- Existing motif discovery algorithms
- Stochastic
- Gibbs
- MEME
- Combinatorial
- Pratt
- Teiresias
8RELATED RESEARCH
- Gibbs Sampling
- Advantages
- Very Fast
- Disadvantages
- Heuristic instead of exhaustive search. No
guarantee to reach an optimal value - Requires relatively large input set (15 or more
sequences) - Need to specify the length and number of motifs
9RELATED RESEARCH
- MEME Multiple EM for Motif Elicitation
- Advantages
- Expectation Maximization
- Quite Accurate
- Disadvantages
- Time Complexity O(N5)
- Number of motifs need to be known
10RELATED RESEARCH
- Challenges for existing motif discovery
algorithms - Finding motifs in heterogeneous sequences
- Number of motifs unknown
- Width of motifs unknown
11Why clustering approach?
12RESEARCH PROBLEM OF INTEREST
w1
Family 1
Motif 1
- Separate non-homologous ones using clustering
w2
Family 2
Motif 2
Look at subsequences !
13Two Search frameworks for motif discovery
- One by clustering subsequences.
- with Hardik Sheth
- iGibbs by clustering whole sequences into
clusters of sequences and then clustering motifs
predicted using gibbs motif sampler from each
cluster. - with Zhiping Wang and Mehmet Dalkilic
-
14OBJECTIVE
- Algorithm should
- Find motifs from an unknown input set i.e. work
without knowing the expected number of motifs - (Work without knowing the expected length of the
motifs) - Run Fast
15- A Subsequence Clustering Approach
16Overview A Subsequence Clustering Approach
- Generate subsequences of length k.
- Perform all pairwise comaprison of the
subsequences using FASTA. - Group subsequences into clusters with a common
shared regions of at least l - Build, extend, and combine profiles using the
clusters
17MOTIVATION
Using whole sequences
Using subsequences
18PROCEDURE
- Initial subsequences discovery
- Find statistically significant subsequences with
positive log-odds score - Reject the ones appearing in less than k of
sequences - Iteratively search the input sequences to sample
these subsequences - Compare subsequences using FASTA
19PROCEDURE
- Cluster the subsequences
- Using BAG clustering algorithm
s9
s1
s8
s2
s5
s3
s7
s4
s6
Articulation Point
20PROCEDURE
- The shared regions in each cluster form the
initial motif model - Motif Extension
- Calculate the relative entropy of the motif model
- Extend the model if addition of column increases
the relative entropy
21PROCEDURE
- Merge motifs with high correlation
22PROCEDURE
- Ranking Filtering
- Rank motifs based on support (quorum)
- Calculate Support Weight for motif
- Reduce the weight of a sequence by half if it
has been covered by a motif - Reject motifs with lower support weight
- WA 5 WC 2
23EXPERIMENTS
- Three test scenarios using PROSITE database.
-
- 20 test cases, each having sequences from
randomly selected - Single protein family
- Two different protein families
- Three different protein families
24EXPERIMENTS
- Results compared against Gibbs and MEME programs
- Results validated using PROSITE patterns
- Example PS00017
- AG - x(4) - G - K - ST
25RESULTS Motif coverage in single family scenario
26RESULTS Motif coverage in two family scenario
27RESULTS Motif coverage in three family scenario
28RESULTS
29RESULTS Run timings for single family scenario
(a) Linear and (b) Semi-log plot of run timings
for single-family cases
30RESULTS Run timings for two family scenario
(a) Linear and (b) Semi-log plot of run timings
for two-family cases
31RESULTS Run timings for three family scenario
(a) Linear and (b) Semi-log plot of run timings
for three-family cases
32- iGibbs An Improved Gibbs Motif Sampler for
Proteins
33Overview
34Basic idea again, the effect of the input
sequence set
- The input sequences are divided in two different
ways - Horizontally by clustering whole sequences, and
- Vertically by selecting subsequences
- Then, each partition is input to the motif
discovery - Algorithm, Gibbs in this case.
35Subsequence selection by patterns
We look for patterns common to all input
sequences. However, it is unlikley to find a
single pattern common all sequences. Thus we are
using multiple patterns which Are similar to each
other.
36patRefine a procedure to select patterns
37iGibbs a double clustering approach
- Group input sequences in clusters, Ci
- Predict motifs using Gibbs
- Scan the input for occurrences of the motifs
predicted. - Group input sequences into clusters, Di , based
on the motif occurrences. - Predict the final motifs from Dj
38Experiment with PROSITE sequences
- We collected sequence sets with the same PROSITE
patterns. - There are three data sets
- Single-family-set
- Sequence sets with a single PROSITE pattern
- Two-family-set
- Sequence sets with two PROSITE patterns
- More extensive two family data 1025 data sets
- We ran Gibbs, new Gibbs, MEME, iGibbs without
patRefine, iGibbs with patRefine.
39Single-family-set
PPV from 10 runs GGibbs NGGibbs (new
version) IUNiGibbs unguided (no patRefine)
IiGibbs with patRefine
MMEME
40Two-faimly-set
PPV from 10 runs GGibbs NGGibbs (new
version) IUNiGibbs unguided (no patRefine)
IiGibbs with patRefine
MMEME
41Runtime
42More extensive two family data 1025 data sets
43Availability
- Motif Discovery for Proteins By Subsequence
Clustering Hardik Sheth, and Sun Kim 5th ACM
SIGKDD Workshop on Data Mining in Bioinformatics
August 2005 - iGibbs A Motif Discovery Framework for Gibbs
Sampling Algorithm Try iGibbs online! revision
for ProteinsSun Kim, Zhiping Wang, and Mehmet
Dalkilic - More information on our motif projects at
- http//bio.informatics.indiana.edu/projects/MOTIF/
44FUTURE WORK
- Extending the algorithm to DNA sequences
- Parameter Optimization
- More rigorous statistical study
- More general graphical model in general,
- a sequence has several different motifs, not
just one.
45ACKNOWLEDGEMENTS
- Mehmet Dalkilic
- Hardik Sheth
- Zhiping Wang
- Presentation materials by Mehmet Dalkilic and
Hardik Sheth - Bioinformatics Research Group
- Supported by
- NSF CAREER DBI-0237901,
- INGEN (Indiana Genomics Initiatives)
- Collaboration in Life Sciences and Informatics
Research (CLSIR) Jun Xie (PI)