A Sequence Clustering Approach to Motif Discovery For Proteins presentation

About This Presentation

Transcript and Presenter's Notes

Title: A Sequence Clustering Approach to Motif Discovery For Proteins

1
A Sequence Clustering Approach to Motif Discovery
For Proteins

Sun Kim
sunkim2atindiana.edu
April 4, 2006
IU School of Informatics

2
OUTLINE

Introduction
Research Question and Motivation
Why clustering approach?
Two approaches
Subsequence clustering approach
iGibbs a double clustering approach
Future work/Discussion

3
INTRODUCTION

Motifs
Short, conserved subsequences across a set of
proteins that share similar function
Seq 1 KGGAKRHRKIL
Seq 2 KVGAKRHSKRS
Seq 3 KVGAKRHSRKS
Seq 4 KGGAKRHRKVL

4
Why do we want to identify motifs ?

Useful for
Predicting protein functions from its primary
sequence
Predicting the family to which a protein belongs
Predicting protein structural features

5
Motifs
6
Motif Discovery Problem

Input N sequences
Output set of conserved regions

7
RELATED RESEARCH

Existing motif discovery algorithms
Stochastic
Gibbs
MEME
Combinatorial
Pratt
Teiresias

8
RELATED RESEARCH

Gibbs Sampling
Advantages
Very Fast
Disadvantages
Heuristic instead of exhaustive search. No
guarantee to reach an optimal value
Requires relatively large input set (15 or more
sequences)
Need to specify the length and number of motifs

9
RELATED RESEARCH

MEME Multiple EM for Motif Elicitation
Advantages
Expectation Maximization
Quite Accurate
Disadvantages
Time Complexity O(N5)
Number of motifs need to be known

10
RELATED RESEARCH

Challenges for existing motif discovery
algorithms
Finding motifs in heterogeneous sequences
Number of motifs unknown
Width of motifs unknown

11
Why clustering approach?
12
RESEARCH PROBLEM OF INTEREST
w1
Family 1
Motif 1

Separate non-homologous ones using clustering

w2
Family 2
Motif 2
Look at subsequences !
13
Two Search frameworks for motif discovery

One by clustering subsequences.
with Hardik Sheth
iGibbs by clustering whole sequences into
clusters of sequences and then clustering motifs
predicted using gibbs motif sampler from each
cluster.
with Zhiping Wang and Mehmet Dalkilic

14
OBJECTIVE

Algorithm should
Find motifs from an unknown input set i.e. work
without knowing the expected number of motifs
(Work without knowing the expected length of the
motifs)
Run Fast

A Subsequence Clustering Approach

16
Overview A Subsequence Clustering Approach

Generate subsequences of length k.
Perform all pairwise comaprison of the
subsequences using FASTA.
Group subsequences into clusters with a common
shared regions of at least l
Build, extend, and combine profiles using the
clusters

17
MOTIVATION
Using whole sequences
Using subsequences

Fast
Multiple motifs

18
PROCEDURE

Initial subsequences discovery
Find statistically significant subsequences with
positive log-odds score
Reject the ones appearing in less than k of
sequences
Iteratively search the input sequences to sample
these subsequences
Compare subsequences using FASTA

19
PROCEDURE

Cluster the subsequences
Using BAG clustering algorithm

s9
s1
s8
s2
s5
s3
s7
s4
s6
Articulation Point
20
PROCEDURE

The shared regions in each cluster form the
initial motif model
Motif Extension
Calculate the relative entropy of the motif model
Extend the model if addition of column increases
the relative entropy

21
PROCEDURE

Merge motifs with high correlation

22
PROCEDURE

Ranking Filtering
Rank motifs based on support (quorum)
Calculate Support Weight for motif
Reduce the weight of a sequence by half if it
has been covered by a motif
Reject motifs with lower support weight
WA 5 WC 2

23
EXPERIMENTS

Three test scenarios using PROSITE database.
20 test cases, each having sequences from
randomly selected
Single protein family
Two different protein families
Three different protein families

24
EXPERIMENTS

Results compared against Gibbs and MEME programs
Results validated using PROSITE patterns
Example PS00017
AG - x(4) - G - K - ST

25
RESULTS Motif coverage in single family scenario
26
RESULTS Motif coverage in two family scenario
27
RESULTS Motif coverage in three family scenario
28
RESULTS
29
RESULTS Run timings for single family scenario
(a) Linear and (b) Semi-log plot of run timings
for single-family cases
30
RESULTS Run timings for two family scenario
(a) Linear and (b) Semi-log plot of run timings
for two-family cases
31
RESULTS Run timings for three family scenario
(a) Linear and (b) Semi-log plot of run timings
for three-family cases
32

iGibbs An Improved Gibbs Motif Sampler for
Proteins

33
Overview
34
Basic idea again, the effect of the input
sequence set

The input sequences are divided in two different
ways
Horizontally by clustering whole sequences, and
Vertically by selecting subsequences
Then, each partition is input to the motif
discovery
Algorithm, Gibbs in this case.

35
Subsequence selection by patterns
We look for patterns common to all input
sequences. However, it is unlikley to find a
single pattern common all sequences. Thus we are
using multiple patterns which Are similar to each
other.
36
patRefine a procedure to select patterns

37
iGibbs a double clustering approach

Group input sequences in clusters, Ci
Predict motifs using Gibbs
Scan the input for occurrences of the motifs
predicted.
Group input sequences into clusters, Di , based
on the motif occurrences.
Predict the final motifs from Dj

38
Experiment with PROSITE sequences

We collected sequence sets with the same PROSITE
patterns.
There are three data sets
Single-family-set
Sequence sets with a single PROSITE pattern
Two-family-set
Sequence sets with two PROSITE patterns
More extensive two family data 1025 data sets
We ran Gibbs, new Gibbs, MEME, iGibbs without
patRefine, iGibbs with patRefine.

39
Single-family-set
PPV from 10 runs GGibbs NGGibbs (new
version) IUNiGibbs unguided (no patRefine)
IiGibbs with patRefine
MMEME
40
Two-faimly-set
PPV from 10 runs GGibbs NGGibbs (new
version) IUNiGibbs unguided (no patRefine)
IiGibbs with patRefine
MMEME
41
Runtime
42
More extensive two family data 1025 data sets
43
Availability

Motif Discovery for Proteins By Subsequence
Clustering Hardik Sheth, and Sun Kim 5th ACM
SIGKDD Workshop on Data Mining in Bioinformatics
August 2005
iGibbs A Motif Discovery Framework for Gibbs
Sampling Algorithm Try iGibbs online! revision
for ProteinsSun Kim, Zhiping Wang, and Mehmet
Dalkilic
More information on our motif projects at
http//bio.informatics.indiana.edu/projects/MOTIF/

44
FUTURE WORK

Extending the algorithm to DNA sequences
Parameter Optimization
More rigorous statistical study
More general graphical model in general,
a sequence has several different motifs, not
just one.

45
ACKNOWLEDGEMENTS

Mehmet Dalkilic
Hardik Sheth
Zhiping Wang
Presentation materials by Mehmet Dalkilic and
Hardik Sheth
Bioinformatics Research Group
Supported by
NSF CAREER DBI-0237901,
INGEN (Indiana Genomics Initiatives)
Collaboration in Life Sciences and Informatics
Research (CLSIR) Jun Xie (PI)

Write a Comment

User Comments (0)

About PowerShow.com

A Sequence Clustering Approach to Motif Discovery For Proteins PowerPoint PPT Presentation