Title: Motif Discovery in Unaligned and Aligned Protein Sequences
1Motif Discovery in Unaligned and Aligned Protein
Sequences
- Sun Kim
- sunkim2atindiana.edu
- School of Informatics Center for Genomics and
Bioinformatics - Indiana University Bloomington
- November 2, 2006
2OUTLINE
- Introduction
- Research Question and Motivation
- Conserved region detection in aligned sequences
using aggregated related column scores - iGibbs improved Gibbs sampler for proetins
- Future work/Discussion
3INTRODUCTION
- (Sequence) Motifs
- Short, conserved subsequences across a set of
proteins that share similar function - Seq 1 KGGAKRHRKIL
- Seq 2 KVGAKRHSKRS
- Seq 3 KVGAKRHSRKS
- Seq 4 KGGAKRHRKVL
4Why do we want to identify motifs ?
- Useful for
- Predicting protein functions from its primary
sequence - Predicting the family to which a protein belongs
- Predicting protein structural features
5Motifs
PS00577 DENY-x(2)-KRI-STA-x(2)-V-G-x-DN-x
-FW-T-KR
6Motif Discovery
- In aligned sequences
- The input sequences are aligned using a multiple
sequence alignment algorithm, say ClustalW or
T-coffee or WP. - From the alignment, determine motif regions,
typically most conserved regions. How? - In unaligned sequences
- A motif model M is typically an alignment of
subsequences occurring input sequences S. - Then the motif search problem is to find a model
M such that maximizes a measure, typically log
Pr(SM).
7- ARCS An Aggregated Related Column Scoring Scheme
for Aligned Sequences - A method for detecting conserved, hopefully
motif, regions in aligned sequences
8Challenge in motif discovery in aligned sequences
- Aligning input sequences globally often works
very well when input sequences share enough
similarity. - However, determining conserved regions or motifs
in aligned sequences is poorly studied.
9Motivation for ARCS
- There are a few methods for detecting conserved
regions in aligned sequences. - AL2CO (Pei and Grishin 2001) Two different
strategies (unweighted frequencies and weighted
frequencies) and three conceptually different
approaches (entropy-based, variance-based and
matrix score-based) were used. - ConFind (Smagala et al 2005) Several criteria,
including minimum region length and maximum
informational entropy (variability) per position,
were used. - These methods do not use correlation among
columns in an alignment, thus they often fail to
detect conserved regions.
10How we measure column correlation?
- To measure column correlation, we use approximate
functional dependency (Dalkilic and Robertson,
2000 Giannella and Robertson, 2004). - Correlation score at column 2 with columns 1 and
3.
11Aggregated Related Column Score (ARCS) at column i
- Measure LOGOS score at each column j.
- Measure approximate functional dependency between
column i and its neighbor column j. - ARCS at column i is the sum of
- LOGOS at neighbor column j TIMES
- approximate functional dependency between column
i and its neighbor column j.
12ARCS formal definition
- LOGOS(i) HMax ? H(i) where
- H(i) ??eFie log2(Fie)
- HMax log2(Min(NL, n))
- FD i?j 1- H i?j / log2(n)
- H i?j ?p( cip /n) (?q(cip,jq / c ip) log2 (c
ip/ c ip,jq)) - ARCS(i) ? j?N(i) FD j ?i LOGOS(j)
13ARCS example
-
- HMax log2(Min(NL, n)) log2(Min(5, 4)) 2
- H(1) ?(F1M log2(F1M) F1W log2(F1W)) -(½-1
½-1)1 - LOGOS(1) HMax ? H(1) 1 LOGOS(2) 0.5
LOGOS(3) 1.1887 LOGOS(4) 1.1887. - H 1?4 2/4 (1 log2 (1)) 2/4(1/2 log2 (2) 1/2
log2 (2)) 0.5 - FD 1?4 1-0.5/ log2 (4) 0.75
- ARCS(1) FD 1?1 LOGOS(1) FD 2?1 LOGOS(2)
1.375
14Smoothed ARCS
- Smoothed ARCS(i) ? 0?j? ?(w-1)/2? ARCS( i ?
j)/w
15Experiments with PROSITE patterns
- We extracted a sequence set from Swissprot having
the same PROSITE patterns. - Among 1320 patterns, we randomly chose 709
patterns where the number of sequences was not
greater than 50. - Each sequence set was aligned using Clustalw.
- We used 533 multiple sequence alignments to
evaluate our method for the case that the
alignment is correct. - Forty-seven alignments were tested for the case
that Clustal W aligned part or none of the
motifs.
16Effect of smoothing window size
17Smoothed LOGO, AL2CO, and ARCS for PS00702
18Performance of ARCS on incorrectly aligned
sequences
Among incorrectly aligned 47 protein families,
the first peak of ARCS corresponds to part of
motifs up to 40.4 test cases.
19ARCS can be used for a new multiple seq alignment
- If ARCS can highlight true motif regions even
when a multiple sequence alignment is only
partially correct, ARCS can be used for
generating a multiple sequence alignment. - Indeed, we are trying two different methods
- WP A Weighted Position Approach for Multiple
Sequence Alignment a manuscript in preparation. - An iterative alignment improvement method
(similar to a method by Wang, Dalkilic, and Kim
ACM SAC 2004).
20Pattern Complexity
- We measured the sensitivity of ARCS with respect
to the motif complexity which is defined as 1 -
the ratio of the number of exact characters in
the pattern to the length of the pattern.
21- iGibbs An Improved Gibbs Motif Sampler for
Proteins
22Motif Discovery Problem a search problem
- Input N sequences
-
- Output set of conserved regions
23Motif search in unaligned seqs an optimization
problem
24RELATED RESEARCH
- Existing motif discovery algorithms
- Stochastic
- Gibbs
- MEME
- Combinatorial
- Pratt
- Teiresias
25Motivation for iGibbs, Improved Gibbs motif
sampler
- As shown in previous slides, the search space for
motif detection is determined by the input
sequence set. - What if only a set of the input seqs contains a
motif, or what if there are more than one motif
occuring in different subsets ? - We have shown that a sequence clustering approach
improved the performance of MEME. - This time, we will improve the performance of
Gibbs using a double clustering approach.
26Why Gibbs?
- Why did we choose to improve Gibbs motif sampler?
- Gibbs is one of the fast running motif sampler,
yet retains a good performance. - A Gibbs motif sampler predict different motifs
for different execution, since it is a random
sampling algorithm (we discuss more in the next
slide). - Of course, the effect of the input seq set is not
considered in the design of Gibbs.
27Gibbs motif sampler review
28Why clustering approach?
29RESEARCH PROBLEM OF INTEREST
w1
Family 1
Motif 1
- Separate non-homologous ones using clustering
w2
Family 2
Motif 2
Look at subsequences !
30Three Search frameworks for motif discovery
- One by clustering subsequences.
- with Hardik Sheth
- iGibbs by clustering whole sequences into
clusters of sequences and then clustering motifs
predicted using gibbs motif sampler from each
cluster. - with Zhiping Wang and Mehmet Dalkilic
- Iterative subsequence graph splitting approach
using a min-cut algorithm (a genetic programming
implementation) - with Yong-Hyuk Kim, Byung-Ro Moon, Seunghee
Bae. -
31DESIGN OBJECTIVE
- Algorithm should
- Find motifs from an unknown input set i.e. work
without knowing the expected number of motifs - Run Fast
32Problem with sequence clustering algorithms
- A lot!
- BAG tends to generate clusters too specific or
fragmented clusters a true family is split into
multiple clusters. - Thus we used a double clustering approach to
iGibbs.
33Overview
34Basic idea again, the effect of the input
sequence set
- The input sequences are divided in two different
ways - Horizontally by clustering whole sequences, and
- Vertically by selecting subsequences
- Then, each partition is input to the motif
discovery - Algorithm, Gibbs in this case.
35Subsequence selection by patterns
We look for patterns common to all input
sequences. However, it is unlikley to find a
single pattern common all sequences. Thus we are
using multiple patterns which Are similar to each
other.
36patRefine a procedure to select patterns
37iGibbs a double clustering approach
- Group input sequences in clusters, Ci
- Predict motifs using Gibbs
- Scan the input for occurrences of the motifs
predicted. - Group input sequences into clusters, Di , based
on the motif occurrences. - Predict the final motifs from Dj
38Experiment with PROSITE sequences
- We collected sequence sets with the same PROSITE
patterns. - There are three data sets
- Single-family-set
- Sequence sets with a single PROSITE pattern
- Two-family-set
- Sequence sets with two PROSITE patterns
- More extensive two family data 1025 data sets
- We ran Gibbs, new Gibbs, MEME, iGibbs without
patRefine, iGibbs with patRefine.
39Single-family-set
PPV from 10 runs GGibbs NGGibbs (new
version) IUNiGibbs unguided (no patRefine)
IiGibbs with patRefine
MMEME
40Two-faimly-set
PPV from 10 runs GGibbs NGGibbs (new
version) IUNiGibbs unguided (no patRefine)
IiGibbs with patRefine
MMEME
41Runtime
42More extensive two family data 1025 data sets
43When the clustering approach works?
44Availability
- ARCS An Aggregated Related Column Scoring Scheme
for Aligned Sequences - Bioinformatics, vol22, pp 2326-2332, Oct, 2006
- iGibbs A Motif Discovery Framework for Gibbs
Sampling Algorithm Try iGibbs online! Proteins
Structure, Function, and Bioinformatics, in press -
- More information on our motif projects at
- http//bio.informatics.indiana.edu/projects/MOTIF/
45FUTURE WORK
- Extending the algorithm to DNA sequences
- Parameter Optimization
- More rigorous statistical study
- More general graphical model in general,
- a sequence has several different motifs, not
just one.
46ACKNOWLEDGEMENTS
- JeongHyeon Choi, Mehmet Dalkilic, Hardik Sheth,
Zhiping Wang (Indiana) - Jiong Yang, Meng Hu (CaseWestern)
- Presentation materials by Mehmet Dalkilic and
Hardik Sheth - Bioinformatics Research Group
- Supported by
- NSF CAREER DBI-0237901,
- INGEN (Indiana Genomics Initiatives)