Motif Discovery in Unaligned and Aligned Protein Sequences - PowerPoint PPT Presentation

1 / 46

About This Presentation

Title:

Motif Discovery in Unaligned and Aligned Protein Sequences

Description:

Performance of ARCS on incorrectly aligned sequences. Among incorrectly aligned 47 protein families, the first peak of ARCS corresponds to ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 47

Provided by: Har5159

Category:

more less

Transcript and Presenter's Notes

Title: Motif Discovery in Unaligned and Aligned Protein Sequences

1
Motif Discovery in Unaligned and Aligned Protein
Sequences

Sun Kim
sunkim2atindiana.edu
School of Informatics Center for Genomics and
Bioinformatics
Indiana University Bloomington
November 2, 2006

2
OUTLINE

Introduction
Research Question and Motivation
Conserved region detection in aligned sequences
using aggregated related column scores
iGibbs improved Gibbs sampler for proetins
Future work/Discussion

3
INTRODUCTION

(Sequence) Motifs
Short, conserved subsequences across a set of
proteins that share similar function
Seq 1 KGGAKRHRKIL
Seq 2 KVGAKRHSKRS
Seq 3 KVGAKRHSRKS
Seq 4 KGGAKRHRKVL

4
Why do we want to identify motifs ?

Useful for
Predicting protein functions from its primary
sequence
Predicting the family to which a protein belongs
Predicting protein structural features

5
Motifs
PS00577 DENY-x(2)-KRI-STA-x(2)-V-G-x-DN-x
-FW-T-KR
6
Motif Discovery

In aligned sequences
The input sequences are aligned using a multiple
sequence alignment algorithm, say ClustalW or
T-coffee or WP.
From the alignment, determine motif regions,
typically most conserved regions. How?
In unaligned sequences
A motif model M is typically an alignment of
subsequences occurring input sequences S.
Then the motif search problem is to find a model
M such that maximizes a measure, typically log
Pr(SM).

ARCS An Aggregated Related Column Scoring Scheme
for Aligned Sequences
A method for detecting conserved, hopefully
motif, regions in aligned sequences

8
Challenge in motif discovery in aligned sequences

Aligning input sequences globally often works
very well when input sequences share enough
similarity.
However, determining conserved regions or motifs
in aligned sequences is poorly studied.

9
Motivation for ARCS

There are a few methods for detecting conserved
regions in aligned sequences.
AL2CO (Pei and Grishin 2001) Two different
strategies (unweighted frequencies and weighted
frequencies) and three conceptually different
approaches (entropy-based, variance-based and
matrix score-based) were used.
ConFind (Smagala et al 2005) Several criteria,
including minimum region length and maximum
informational entropy (variability) per position,
were used.
These methods do not use correlation among
columns in an alignment, thus they often fail to
detect conserved regions.

10
How we measure column correlation?

To measure column correlation, we use approximate
functional dependency (Dalkilic and Robertson,
2000 Giannella and Robertson, 2004).
Correlation score at column 2 with columns 1 and
3.

11
Aggregated Related Column Score (ARCS) at column i

Measure LOGOS score at each column j.
Measure approximate functional dependency between
column i and its neighbor column j.
ARCS at column i is the sum of
LOGOS at neighbor column j TIMES
approximate functional dependency between column
i and its neighbor column j.

12
ARCS formal definition

LOGOS(i) HMax ? H(i) where
H(i) ??eFie log2(Fie)
HMax log2(Min(NL, n))
FD i?j 1- H i?j / log2(n)
H i?j ?p( cip /n) (?q(cip,jq / c ip) log2 (c
ip/ c ip,jq))
ARCS(i) ? j?N(i) FD j ?i LOGOS(j)

13
ARCS example

HMax log2(Min(NL, n)) log2(Min(5, 4)) 2
H(1) ?(F1M log2(F1M) F1W log2(F1W)) -(½-1
½-1)1
LOGOS(1) HMax ? H(1) 1 LOGOS(2) 0.5
LOGOS(3) 1.1887 LOGOS(4) 1.1887.
H 1?4 2/4 (1 log2 (1)) 2/4(1/2 log2 (2) 1/2
log2 (2)) 0.5
FD 1?4 1-0.5/ log2 (4) 0.75
ARCS(1) FD 1?1 LOGOS(1) FD 2?1 LOGOS(2)
1.375

14
Smoothed ARCS

Smoothed ARCS(i) ? 0?j? ?(w-1)/2? ARCS( i ?
j)/w

15
Experiments with PROSITE patterns

We extracted a sequence set from Swissprot having
the same PROSITE patterns.
Among 1320 patterns, we randomly chose 709
patterns where the number of sequences was not
greater than 50.
Each sequence set was aligned using Clustalw.
We used 533 multiple sequence alignments to
evaluate our method for the case that the
alignment is correct.
Forty-seven alignments were tested for the case
that Clustal W aligned part or none of the
motifs.

16
Effect of smoothing window size
17
Smoothed LOGO, AL2CO, and ARCS for PS00702
18
Performance of ARCS on incorrectly aligned
sequences
Among incorrectly aligned 47 protein families,
the first peak of ARCS corresponds to part of
motifs up to 40.4 test cases.
19
ARCS can be used for a new multiple seq alignment

If ARCS can highlight true motif regions even
when a multiple sequence alignment is only
partially correct, ARCS can be used for
generating a multiple sequence alignment.
Indeed, we are trying two different methods
WP A Weighted Position Approach for Multiple
Sequence Alignment a manuscript in preparation.
An iterative alignment improvement method
(similar to a method by Wang, Dalkilic, and Kim
ACM SAC 2004).

20
Pattern Complexity

We measured the sensitivity of ARCS with respect
to the motif complexity which is defined as 1 -
the ratio of the number of exact characters in
the pattern to the length of the pattern.

iGibbs An Improved Gibbs Motif Sampler for
Proteins

22
Motif Discovery Problem a search problem

Input N sequences
Output set of conserved regions

23
Motif search in unaligned seqs an optimization
problem
24
RELATED RESEARCH

Existing motif discovery algorithms
Stochastic
Gibbs
MEME
Combinatorial
Pratt
Teiresias

25
Motivation for iGibbs, Improved Gibbs motif
sampler

As shown in previous slides, the search space for
motif detection is determined by the input
sequence set.
What if only a set of the input seqs contains a
motif, or what if there are more than one motif
occuring in different subsets ?
We have shown that a sequence clustering approach
improved the performance of MEME.
This time, we will improve the performance of
Gibbs using a double clustering approach.

26
Why Gibbs?

Why did we choose to improve Gibbs motif sampler?
Gibbs is one of the fast running motif sampler,
yet retains a good performance.
A Gibbs motif sampler predict different motifs
for different execution, since it is a random
sampling algorithm (we discuss more in the next
slide).
Of course, the effect of the input seq set is not
considered in the design of Gibbs.

27
Gibbs motif sampler review
28
Why clustering approach?
29
RESEARCH PROBLEM OF INTEREST
w1
Family 1
Motif 1

Separate non-homologous ones using clustering

w2
Family 2
Motif 2
Look at subsequences !
30
Three Search frameworks for motif discovery

One by clustering subsequences.
with Hardik Sheth
iGibbs by clustering whole sequences into
clusters of sequences and then clustering motifs
predicted using gibbs motif sampler from each
cluster.
with Zhiping Wang and Mehmet Dalkilic
Iterative subsequence graph splitting approach
using a min-cut algorithm (a genetic programming
implementation)
with Yong-Hyuk Kim, Byung-Ro Moon, Seunghee
Bae.

31
DESIGN OBJECTIVE

Algorithm should
Find motifs from an unknown input set i.e. work
without knowing the expected number of motifs
Run Fast

32
Problem with sequence clustering algorithms

A lot!
BAG tends to generate clusters too specific or
fragmented clusters a true family is split into
multiple clusters.
Thus we used a double clustering approach to
iGibbs.

33
Overview
34
Basic idea again, the effect of the input
sequence set

The input sequences are divided in two different
ways
Horizontally by clustering whole sequences, and
Vertically by selecting subsequences
Then, each partition is input to the motif
discovery
Algorithm, Gibbs in this case.

35
Subsequence selection by patterns
We look for patterns common to all input
sequences. However, it is unlikley to find a
single pattern common all sequences. Thus we are
using multiple patterns which Are similar to each
other.
36
patRefine a procedure to select patterns

37
iGibbs a double clustering approach

Group input sequences in clusters, Ci
Predict motifs using Gibbs
Scan the input for occurrences of the motifs
predicted.
Group input sequences into clusters, Di , based
on the motif occurrences.
Predict the final motifs from Dj

38
Experiment with PROSITE sequences

We collected sequence sets with the same PROSITE
patterns.
There are three data sets
Single-family-set
Sequence sets with a single PROSITE pattern
Two-family-set
Sequence sets with two PROSITE patterns
More extensive two family data 1025 data sets
We ran Gibbs, new Gibbs, MEME, iGibbs without
patRefine, iGibbs with patRefine.

39
Single-family-set
PPV from 10 runs GGibbs NGGibbs (new
version) IUNiGibbs unguided (no patRefine)
IiGibbs with patRefine
MMEME
40
Two-faimly-set
PPV from 10 runs GGibbs NGGibbs (new
version) IUNiGibbs unguided (no patRefine)
IiGibbs with patRefine
MMEME
41
Runtime
42
More extensive two family data 1025 data sets
43
When the clustering approach works?
44
Availability

ARCS An Aggregated Related Column Scoring Scheme
for Aligned Sequences
Bioinformatics, vol22, pp 2326-2332, Oct, 2006
iGibbs A Motif Discovery Framework for Gibbs
Sampling Algorithm Try iGibbs online! Proteins
Structure, Function, and Bioinformatics, in press
More information on our motif projects at
http//bio.informatics.indiana.edu/projects/MOTIF/

45
FUTURE WORK