Space Efficient Algorithms for Planted Motif Search

About This Presentation

Title:

Space Efficient Algorithms for Planted Motif Search

Description:

Jaime Davila, IWBRA 2006. 1. Space Efficient Algorithms for ... Jaime Davila, Sudha Balla, Sanguthevar Rajasekaran. CSE Department at University of Connecticut ... – PowerPoint PPT presentation

Number of Views:75

Avg rating:3.0/5.0

Slides: 23

Provided by: engrU

Category:

more less

Transcript and Presenter's Notes

Title: Space Efficient Algorithms for Planted Motif Search

1
Space Efficient Algorithms for Planted Motif
Search

Jaime Davila, Sudha Balla, Sanguthevar
Rajasekaran
CSE Department at University of Connecticut

2
Definition of (l,d) Motif Problem

Given sequences s1, s2 , sn of length m each.
Find a string x of size l an l-mer that
appears as substring in all of them, with less
than d mismatches in every occurrence. This is, x
almost appears in si for i1,, n.

3
(l,d) Motif Problem, Example

s1 GGCATCCGATTATTGTAGTCTGG
s2 ATTTCTATGCTAAGCTTGCTCGA
s3 CAGGCTGTAAGTAGTTTGTTAGC
l5, d1

4
(l,d) Motif Problem, Solution

s1 GGCATCCGATTATTGTAGTCTGG
s2 ATTTCTATGCTAAGCTTGCTCGA
s3 CAGGCTGTAAGTAGTTTGTTAGC
x TTTTT is a (5,1) motif.

5
Motivation

Mining of Transcription Factor Binding Sites,
which are small sequences of DNA that mark the
beginning of coding regions in DNA. They might
appear slightly modified in different sequences.

6
PMS Simple Planted Motif Search Raj et al 2005

l5 d1
s1 GGCATCCGATTATTGTAGTCTGG
s2 ATTTCTATGCTAAGCTTGCTCGA
s3 CAGGCTGTAAGTAGTTTGTTAGC

7
PMS Description

Build d-vicinities for each l-mer
s1 GGCATCCGATTATTGTAGTCTGG
s2 ATTTCTATGCTAAGCTTGCTCGA
s3 CAGGCTGTAAGTAGTTTGTTAGC

B(ATCCG,1)CTCCG,TTCCG, ,ATCCT
8
PMS Description

Let Li be the union of vicinities for each
sequence. Sort them by using radix-sort
GGCATCCGATTATTGTAGTCTGG
ATTTCTATGCTAAGCTTGCTCGA
CAGGCTGTAAGTAGTTTGTTAGC

È
L1
È
L2
È
L3
9
PMS Description

3) Mi is the intersection of Lk for k1, , i
GGCATCCGATTATTGTAGTCTGG
ATTTCTATGCTAAGCTTGCTCGA
CAGGCTGTAAGTAGTTTGTTAGC

È
L1
Ç
È
L2
Ç
È
L3

TTTTT, M3
10
PMS Drawbacks

As d increases the sizes of the Li increase
considerably. For n20,m600, l15 and d4, the
core memory requirement is over 1GB.

11
PMSi Key idea

L1 È B(x,d) (x is an l-mer in s1)
L2 È B(y,d) (y is an l-mer in s2)
M2 L1 Ç L2 (È B(x,d)) Ç (È B(y,d) )
È (B(x,d) Ç B(y,d) ) for all pairs (x, y )

12
PMSi Graphical Idea

Generate M2 by using De Morgan
GGCATCCGATTATTGTAGTCTGG
ATTTCTATGCTAAGCTTGCTCGA
CAGGCTGTAAGTAGTTTGTTAGC

È
Ç

13
PMSi Refinement

B(x,d) Ç B(y,d) Æ if dist(x,y) gt 2d
GGCATCCGATTATTGTAGTCTGG
ATTTCTATGCTAAGCTTGCTCGA
CAGGCTGTAAGTAGTTTGTTAGC

Ç

14
PMSi Intersections of vicinities

x is fixed l-mer in s1, y is any l-mer in s2.
È (B(x,d) Ç B(y,d) )
z Î B(x,d) y dist(z, y) d.
We cached the calculations of dist, to be
more efficient.

15
PMSi Drawbacks

We add more time depending on the number of
l-mers whose distance is less than 2d from a
given l-mer.
Depending on this number, we also have a
bigger/lesser use of memory.

16
PMSP Key idea

We can iterate the basic principle of PMSi, i.e.
x is fixed l-mer in s1, y is any l-mer in s2 , w
any l-mer in s3
È (B(x,d) Ç B(y,d) Ç B(w,d))
z Î B(x,d)
y dist(z, y) d, w dist(z, w) d
.

17
PMSP Graphical Idea
B(TTATT,1)ATATT,,TTTTT,.,TTATG

GGCATCCGATTATTGTAGTCTGG
ATTTCTATGCTAAGCTTGCTCGA
CAGGCTGTAAGTAGTTTGTTAGC
All vicinities considered are at distance less
than 2d from l-mer in first sequence.

18
PMSP Observations

We reduce the memory usage drastically
We add more time depending on the number of
l-mers whose distance is less than 2d from a
given l-mer.

19
Experimental setting

n20, m600.
Every letter from every sequence is generated at
random uniformly and independently.
A challenge instance is one where the expected
number of (l,d) motifs is greater than 1, i.e.
(11,3), (13,4), (15,5), (17,6)

20
Results (d3)
21
Results in Challenging Instances
22
Questions?

Write a Comment

User Comments (0)