Title: Space Efficient Algorithms for Planted Motif Search
1Space Efficient Algorithms for Planted Motif
Search
- Jaime Davila, Sudha Balla, Sanguthevar
Rajasekaran - CSE Department at University of Connecticut
2Definition of (l,d) Motif Problem
- Given sequences s1, s2 , sn of length m each.
- Find a string x of size l an l-mer that
appears as substring in all of them, with less
than d mismatches in every occurrence. This is, x
almost appears in si for i1,, n.
3(l,d) Motif Problem, Example
- s1 GGCATCCGATTATTGTAGTCTGG
- s2 ATTTCTATGCTAAGCTTGCTCGA
- s3 CAGGCTGTAAGTAGTTTGTTAGC
- l5, d1
4(l,d) Motif Problem, Solution
- s1 GGCATCCGATTATTGTAGTCTGG
- s2 ATTTCTATGCTAAGCTTGCTCGA
- s3 CAGGCTGTAAGTAGTTTGTTAGC
- x TTTTT is a (5,1) motif.
5Motivation
- Mining of Transcription Factor Binding Sites,
which are small sequences of DNA that mark the
beginning of coding regions in DNA. They might
appear slightly modified in different sequences.
6PMS Simple Planted Motif Search Raj et al 2005
- l5 d1
- s1 GGCATCCGATTATTGTAGTCTGG
- s2 ATTTCTATGCTAAGCTTGCTCGA
- s3 CAGGCTGTAAGTAGTTTGTTAGC
7PMS Description
- Build d-vicinities for each l-mer
- s1 GGCATCCGATTATTGTAGTCTGG
- s2 ATTTCTATGCTAAGCTTGCTCGA
- s3 CAGGCTGTAAGTAGTTTGTTAGC
B(ATCCG,1)CTCCG,TTCCG, ,ATCCT
8PMS Description
- Let Li be the union of vicinities for each
sequence. Sort them by using radix-sort - GGCATCCGATTATTGTAGTCTGG
- ATTTCTATGCTAAGCTTGCTCGA
- CAGGCTGTAAGTAGTTTGTTAGC
È
L1
È
L2
È
L3
9PMS Description
- 3) Mi is the intersection of Lk for k1, , i
- GGCATCCGATTATTGTAGTCTGG
- ATTTCTATGCTAAGCTTGCTCGA
- CAGGCTGTAAGTAGTTTGTTAGC
È
L1
Ç
È
L2
Ç
È
L3
TTTTT, M3
10PMS Drawbacks
- As d increases the sizes of the Li increase
considerably. For n20,m600, l15 and d4, the
core memory requirement is over 1GB.
11PMSi Key idea
- L1 È B(x,d) (x is an l-mer in s1)
- L2 È B(y,d) (y is an l-mer in s2)
- M2 L1 Ç L2 (È B(x,d)) Ç (È B(y,d) )
- È (B(x,d) Ç B(y,d) ) for all pairs (x, y )
12PMSi Graphical Idea
- Generate M2 by using De Morgan
- GGCATCCGATTATTGTAGTCTGG
- ATTTCTATGCTAAGCTTGCTCGA
- CAGGCTGTAAGTAGTTTGTTAGC
È
Ç
13PMSi Refinement
- B(x,d) Ç B(y,d) Æ if dist(x,y) gt 2d
- GGCATCCGATTATTGTAGTCTGG
- ATTTCTATGCTAAGCTTGCTCGA
- CAGGCTGTAAGTAGTTTGTTAGC
Ç
14PMSi Intersections of vicinities
- x is fixed l-mer in s1, y is any l-mer in s2.
- È (B(x,d) Ç B(y,d) )
- z Î B(x,d) y dist(z, y) d.
- We cached the calculations of dist, to be
- more efficient.
15PMSi Drawbacks
- We add more time depending on the number of
l-mers whose distance is less than 2d from a
given l-mer. - Depending on this number, we also have a
bigger/lesser use of memory.
16PMSP Key idea
- We can iterate the basic principle of PMSi, i.e.
x is fixed l-mer in s1, y is any l-mer in s2 , w
any l-mer in s3 - È (B(x,d) Ç B(y,d) Ç B(w,d))
- z Î B(x,d)
- y dist(z, y) d, w dist(z, w) d
.
17PMSP Graphical Idea
B(TTATT,1)ATATT,,TTTTT,.,TTATG
- GGCATCCGATTATTGTAGTCTGG
- ATTTCTATGCTAAGCTTGCTCGA
- CAGGCTGTAAGTAGTTTGTTAGC
- All vicinities considered are at distance less
than 2d from l-mer in first sequence.
18PMSP Observations
- We reduce the memory usage drastically
- We add more time depending on the number of
l-mers whose distance is less than 2d from a
given l-mer.
19Experimental setting
- n20, m600.
- Every letter from every sequence is generated at
random uniformly and independently. - A challenge instance is one where the expected
number of (l,d) motifs is greater than 1, i.e.
(11,3), (13,4), (15,5), (17,6)
20Results (d3)
21Results in Challenging Instances
22Questions?