Title: Sparse LCS Common Substring Alignment
1Sparse LCS Common Substring Alignment
- Gad M .Landau, Baruch Schieber and Michal
Ziv-Ukelson - CPM03
- ??? ??? ???
2Outline
- Introduction
- Preliminaries
- The algorithm
- Totally Monotone Rectangular Matrix
- Conclusions and Open Problems
3Common Substring Alignment
- Input a set of strings S1, S2, , Sc and a
target string T - Output the similarity of all strings Si with T
- Under LCS similarity metric
- Ex
- S1ababaa
- S2aabbb
- Tabab
- Sim(S1, T) 4
- Sim(S2, T) 3
4Application
- Molecular biology
- Search the most similar strings in database
5Main idea
- Y is the common substring of Si
- Dont compute of the similarity between Y and T
over and over again. - The sparsity of LCS
6DP Graph
T
Varies by Si
Bi
G
Y
Si
Speed up
Fi
Same structure
7Three stages
- In Common Substring Alignment
- Preprocessing stage
- Encoding stage
- Alignment stage
8Preprocessing stage
- Parsed for the optimal common substring
compromise - Si Bi Y Fi
T BCBADBDCD Y BCBD S1 BC BCBD C B1
BC F1 C S2 E BCBD DBCBD A B2a
E F2a DBCBDA B2b EBCBDD F2bA
9In this paper
- We assume that Y is given. We focus on the
following two stages.
10Encoding Stage
- A data structure is constructed which encodes the
comparison of Y with T - Goal to speed up alignment stage
11Alignment Stage
- Align between Si and T
- Use the pre-compiled data-structure to align Y
and T
12Notation
- n Si T
- L maxLCST, Si
- LyLCST, Y
- (Ly Y, Ly L, L n)
13Results
Sparcity of LCS Ly ltlt Y, L ltlt n
- Previous result
- Encoding stage
- O(n2nY)
- Alignment stage
- O(n)
- (SIAM 2001)
- In this paper
- Encoding stage
- O(nLy)
- Alignment stage
- O(L)
14Our goal now
?
15DP Graph
G
- Auz substring of A from index u to z, 1uzn
- IjLCST1j, Bi
- (0,0) ?input row I ??j?vertex?optimal paths
weight - OjLCST1j, BiY
16Observation
- In a given row in the DP graph,LCS has two
properties - ??
- ???????1
- ????match
17Some alternative
- IjLCST1j, Bi
- (0,0) ?input row I ??j?vertex?optimal paths
weight - OjLCST1j, BiY
- For k 0,,L
- PIk
- Row I ?, weight k ?block???index
- ?DP graph?(0,0)? row I, weight?k?????path
- POk
- Row O ?, weight k ?block???index
- ?DP graph?(0,0)? row O, weight?k?????path
18Therefore
- PIk and POk are sufficient to represent Ij
and Oj
PI 0 1 PO 0 1 3 5 6
19Claim
- Only the positions PIr are sufficient for
computing POk, r, k 0,,L - ?Row I?, ??PIr?index?Row O???????,PIr????,???
? - Proof
- i1 PIk, i3PIk1 if defined
- For any index i2, i1lti2lti3 (Ii1Ii2), ?Row
O?index j - Ii1LCSTi11j,Y Ii2LCSTi21j,Y
- (??i1???path?????i2???path?)
20Objective now!!
T
PI 0 1
- Given vector PI, compute vector PO!
?
PO
21Observation
- When compute POk, only PIr are candidates,
0rk - ????row I weightk? path??????row O?k-path
22PO
23The Algorithm
Encoding Stage
Alignment Stage
???
???
????
Total Monotone in O(n)
S in O(nLCS(Y,T))
Construct LEFT in O(n)
Column Minima of LEFT in O(n)
24B
B
C
B
A
D
D
C
0
B
A
B
D
B
25T
PIr
j
Bi
r
PIr
PI
k-r LCSTjPIr1,Y
Y
PO
POk
Fi
26Find Optimal SubPath
T
Bi
PI
Y
PO
Fi
27Encoding Stage
- Preprocessing Si unknown
- Table S alignment of T, Y
Bi
PI
Y
PO
Fi
Si, w minj LCSTji1, Y w
28Algorithm
Si, w minj LCSTji1, Y w
- for i 0 to T
- Si, 0 ?i
- for k 0 to
- Si, k1 Si, k d
- next k
- next i
29Observation
S1,0 1
S1,1 S1,0 ?????? 1 1
S1,2 S1,1 ?????? 2 2 4
T
B
C
B
A
D
D
C
B
B
A
Y
B
D
B
1 2 3 4 5 6 7 8 9
30??????
- O( Alphabet (YT) ) preprocessing
- O(1) finding next
31Preprocessing
- Finding all matches
- foreach alphabet, scan Y, T for position
- matches ? ?(B in Y) cross ?(B in T)
- construct a fastfind structure
T
B
C
B
A
D
D
C
B
B
A
Y
B
D
B
32Algorithm
Si, w minj LCSTji1, Y w
- for i 0 to T
- Si, 0 ?i
- for k 0 to O(LCS(Y,T))
- Si, k1 Si, k d
- next k
- next i
33The Inner LoopO(LCS(Y,T))
T
B
B
C
B
A
D
D
C
B
A
Y
B
D
B
Si, k1 Si, k d
34Complexity
- Assume T gt Y
- preprocessing ?O( Alphabet (YT) )
- The inner loop?O( LCS(Y,T) )
- The outter loop?O(T)
- Overall?O( TLCS(Y,T) )
for i 0 to T Si, 0 ?i for k 0 to
O(LCS(Y,T)) Si, k1 Si, k d next
k next i
35The Algorithm
Encoding Stage
Alignment Stage
???
???
????
S in O(nLCS(Y,T))
Construct LEFT in O(n)
Column Minima of LEFT in O(n)
36Alignment Stage
POk minkr0 S PIr, k-r
T
Bi
PI
Y
PO
Fi
37Construction of Left(1)
POk minkr0 S PIr, k-r
38Construction of Left(2)
POk min
39Construction of Left(3)
POLmin
PO0min
PO1min
40Undefined Region in LEFT
Si, w minj LCSTji1, Y w
??
???weight
POLmin
PO0min
PO1min
41Good Property of Left
- Totally Monotone Rectangular Matrix
Convex
Concave
Or
42Reduced Problem
- The minimum value of each column
- nxn total monotone matrix ? O(n)
43Find Column Minima Recursively
- Minima(Amn)
- Bnn??(Amn)
- If row(Bnn) 1
- return the positions of minima
- ??? by Minima(?(B))
- return the positions of minima
44?
?
?
?
?
?
45The Algorithm
Encoding Stage
Alignment Stage
???
???
????
S in O(nLCS(Y,T))
Construct LEFT in O(n)
Column Minima of LEFT in O(n)
46?mn ? nn
n
Type A ????
Type B ????
m
gt
47? at the n-th row
n
Type C ????
m
48Complexity of ?O(m)
- At most m-n deletions
- B????C???? O(m-n)
- ????n
- A????-B???? O(n)
- ABC (A-B)2(BC) O(n2(m-n))
- O(2m n) O(m)
49The Algorithm
Encoding Stage
Alignment Stage
???
???
????
S in O(nLCS(Y,T))
Construct LEFT in O(n)
Column Minima of LEFT in O(n)
50Find Column Minima Recursively
- Minima(Amn)
- Bnn??(Amn)
- If row(Bnn) 1
- return the positions of minima
- ??? by Minima(?(B))
- return the positions of minima
51Complexity of ???O(n)
52Complexity
- f(m,n) ? ? f(n, n/2) ???
O(n)O(m) f(n/2, n) O(n)
O(n)O(m)O(n/2)O(n)f(n/2,n/4) - O(m) 2O(n)O(n/2) O(n/4)
- O(m)
- LEFT nn matrix ? O(n)
53The Algorithm
Encoding Stage
Alignment Stage
???
???
????
S in O(nLCS(Y,T))
Construct LEFT in O(n)
Column Minima of LEFT in O(n)
54The remaining Part
- Goal
- Prove Left is a Totally Monotone Matrix,
- so that we could apply the linear time algorithm
above to Left
55A little reminder of definition
- Si, j
- ??G??? i ?index??,
- ? G?weight?? j ????, ??G????index
j 2
i
56A little reminder of definition
- Leftr,i
- ??G?weight r, ??G?weight i????,??G????index
weight r
weight i
By definition, Lefti, j SPIi, j-i
57Some observation on Left
Since Lefti, j means the LCS path has weight j
when leaving Y, for some big j, the path may not
exist !!
- There are some undefined entries in Left
Lefti, j SPIi, j-i
j
? for j lt i, Left i , j is undefined
X
X
X
X
X
X
i
X
X
X
X
X
X
X
58Lemma1
- For each row in LEFT, say LEFTr, it consists of
- A (possibly empty) span of undefined entries,
followed by - A span of defined entries from LEFTr,r to
LEFTr,kr, followed by - A (possibly empty) span of undefined entries
i.e. no interleaving
LEFTi
possibly empty
possibly empty
59Proof
Main Idea If Left r , I , Left r , j are
defined (i lt j), then Left r , k are defined
as well (for i k j)
Left r , i , Left r , j are defined ?S PI
r , i r and S PI r , j - r are
defined ????PI r ??Y, ?Y?weight?? ( i - r
), ( j - r )??? ????PI r ??Y, ?Y?weight??
( k - r )???, i k j ?S PI r , k - r is
defined ?Left r, k is defined, for i k j
60Proof (continued)
- Since Left r, i S PI r , i - r
- ? for i lt r, S PI r , i - r is not defined
- ? for i lt r, Left r, i is not defined
- ?Left r starts its consecutive defined entries
from Leftr, r
61Lemma 2 (Convex Total Monotone)
- For defined entries,
- if a lt b, and Lefta,c Leftb,c
- then
- Lefta,c1 Leftb, c1
?
?
?
Left
?
?
?
62Proof
T
Bc
Ac1
- Ac1 an optimal path has weight (c1) and passes
the vertex PIa when entering Y - BCan optimal path has weight c and passes the
vertex PIb when entering Y -
-
63T
- Since a lt b
- ? PIa lt Pib
- and Leftb,c
- Lefta,c
- Lefta,c1,
- So, Ac1 and Bc would cross
64Consider Two cases
T
- Case1.X Y
- ? Y W X W c 1
- ? for some k 1,
- Lefta, c1 Leftb, ck
- Leftb,c1
- Case2.X gt Y
- ? X Z gt Y Z c
- ? for some k gt c,
- Leftb, c Lefta, k gt Lefta, c
- ? Contradiction with the assumption
We Want to prove if a lt b, and Lfa,c
Lfb,c then Lfa,c1 Lfb,c1
65What do we have now?
- The defined entries of Left follow the convex
total monotonicity property
Whats next?
Assign values those undefined entries, and still
keep the convex property
66Recall the undefined entries in Left
The lower left triangle seems quite good.. but
the upper right part is so crispy..
Were going to transform the crispy part into a
upper right triangle
67Lemma 3 (?????)
Definition ka the index of the last defined
entries in LEFTa
- For two rows a,b in LEFT, a lt b
- If Ka gt Kb, then LEFTa,i LEFTb,i,
- for all defined entries in row a and row b
?
?
?
a
?
?
?
?
?
?
b
Ka
?
Kb
68Proof
We Want to prove If a lt b and Ka gt Kb, then
LEFTa,i LEFTb,i,
- Main Idea ??????, assume that
- For b j kb, Lefta, j gt Leftb, j
-
Observation by definition Leftb, kb1 isnt
defined
69Proof
Bj
- Consider Two Path
- AKb1, Bj
Akb1
By assumption, Lefta,j gtLeftb,j ?
Lefta,kb1gtLefta,j gtLeftb,j ?PIaltPIb,
Lefta,kb1gtLeftb,j ?Akb1 and Bj must cross
These two path will cross, Why !?
We Want to prove If a lt b and Ka gt Kb, then
LEFTa,i LEFTb,i,
70Proof
- Consider Two cases
- 1. X Y
- X Z Y Z j
- ? Lefta,j Leftb,j
- ??
- 2. X lt Y
- Y W gt X W kb1
- ? ??????PIb, weight?kb1???
- ? Leftb,kb1 is defined
- ??
- ? Lefta,j Leftb,j
- for b j kb QED.
We Want to prove If a lt b and Ka gt Kb, then
LEFTa,i LEFTb,i,
71By Lemma 3, we have
If a lt b, and Ka gt Kb Then for all defined
entries in Row b, they cannot be the column
minima (since they are bigger than Lefta,i)
- So, just REMOVE this kind of rows
- And all left rows would have
- ka ka1 ka2 ka3
- ? the undefined suffix form an upper right
triangle now
72Next.
- Assign values to the undefined entries in LEFT,
while holding the column minima and keeping the
convex property
73Lower Left Triangle
- Set all these entries Leftr.j (nr1)
- Holding the column minima
- nr1n1, where (n1) is the greatest possible
index in Left - ?so column min. is held.
74Whats Convex? If a lt b, and Lfa,c
Lfb,c Then Lfa,d Lfb,d for dgtc
Lower Left Triangle
- Set Leftr.j (nr1)
- Keeping the convex property
- for any complemented entry Leftb,c,
- Lefta,c lt Leftb,c, for altb
- so the presupposition of convex is not valid
- ?convex property is held.
?
?
75Upper Right Triangle
- Set all these entries Leftr,j to 8
- Holding the column minima
- all scores in LEFT are finite, so column
minima is held.
76Upper Right Triangle
Whats Convex? If a lt b, and Lfa,c
Lfb,c Then Lfa,d Lfb,d for dgtc
Set Leftr,j to 8
- Keeping the convex property
- assume d as the smallest index that Lefta,d 8
- ?for altb, kagtkb
- ?Lefta,e 8 Leftb,e, for all e d
?
?
?
?
77Whats next?.......
- Show that marking those undefined
- entries and removing those rows wont slow down
the time complexity of the Common Substring
Alignment alglrithm
78Encoding Stage
- Using Myers algorithm,
- compute Si,j in O(nLy)time
79Alignment Stage
- Marking the undefined entries
- dont need to implicitly mark them, mark on
need. - Removing the redundant rows
- Only need to scan L rows, so O(L) time .
- Using SMAWK algorithm to find column minima, O(L)
time -
- So additional work in A,B doesnt increase time
complexity
80Conclusion
- The Sparse LCS Common Alignment algorithm
consists of - O(nLy) time for encoding stage
- O(L) time for alignment stage
-
81Open problems
- This algorithm is for when threres common
substring between Si. Is there an efficient
algorithm when theres repetitions between target
and source string? - Extend this solution to general metrics, such as
Edit Distance.