Sparse LCS Common Substring Alignment - PowerPoint PPT Presentation

1 / 81
About This Presentation
Title:

Sparse LCS Common Substring Alignment

Description:

Sparse LCS Common Substring Alignment. Gad M .Landau, Baruch Schieber and Michal Ziv-Ukelson ... Input: a set of strings S1, S2, ..., Sc and a target string T ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 82
Provided by: gra102
Category:

less

Transcript and Presenter's Notes

Title: Sparse LCS Common Substring Alignment


1
Sparse LCS Common Substring Alignment
  • Gad M .Landau, Baruch Schieber and Michal
    Ziv-Ukelson
  • CPM03
  • ??? ??? ???

2
Outline
  • Introduction
  • Preliminaries
  • The algorithm
  • Totally Monotone Rectangular Matrix
  • Conclusions and Open Problems

3
Common Substring Alignment
  • Input a set of strings S1, S2, , Sc and a
    target string T
  • Output the similarity of all strings Si with T
  • Under LCS similarity metric
  • Ex
  • S1ababaa
  • S2aabbb
  • Tabab
  • Sim(S1, T) 4
  • Sim(S2, T) 3

4
Application
  • Molecular biology
  • Search the most similar strings in database

5
Main idea
  • Y is the common substring of Si
  • Dont compute of the similarity between Y and T
    over and over again.
  • The sparsity of LCS

6
DP Graph
T
Varies by Si
Bi
G
Y
Si
Speed up
Fi
Same structure
7
Three stages
  • In Common Substring Alignment
  • Preprocessing stage
  • Encoding stage
  • Alignment stage

8
Preprocessing stage
  • Parsed for the optimal common substring
    compromise
  • Si Bi Y Fi

T BCBADBDCD Y BCBD S1 BC BCBD C B1
BC F1 C S2 E BCBD DBCBD A B2a
E F2a DBCBDA B2b EBCBDD F2bA
9
In this paper
  • We assume that Y is given. We focus on the
    following two stages.

10
Encoding Stage
  • A data structure is constructed which encodes the
    comparison of Y with T
  • Goal to speed up alignment stage

11
Alignment Stage
  • Align between Si and T
  • Use the pre-compiled data-structure to align Y
    and T

12
Notation
  • n Si T
  • L maxLCST, Si
  • LyLCST, Y
  • (Ly Y, Ly L, L n)

13
Results
Sparcity of LCS Ly ltlt Y, L ltlt n
  • Previous result
  • Encoding stage
  • O(n2nY)
  • Alignment stage
  • O(n)
  • (SIAM 2001)
  • In this paper
  • Encoding stage
  • O(nLy)
  • Alignment stage
  • O(L)

14
Our goal now
?
15
DP Graph
G
  • Auz substring of A from index u to z, 1uzn
  • IjLCST1j, Bi
  • (0,0) ?input row I ??j?vertex?optimal paths
    weight
  • OjLCST1j, BiY

16
Observation
  • In a given row in the DP graph,LCS has two
    properties
  • ??
  • ???????1
  • ????match

17
Some alternative
  • IjLCST1j, Bi
  • (0,0) ?input row I ??j?vertex?optimal paths
    weight
  • OjLCST1j, BiY
  • For k 0,,L
  • PIk
  • Row I ?, weight k ?block???index
  • ?DP graph?(0,0)? row I, weight?k?????path
  • POk
  • Row O ?, weight k ?block???index
  • ?DP graph?(0,0)? row O, weight?k?????path

18
Therefore
  • PIk and POk are sufficient to represent Ij
    and Oj

PI 0 1 PO 0 1 3 5 6
19
Claim
  • Only the positions PIr are sufficient for
    computing POk, r, k 0,,L
  • ?Row I?, ??PIr?index?Row O???????,PIr????,???
    ?
  • Proof
  • i1 PIk, i3PIk1 if defined
  • For any index i2, i1lti2lti3 (Ii1Ii2), ?Row
    O?index j
  • Ii1LCSTi11j,Y Ii2LCSTi21j,Y
  • (??i1???path?????i2???path?)

20
Objective now!!
T
PI 0 1
  • Given vector PI, compute vector PO!

?
PO
21
Observation
  • When compute POk, only PIr are candidates,
    0rk
  • ????row I weightk? path??????row O?k-path

22
PO
23
The Algorithm
Encoding Stage
Alignment Stage
???
???
????
Total Monotone in O(n)
S in O(nLCS(Y,T))
Construct LEFT in O(n)
Column Minima of LEFT in O(n)
24
B
B
C
B
A
D
D
C
0
B
A
B
D
B
25
T
PIr
j
Bi
r
PIr
PI
k-r LCSTjPIr1,Y
Y
PO
POk
Fi
26
Find Optimal SubPath
T
Bi
PI
Y
PO
Fi
27
Encoding Stage
  • Preprocessing Si unknown
  • Table S alignment of T, Y

Bi
PI
Y
PO
Fi
Si, w minj LCSTji1, Y w
28
Algorithm
Si, w minj LCSTji1, Y w
  • for i 0 to T
  • Si, 0 ?i
  • for k 0 to
  • Si, k1 Si, k d
  • next k
  • next i

29
Observation
  • Si, k1 Si, k d

S1,0 1
S1,1 S1,0 ?????? 1 1
S1,2 S1,1 ?????? 2 2 4
T
B
C
B
A
D
D
C
B
B
A
Y
B
D
B
1 2 3 4 5 6 7 8 9
30
??????
  • O( Alphabet (YT) ) preprocessing
  • O(1) finding next

31
Preprocessing
  • Finding all matches
  • foreach alphabet, scan Y, T for position
  • matches ? ?(B in Y) cross ?(B in T)
  • construct a fastfind structure

T
B
C
B
A
D
D
C
B
B
A
Y
B
D
B
32
Algorithm
Si, w minj LCSTji1, Y w
  • for i 0 to T
  • Si, 0 ?i
  • for k 0 to O(LCS(Y,T))
  • Si, k1 Si, k d
  • next k
  • next i

33
The Inner LoopO(LCS(Y,T))
T
B
B
C
B
A
D
D
C
B
A
Y
B
D
B
Si, k1 Si, k d
34
Complexity
  • Assume T gt Y
  • preprocessing ?O( Alphabet (YT) )
  • The inner loop?O( LCS(Y,T) )
  • The outter loop?O(T)
  • Overall?O( TLCS(Y,T) )

for i 0 to T Si, 0 ?i for k 0 to
O(LCS(Y,T)) Si, k1 Si, k d next
k next i
35
The Algorithm
Encoding Stage
Alignment Stage
???
???
????
S in O(nLCS(Y,T))
Construct LEFT in O(n)
Column Minima of LEFT in O(n)
36
Alignment Stage
POk minkr0 S PIr, k-r
T
Bi
PI
Y
PO
Fi
37
Construction of Left(1)
POk minkr0 S PIr, k-r
38
Construction of Left(2)
POk min

39
Construction of Left(3)
POLmin
PO0min
PO1min
40
Undefined Region in LEFT
Si, w minj LCSTji1, Y w
??
???weight
POLmin
PO0min
PO1min
41
Good Property of Left
  • Totally Monotone Rectangular Matrix

Convex
Concave
Or
42
Reduced Problem
  • The minimum value of each column
  • nxn total monotone matrix ? O(n)

43
Find Column Minima Recursively
  • Minima(Amn)
  • Bnn??(Amn)
  • If row(Bnn) 1
  • return the positions of minima
  • ??? by Minima(?(B))
  • return the positions of minima

44
?
?
?
?
?
?
45
The Algorithm
Encoding Stage
Alignment Stage
???
???
????
S in O(nLCS(Y,T))
Construct LEFT in O(n)
Column Minima of LEFT in O(n)
46
?mn ? nn
n
Type A ????
Type B ????


m
gt
47
? at the n-th row
n
Type C ????
m

48
Complexity of ?O(m)
  • At most m-n deletions
  • B????C???? O(m-n)
  • ????n
  • A????-B???? O(n)
  • ABC (A-B)2(BC) O(n2(m-n))
  • O(2m n) O(m)

49
The Algorithm
Encoding Stage
Alignment Stage
???
???
????
S in O(nLCS(Y,T))
Construct LEFT in O(n)
Column Minima of LEFT in O(n)
50
Find Column Minima Recursively
  • Minima(Amn)
  • Bnn??(Amn)
  • If row(Bnn) 1
  • return the positions of minima
  • ??? by Minima(?(B))
  • return the positions of minima

51
Complexity of ???O(n)
52
Complexity
  • f(m,n) ? ? f(n, n/2) ???
    O(n)O(m) f(n/2, n) O(n)
    O(n)O(m)O(n/2)O(n)f(n/2,n/4)
  • O(m) 2O(n)O(n/2) O(n/4)
  • O(m)
  • LEFT nn matrix ? O(n)

53
The Algorithm
Encoding Stage
Alignment Stage
???
???
????
S in O(nLCS(Y,T))
Construct LEFT in O(n)
Column Minima of LEFT in O(n)
54
The remaining Part
  • Goal
  • Prove Left is a Totally Monotone Matrix,
  • so that we could apply the linear time algorithm
    above to Left

55
A little reminder of definition
  • Si, j
  • ??G??? i ?index??,
  • ? G?weight?? j ????, ??G????index

j 2
i
56
A little reminder of definition
  • Leftr,i
  • ??G?weight r, ??G?weight i????,??G????index

weight r
weight i
By definition, Lefti, j SPIi, j-i
57
Some observation on Left
Since Lefti, j means the LCS path has weight j
when leaving Y, for some big j, the path may not
exist !!
  • There are some undefined entries in Left

Lefti, j SPIi, j-i
j
? for j lt i, Left i , j is undefined
X
X
X
X
X
X
i
X
X
X
X
X
X
X
58
Lemma1
  • For each row in LEFT, say LEFTr, it consists of
  • A (possibly empty) span of undefined entries,
    followed by
  • A span of defined entries from LEFTr,r to
    LEFTr,kr, followed by
  • A (possibly empty) span of undefined entries

i.e. no interleaving
LEFTi
possibly empty
possibly empty
59
Proof
Main Idea If Left r , I , Left r , j are
defined (i lt j), then Left r , k are defined
as well (for i k j)
Left r , i , Left r , j are defined ?S PI
r , i r and S PI r , j - r are
defined ????PI r ??Y, ?Y?weight?? ( i - r
), ( j - r )??? ????PI r ??Y, ?Y?weight??
( k - r )???, i k j ?S PI r , k - r is
defined ?Left r, k is defined, for i k j
60
Proof (continued)
  • Since Left r, i S PI r , i - r
  • ? for i lt r, S PI r , i - r is not defined
  • ? for i lt r, Left r, i is not defined
  • ?Left r starts its consecutive defined entries
    from Leftr, r

61
Lemma 2 (Convex Total Monotone)
  • For defined entries,
  • if a lt b, and Lefta,c Leftb,c
  • then
  • Lefta,c1 Leftb, c1

?
?
?
Left
?
?
?
62
Proof
T
Bc
Ac1
  • Consider Two Path
  • Ac1 an optimal path has weight (c1) and passes
    the vertex PIa when entering Y
  • BCan optimal path has weight c and passes the
    vertex PIb when entering Y

63
T
  • Since a lt b
  • ? PIa lt Pib
  • and Leftb,c
  • Lefta,c
  • Lefta,c1,
  • So, Ac1 and Bc would cross

64
Consider Two cases
T
  • Case1.X Y
  • ? Y W X W c 1
  • ? for some k 1,
  • Lefta, c1 Leftb, ck
  • Leftb,c1
  • Case2.X gt Y
  • ? X Z gt Y Z c
  • ? for some k gt c,
  • Leftb, c Lefta, k gt Lefta, c
  • ? Contradiction with the assumption

We Want to prove if a lt b, and Lfa,c
Lfb,c then Lfa,c1 Lfb,c1
65
What do we have now?
  • The defined entries of Left follow the convex
    total monotonicity property

Whats next?
Assign values those undefined entries, and still
keep the convex property
66
Recall the undefined entries in Left
The lower left triangle seems quite good.. but
the upper right part is so crispy..
Were going to transform the crispy part into a
upper right triangle
67
Lemma 3 (?????)
Definition ka the index of the last defined
entries in LEFTa
  • For two rows a,b in LEFT, a lt b
  • If Ka gt Kb, then LEFTa,i LEFTb,i,
  • for all defined entries in row a and row b

?
?
?
a
?
?
?
?
?
?
b
Ka
?
Kb
68
Proof
We Want to prove If a lt b and Ka gt Kb, then
LEFTa,i LEFTb,i,
  • Main Idea ??????, assume that
  • For b j kb, Lefta, j gt Leftb, j

Observation by definition Leftb, kb1 isnt
defined
69
Proof
Bj
  • Consider Two Path
  • AKb1, Bj

Akb1
By assumption, Lefta,j gtLeftb,j ?
Lefta,kb1gtLefta,j gtLeftb,j ?PIaltPIb,
Lefta,kb1gtLeftb,j ?Akb1 and Bj must cross
These two path will cross, Why !?
We Want to prove If a lt b and Ka gt Kb, then
LEFTa,i LEFTb,i,
70
Proof
  • Consider Two cases
  • 1. X Y
  • X Z Y Z j
  • ? Lefta,j Leftb,j
  • ??
  • 2. X lt Y
  • Y W gt X W kb1
  • ? ??????PIb, weight?kb1???
  • ? Leftb,kb1 is defined
  • ??
  • ? Lefta,j Leftb,j
  • for b j kb QED.

We Want to prove If a lt b and Ka gt Kb, then
LEFTa,i LEFTb,i,
71
By Lemma 3, we have
If a lt b, and Ka gt Kb Then for all defined
entries in Row b, they cannot be the column
minima (since they are bigger than Lefta,i)
  • So, just REMOVE this kind of rows
  • And all left rows would have
  • ka ka1 ka2 ka3
  • ? the undefined suffix form an upper right
    triangle now

72
Next.
  • Assign values to the undefined entries in LEFT,
    while holding the column minima and keeping the
    convex property

73
Lower Left Triangle
  • Set all these entries Leftr.j (nr1)
  • Holding the column minima
  • nr1n1, where (n1) is the greatest possible
    index in Left
  • ?so column min. is held.

74
Whats Convex? If a lt b, and Lfa,c
Lfb,c Then Lfa,d Lfb,d for dgtc
Lower Left Triangle
  • Set Leftr.j (nr1)
  • Keeping the convex property
  • for any complemented entry Leftb,c,
  • Lefta,c lt Leftb,c, for altb
  • so the presupposition of convex is not valid
  • ?convex property is held.

?
?
75
Upper Right Triangle
  • Set all these entries Leftr,j to 8
  • Holding the column minima
  • all scores in LEFT are finite, so column
    minima is held.

76
Upper Right Triangle
Whats Convex? If a lt b, and Lfa,c
Lfb,c Then Lfa,d Lfb,d for dgtc
Set Leftr,j to 8
  • Keeping the convex property
  • assume d as the smallest index that Lefta,d 8
  • ?for altb, kagtkb
  • ?Lefta,e 8 Leftb,e, for all e d

?
?
?
?
77
Whats next?.......
  • Show that marking those undefined
  • entries and removing those rows wont slow down
    the time complexity of the Common Substring
    Alignment alglrithm

78
Encoding Stage
  • Using Myers algorithm,
  • compute Si,j in O(nLy)time

79
Alignment Stage
  • Marking the undefined entries
  • dont need to implicitly mark them, mark on
    need.
  • Removing the redundant rows
  • Only need to scan L rows, so O(L) time .
  • Using SMAWK algorithm to find column minima, O(L)
    time
  • So additional work in A,B doesnt increase time
    complexity

80
Conclusion
  • The Sparse LCS Common Alignment algorithm
    consists of
  • O(nLy) time for encoding stage
  • O(L) time for alignment stage

81
Open problems
  • This algorithm is for when threres common
    substring between Si. Is there an efficient
    algorithm when theres repetitions between target
    and source string?
  • Extend this solution to general metrics, such as
    Edit Distance.
Write a Comment
User Comments (0)
About PowerShow.com