Suffix arrays - PowerPoint PPT Presentation

About This Presentation
Title:

Suffix arrays

Description:

Say we want to sort only suffixes that start at even positions ? Change the alphabet. You in fact sort suffixes of a string shorter by a factor of 2 ! ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 51
Provided by: hai2
Category:
Tags: arrays | fact | suffix

less

Transcript and Presenter's Notes

Title: Suffix arrays


1
Suffix arrays
2
Suffix array
  • We loose some of the functionality but we save
    space.

Let s abab
Sort the suffixes lexicographically ab, abab,
b, bab
The suffix array gives the indices of the
suffixes in sorted order
2
0
3
1
3
How do we build it ?
  • Build a suffix tree
  • Traverse the tree in DFS, lexicographically
    picking edges outgoing from each node and fill
    the suffix array.
  • O(n) time

4
How do we search for a pattern ?
  • If P occurs in T then all its occurrences are
    consecutive in the suffix array.
  • Do a binary search on the suffix array
  • Takes O(mlogn) time

5
Example
Let S mississippi
i
L
ippi
issippi
Let P issa
ississippi
mississippi
pi
M
ppi
sippi
sisippi
ssippi
ssissippi
R
6
How do we accelerate the search ?
Maintain l LCP(P,L)
Maintain r LCP(P,R) Assume l r
r
l





















L
M
R
7
If l r then start comparing M to P at l 1
r
l





















L
M
R
8
l gt r
r
l





















L
M
R
9
Someone whispers LCP(L,M)
LCP(L,M) gt l
r
l





















L
M
R
10
Continue in the right half
LCP(L,M) gt l
r
l





















L
M
R
11
LCP(L,M) lt l
r
l





















L
M
R
12
Continue in the left half
LCP(L,M) lt l
r
l





















L
M
R
13
LCP(L,M) l
start comparing M to P at l 1
r
l





















L
M
R
14
Analysis
If we do more than a single comparison in an
iteration then max(l, r ) grows by 1 for each
comparison ? O(m logn) time
15
Construct the suffix array without the suffix tree
16
Linear time construction
Recursively ?
Say we want to sort only suffixes that start at
even positions ?





















17
Change the alphabet
Every pair of characters is now a character





















You in fact sort suffixes of a string shorter by
a factor of 2 !
18
Change the alphabet
a 0
aa 1
ab 2
b 3
ba 4
bb 5
a
a
b
a
a
b
































2
1
2

19
But we do not gain anything
20
Divide into triples
y
a
b
b
a
b
o
d
a
b
a
d


















abb
ada
bba
do
21
Divide into triples
y
a
b
b
a
b
o
d
a
b
a
d


















abb
ada
bba
do
y
a
b
b
a
b
o
d
a
b
a
d


















bba
dab
bad
o
22
Sort recursively 2/3 of the suffixes
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














0
1
2
3
4
7
5
6








abb
ada
bba
do
bba
dab
bad
o
3
7
1
2
4
6
4
5
23
Sort the remaining third
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














7
8
1
4
2
6
5
3
(a, 7)
(b, 2)
(a, 5)
(y, 1)
?
(y, 1)
(b, 2)
(a, 7)
(a, 5)
0
3
9
6
24
Merge
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














7
8
1
4
2
6
5
3
0
3
9
6
10
11
1
4
8
2
7
5
1
25
Merge
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














7
8
1
4
2
6
5
3
0
3
9
6
10
11
4
8
2
7
5
1
6
26
Merge
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














7
8
1
4
2
6
5
3
0
3
9
10
11
4
8
2
7
5
1
6
4
27
Merge
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














7
8
1
4
2
6
5
3
0
3
9
10
11
8
2
7
5
1
6
4
9
28
Merge
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














7
8
1
4
2
6
5
3
0
3
10
11
8
2
7
5
1
6
4
9
3
29
Merge
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














7
8
1
4
2
6
5
3
0
10
11
8
2
7
5
1
6
4
9
3
8
30
Merge
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














7
8
1
4
2
6
5
3
0
10
11
2
7
5
1
6
4
9
3
8
2
31
Merge
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














7
8
1
4
2
6
5
3
0
10
11
7
5
1
6
4
9
3
8
2
7
32
Merge
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














7
8
1
4
2
6
5
3
0
10
11
5
1
6
4
9
3
8
2
7
5
33
Merge
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














7
8
1
4
2
6
5
3
0
10
11
1
6
4
9
3
8
2
7
5
34
Merge
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














7
8
1
4
2
6
5
3
0
1
6
4
9
3
8
2
7
5
10
11
35
summary
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














7
8
1
4
2
6
5
3
1
6
4
9
3
8
2
7
5
10
11
0
When comparing to a suffix with index 1 (mod 3)
we compare the char and break ties by the ranks
of the following suffixes
When comparing to a suffix with index 2 (mod 3)
we compare the char, the next char if there is a
tie, and finally the ranks of the following
suffixes
36
Compute LCPs
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














4
10
11
12
1
7
9
2
8
5
3
6
1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
o
11
do
10
dabbado
5
bbado
7
bbadabbado
2
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
37
Crucial observation
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














4
10
11
12
1
7
9
2
8
5
3
6
1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
o
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(i,j) min LCP(i,i1),LCP(i1,i2),.,LCP(j-1
,j)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
38
Find LCPs of consecutive suffixes
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














4
10
11
12
1
7
9
2
8
5
3
6
1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
o
0
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(11,0)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
39
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














4
10
11
12
1
7
9
2
8
5
3
6
1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
o
0
1
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(8,2)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
40
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














4
10
11
12
1
7
9
2
8
5
3
6
1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
0
o
0
1
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(9,3)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
41
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














4
10
11
12
1
7
9
2
8
5
3
6
1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
0
o
1
0
1
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(6,4)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
42
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














4
10
11
12
1
7
9
2
8
5
3
6
1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
0
0
1
0
1
o
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(7,5)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
43
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














4
10
11
12
1
7
9
2
8
5
3
6
1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
5
0
0
1
0
1
o
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(1,6)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
44
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














4
10
11
12
1
7
9
2
8
5
3
6
1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
4
5
0
0
1
0
1
o
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(2,7)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
45
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














4
10
11
12
1
7
9
2
8
5
3
6
1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
4
5
0
0
1
0
1
3
o
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(3,8)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
46
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














4
10
11
12
1
7
9
2
8
5
3
6
1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
4
5
0
0
1
0
1
3
o
2
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(4,9)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
47
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














4
10
11
12
1
7
9
2
8
5
3
6
1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
4
5
0
0
1
0
1
3
o
2
1
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(5,10)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
48
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














4
10
11
12
1
7
9
2
8
5
3
6
1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
4
5
0
0
1
0
1
3
o
2
1
0
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(10,11)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
49
Analysis
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














4
10
11
12
1
7
9
2
8
5
3
6
1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
4
5
0
0
1
0
1
3
o
2
1
0
11
do
10
dabbado
5
bbado
7
bbadabbado
2
The starting position deceases by 1 in every
iteration. So it cannot increase more than O(n)
times
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
50
We need more LCPs for search
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














4
10
11
12
1
7
9
2
8
5
3
6
1
6
4
9
3
8
2
7
5
10
11
0
4
5
0
0
1
0
1
3
2
1
0
Linearly many, calculate the all bottom up
Write a Comment
User Comments (0)
About PowerShow.com