Title: Suffix arrays
1Suffix arrays
2Suffix array
- We loose some of the functionality but we save
space.
Let s abab
Sort the suffixes lexicographically ab, abab,
b, bab
The suffix array gives the indices of the
suffixes in sorted order
2
0
3
1
3How do we build it ?
- Build a suffix tree
- Traverse the tree in DFS, lexicographically
picking edges outgoing from each node and fill
the suffix array. - O(n) time
4How do we search for a pattern ?
- If P occurs in T then all its occurrences are
consecutive in the suffix array. - Do a binary search on the suffix array
- Takes O(mlogn) time
5Example
Let S mississippi
i
L
ippi
issippi
Let P issa
ississippi
mississippi
pi
M
ppi
sippi
sisippi
ssippi
ssissippi
R
6How do we accelerate the search ?
Maintain l LCP(P,L)
Maintain r LCP(P,R) Assume l r
r
l
L
M
R
7If l r then start comparing M to P at l 1
r
l
L
M
R
8l gt r
r
l
L
M
R
9Someone whispers LCP(L,M)
LCP(L,M) gt l
r
l
L
M
R
10Continue in the right half
LCP(L,M) gt l
r
l
L
M
R
11LCP(L,M) lt l
r
l
L
M
R
12Continue in the left half
LCP(L,M) lt l
r
l
L
M
R
13LCP(L,M) l
start comparing M to P at l 1
r
l
L
M
R
14Analysis
If we do more than a single comparison in an
iteration then max(l, r ) grows by 1 for each
comparison ? O(m logn) time
15Construct the suffix array without the suffix tree
16Linear time construction
Recursively ?
Say we want to sort only suffixes that start at
even positions ?
17Change the alphabet
Every pair of characters is now a character
You in fact sort suffixes of a string shorter by
a factor of 2 !
18Change the alphabet
a
a
b
a
a
b
2
1
2
19But we do not gain anything
20Divide into triples
y
a
b
b
a
b
o
d
a
b
a
d
abb
ada
bba
do
21Divide into triples
y
a
b
b
a
b
o
d
a
b
a
d
abb
ada
bba
do
y
a
b
b
a
b
o
d
a
b
a
d
bba
dab
bad
o
22Sort recursively 2/3 of the suffixes
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d
0
1
2
3
4
7
5
6
abb
ada
bba
do
bba
dab
bad
o
3
7
1
2
4
6
4
5
23Sort the remaining third
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d
7
8
1
4
2
6
5
3
(a, 7)
(b, 2)
(a, 5)
(y, 1)
?
(y, 1)
(b, 2)
(a, 7)
(a, 5)
0
3
9
6
24Merge
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d
7
8
1
4
2
6
5
3
0
3
9
6
10
11
1
4
8
2
7
5
1
25Merge
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d
7
8
1
4
2
6
5
3
0
3
9
6
10
11
4
8
2
7
5
1
6
26Merge
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d
7
8
1
4
2
6
5
3
0
3
9
10
11
4
8
2
7
5
1
6
4
27Merge
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d
7
8
1
4
2
6
5
3
0
3
9
10
11
8
2
7
5
1
6
4
9
28Merge
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d
7
8
1
4
2
6
5
3
0
3
10
11
8
2
7
5
1
6
4
9
3
29Merge
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d
7
8
1
4
2
6
5
3
0
10
11
8
2
7
5
1
6
4
9
3
8
30Merge
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d
7
8
1
4
2
6
5
3
0
10
11
2
7
5
1
6
4
9
3
8
2
31Merge
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d
7
8
1
4
2
6
5
3
0
10
11
7
5
1
6
4
9
3
8
2
7
32Merge
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d
7
8
1
4
2
6
5
3
0
10
11
5
1
6
4
9
3
8
2
7
5
33Merge
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d
7
8
1
4
2
6
5
3
0
10
11
1
6
4
9
3
8
2
7
5
34Merge
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d
7
8
1
4
2
6
5
3
0
1
6
4
9
3
8
2
7
5
10
11
35summary
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d
7
8
1
4
2
6
5
3
1
6
4
9
3
8
2
7
5
10
11
0
When comparing to a suffix with index 1 (mod 3)
we compare the char and break ties by the ranks
of the following suffixes
When comparing to a suffix with index 2 (mod 3)
we compare the char, the next char if there is a
tie, and finally the ranks of the following
suffixes
36Compute LCPs
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d
1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
o
11
do
10
dabbado
5
bbado
7
bbadabbado
2
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
37Crucial observation
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d
1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
o
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(i,j) min LCP(i,i1),LCP(i1,i2),.,LCP(j-1
,j)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
38Find LCPs of consecutive suffixes
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d
1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
o
0
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(11,0)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
391
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d
1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
o
0
1
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(8,2)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
401
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d
1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
0
o
0
1
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(9,3)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
411
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d
1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
0
o
1
0
1
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(6,4)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
421
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d
1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
0
0
1
0
1
o
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(7,5)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
431
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d
1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
5
0
0
1
0
1
o
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(1,6)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
441
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d
1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
4
5
0
0
1
0
1
o
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(2,7)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
451
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d
1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
4
5
0
0
1
0
1
3
o
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(3,8)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
461
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d
1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
4
5
0
0
1
0
1
3
o
2
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(4,9)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
471
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d
1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
4
5
0
0
1
0
1
3
o
2
1
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(5,10)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
481
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d
1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
4
5
0
0
1
0
1
3
o
2
1
0
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(10,11)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
49Analysis
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d
1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
4
5
0
0
1
0
1
3
o
2
1
0
11
do
10
dabbado
5
bbado
7
bbadabbado
2
The starting position deceases by 1 in every
iteration. So it cannot increase more than O(n)
times
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
50We need more LCPs for search
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d
1
6
4
9
3
8
2
7
5
10
11
0
4
5
0
0
1
0
1
3
2
1
0
Linearly many, calculate the all bottom up
51Another example
2
3
4
7
8
5
6
9
1
a
b
c
a
b
a
b
c
4
1
8
5
2
6
3
7
9
abbca
4
0
2
0
1
3
2
abcabbca
1
0
1
a
8
bbca
5
bcabbca
2
bca
6
cabbca
3
ca
7
9
52Analysis
Think about the LCP which we know at any point in
the algorithm
A successful comparison increases it by one
It decreases by one when iteration starts
So the number of successful comparisons is O(n)
53Burrows Wheeler (bzip2)
- Currently best algorithm for text
- High level
- Apply the Borrows-Wheeler transform
- Use move-to-front to translate the sorted
characters to small integers - Use Huffman coding
54??? I ????? ?????? M ?????? ??? ?????? ????????
?? S
S abraca
M
55??? II ???? ?????? ???? ??????????
L
F
L is the Burrows Wheeler Transform
56Claim Every column contains all chars.
L
F
a b r a c a
a b r a c a
a a b r a c
b r a c a a
a b r a c a
r a c a a b
a c a a b r
a c a a b r
b r a c a a
c a a b r a
c a a b r a
a a b r a c
a b r a c a
r a c a a b
You can obtain F from L by sorting
57L
F
a b r a c a
a a b r a c
a b r a c a
a c a a b r
b r a c a a
c a a b r a
r a c a a b
The as are in the same order in L and in
F, Similarly for every other char.
58From L you can reconstruct the string
F
L
a
a
c
a
a
r
a
b
a
c
r
b
What is the first char of S ?
59From L you can reconstruct the string
F
L
a
a
c
a
a
r
a
b
a
c
r
b
a
What is the first char of S ?
60From L you can reconstruct the string
F
L
a
a
c
a
a
r
a
b
a
c
r
b
ab
61From L you can reconstruct the string
F
L
a
a
c
a
a
r
a
b
a
c
r
b
abr
62Compression ?
L
a
c
Compress the transform to a string of integers
using move to front
r
a
0 2 3 2 0 3
a
Then use Huffman to code the integers
b
63Why is it good ?
L
F
a b r a c a
a b r a c a
a a b r a c
b r a c a a
a b r a c a
r a c a a b
a c a a b r
a c a a b r
b r a c a a
c a a b r a
c a a b r a
a a b r a c
a b r a c a
r a c a a b
Characters with the same (right) context appear
together
64Sorting is equivalent to computing the suffix
array.
L
F
a b r a c a
a b r a c a
a a b r a c
b r a c a a
a b r a c a
r a c a a b
a c a a b r
a c a a b r
b r a c a a
c a a b r a
c a a b r a
a a b r a c
a b r a c a
r a c a a b
Can encode and decode in linear time
65A useful tool L ? F mapping
F
L
unknown
mississipp i
i mississip p
i ppimissis s
To implement the LF-mapping for a char a at
position j in L we need the oracle occ( a , j )
66Substring search using the BWT (Count the
pattern occurrences)
unknown
s
s
- Find the first c in Lfr, lr
- Find the last c in Lfr, lr
- L-to-F mapping of these chars
Occ() oracle is enough
slide stolen from Paolo Ferragina_at_
67Substring search using the BWT (Count the
pattern occurrences)
s
s
- Find the first c in Lfr, lr
- Find the last c in Lfr, lr
- L-to-F mapping of these chars
slide stolen from Paolo Ferragina_at_
68Substring search using the BWT (Count the
pattern occurrences)
s
s
What if someone whispers how many s we have up
to index 2 and up to index 5 occ(s,2), occ(s,5) ?
slide stolen from Paolo Ferragina_at_
69occ( a , j )
L
i p s s m p i s s i i
occ(s,4) 2
70Make a bit vector for each character
L
i p s s m p i s s i i
0 0 1 1 0 0 0 0 1 1 0 0
occ(s,4) rank(4)
rank(i) how many ones are there before position
i ?
71How do you answer rank queries ?
0 0 1 1 0 0 0 0 1 1 0 0
rank(i) how many ones are there before position
i ?
We can prepare a vector with all answers
0 0 1 2 2 2 2 2 3 4 4 4
72Lets do it with O(n) bits per character
730 0 1 1 0 0 1 0 1 1 0 0
0 0 1 1 0 0 0 0 1 1 0 0
logn/2
2
5
7
In our solution the bit vector takes T(n) bits
and also the additionals take T(n) bits
74Can we do it with smaller overhead so
additionals would take o(n) ?
0 0 1 1 0 0 0 0 1 1 0 0
0 0 1 1 0 0 1 0 1 1 0 0
0 0 1 1 0 0 0 0 1 1 0 0
0 0 1 1 0 0 0 0 1 1 0 0
2
4
log2n
13
7
superblocks of size log2(n)
Each block keeps the number of one in previous
blocks that are in the same superblock
75Analysis
0 0 1 1 0 0 0 0 1 1 0 0
0 0 1 1 0 0 1 0 1 1 0 0
0 0 1 1 0 0 0 0 1 1 0 0
0 0 1 1 0 0 0 0 1 1 0 0
2
4
log2n
13
7
The superblock table is of size n/log (n)
The block table is of size (loglog(n)) n/log (n)
The tables for the blocks vn log(n)loglog(n)
So the additionals take o(n) space
76Next step
Do it without keeping the bit vectors themselves
Instead keep only the compressed version of the
text
Saves a lot of space for compressible strings