Suffix arrays - PowerPoint PPT Presentation

About This Presentation
Title:

Suffix arrays

Description:

Suffix arrays. Suffix array. We loose some of the functionality but we ... Change the alphabet. You in fact sort suffixes of a string shorter by a factor of 2 ! ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 73
Provided by: hai2
Category:
Tags: arrays | change | loose | suffix

less

Transcript and Presenter's Notes

Title: Suffix arrays


1
Suffix arrays
2
Suffix array
  • We loose some of the functionality but we save
    space.

Let s abab
Sort the suffixes lexicographically ab, abab,
b, bab
The suffix array gives the indices of the
suffixes in sorted order
2
0
3
1
3
How do we build it ?
  • Build a suffix tree
  • Traverse the tree in DFS, lexicographically
    picking edges outgoing from each node and fill
    the suffix array.
  • O(n) time

4
How do we search for a pattern ?
  • If P occurs in T then all its occurrences are
    consecutive in the suffix array.
  • Do a binary search on the suffix array
  • Takes O(mlogn) time

5
Example
Let S mississippi
i
L
ippi
issippi
Let P issa
ississippi
mississippi
pi
M
ppi
sippi
sisippi
ssippi
ssissippi
R
6
How do we accelerate the search ?
Maintain l LCP(P,L)
Maintain r LCP(P,R) Assume l r
r
l





















L
M
R
7
If l r then start comparing M to P at l 1
r
l





















L
M
R
8
l gt r
r
l





















L
M
R
9
Someone whispers LCP(L,M)
LCP(L,M) gt l
r
l





















L
M
R
10
Continue in the right half
LCP(L,M) gt l
r
l





















L
M
R
11
LCP(L,M) lt l
r
l





















L
M
R
12
Continue in the left half
LCP(L,M) lt l
r
l





















L
M
R
13
LCP(L,M) l
start comparing M to P at l 1
r
l





















L
M
R
14
Analysis
If we do more than a single comparison in an
iteration then max(l, r ) grows by 1 for each
comparison ? O(m logn) time
15
Construct the suffix array without the suffix tree
16
Linear time construction
Recursively ?
Say we want to sort only suffixes that start at
even positions ?





















17
Change the alphabet
Every pair of characters is now a character





















You in fact sort suffixes of a string shorter by
a factor of 2 !
18
Change the alphabet
a
a
b
a
a
b
































2
1
2

19
But we do not gain anything
20
Divide into triples
y
a
b
b
a
b
o
d
a
b
a
d


















abb
ada
bba
do
21
Divide into triples
y
a
b
b
a
b
o
d
a
b
a
d


















abb
ada
bba
do
y
a
b
b
a
b
o
d
a
b
a
d


















bba
dab
bad
o
22
Sort recursively 2/3 of the suffixes
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














0
1
2
3
4
7
5
6








abb
ada
bba
do
bba
dab
bad
o
3
7
1
2
4
6
4
5
23
Sort the remaining third
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














7
8
1
4
2
6
5
3
(a, 7)
(b, 2)
(a, 5)
(y, 1)
?
(y, 1)
(b, 2)
(a, 7)
(a, 5)
0
3
9
6
24
Merge
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














7
8
1
4
2
6
5
3
0
3
9
6
10
11
1
4
8
2
7
5
1
25
Merge
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














7
8
1
4
2
6
5
3
0
3
9
6
10
11
4
8
2
7
5
1
6
26
Merge
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














7
8
1
4
2
6
5
3
0
3
9
10
11
4
8
2
7
5
1
6
4
27
Merge
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














7
8
1
4
2
6
5
3
0
3
9
10
11
8
2
7
5
1
6
4
9
28
Merge
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














7
8
1
4
2
6
5
3
0
3
10
11
8
2
7
5
1
6
4
9
3
29
Merge
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














7
8
1
4
2
6
5
3
0
10
11
8
2
7
5
1
6
4
9
3
8
30
Merge
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














7
8
1
4
2
6
5
3
0
10
11
2
7
5
1
6
4
9
3
8
2
31
Merge
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














7
8
1
4
2
6
5
3
0
10
11
7
5
1
6
4
9
3
8
2
7
32
Merge
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














7
8
1
4
2
6
5
3
0
10
11
5
1
6
4
9
3
8
2
7
5
33
Merge
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














7
8
1
4
2
6
5
3
0
10
11
1
6
4
9
3
8
2
7
5
34
Merge
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














7
8
1
4
2
6
5
3
0
1
6
4
9
3
8
2
7
5
10
11
35
summary
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














7
8
1
4
2
6
5
3
1
6
4
9
3
8
2
7
5
10
11
0
When comparing to a suffix with index 1 (mod 3)
we compare the char and break ties by the ranks
of the following suffixes
When comparing to a suffix with index 2 (mod 3)
we compare the char, the next char if there is a
tie, and finally the ranks of the following
suffixes
36
Compute LCPs
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
o
11
do
10
dabbado
5
bbado
7
bbadabbado
2
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
37
Crucial observation
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
o
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(i,j) min LCP(i,i1),LCP(i1,i2),.,LCP(j-1
,j)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
38
Find LCPs of consecutive suffixes
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
o
0
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(11,0)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
39
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
o
0
1
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(8,2)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
40
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
0
o
0
1
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(9,3)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
41
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
0
o
1
0
1
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(6,4)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
42
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
0
0
1
0
1
o
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(7,5)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
43
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
5
0
0
1
0
1
o
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(1,6)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
44
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
4
5
0
0
1
0
1
o
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(2,7)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
45
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
4
5
0
0
1
0
1
3
o
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(3,8)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
46
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
4
5
0
0
1
0
1
3
o
2
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(4,9)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
47
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
4
5
0
0
1
0
1
3
o
2
1
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(5,10)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
48
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
4
5
0
0
1
0
1
3
o
2
1
0
11
do
10
dabbado
5
bbado
7
bbadabbado
2
LCP(10,11)
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
49
Analysis
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














1
6
4
9
3
8
2
7
5
10
11
0
yabbadabbado
0
4
5
0
0
1
0
1
3
o
2
1
0
11
do
10
dabbado
5
bbado
7
bbadabbado
2
The starting position deceases by 1 in every
iteration. So it cannot increase more than O(n)
times
bado
8
badabbado
3
ado
9
adabbado
4
abbado
6
abbadabbado
1
50
We need more LCPs for search
1
2
3
4
7
8
9
10
11
12
5
6
0
y
a
b
b
a
b
o
d
a
b
a
d














1
6
4
9
3
8
2
7
5
10
11
0
4
5
0
0
1
0
1
3
2
1
0
Linearly many, calculate the all bottom up
51
Another example
2
3
4
7
8
5
6
9
1
a
b
c
a
b
a
b
c










4
1
8
5
2
6
3
7
9
abbca
4
0
2
0
1
3
2
abcabbca
1
0
1
a
8
bbca
5
bcabbca
2
bca
6
cabbca
3
ca
7

9
52
Analysis
Think about the LCP which we know at any point in
the algorithm
A successful comparison increases it by one
It decreases by one when iteration starts
So the number of successful comparisons is O(n)
53
Burrows Wheeler (bzip2)
  • Currently best algorithm for text
  • High level
  • Apply the Borrows-Wheeler transform
  • Use move-to-front to translate the sorted
    characters to small integers
  • Use Huffman coding

54
??? I ????? ?????? M ?????? ??? ?????? ????????
?? S
S abraca
M
55
??? II ???? ?????? ???? ??????????
L
F
L is the Burrows Wheeler Transform
56
Claim Every column contains all chars.
L
F
a b r a c a
a b r a c a
a a b r a c
b r a c a a
a b r a c a
r a c a a b
a c a a b r
a c a a b r
b r a c a a
c a a b r a
c a a b r a
a a b r a c
a b r a c a
r a c a a b
You can obtain F from L by sorting
57
L
F
a b r a c a
a a b r a c
a b r a c a
a c a a b r
b r a c a a
c a a b r a
r a c a a b
The as are in the same order in L and in
F, Similarly for every other char.
58
From L you can reconstruct the string
F
L

a
a
c
a

a
r
a
b
a
c
r
b
What is the first char of S ?
59
From L you can reconstruct the string
F
L

a
a
c
a

a
r
a
b
a
c
r
b
a
What is the first char of S ?
60
From L you can reconstruct the string
F
L

a
a
c
a

a
r
a
b
a
c
r
b
ab
61
From L you can reconstruct the string
F
L

a
a
c
a

a
r
a
b
a
c
r
b
abr
62
Compression ?
L
a
c
Compress the transform to a string of integers
using move to front

r
a
0 2 3 2 0 3
a
Then use Huffman to code the integers
b
63
Why is it good ?
L
F
a b r a c a
a b r a c a
a a b r a c
b r a c a a
a b r a c a
r a c a a b
a c a a b r
a c a a b r
b r a c a a
c a a b r a
c a a b r a
a a b r a c
a b r a c a
r a c a a b
Characters with the same (right) context appear
together
64
Sorting is equivalent to computing the suffix
array.
L
F
a b r a c a
a b r a c a
a a b r a c
b r a c a a
a b r a c a
r a c a a b
a c a a b r
a c a a b r
b r a c a a
c a a b r a
c a a b r a
a a b r a c
a b r a c a
r a c a a b
Can encode and decode in linear time
65
A useful tool L ? F mapping
F
L
unknown
mississipp i
i mississip p
i ppimissis s
To implement the LF-mapping for a char a at
position j in L we need the oracle occ( a , j )
66
Substring search using the BWT (Count the
pattern occurrences)
unknown
s
s
  • Find the first c in Lfr, lr
  • Find the last c in Lfr, lr
  • L-to-F mapping of these chars

Occ() oracle is enough
slide stolen from Paolo Ferragina_at_
67
Substring search using the BWT (Count the
pattern occurrences)
s
s
  • Find the first c in Lfr, lr
  • Find the last c in Lfr, lr
  • L-to-F mapping of these chars

slide stolen from Paolo Ferragina_at_
68
Substring search using the BWT (Count the
pattern occurrences)
s
s
What if someone whispers how many s we have up
to index 2 and up to index 5 occ(s,2), occ(s,5) ?
slide stolen from Paolo Ferragina_at_
69
occ( a , j )
L
i p s s m p i s s i i
occ(s,4) 2
70
Make a bit vector for each character
L
i p s s m p i s s i i
0 0 1 1 0 0 0 0 1 1 0 0
occ(s,4) rank(4)
rank(i) how many ones are there before position
i ?
71
How do you answer rank queries ?
0 0 1 1 0 0 0 0 1 1 0 0
rank(i) how many ones are there before position
i ?
We can prepare a vector with all answers
0 0 1 2 2 2 2 2 3 4 4 4
72
Lets do it with O(n) bits per character
73
0 0 1 1 0 0 1 0 1 1 0 0
0 0 1 1 0 0 0 0 1 1 0 0
logn/2
2
5
7
In our solution the bit vector takes T(n) bits
and also the additionals take T(n) bits
74
Can we do it with smaller overhead so
additionals would take o(n) ?
0 0 1 1 0 0 0 0 1 1 0 0
0 0 1 1 0 0 1 0 1 1 0 0
0 0 1 1 0 0 0 0 1 1 0 0
0 0 1 1 0 0 0 0 1 1 0 0
2
4
log2n
13
7
superblocks of size log2(n)
Each block keeps the number of one in previous
blocks that are in the same superblock
75
Analysis
0 0 1 1 0 0 0 0 1 1 0 0
0 0 1 1 0 0 1 0 1 1 0 0
0 0 1 1 0 0 0 0 1 1 0 0
0 0 1 1 0 0 0 0 1 1 0 0
2
4
log2n
13
7
The superblock table is of size n/log (n)
The block table is of size (loglog(n)) n/log (n)
The tables for the blocks vn log(n)loglog(n)
So the additionals take o(n) space
76
Next step
Do it without keeping the bit vectors themselves
Instead keep only the compressed version of the
text
Saves a lot of space for compressible strings
Write a Comment
User Comments (0)
About PowerShow.com