Title: Compressed suffix arrays and suffix trees with applications to text indexing and string matching
1Compressed suffix arrays and suffix trees with
applications to text indexing and string matching
2(No Transcript)
3Agenda
- A (very) short review on suffix arrays
- Introduction
- Problem Definition
- Information theory reasoning
- Simple solution round 2
- Compressed suffix arrays in ½nloglogn O(n)
bits and O(loglogn) access time - Rank And Select Problem definitions
- Rank DS
- Compressed suffix arrays in e-1n O(n) bits and
O(logen) access time - Select data structure (if time permits)
4Short review on suffix arrays
- A suffix array is a sorted array of the suffix of
a string S represented by an array of pointers to
the suffixes of S - For example The string TelAviv and its
corresponding suffix array
telaviv S0
elaviv S1
laviv S2
aviv S3
viv S4
iv S5
v S6
4 6 0 2 5 1 3
5Introduction
- Succinct data structures branch
- Dna genome strings (small alphabet, large
strings) - Mainly a Theoretical article
6Problem Definition
- The Algorithm Is composed of two phases
- compression
- lookup
- Compress
- given a suffix array Sa compress it to get its
succinct representation - lookup(i)
- Given the compressed representation return SAi
7Some Definitions
- We will deal (at first) with binary alphabet
- S a,b
- We will add a special end of string symbol
- And will set the relation between the characters
to be - altltb ()
- Basic Ram Model
- Log(n) word size
- Word lookup and arithmetic in constant time
8Information theory reasoning
abba 15432 abab 41532 abaa 13524 abaa 34152 aabb12543 aaba 14253 aaab 12354 aaaa 12345
bbbb54321 bbba45321 bbab35241 bbaa34521 babb25143 baba42531 baab23514 baaa 23451
9Information theory reasoning (2)
- Suffix array size nlog(n)
- One to one corresponds between the suffix array
to the string - Construction details
- Number of possible suffix arrays 2n-1
- Perfect compress n bits (the string itself)
- The cost for lookup O(n) see prev lecture
10Simple solution round 2different approach
- Lets pack together each logn bits to create a
new alphabet. - So the text length will be n/logn and the pattern
length would be m/logn - The suffix array will take o(n) bits
- Searching becomes hard (alignment)
- the text is aligned but the pattern isnt
- logn cases
11Simple solution round 2
- the text isnt aligned the pattern occurs k bit
right to a word boundary - Need to append k bits to the pattern and check it
- So we need to check 2k cases
- Klogn gt n different cases to check
- Assuming we know how much to pad!!
12General framework
- Abstract Data Type Optimization Jacobson'89
- distinct Data structures C(n) gt Each data
structure occupies O(log C(n)) bits. - Doesnt guarantee the time complexity on the
supported operations
13Compressed suffix arrays in ½nloglogn O(n)
bits and O(loglogn) access time
- Recursive method in nature
- Take advantage on the suffixes
- Let Sa0 be the uncompressed suffix array
- And N0 be its size (assume power of 2)
- In The k phase of the compression we start with
- Sak with the size
- and create Sak1 with the size
- Sak1 holds the permutation 1..Nk1
14Sak1 Construction
- Create the Bk bit vector
- Bki 1 iff Saki is even
- create the Rank vector
- Rankk(j) counts the number of one bits in the
first j bits of Bk - Create the ?k(i) vector
- stores the 0 to 1 companion relation)
- Store the even values from Sak in Sak1
15An Example
- The 32 chars string T
- abbabbabbabbabaaabababbabbbabba
16An Example
16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
a a b a b b a b b a b b a b b a Text
30 14 32 24 21 1 4 7 10 28 19 17 13 31 16 15 Sa0
1 1 1 1 0 0 1 0 1 1 0 0 0 0 1 0 B0
8 7 6 5 4 4 4 3 3 2 1 1 1 1 1 0 Rank0
16 15 14 13 31 30 10 28 8 7 23 18 15 14 2 2 ?0
17Example
32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
a b b a b b b a b b a b a b a a
25 22 2 5 8 26 11 29 23 20 3 6 9 27 18 12 30
0 1 1 0 1 1 0 0 0 1 0 1 0 0 1 1 1
16 16 15 14 14 13 12 12 12 12 11 11 10 10 10 9 8
27 31 30 21 28 27 17 16 13 23 10 21 8 7 18 17 16
18How To compute Sak from Sak-1
- Lemma 1
- Given suffix array Sak let Bk rankk ?k and Sak1
- Be the result of the transformation performed by
phase k we can construct Sak from Sak1 by the
following formula - Saki 2 Sak1rankk(?k(i))(Bki-1)
- Lets split for 2 cases
- Bki is even
- Bki is odd
19Example continue
11 1 4 13 10 3 9 6 15 7 16 12 2 5 14 8 Sa1
0 0 1 0 1 0 0 1 0 0 1 1 1 0 1 1 B1
8 8 8 7 7 6 6 6 5 5 5 4 3 2 2 1 Rank1
5 4 14 2 12 14 12 9 6 1 6 5 4 9 2 1 ?1
2 5 3 8 6 1 7 4 Sa2
1 0 0 1 1 0 0 1 B2
4 3 3 3 2 1 1 1 Rank2
8 4 1 5 4 8 5 1 ?2
1 4 3 2 Sa3
20Compress
- We Keep l O(loglogn) levels
-
- All Levels but the Sal level are save implicitly
- For each of the level 0..l-1 we save Bj,rankj ?j
- rankj ?j are stored implicitly
- The Size of Sal is
21lookup
- just compute recursively Saki from Sak1i
- Recursion depth loglogn
- All data structure going to be used have o(1)
access time - O(loglogn) lookup cost
22How The Data Is Stored
- The Bk bit vector is stored explctiy
- O(Nk) space
- O(1) lookup
- O(Nk) preprocess time
- The RankK vector is stored implicitly using
Jacobson rank data structure - O(Nk(loglognk)/lognk) space
- O(1) lookup
- O(Nk) preprocess time
- The ?k vector is stored implicitly (using rank
and select)
23?k vector representation
24Lets Take a look
25An Example
16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
a a b a b b a b b a b b a b b a Text
30 14 32 24 21 1 4 7 10 28 19 17 13 31 16 15 Sa0
1 1 1 1 0 0 1 0 1 1 0 0 0 0 1 0 B0
8 7 6 5 4 4 4 3 3 2 1 1 1 1 1 0 Rank0
16 15 14 13 31 30 10 28 8 7 23 18 15 14 2 2 ?0
26Example
32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
a b b a b b b a b b a b a b a a
25 22 2 5 8 26 11 29 23 20 3 6 9 27 18 12 30
0 1 1 0 1 1 0 0 0 1 0 1 0 0 1 1 1
16 16 15 14 14 13 12 12 12 12 11 11 10 10 10 9 8
27 31 30 21 28 27 17 16 13 23 10 21 8 7 18 17 16
27So What can we do with all the lists
- Concatenate them together in a lexicographical
order and form the Lk list - L19,1,6,12,14,2,4,5
- Lets see how we can compute ?k (i)
- If Bki is even , its simply i
- Otherwise ,
- because all the prefix patterns saved are in
sorted order, - We saved in the Lk list till the point i ,
entries for all the odd suffixs before i ,
hi-ranki - So we can look up the h entry in Lk
- And it will give us the answer
28Simple example
- L25,8,2,4
- Rank21,1,1,2,3,3,3,4
- B21,0,0,1,1,0,0,1
- ?21,5,8,4,5,1,4,8
- ?(3) ?
- Rank(3) 1, h 3-1 , L22 8
- ?(3) 8 ?
29Rank and select
- Given a bit vector length n
- Ranki is the number of 1 bits till I
- Select(i) returns the index of the ith 1
30?k vector representation
- Lemma 2
- Given s integers in sorted order ,
- each containing w bits ,where slt2w
- we can store them with at most
- s(2w-floor(logs))O(s/loglogs) bits
- so that retrieving the hth integer takes constant
time
31?k vector representation
- Take the first zfloor(logs) bits of each int,
creating the q1..qs int - Its easy to see that , q1ltqiltqi1lts (we take the
msb bits after all) - The rest w-z bits of each int , will be ri
Si
10101010101010101010101010101
qi
ri
1010101010101010101010101
101
32?k vector representation
- Store ri in a simple array, (w-z)s bits
- Store q1..qs in a table supporting select and
rank in constant time. - The table Q is implemented in the following way
- Instead of saving the number themselves,
- we store q1,q2-q1,q2-q3, qs-qs-1
- in unary representation )0i1(
- And add a select data structure.
33?k vector representation
- In order to get qi we simply do select(i) ,
- and count the number of zeros before the ith 1
- Qi select(i) - rank(select(i))
34?k vector representation
- The q table size is
- the size of the unary string is s2z lt2s the
select overhead O(s/loglogs) - So we can output Si easily
- Siqi2w-zri
35?k vector representation
- Lemma 3
- We can store the concatenated list Lk used for ?k
in n(1/23/2k1)O(n/2kloglogn), so accessing
the hth element will take constant time, with
preprocessing time o(n/2k22k) - There are 22k lists, number them ,(even the empty
ones)
36?k vector representation
- Lemma 3
- We can store the concatenated list Lk used for ?k
in n(1/23/2k1)O(n/2kloglogn), so accessing
the hth element will take constant time, with
preprocessing time o(n/2k22k) - There are 22k lists, number them ,(even the empty
ones) - Each Xi integer in the lists, 1ltxiltNk will be
transformed into a new integer by appending its
list int representation - X bit size is , 2Klognk ,
37?k vector representation
- Lemma 3
- We can store the concatenated list Lk used for ?k
in n(1/23/2k1)O(n/2kloglogn), so accessing
the hth element will take constant time, with
preprocessing time o(n/2k22k) - There are 22k lists, number them ,(even the empty
ones) - Each Xi integer in the lists, 1ltxiltNk will be
transformed into a new integer by appending its
list int representation - X bit size is , 2Klognk ,
- After concatenating all the lists ,we have a Nk/2
sorted numbers sized 2Klognk bits - Using lemma 2 we get.
- O(1) access time
- And a space bound of n(1/23/2k1)O(n/2kloglogn)
bits
38Sum it up (space complexity)
39Rank data structure
- Due to Jacobson
- Given a bit vector length n ,Ranki is the
number of 1 bits till I - Multilevel approach
- We will slice the bit string to log2n chunks.
- Between each chunk we will keep rank counter
- Each chunk will be divvied into ½ logn chunks ,
- And a counter will be kept between each sub
chunks - At The Bottom Level a simple Lookup table will
be used.
40Rank
Log2n chunks
7
14
3
101
½ logn sub chunks
Lookup table
The output 1431
41Rank Analysis
42Compressed suffix arrays in e-1n O(n) bits and
O(logen) access time
- In order to break the space barrier we need to
save less levels gtlonger lookups - Lets save 3 compressed levels only Sa0 Sal Sal
L ceil(loglogn) , lceil(1/2loglogn) - using A Dictionary data structure , which Can
say If an element is member of the Dictionary,
and support a rank query, O(1) time for both
queries - The Space complexity of the dictionary is
- We keep in 2 dictionaries what items we have in
the next level D0 and Dl (from Sa0-gtSal
Sal-gtSal
43The ?k function
- We define the ?k function , which maps each 1 to
its companion 0 - Lets define the fk function to be
- We just need to merge the indexes in Lk and Lk
44Example
45The fk function implementation
- Lemma 4 We can store the concatenated list used
for fk - k 0 in nO(n/loglogn) bits
- Kgt0 in n(11/2k-1)O(n/2kloglogn) , preprocess
time of O(n/2k 22k) - If kgt0 simply using lemma 3
- K0
- Encode a, as 0, and b as 1.
- Create a n bit vector , named l
- Lf 0 iff the list for f0 is a or at the f
position - We add a select and select0 data structure on top
of it. O(n/loglogn) - Also we keep the number of 0 in l as c0,
- Query fk(j) is done in the following way
- if j C0 , return select0(c0)
- If jltc0 return select0(j)
- If jgtc0 return select(j-c0)
46The Lookup algorithm
- Sai , we start walking the fk function
i,i,i,i - Sa0i1Sa0i
- Until reaching entry found in the dictionary D0,
- Let s be the walk length
- And r the entry rank in the dictionary (how many
items, already passed to the next level?) - Using r we start walking the next level
- Let s be the walk length
- And r the entry rank in the dictionary
- we return the following result
- The walk length is , max(s,s)lt2lltsqr(logn)
- So the query time is O(sqr(logn))
47The General multilevel Build
- For every 0ltelt1 ,
- Assume el is an integer so 2ellt2logen
- Create all the levels , 0, el,2el ..l
- Number of levels is e-11 gt lookup of O(logen)
48The General multilevel Build
49Select data structure
- select(i)- returns the i 1 bit in the string
- Same idea as rank , a bit more complicated
- multilevel approach
- At the first level we record the position of
every lognloglognth bit, - Total space o(N/loglogn)
- Between each two bits, we keep the following
data, - If the distance between them rgt(lognloglogn)2
- we keep the absolute pos of all the indexes
between them - log2nloglogn
- Other wise we keep , the relative position of
each logrloglognth bit - Total space logrloglogn ltlog2nloglogn
r/loglogn rltN !!! - Then we keep one more level (the same notions)
- Block size comes to the size of (lgn)4
50Select data structure
- After that, we keep a lookup table
- For every logn/d pattern we save (dgt2)
- Number of 1 bits,
- the location of the ith 1 bit in the pattern
- Same as before the space is O(n1/dlognloglogn)
- The lookup is then very simple, just walk the
levels, - Get a block and ask a query about him using the
lookup table. - Space complexity , O(n/loglogn)