Compressed suffix arrays and suffix trees with applications to text indexing and string matching - PowerPoint PPT Presentation

About This Presentation
Title:

Compressed suffix arrays and suffix trees with applications to text indexing and string matching

Description:

Compressed suffix arrays and suffix trees with applications to ... The 32 chars string T. abbabbabbabbabaaabababbabbbabba# An Example. 16. 15. 14. 13. 31. 30 ... – PowerPoint PPT presentation

Number of Views:536
Avg rating:3.0/5.0
Slides: 50
Provided by: itay
Category:

less

Transcript and Presenter's Notes

Title: Compressed suffix arrays and suffix trees with applications to text indexing and string matching


1
Compressed suffix arrays and suffix trees with
applications to text indexing and string matching
2
(No Transcript)
3
Agenda
  • A (very) short review on suffix arrays
  • Introduction
  • Problem Definition
  • Information theory reasoning
  • Simple solution round 2
  • Compressed suffix arrays in ½nloglogn O(n)
    bits and O(loglogn) access time
  • Rank And Select Problem definitions
  • Rank DS
  • Compressed suffix arrays in e-1n O(n) bits and
    O(logen) access time
  • Select data structure (if time permits)

4
Short review on suffix arrays
  • A suffix array is a sorted array of the suffix of
    a string S represented by an array of pointers to
    the suffixes of S
  • For example The string TelAviv and its
    corresponding suffix array

telaviv S0
elaviv S1
laviv S2
aviv S3
viv S4
iv S5
v S6
4 6 0 2 5 1 3
5
Introduction
  • Succinct data structures branch
  • Dna genome strings (small alphabet, large
    strings)
  • Mainly a Theoretical article

6
Problem Definition
  • The Algorithm Is composed of two phases
  • compression
  • lookup
  • Compress
  • given a suffix array Sa compress it to get its
    succinct representation
  • lookup(i)
  • Given the compressed representation return SAi

7
Some Definitions
  • We will deal (at first) with binary alphabet
  • S a,b
  • We will add a special end of string symbol
  • And will set the relation between the characters
    to be
  • altltb ()
  • Basic Ram Model
  • Log(n) word size
  • Word lookup and arithmetic in constant time

8
Information theory reasoning
abba 15432 abab 41532 abaa 13524 abaa 34152 aabb12543 aaba 14253 aaab 12354 aaaa 12345
bbbb54321 bbba45321 bbab35241 bbaa34521 babb25143 baba42531 baab23514 baaa 23451
9
Information theory reasoning (2)
  • Suffix array size nlog(n)
  • One to one corresponds between the suffix array
    to the string
  • Construction details
  • Number of possible suffix arrays 2n-1
  • Perfect compress n bits (the string itself)
  • The cost for lookup O(n) see prev lecture

10
Simple solution round 2different approach
  • Lets pack together each logn bits to create a
    new alphabet.
  • So the text length will be n/logn and the pattern
    length would be m/logn
  • The suffix array will take o(n) bits
  • Searching becomes hard (alignment)
  • the text is aligned but the pattern isnt
  • logn cases

11
Simple solution round 2
  • the text isnt aligned the pattern occurs k bit
    right to a word boundary
  • Need to append k bits to the pattern and check it
  • So we need to check 2k cases
  • Klogn gt n different cases to check
  • Assuming we know how much to pad!!

12
General framework
  • Abstract Data Type Optimization Jacobson'89
  • distinct Data structures C(n) gt Each data
    structure occupies O(log C(n)) bits.
  • Doesnt guarantee the time complexity on the
    supported operations

13
Compressed suffix arrays in ½nloglogn O(n)
bits and O(loglogn) access time
  • Recursive method in nature
  • Take advantage on the suffixes
  • Let Sa0 be the uncompressed suffix array
  • And N0 be its size (assume power of 2)
  • In The k phase of the compression we start with
  • Sak with the size
  • and create Sak1 with the size
  • Sak1 holds the permutation 1..Nk1

14
Sak1 Construction
  • Create the Bk bit vector
  • Bki 1 iff Saki is even
  • create the Rank vector
  • Rankk(j) counts the number of one bits in the
    first j bits of Bk
  • Create the ?k(i) vector
  • stores the 0 to 1 companion relation)
  • Store the even values from Sak in Sak1

15
An Example
  • The 32 chars string T
  • abbabbabbabbabaaabababbabbbabba

16
An Example
16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
a a b a b b a b b a b b a b b a Text
30 14 32 24 21 1 4 7 10 28 19 17 13 31 16 15 Sa0
1 1 1 1 0 0 1 0 1 1 0 0 0 0 1 0 B0
8 7 6 5 4 4 4 3 3 2 1 1 1 1 1 0 Rank0
16 15 14 13 31 30 10 28 8 7 23 18 15 14 2 2 ?0
17
Example
32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
a b b a b b b a b b a b a b a a
25 22 2 5 8 26 11 29 23 20 3 6 9 27 18 12 30
0 1 1 0 1 1 0 0 0 1 0 1 0 0 1 1 1
16 16 15 14 14 13 12 12 12 12 11 11 10 10 10 9 8
27 31 30 21 28 27 17 16 13 23 10 21 8 7 18 17 16
18
How To compute Sak from Sak-1
  • Lemma 1
  • Given suffix array Sak let Bk rankk ?k and Sak1
  • Be the result of the transformation performed by
    phase k we can construct Sak from Sak1 by the
    following formula
  • Saki 2 Sak1rankk(?k(i))(Bki-1)
  • Lets split for 2 cases
  • Bki is even
  • Bki is odd

19
Example continue
11 1 4 13 10 3 9 6 15 7 16 12 2 5 14 8 Sa1
0 0 1 0 1 0 0 1 0 0 1 1 1 0 1 1 B1
8 8 8 7 7 6 6 6 5 5 5 4 3 2 2 1 Rank1
5 4 14 2 12 14 12 9 6 1 6 5 4 9 2 1 ?1

2 5 3 8 6 1 7 4 Sa2
1 0 0 1 1 0 0 1 B2
4 3 3 3 2 1 1 1 Rank2
8 4 1 5 4 8 5 1 ?2

1 4 3 2 Sa3
20
Compress
  • We Keep l O(loglogn) levels
  • All Levels but the Sal level are save implicitly
  • For each of the level 0..l-1 we save Bj,rankj ?j
  • rankj ?j are stored implicitly
  • The Size of Sal is

21
lookup
  • just compute recursively Saki from Sak1i
  • Recursion depth loglogn
  • All data structure going to be used have o(1)
    access time
  • O(loglogn) lookup cost

22
How The Data Is Stored
  • The Bk bit vector is stored explctiy
  • O(Nk) space
  • O(1) lookup
  • O(Nk) preprocess time
  • The RankK vector is stored implicitly using
    Jacobson rank data structure
  • O(Nk(loglognk)/lognk) space
  • O(1) lookup
  • O(Nk) preprocess time
  • The ?k vector is stored implicitly (using rank
    and select)

23
?k vector representation
24
Lets Take a look
25
An Example
16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
a a b a b b a b b a b b a b b a Text
30 14 32 24 21 1 4 7 10 28 19 17 13 31 16 15 Sa0
1 1 1 1 0 0 1 0 1 1 0 0 0 0 1 0 B0
8 7 6 5 4 4 4 3 3 2 1 1 1 1 1 0 Rank0
16 15 14 13 31 30 10 28 8 7 23 18 15 14 2 2 ?0
26
Example
32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
a b b a b b b a b b a b a b a a
25 22 2 5 8 26 11 29 23 20 3 6 9 27 18 12 30
0 1 1 0 1 1 0 0 0 1 0 1 0 0 1 1 1
16 16 15 14 14 13 12 12 12 12 11 11 10 10 10 9 8
27 31 30 21 28 27 17 16 13 23 10 21 8 7 18 17 16
27
So What can we do with all the lists
  • Concatenate them together in a lexicographical
    order and form the Lk list
  • L19,1,6,12,14,2,4,5
  • Lets see how we can compute ?k (i)
  • If Bki is even , its simply i
  • Otherwise ,
  • because all the prefix patterns saved are in
    sorted order,
  • We saved in the Lk list till the point i ,
    entries for all the odd suffixs before i ,
    hi-ranki
  • So we can look up the h entry in Lk
  • And it will give us the answer

28
Simple example
  • L25,8,2,4
  • Rank21,1,1,2,3,3,3,4
  • B21,0,0,1,1,0,0,1
  • ?21,5,8,4,5,1,4,8
  • ?(3) ?
  • Rank(3) 1, h 3-1 , L22 8
  • ?(3) 8 ?

29
Rank and select
  • Given a bit vector length n
  • Ranki is the number of 1 bits till I
  • Select(i) returns the index of the ith 1

30
?k vector representation
  • Lemma 2
  • Given s integers in sorted order ,
  • each containing w bits ,where slt2w
  • we can store them with at most
  • s(2w-floor(logs))O(s/loglogs) bits
  • so that retrieving the hth integer takes constant
    time

31
?k vector representation
  • Take the first zfloor(logs) bits of each int,
    creating the q1..qs int
  • Its easy to see that , q1ltqiltqi1lts (we take the
    msb bits after all)
  • The rest w-z bits of each int , will be ri

Si
10101010101010101010101010101
qi
ri
1010101010101010101010101
101
32
?k vector representation
  • Store ri in a simple array, (w-z)s bits
  • Store q1..qs in a table supporting select and
    rank in constant time.
  • The table Q is implemented in the following way
  • Instead of saving the number themselves,
  • we store q1,q2-q1,q2-q3, qs-qs-1
  • in unary representation )0i1(
  • And add a select data structure.

33
?k vector representation
  • In order to get qi we simply do select(i) ,
  • and count the number of zeros before the ith 1
  • Qi select(i) - rank(select(i))

34
?k vector representation
  • The q table size is
  • the size of the unary string is s2z lt2s the
    select overhead O(s/loglogs)
  • So we can output Si easily
  • Siqi2w-zri

35
?k vector representation
  • Lemma 3
  • We can store the concatenated list Lk used for ?k
    in n(1/23/2k1)O(n/2kloglogn), so accessing
    the hth element will take constant time, with
    preprocessing time o(n/2k22k)
  • There are 22k lists, number them ,(even the empty
    ones)

36
?k vector representation
  • Lemma 3
  • We can store the concatenated list Lk used for ?k
    in n(1/23/2k1)O(n/2kloglogn), so accessing
    the hth element will take constant time, with
    preprocessing time o(n/2k22k)
  • There are 22k lists, number them ,(even the empty
    ones)
  • Each Xi integer in the lists, 1ltxiltNk will be
    transformed into a new integer by appending its
    list int representation
  • X bit size is , 2Klognk ,

37
?k vector representation
  • Lemma 3
  • We can store the concatenated list Lk used for ?k
    in n(1/23/2k1)O(n/2kloglogn), so accessing
    the hth element will take constant time, with
    preprocessing time o(n/2k22k)
  • There are 22k lists, number them ,(even the empty
    ones)
  • Each Xi integer in the lists, 1ltxiltNk will be
    transformed into a new integer by appending its
    list int representation
  • X bit size is , 2Klognk ,
  • After concatenating all the lists ,we have a Nk/2
    sorted numbers sized 2Klognk bits
  • Using lemma 2 we get.
  • O(1) access time
  • And a space bound of n(1/23/2k1)O(n/2kloglogn)
    bits

38
Sum it up (space complexity)
39
Rank data structure
  • Due to Jacobson
  • Given a bit vector length n ,Ranki is the
    number of 1 bits till I
  • Multilevel approach
  • We will slice the bit string to log2n chunks.
  • Between each chunk we will keep rank counter
  • Each chunk will be divvied into ½ logn chunks ,
  • And a counter will be kept between each sub
    chunks
  • At The Bottom Level a simple Lookup table will
    be used.

40
Rank
Log2n chunks
7
14
3
101
½ logn sub chunks
Lookup table
The output 1431
41
Rank Analysis
42
Compressed suffix arrays in e-1n O(n) bits and
O(logen) access time
  • In order to break the space barrier we need to
    save less levels gtlonger lookups
  • Lets save 3 compressed levels only Sa0 Sal Sal
    L ceil(loglogn) , lceil(1/2loglogn)
  • using A Dictionary data structure , which Can
    say If an element is member of the Dictionary,
    and support a rank query, O(1) time for both
    queries
  • The Space complexity of the dictionary is
  • We keep in 2 dictionaries what items we have in
    the next level D0 and Dl (from Sa0-gtSal
    Sal-gtSal

43
The ?k function
  • We define the ?k function , which maps each 1 to
    its companion 0
  • Lets define the fk function to be
  • We just need to merge the indexes in Lk and Lk

44
Example
45
The fk function implementation
  • Lemma 4 We can store the concatenated list used
    for fk
  • k 0 in nO(n/loglogn) bits
  • Kgt0 in n(11/2k-1)O(n/2kloglogn) , preprocess
    time of O(n/2k 22k)
  • If kgt0 simply using lemma 3
  • K0
  • Encode a, as 0, and b as 1.
  • Create a n bit vector , named l
  • Lf 0 iff the list for f0 is a or at the f
    position
  • We add a select and select0 data structure on top
    of it. O(n/loglogn)
  • Also we keep the number of 0 in l as c0,
  • Query fk(j) is done in the following way
  • if j C0 , return select0(c0)
  • If jltc0 return select0(j)
  • If jgtc0 return select(j-c0)

46
The Lookup algorithm
  • Sai , we start walking the fk function
    i,i,i,i
  • Sa0i1Sa0i
  • Until reaching entry found in the dictionary D0,
  • Let s be the walk length
  • And r the entry rank in the dictionary (how many
    items, already passed to the next level?)
  • Using r we start walking the next level
  • Let s be the walk length
  • And r the entry rank in the dictionary
  • we return the following result
  • The walk length is , max(s,s)lt2lltsqr(logn)
  • So the query time is O(sqr(logn))

47
The General multilevel Build
  • For every 0ltelt1 ,
  • Assume el is an integer so 2ellt2logen
  • Create all the levels , 0, el,2el ..l
  • Number of levels is e-11 gt lookup of O(logen)

48
The General multilevel Build
49
Select data structure
  • select(i)- returns the i 1 bit in the string
  • Same idea as rank , a bit more complicated
  • multilevel approach
  • At the first level we record the position of
    every lognloglognth bit,
  • Total space o(N/loglogn)
  • Between each two bits, we keep the following
    data,
  • If the distance between them rgt(lognloglogn)2
  • we keep the absolute pos of all the indexes
    between them
  • log2nloglogn
  • Other wise we keep , the relative position of
    each logrloglognth bit
  • Total space logrloglogn ltlog2nloglogn
    r/loglogn rltN !!!
  • Then we keep one more level (the same notions)
  • Block size comes to the size of (lgn)4

50
Select data structure
  • After that, we keep a lookup table
  • For every logn/d pattern we save (dgt2)
  • Number of 1 bits,
  • the location of the ith 1 bit in the pattern
  • Same as before the space is O(n1/dlognloglogn)
  • The lookup is then very simple, just walk the
    levels,
  • Get a block and ask a query about him using the
    lookup table.
  • Space complexity , O(n/loglogn)
Write a Comment
User Comments (0)
About PowerShow.com