Compressed Suffix Arrays and Suffix Trees - PowerPoint PPT Presentation

About This Presentation
Title:

Compressed Suffix Arrays and Suffix Trees

Description:

Compressed Suffix Tree. Compressed Suffix Array. Proof of bounds. 3. Reminder - Symbols ... Bo. 1. 1. 0. 0. 3. 2. 2. 3. ?o. SA0. 20. Step #3 : Compute 1's for ... – PowerPoint PPT presentation

Number of Views:356
Avg rating:3.0/5.0
Slides: 55
Provided by: MAMR5
Category:

less

Transcript and Presenter's Notes

Title: Compressed Suffix Arrays and Suffix Trees


1
Compressed Suffix Arrays and Suffix Trees
  • Roberto Grossi, Jeffery Scott Vitter

2
Outline
  • Reminders
  • Motivation
  • Compression results
  • Time Space bounds
  • Compressed Suffix Tree
  • Compressed Suffix Array
  • Proof of bounds

3
Reminder - Symbols
  • T t1t2...tn-1
  • text of length n-1
  • eof symbol at the nth position
  • Ti,n is suffix i of text T
  • i1,,n

4
Reminder - Symbols
  • P p1p2...pm
  • pattern of length m
  • 0lte1

5
Reminder - Main Goal
  • Search string pattern P within text T
  • Support fast queries
  • Text T being fully scanned only once

6
Reminder Suffix Trees
  • Leaf with value i represents suffix i,n
  • Build time
  • O(n)
  • Search time
  • O(m)
  • Structure space
  • O(n)

7
Reminder Suffix Arrays
  • Lexicographically ordered
  • SAi the starting position in T of the i-th
    suffix
  • Sa,b
  • altltb

1 2 3 4 5 4 5 3 2 1
a

ba
bba
bbba
8
Reminder Suffix Arrays
  • Build time
  • O(nlogn)
  • Search time
  • O(mlogn)
  • Structure space
  • O(n)

9
Motivation
  • So Far
  • Greedy in space
  • Fast searching
  • Need for space-efficient text indexing
  • Reduce both space and query time

10
Compressed Suffix Tree
  • Build time
  • O(n)
  • Search time
  • O(m/logn(logn)e)
  • Structure space
  • (e -1O(1)) n

11
Compressed Suffix Tree
  • Build Suffix Array
  • Build Compressed Suffix Tree
  • Patricia Tries
  • Compress Suffix Array

12
CSA Basic Operations
  • Compress(T,SA)
  • Return succinct representation of SA
  • Retain T
  • Discard SA
  • Lookup(i)
  • Return SAi
  • Use compressed SA

13
CSA Primary measures
  • Compress
  • Preprocessing compressed SA
  • Space of compressed SA
  • lookup
  • Query time

14
Compressed Suffix Array
  • Build time
  • O(n)
  • Structure space
  • ½nloglogn O(n)
  • lookup time
  • O(loglogn)

15
Suffix Arrays Optimization
  • Main idea
  • Decomposition scheme
  • Recursive structure of permutations

16
Decomposition Scheme
  • K levels, K0,.,l
  • SA0 SA (Original SA)
  • n 0n
  • n T
  • assumption - n is a power of 2
  • n kn/2k
  • SAk1,2,,nk)

17
SAk Succinct Representation
  • 4 main steps
  • Produce bit vector Bk
  • Map Bk 0s to 1s
  • Compute 1s for each prefix in Bk
  • Using function rankk(j)
  • Pack SAk

18
Step 1 Produce bit vector Bk
  • Bk nk
  • Bki1 if SAki is even
  • Bki0 if SAki is odd

SA0
Bo
19
Step 2 Map Bk 0s to 1s
  • New Fuction ?k(i), i1,,nk
  • ?k(i)

j SAki is odd and SAkj SAki1 i otherwise
(SAki is even)
SA0
Bo
?o
20
Step 3 Compute 1s for Bk
  • Recall fuction rankk(j), j1,,lk
  • rankk(j) number of 1s on first j bits
    of Bk

SA0
Bo
ranko
21
Step 4 Pack SAk
  • Pack even values of SAk
  • Divide by 2
  • New permutation 1,2,..,nk1
  • nk1nk/2n/2k1
  • Store new permutation into SAk1
  • Remove SAk
  • SAk1 SAk/2

22
Example level 0, steps 1-3
23
Example level 0, step 4
24
Lemma Reconstruct SAk
  • Results of phase k
  • Bk, ?k, rankk,SAk1
  • Reconstruct SAk
  • SAki 2SAk1rankk(?k(i)) (Bki-1)
  • i 1,.nk

25
Proof, case 1, Bki 1
  • SAki 2SAk1rankk(?k(i)) (Bki-1)
  • Step 4 SAki/2 stored in rankk(i)th entry of
    SAk1
  • SAki 2 SAk1rankk(i)
  • Step 2 ?k(i) i

25
26
Proof, case 2, Bki 0
  • SAki 2SAk1rankk(?k(i)) (Bki-1)
  • ?k(i) j
  • Step 2 SAki SAkj-1
  • Bkj 1
  • Apply case 1 on j
  • SAkj 2 SAk1rankk(j)

26
27
Example, case 1, Bki 1
  • SA02 ?
  • B021, ?0(2)2, rank0(2) 1
  • SA02/2 stored in 1st entry of SA1
  • SA02 2 SA11 2 8 16

28
Example, case 2, Bki 0
  • SA03 ?
  • B030, ?0(3) 14, rank0(14) 6
  • SA014 2 SA16 2 16 32
  • SA03 SA014 - 1 32 - 1 31

29
Example - Decomposition
30
Determining l
  • n 0 n 32
  • n 3 4 n/logn
  • can be stored in n bits
  • Conclusion
  • l loglogn

31
CSA Structure
  • K levels, k 0,1,.,l-1
  • Store Bk, ?k, rankk
  • Final Level k l
  • Store only SAl

32
CSA Structure Build
  • Bk
  • nk bits per vector
  • O(nk) build
  • rankk
  • O(nk(loglognk)/lognk) bits
  • As shown before
  • O(nk) build
  • Sal
  • (n/2l)logn bits

33
CSA Structure space - ?k
  • List method
  • 2K lists
  • possibilities for prefixes of suffixes
  • Number of lists increases
  • Lk concatenation of all 2K lists
  • Lk nk/2
  • Lk decreases

34
CSA Structure space - ?k
  • For i 1,,nk/2
  • j ith 1 in Bk
  • Pattern in 2K(SAkj-1),, 2KSAkj-1
  • matched to a list

35
Level 0
  • a list 2,14,15,18,23,28,30,31
  • b list 7,8,10,13,16,17,21,27

36
Levels 1,2
  • Level 1
  • aa //empty list
  • ab 9
  • ba 1,6,12,14
  • bb 2,4,5
  • Level 2
  • abba 5,8
  • baba 1
  • aabb 4

37
Reconstruct ?k
  • Bki 1
  • ?k(i) i
  • Bki 0
  • h number of 0s in Bk
  • ?k(i) Lkh

38
example Reconstruct ?k
  • ?0(25) ?
  • B025 0
  • h 25 - 12 13
  • ?0(25) L013 16

39
example Reconstruct ?k
  • rank0(16) 8
  • SA18 ?
  • ?18 ?
  • B18 0
  • h 8 - 5 3
  • ?1(8) L13 6

40
Lemma
  • S sorted integers
  • w bits per number
  • S lt 2w
  • Store integers
  • S(2w-logs)O(s/loglogs)
  • Retrieve hth integer
  • O(1)

41
Store Lk
  • Store integers
  • n(1/23/2K1 )O(n/2kloglogn)
  • Retrieve hth integer
  • O(1)
  • Preprocess time
  • O(n/2k22k)

42
CSA Structure - Summary
  • Bk
  • nk
  • rankk
  • O(nk(loglognk)/lognk)
  • Sal
  • (n/2l)logn
  • ?k
  • n(½3/2K1 )O(n/2kloglogn)

43
Summing it up
  • nlogn/2l ½ln 5n O(n/loglogn)
  • ½nloglognn
  • ½nloglogn O(n) bits of storage

44
Preprocess - summary
  • Bk
  • O(nk)
  • rankk
  • O(nk)
  • ?k
  • O(n/2k22k)
  • Summing up 0,..,l-1 levels
  • Preprocess time O(n)

45
lookup(i)
  • lookup(i) refers to SA0i
  • Need to reconstruct SA0i
  • New procedure - rlookup(i,k)
  • Recursive
  • Based on lemma of reconstructing SAk

46
rlookup(i,k)
  • rlookup(i,k)
  • If k l
  • Return Sali
  • else
  • Return 2rlookup(rankk(?k(i)),k1)(Bki-1)

47
Reconstruct SAk
  • Lemma
  • 2SAk1rankk(?k(i)) (Bki-1)
  • lookup(i) rlookup(i,0)

48
Example - lookup(i)
  • lookup(5) rlookup(5,0), l3
  • 2rlookup(rank0(?0(5)),1)(B05-1)
  • 2rlookup(10,1)(-1)

49
Example cont.
  • rlookup(10,1) 2rlookup(rank1(?1(10)),2)(B110
    -1)
  • 2rlookup(7,2)(-1)

50
Example - cont.
  • rlookup(7,2) 2rlookup(rank2(?2(7)),3)(B27-1)
  • 2rlookup(2,3)(-1)

51
Example - cont.
  • rlookup(2,3)
  • lookup(5) 2(2(23(-1))(-1))(-1)
  • 2(2(5)(-1))(-1) 2(9)(-1) 17

52
lookup(i)
  • lookup(i) rlookup(i,0)
  • l1 levels
  • O(1) per level
  • O(loglogn) lookup time

53
Compressed Suffix Array
  • Build time
  • O(n)
  • Structure space
  • ½nloglogn O(n)
  • lookup time
  • O(loglogn)

54
Compressed Suffix Tree
  • Build time
  • O(n)
  • Search time
  • O(m/logn(logn)e)
  • Structure space
  • (e -1O(1)) n
Write a Comment
User Comments (0)
About PowerShow.com