Suffix Arrays - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Suffix Arrays

Description:

Obtained by comparing v and w and stopping at the first unequal symbol. Use precomputed lcp information to reduce the number of comparisons to O(P logN) ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 58
Provided by: csta3
Category:

less

Transcript and Presenter's Notes

Title: Suffix Arrays


1
Suffix Arrays
  • A New Method for Online String Searches
  • U.Manber and G.Myers

2
Introduction - String matching
  • Let A a0a1...aN-1 be a large text of length N
  • Let W w0w1...wp-1 be a word of length P
  • Is W a substring of A?

3
Introduction - Suffix Trees
  • Build time
  • O(N)
  • Search time
  • O(P)
  • Structure space
  • O(N)
  • Big constant
  • Dependent of S

4
Suffix Arrays
  • An array of all the suffixes of A
  • Sorted by lexicographical order
  • A aababa

baba ba ababa aba aababa a
5
Suffix Arrays
  • Ai aiai1...aN-1
  • The suffix of A that starts at position i.
  • Position array (Pos)
  • Posk is the start position of kth smallest
    suffix
  • APosk is the suffix pointed from Posk
  • APosk is the kth smallest suffix
  • Pos

A aababa
0 1 2 3 4 5
5 0 3 1 4 2
0 1 2 3 4 5
6
Searching
  • Is W a substring of A?
  • W is a substring of A Some suffix Ai starts
    with W
  • i is Ws location
  • All the instances of W must match consecutive
    suffixes in the array
  • Find the array interval that contains those
    suffixes

7
Searching - Definitions
  • For a string u
  • up u0u1...up-1
  • For strings u,v
  • u p v up vp
  • Same for ?, , gt
  • For any p, Pos is ordered according to p

8
Searching - Definitions
  • W w0w1wP-1
  • LW min (k W p APosk or k N)
  • First suffix p from W
  • RW max (k APosk p W or k 1)
  • Last suffix p from W


9
Search Algorithm
  • k LW, RW W p APosk
  • To find Ws instances - find LW, RW
  • Number of Ws occurrences is (RW-LW1)
  • Matches are APosLW,, APosRW
  • Suffix array is sorted - use binary search

10
Binary Search
  • Search interval L,R
  • Midpoint M
  • Compare W to APosM
  • Decide where to search next
  • W p APosM - search in left half (R M)
  • W gtp APosM - search in right half (L M)
  • O(PlogN)

cbb bcd abc aab
W abc
L
M
R
11
Search Algorithm
  • Observation
  • We can use information from one comparison to
    speedup the next comparisons
  • Use additional information
  • lcp longest common prefix

12
Search Algorithm - lcp
  • lcp(v,w) the length of the longest common
    prefix of v and w
  • Obtained by comparing v and w and stopping at the
    first unequal symbol
  • Use precomputed lcp information to reduce the
    number of comparisons to O(P logN)

13
Search Algorithm
  • Consider all possible midpoints
  • M 1N-2
  • Every midpoint corresponds to a triplet LM,M,RM
  • Suppose we precomputed two arrays
  • LlcpM lcp (APosLM, APosM)
  • RlcpM lcp (APosM, APosRM)

14
Search Algorithm
  • Maintain two more variables
  • l lcp(APosL, W)
  • r lcp(W, APosR)
  • W abcd

ad acd acb aca ac abcd abc abb abaa
15
Search Algorithm
  • Assume lr
  • Compare l with Llcp
  • If l lt LlcpM
  • W gtl1 APosLM
  • APosLM l1 APosM
  • W gtl1 APosM

ad acd ac abcd abac ababa abab abaa aba
16
Search Algorithm
  • If l gt LlcpM
  • APosLM ltl APosM
  • W l APosLM
  • W ltl APosM

W abcd
adc adb ada ad aca ac abd abcd aba
17
Search Algorithm
  • If l LlcpM
  • W can be in either half
  • Start comparing A and APosM from the (l1)
    symbol
  • First unequal symbol determines whether to go
    right or left
  • r/l will be updated to lj
  • j1 comparisons

W abcd
adc adb ada abcd abcc abc abaa aba ab
18
Search Algorithm - Complexity
  • In each Iteration
  • Let hmax(l,r)
  • We start comparing from the hth symbol to the
    hj1
  • j1 symbol comparisons
  • Next time we will start from the hj symbol
  • j symbols out of the j1 will not be compared
    again

19
Search Algorithm - Complexity
  • Every symbol in W will be successfully matched at
    most once
  • O(P) successful comparisons
  • At most one symbol will be unsuccessfully matched
    in each iteration
  • O(logN) unsuccessful comaprsions
  • Total O(P logN) comparisons

20
Build Suffix Array
  • So far
  • A O(P logN) search algorithm
  • Given a sorted suffix array
  • Given lcp information (Llcp, Rlcp)
  • Next
  • Sort the suffix array in O(NlogN)
  • Compute the lcps while sorting the array

21
Sort Algorithm
  • First stage
  • Sort the suffixes into buckets, according to
    first symbol
  • Inductive stage
  • Assume array is bucket sorted according to first
    H symbols
  • Every H-bucket holds suffixes with the same H
    first symbols
  • Buckets are ordered according to the H relation
  • Sort according to 2H first symbols

22
Sort Algorithm Intuition
  • Let Ai, Aj be two suffixes in the same H-bucket
  • Ai H Aj
  • Next H symbols of Ai and Aj are the first H
    symbols of AiH and AjH
  • In order to determine the 2H order of Ai and Aj,
    look at the H order of AiH and AjH

A aababaa
baa babaa ababaa abaa aababaa aa a
H 2
Ai
Aj
AjH
AiH
23
Sort Algorithm Main Idea
  • Let Ai be a suffix in the first H-bucket
  • Ai starts with the smallest H-symbol string
  • Ai-H should be the first in its 2H-bucket

A aababa
H 1
ba baba aababa aba ababa a
24
Sort Algorithm
  • In stage H
  • Go over all the suffixes in the H order
  • For each Ai move Ai-H to the next available place
    in its H-bucket
  • The suffixes are now sorted according to the 2H
    order
  • Go on to stage 2H to produce 4H order

25
Sort Algorithm - Example
A assassin
0 1 2 3 4 5 6 7
A3 A0 A6 A7 A1 A5 A4 A2
in
n
sin
ssassin
ssin
sassin
assassin
assin
H 2
26
Sort Algorithm - Example
A0 A3 A6 A7 A2 A5 A4 A1
A0 A3 A6 A7 A2 A5 A1 A4
27
Sort Algorithm - Complexity
  • First Stage
  • Bucket sort according to first symbol
  • O(NlogN)
  • Inductive Stages
  • O(logN) stages
  • O(N) per stage
  • Total O(NlogN)
  • Space
  • Can be implemented using two N-sized integer
    arrays

28
Finding Longest Common Prefixes
  • The search algorithm uses lcp information
  • LlcpM lcp (APosLM, APosM)
  • RlcpM lcp (APosM, APosRM)
  • We want to compute this information while we are
    sorting the array

29
Finding Longest Common Prefixes
  • Show how to compute lcps for suffixes in
    adjacent H-buckets during the sort algorithm
  • Use that to compute the lcps of all the suffixes
    that are consecutive in the sorted suffix array
  • Show how to compute lcps for all the necessary
    suffixes

30
Finding LCP for adjacent buckets
  • After the first sort stage, lcps of suffixes in
    adjacent buckets is 0
  • Assume after stage H we know the lcps between
    suffixes in adjacent H-buckets
  • Suppose Ap and Aq are in the same H-bucket but
    not in the same 2H bucket
  • H lcp(Ap, Aq) lt 2H
  • lcp(Ap, Aq) H lcp(ApH, AqH)
  • lcp(ApH, AqH) lt H

31
Finding LCP for adjacent buckets
  • Let i,j be ApH, AqHs positions in the suffix
    array
  • Assume iltj
  • Array is ordered according to the ltH order
  • lcp(APosi, APosj) min(lcp(APosk-1,
    APosk))

ba baba aababa aba ababa a
32
LCP Data Structures Hgt
  • We need a data structure that will allow us
  • get the lcps of consecutive suffixes
  • get their minimum
  • Hgt an N-1 sized array
  • Hgti lcp(APosi-1, APosi)

33
LCP Data Structures Hgt
  • Hgt will be computed inductively throughout the
    sort
  • Initialized to N1
  • Hgti is updated in stage 2H APosi started a
    new 2H-bucket
  • To update Hgti
  • Let a,b be the array positions of APosi-1H
    and APosi H
  • Assume ab
  • Hgti H min(Hgtk)

34
Finding LCP - Example
sassin ssin sin ssassin n in assassin assin
H 1
ssassin ssin sin sassin n in assin assassin
H 2
1
1
lcp (sin, ssin) 1 lcp(in, sin) 1
min(lcp(in,n), lcp(n,sassin), lcp(sassin,sin) 1
0 1
lcp(sassin,sin) 1 lcp(assin, in) 1
ssin ssassin sin sassin n in assin assassin
H 4
35
LCP Data Structures - Interval Tree
  • We need the following operations for Hgt
  • Set(i, h) sets Hgti to h
  • Min_height(i,j) determines min(Hgtk)
  • We need to find a way to find the lcps for all
    the necessary suffixes not just the ones in
    consecutive positions

36
LCP Data Structures - Interval Tree
  • A full and balanced binary tree
  • N-1 leaves, correspond to Hgt
  • O(logN) height, N-2 interior vertices
  • Keep a Hgt value for each interior vertex as
    well
  • Hgtv min(Hgtleft(v), Hgtright(v))

37
LCP Data Structures - Interval Tree
  • Operations implementation
  • Set(i,h)
  • Set Hgti to h and update the Hgt values on the
    path from i to the root
  • Min-height(i,j)
  • Finds the minimal Hgt value by scanning O(logN)
    vertices in the tree
  • Operations complexity O(logN)

38
Finding LCP Interval Tree
39
Finding LCP - Complexity
  • In stage 2H we update Hgti for all the leaves
    that started new buckets
  • Each update is one set operation and one
    Min_height - O(logN)
  • Throughout the algorithm every leaf is updated
    exactly once - O(N) updates
  • Updates complexity O(NlogN)
  • In each stage we scan the array to see which
    suffixes opened new buckets
  • Scans complexity O(NlogN)
  • Total LCP complexity O(NlogN)

40
Finding LCP - Llcp and Rlcp
  • We want Llcp and Rlcp to be available
    directly from the interval tree at the end of the
    sort
  • Use an interval tree that represents a binary
    search
  • Each interior node corresponds to (LM, RM) for
    some M
  • For each interior node (LM, RM)
  • Left(LM, RM) (LM,M)
  • Right(LM, RM) (M, RM)
  • N-2 interior nodes
  • Leaves correspond to (i-1,i)
  • Leaf(i-1,i) Hgti

41
Finding LCP - Llcp and Rlcp
  • According to interval tree structure
  • Hgt(L,R) min(Hgtk)
  • Hgt(L,R) lcp (APosL, APosR)
  • LlcpM Hgt(LM,M)
  • RlcpM Hgt(M,RM)

k L1,R
42
Worst Case Complexity
  • Suffix Array
  • Build time
  • O(NlogN)
  • Search time
  • O(PlogN)
  • Structure space
  • O(N)
  • 2N - 3N integers
  • Independent of S
  • Suffix Tree
  • Build time
  • O(N)
  • Search time
  • O(P)
  • Structure space
  • O(N)
  • Big constant
  • Dependent of S

43
Expected Time Improvements
  • Improve the expected case time of
  • Search Algorithm
  • Sort Algorithm
  • LCP computation
  • Use the following assumptions
  • All N-symbol strings are equally likely
  • Under this assumption
  • Expected length of longest repeated substring of
    A is O(logSN)

44
Expected Case Improvements - Main Idea
  • Let T
  • Let IntT(u) integer encoding in base S of the
    T-symbol prefix of u
  • Example
  • T 3
  • S a,b
  • u abaa
  • IntT(u) 010 2
  • There are ST N possible T-symbol prefixes
  • IntT(u) is a number in 0,N-1
  • Map each suffix Ap to IntT(Ap)
  • Can be done in O(N) time

45
Expected Case Improvements - Search Algorithm
  • Use an additional array Buck
  • Think of the sorted array as buckets, based on
    the IntT encoding
  • Buckk min i IntT (APosi) k
  • The first position that contains a suffix thats
    mapped to k
  • Compute Buck
  • at the end of the sort algorithm
  • O(N) additional time

46
Expected Case Improvements - Search Algorithm
  • Given a word W
  • We need to find Lw and Rw
  • Let k IntT(W)
  • Lw and Rw must be in ks bucket
  • (Buckk, Buckk1)
  • We only need to search one bucket

47
Expected Case Improvements - Search Algorithm
  • Number of buckets ST N
  • Average number of elements in a bucket O(1)
  • In the binary search for W
  • Expected size of bucket to search O(1)
  • Expected number of search steps O(1)
  • Expected case time O(P)

48
Expected Case Improvements - Sort Algorithm
  • First stage of sort
  • Sort according to first symbol
  • Replace first stage with sort according to IntT
  • Equivalent to sort according to first T symbols
  • Can be done in O(N) time
  • We changed the base case of the sort from H1 to
    HT

49
Expected Case Improvements - Sort Algorithm
  • Observation
  • Let C be the length of the longest repeated
    substring of A
  • Sort is in fact complete once we have reached
    (C1)-buckets
  • Suppose some (C1)-bucket contains more than one
    suffix
  • Then we have two suffixes with lcp gt C
  • This prefix is a repeated substring longer than C
    - contradiction

50
Expected Case Improvements - Sort Algorithm
  • Expected case
  • C O(logSN) O(T)
  • Number of stages O(1)
  • Expected case time O(N)

51
Expected Case Improvements - LCP Computation
  • Replace interval tree with sort history
  • Binary tree
  • Models the refinement of buckets during the sort
  • A vertex for each H-bucket
  • Each vertex holds the stage number at which its
    bucket was split

52
Expected Case Improvements - LCP Computation
  • Leaves correspond to suffixes and are arranged in
    an N element array
  • Each vertex has at least two children
  • O(N) nodes
  • Can be built with O(N) additional time during the
    sort

53
Expected Case Improvements - LCP Computation
  • Given the sort history we can compute lcp(Ap, Aq)
  • Find the nca (nearest common ancestor) of Ap and
    Aq
  • Let H be the ncas stage number
  • lcp(Ap, Aq) H lcp(ApH, AqH)
  • Recursively compute lcp(ApH, AqH)
  • Stop when the nca is the root

54
Expected Case Improvements - LCP Computation
  • Each step is O(1)
  • At each step the stage number of the nca is at
    least halved
  • Suppose we stop the recursion when H lt T
  • Expected length of longest repeated substring is
    O(T)
  • Expected case lcp is O(T) O(logSN)

55
Expected Case Improvements - LCP Computation
  • O(1) recursive steps in the expected case
  • Expected case time for one lcp O(1)
  • Expected case time for computing Llcp, Rlcp
    O(N)

56
Expected Case Improvements - LCP Computation
  • We need a way to find lcps that are known to be
    less than T
  • Build a ST x ST array
  • LookupIntT(x), IntT(y) lcp(x,y) for all
    T-symbol strings x,y
  • Max N entries (ST vN)
  • Compute incrementally in O(N)
  • Final recursion steps are replaced by O(1) lookup

57
Expected Time Complexity
  • Search time
  • O(P)
  • Sort LCP computation time
  • O(N)
Write a Comment
User Comments (0)
About PowerShow.com