Title: Faster Suffix Tree Construction With Missing Suffix Links
1Faster Suffix Tree Construction With Missing
Suffix Links
- By Richard Cole and Ramesh Hariharn
- Present by B89502027??? ???
- B89705013??? ???
2What is Missing Suffix Link
- The definition of suffix link implies str(
link(x) ) is str(x) with 1st symbol removed - Where link(x) must be a NODE.
- When link(x) is not a node. The suffix link is
missing! - 2 situations
- Parameterized string
- Suffix tree for 2-Dimension array
3The problem is
- Parameterized strings and 2D array
- The node degree may not be bound by constant,
i.e., some polynomial of n - Farach 5 solved polynomial but not missing
suffix link - Baker1, Kosaraju11 solved parameterized
string but not polynomial (n logn) - Giancarlo7 solved 2D array but still in (n
logn) - We can solve both with O(n)!!!
4Our contribution tools
- Putting additional nodes and suffix links to the
suffix tree but still in space O(n) and time O(n) - Providing a failure probability of inverse
exponential, i.e., hashing
scheme.
5General Settings
- Quasi-suffix collection
- An ordered collection of strings s1, s2, sn iff
the following hold - s1 n, and si si-1 -1, therefore sn1
- No si is a prefix of another sj
- Suppose si and sj have common prefix of length L
gt0, then si1 and sj1 have a common prefix of
length at least L 1. - aabb
- abb
- bb
- b
6General Settings(contd)
- Multiple quasi-suffix collection
- Several quasi-suffix collections have L strings
in all - Any pairs of strings si, sj hold conditions 2 3
of quasi-suffix collection - Character Oracle
- Supply the ith character of the jth string of the
collection on demand in O(1) time
7Suffix trees for parameterized strings
- Each s of string s is transformed to num(s),
e.g., ?b?b? gt 0b2b2 - How is condition 1 hold?
- How is condition 2 hold?
- How is condition 3 hold?
- 0bb3b2
- bb0b2
- b0b2
- 0b2
- b0
- 0
8Suffix trees for 2D arrays
- There are mn-1 diagonals in m x n array
- For each diagonal form a square array
- For each square array, decomposing in a
shapes, - Each is mapped to a number (Giancarlo7),
and a square is a string num(s), forming
quasi-suffix collection (each with different
ending symbol) - since mn-1 diagonals, mn-1 square for a
multiple quasi-suffix collection
9First! McCreights Algorithm
- Definition of suffix link
- Since condition 3 must satisfied with equality,
suffix link is defined for each node x and
link(x) is defined to be a node. - Two stages rescanning and, possibly, scanning
- Rescan down from link( par(x) ) until position
for link(x) found - If node not present, insert one and an edge for
the leaf (no scan) - Otherwise, just scan down (as we did in ukkonen)
- In either case, link(x) is well defined!
10Two problems
- Link( par(x) ) may not be defined
- The lack of node at link(x)!
- Since condition 3 need not satisfied only with
equality, i.e., in our parameterized string case!
11Our Algorithm
- Two modifications to McCreights
- Traversing up to find an ancestor with suffix
link - Copy nodes backwards from the destination found
above - Re-definition of suffix link
- link(x) is node y such that if str(x) is the
longest common prefix of si and sj, then str(y)
must be the longest common prefix of si1 and
sj1, where str(y) str(x) -1. - link(x) need not be defined for every node x!
12Some definition
- nanc(x), nearest ancestor of x with suffix link
- Real/imaginary node
- If new scanning stage begin within an edge,
(condition 3 with gt property) we use an imaginary
node. - Imaginary node has only 1 child, whereas real
node has at least 2! - At most O(n) real nodes and imaginary nodes
(since leaves at most n)
13Some facts
- Number of real and imaginary node is O(n)
- Total number of children of real and imaginary
nodes are O(n) - Total length of scanned portion is O(n)
14More features
- Back propagation nodes
- Must have suffix link
- Only one child
- When scanning down from link(nanc(x)) to link(x),
every 2 node (not including the first and the
last) are back-propagated.
15Invariant 1
- If a node x is back-propagated in direction u,
then its parent is not back-propagated in
direction u where u is a prefix of u.
16Time Complexity
- Two to be analyze
- Finding nanc(x)
- Rescanning down
- Creating a new back-propagated node
- Upgrade imaginary node to back-propagated node,
by adding suffix link to it! - Adding a real/imaginary node for link(x)
- Time O(1) 1 2
17Bounding back-Propagated node
- Defining BP tree
- All node except root are back-propagated node
- BP forest
- Trees rooted at various real/imaginary nodes that
are back-propagated. (Imagine the suffix tree as
BP forest!) - Decomposing BP tree into paths
- From root down to a node y such that either
- 1. no valid direction for y
- 2. there exist a direction u but in which y has
not been back propagated! - Decomposing recursively
18Bounding back-Propagated node (contd)
- Extend paths on suffix tree backward (on
direction not imply by back-propagation node)
until either - 1. a node is reached
- 2. no valid direction is available
- Lemma 1 two distinct extended path cannt
intersect. - Lemma 2 if an extended path terminated at node y
(not by running out of valid direction), y cannot
be back-propagated node. - Lemma 3 total number of path is O(n), and hence
total number of pack-propagated node is O(n)
19Time Complexity (contd)
- The process of finding nanc(x) is just the same
way discussed in Ukkonen bounded by O(n) - Combining with lemma 3, we have the theorem
20The Hashing Scheme
- Goal
- Hash O(n) pairs node, following symbol
- ????? O(n), ????? O(1) query
- ??? inverse exponential
21FKS Perfect Hashing
- Fredman, Komlos, Szemeredi
- Refer to textbook for Algorithm
- Hash n items from range 0poly(n) into 0T(n)
- Ensure probability without collision gt ½
22The Static Hashing Scheme
- Choose positive constante
- When e?0, failure probability ?
- Total time space of DS will be linear with
factor 1/e - ??????1 nc ?n?items hash??? imaginary array A
of size nc
23The Static Hashing Scheme(contd)
- Step 1 (??partition tree)
- of node O(n)
- Has ne children
- Each children associate with a distinct subarray
of A of size nc-e - Each leaf (subarray) with more than neitems is
recursively partitioned - Total size O(n)
24The Static Hashing Scheme(contd)
- Step 2
- Using FKS Perfect Hashing
- Several trials will be required since only ½
- ??total time complexity
- Total size of sub problem is n
- Each sub problem is ne
25The Static Hashing Scheme(contd)
- Size categories
- Divide leaves into O(logn) categories
- For a categories i , the leaves size are in the
range ne/(4i1) ne/(4i) for igt0 - We will show that
- time for this category is proportional to the sum
of size of the leaves in this category O(n/2i) - With failure probability
- It follows that total time O(n) with failure
probability
26The Static Hashing Scheme(contd)
- Succeed
- Items in a leaf are perfect hashed
- Round
- One trials for each of the relevant leaves
- Group
- Organization of rounds
27The Static Hashing Scheme(contd)
- How to grouping rounds?
- 0th Group ???category???unsuccessful
leaves??n1-e2i / (log n)????rounds - jth Group???category???unsuccessful
leaves?n1-e2i / (2j log n)?n1-e2i / (2j-1 log
n)?????rounds (j gt 1)
28The Static Hashing Scheme(contd)
- We will show failure probability of rounds in
group - 0th of rounds O( i log log n) with failure
probability - jth of rounds O( 2j ) with failure probability
- Failure probability (over all groups)
- First of all, we show that total time taken in j
groups
29The Static Hashing Scheme(contd)
- Secondary we show rounds in 0th group
- Leaves in ith category are at most n / (ne/4i1)
- n / (ne/4i1) (1/2)x n1-e2i / log n
- gt x 2 i log log n (the rounds in 0th
group) - In Chernoff bound 2, If u unsuccessful
leaves, at some instance of time, then half these
leaves succeed in the next 2k rounds, with
failure probability 1/(2T(uk) ) - Failure probability at end of 0th group is thus
(k1)
30The Static Hashing Scheme(contd)
- we show rounds in jth group
- K 2j
- Has 22j rounds
- u n1-e2i / (2j log n)
- With failure probability
- Totally O(log n) groups, thus total failure
probability