Faster Suffix Tree Construction With Missing Suffix Links - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Faster Suffix Tree Construction With Missing Suffix Links

Description:

For each square array, decomposing in a ''shapes, ... begin within an edge, (condition 3 with property) we use an imaginary node. ... – PowerPoint PPT presentation

Number of Views:148

Avg rating:3.0/5.0

Slides: 31

Provided by: Jon86

Category:

more less

Transcript and Presenter's Notes

Title: Faster Suffix Tree Construction With Missing Suffix Links

1
Faster Suffix Tree Construction With Missing
Suffix Links

By Richard Cole and Ramesh Hariharn
Present by B89502027??? ???
B89705013??? ???

2
What is Missing Suffix Link

The definition of suffix link implies str(
link(x) ) is str(x) with 1st symbol removed
Where link(x) must be a NODE.
When link(x) is not a node. The suffix link is
missing!
2 situations
Parameterized string
Suffix tree for 2-Dimension array

3
The problem is

Parameterized strings and 2D array
The node degree may not be bound by constant,
i.e., some polynomial of n
Farach 5 solved polynomial but not missing
suffix link
Baker1, Kosaraju11 solved parameterized
string but not polynomial (n logn)
Giancarlo7 solved 2D array but still in (n
logn)
We can solve both with O(n)!!!

4
Our contribution tools

Putting additional nodes and suffix links to the
suffix tree but still in space O(n) and time O(n)
Providing a failure probability of inverse
exponential, i.e., hashing
scheme.

5
General Settings

Quasi-suffix collection
An ordered collection of strings s1, s2, sn iff
the following hold
s1 n, and si si-1 -1, therefore sn1
No si is a prefix of another sj
Suppose si and sj have common prefix of length L
gt0, then si1 and sj1 have a common prefix of
length at least L 1.
aabb
abb
bb
b

6
General Settings(contd)

Multiple quasi-suffix collection
Several quasi-suffix collections have L strings
in all
Any pairs of strings si, sj hold conditions 2 3
of quasi-suffix collection
Character Oracle
Supply the ith character of the jth string of the
collection on demand in O(1) time

7
Suffix trees for parameterized strings

Each s of string s is transformed to num(s),
e.g., ?b?b? gt 0b2b2
How is condition 1 hold?
How is condition 2 hold?
How is condition 3 hold?
0bb3b2
bb0b2
b0b2
0b2
b0
0

8
Suffix trees for 2D arrays

There are mn-1 diagonals in m x n array
For each diagonal form a square array
For each square array, decomposing in a
shapes,
Each is mapped to a number (Giancarlo7),
and a square is a string num(s), forming
quasi-suffix collection (each with different
ending symbol)
since mn-1 diagonals, mn-1 square for a
multiple quasi-suffix collection

9
First! McCreights Algorithm

Definition of suffix link
Since condition 3 must satisfied with equality,
suffix link is defined for each node x and
link(x) is defined to be a node.
Two stages rescanning and, possibly, scanning
Rescan down from link( par(x) ) until position
for link(x) found
If node not present, insert one and an edge for
the leaf (no scan)
Otherwise, just scan down (as we did in ukkonen)
In either case, link(x) is well defined!

10
Two problems

Link( par(x) ) may not be defined
The lack of node at link(x)!
Since condition 3 need not satisfied only with
equality, i.e., in our parameterized string case!

11
Our Algorithm

Two modifications to McCreights
Traversing up to find an ancestor with suffix
link
Copy nodes backwards from the destination found
above
Re-definition of suffix link
link(x) is node y such that if str(x) is the
longest common prefix of si and sj, then str(y)
must be the longest common prefix of si1 and
sj1, where str(y) str(x) -1.
link(x) need not be defined for every node x!

12
Some definition

nanc(x), nearest ancestor of x with suffix link
Real/imaginary node
If new scanning stage begin within an edge,
(condition 3 with gt property) we use an imaginary
node.
Imaginary node has only 1 child, whereas real
node has at least 2!
At most O(n) real nodes and imaginary nodes
(since leaves at most n)

13
Some facts

Number of real and imaginary node is O(n)
Total number of children of real and imaginary
nodes are O(n)
Total length of scanned portion is O(n)

14
More features

Back propagation nodes
Must have suffix link
Only one child
When scanning down from link(nanc(x)) to link(x),
every 2 node (not including the first and the
last) are back-propagated.

15
Invariant 1

If a node x is back-propagated in direction u,
then its parent is not back-propagated in
direction u where u is a prefix of u.

16
Time Complexity

Two to be analyze
Finding nanc(x)
Rescanning down
Creating a new back-propagated node
Upgrade imaginary node to back-propagated node,
by adding suffix link to it!
Adding a real/imaginary node for link(x)
Time O(1) 1 2

17
Bounding back-Propagated node

Defining BP tree
All node except root are back-propagated node
BP forest
Trees rooted at various real/imaginary nodes that
are back-propagated. (Imagine the suffix tree as
BP forest!)
Decomposing BP tree into paths
From root down to a node y such that either
1. no valid direction for y
2. there exist a direction u but in which y has
not been back propagated!
Decomposing recursively

18
Bounding back-Propagated node (contd)

Extend paths on suffix tree backward (on
direction not imply by back-propagation node)
until either
1. a node is reached
2. no valid direction is available
Lemma 1 two distinct extended path cannt
intersect.
Lemma 2 if an extended path terminated at node y
(not by running out of valid direction), y cannot
be back-propagated node.
Lemma 3 total number of path is O(n), and hence
total number of pack-propagated node is O(n)

19
Time Complexity (contd)

The process of finding nanc(x) is just the same
way discussed in Ukkonen bounded by O(n)
Combining with lemma 3, we have the theorem

20
The Hashing Scheme

Goal
Hash O(n) pairs node, following symbol
????? O(n), ????? O(1) query
??? inverse exponential

21
FKS Perfect Hashing

Fredman, Komlos, Szemeredi
Refer to textbook for Algorithm
Hash n items from range 0poly(n) into 0T(n)
Ensure probability without collision gt ½

22
The Static Hashing Scheme

Choose positive constante
When e?0, failure probability ?
Total time space of DS will be linear with
factor 1/e
??????1 nc ?n?items hash??? imaginary array A
of size nc

23
The Static Hashing Scheme(contd)

Step 1 (??partition tree)
of node O(n)
Has ne children
Each children associate with a distinct subarray
of A of size nc-e
Each leaf (subarray) with more than neitems is
recursively partitioned
Total size O(n)

24
The Static Hashing Scheme(contd)

Step 2
Using FKS Perfect Hashing
Several trials will be required since only ½
??total time complexity
Total size of sub problem is n
Each sub problem is ne

25
The Static Hashing Scheme(contd)

Size categories
Divide leaves into O(logn) categories
For a categories i , the leaves size are in the
range ne/(4i1) ne/(4i) for igt0
We will show that
time for this category is proportional to the sum
of size of the leaves in this category O(n/2i)
With failure probability
It follows that total time O(n) with failure
probability

26
The Static Hashing Scheme(contd)

Succeed
Items in a leaf are perfect hashed
Round
One trials for each of the relevant leaves
Group
Organization of rounds

27
The Static Hashing Scheme(contd)

How to grouping rounds?
0th Group ???category???unsuccessful
leaves??n1-e2i / (log n)????rounds
jth Group???category???unsuccessful
leaves?n1-e2i / (2j log n)?n1-e2i / (2j-1 log
n)?????rounds (j gt 1)

28
The Static Hashing Scheme(contd)

We will show failure probability of rounds in
group
0th of rounds O( i log log n) with failure
probability
jth of rounds O( 2j ) with failure probability
Failure probability (over all groups)
First of all, we show that total time taken in j
groups

29
The Static Hashing Scheme(contd)

Secondary we show rounds in 0th group
Leaves in ith category are at most n / (ne/4i1)
n / (ne/4i1) (1/2)x n1-e2i / log n
gt x 2 i log log n (the rounds in 0th
group)
In Chernoff bound 2, If u unsuccessful
leaves, at some instance of time, then half these
leaves succeed in the next 2k rounds, with
failure probability 1/(2T(uk) )
Failure probability at end of 0th group is thus
(k1)

30
The Static Hashing Scheme(contd)