Faster Suffix Tree Construction With Missing Suffix Links - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Faster Suffix Tree Construction With Missing Suffix Links

Description:

For each square array, decomposing in a ''shapes, ... begin within an edge, (condition 3 with property) we use an imaginary node. ... – PowerPoint PPT presentation

Number of Views:148
Avg rating:3.0/5.0
Slides: 31
Provided by: Jon86
Category:

less

Transcript and Presenter's Notes

Title: Faster Suffix Tree Construction With Missing Suffix Links


1
Faster Suffix Tree Construction With Missing
Suffix Links
  • By Richard Cole and Ramesh Hariharn
  • Present by B89502027??? ???
  • B89705013??? ???

2
What is Missing Suffix Link
  • The definition of suffix link implies str(
    link(x) ) is str(x) with 1st symbol removed
  • Where link(x) must be a NODE.
  • When link(x) is not a node. The suffix link is
    missing!
  • 2 situations
  • Parameterized string
  • Suffix tree for 2-Dimension array

3
The problem is
  • Parameterized strings and 2D array
  • The node degree may not be bound by constant,
    i.e., some polynomial of n
  • Farach 5 solved polynomial but not missing
    suffix link
  • Baker1, Kosaraju11 solved parameterized
    string but not polynomial (n logn)
  • Giancarlo7 solved 2D array but still in (n
    logn)
  • We can solve both with O(n)!!!

4
Our contribution tools
  • Putting additional nodes and suffix links to the
    suffix tree but still in space O(n) and time O(n)
  • Providing a failure probability of inverse
    exponential, i.e., hashing
    scheme.

5
General Settings
  • Quasi-suffix collection
  • An ordered collection of strings s1, s2, sn iff
    the following hold
  • s1 n, and si si-1 -1, therefore sn1
  • No si is a prefix of another sj
  • Suppose si and sj have common prefix of length L
    gt0, then si1 and sj1 have a common prefix of
    length at least L 1.
  • aabb
  • abb
  • bb
  • b

6
General Settings(contd)
  • Multiple quasi-suffix collection
  • Several quasi-suffix collections have L strings
    in all
  • Any pairs of strings si, sj hold conditions 2 3
    of quasi-suffix collection
  • Character Oracle
  • Supply the ith character of the jth string of the
    collection on demand in O(1) time

7
Suffix trees for parameterized strings
  • Each s of string s is transformed to num(s),
    e.g., ?b?b? gt 0b2b2
  • How is condition 1 hold?
  • How is condition 2 hold?
  • How is condition 3 hold?
  • 0bb3b2
  • bb0b2
  • b0b2
  • 0b2
  • b0
  • 0

8
Suffix trees for 2D arrays
  • There are mn-1 diagonals in m x n array
  • For each diagonal form a square array
  • For each square array, decomposing in a
    shapes,
  • Each is mapped to a number (Giancarlo7),
    and a square is a string num(s), forming
    quasi-suffix collection (each with different
    ending symbol)
  • since mn-1 diagonals, mn-1 square for a
    multiple quasi-suffix collection

9
First! McCreights Algorithm
  • Definition of suffix link
  • Since condition 3 must satisfied with equality,
    suffix link is defined for each node x and
    link(x) is defined to be a node.
  • Two stages rescanning and, possibly, scanning
  • Rescan down from link( par(x) ) until position
    for link(x) found
  • If node not present, insert one and an edge for
    the leaf (no scan)
  • Otherwise, just scan down (as we did in ukkonen)
  • In either case, link(x) is well defined!

10
Two problems
  • Link( par(x) ) may not be defined
  • The lack of node at link(x)!
  • Since condition 3 need not satisfied only with
    equality, i.e., in our parameterized string case!

11
Our Algorithm
  • Two modifications to McCreights
  • Traversing up to find an ancestor with suffix
    link
  • Copy nodes backwards from the destination found
    above
  • Re-definition of suffix link
  • link(x) is node y such that if str(x) is the
    longest common prefix of si and sj, then str(y)
    must be the longest common prefix of si1 and
    sj1, where str(y) str(x) -1.
  • link(x) need not be defined for every node x!

12
Some definition
  • nanc(x), nearest ancestor of x with suffix link
  • Real/imaginary node
  • If new scanning stage begin within an edge,
    (condition 3 with gt property) we use an imaginary
    node.
  • Imaginary node has only 1 child, whereas real
    node has at least 2!
  • At most O(n) real nodes and imaginary nodes
    (since leaves at most n)

13
Some facts
  • Number of real and imaginary node is O(n)
  • Total number of children of real and imaginary
    nodes are O(n)
  • Total length of scanned portion is O(n)

14
More features
  • Back propagation nodes
  • Must have suffix link
  • Only one child
  • When scanning down from link(nanc(x)) to link(x),
    every 2 node (not including the first and the
    last) are back-propagated.

15
Invariant 1
  • If a node x is back-propagated in direction u,
    then its parent is not back-propagated in
    direction u where u is a prefix of u.

16
Time Complexity
  • Two to be analyze
  • Finding nanc(x)
  • Rescanning down
  • Creating a new back-propagated node
  • Upgrade imaginary node to back-propagated node,
    by adding suffix link to it!
  • Adding a real/imaginary node for link(x)
  • Time O(1) 1 2

17
Bounding back-Propagated node
  • Defining BP tree
  • All node except root are back-propagated node
  • BP forest
  • Trees rooted at various real/imaginary nodes that
    are back-propagated. (Imagine the suffix tree as
    BP forest!)
  • Decomposing BP tree into paths
  • From root down to a node y such that either
  • 1. no valid direction for y
  • 2. there exist a direction u but in which y has
    not been back propagated!
  • Decomposing recursively

18
Bounding back-Propagated node (contd)
  • Extend paths on suffix tree backward (on
    direction not imply by back-propagation node)
    until either
  • 1. a node is reached
  • 2. no valid direction is available
  • Lemma 1 two distinct extended path cannt
    intersect.
  • Lemma 2 if an extended path terminated at node y
    (not by running out of valid direction), y cannot
    be back-propagated node.
  • Lemma 3 total number of path is O(n), and hence
    total number of pack-propagated node is O(n)

19
Time Complexity (contd)
  • The process of finding nanc(x) is just the same
    way discussed in Ukkonen bounded by O(n)
  • Combining with lemma 3, we have the theorem

20
The Hashing Scheme
  • Goal
  • Hash O(n) pairs node, following symbol
  • ????? O(n), ????? O(1) query
  • ??? inverse exponential

21
FKS Perfect Hashing
  • Fredman, Komlos, Szemeredi
  • Refer to textbook for Algorithm
  • Hash n items from range 0poly(n) into 0T(n)
  • Ensure probability without collision gt ½

22
The Static Hashing Scheme
  • Choose positive constante
  • When e?0, failure probability ?
  • Total time space of DS will be linear with
    factor 1/e
  • ??????1 nc ?n?items hash??? imaginary array A
    of size nc

23
The Static Hashing Scheme(contd)
  • Step 1 (??partition tree)
  • of node O(n)
  • Has ne children
  • Each children associate with a distinct subarray
    of A of size nc-e
  • Each leaf (subarray) with more than neitems is
    recursively partitioned
  • Total size O(n)

24
The Static Hashing Scheme(contd)
  • Step 2
  • Using FKS Perfect Hashing
  • Several trials will be required since only ½
  • ??total time complexity
  • Total size of sub problem is n
  • Each sub problem is ne

25
The Static Hashing Scheme(contd)
  • Size categories
  • Divide leaves into O(logn) categories
  • For a categories i , the leaves size are in the
    range ne/(4i1) ne/(4i) for igt0
  • We will show that
  • time for this category is proportional to the sum
    of size of the leaves in this category O(n/2i)
  • With failure probability
  • It follows that total time O(n) with failure
    probability

26
The Static Hashing Scheme(contd)
  • Succeed
  • Items in a leaf are perfect hashed
  • Round
  • One trials for each of the relevant leaves
  • Group
  • Organization of rounds

27
The Static Hashing Scheme(contd)
  • How to grouping rounds?
  • 0th Group ???category???unsuccessful
    leaves??n1-e2i / (log n)????rounds
  • jth Group???category???unsuccessful
    leaves?n1-e2i / (2j log n)?n1-e2i / (2j-1 log
    n)?????rounds (j gt 1)

28
The Static Hashing Scheme(contd)
  • We will show failure probability of rounds in
    group
  • 0th of rounds O( i log log n) with failure
    probability
  • jth of rounds O( 2j ) with failure probability
  • Failure probability (over all groups)
  • First of all, we show that total time taken in j
    groups

29
The Static Hashing Scheme(contd)
  • Secondary we show rounds in 0th group
  • Leaves in ith category are at most n / (ne/4i1)
  • n / (ne/4i1) (1/2)x n1-e2i / log n
  • gt x 2 i log log n (the rounds in 0th
    group)
  • In Chernoff bound 2, If u unsuccessful
    leaves, at some instance of time, then half these
    leaves succeed in the next 2k rounds, with
    failure probability 1/(2T(uk) )
  • Failure probability at end of 0th group is thus
    (k1)

30
The Static Hashing Scheme(contd)
  • we show rounds in jth group
  • K 2j
  • Has 22j rounds
  • u n1-e2i / (2j log n)
  • With failure probability
  • Totally O(log n) groups, thus total failure
    probability
Write a Comment
User Comments (0)
About PowerShow.com