Two implementation issues - PowerPoint PPT Presentation

1 / 8
About This Presentation
Title:

Two implementation issues

Description:

Make suffix tree of concatenated string. Make artificial suffixes actual suffixes ... This can make them superior to suffix tree approaches when |S| is large. ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 9
Provided by: erict9
Learn more at: http://www.cse.msu.edu
Category:

less

Transcript and Presenter's Notes

Title: Two implementation issues


1
Two implementation issues
  • Alphabet size
  • Generalizing to multiple strings

2
One way to compute
  • Use a different end character i for each string
    Si
  • Concatenate all the strings together
  • Make suffix tree of concatenated string
  • Make artificial suffixes actual suffixes
  • For any internal node v, L(v) must be a substring
    of an original string
  • Only the leaf edge labels can span two original
    strings because of the uniqueness of each i
  • Postprocess and shorten leaf edge labels
    appropriately

3
Effects of alphabet size on suffix trees
  • We have generally been assuming that the trees
    are built in such a way that
  • from any node, we can find an edge in constant
    time for any specific character in S
  • an array of size S at each node
  • This takes Q(mS) space.

4
More compact representation
  • We can try to be more compact taking only O(m)
    space.
  • At each node, have pointers to only the edges
    that are needed
  • This slows down the search time
  • How much?
  • typically the minimum of O(log m) or O(log S)
    with a binary tree representation.
  • This effects both suffix tree construction time
    and later searching time against the suffix tree.
  •  
  • Other methods are truly alphabet independent
  • Z-compuation, KMP, BM all have running times and
    space requirements that are truly independent of
    the alphabet size.
  •  
  • This can make them superior to suffix tree
    approaches when S is large.

5
Other methods are truly alphabet independent
  • Z-computation, KMP, BM all have running times and
    space requirements that are truly independent of
    the alphabet size. 
  • This can make them superior to suffix tree
    approaches when S is large.

6
Generalized suffix trees
  • Build a suffix tree for a set of strings S S1,
    , Sz
  • Some issues
  • Nodes in tree may corresponds to substrings of
    potentially multiple strings Si
  • compact edge labels need 3 fields (start
    position, stop position, string)
  • leaf labels now a set of pairs indicating
    starting position and string

7
One way to compute
  • Use a different end character i for each string
    Si
  • Concatenate all the strings together
  • Make suffix tree of concatenated string
  • Make artificial suffixes actual suffixes
  • For any internal node v, L(v) must be a substring
    of an original string
  • Only the leaf edge labels can span two original
    strings because of the uniqueness of each i
  • Postprocess and shorten leaf edge labels
    appropriately

8
Another way to compute
  • Build tree for S1
  • Given tree for strings S1 through Si, add
    suffixes for Si1 as follows
  • Search for Si1 in tree till mismatch in position
    j1 of Si1
  • Existing tree implicitly has every suffix of
    Si11..j
  • Resume Ukkonens algorithm for Si1 in phase j1
    from point of last match
Write a Comment
User Comments (0)
About PowerShow.com