Suffix Trees - PowerPoint PPT Presentation

About This Presentation
Title:

Suffix Trees

Description:

... is the string pi a substring of S? Knuth-Morris-Pratt (KMP) string matching. O(|S| + | pi |) time per query. O(n|S| + Si | pi |) time for n queries. – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 26
Provided by: ciseUflE5
Learn more at: https://www.cise.ufl.edu
Category:
Tags: suffix | trees

less

Transcript and Presenter's Notes

Title: Suffix Trees


1
Suffix Trees
  • Suffix trees
  • Linearized suffix trees
  • Virtual suffix trees
  • Suffix arrays
  • Enhanced suffix arrays
  • Suffix cactus, suffix vectors,

2
Suffix Trees
  • String any sequence of characters.
  • Substring of string S string composed of
    characters i through j, i lt j of S.
  • S cater gt ate is a substring.
  • car is not a substring.
  • Empty string is a substring of S.

3
Subsequence
  • Subsequence of string S string composed of
    characters i1 lt i2 lt lt ik of S.
  • S cater gt ate is a subsequence.
  • car is a subsequence.
  • The empty string is a subsequence.

4
String/Pattern Matching
  • You are given a source string S.
  • Answer queries of the form is the string pi a
    substring of S?
  • Knuth-Morris-Pratt (KMP) string matching.
  • O(S pi ) time per query.
  • O(nS Si pi ) time for n queries.
  • Suffix tree solution.
  • O(S Si pi ) time for n queries.

5
String/Pattern Matching
  • KMP preprocesses the query string pi, whereas the
    suffix tree method preprocesses the source string
    S.
  • An application of string matching.
  • Genome project.
  • Databank of strings (gene sequences).
  • Character set is ATGC.
  • Determine if a new sequence is a substring of a
    databank sequence.

6
Definition Of Suffix Tree
  • Compressed trie with edge information.
  • Keys are the nonempty suffixes of a given string
    S.
  • Nonempty suffixes of S sleeper are
  • sleeper
  • leeper
  • eeper
  • eper
  • per, er, and r.

7
String Matching Suffixes
  • pi is a substring of S iff pi is a prefix of some
    suffix of S.
  • Nonempty suffixes of S sleeper are
  • sleeper
  • leeper
  • eeper
  • eper
  • per, er, and r.
  • Which of these are substrings of S?
  • leep, eepe, pe, leap, peel

8
Last Character Of S Repeats
  • When the last character of S appears more than
    once in S, S has at least one suffix that is a
    proper prefix of another suffix.
  • S creeper
  • creeper, reeper, eeper, eper, per, er, r
  • When the last character of S appears more than
    once in S, use an end of string character to
    overcome this problem.
  • S creeper
  • creeper, reeper, eeper, eper, per, er, r,

9
Suffix Tree For S abbbabbbb
10
Suffix Tree For S abbbabbbb
1
2
5
10
3
1
5
9
4
4
8
3
abbbabbbb
7
2
6
12345678910
11
Suffix Tree For S abbbabbbb
1
1
4
5
2
10
1
3
8
1
5
9
4
4
2
8
3
abbbabbbb
7
2
6
12345678910
12
Suffix Tree Construction
  • See Web write up for algorithm.
  • Time complexity
  • S n, alphabet size r.
  • O(nr) using array nodes.
  • This is O(n) for r a constant (or r lt c).
  • O(n) expected time using a hash table.
  • O(n) time algorithm for large r in reference
    cited in Web write up.

13
Suffix Array
  • Array that contains the start position of
    suffixes in lexicographic order.
  • abbbabbbb
  • Assume lt a lt b
  • lt abbbabbbb lt abbbb lt b lt babbbb lt bb lt
    bbabbbb lt bbb lt bbbabbbb lt bbbb
  • SA 10, 1, 5, 9, 4, 8, 3, 7, 2, 6
  • LCP length of longest common prefix between
    adjacent entries of SA.
  • LCP 0, 4, 0, 1, 1, 2, 2, 3, 3, -

14
Suffix Array
  • Less space than suffix tree
  • Linear time construction
  • Can be used to solve several of the problems
    solved by a suffix tree with same asymptotic
    complexity.
  • Substring matching ? binary search for p using
    SA.
  • O(p log S).

15
O(pi) Time Substring Matching
babb
abbba
baba
16
Find All Occurrences Of pi
  • Search suffix tree for pi.
  • Suppose the search for pi is successful.
  • When search terminates at an element node, pi
    appears exactly once in the source string S.

17
Search Terminates At Element Node
abbbb
18
Search Terminates At Branch Node
  • When the search for pi terminates at a branch
    node, each element node in the subtree rooted at
    this branch node gives a different occurrence of
    pi.

19
Search Terminates At Branch Node
ab
20
Find All Occurrences Of pi
  • To find all occurrences of pi in time linear in
    the length of pi and linear in the number of
    occurrences of pi, augment suffix tree
  • Link all element nodes into a chain in inorder.
  • Each branch node keeps a pointer to the left most
    and right most element node in its subtree.

21
Augmented Suffix Tree
b
22
Longest Repeating Substring
  • Find longest substring of S that occurs more than
    m gt 1 times in S.
  • Label branch nodes with number of element nodes
    in subtree.
  • Find branch node with label gt m and max char
    field.

23
Longest Repeating Substring
m 2
m 5
24
Longest Common Substring
  • Given two strings S and T.
  • Find the longest common substring.
  • S carport, T airports
  • Longest common substring rport
  • Longest common subsequence arport
  • Longest common subsequence may be found in
    O(ST) time using dynamic programming.
  • Longest common substring may be found in
    O(ST) time using a suffix tree.

25
Longest Common Substring
  • Let be a new symbol.
  • Construct the suffix tree for the string U
    ST.
  • U carportairports
  • No repeating substring includes .
  • Find longest repeating substring that is both to
    left and right of .
  • Find branch node that has max char and has at
    least one element node in its subtree that
    represents a suffix that begins in S as well as
    at least one that begins in T.
Write a Comment
User Comments (0)
About PowerShow.com