Suffix Trees - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Suffix Trees

Description:

... is the string pi a substring of S? Knuth-Morris-Pratt (KMP) string matching. O(|S| + | pi |) time per query. O(n|S| + Si | pi |) time for n queries. – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 23
Provided by: programmi
Category:
Tags: suffix | trees

less

Transcript and Presenter's Notes

Title: Suffix Trees


1
Suffix Trees
  • String any sequence of characters.
  • Substring of string S string composed of
    characters i through j, i lt j of S.
  • S cater gt ate is a substring.
  • car is not a substring.
  • Empty string is a substring of S.

2
Subsequence
  • Subsequence of string S string composed of
    characters i1 lt i2 lt lt ik of S.
  • S cater gt ate is a subsequence.
  • car is a subsequence.
  • The empty string is a subsequence.

3
String/Pattern Matching
  • You are given a source string S.
  • Answer queries of the form is the string pi a
    substring of S?
  • Knuth-Morris-Pratt (KMP) string matching.
  • O(S pi ) time per query.
  • O(nS Si pi ) time for n queries.
  • Suffix tree solution.
  • O(S Si pi ) time for n queries.

4
String/Pattern Matching
  • KMP preprocesses the query string pi, whereas the
    suffix tree method preprocesses the source string
    S.
  • An application of string matching.
  • Genome project.
  • Databank of strings (gene sequences).
  • Character set is ATGF.
  • Determine if a new sequence is a substring of a
    databank sequence.

5
Definition Of Suffix Tree
  • Compressed trie with edge information.
  • Keys are the nonempty suffixes of a given string
    S.
  • Nonempty suffixes of S sleeper are
  • sleeper
  • leeper
  • eeper
  • eper
  • per, er, and r.

6
String Matching Suffixes
  • pi is a substring of S iff pi is a prefix of some
    suffix of S.
  • Nonempty suffixes of S sleeper are
  • sleeper
  • leeper
  • eeper
  • eper
  • per, er, and r.
  • Which of these are substrings of S?
  • leep, eepe, pe, leap, peel

7
Last Character Of S Repeats
  • When the last character of S appears more than
    once in S, S has at least one suffix that is a
    proper prefix of another suffix.
  • S creeper
  • creeper, reeper, eeper, eper, per, er, r
  • When the last character of S appears more than
    once in S, use an end of string character to
    overcome this problem.
  • S creeper
  • creeper, reeper, eeper, eper, per, er, r,

8
Suffix Tree For S abbbabbbb
9
Suffix Tree For S abbbabbbb
1
2
5
10
3
1
5
9
4
4
8
3
abbbabbbb
7
2
6
12345678910
10
Suffix Tree For S abbbabbbb
1
1
4
5
2
10
1
3
8
1
5
9
4
4
2
8
3
abbbabbbb
7
2
6
12345678910
11
Suffix Tree Construction
  • See Web write up for algorithm.
  • Time complexity
  • S n, alphabet size r.
  • O(nr) using array nodes.
  • This is O(n) for r a constant (or r lt c).
  • O(n) expected time using a hash table.
  • O(n) time algorithm for large r in reference
    cited in Web write up.

12
O(pi) Time Substring Matching
babb
abbba
baba
13
Find All Occurrences Of pi
  • Search suffix tree for pi.
  • Suppose the search for pi is successful.
  • When search terminates at an element node, pi
    appears exactly once in the source string S.

14
Search Terminates At Element Node
abbbb
15
Search Terminates At Branch Node
  • When the search for pi terminates at a branch
    node, each element node in the subtree rooted at
    this branch node gives a different occurrence of
    pi.

16
Search Terminates At Branch Node
ab
17
Find All Occurrences Of pi
  • To find all occurrences of pi in time linear in
    the length of pi and linear in the number of
    occurrences of pi, augment suffix tree
  • Link all element nodes into a chain in inorder.
  • Each branch node keeps a pointer to the left most
    and right most element node in its subtree.

18
Augmented Suffix Tree
b
19
Longest Repeating Substring
  • Find longest substring of S that occurs more than
    m gt 1 times in S.
  • Label branch nodes with number of element nodes
    in subtree.
  • Find branch node with label gt m and max char
    field.

20
Longest Repeating Substring
m 2
m 5
21
Longest Common Substring
  • Given two strings S and T.
  • Find the longest common substring.
  • S carport, T airports
  • Longest common substring rport
  • Longest common subsequence arport
  • Longest common subsequence may be found in
    O(ST) time using dynamic programming.
  • Longest common substring may be found in
    O(ST) time using a suffix tree.

22
Longest Common Substring
  • Let be a new symbol.
  • Construct the suffix tree for the string U
    ST.
  • U carportairports
  • No repeating substring includes .
  • Find longest repeating substring that is both to
    left and right of .
  • Find branch node that has max char and has at
    least one element node in its subtree that
    represents a suffix that begins in S as well as
    at least one that begins in T.
Write a Comment
User Comments (0)
About PowerShow.com