Title: Suffix Trees
1Suffix Trees
- Suffix trees
- Linearized suffix trees
- Virtual suffix trees
- Suffix arrays
- Enhanced suffix arrays
- Suffix cactus, suffix vectors,
2Suffix Trees
- String any sequence of characters.
- Substring of string S string composed of
characters i through j, i lt j of S. - S cater gt ate is a substring.
- car is not a substring.
- Empty string is a substring of S.
3Subsequence
- Subsequence of string S string composed of
characters i1 lt i2 lt lt ik of S. - S cater gt ate is a subsequence.
- car is a subsequence.
- The empty string is a subsequence.
4String/Pattern Matching
- You are given a source string S.
- Answer queries of the form is the string pi a
substring of S? - Knuth-Morris-Pratt (KMP) string matching.
- O(S pi ) time per query.
- O(nS Si pi ) time for n queries.
- Suffix tree solution.
- O(S Si pi ) time for n queries.
5String/Pattern Matching
- KMP preprocesses the query string pi, whereas the
suffix tree method preprocesses the source string
S. - An application of string matching.
- Genome project.
- Databank of strings (gene sequences).
- Character set is ATGC.
- Determine if a new sequence is a substring of a
databank sequence.
6Definition Of Suffix Tree
- Compressed trie with edge information.
- Keys are the nonempty suffixes of a given string
S. - Nonempty suffixes of S sleeper are
- sleeper
- leeper
- eeper
- eper
- per, er, and r.
7String Matching Suffixes
- pi is a substring of S iff pi is a prefix of some
suffix of S. - Nonempty suffixes of S sleeper are
- sleeper
- leeper
- eeper
- eper
- per, er, and r.
- Which of these are substrings of S?
- leep, eepe, pe, leap, peel
8Last Character Of S Repeats
- When the last character of S appears more than
once in S, S has at least one suffix that is a
proper prefix of another suffix. - S creeper
- creeper, reeper, eeper, eper, per, er, r
- When the last character of S appears more than
once in S, use an end of string character to
overcome this problem. - S creeper
- creeper, reeper, eeper, eper, per, er, r,
9Suffix Tree For S abbbabbbb
10Suffix Tree For S abbbabbbb
1
2
5
10
3
1
5
9
4
4
8
3
abbbabbbb
7
2
6
12345678910
11Suffix Tree For S abbbabbbb
1
1
4
5
2
10
1
3
8
1
5
9
4
4
2
8
3
abbbabbbb
7
2
6
12345678910
12Suffix Tree Construction
- See Web write up for algorithm.
- Time complexity
- S n, alphabet size r.
- O(nr) using array nodes.
- This is O(n) for r a constant (or r lt c).
- O(n) expected time using a hash table.
- O(n) time algorithm for large r in reference
cited in Web write up.
13Suffix Array
- Array that contains the start position of
suffixes in lexicographic order. - abbbabbbb
- Assume lt a lt b
- lt abbbabbbb lt abbbb lt b lt babbbb lt bb lt
bbabbbb lt bbb lt bbbabbbb lt bbbb - SA 10, 1, 5, 9, 4, 8, 3, 7, 2, 6
- LCP length of longest common prefix between
adjacent entries of SA. - LCP 0, 4, 0, 1, 1, 2, 2, 3, 3, -
14Suffix Array
- Less space than suffix tree
- Linear time construction
- Can be used to solve several of the problems
solved by a suffix tree with same asymptotic
complexity. - Substring matching ? binary search for p using
SA. - O(p log S).
15O(pi) Time Substring Matching
babb
abbba
baba
16Find All Occurrences Of pi
- Search suffix tree for pi.
- Suppose the search for pi is successful.
- When search terminates at an element node, pi
appears exactly once in the source string S.
17Search Terminates At Element Node
abbbb
18Search Terminates At Branch Node
- When the search for pi terminates at a branch
node, each element node in the subtree rooted at
this branch node gives a different occurrence of
pi.
19Search Terminates At Branch Node
ab
20Find All Occurrences Of pi
- To find all occurrences of pi in time linear in
the length of pi and linear in the number of
occurrences of pi, augment suffix tree - Link all element nodes into a chain in inorder.
- Each branch node keeps a pointer to the left most
and right most element node in its subtree.
21Augmented Suffix Tree
b
22Longest Repeating Substring
- Find longest substring of S that occurs more than
m gt 1 times in S. - Label branch nodes with number of element nodes
in subtree. - Find branch node with label gt m and max char
field.
23Longest Repeating Substring
m 2
m 5
24Longest Common Substring
- Given two strings S and T.
- Find the longest common substring.
- S carport, T airports
- Longest common substring rport
- Longest common subsequence arport
- Longest common subsequence may be found in
O(ST) time using dynamic programming. - Longest common substring may be found in
O(ST) time using a suffix tree.
25Longest Common Substring
- Let be a new symbol.
- Construct the suffix tree for the string U
ST. - U carportairports
- No repeating substring includes .
- Find longest repeating substring that is both to
left and right of . - Find branch node that has max char and has at
least one element node in its subtree that
represents a suffix that begins in S as well as
at least one that begins in T.