Title: Pattern Matching: Suffix Tree Applications
1Pattern MatchingSuffix Tree Applications
2Applications
- Exact string and substring matching
- Longest common substrings
- Finding and representing repeated substrings
efficiently - Applications that lead to alternative space vs.
efficient implementations - Matching statistics
- Suffix Arrays
3Exact Set matching
- Input
- A set of patterns P P1,P2,Pk and P
- Text T of length m
- Output
- Positions of all occurrences of each pattern Pi
in T - Solution method
- Preprocess to create suffix tree for T
- O(m) time, O(m) space
- Maximally match each Pi in suffix tree
- O(P1 ) O(P2) O(Pk ) O(n)
- Output all leaf positions below match point
- O(k) time where k is number of total matches
4Exact set matching using Aho-Corasick
- Aho-Corasick algorithm is a classical solution to
exact set matching - build keyword tree of set of patterns P
- A keyword tree for a pattern set P is a rooted
tree T such that - Each edge e is labeled by a character
- Any two edge from a node have different labels
- Define L(v) of a node v are the concatenation of
edge labels on the path from the root to v - For each Pi P there is a node v s.t L(v) Pi
and for each leaf v there is a Pi L(v)
5Example of Aho-Corasick
- Example P abce,abe,dce,ac
3
c
e
b
2,6
4
e
7
a
c
1,5,11
0
12
8
d
c
9
e
10
6Example of Aho-Corasick
- Example P abce,ababc,abac
4
e
3
c
9
c
a
b
0
1
2
8
b
a
7
c
13
Resume Link
Like KMP algorithm, if there is an error on node
v, then we resume the comparison by the resume
link of its parent.
7Aho-Corasick vs. Suffix tree
- Aho-Corasick Approach
- O(n) preprocess time and space
- to build keyword tree of set of patterns P
- O(mk) search time
- Linear time by using the resume link
- Suffix Tree Approach
- O(m) preprocess time and space
- to build suffix tree of T
- O(nk) search time
- Using matching statistics to be defined, can make
this tradeoff similar to that of Aho-Corasick
8Substring problem
- Input
- Pattern P of length n
- A set of Text Ti of total length m
- Output
- Position of all occurrences of P in each Text Ti
- Solution method
- Preprocess to create generalized suffix tree for
Ti - O(m) time, O(m) space
- Maximally match P in generalized suffix tree
- Output all leaf positions below match point
- O(nk) time where k is number of total matches
9Generalized suffix tree
abc
c
d
abc
ab
c
b
d
c
root
d
10Longest Common Substring problem
- Input
- Strings S and T
- Output
- The longest common substring of S and T (and its
position in S and T) - Solution method
- Preprocess to create generalized suffix tree for
S,T - Mark each node by whether or not its subtree
contains a leaf node of S, T, or both - Simple postfix tree traversal algorithm to do
this - Path label of node with greatest string depth is
the longest common substring of S and T
11Common substrings of length k problem
- Input
- Strings S and T
- Integer k
- Output
- all substrings of S and T (and their positions in
S and T) of length at least k - Solution method
- Same as previous problem
- Look for all nodes with 2 leaf labels of string
depth at least k
12Longest Common Substrings of more than two Strings
- Definition For a given set of K strings, l(j)
for 2 lt j lt K is the length of the longest
common substring belong to at least j of the K
strings - Example abcedfg, cbcedfa, dbcedg, cbceg, acea
13Longest Common Substrings of more than two Strings
- Input
- Strings S1, , SK, total length
- Output
- l(j) (and positions in Si) for 2 lt j lt K
- Solution method
- Build a generalized suffix tree for the K strings
- each string has a unique end character, so each
leaf shows up only once
14Longest Common Substrings of more than two Strings
- Build a generalized suffix tree for the K strings
- each string has a unique end character, so each
leaf shows up only once - Define c(v) number of distinct leaf labels in
subtree rooted at node v and d(v) string-depth
from root to node v - Given c(v) and d(v), do a simple traversal of
tree to find l(j) j 2K and pointers to
locations in substrings - Computing c(v) efficiently
- of leaves is not correct as some leaves may
have same label - length K bit vector, 1 bit per string in set
- OR your way up the tree
- Each OR op takes O(K) time which give O(Kn)
running time - Can be improved to be O(n) later
15Repeated Substrings
- Definition
- maximal pair in S is a pair of identical
substrings a and b in S such that the character
to the immediate left (right) of a is different
than the character to the immediate left (right)
of b. - Add unique characters to front and end of S to
include prefixes and suffixes. - Representation (p1, p2, n)
- starting positions and length of the maximal pair
- R(S) is the set of all triples representing
maximal pairs in S
16Example of Repeated substrings
- S
- (2, 7, 3) is a maximal pair
- (7, 14, 3) is a maximal pair
- (2, 14, 3) is not a maximal pair
- (2, 14, 4) is a maximal pair
17Repeated Substrings
- A maximal repeat a is a substring in S that is
the substring defined by a maximal pair of S - R(S) is the set of maximal repeats and R(S)
R(S) - Previous example
- xyz and xyzv are maximal repeats of Showever,
xyz is represented only once in R(S), but there
are (2, 7, 3) and (7, 14, 3) in R(S) - R(S) is smaller than R(S) as xyz shows up
twice in R(S) but only once in R(S)
18Maximal Repeated Substrings
- Maximal repeats
- Input
- String S (length n)
- Output
- R(S)
- Lemma
- If a is a maximal repeat in S, then a is the
path-label of an internal node v in T - a does not end in the middle of an edge
19Maximal Repeated substrings
- Definition left character of i is Si-1
- The left character of a leaf of a suffix tree T
is the left character of the suffix position
represented by that leaf - A node v of T is called left diverse if at least
2 leaves in vs subtree have different left
characters - Theorem
- String a labeling the path to an internal node v
of T is a maximal repeat if and only if v is left
diverse - Capture that character before a is different
20Example of left diverse
root
ab
c
b
left diverse
abc
c
abc
c
root
b
21Maximal Repeated substrings
- Solution method
- Construct suffix tree for S
- There are at most n maximal repeats
- So that, there are n leaves
- Because all internal nodes except the root have
at least two children. - Therefore, at most n internal nodes
22Maximal Repeated substrings
- Find all left diverse nodes in linear time
- All nodes will have a left character label
- Leaf node
- Label leaves with their left character
- Internal node v
- If any child is left diverse, so is v
- If two children have different left character
labels, v is left diverse - Otherwise, take on left character value of
children - Compact representation
- Node v in T is a frontier node if
- v is a diverse
- none of vs children are left diverse
23Maximal Repeated substrings
- Time complexity
- Construct suffix tree for S ? O(n)
- Find all left diverse nodes in linear time ? O(n)
- Compact representation ? O(k), where k is the
number of maximal pairs
24Supermaximal repeated substrings
- A supermaximal repeat a is a maximal repeat of S
that never occurs as a substring of another
maximal repeat of S - Previous example
- xyzv is a supermaximal repeat of S
- xyz is NOT a supermaximal repeat of S
25Supermaximal repeated substrings
- Supermaximal repeats
- Input
- String S (length n)
- Output
- The set of supermaximal repeats of S
- Theorem
- A left diverse node v represents a supermaximal
repeat if and only if - all of vs children are leaves
- and each has a distinct left character
26Matching Statistics
- Input
- Pattern P of length n
- Text T of length m
- Output
- Compute ms(i) for 1 lti lt m
- Definition of ms(i)
- For 1 lt i ltm, matching statistic ms(i) is the
length of the longest substring of T starting at
position i that matches a substring somewhere in
P.
27Matching Statistics
- With matching statistics, one can solve several
problems with less space than a suffix tree - Exact matching example
- Well show an O(n) preprocessing time and O(m)
search time solution matching the traditional
methods - P matches substring starting at i in T if and
only if ms(i) P
28Example of Matching Statistics
i
T
P
29Matching Statistics
- Solution method
- Compute suffix tree of P retaining suffix links
- Adding location of substring in P
- p(i) a location in P such that the substring at
p(i) matches substring starting at T(i) for
exactly ms(i) positions - Before computing ms(i) values, mark each node in
T with the leaf number of one of its leaves - Simply output this value when outputting ms(i)
values
30Matching Statistics
- Count ms(1) match T against tree
- Get ms(i1) from ms(i)
- Assume we are at some node v in the tree
- If it is internal, follow suffix link to s(v)
- Else if it is a leaf, go up one level to its
parent w - If w is an internal node, follow suffix link to
s(w) - Traverse downwards using skip/count trick until
we have matched all the characters in edge label
(w,v) - Now match against T character by character till
we have a mismatch and can output ms(i1)
31Applying matching statistics to LCS problem
- Input
- strings S and T
- Output
- longest common substring of S and T
- Solution method
- Compute suffix tree for shorter string, say S
- Compute ms(i) values for T
- Maximal ms(i) value identifies LCS
32Suffix Arrays
- Input
- Text T of length m
- Output
- Pos array
- Definition of Pos array
- A suffix array for T, called Pos, is an array of
integers in the range 1 to m specifying the
lexicographic order of the m suffixes of string T - Posk i iff Ti is the kth smallest suffix in
the m suffixes - Add terminating character which is lexically
smallest
33Example of Suffix Arrays
- T axfcaxgx
- Suffixes 1. axfcaxgx
- 2. xfcaxgx
- 3. fcaxgx
- 4. caxgx
- 5. axgx
- 6. xgx
- 7. gx
- 8. x
- 9.
- Order 9.
- 1. axfcaxgx
- 5. axgx
- 4. caxgx
- 3. fcaxgx
- 7. gx
- 8. x
- 2. xfcaxgx
- 6. xgx
- 8. x
-
k
Posk
34Suffix Arrays
- Solution method
- Compute suffix tree of T
- Do a lexical depth-first traversal of T labeling
Pos(k) with leafs in order of encountering them - Edge (v,u) is lexically smaller than edge (v,w)
iff first character of (v,u) is lexically smaller
than first character of (v,w)
35Applying Suffix Arrays to exact pattern matching
- Input
- Pattern P of length n
- Text T of length m
- Output
- All occurrences of P in T
- Solution method
- Compute suffix array Pos for T
- If P is in T, then all these locations will be
grouped consecutively in Pos
36Applying Suffix Arrays to exact pattern matching
- Using binary search, find smallest index i such
that P exactly matches the n characters of suffix
Pos(i) - Similarly, find largest index i such that P
exactly matches the n characters of suffix Pos(i) - Time complexity O(n log m)
37Longest common prefixes
- Input
- Text T of length m
- Output
- Max(Lcp(i,j)) ,for 1 i,j m and i ? j
- Definition of Lcp(i,j) Lcp(i,j) is the length of
the longest common prefix of the suffixes of T
beginning at Posi and Posj. - Example from Suffix Arrays
- T axfcaxgx, Pos2 1 (axfcaxgx), Pos3 5
(axgx) - Lcp(2,3) 2
38Longest common prefixes
- Solution method
- We want to get Lcp in O(m) time
- However, there are potentially O(m2) different
possible pairs of Lcp values - Crucial point
- Since this is binary search, there are only O(m)
values that are ever needed, and these have a lot
of structure
39Longest common prefixes
- Lcp(i,i1) string depth of lowest common
ancestor encountered during lexical depth-first
traversal of suffix tree from Pos(i) leaf to
Pos(i1) leaf - Other Lcp values
- Lcp(i,j) mink in 1 to j-1 Lcp(k,k1)
- Take min of Lcp values of children in the binary
tree of needed Lcp values (not the suffix tree)