Pattern Matching: Suffix Tree Applications - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

Pattern Matching: Suffix Tree Applications

Description:

Exact string and substring matching. Longest common substrings ... String a labeling the path to an internal node v of T is a maximal repeat if and ... – PowerPoint PPT presentation

Number of Views:268

Avg rating:3.0/5.0

Slides: 40

Provided by: Heal63

Category:

more less

Transcript and Presenter's Notes

Title: Pattern Matching: Suffix Tree Applications

1
Pattern MatchingSuffix Tree Applications
2
Applications

Exact string and substring matching
Longest common substrings
Finding and representing repeated substrings
efficiently
Applications that lead to alternative space vs.
efficient implementations
Matching statistics
Suffix Arrays

3
Exact Set matching

Input
A set of patterns P P1,P2,Pk and P
Text T of length m
Output
Positions of all occurrences of each pattern Pi
in T
Solution method
Preprocess to create suffix tree for T
O(m) time, O(m) space
Maximally match each Pi in suffix tree
O(P1 ) O(P2) O(Pk ) O(n)
Output all leaf positions below match point
O(k) time where k is number of total matches

4
Exact set matching using Aho-Corasick

Aho-Corasick algorithm is a classical solution to
exact set matching
build keyword tree of set of patterns P
A keyword tree for a pattern set P is a rooted
tree T such that
Each edge e is labeled by a character
Any two edge from a node have different labels
Define L(v) of a node v are the concatenation of
edge labels on the path from the root to v
For each Pi P there is a node v s.t L(v) Pi
and for each leaf v there is a Pi L(v)

5
Example of Aho-Corasick

Example P abce,abe,dce,ac

3
c
e
b
2,6
4
e
7
a
c
1,5,11
0
12
8
d
c
9
e
10
6
Example of Aho-Corasick

Example P abce,ababc,abac

4
e
3
c
9
c
a
b
0
1
2
8
b
a
7
c
13
Resume Link
Like KMP algorithm, if there is an error on node
v, then we resume the comparison by the resume
link of its parent.
7
Aho-Corasick vs. Suffix tree

Aho-Corasick Approach
O(n) preprocess time and space
to build keyword tree of set of patterns P
O(mk) search time
Linear time by using the resume link
Suffix Tree Approach
O(m) preprocess time and space
to build suffix tree of T
O(nk) search time
Using matching statistics to be defined, can make
this tradeoff similar to that of Aho-Corasick

8
Substring problem

Input
Pattern P of length n
A set of Text Ti of total length m
Output
Position of all occurrences of P in each Text Ti
Solution method
Preprocess to create generalized suffix tree for
Ti
O(m) time, O(m) space
Maximally match P in generalized suffix tree
Output all leaf positions below match point
O(nk) time where k is number of total matches

9
Generalized suffix tree
abc
c

T1 ababc
T2 abd

d
abc
ab
c
b
d
c
root

d

10
Longest Common Substring problem

Input
Strings S and T
Output
The longest common substring of S and T (and its
position in S and T)
Solution method
Preprocess to create generalized suffix tree for
S,T
Mark each node by whether or not its subtree
contains a leaf node of S, T, or both
Simple postfix tree traversal algorithm to do
this
Path label of node with greatest string depth is
the longest common substring of S and T

11
Common substrings of length k problem

Input
Strings S and T
Integer k
Output
all substrings of S and T (and their positions in
S and T) of length at least k
Solution method
Same as previous problem
Look for all nodes with 2 leaf labels of string
depth at least k

12
Longest Common Substrings of more than two Strings

Definition For a given set of K strings, l(j)
for 2 lt j lt K is the length of the longest
common substring belong to at least j of the K
strings
Example abcedfg, cbcedfa, dbcedg, cbceg, acea

13
Longest Common Substrings of more than two Strings

Input
Strings S1, , SK, total length
Output
l(j) (and positions in Si) for 2 lt j lt K
Solution method
Build a generalized suffix tree for the K strings
each string has a unique end character, so each
leaf shows up only once

14
Longest Common Substrings of more than two Strings

Build a generalized suffix tree for the K strings
each string has a unique end character, so each
leaf shows up only once
Define c(v) number of distinct leaf labels in
subtree rooted at node v and d(v) string-depth
from root to node v
Given c(v) and d(v), do a simple traversal of
tree to find l(j) j 2K and pointers to
locations in substrings
Computing c(v) efficiently
of leaves is not correct as some leaves may
have same label
length K bit vector, 1 bit per string in set
OR your way up the tree
Each OR op takes O(K) time which give O(Kn)
running time
Can be improved to be O(n) later

15
Repeated Substrings

Definition
maximal pair in S is a pair of identical
substrings a and b in S such that the character
to the immediate left (right) of a is different
than the character to the immediate left (right)
of b.
Add unique characters to front and end of S to
include prefixes and suffixes.
Representation (p1, p2, n)
starting positions and length of the maximal pair
R(S) is the set of all triples representing
maximal pairs in S

16
Example of Repeated substrings

S
(2, 7, 3) is a maximal pair
(7, 14, 3) is a maximal pair
(2, 14, 3) is not a maximal pair
(2, 14, 4) is a maximal pair

17
Repeated Substrings

A maximal repeat a is a substring in S that is
the substring defined by a maximal pair of S
R(S) is the set of maximal repeats and R(S)
R(S)
Previous example
xyz and xyzv are maximal repeats of Showever,
xyz is represented only once in R(S), but there
are (2, 7, 3) and (7, 14, 3) in R(S)
R(S) is smaller than R(S) as xyz shows up
twice in R(S) but only once in R(S)

18
Maximal Repeated Substrings

Maximal repeats
Input
String S (length n)
Output
R(S)
Lemma
If a is a maximal repeat in S, then a is the
path-label of an internal node v in T
a does not end in the middle of an edge

19
Maximal Repeated substrings

Definition left character of i is Si-1
The left character of a leaf of a suffix tree T
is the left character of the suffix position
represented by that leaf
A node v of T is called left diverse if at least
2 leaves in vs subtree have different left
characters
Theorem
String a labeling the path to an internal node v
of T is a maximal repeat if and only if v is left
diverse
Capture that character before a is different

20
Example of left diverse

S ababc

root
ab
c
b
left diverse
abc
c
abc
c
root
b
21
Maximal Repeated substrings

Solution method
Construct suffix tree for S
There are at most n maximal repeats
So that, there are n leaves
Because all internal nodes except the root have
at least two children.
Therefore, at most n internal nodes

22
Maximal Repeated substrings

Find all left diverse nodes in linear time
All nodes will have a left character label
Leaf node
Label leaves with their left character
Internal node v
If any child is left diverse, so is v
If two children have different left character
labels, v is left diverse
Otherwise, take on left character value of
children
Compact representation
Node v in T is a frontier node if
v is a diverse
none of vs children are left diverse

23
Maximal Repeated substrings

Time complexity
Construct suffix tree for S ? O(n)
Find all left diverse nodes in linear time ? O(n)
Compact representation ? O(k), where k is the
number of maximal pairs

24
Supermaximal repeated substrings

A supermaximal repeat a is a maximal repeat of S
that never occurs as a substring of another
maximal repeat of S
Previous example
xyzv is a supermaximal repeat of S
xyz is NOT a supermaximal repeat of S

25
Supermaximal repeated substrings

Supermaximal repeats
Input
String S (length n)
Output
The set of supermaximal repeats of S
Theorem
A left diverse node v represents a supermaximal
repeat if and only if
all of vs children are leaves
and each has a distinct left character

26
Matching Statistics

Input
Pattern P of length n
Text T of length m
Output
Compute ms(i) for 1 lti lt m
Definition of ms(i)
For 1 lt i ltm, matching statistic ms(i) is the
length of the longest substring of T starting at
position i that matches a substring somewhere in
P.

27
Matching Statistics

With matching statistics, one can solve several
problems with less space than a suffix tree
Exact matching example
Well show an O(n) preprocessing time and O(m)
search time solution matching the traditional
methods
P matches substring starting at i in T if and
only if ms(i) P

28
Example of Matching Statistics
i
T
P
29
Matching Statistics

Solution method
Compute suffix tree of P retaining suffix links
Adding location of substring in P
p(i) a location in P such that the substring at
p(i) matches substring starting at T(i) for
exactly ms(i) positions
Before computing ms(i) values, mark each node in
T with the leaf number of one of its leaves
Simply output this value when outputting ms(i)
values

30
Matching Statistics

Count ms(1) match T against tree
Get ms(i1) from ms(i)
Assume we are at some node v in the tree
If it is internal, follow suffix link to s(v)
Else if it is a leaf, go up one level to its
parent w
If w is an internal node, follow suffix link to
s(w)
Traverse downwards using skip/count trick until
we have matched all the characters in edge label
(w,v)
Now match against T character by character till
we have a mismatch and can output ms(i1)

31
Applying matching statistics to LCS problem

Input
strings S and T
Output
longest common substring of S and T
Solution method
Compute suffix tree for shorter string, say S
Compute ms(i) values for T
Maximal ms(i) value identifies LCS

32
Suffix Arrays

Input
Text T of length m
Output
Pos array
Definition of Pos array
A suffix array for T, called Pos, is an array of
integers in the range 1 to m specifying the
lexicographic order of the m suffixes of string T
Posk i iff Ti is the kth smallest suffix in
the m suffixes
Add terminating character which is lexically
smallest

33
Example of Suffix Arrays

T axfcaxgx
Suffixes 1. axfcaxgx
2. xfcaxgx
3. fcaxgx
4. caxgx
5. axgx
6. xgx
7. gx
8. x
9.

Order 9.
1. axfcaxgx
5. axgx
4. caxgx
3. fcaxgx
7. gx
8. x
2. xfcaxgx
6. xgx
8. x

k
Posk
34
Suffix Arrays

Solution method
Compute suffix tree of T
Do a lexical depth-first traversal of T labeling
Pos(k) with leafs in order of encountering them
Edge (v,u) is lexically smaller than edge (v,w)
iff first character of (v,u) is lexically smaller
than first character of (v,w)

35
Applying Suffix Arrays to exact pattern matching

Input
Pattern P of length n
Text T of length m
Output
All occurrences of P in T
Solution method
Compute suffix array Pos for T
If P is in T, then all these locations will be
grouped consecutively in Pos

36
Applying Suffix Arrays to exact pattern matching

Using binary search, find smallest index i such
that P exactly matches the n characters of suffix
Pos(i)
Similarly, find largest index i such that P
exactly matches the n characters of suffix Pos(i)
Time complexity O(n log m)

37
Longest common prefixes

Input
Text T of length m
Output
Max(Lcp(i,j)) ,for 1 i,j m and i ? j
Definition of Lcp(i,j) Lcp(i,j) is the length of
the longest common prefix of the suffixes of T
beginning at Posi and Posj.
Example from Suffix Arrays
T axfcaxgx, Pos2 1 (axfcaxgx), Pos3 5
(axgx)
Lcp(2,3) 2

38
Longest common prefixes

Solution method
We want to get Lcp in O(m) time
However, there are potentially O(m2) different
possible pairs of Lcp values
Crucial point
Since this is binary search, there are only O(m)
values that are ever needed, and these have a lot
of structure

39
Longest common prefixes

Lcp(i,i1) string depth of lowest common
ancestor encountered during lexical depth-first
traversal of suffix tree from Pos(i) leaf to
Pos(i1) leaf
Other Lcp values
Lcp(i,j) mink in 1 to j-1 Lcp(k,k1)
Take min of Lcp values of children in the binary
tree of needed Lcp values (not the suffix tree)

Write a Comment

User Comments (0)