Pattern Matching: Suffix Tree Applications - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Pattern Matching: Suffix Tree Applications

Description:

Exact string and substring matching. Longest common substrings ... String a labeling the path to an internal node v of T is a maximal repeat if and ... – PowerPoint PPT presentation

Number of Views:268
Avg rating:3.0/5.0
Slides: 40
Provided by: Heal63
Category:

less

Transcript and Presenter's Notes

Title: Pattern Matching: Suffix Tree Applications


1
Pattern MatchingSuffix Tree Applications
2
Applications
  • Exact string and substring matching
  • Longest common substrings
  • Finding and representing repeated substrings
    efficiently
  • Applications that lead to alternative space vs.
    efficient implementations
  • Matching statistics
  • Suffix Arrays

3
Exact Set matching
  • Input
  • A set of patterns P P1,P2,Pk and P
  • Text T of length m
  • Output
  • Positions of all occurrences of each pattern Pi
    in T
  • Solution method
  • Preprocess to create suffix tree for T
  • O(m) time, O(m) space
  • Maximally match each Pi in suffix tree
  • O(P1 ) O(P2) O(Pk ) O(n)
  • Output all leaf positions below match point
  • O(k) time where k is number of total matches

4
Exact set matching using Aho-Corasick
  • Aho-Corasick algorithm is a classical solution to
    exact set matching
  • build keyword tree of set of patterns P
  • A keyword tree for a pattern set P is a rooted
    tree T such that
  • Each edge e is labeled by a character
  • Any two edge from a node have different labels
  • Define L(v) of a node v are the concatenation of
    edge labels on the path from the root to v
  • For each Pi P there is a node v s.t L(v) Pi
    and for each leaf v there is a Pi L(v)

5
Example of Aho-Corasick
  • Example P abce,abe,dce,ac

3
c
e
b
2,6
4
e
7
a
c
1,5,11
0
12
8
d
c
9
e
10
6
Example of Aho-Corasick
  • Example P abce,ababc,abac

4
e
3
c
9
c
a
b
0
1
2
8
b
a
7
c
13
Resume Link
Like KMP algorithm, if there is an error on node
v, then we resume the comparison by the resume
link of its parent.
7
Aho-Corasick vs. Suffix tree
  • Aho-Corasick Approach
  • O(n) preprocess time and space
  • to build keyword tree of set of patterns P
  • O(mk) search time
  • Linear time by using the resume link
  • Suffix Tree Approach
  • O(m) preprocess time and space
  • to build suffix tree of T
  • O(nk) search time
  • Using matching statistics to be defined, can make
    this tradeoff similar to that of Aho-Corasick

8
Substring problem
  • Input
  • Pattern P of length n
  • A set of Text Ti of total length m
  • Output
  • Position of all occurrences of P in each Text Ti
  • Solution method
  • Preprocess to create generalized suffix tree for
    Ti
  • O(m) time, O(m) space
  • Maximally match P in generalized suffix tree
  • Output all leaf positions below match point
  • O(nk) time where k is number of total matches

9
Generalized suffix tree
abc
c
  • T1 ababc
  • T2 abd

d
abc
ab
c
b
d
c
root

d

10
Longest Common Substring problem
  • Input
  • Strings S and T
  • Output
  • The longest common substring of S and T (and its
    position in S and T)
  • Solution method
  • Preprocess to create generalized suffix tree for
    S,T
  • Mark each node by whether or not its subtree
    contains a leaf node of S, T, or both
  • Simple postfix tree traversal algorithm to do
    this
  • Path label of node with greatest string depth is
    the longest common substring of S and T

11
Common substrings of length k problem
  • Input
  • Strings S and T
  • Integer k
  • Output
  • all substrings of S and T (and their positions in
    S and T) of length at least k
  • Solution method
  • Same as previous problem
  • Look for all nodes with 2 leaf labels of string
    depth at least k

12
Longest Common Substrings of more than two Strings
  • Definition For a given set of K strings, l(j)
    for 2 lt j lt K is the length of the longest
    common substring belong to at least j of the K
    strings
  • Example abcedfg, cbcedfa, dbcedg, cbceg, acea

13
Longest Common Substrings of more than two Strings
  • Input
  • Strings S1, , SK, total length
  • Output
  • l(j) (and positions in Si) for 2 lt j lt K
  • Solution method
  • Build a generalized suffix tree for the K strings
  • each string has a unique end character, so each
    leaf shows up only once

14
Longest Common Substrings of more than two Strings
  • Build a generalized suffix tree for the K strings
  • each string has a unique end character, so each
    leaf shows up only once
  • Define c(v) number of distinct leaf labels in
    subtree rooted at node v and d(v) string-depth
    from root to node v
  • Given c(v) and d(v), do a simple traversal of
    tree to find l(j) j 2K and pointers to
    locations in substrings
  • Computing c(v) efficiently
  • of leaves is not correct as some leaves may
    have same label
  • length K bit vector, 1 bit per string in set
  • OR your way up the tree
  • Each OR op takes O(K) time which give O(Kn)
    running time
  • Can be improved to be O(n) later

15
Repeated Substrings
  • Definition
  • maximal pair in S is a pair of identical
    substrings a and b in S such that the character
    to the immediate left (right) of a is different
    than the character to the immediate left (right)
    of b.
  • Add unique characters to front and end of S to
    include prefixes and suffixes.
  • Representation (p1, p2, n)
  • starting positions and length of the maximal pair
  • R(S) is the set of all triples representing
    maximal pairs in S

16
Example of Repeated substrings
  • S
  • (2, 7, 3) is a maximal pair
  • (7, 14, 3) is a maximal pair
  • (2, 14, 3) is not a maximal pair
  • (2, 14, 4) is a maximal pair

17
Repeated Substrings
  • A maximal repeat a is a substring in S that is
    the substring defined by a maximal pair of S
  • R(S) is the set of maximal repeats and R(S)
    R(S)
  • Previous example
  • xyz and xyzv are maximal repeats of Showever,
    xyz is represented only once in R(S), but there
    are (2, 7, 3) and (7, 14, 3) in R(S)
  • R(S) is smaller than R(S) as xyz shows up
    twice in R(S) but only once in R(S)

18
Maximal Repeated Substrings
  • Maximal repeats
  • Input
  • String S (length n)
  • Output
  • R(S)
  • Lemma
  • If a is a maximal repeat in S, then a is the
    path-label of an internal node v in T
  • a does not end in the middle of an edge

19
Maximal Repeated substrings
  • Definition left character of i is Si-1
  • The left character of a leaf of a suffix tree T
    is the left character of the suffix position
    represented by that leaf
  • A node v of T is called left diverse if at least
    2 leaves in vs subtree have different left
    characters
  • Theorem
  • String a labeling the path to an internal node v
    of T is a maximal repeat if and only if v is left
    diverse
  • Capture that character before a is different

20
Example of left diverse
  • S ababc

root
ab
c
b
left diverse
abc
c
abc
c
root
b
21
Maximal Repeated substrings
  • Solution method
  • Construct suffix tree for S
  • There are at most n maximal repeats
  • So that, there are n leaves
  • Because all internal nodes except the root have
    at least two children.
  • Therefore, at most n internal nodes

22
Maximal Repeated substrings
  • Find all left diverse nodes in linear time
  • All nodes will have a left character label
  • Leaf node
  • Label leaves with their left character
  • Internal node v
  • If any child is left diverse, so is v
  • If two children have different left character
    labels, v is left diverse
  • Otherwise, take on left character value of
    children
  • Compact representation
  • Node v in T is a frontier node if
  • v is a diverse
  • none of vs children are left diverse

23
Maximal Repeated substrings
  • Time complexity
  • Construct suffix tree for S ? O(n)
  • Find all left diverse nodes in linear time ? O(n)
  • Compact representation ? O(k), where k is the
    number of maximal pairs

24
Supermaximal repeated substrings
  • A supermaximal repeat a is a maximal repeat of S
    that never occurs as a substring of another
    maximal repeat of S
  • Previous example
  • xyzv is a supermaximal repeat of S
  • xyz is NOT a supermaximal repeat of S

25
Supermaximal repeated substrings
  • Supermaximal repeats
  • Input
  • String S (length n)
  • Output
  • The set of supermaximal repeats of S
  • Theorem
  • A left diverse node v represents a supermaximal
    repeat if and only if
  • all of vs children are leaves
  • and each has a distinct left character

26
Matching Statistics
  • Input
  • Pattern P of length n
  • Text T of length m
  • Output
  • Compute ms(i) for 1 lti lt m
  • Definition of ms(i)
  • For 1 lt i ltm, matching statistic ms(i) is the
    length of the longest substring of T starting at
    position i that matches a substring somewhere in
    P.

27
Matching Statistics
  • With matching statistics, one can solve several
    problems with less space than a suffix tree
  • Exact matching example
  • Well show an O(n) preprocessing time and O(m)
    search time solution matching the traditional
    methods
  • P matches substring starting at i in T if and
    only if ms(i) P

28
Example of Matching Statistics
i
T
P
29
Matching Statistics
  • Solution method
  • Compute suffix tree of P retaining suffix links
  • Adding location of substring in P
  • p(i) a location in P such that the substring at
    p(i) matches substring starting at T(i) for
    exactly ms(i) positions
  • Before computing ms(i) values, mark each node in
    T with the leaf number of one of its leaves
  • Simply output this value when outputting ms(i)
    values

30
Matching Statistics
  • Count ms(1) match T against tree
  • Get ms(i1) from ms(i)
  • Assume we are at some node v in the tree
  • If it is internal, follow suffix link to s(v)
  • Else if it is a leaf, go up one level to its
    parent w
  • If w is an internal node, follow suffix link to
    s(w)
  • Traverse downwards using skip/count trick until
    we have matched all the characters in edge label
    (w,v)
  • Now match against T character by character till
    we have a mismatch and can output ms(i1)

31
Applying matching statistics to LCS problem
  • Input
  • strings S and T
  • Output
  • longest common substring of S and T
  • Solution method
  • Compute suffix tree for shorter string, say S
  • Compute ms(i) values for T
  • Maximal ms(i) value identifies LCS

32
Suffix Arrays
  • Input
  • Text T of length m
  • Output
  • Pos array
  • Definition of Pos array
  • A suffix array for T, called Pos, is an array of
    integers in the range 1 to m specifying the
    lexicographic order of the m suffixes of string T
  • Posk i iff Ti is the kth smallest suffix in
    the m suffixes
  • Add terminating character which is lexically
    smallest

33
Example of Suffix Arrays
  • T axfcaxgx
  • Suffixes 1. axfcaxgx
  • 2. xfcaxgx
  • 3. fcaxgx
  • 4. caxgx
  • 5. axgx
  • 6. xgx
  • 7. gx
  • 8. x
  • 9.
  • Order 9.
  • 1. axfcaxgx
  • 5. axgx
  • 4. caxgx
  • 3. fcaxgx
  • 7. gx
  • 8. x
  • 2. xfcaxgx
  • 6. xgx
  • 8. x

k
Posk
34
Suffix Arrays
  • Solution method
  • Compute suffix tree of T
  • Do a lexical depth-first traversal of T labeling
    Pos(k) with leafs in order of encountering them
  • Edge (v,u) is lexically smaller than edge (v,w)
    iff first character of (v,u) is lexically smaller
    than first character of (v,w)

35
Applying Suffix Arrays to exact pattern matching
  • Input
  • Pattern P of length n
  • Text T of length m
  • Output
  • All occurrences of P in T
  • Solution method
  • Compute suffix array Pos for T
  • If P is in T, then all these locations will be
    grouped consecutively in Pos

36
Applying Suffix Arrays to exact pattern matching
  • Using binary search, find smallest index i such
    that P exactly matches the n characters of suffix
    Pos(i)
  • Similarly, find largest index i such that P
    exactly matches the n characters of suffix Pos(i)
  • Time complexity O(n log m)

37
Longest common prefixes
  • Input
  • Text T of length m
  • Output
  • Max(Lcp(i,j)) ,for 1 i,j m and i ? j
  • Definition of Lcp(i,j) Lcp(i,j) is the length of
    the longest common prefix of the suffixes of T
    beginning at Posi and Posj.
  • Example from Suffix Arrays
  • T axfcaxgx, Pos2 1 (axfcaxgx), Pos3 5
    (axgx)
  • Lcp(2,3) 2

38
Longest common prefixes
  • Solution method
  • We want to get Lcp in O(m) time
  • However, there are potentially O(m2) different
    possible pairs of Lcp values
  • Crucial point
  • Since this is binary search, there are only O(m)
    values that are ever needed, and these have a lot
    of structure

39
Longest common prefixes
  • Lcp(i,i1) string depth of lowest common
    ancestor encountered during lexical depth-first
    traversal of suffix tree from Pos(i) leaf to
    Pos(i1) leaf
  • Other Lcp values
  • Lcp(i,j) mink in 1 to j-1 Lcp(k,k1)
  • Take min of Lcp values of children in the binary
    tree of needed Lcp values (not the suffix tree)
Write a Comment
User Comments (0)
About PowerShow.com