Suffix trees and suffix arrays - PowerPoint PPT Presentation

About This Presentation
Title:

Suffix trees and suffix arrays

Description:

Suffix trees and suffix arrays Trie A tree representing a set of strings. Trie (Cont) Assume no string is a prefix of another Compressed Trie Compress unary nodes ... – PowerPoint PPT presentation

Number of Views:197
Avg rating:3.0/5.0
Slides: 34
Provided by: hai5
Learn more at: https://s2.smu.edu
Category:

less

Transcript and Presenter's Notes

Title: Suffix trees and suffix arrays


1
Suffix trees and suffix arrays
2
Trie
  • A tree representing a set of strings.

c
a
aeef ad bbfe bbfg c
b
e
b
d
e
f
c
f
e
g
3
Trie (Cont)
  • Assume no string is a prefix of another

c
Each edge is labeled by a letter, no two edges
outgoing from the same node are labeled the
same. Each string corresponds to a leaf.
a
b
e
b
d
e
f
c
f
e
g
4
Compressed Trie
  • Compress unary nodes, label edges by strings

c
?
c
a
a
b
e
b
bbf
d
d
eef
e
f
c
c
f
e
g
e
g
5
Suffix tree
Given a string s a suffix tree of s is a
compressed trie of all suffixes of s
To make these suffixes prefix-free we add a
special character, say , at the end of s
6
Suffix tree (Example)
Let sabab, a suffix tree of s is a compressed
trie of all suffixes of sabab

b ab bab abab
a
b
b

a
a
b

b


7
Trivial algorithm to build a Suffix tree
a
b
Put the largest suffix in
a
b

a
b
b
a
Put the suffix bab in
a
b
b


8
a
b
b
a
a
b
b


Put the suffix ab in
a
b
b
a
b

a

b

9
a
b
b
a
b

a

b

Put the suffix b in
a
b
b

a
a
b

b


10
a
b
b

a
a
b

b



Put the suffix in
a
b
b

a
a
b

b


11

a
b
b

a
a
b

b


We will also label each leaf with the starting
point of the corres. suffix.

a
b
5
b

a
a
b

4
b


3
2
1
12
Analysis
  • Takes O(n2) time to build.

We will see how to do it in O(n) time
13
What can we do with it ?
  • Exact string matching
  • Given a Text T, T n, preprocess it such that
    when a pattern P, Pm, arrives you can quickly
    decide when it occurs in T.
  • W e may also want to find all occurrences of P in
    T

14
Exact string matching
In preprocessing we just build a suffix tree in
O(n) time

a
b
5
b

a
a
b

4
b


3
2
1
Given a pattern P ab we traverse the tree
according to the pattern.
15

a
b
5
b

a
a
b

4
b


3
2
1
If we did not get stuck traversing the pattern
then the pattern occurs in the text.
Each leaf in the subtree below the node we reach
corresponds to an occurrence.
By traversing this subtree we get all k
occurrences in O(nk) time
16
Generalized suffix tree
Given a set of strings S a generalized suffix
tree of S is a compressed trie of all suffixes of
s ? S
To make these suffixes prefix-free we add a
special char, say , at the end of s
To associate each suffix with a unique string in
S add a different special char to each s
17
Generalized suffix tree (Example)
Let s1abab and s2aab here is a generalized
suffix tree for s1 and s2


b b ab
ab bab aab abab
a
b
4
5


a
b
a
3
b
b
4



a

2
1
b

3
2
1
18
So what can we do with it ?
Matching a pattern against a database of strings
19
Longest common substring (of two strings)
Every node with a leaf descendant from string s1
and a leaf descendant from string s2 represents
a maximal common substring and vice versa.


a
b
4
5


a
b
a
3
b
b
4


Find such node with largest string depth

a

2
1
b

3
2
1
20
Lowest common ancetors
A lot more can be gained from the suffix tree if
we preprocess it so that we can answer LCA
queries on it








21
Why?
The LCA of two leaves represents the longest
common prefix (LCP) of these 2 suffixes


a
b
4
5


a
b
a
3
b
b
4



a

2
1
b

3
2
1
22
Finding maximal palindromes
  • A palindrome caabaac, cbaabc
  • Want to find all maximal palindromes in a string s

Let s cbaaba
The maximal palindrome with center between i-1
and i is the LCP of the suffix at position i of s
and the suffix at position m-i1 of sr
23
Maximal palindromes algorithm
Prepare a generalized suffix tree for
s cbaaba and sr abaabc
For every i find the LCA of suffix i of s and
suffix m-i1 of sr
24
Let s cbaaba then sr abaabc
a

b

c
7
7
a b

b
a
baaba
c
c
6
6
a b
c
a
a

4
abc
5
5

3
3
c
a
4
1
2
2
1
25
Analysis
O(n) time to identify all palindromes
26
Drawbacks
  • Suffix trees consume a lot of space
  • It is O(n) but the constant is quite big
  • Notice that if we indeed want to traverse an edge
    in O(1) time then we need an array of ptrs. of
    size S in each node

27
Suffix array
  • We loose some of the functionality but we save
    space.

Let s abab
Sort the suffixes lexicographically ab, abab,
b, bab
The suffix array gives the indices of the
suffixes in sorted order
3
1
4
2
28
How do we build it ?
  • Build a suffix tree
  • Traverse the tree in DFS, lexicographically
    picking edges outgoing from each node and fill
    the suffix array.
  • O(n) time

29
How do we search for a pattern ?
  • If P occurs in T then all its occurrences are
    consecutive in the suffix array.
  • Do a binary search on the suffix array
  • Takes O(mlogn) time

30
Example
Let S mississippi
i
L
ippi
issippi
Let P issa
ississippi
mississippi
pi
M
ppi
sippi
sisippi
ssippi
ssissippi
R
31
How do we accelerate the search ?
Maintain l LCP(P,L)
l
L
Maintain r LCP(P,R)
If l r then start comparing M to P at l 1
M
R
r
32
How do we accelerate the search ?
l
L
If l gt r then
Suppose we know LCP(L,M) If LCP(L,M) lt l we go
left If LCP(L,M) gt l we go right If LCP(L,M) l
we start comparing at l 1
M
R
r
33
Analysis of the acceleration
If we do more than a single comparison in an
iteration then max(l, r ) grows by 1 for each
comparison ? O(logn m) time
Write a Comment
User Comments (0)
About PowerShow.com