Title: Efficient Computation of Substring Equivalence Classes with Suffix Arrays
1Efficient Computation of Substring
EquivalenceClasses with Suffix Arrays
- Kazuyuki Narisawa,
- Shunsuke Inenaga, Hideo Bannai and Masayuki
Takeda - Kyushu University, Japan
2Contents
- Introduction
- Problem definition
- Suffix tree based algorithm
- Simulation by suffix array
- Computational experiment
- Application
- Summary
3Main contribution
Time and space efficient computation of substring
equivalence classes Blumer et al. 1987 with
suffix arrays
- Linear time and space
- is faster and requires less memory than suffix
tree and CDAWG based methods.
4Equivalence relation and classes
?Substrings with essentially identical occurrence
in w
example
Bettyboughtabitofbetterbutterand madeabe
tterbatterafterbreakfast.
bet ? betterb?
5Problem
- Input string w of length n
- Output the equivalence classes on w
- Difficulty
- The total number of elements in the equivalence
classes (shortly ECs) is O(n2). - Solution
- The number of the ECs is O(n).
- Each EC can be succinctly represented in O(1)
space.
6Succinct representation of the ECs
- representative the longest element(maximal
extension) - minimal strings the elements which belong to
another EC - when the left
or right most character is deleted
the elements of x? can be enumerated with the
representative and minimal strings
example
representative
Betty-bought-a-bit-of-better- butter-and-made-a-be
tter- batter-after-breakfast.
minimal strings
7Problem
- Input string w of length n
- Output succinct representations of the
equivalence classes on w - additionally, we will output
- size ( the number of elements in each EC)
- frequency ( the number of occurrences of the
elements in each EC ) - of each EC
8Possible solutions
- Suffix TreeWeiner 1973
- Compact Directed Acyclic Word Graph (CDAWG)
Blumer et al. 1985 - ECs can be computed with either of the data
structures in linear time and space.
9Suffix tree (with suffix link)
ababbbabbc
Ignore leaves here because they form a trivial EC.
10Equivalence classes on suffix tree
ababbbabbc
a
b
c
b
b
11
10
a
c
b
b
b
b
a
a
b
b
b
EC
9
a
b
b
b
b
b
babb bab ba
c
a
c
a
b
c
b
a
b
b
b
b
c
b
1
c
b
a
c
c
b
b
7
3
c
EC def.
5
8
4
Essentially same occurrence substrings
2
6
11Suffix tree algorithm
- foreach node v in suffix tree
- if(node v is representative of EC v )
- follow suffix link
- while(node is in EC v)
- follow suffix link
- compute size and minimal
strings -
-
- output succinct representation of EC v
12Algorithm with suffix tree
a
b
c
b
11
10
a
c
b
b
b
b
a
Suffix tree requires large memory space.
b
b
9
a
b
b
b
b
c
a
c
a
b
c
b
a
b
b
b
b
c
b
1
c
b
a
c
c
b
b
7
3
c
5
8
4
2
6
13Suffix array Manber and Myers 1993
- Can simulate traversal on suffix tree
- ?using lcp and rank arrays Kasai et al. 2003
- Can simulate traversal on suffix links
- ?using additional data structure suffix link
table Abouelhoda et al. 2004
14Suffix array
lexicographically sort suffixes
ababbbabbc
Suffix Array
15Lcp array
ababbbabbc
lcpithe length of the longest common
prefix of i th and (i 1) th suffixes
Lcp Array
Suffix Array
16Rank array
ababbbabbc
rankSAi i
Rank Array
Suffix Array
17Suffix array has less information
Information available during traversal for each
data structure, when visiting node v
Suffix Tree
Suffix Array
1. label from root to each node 2. label from
parent to each node 3. num. leaves in each
subtree 4. parent of each node 5. children of
each node 6. suffix link of each node
- length of label from root to v
- length of label from root to the parent of v
- left most leaf ID in subtree rooted at v
- right most leaf ID in subtree rooted at v
18Suffix array has less information
length of parent label from root1
11
10
9
label length from root4
1
2
3
6
7
8
5
4
19Suffix array algorithm
- foreach v in suffix tree (simulated by suffix
array) - if(node v is representative of EC v)
- follow suffix link
- while(node is in EC v)
- follow suffix link
- compute size and minimal strings
-
-
- output succinct representation of EC v
difficulty 1
difficulty 2
difficulty 3
These are difficult because suffix array has less
information.
20Solving difficulty 1 (representative judge)
v
l 1
r 1
l
r
Suffix Array
L rank(l 1)
index
R rank(r 1)
L rank(l)
R rank(r)
21Solving difficulty 2 (equivalence relation judge)
xlabel from root
axlabel from root
v
l
r
l1
r1
Suffix Array
L rank(l)
index
R rank(r)
L rank(l1)
R rank(r1)
22Solving difficulty 3 (size computation)
case 1
case 2
case 3
size sum of this
l
r
r
l
r
r
l
r
Suffix Array
L
index
R
R 1
L
R
R 1
L
R
lcp(R 1)
lcp(L)
lcp(R 1)
label length of parent
23Computational experiment
- Comparison of algorithms
- suffix tree
- CDAWG
- suffix array
- Data
- two English and two Genome corpora
- Canterbury corpus, Protein corpus
- Machine spec.
- Red Hat Linux
- CPU 2.8GHz, 1 GB memory
24Experimental result
25Application spam detection
the size of the equivalence classes formed by
spams are larger than that of non spams.
- This is Japanese
- Sushi using spam,
- but this spam does not relate to this study.
26Application spam detection
- Unsupervised Spam Detection based on String
Alienness Measures - by Kazuyuki Narisawa, Hideo Bannai, Kohei
Hatano - and Masayuki Takeda
if you are interested in our study and want to
come the conference, you should search not DS
07 but Discovery Science 2007.
27Summary
- Presented an algorithm for computing the
equivalence classes with suffix array - simulating traversal on suffix tree suffix
links - using only lcp and rank arrays
- running in linear time and space
- Compared with other data structures
- less memory
- faster computation
- Can be applied to spam detection DS 07
28Thank You
29(No Transcript)
30Compute size of the EC
sum of the length of label from parent to each
node
1 3 4
31Compute minimal strings of the EC
z
x
y
z1
y1
z2
y1
x1
x2
zm
yk
xk
32suffix tree
- each node has
- parent
- leftmost child
- right sibling
- suffix link
- label of the incoming edge
33(No Transcript)
34(No Transcript)