Title: Suffix tree and suffix array techniques for pattern analysis in strings
1Suffix tree and suffix array techniques for
pattern analysis in strings
- Esko Ukkonen
- Univ Helsinki
- Erice School 30 Oct 2005
2ttttttttttttttgagacggagtctcgctctgtcgcccaggctggagtg
cagtggcgggatctcggctcactgcaagctccgcctcccgggttcacgcc
attctcctgcctcagcctcccaagtagctgggactacaggcgcccgccac
tacgcccggctaattttttgtatttttagtagagacggggtttcaccgtt
ttagccgggatggtctcgatctcctgacctcgtgatccgcccgcctcggc
ctcccaaagtgctgggattacaggcgt
3Algorithms for combinatorial string matching?
- deep beauty? -
- shallow beauty?
- applications?
- intensive algorithmic miniatures
- sources of new problems text processing, DNA,
music,
4Analysis of a string of symbols
- T h a t t i v a t t i text
- P a t t pattern
- Find the occurrences of P in T h a t t i v a
t t i - Pattern synthesis (t) 4 (atti) 2
(tt) 2
5Pattern finding synthesis problems
- T t1t2 tn, P p1 p 2 pn , strings of
symbols in finite alphabet - Indexing problem Preprocess T (build an index
structure) such that the occurrences of different
patterns P can be found fast - static text, any given pattern P
- Pattern synthesis problem Learn from T new
patterns that occur surprisingly often - What is a pattern? Exact substring, approximate
substring, with generalized symbols, with gaps,
6- Suffix tree
- Suffix array
- Some applications
- Finding motifs
7The suffix tree Tree(T) of T
- data structure suffix tree, Tree(T), is compacted
trie that represents all the suffixes of string T - linear size Tree(T) O(T)
- can be constructed in linear time O(T)
- has myriad virtues (A. Apostolico)
- is well-known 366 000 Google hits
8Suffix trie and suffix tree
abaab baab aab ab b
Trie(abaab)
Tree(abaab)
a
a
b
baab
b
a
a
ab
baab
a
b
a
a
b
b
9Trie(T) can be large
- Trie(T) O(T2)
- bad example T anbn
- Trie(T) can be seen as a DFA language accepted
the suffixes of T - minimize the DFA gt directed cyclic word graph
(DAWG)
10Tree(T) is of linear size
- only the internal branching nodes and the leaves
represented explicitly - edges labeled by substrings of T
- v node(a) if the path from root to v spells a
- one-to-one correspondence of leaves and suffixes
- T leaves, hence lt T internal nodes
- Tree(T) O(T size(edge labels))
11Tree(hattivatti)
hattivatti attivatti ttivatti tivatti ivatti vatti
atti tti ti i
vatti
i
vatti
t
i
ti
atti
i
vatti
hattivatti
ivatti
ti
vatti
vatti
tti
vatti
atti
tivatti
hattivatti
ttivatti
attivatti
12Tree(hattivatti)
hattivatti attivatti ttivatti tivatti ivatti vatti
atti tti ti i
vatti
i
vatti
t
i
ti
atti
i
vatti
hattivatti
ivatti
ti
vatti
vatti
tti
vatti
atti
tivatti
hattivatti
ttivatti
attivatti
substring labels of edges represented as pairs of
pointers
hattivatti
13Tree(hattivatti)
hattivatti attivatti ttivatti tivatti ivatti vatti
atti tti ti i
vatti
i
6
3,3
10
4,5
2,5
i
vatti
hattivatti
5
9
vatti
6,10
8
6,10
7
4
1
3
2
hattivatti
14Tree(T) is full text index
Tree(T)
P
P occurs in T at locations 8, 31,
31
8
P occurs in T ? P is a prefix of some suffix of T
? Path for P exists in Tree(T) All occurrences
of P in time O(P occ)
15Find att from Tree(hattivatti)
hattivatti attivatti ttivatti tivatti ivatti vatti
atti tti ti i
vatti
vatti
t
i
ti
atti
i
vatti
hattivatti
ivatti
ti
vatti
vatti
tti
vatti
atti
tivatti
7
hattivatti
ttivatti
attivatti
2
16Linear time construction of Tree(T)
hattivatti attivatti ttivatti tivatti ivatti vatti
atti tti ti i
McCreight (1976)
Weiner (1973), algorithm of the year
on-line algorithm (Ukkonen 1992)
17On-line construction of Trie(T)
- T t1t2 tn
- Pi t1t2 ti ith prefix of T
- on-line idea update Trie(Pi) to Trie(Pi1)
- gt very simple construction
18Trie(abaab)
Trie(a)
Trie(ab)
Trie(aba)
abaa baa aa ea e
a
a
b
a
b
b
b
a
a
chain of links connects the end
points of current suffixes
19Trie(abaab)
Trie(abaa)
a
a
b
a
b
a
b
b
b
b
a
a
a
a
a
a
a
20Trie(abaab)
Trie(abaa)
a
a
b
a
b
a
b
b
b
b
a
a
a
a
a
a
a
Add next symbol b
21Trie(abaab)
Trie(abaa)
a
a
b
a
b
a
b
b
b
b
a
a
a
a
a
a
a
Add next symbol b
From here on b-arc already exists
22Trie(abaab)
a
a
b
a
b
a
b
b
b
b
a
a
a
a
a
a
a
Trie(abaab)
a
b
b
a
a
a
b
a
a
b
b
23What happens in Trie(Pi) gt Trie(Pi1) ?
Before
From here on the ai-arc exists already gt stop
updating here
ai
After
ai
ai
ai
ai
ai
New suffix links
New nodes
24What happens in Trie(Pi) gt Trie(Pi1) ?
- time O(size of Trie(T))
- suffix links
slink(node(aa)) node(a)
25On-line procedure for suffix trie
- Create Trie(t1) nodes root and v, an arc
son(root, t1) v, and suffix links slink(v)
root and slink(root) root - for i 2 to n do begin
- vi-1 leaf of Trie(t1ti-1) for string
t1ti-1 (i.e., the deepest leaf) - v vi-1 v 0
- while node v has no outgoing arc for ti do
begin - Create a new node v and an arc
son(v,ti) v - if v ? 0 then slink(v) v
- v slink(v) v v end
- for the node v such that v
son(v,ti) do if v v then
slink(v) root else slink(v) v
26Suffix trees on-line
- compacted version of the on-line trie
construction simulate the construction on the
linear size tree instead of the trie gt time
O(T) - all trie nodes are conceptually still needed gt
implicit and real nodes
27Implicit and real nodes
- Pair (v,a) is an implicit node in Tree(T) if v is
a node of Tree and a is a (proper) prefix of the
label of some arc from v. If a is the empty
string then (v, a) is a real node ( v). - Let v node(a) in Tree(T). Then implicit node
(v, a) represents node(aa) of Trie(T)
28Implicit node
a
v
a
(v, a)
29Suffix links and open arcs
root
a
aa
slink(v)
v
label i, instead of i,j if w is a leaf
and j is the scanned position of T
w
30Big picture
suffix link path traversed total work O(n)
new arcs and nodes created total work
O(size(Tree(T))
31On-line procedure for suffix tree
Input string T t1t2 tn Output
Tree(T) Notation son(v,a) w iff there is an
arc from v to w with label a son(v,e)
v Function Canonize(v, a) while son(v, a) ?
0 where a a a, a gt 0 do v
son(v, a) a a return (v, a)
32Suffix-tree on-line main procedure
Create Tree(t1) slink(root) root
(v, a) (root, e) / (v, a) is the start
node / for i 2 to n1 do
v 0 while there is no arc from v
with label prefix ati do if
a ? e then / divide the arc w son(v, a?)
into two / son(v, a)
v son(v,ti) v son(v,?) w
else
son(v,ti) v v v
if v ? 0 then slink(v) v
v v v slink(v) (v, a)
Canonize(v, a) if v ? 0 then
slink(v) v (v, a) Canonize(v,
ati) / (v, a) start node of the next round /
33The actual time and space
- Tree(T) is about 20T in practice
- brute-force construction is O(TlogT) for
random strings as the average depth of internal
nodes is O(logT) - difference between linear and brute-force
constructions not necessarily large (Giegerich
Kurtz) - truncated suffix trees k symbols long prefix of
each suffix represented (Na et al. 2003) - alphabet independent linear time (Farach 1997)
34- Suffix tree
- Suffix array
- Some applications
- Finding motifs
35Suffix array example
e atti attivatti hattivatti i ivatti ti tivatti tt
i ttivatti vatti
11 7 2 1 10 5 9 4 8 3 6
hattivatti attivatti ttivatti tivatti ivatti vatti
atti tti ti i e
- suffix array lexicographic order of the suffixes
36Suffix array
- suffix array SA(T) an array giving the
lexicographic order of the suffixes of T - space requirement 5T
- practitioners like suffix arrays (simplicity,
space efficiency) - theoreticians like suffix trees (explicit
structure)
37Pattern search from suffix array
e atti attivatti hattivatti i ivatti ti tivatti tt
i ttivatti vatti
11 7 2 1 10 5 9 4 8 3 6
hattivatti attivatti ttivatti tivatti ivatti vatti
atti tti ti i e
att
binary search
38Recent suffix array constructions
- ManberMyers (1990) O(TlogT)
- linear time via suffix tree
- January / June 2003 direct linear time
construction of suffix array - Kim, Sim, Park,
Park (CPM03) - Kärkkäinen Sanders (ICALP03) -
Ko Aluru (CPM03)
39Kärkkäinen-Sanders algorithm
- Construct the suffix array of the suffixes
starting at positions i mod 3 ? 0. This is done
by reduction to the suffix array construction of
a string of two thirds the length, which is
solved recursively. - Construct the suffix array of the remaining
suffixes using the result of the first step. - Merge the two suffix arrays into one.
40Notation
- string T T0,n) t0t1 tn-1
- suffix Si Ti,0) titi1 tn-1
- for C \subset 0,n SC Si i in C
- suffix array SA0,n of T is a permutation of
0,n satisfying SSA0 lt SSA1 lt lt SSAn
41Running example
0 1 2 3 4 5 6 7 8 9 10 11
- T0,n) y a b b a d a b b a d o 0 0
- SA (12,1,6,4,9,3,8,2,7,5,10,11,0)
42Step 0 Construct a sample
- for k 0,1,2
Bk i ? 0,n i mod 3 k - C B1 U B2 sample positions
- SC sample suffixes
- Example B1 1,4,7,10, B2 2,5,8,11, C
1,4,7,10,2,5,8,11
43Step 1 Sort sample suffixes
- for k 1,2, construct
- Rk tktk1tk2 tk3tk4tk5
tmaxBktmaxBk1tmaxBk2 - R R1 R2 concatenation of R1 and R2
-
- Suffixes of R correspond to SC suffix
titi1ti2 corresponds to Si
correspondence is order preserving. -
- Sort the suffixes of R radix sort the
characters and rename with ranks to obtain R. If
all characters different, their order directly
gives the order of suffixes. Otherwise, sort the
suffixes of R using Kärkkäinen-Sanders. Note
R 2n/3.
44Step 1 (cont.)
- once the sample suffixes are sorted, assign a
rank to each rank(Si) the rank of Si in SC
rank(Sn1) rank(Sn2) 0 - Example R abbadabbado0bbadab
bado00 R (1,2,4,6,4,5,3,7) SAR
(8,0,1,6,4,2,5,3,7) rank(Si) - 1 4 - 2 6 - 5
3 7 8 0 0
45Step 2 Sort nonsample suffixes
- for each non-sample Si ? SB0 (note that
rank(Si1) is always defined for i ? B0) Si
Sj ? (ti,rank(Si1)) (tj,rank(Sj1)) - radix sort the pairs (ti,rank(Si1)).
- Example S12 lt S6 lt S9 lt S3 lt S0 because (0,0)
lt (a,5) lt (a,7) lt (b,2) lt (y,1)
46Step 3 Merge
- merge the two sorted sets of suffixes using a
standard comparison-based merging - to compare Si ? SC with Sj ? SB0, distinguish two
cases - i ? B1 Si Sj ? (ti,rank(Si1))
(tj,rank(Sj1)) - i ? B2 Si Sj ? (ti,ti1,rank(Si2))
(tj,tj1,rank(Sj2)) - note that the ranks are defined in all cases!
- S1 lt S6 as (a,4) lt (a,5) and S3 lt S8 as
(b,a,6) lt (b,a,7)
47Running time O(n)
- excluding the recursive call, everything can be
done in linear time - the recursion is on a string of length 2n/3
- thus the time is given by recurrence T(n)
T(2n/3) O(n) - hence T(n) O(n)
48Implementation
- about 50 lines of C
- code available e.g. via Juha Kärkkäinens home
page
49LCP table
- Longest Common Prefix of successive elements of
suffix array - LCPi length of the longest common prefix of
suffixes SSAi and SSAi1 - build inverse array SA-1 from SA in linear time
- then LCP table from SA-1 in linear time (Kasai et
al, CPM2001)
50Suffix tree vs suffix array
- suffix tree ? suffix array LCP table
51- Suffix tree
- Suffix array
- Some applications
- Finding motifs
52Substring motifs of string T
- string T t1 tn in alphabet A.
- Problem what are the frequently occurring
(ungapped) substrings of T? Longest substring
that occurs at least q times? - Thm Suffix tree Tree(T) gives complete
occurrence counts of all substring motifs of T in
O(n) time (although T may have O(n2) substrings!)
53 Counting the substring motifs
- internal nodes of Tree(T) ? repeating
substrings of T - number of leaves of the subtree of a node for
string P number of occurrences of P in T
54Substring motifs of hattivatti
vatti
i
vatti
t
4
2
i
ti
atti
i
2
vatti
2
hattivatti
ivatti
2
ti
vatti
vatti
tti
vatti
atti
tivatti
hattivatti
ttivatti
attivatti
Counts for the O(n) maximal motifs shown
55Finding repeats in DNA
- human chromosome 3
- the first 48 999 930 bases
- 31 min cpu time (8 processors, 4 GB)
- Human genome 3x109 bases
- Tree(HumanGenome) feasible
56Longest repeat?
Occurrences at 28395980, 28401554r Length
2559 ttagggtacatgtgcacaacgtgcaggtttgttacatatgtata
cacgtgccatgatggtgtgctgcacccattaactcgtcatttagcgttag
gtatatctccgaatgctatccctcccccctccccccaccccacaacagtc
cccggtgtgtgatgttccccttcctgtgtccatgtgttctcattgttcaa
ttcccacctatgagtgagaacatgcggtgtttggttttttgtccttgcga
aagtttgctgagaatgatggtttccagcttcatccatatccctacaaagg
acatgaactcatcatttttttatggctgcatagtattccatggtgtatat
gtgccacattttcttaacccagtctacccttgttggacatctgggttggt
tccaagtctttgctattgtgaatagtgccgcaataaacatacgtgtgcat
gtgtctttatagcagcatgatttataatcctttgggtatatacccagtaa
tgggatggctgggtcaaatggtatttctagttctagatccctgaggaatc
accacactgacttccacaatggttgaactagtttacagtcccagcaacag
ttcctatttctccacatcctctccagcacctgttgtttcctgacttttta
atgatcgccattctaactggtgtgagatggtatctcattgtggttttgat
ttgcatttctctgatggccagtgatgatgagcattttttcatgtgttttt
tggctgcataaatgtcttcttttgagaagtgtctgttcatatccttcgcc
cacttttgatggggttgtttgtttttttcttgtaaatttgttggagttca
ttgtagattctgggtattagccctttgtcagatgagtaggttgcaaaaat
tttctcccattctgtaggttgcctgttcactctgatggtggtttcttctg
ctgtgcagaagctctttagtttaattagatcccatttgtcaattttggct
tttgttgccatagcttttggtgttttagacatgaagtccttgcccatgcc
tatgtcctgaatggtattgcctaggttttcttctagggtttttatggttt
taggtctaacatgtaagtctttaatccatcttgaattaattataaggtgt
atattataaggtgtaattataaggtgtataattatatattaattataagg
tgtatattaattataaggtgtaaggaagggatccagtttcagctttctac
atatggctagccagttttccctgcaccatttattaaatagggaatccttt
ccccattgcttgtttttgtcaggtttgtcaaagatcagatagttgtagat
atgcggcattatttctgagggctctgttctgttccattggtctatatctc
tgttttggtaccagtaccatgctgttttggttactgtagccttgtagtat
agtttgaagtcaggtagcgtgatggttccagctttgttcttttggcttag
gattgacttggcaatgtgggctcttttttggttccatatgaactttaaag
tagttttttccaattctgtgaagaaattcattggtagcttgatggggatg
gcattgaatctataaattaccctgggcagtatggccattttcacaatatt
gaatcttcctacccatgagcgtgtactgttcttccatttgtttgtatcct
cttttatttcattgagcagtggtttgtagttctccttgaagaggtccttc
acatcccttgtaagttggattcctaggtattttattctctttgaagcaat
tgtgaatgggagttcactcatgatttgactctctgtttgtctgttattgg
tgtataagaatgcttgtgatttttgcacattgattttgtatcctgagact
ttgctgaagttgcttatcagcttaaggagattttgggctgagacgatggg
gttttctagatatacaatcatgtcatctgcaaacagggacaatttgactt
cctcttttcctaattgaatacccgttatttccctctcctgcctgattgcc
ctggccagaacttccaacactatgttgaataggagtggtgagagagggca
tccctgtcttgtgccagttttcaaagggaatgcttccagtttttgtccat
tcagtatgatattggctgtgggtttgtcatagatagctcttattattttg
agatacatcccatcaatacctaatttattgagagtttttagcatgaagag
ttcttgaattttgtcaaaggccttttctgcatcttttgagataatcatgt
ggtttctgtctttggttctgtttatatgctggagtacgtttattgatttt
cgtatgttgaaccagccttgcatcccagggatgaagcccacttgatcatg
gtggataagctttttgatgtgctgctggattcggtttgccagtattttat
tgaggatttctgcatcgatgttcatcaaggatattggtctaaaattctct
ttttttgttgtgtctctgtcaggctttggtatcaggatgatgctggcctc
ataaaatgagttagg
57Ten occurrences?
ttttttttttttttgagacggagtctcgctctgtcgcccaggctggagtg
cagtggcgggatctcggctcactgcaagctccgcctcccgggttcacgcc
attctcctgcctcagcctcccaagtagctgggactacaggcgcccgccac
tacgcccggctaattttttgtatttttagtagagacggggtttcaccgtt
ttagccgggatggtctcgatctcctgacctcgtgatccgcccgcctcggc
ctcccaaagtgctgggattacaggcgt Length
277 Occurrences at 10130003, 11421803,
18695837, 26652515, 42971130, 47398125In the
reversed complement at 17858493, 41463059,
42431718, 42580925
58Using suffix trees plagiarism
- find longest common substring of strings X and Y
- build Tree(XY) and find the deepest node which
has a leaf pointing to X and another pointing to Y
59Using suffix trees approximate matching
- edit distance insertions, deletions, changes
- STOCKHOLM vs TUKHOLMA
60String distance/similarity functions
STOCKHOLM vs TUKHOLMA STOCKHOLM_
_TU_ KHOLMA gt 2 deletions, 1 insertion, 1 change
61Approximate string matching
- A S T O C K H O L M
B
T U K H O L M A - minimum number of mutation steps a -gt b a
-gt ? ? -gt b - dID(A,B) 5 dLevenshtein(A,B) 4
- mutation costs gt probabilistic modeling
- evaluation by dynamic programming gt alignment
62Dynamic programming
di,j min(if aibj then di-1,j-1 else ?,
di-1,j 1, di,j-1 1)
distance between i-prefix of A and
j-prefix of B (substitution excluded)
B
mxn table d
bj
di-1,j-1
di-1,j
A
1
di,j
di,j-1
ai
1
dm,n
63di,j min(if aibj then di-1,j-1 else ?,
di-1,j 1, di,j-1 1)
optimal alignment by trace-back
dID(A,B)
64Search problem
- find approximate occurrences of pattern P in text
T substrings P of T such that d(P,P) small - dyn progr with small modification O(mn)
- lots of (practical) improvement tricks
T
P
P
65Index for approximate searching?
- dynamic programming P x Tree(T) with backtracking
Tree(T)
P
66- Suffix tree
- Suffix array
- Some applications
- Finding motifs
67Gapped motifs
abc
68Gapped motifs
abc
abc
ababbbcccbcaaabca
69Gapped motifs of T
- gapped pattern P in (A U )
- gap symbol matches any symbol in A
- aabbb
- L(P) occurrence locations of P in T
- P is called a motif of T if L(P) gt 1 and a
motif with quorum q if L(P) q. - Problem find occurrence count L(P) for all
gapped motifs P of T - anban has exponentially many motifs!
70Maximal motifs
- motif P is the maximal version of motif P if P
has the largest possible number of non- symbols
among motifs that have the same set of occurrence
locations as P - every motif has unique maximal version
- unfortunately still exponential number of
different maximal motifs
71Blocks of maximal motifs
- aaabba has blocks aaa, b, ba
- Lemma Maximal 1-block motifs ? (branching)
nodes of Tree(S) - Thm Each block of a maximal motif of T is a
maximal substring motif of T. Hence there are
O(n) different strings that can be used as a
block of a maximal motif. - gt There are O(n2k-1) different maximal motifs
with k blocks O(n2k) unrestricted motifs.
72Tree(T)
Y
X
Two-block maximal pattern XY
73Counting 2-block maximal motifs
- Thm The occurrence counts for all maximal motifs
with two blocks can be found in (optimal) time
O(n3). - c.f., Arimura et al (2000), Apostolico et al
(2004),
74Algorithm (very simple)
d
Y
X
2-block motif (X,d,Y)
for each maximal substring motif X for each
distance d 1,2, mark the leaves of
Tree(T) that correspond to locations L(X) d
for each maximal substring motif Y,
find the number h(Y) of marked leaves in its
subtree in Tree(T) the occurrence count
of motif (X,d,Y) is h(Y)
75Algorithm (very simple)
d
Y
X
2-block motif (X,d,Y)
for each maximal substring motif X for each
distance d 1,2, mark the leaves of
Tree(T) that correspond to locations L(X) d
for each maximal substring motif Y,
find the number h(Y) of marked leaves in its
subtree in Tree(T) the occurrence count
of motif (X,d,Y) is h(Y)
O(n) O(n) O(n)
76Counting 2-block maximal motifs (cont)
- Thm The occurrence counts for all maximal motifs
with two blocks can be found in (optimal) time
O(n3). - flexible gaps xy gap of
any length - Thm The occurrence counts for all maximal motifs
with two blocks and one flexible gap can be found
in (optimal) time O(n2). - k-block case?
77Conclusion
- suffix structures provide very efficient
algorithmic tools for finding and learning
potentially interesting patterns in strings
78General case
- Q1 Given q and W, has T a motif with at least W
non-gap symbols and at least q occurrences? - In k-block case, is O(n2k-1) (or even better)
time possible? - related work H.Arimuraal, A. Apostolico al,
M-F. Sagot, L. Parida, N. Pisanti,