Suffix tree and suffix array techniques for pattern analysis in strings - PowerPoint PPT Presentation

About This Presentation
Title:

Suffix tree and suffix array techniques for pattern analysis in strings

Description:

Find the occurrences of P in T: h a t t i v a t t i ... truncated suffix trees: k symbols long prefix of each suffix represented (Na et al. 2003) ... – PowerPoint PPT presentation

Number of Views:379
Avg rating:3.0/5.0
Slides: 79
Provided by: ukko9
Category:

less

Transcript and Presenter's Notes

Title: Suffix tree and suffix array techniques for pattern analysis in strings


1
Suffix tree and suffix array techniques for
pattern analysis in strings
  • Esko Ukkonen
  • Univ Helsinki
  • Erice School 30 Oct 2005

2
ttttttttttttttgagacggagtctcgctctgtcgcccaggctggagtg
cagtggcgggatctcggctcactgcaagctccgcctcccgggttcacgcc
attctcctgcctcagcctcccaagtagctgggactacaggcgcccgccac
tacgcccggctaattttttgtatttttagtagagacggggtttcaccgtt
ttagccgggatggtctcgatctcctgacctcgtgatccgcccgcctcggc
ctcccaaagtgctgggattacaggcgt
3
Algorithms for combinatorial string matching?
  • deep beauty? -
  • shallow beauty?
  • applications?
  • intensive algorithmic miniatures
  • sources of new problems text processing, DNA,
    music,

4
Analysis of a string of symbols
  • T h a t t i v a t t i text
  • P a t t pattern
  • Find the occurrences of P in T h a t t i v a
    t t i
  • Pattern synthesis (t) 4 (atti) 2
    (tt) 2

5
Pattern finding synthesis problems
  • T t1t2 tn, P p1 p 2 pn , strings of
    symbols in finite alphabet
  • Indexing problem Preprocess T (build an index
    structure) such that the occurrences of different
    patterns P can be found fast
  • static text, any given pattern P
  • Pattern synthesis problem Learn from T new
    patterns that occur surprisingly often
  • What is a pattern? Exact substring, approximate
    substring, with generalized symbols, with gaps,

6
  1. Suffix tree
  2. Suffix array
  3. Some applications
  4. Finding motifs

7
The suffix tree Tree(T) of T
  • data structure suffix tree, Tree(T), is compacted
    trie that represents all the suffixes of string T
  • linear size Tree(T) O(T)
  • can be constructed in linear time O(T)
  • has myriad virtues (A. Apostolico)
  • is well-known 366 000 Google hits

8
Suffix trie and suffix tree
abaab baab aab ab b
Trie(abaab)
Tree(abaab)
a
a
b
baab
b
a
a
ab
baab
a
b
a
a
b
b
9
Trie(T) can be large
  • Trie(T) O(T2)
  • bad example T anbn
  • Trie(T) can be seen as a DFA language accepted
    the suffixes of T
  • minimize the DFA gt directed cyclic word graph
    (DAWG)

10
Tree(T) is of linear size
  • only the internal branching nodes and the leaves
    represented explicitly
  • edges labeled by substrings of T
  • v node(a) if the path from root to v spells a
  • one-to-one correspondence of leaves and suffixes
  • T leaves, hence lt T internal nodes
  • Tree(T) O(T size(edge labels))

11
Tree(hattivatti)
hattivatti attivatti ttivatti tivatti ivatti vatti
atti tti ti i
vatti
i
vatti
t
i
ti
atti
i
vatti
hattivatti
ivatti
ti
vatti
vatti
tti
vatti
atti
tivatti
hattivatti
ttivatti
attivatti
12
Tree(hattivatti)
hattivatti attivatti ttivatti tivatti ivatti vatti
atti tti ti i
vatti
i
vatti
t
i
ti
atti
i
vatti
hattivatti
ivatti
ti
vatti
vatti
tti
vatti
atti
tivatti
hattivatti
ttivatti
attivatti
substring labels of edges represented as pairs of
pointers
hattivatti
13
Tree(hattivatti)
hattivatti attivatti ttivatti tivatti ivatti vatti
atti tti ti i
vatti
i
6
3,3
10
4,5
2,5
i
vatti
hattivatti
5
9
vatti
6,10
8
6,10
7
4
1
3
2
hattivatti
14
Tree(T) is full text index
Tree(T)
P
P occurs in T at locations 8, 31,
31
8
P occurs in T ? P is a prefix of some suffix of T
? Path for P exists in Tree(T) All occurrences
of P in time O(P occ)
15
Find att from Tree(hattivatti)
hattivatti attivatti ttivatti tivatti ivatti vatti
atti tti ti i
vatti
vatti
t
i
ti
atti
i
vatti
hattivatti
ivatti
ti
vatti
vatti
tti
vatti
atti
tivatti
7
hattivatti
ttivatti
attivatti
2
16
Linear time construction of Tree(T)
hattivatti attivatti ttivatti tivatti ivatti vatti
atti tti ti i
McCreight (1976)
Weiner (1973), algorithm of the year
on-line algorithm (Ukkonen 1992)
17
On-line construction of Trie(T)
  • T t1t2 tn
  • Pi t1t2 ti ith prefix of T
  • on-line idea update Trie(Pi) to Trie(Pi1)
  • gt very simple construction

18
Trie(abaab)
Trie(a)
Trie(ab)
Trie(aba)
abaa baa aa ea e
a
a
b
a
b
b
b
a
a
chain of links connects the end
points of current suffixes
19
Trie(abaab)
Trie(abaa)
a
a
b
a
b
a
b
b
b
b
a
a
a
a
a
a
a
20
Trie(abaab)
Trie(abaa)
a
a
b
a
b
a
b
b
b
b
a
a
a
a
a
a
a
Add next symbol b
21
Trie(abaab)
Trie(abaa)
a
a
b
a
b
a
b
b
b
b
a
a
a
a
a
a
a
Add next symbol b
From here on b-arc already exists
22
Trie(abaab)
a
a
b
a
b
a
b
b
b
b
a
a
a
a
a
a
a
Trie(abaab)
a
b
b
a
a
a
b
a
a
b
b
23
What happens in Trie(Pi) gt Trie(Pi1) ?
Before
From here on the ai-arc exists already gt stop
updating here
ai
After
ai
ai
ai
ai
ai
New suffix links
New nodes
24
What happens in Trie(Pi) gt Trie(Pi1) ?
  • time O(size of Trie(T))
  • suffix links
    slink(node(aa)) node(a)

25
On-line procedure for suffix trie
  1. Create Trie(t1) nodes root and v, an arc
    son(root, t1) v, and suffix links slink(v)
    root and slink(root) root
  2. for i 2 to n do begin
  3. vi-1 leaf of Trie(t1ti-1) for string
    t1ti-1 (i.e., the deepest leaf)
  4. v vi-1 v 0
  5. while node v has no outgoing arc for ti do
    begin
  6. Create a new node v and an arc
    son(v,ti) v
  7. if v ? 0 then slink(v) v
  8. v slink(v) v v end
  9. for the node v such that v
    son(v,ti) do if v v then
    slink(v) root else slink(v) v

26
Suffix trees on-line
  • compacted version of the on-line trie
    construction simulate the construction on the
    linear size tree instead of the trie gt time
    O(T)
  • all trie nodes are conceptually still needed gt
    implicit and real nodes

27
Implicit and real nodes
  • Pair (v,a) is an implicit node in Tree(T) if v is
    a node of Tree and a is a (proper) prefix of the
    label of some arc from v. If a is the empty
    string then (v, a) is a real node ( v).
  • Let v node(a) in Tree(T). Then implicit node
    (v, a) represents node(aa) of Trie(T)

28
Implicit node

a
v

a
(v, a)
29
Suffix links and open arcs
root
a

aa
slink(v)
v

label i, instead of i,j if w is a leaf
and j is the scanned position of T
w
30
Big picture
suffix link path traversed total work O(n)





new arcs and nodes created total work
O(size(Tree(T))
31
On-line procedure for suffix tree
Input string T t1t2 tn Output
Tree(T) Notation son(v,a) w iff there is an
arc from v to w with label a son(v,e)
v Function Canonize(v, a) while son(v, a) ?
0 where a a a, a gt 0 do v
son(v, a) a a return (v, a)
32
Suffix-tree on-line main procedure
Create Tree(t1) slink(root) root
(v, a) (root, e) / (v, a) is the start
node / for i 2 to n1 do
v 0 while there is no arc from v
with label prefix ati do if
a ? e then / divide the arc w son(v, a?)
into two / son(v, a)
v son(v,ti) v son(v,?) w
else
son(v,ti) v v v
if v ? 0 then slink(v) v
v v v slink(v) (v, a)
Canonize(v, a) if v ? 0 then
slink(v) v (v, a) Canonize(v,
ati) / (v, a) start node of the next round /
33
The actual time and space
  • Tree(T) is about 20T in practice
  • brute-force construction is O(TlogT) for
    random strings as the average depth of internal
    nodes is O(logT)
  • difference between linear and brute-force
    constructions not necessarily large (Giegerich
    Kurtz)
  • truncated suffix trees k symbols long prefix of
    each suffix represented (Na et al. 2003)
  • alphabet independent linear time (Farach 1997)

34
  1. Suffix tree
  2. Suffix array
  3. Some applications
  4. Finding motifs

35
Suffix array example
e atti attivatti hattivatti i ivatti ti tivatti tt
i ttivatti vatti
11 7 2 1 10 5 9 4 8 3 6
hattivatti attivatti ttivatti tivatti ivatti vatti
atti tti ti i e
  • suffix array lexicographic order of the suffixes

36
Suffix array
  • suffix array SA(T) an array giving the
    lexicographic order of the suffixes of T
  • space requirement 5T
  • practitioners like suffix arrays (simplicity,
    space efficiency)
  • theoreticians like suffix trees (explicit
    structure)

37
Pattern search from suffix array
e atti attivatti hattivatti i ivatti ti tivatti tt
i ttivatti vatti
11 7 2 1 10 5 9 4 8 3 6
hattivatti attivatti ttivatti tivatti ivatti vatti
atti tti ti i e
att
binary search
38
Recent suffix array constructions
  • ManberMyers (1990) O(TlogT)
  • linear time via suffix tree
  • January / June 2003 direct linear time
    construction of suffix array - Kim, Sim, Park,
    Park (CPM03) - Kärkkäinen Sanders (ICALP03) -
    Ko Aluru (CPM03)

39
Kärkkäinen-Sanders algorithm
  1. Construct the suffix array of the suffixes
    starting at positions i mod 3 ? 0. This is done
    by reduction to the suffix array construction of
    a string of two thirds the length, which is
    solved recursively.
  2. Construct the suffix array of the remaining
    suffixes using the result of the first step.
  3. Merge the two suffix arrays into one.

40
Notation
  • string T T0,n) t0t1 tn-1
  • suffix Si Ti,0) titi1 tn-1
  • for C \subset 0,n SC Si i in C
  • suffix array SA0,n of T is a permutation of
    0,n satisfying SSA0 lt SSA1 lt lt SSAn

41
Running example
0 1 2 3 4 5 6 7 8 9 10 11
  • T0,n) y a b b a d a b b a d o 0 0
  • SA (12,1,6,4,9,3,8,2,7,5,10,11,0)

42
Step 0 Construct a sample
  • for k 0,1,2
    Bk i ? 0,n i mod 3 k
  • C B1 U B2 sample positions
  • SC sample suffixes
  • Example B1 1,4,7,10, B2 2,5,8,11, C
    1,4,7,10,2,5,8,11

43
Step 1 Sort sample suffixes
  • for k 1,2, construct
  • Rk tktk1tk2 tk3tk4tk5
    tmaxBktmaxBk1tmaxBk2
  • R R1 R2 concatenation of R1 and R2
  • Suffixes of R correspond to SC suffix
    titi1ti2 corresponds to Si
    correspondence is order preserving.
  • Sort the suffixes of R radix sort the
    characters and rename with ranks to obtain R. If
    all characters different, their order directly
    gives the order of suffixes. Otherwise, sort the
    suffixes of R using Kärkkäinen-Sanders. Note
    R 2n/3.

44
Step 1 (cont.)
  • once the sample suffixes are sorted, assign a
    rank to each rank(Si) the rank of Si in SC
    rank(Sn1) rank(Sn2) 0
  • Example R abbadabbado0bbadab
    bado00 R (1,2,4,6,4,5,3,7) SAR
    (8,0,1,6,4,2,5,3,7) rank(Si) - 1 4 - 2 6 - 5
    3 7 8 0 0

45
Step 2 Sort nonsample suffixes
  • for each non-sample Si ? SB0 (note that
    rank(Si1) is always defined for i ? B0) Si
    Sj ? (ti,rank(Si1)) (tj,rank(Sj1))
  • radix sort the pairs (ti,rank(Si1)).
  • Example S12 lt S6 lt S9 lt S3 lt S0 because (0,0)
    lt (a,5) lt (a,7) lt (b,2) lt (y,1)

46
Step 3 Merge
  • merge the two sorted sets of suffixes using a
    standard comparison-based merging
  • to compare Si ? SC with Sj ? SB0, distinguish two
    cases
  • i ? B1 Si Sj ? (ti,rank(Si1))
    (tj,rank(Sj1))
  • i ? B2 Si Sj ? (ti,ti1,rank(Si2))
    (tj,tj1,rank(Sj2))
  • note that the ranks are defined in all cases!
  • S1 lt S6 as (a,4) lt (a,5) and S3 lt S8 as
    (b,a,6) lt (b,a,7)

47
Running time O(n)
  • excluding the recursive call, everything can be
    done in linear time
  • the recursion is on a string of length 2n/3
  • thus the time is given by recurrence T(n)
    T(2n/3) O(n)
  • hence T(n) O(n)

48
Implementation
  • about 50 lines of C
  • code available e.g. via Juha Kärkkäinens home
    page

49
LCP table
  • Longest Common Prefix of successive elements of
    suffix array
  • LCPi length of the longest common prefix of
    suffixes SSAi and SSAi1
  • build inverse array SA-1 from SA in linear time
  • then LCP table from SA-1 in linear time (Kasai et
    al, CPM2001)

50
Suffix tree vs suffix array
  • suffix tree ? suffix array LCP table

51
  1. Suffix tree
  2. Suffix array
  3. Some applications
  4. Finding motifs

52
Substring motifs of string T
  • string T t1 tn in alphabet A.
  • Problem what are the frequently occurring
    (ungapped) substrings of T? Longest substring
    that occurs at least q times?
  • Thm Suffix tree Tree(T) gives complete
    occurrence counts of all substring motifs of T in
    O(n) time (although T may have O(n2) substrings!)

53
Counting the substring motifs
  • internal nodes of Tree(T) ? repeating
    substrings of T
  • number of leaves of the subtree of a node for
    string P number of occurrences of P in T

54
Substring motifs of hattivatti

vatti
i
vatti
t
4
2
i
ti
atti
i
2
vatti
2
hattivatti
ivatti
2
ti
vatti
vatti
tti
vatti
atti
tivatti
hattivatti
ttivatti
attivatti
Counts for the O(n) maximal motifs shown
55
Finding repeats in DNA
  • human chromosome 3
  • the first 48 999 930 bases
  • 31 min cpu time (8 processors, 4 GB)
  • Human genome 3x109 bases
  • Tree(HumanGenome) feasible

56
Longest repeat?
Occurrences at 28395980, 28401554r Length
2559 ttagggtacatgtgcacaacgtgcaggtttgttacatatgtata
cacgtgccatgatggtgtgctgcacccattaactcgtcatttagcgttag
gtatatctccgaatgctatccctcccccctccccccaccccacaacagtc
cccggtgtgtgatgttccccttcctgtgtccatgtgttctcattgttcaa
ttcccacctatgagtgagaacatgcggtgtttggttttttgtccttgcga
aagtttgctgagaatgatggtttccagcttcatccatatccctacaaagg
acatgaactcatcatttttttatggctgcatagtattccatggtgtatat
gtgccacattttcttaacccagtctacccttgttggacatctgggttggt
tccaagtctttgctattgtgaatagtgccgcaataaacatacgtgtgcat
gtgtctttatagcagcatgatttataatcctttgggtatatacccagtaa
tgggatggctgggtcaaatggtatttctagttctagatccctgaggaatc
accacactgacttccacaatggttgaactagtttacagtcccagcaacag
ttcctatttctccacatcctctccagcacctgttgtttcctgacttttta
atgatcgccattctaactggtgtgagatggtatctcattgtggttttgat
ttgcatttctctgatggccagtgatgatgagcattttttcatgtgttttt
tggctgcataaatgtcttcttttgagaagtgtctgttcatatccttcgcc
cacttttgatggggttgtttgtttttttcttgtaaatttgttggagttca
ttgtagattctgggtattagccctttgtcagatgagtaggttgcaaaaat
tttctcccattctgtaggttgcctgttcactctgatggtggtttcttctg
ctgtgcagaagctctttagtttaattagatcccatttgtcaattttggct
tttgttgccatagcttttggtgttttagacatgaagtccttgcccatgcc
tatgtcctgaatggtattgcctaggttttcttctagggtttttatggttt
taggtctaacatgtaagtctttaatccatcttgaattaattataaggtgt
atattataaggtgtaattataaggtgtataattatatattaattataagg
tgtatattaattataaggtgtaaggaagggatccagtttcagctttctac
atatggctagccagttttccctgcaccatttattaaatagggaatccttt
ccccattgcttgtttttgtcaggtttgtcaaagatcagatagttgtagat
atgcggcattatttctgagggctctgttctgttccattggtctatatctc
tgttttggtaccagtaccatgctgttttggttactgtagccttgtagtat
agtttgaagtcaggtagcgtgatggttccagctttgttcttttggcttag
gattgacttggcaatgtgggctcttttttggttccatatgaactttaaag
tagttttttccaattctgtgaagaaattcattggtagcttgatggggatg
gcattgaatctataaattaccctgggcagtatggccattttcacaatatt
gaatcttcctacccatgagcgtgtactgttcttccatttgtttgtatcct
cttttatttcattgagcagtggtttgtagttctccttgaagaggtccttc
acatcccttgtaagttggattcctaggtattttattctctttgaagcaat
tgtgaatgggagttcactcatgatttgactctctgtttgtctgttattgg
tgtataagaatgcttgtgatttttgcacattgattttgtatcctgagact
ttgctgaagttgcttatcagcttaaggagattttgggctgagacgatggg
gttttctagatatacaatcatgtcatctgcaaacagggacaatttgactt
cctcttttcctaattgaatacccgttatttccctctcctgcctgattgcc
ctggccagaacttccaacactatgttgaataggagtggtgagagagggca
tccctgtcttgtgccagttttcaaagggaatgcttccagtttttgtccat
tcagtatgatattggctgtgggtttgtcatagatagctcttattattttg
agatacatcccatcaatacctaatttattgagagtttttagcatgaagag
ttcttgaattttgtcaaaggccttttctgcatcttttgagataatcatgt
ggtttctgtctttggttctgtttatatgctggagtacgtttattgatttt
cgtatgttgaaccagccttgcatcccagggatgaagcccacttgatcatg
gtggataagctttttgatgtgctgctggattcggtttgccagtattttat
tgaggatttctgcatcgatgttcatcaaggatattggtctaaaattctct
ttttttgttgtgtctctgtcaggctttggtatcaggatgatgctggcctc
ataaaatgagttagg
57
Ten occurrences?
ttttttttttttttgagacggagtctcgctctgtcgcccaggctggagtg
cagtggcgggatctcggctcactgcaagctccgcctcccgggttcacgcc
attctcctgcctcagcctcccaagtagctgggactacaggcgcccgccac
tacgcccggctaattttttgtatttttagtagagacggggtttcaccgtt
ttagccgggatggtctcgatctcctgacctcgtgatccgcccgcctcggc
ctcccaaagtgctgggattacaggcgt Length
277 Occurrences at 10130003, 11421803,
18695837, 26652515, 42971130, 47398125In the
reversed complement at 17858493, 41463059,
42431718, 42580925
58
Using suffix trees plagiarism
  • find longest common substring of strings X and Y
  • build Tree(XY) and find the deepest node which
    has a leaf pointing to X and another pointing to Y

59
Using suffix trees approximate matching
  • edit distance insertions, deletions, changes
  • STOCKHOLM vs TUKHOLMA

60
String distance/similarity functions
STOCKHOLM vs TUKHOLMA STOCKHOLM_
_TU_ KHOLMA gt 2 deletions, 1 insertion, 1 change

61
Approximate string matching
  • A S T O C K H O L M

    B
    T U K H O L M A
  • minimum number of mutation steps a -gt b a
    -gt ? ? -gt b
  • dID(A,B) 5 dLevenshtein(A,B) 4
  • mutation costs gt probabilistic modeling
  • evaluation by dynamic programming gt alignment

62
Dynamic programming
di,j min(if aibj then di-1,j-1 else ?,
di-1,j 1, di,j-1 1)
distance between i-prefix of A and
j-prefix of B (substitution excluded)

B
mxn table d
bj
di-1,j-1
di-1,j
A
1
di,j
di,j-1
ai
1
dm,n
63
di,j min(if aibj then di-1,j-1 else ?,
di-1,j 1, di,j-1 1)

optimal alignment by trace-back
dID(A,B)
64
Search problem
  • find approximate occurrences of pattern P in text
    T substrings P of T such that d(P,P) small
  • dyn progr with small modification O(mn)
  • lots of (practical) improvement tricks

T
P
P
65
Index for approximate searching?
  • dynamic programming P x Tree(T) with backtracking

Tree(T)
P
66
  1. Suffix tree
  2. Suffix array
  3. Some applications
  4. Finding motifs

67
Gapped motifs
abc
68
Gapped motifs
abc
abc
ababbbcccbcaaabca
69
Gapped motifs of T
  • gapped pattern P in (A U )
  • gap symbol matches any symbol in A
  • aabbb
  • L(P) occurrence locations of P in T
  • P is called a motif of T if L(P) gt 1 and a
    motif with quorum q if L(P) q.
  • Problem find occurrence count L(P) for all
    gapped motifs P of T
  • anban has exponentially many motifs!

70
Maximal motifs
  • motif P is the maximal version of motif P if P
    has the largest possible number of non- symbols
    among motifs that have the same set of occurrence
    locations as P
  • every motif has unique maximal version
  • unfortunately still exponential number of
    different maximal motifs

71
Blocks of maximal motifs
  • aaabba has blocks aaa, b, ba
  • Lemma Maximal 1-block motifs ? (branching)
    nodes of Tree(S)
  • Thm Each block of a maximal motif of T is a
    maximal substring motif of T. Hence there are
    O(n) different strings that can be used as a
    block of a maximal motif.
  • gt There are O(n2k-1) different maximal motifs
    with k blocks O(n2k) unrestricted motifs.

72
Tree(T)
Y
X
Two-block maximal pattern XY
73
Counting 2-block maximal motifs
  • Thm The occurrence counts for all maximal motifs
    with two blocks can be found in (optimal) time
    O(n3).
  • c.f., Arimura et al (2000), Apostolico et al
    (2004),

74
Algorithm (very simple)
d
Y
X
2-block motif (X,d,Y)
for each maximal substring motif X for each
distance d 1,2, mark the leaves of
Tree(T) that correspond to locations L(X) d
for each maximal substring motif Y,
find the number h(Y) of marked leaves in its
subtree in Tree(T) the occurrence count
of motif (X,d,Y) is h(Y)
75
Algorithm (very simple)
d
Y
X
2-block motif (X,d,Y)
for each maximal substring motif X for each
distance d 1,2, mark the leaves of
Tree(T) that correspond to locations L(X) d
for each maximal substring motif Y,
find the number h(Y) of marked leaves in its
subtree in Tree(T) the occurrence count
of motif (X,d,Y) is h(Y)
O(n) O(n) O(n)
76
Counting 2-block maximal motifs (cont)
  • Thm The occurrence counts for all maximal motifs
    with two blocks can be found in (optimal) time
    O(n3).
  • flexible gaps xy gap of
    any length
  • Thm The occurrence counts for all maximal motifs
    with two blocks and one flexible gap can be found
    in (optimal) time O(n2).
  • k-block case?

77
Conclusion
  • suffix structures provide very efficient
    algorithmic tools for finding and learning
    potentially interesting patterns in strings

78
General case
  • Q1 Given q and W, has T a motif with at least W
    non-gap symbols and at least q occurrences?
  • In k-block case, is O(n2k-1) (or even better)
    time possible?
  • related work H.Arimuraal, A. Apostolico al,
    M-F. Sagot, L. Parida, N. Pisanti,
Write a Comment
User Comments (0)
About PowerShow.com