Title: On%20the%20suffix%20automaton%20with%20mismatches
1 On the suffix automaton with mismatches
- Maxime Crochemore, Chiara Epifanio,
- Alessandra Gabriele, and Filippo Mignosi
2Outline
- Motivations and basic definitions
- Nerodes congruence with mismatches
- Suffix automata with mismatches
- Conclusions and open problems
3- In literature several data structures have been
studied for storing the suffixes of a text. Each
of them is conceived for giving a fast access to
all factors of the text itself. Among them - suffix tries representation of all the suffixes
of a word by an ordinary tree - quadratic size in
the length of the word - suffix trees compact representations of suffix
tries - linear size in the length of the word - suffix automata minimization (related to
automata) of suffix tries - linear size in the
length of the word - compact suffix automata compact representations
of suffix automata - linear size in the length of
the word.
4Why suffix automata?
- Suffix automata, compact suffix automata and
suffix trees have many applications, such as
indexing, pattern matching, and data compression.
- They both linear size.
- but
- suffix trees and compact suffix automata
represent strings by pointers to the text, while
suffix automata work without the need of
accessing it.
5Why mismatches?
- Data structures recognizing languages with
mismatches for approximate string matching and
its applications, such as - recovering the
original signals after their transmission over
noisy channels- finding DNA subsequences after
possible mutations- text searching where there
are typing or spelling errors- retrieving
musical passages. - Independent theoretical interest, such as, for
instance, the modelling of some evolutionary
events in molecular biology.
6- In Blumer et al. (1985)
- a linear algorithm for building the suffix
automaton of a word w on a fixed alphabet is
given (based on Nerodes congruence) - it is showed that this suffix automaton must
have at least w1 states and at most 2w
complexity Carpi, de Luca in 2001 have proved
that the lower bound is joined for any prefix of
Fibonacci word.
7In this paper we focus on the minimal
deterministic finite automaton, denoted by Sk,
that recognizes the set of suffixes Suff(w,k) of
a word w up to k errors.
- First main result characterization of the
Nerode's right-invariant congruence relative to
Sk and a Conjecture on the size of Sk. - Second main result description of an algorithm
that makes use of Sk in order to accept, in an
efficient way, the language of all suffixes of w
up to k errors in every window of size r,
(rrepetition index).
8Basic definitions
The distance d(x,y) between two strings x and y
is the minimal cost of a sequence of operations
that transform x into y (and ? if no such
sequence exists).
We consider the Hamming distance, that allows
only substitutions, with cost 1 (simplified
definition). It is finite whenever xy and
it holds 0 ? d(x,y) ? x.
Ex.
xacgtatct, yaggttact
d(x,y)3 (in the simplified
definition)
A string x k-occurs in w if it occurs in w at
position l, 1lw, up to k errors. A string x
that k-occurs in w as a suffix of w is a
k-suffix of w.
9Suffixes with One Mismatch
- a Suff(a,1)e,a,b.
- The minimal automaton has 2 states.
- ab Suff(ab,1) e,a,b,aa,ab,bb.
- The minimal automaton has 4 states.
- aba Suff(aba,1)e,a,b,aa,ba,bb,aaa,aba,abb,bba
. The minimal automaton has 6 states. - abaa Suff(abaa,1) e,a,b,aa,ab,ba,aaa,baa,bab
, - bba,aaaa,abaa,abab,abba,bbaa.
- The minimal automaton has 11 states.
10On Nerodes congruence with mismatches
- Definition 1 Let w??. ?y ??, y? ?
- end-setw(y,k) i y k-occurs in w with final
position i. - Notice that end-setw(?, k) 0,1, , w.
- Definition 2 x, y ?? are endk-equivalent, x
w,ky, on w if - 1. end-setw(x, k) end-setw(y, k)
- 2. ?i ?end-setw(x,k) end-setw(y, k), the number
of errors available in the suffix of w having i1
as first position is the same after the reading
of x and of y, i.e. - minw-i, k-erri(x) minw-i, k-erri(y) ,
- erri(u)(mismatches) of u in w with final
position i. - xw,k equivalence class of x with respect to
w,k.
11In other words
- x w,ky if
- x and y have the same end-set in w up to k
mismatches as in the exact case Blumer et al., - (available errors) in the suffix of w after the
reading of x and of y is the same. - The definition includes two cases depending on
the considered final position i?end-setw(x,k)
end-setw(y, k) - 2.a) w-imaxk-erri(x),k-erri(y) ?
k-erri(x)k-erri(y) ? erri(x)erri(y). (In this
case minw-i,k-erri(x) k-erri(x) k-erri(y)
minw-i,k-erri(y)) - 2.b) w-i mink-erri(x), k-erri(y) ? it is
possible to have mismatches in any position of
the suffix of w having length w-i. - This does not necessarily imply that erri(x)
erri(y). - (In this case minw-i,k-erri(x) w-i
minw-i,k-erri(y))
12Example
- Let w abaababaab, k2.
-
- x baba, y babb,
- end-setw(x, 2) 5, 6, 8, 10 end-setw(y, 2)
- but x w,ky.
- i 5 ? err5(x) 2, err5(y) 1 ?
- minw-5,2-err5(x) 0 ? 1
minw-5,2-err5(y)
1 2 3 4 5 6 7 8 9 10
i
13Example (contd)
- Let w abaababaab, k2.
-
- x abaababa, y baababa, x w,ky
- end-setw(x, 2) 8 end-setw(y, 2)
- i 8 ? err8(x) 0 err8(y) ?
- minw-8,2-err8(x)2minw-8,2-err8(y)
1 2 3 4 5 6 7 8 9 10
i
14Example (contd2)
- Let w abaababaab, k2.
-
- x abaababaa, y baababab, x w,ky
- end-setw(x, 2) 9 end-setw(y, 2)
- i 9 ? err9(x) 0 ? 1 err9(y) but
- minw-9,2-err9(x)1minw-9,2-err9(y)
1 2 3 4 5 6 7 8 9 10
i
15Results
- In Blumer et al. (exact case)
- w is a right-invariant equivalence relation on
S. - x w y ? x is a suffix of y (or vice-versa).
- xy wy ?
- every occurrence of y is immediately
preceded by an occurrence of x.
- Lemma 1 (approximate case)
- w,k is a right-invariant equivalence relation
on S. - x w,ky ? x is a suffix of y up to 2k errors (or
vice-versa). - xy w,ky ?
- ?i ? end-setw(xy, k)end-setw(y,k),
- the k-occurrence of y with final position i is
immediately preceded by a t-occurrence of x,
where t max(k-erri(y))-(w-i), 0).
16Results (contd)
- Theorem 1.
- x w,ky ? (?z?S, xz is a k-suffix of w ? yz is a
k-suffix of w) (they have the same future in w). - Corollary 1.
- ?w?S, the (partial) DFA Sk(S,Q,q0,F, d) having
- input alphabet S,
- state set Qxw,k x is a k-occurrence of w,
- initial state q0ew,k,
- accepting states (F) those equivalence classes
that include the k- suffixes of w (i.e., whose
end-sets include the position w), - transition function dxw,k ? xaw,k , x and
xa are k-occurrences of w, - is the minimal deterministic finite automaton
which recognizes the set Suff(w, k).
a
17What about the size of Sk?
- Gad Landau asked for a data structure having size
close to w that allows approximate pattern
matching in time proportional to the query plus
the number of occurrences. - In the NON approximate case suffix trees and
(compact) suffix automata do the job. - What about approximate case?
18Prefixes of Fibonacci word
- 2, 4, 6, 11, 15, 18, 23, 28, 33, 36, 39, 45,
- 50, 56, 61, 64, 67, 70, 73, 79, 84, 90, 96,
- 102, 107, 110, 113, 116, 119, 122, 125,
- 128, 134, 139, 145, 151, 157, 163, 169,
- 175, 180, 183, 186, 189, 192, 195, 198,
- 201, 204, 207, 210, 213, 216, 222, 227,
- 233, 239, 245, 251, 257, 263, 269, x?.....
- It is not in the Sloane al. Database
19- Writing an1-ann we obtain
- 2, 2, 5, 4, 3, 5, 5, 5, 3, 3, 6, 5, 6, 5, 3, 3,
- 3, 3, 6, 5, 6, 6, 6, 5, 3, 3, 3, 3, 3, 3, 3, 6,
- 5, 6, 6, 6, 6, 6, 6, 5, 3, 3, 3, 3, 3, 3, 3, 3,
- 3, 3, 3, 3, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
- 6, 5, 3, 3, 3, 3, 3, 3, 3, 3, x? .......
- It seems easier. Let us Run-Length encode.
20Run-Length encode
- two 2, one 5, one 4, one 3, three 5, two 3,
- one 6, one 5, one 6, one 5, four 3,
- one 6, one 5, three 6, one 5, seven 3,
- one 6, one 5, six 6, one 5, twelve 3,
- one 6, one 5, eleven 6, one 5, twenty 3,
- one 6, one 5, nineteen 6, one 5, x? 3,
- ......
- Which is the rule?
21Conjecture on the size of S1 for prefixes of
Fibonacci word
- An initial part, and then from i4, 5, ....
- one 6, one 5, (fibi-1-2) 6, one 5, (fibi-1) 3,
- Conjecture 1 The size of the suffix automaton
with one mismatch of the prefixes of the
Fibonacci word grows according to - afibn afibn-1 3(afibn-3-1) 10
6(afibn-4-1) - We did not prove the rule. The rule holds true up
to prefixes of length 2000. It is a conjecture
that the rule describes this sequence.
22Other experiments and Final Conjecture
- bban, n4 ? an1-an196(n-4),
- Prefixes of Thue-Morse word ? S12wlogw
- Random words generated by memoryless sources ?
SkO(wlogkw) Epifanio Gabriele Mignosi
Restivo Sciortino 2003, 2005 Maas Nowak 2005. - Conjecture
- The suffix automaton with k mismatches of any
text w has size O(wlogkw).
23Allowing more mismatches
- Definition
- w?S, k, r ?Z?0, k r.
- x occurs in w at position l up to k errors in a
window of size r, or simply kr-occurs in w at
position l, if - if x lt r ? d(x, w(l,
lx-1)) k - if x r ? ?i, 1 i x-r1, d(x(i,ir-1),
w(li-1, lir-2)) k. - A string x satisfying above property is a
kr-occurrence of w. - A string x that kr-occurs in w as a suffix of w
is a kr-suffix of w.
L(w,k,r) x x kr-occurs in w at position l, 1
l w-x1. Suff(w,k,r) x x kr-suffix of
w. Remark Suff(w,k) Suff(w,k,r) when r
w.
24Example
- wabaa, k1, r2
- L(w,1,2)e,a,b,aa,ab,ba,bb,aaa,aab,aba,abb,baa,
- bab,bba,bbb,aaaa,aaab,abaa,abab,abba,bbaa,bbab
, - bbba
- Remark bbba ? L(w,1,2), but bbba ?
L(w,1,4)L(w,1) - Suff(w,1,2)e,a,b,aa,ab,ba,aaa,aab,baa,bab,bba,
aaaa,aaab,abaa,abab,abba,bbaa,bbab,bbba
25The Repetition Index R(w,k,r) of w is the
smallest integer h such that all strings of
length h kr-occur at most once in w.
- Remarks
- R(w,k,r) is well defined because the integer
hw is an element of the set above described - If k/r ? 1/2 then R(w,k,r)w
- Equation r R(w,k,r) admits an unique solution.
Lemma 2 Given Sk there exists a linear time
algorithm that returns rR(w,k,r). Remark This
algorithm labels each state of Sk with an integer
that represents a distance from this state to the
end.
26Algorithm that lets Sk recognize Suff(w,k,r)
- Algorithm (x,r,Sk)
- xr R(w, k, r)
- if x is accepted by Sk then x?Suff(w,k,r)
- else x?Suff(w,k,r)
- xgtr R(w, k, r)
- let x prefix of x such that x r R(w,
k, r) - let q be the state reached after reading x
and i the integer associated to q - w-i-r1j is the unique possible initial
position of x - check if x kr-occurs at position j in w.
27Conclusions and open problems
- Sk can be useful for approximate indexing.
- If Conjecture 2 is true and constants involved in
O-notation are small, our data structure is
useful for some classical applications of
approximate indexing. - We think that it is possible to connect Sk with
Sk,r and conjecture that Sk,r O(Sk). - We think that it is possible to obtain an online
algorithm even when dealing with mismatches. It
would be probably more complex than the classical
one. It still remains an open problem how to
define it.