On%20the%20suffix%20automaton%20with%20mismatches - PowerPoint PPT Presentation

About This Presentation
Title:

On%20the%20suffix%20automaton%20with%20mismatches

Description:

In literature several data structures have been studied for storing the suffixes ... Gad Landau asked for a data structure having size 'close' to |w| that allows ... – PowerPoint PPT presentation

Number of Views:278
Avg rating:3.0/5.0
Slides: 28
Provided by: Gues299
Category:

less

Transcript and Presenter's Notes

Title: On%20the%20suffix%20automaton%20with%20mismatches


1
On the suffix automaton with mismatches
  • Maxime Crochemore, Chiara Epifanio,
  • Alessandra Gabriele, and Filippo Mignosi

2
Outline
  1. Motivations and basic definitions
  2. Nerodes congruence with mismatches
  3. Suffix automata with mismatches
  4. Conclusions and open problems

3
  • In literature several data structures have been
    studied for storing the suffixes of a text. Each
    of them is conceived for giving a fast access to
    all factors of the text itself. Among them
  • suffix tries representation of all the suffixes
    of a word by an ordinary tree - quadratic size in
    the length of the word
  • suffix trees compact representations of suffix
    tries - linear size in the length of the word
  • suffix automata minimization (related to
    automata) of suffix tries - linear size in the
    length of the word
  • compact suffix automata compact representations
    of suffix automata - linear size in the length of
    the word.

4
Why suffix automata?
  • Suffix automata, compact suffix automata and
    suffix trees have many applications, such as
    indexing, pattern matching, and data compression.
  • They both linear size.
  • but
  • suffix trees and compact suffix automata
    represent strings by pointers to the text, while
    suffix automata work without the need of
    accessing it.

5
Why mismatches?
  1. Data structures recognizing languages with
    mismatches for approximate string matching and
    its applications, such as - recovering the
    original signals after their transmission over
    noisy channels- finding DNA subsequences after
    possible mutations- text searching where there
    are typing or spelling errors- retrieving
    musical passages.
  2. Independent theoretical interest, such as, for
    instance, the modelling of some evolutionary
    events in molecular biology.

6
  • In Blumer et al. (1985)
  • a linear algorithm for building the suffix
    automaton of a word w on a fixed alphabet is
    given (based on Nerodes congruence)
  • it is showed that this suffix automaton must
    have at least w1 states and at most 2w
    complexity Carpi, de Luca in 2001 have proved
    that the lower bound is joined for any prefix of
    Fibonacci word.

7
In this paper we focus on the minimal
deterministic finite automaton, denoted by Sk,
that recognizes the set of suffixes Suff(w,k) of
a word w up to k errors.
  1. First main result characterization of the
    Nerode's right-invariant congruence relative to
    Sk and a Conjecture on the size of Sk.
  2. Second main result description of an algorithm
    that makes use of Sk in order to accept, in an
    efficient way, the language of all suffixes of w
    up to k errors in every window of size r,
    (rrepetition index).

8
Basic definitions
The distance d(x,y) between two strings x and y
is the minimal cost of a sequence of operations
that transform x into y (and ? if no such
sequence exists).
We consider the Hamming distance, that allows
only substitutions, with cost 1 (simplified
definition). It is finite whenever xy and
it holds 0 ? d(x,y) ? x.
Ex.
xacgtatct, yaggttact
d(x,y)3 (in the simplified
definition)
A string x k-occurs in w if it occurs in w at
position l, 1lw, up to k errors. A string x
that k-occurs in w as a suffix of w is a
k-suffix of w.
9
Suffixes with One Mismatch
  • a Suff(a,1)e,a,b.
  • The minimal automaton has 2 states.
  • ab Suff(ab,1) e,a,b,aa,ab,bb.
  • The minimal automaton has 4 states.
  • aba Suff(aba,1)e,a,b,aa,ba,bb,aaa,aba,abb,bba
    . The minimal automaton has 6 states.
  • abaa Suff(abaa,1) e,a,b,aa,ab,ba,aaa,baa,bab
    ,
  • bba,aaaa,abaa,abab,abba,bbaa.
  • The minimal automaton has 11 states.

10
On Nerodes congruence with mismatches
  • Definition 1 Let w??. ?y ??, y? ?
  • end-setw(y,k) i y k-occurs in w with final
    position i.
  • Notice that end-setw(?, k) 0,1, , w.
  • Definition 2 x, y ?? are endk-equivalent, x
    w,ky, on w if
  • 1. end-setw(x, k) end-setw(y, k)
  • 2. ?i ?end-setw(x,k) end-setw(y, k), the number
    of errors available in the suffix of w having i1
    as first position is the same after the reading
    of x and of y, i.e.
  • minw-i, k-erri(x) minw-i, k-erri(y) ,
  • erri(u)(mismatches) of u in w with final
    position i.
  • xw,k equivalence class of x with respect to
    w,k.

11
In other words
  • x w,ky if
  • x and y have the same end-set in w up to k
    mismatches as in the exact case Blumer et al.,
  • (available errors) in the suffix of w after the
    reading of x and of y is the same.
  • The definition includes two cases depending on
    the considered final position i?end-setw(x,k)
    end-setw(y, k)
  • 2.a) w-imaxk-erri(x),k-erri(y) ?
    k-erri(x)k-erri(y) ? erri(x)erri(y). (In this
    case minw-i,k-erri(x) k-erri(x) k-erri(y)
    minw-i,k-erri(y))
  • 2.b) w-i mink-erri(x), k-erri(y) ? it is
    possible to have mismatches in any position of
    the suffix of w having length w-i.
  • This does not necessarily imply that erri(x)
    erri(y).
  • (In this case minw-i,k-erri(x) w-i
    minw-i,k-erri(y))

12
Example
  • Let w abaababaab, k2.
  • x baba, y babb,
  • end-setw(x, 2) 5, 6, 8, 10 end-setw(y, 2)
  • but x w,ky.
  • i 5 ? err5(x) 2, err5(y) 1 ?
  • minw-5,2-err5(x) 0 ? 1
    minw-5,2-err5(y)

1 2 3 4 5 6 7 8 9 10
i
13
Example (contd)
  • Let w abaababaab, k2.
  • x abaababa, y baababa, x w,ky
  • end-setw(x, 2) 8 end-setw(y, 2)
  • i 8 ? err8(x) 0 err8(y) ?
  • minw-8,2-err8(x)2minw-8,2-err8(y)

1 2 3 4 5 6 7 8 9 10
i
14
Example (contd2)
  • Let w abaababaab, k2.
  • x abaababaa, y baababab, x w,ky
  • end-setw(x, 2) 9 end-setw(y, 2)
  • i 9 ? err9(x) 0 ? 1 err9(y) but
  • minw-9,2-err9(x)1minw-9,2-err9(y)

1 2 3 4 5 6 7 8 9 10
i
15
Results
  • In Blumer et al. (exact case)
  • w is a right-invariant equivalence relation on
    S.
  • x w y ? x is a suffix of y (or vice-versa).
  • xy wy ?
  • every occurrence of y is immediately
    preceded by an occurrence of x.
  • Lemma 1 (approximate case)
  • w,k is a right-invariant equivalence relation
    on S.
  • x w,ky ? x is a suffix of y up to 2k errors (or
    vice-versa).
  • xy w,ky ?
  • ?i ? end-setw(xy, k)end-setw(y,k),
  • the k-occurrence of y with final position i is
    immediately preceded by a t-occurrence of x,
    where t max(k-erri(y))-(w-i), 0).

16
Results (contd)
  • Theorem 1.
  • x w,ky ? (?z?S, xz is a k-suffix of w ? yz is a
    k-suffix of w) (they have the same future in w).
  • Corollary 1.
  • ?w?S, the (partial) DFA Sk(S,Q,q0,F, d) having
  • input alphabet S,
  • state set Qxw,k x is a k-occurrence of w,
  • initial state q0ew,k,
  • accepting states (F) those equivalence classes
    that include the k- suffixes of w (i.e., whose
    end-sets include the position w),
  • transition function dxw,k ? xaw,k , x and
    xa are k-occurrences of w,
  • is the minimal deterministic finite automaton
    which recognizes the set Suff(w, k).

a
17
What about the size of Sk?
  • Gad Landau asked for a data structure having size
    close to w that allows approximate pattern
    matching in time proportional to the query plus
    the number of occurrences.
  • In the NON approximate case suffix trees and
    (compact) suffix automata do the job.
  • What about approximate case?

18
Prefixes of Fibonacci word
  • 2, 4, 6, 11, 15, 18, 23, 28, 33, 36, 39, 45,
  • 50, 56, 61, 64, 67, 70, 73, 79, 84, 90, 96,
  • 102, 107, 110, 113, 116, 119, 122, 125,
  • 128, 134, 139, 145, 151, 157, 163, 169,
  • 175, 180, 183, 186, 189, 192, 195, 198,
  • 201, 204, 207, 210, 213, 216, 222, 227,
  • 233, 239, 245, 251, 257, 263, 269, x?.....
  • It is not in the Sloane al. Database

19
  • Writing an1-ann we obtain
  • 2, 2, 5, 4, 3, 5, 5, 5, 3, 3, 6, 5, 6, 5, 3, 3,
  • 3, 3, 6, 5, 6, 6, 6, 5, 3, 3, 3, 3, 3, 3, 3, 6,
  • 5, 6, 6, 6, 6, 6, 6, 5, 3, 3, 3, 3, 3, 3, 3, 3,
  • 3, 3, 3, 3, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
  • 6, 5, 3, 3, 3, 3, 3, 3, 3, 3, x? .......
  • It seems easier. Let us Run-Length encode.

20
Run-Length encode
  • two 2, one 5, one 4, one 3, three 5, two 3,
  • one 6, one 5, one 6, one 5, four 3,
  • one 6, one 5, three 6, one 5, seven 3,
  • one 6, one 5, six 6, one 5, twelve 3,
  • one 6, one 5, eleven 6, one 5, twenty 3,
  • one 6, one 5, nineteen 6, one 5, x? 3,
  • ......
  • Which is the rule?

21
Conjecture on the size of S1 for prefixes of
Fibonacci word
  • An initial part, and then from i4, 5, ....
  • one 6, one 5, (fibi-1-2) 6, one 5, (fibi-1) 3,
  • Conjecture 1 The size of the suffix automaton
    with one mismatch of the prefixes of the
    Fibonacci word grows according to
  • afibn afibn-1 3(afibn-3-1) 10
    6(afibn-4-1)
  • We did not prove the rule. The rule holds true up
    to prefixes of length 2000. It is a conjecture
    that the rule describes this sequence.

22
Other experiments and Final Conjecture
  • bban, n4 ? an1-an196(n-4),
  • Prefixes of Thue-Morse word ? S12wlogw
  • Random words generated by memoryless sources ?
    SkO(wlogkw) Epifanio Gabriele Mignosi
    Restivo Sciortino 2003, 2005 Maas Nowak 2005.
  • Conjecture
  • The suffix automaton with k mismatches of any
    text w has size O(wlogkw).

23
Allowing more mismatches
  • Definition
  • w?S, k, r ?Z?0, k r.
  • x occurs in w at position l up to k errors in a
    window of size r, or simply kr-occurs in w at
    position l, if
  • if x lt r ? d(x, w(l,
    lx-1)) k
  • if x r ? ?i, 1 i x-r1, d(x(i,ir-1),
    w(li-1, lir-2)) k.
  • A string x satisfying above property is a
    kr-occurrence of w.
  • A string x that kr-occurs in w as a suffix of w
    is a kr-suffix of w.

L(w,k,r) x x kr-occurs in w at position l, 1
l w-x1. Suff(w,k,r) x x kr-suffix of
w. Remark Suff(w,k) Suff(w,k,r) when r
w.
24
Example
  • wabaa, k1, r2
  • L(w,1,2)e,a,b,aa,ab,ba,bb,aaa,aab,aba,abb,baa,
  • bab,bba,bbb,aaaa,aaab,abaa,abab,abba,bbaa,bbab
    ,
  • bbba
  • Remark bbba ? L(w,1,2), but bbba ?
    L(w,1,4)L(w,1)
  • Suff(w,1,2)e,a,b,aa,ab,ba,aaa,aab,baa,bab,bba,
    aaaa,aaab,abaa,abab,abba,bbaa,bbab,bbba

25
The Repetition Index R(w,k,r) of w is the
smallest integer h such that all strings of
length h kr-occur at most once in w.
  • Remarks
  • R(w,k,r) is well defined because the integer
    hw is an element of the set above described
  • If k/r ? 1/2 then R(w,k,r)w
  • Equation r R(w,k,r) admits an unique solution.

Lemma 2 Given Sk there exists a linear time
algorithm that returns rR(w,k,r). Remark This
algorithm labels each state of Sk with an integer
that represents a distance from this state to the
end.
26
Algorithm that lets Sk recognize Suff(w,k,r)
  • Algorithm (x,r,Sk)
  • xr R(w, k, r)
  • if x is accepted by Sk then x?Suff(w,k,r)
  • else x?Suff(w,k,r)
  • xgtr R(w, k, r)
  • let x prefix of x such that x r R(w,
    k, r)
  • let q be the state reached after reading x
    and i the integer associated to q
  • w-i-r1j is the unique possible initial
    position of x
  • check if x kr-occurs at position j in w.

27
Conclusions and open problems
  1. Sk can be useful for approximate indexing.
  2. If Conjecture 2 is true and constants involved in
    O-notation are small, our data structure is
    useful for some classical applications of
    approximate indexing.
  3. We think that it is possible to connect Sk with
    Sk,r and conjecture that Sk,r O(Sk).
  4. We think that it is possible to obtain an online
    algorithm even when dealing with mismatches. It
    would be probably more complex than the classical
    one. It still remains an open problem how to
    define it.
Write a Comment
User Comments (0)
About PowerShow.com