Title: Vakhitov Alexander Approximate Text Indexing.
1Vakhitov AlexanderApproximate Text Indexing.
- Using simple mathematical arguments the matching
probabilities in the suffix tree are bound and by
a clever division of the search pattern
sub-linear time is achieved - by A Hybrid Method for Approximate
- String Matching G. Navarro, R. Baeza-Yates
2The Task
- The task is to find substrings from the long
text T, approximately matching to our pattern P. - For example, we have text T'adbc' and P'abc'
- (s-starting position of a substring)adbcadbc
insertion of 'd' (s1)dbc(a-gtd) bc
replacement 'a' with 'd' (s2)bc(a)bc
deletion of 'a' (s3)
3Errors
- There are 3 kinds of transformations, which make
errors in initial string insertion, replacement
and deletion. If we transform S with such chnges
to S', then we can transform S' to S with the
same number of changes. - The minimal number of deletions, insertions and
replacements, needed to transform string A to
string B is called edit distance between A and B
(ed(A,B)). - Example ed('abc','adbc')ed('dbc','abc')ed('bc'
,'abc')1ed('survey','surgery')2
replace 'v' with 'g'
insert 'r'
surgery
surgey
survey
4The resulting algorithm
- The algorithm solves the approximate string
matching problem in O(n?log n) time (n is the
size of text T, ??(0,1)), if - the error level , where
- ? is the size of the alphabet,
- e2.718..,
- ?k/m,
- k is the number of errors,
- m is the size of the pattern P.
5Plan of the report
- Some useful ideas basic algorithms
- The main algorithm
- Analysis of the complexity of the algorithm in
different cases
6Dividing the pattern
- Lemma There are strings A and B, ed(A,B)?k, and
we divide A into j substrings (Ai). Then at least
one of the (Ai) appear in B with at most ?k/j?
errors. - We need k changes to transform A to B. Each
change transforms one of the Ai, so k changes are
distributed between j substrings gt the average
number of changes is k/j. - Example ed('he likes','they like')3kA1'he
',A2'likes' gt j2ed('he ','they ')2
ed('likes','like')2?k/j?
7Computing edit distance
- There are strings x,y. ed(x,y)?
- xx1x2...xm,yy1y2..yn, xp,yq?S
- Cijed(x1..xi,y1..yj) C0..x,0..y is a matrix,
filled with Cij. - Computing Cij
- C0,jj Ci,0i
- Ci,j if (xiyj) then Ci-1,j-1
- else 1minCi-1,j-1,Ci-1,j,Ci,j-1
- Example x'survey',y'surgery'
... u r v ey
... u r g ery
0
1
2
1
0
1
1
1
2
green means xiltgtyj red means xiyj arrow shows
the element used to sompute Cij
8Edit distance in the case of text T and pattern P
- We need to find a substring in text T which
matches P with minimal number of errors. Let x be
the pattern, and y will be a text. - The matching text substring can begin in every
text position so, we have to initialize C0,j
with 0. The rest is left from the previous task. - The algorithm can store only the last column and
analyze the text incrementally from the
beginning. It goes left and down through the
matrix, filling it with Cij.
9Examples of the matrices
10Construction of the NFA
- Nondeterministic Finite Automaton, which is
searching the text substrings, approximately
matching to the pattern with k errors. It
consists of k rows and m columns. - Transitions
- Pattern and text characters are the same
(horizontal) - Insert characters into pattern (vertical)
- Replace the pattern character with the text one
(solid diagonal) - Delete the pattern character (dashed diagonal)
11Nondeterministic Finite Automaton
This automaton is for approximate string matching
for the pattern 'survey' with 2 errors
12Depth-first search
- DEF k-neighborhood is the set of strings that
match P with at most k errors - Uk(P)x ? ? ed(x,P)?k
- Searching this strings in the text (without
errors) can solve the problem, but
Uk(P)O(mk?k) is quite large. - We can determine which strings form Uk(P) appear
in the text by traversing the text suffix tree.
Here we can use the Ukt(P) set. Ukt(P) is a set
of neighborhood elements which are not prefixes
of others.
13Algorithm for searching on the suffix tree
- Starts from the root
- Considers the string x incrementally
- Determines when ed(P,x)?k
- Determines when ed(P,xy)gtk for any y
14Algorithm for searching on the suffix tree
- Each new character of x corresponds to a new
column in the matrix (adding s?? to x ltgt
updating column in O(m) time). - A match is detected when the last element of the
column is ? k - x cannot be extended to match P when all the
values of the last column are gt k
15Algorithm for searching on the suffix tree
(illustration)
16Partitioning the pattern
- The cost of the suffix tree search is exponential
in m and k, so it's better to perform j searches
of patterns of length m/j and k/j errors that's
why we divide patterns. - So, we divide our pattern into j pieces and
search them using the above algorithm. Then, for
each match found ending at text position i we
check the text area i-m-k..imk - But the larger j, the more text positions need to
be verified, and the optimal j will be found soon.
17Searching pieces of the pattern
- Let's use NFA with depth-first search (DFS)
technique (the suffixes from the suffix tree will
be the input of the automaton) - At first, we'll transform our NFA
- Initial self-loop isn't needed (it allowed us
earlier to start matching from every position of
the text) - We remove the low-left triangle of our automaton,
because we avoid initial insertions to the
pattern - We can start matching only with k1 first pattern
characters
18The changes to NFA
19Using suffix array instead of suffix tree
- The suffix array can replace the suffix tree in
our algorithm. It has less space requirements,
but the time complexity should be multiplied by
log n. - Suffix array replaces nodes with intervals and
traversing to the node is going to the interval.
If there is a node and it's children, then the
node interval contains children intervals.
20Analysis for the algorithm the average number of
nodes at level l
- For a small l, all the text suffixes (except the
last l) are longer than l, so nearly n suffixes
reach level l - The maximum number of nodes in the level l is ?l,
where ?? - We use the model of n balls randomly thrown into
?l urns. The average number of filled urns is - ?l(1-(1-1/?l)n)?l(1-e-?(n/?l))?(minn,?l)
21Probability of processing a given node at depth l
in the suffix tree.
- If l?m', at least l-k text characters must match
the pattern (m is the pattern size), and - if l?m', at least m'-k pattern characters must
match the text. We sum all the probabilities for
different pattern prefixes
The largest term of the 1st sum is the first one
and by using Stirling's approximation we have
22Probability of processing a given node at depth l
in the suffix tree.
?(?)lO(1/l) , where
, ?k/l
The whole first summation is bounded by l-k times
the last term, so we get (l-k)?(?)lO(1/l)O(?(?)l)
.The first summation exponentially decreases if
?(?)lt1. It means that
gte2/(1- ?)2
(because e-1lt ? ? /(1- ?) if ??0,1)
23Probability of processing a given node at depth l
in the suffix tree.
The second summation can be also bounded by this
O(?(?)l). So the upper bound for the probability
of processing a givennode at depth l in the
suffix tree is O(?(?)l).
In practice, e should be replaced by c1.09 (it
was defined experimentally), because we have only
founded the upper bound of the probability.
24Analysis of the single pattern search in the
suffix tree
- Using the formulas bounding the probability of
matching, let's consider that in levels l
all the nodes are visited, while nodes at level
lgtL(k) are visited with probability O(?(k/l)l).
Remember that the average number of visited nodes
at the level l (for small l) is ?(minn,?l).
25Three cases of analysis
26The cases of analysis
- (a) L(k) ??log?n, n ???L(k) small n online
search preferable, no index needed (since the
total work is n) - (b) mk lt log?n, n gt ?mklarge n the total
cost - independent on n(?k/l)
- (c) L(k) ??log?n ??mk intermediate
n,sublinear of n time.
27Analysis of pattern partitioning
- We need to perform j searches and then verify all
the possible matches. We also determine three
cases according to previous slide - (a) ??j log?n, n ???L(k/j),
complexity O(n) - (b) mk lt j log?n, n gt ?(mk)/j , if error level
- the complexity is O(n1-log????????????) -
sublinear of n (using - j(mk) / log?n)
(c) with the same as in (b) error level, using
the same j, we also get sublinear complexity.
28Other types of algorithms
- Limited depth-first search technique determines
viable prefixes (the prefixes of the possible
pattern matches) and searches for them in the
suffix tree (it is expensive and it cannot be
implemented on the suffix array) - Filtering discard large parts of the text
checking for a necessary condition (simpler than
the matching condition). Most existing filters
are based on finding substrings of the pattern
without errors, and with big error level they
can't work.
29Summary Conclusions
- The splitting technique balances between
traversing too many nodes of the suffix tree and
verifying too many text positions - The resulting index has sublinear retrieval time
(O(n?)), 0lt?lt1) if the error level is moderate - In future there can appear more exact algorithms
to determine the correct number of pieces in
which the pattern is divided and there are (and
may appear in future) some better algorithms for
verifying after matching a piece of pattern.