Title: A new matching algorithm based on prime numbers
1A new matching algorithm based on prime numbers
- N. D. Atreas and C. Karanikas
- Department of Informatics
- Aristotle University of Thessaloniki
2Exact Matching find all the occurences of a
pattern within a text.
- 1. The Brute Force algorithm performs character
by character comparison in O(N M) time
complexity, where M is the length of the pattern
and N is the length of the text. -
- 2. The Knuth-Morris-Pratt algorithm Runs in
O(NM) time, avoiding unecessary re-examinations
of previously matched characters. -
3- 3. The Boyer-Moore algorithm
- involves character by character comparison
by using backwards checking. Best case execution
O(N/M), worst time O(N). - 4. The Karp Rabin algorithm
- It is a randomised algorithm that seeks a
pattern within a text by using hashing. Expected
running time O(NM).
4- A hash function must be
- efficiently computable
- highly discriminating for strings
- hash(x(j1 ... jM)) must be easily computable
from hash(x(j jM-1)) and x(jM). - not injective, i.e. the equality of two hash
values suggests, but does not guarantee, equality
of the inputs.
5Let x x(1),x(N) be a set of positive
integers and p(1)ltltp(N) be primes such that
p(1)gtMaxx(i), i1,..,N, we define the
transform
6Properties of T(x(1)x(N))
- T(x(1),x(N)) is one to one.
- x(1),,x(N) can be recovered from T(x) as the
unique solution of a system of N linear
Diophantine equations defined recursively - (p(i1)p(N))x(i)p(i)c(i1) c(i)
-
- where c(1)T(x)p(1)P(N).
7Properties of T(x(1)x(N))
- T(x) can be used as a measure of similarity
between two strings, since it can be used for
counting the different elements between them. - It provides a necessary and sufficient condition
to detect whenever a binding operation on strings
can be implemented. - It is not a hash function.
8Modelling a hash function approximating T.
9Definition of the hash function
10Final form of hash function
11Software implementation
- Let Xx(1),,x(N) be the text and
Yy(1),,y(M) be the pattern. - Compute T(y(1),,y(M)) and T(x(1),,x(M)) in O(M)
time. - Compute the hash values in O(N-M) time
12Software implementation
- for some i then x(i1),,x(iM-1) is a candidate
for string matching. - For all candidates perform at most p (p is the
length of the alphabet) character comparisons to
throw out false matches. - The algorithm executes in O(N) time complexity.
13Conclusions
- We introduce the idea of a hash function
approximation in order to reduce the
computational complexity of an algorithm. - Although the time bounds are the same or in some
times inferiors compared to Boyer-Moore
algorithm, our algorithm is superior for multiple
matching problems.