Title: Efficient String Matching : An Aid to Bibliographic Search
1Efficient String Matching An Aid to
Bibliographic Search
- Alfred V. Aho and Margaret J. Corasick
- Bell Laboratories
2Virus Definition
- Each virus has its peculiar signature
- Example in ClamAV
- _0017_0001_00021b8004233c999cd218bd6b90300b440cd2
18b4c198b541bb80157cd21b43ecd2132ed - _0017_0001_000 virus index
- Hex(21)Dec(33)!
- Match the signature for detecting virus
3Regular Expression
- Use RE to describe the signature
- ? can be any one char
- W32.Hybris.C (Clam)4000?????????????83??????75f2e
9????ffff00000000 - can be any chars (including no char)
- Oror-fam (Clam)4952435669727553455859330f54554
b617a61536e617073686f - n1-n2, there are n1n2 chars between two parts
- Worm.Bagle.AG-empty (Clam)6e74656e742d547970653a2
06170706c69636174696f6e2f6f637465742d73747265616d3
b40-1302d2d2d2d2d2d2d2d
4Introduction
- Locate all occurrences of any of a finite number
of keywords in a string of text. - Consists of two parts
- constructing a finite state pattern matching
machine from the keywords - using the pattern matching machine to process the
text string in a single pass.
5Pattern Matching Machine(1)
- Our problem is to locate and identify all
substrings of x which are keywords in K. - K Ky1,y2,,yk be a finite set of strings
which we shall call keywords - x x is an arbitrary string which we shall call
the text string. - The behavior of the pattern matching machine is
dictated by three functions a goto function g, a
failure function f, and an output function output.
6Pattern Matching Machine(2)
- g (s,a) s or failmaps a pair consisting of a
state and an input symbol into a state or the
message fail. - f (s) smaps a state into a state, and is
consulted whenever the goto function reports
fail. - output (s) keywordsassociating a set of
keyword (possibly empty) with every state.
7Pattern Matching Machine Example with keywords
he,she,his,hers
8(No Transcript)
9- Start state is state 0.
- Let s be the current state and a the current
symbol of the input string x. - Operating cycle
- If g(s,a)s, makes a goto transition, and enters
state s and the next symbol of x becomes the
current input symbol. - If g(s,a)fail, make a failure transition f. If
f(s)s, the machine repeats the cycle with s as
the current state and a as the current input
symbol.
10Example
- Text u s h e r s
- State 0 0 3 4 5 8 9
- 2
- In state 4, since g(4,e)5, and the machine
enters state 5, and finds keywords she and he
at the end of position four in text string, emits
output(5)
11Example Contd
- In state 5 on input symbol r, the machine makes
two state transitions in its operating cycle. - Since g(5,r)fail, M enters state 2f(5) . Then
since g(2,r)8, M enters state 8 and advances to
the next input symbol. - No output is generated in this operating cycle.
12Algorithm 1. Pattern matching machine. Input. A
text string x a1 a2 a n where each a i is an
input symbol and a pattern matching machine M
with goto function g, failure function f, and
output function output, as described
above. Output. Locations at which keywords occur
in x. Method. begin state ? 0 for i ? 1
until n do begin while g (state, a i
) fail do state ? f(state) state ? g
(state, a i ) if output (state)? empty
then begin print i
print output (state) end end end
13Construction the functions
- Two part to the construction
- FirstDetermine the states and the goto function.
- SecondCompute the failure function.
- Output function start at first, complete at
second.
14Construction of Goto function
- Construct a goto graph like next page.
- New vertices and edges to the graph, starting at
the start state. - Add new edges only when necessary.
- Add a loop from state 0 to state 0 on all input
symbols other than the first one in each keyword.
15Construction of Goto function with keywords
he,she,his,hers
16Algorithm 2
Algorithm 2. Construction of the goto
function. Input. Set of keywords K yl, y2, . .
. . . yk. Output. Goto function g and a
partially computed output function
output. Method. We assume output(s) is empty when
state s is first created, and g(s, a)
fail if a is undefined or if g(s, a) has not yet
been defined. The procedure enter(y)
inserts into the goto graph a path that
spells out y. begin newstate ? 0 for i ?
1 until k do enter(y i ) for all a such that
g(0, a) fail do g(0, a) ? 0 end
17 procedure enter(a 1 a 2 a m ) begin state
? 0 j ? 1 while g (state, aj )? fail do
begin state ? g (state, aj) j ? j
l end for p ? j until m do begin
newstate ? newstate 1 g (state, ap )
? newstate state ? newstate end
output(state) ? a 1 a 2 a m end
18Construction of Failure function
- Depth of sthe length of the shortest path from
the start state to state s. - The states of depth d can be determined from the
states of depth d-1. - Make f(s)0 for all states s of depth 1.
19Construction of Failure function Contd
- Compute failure function for the state of depth
d, each state r of depth d-1 - 1. If g(r,a)fail for all a, do nothing.
- 2. Otherwise, for each a such that g(r,a)s, do
the following - a. Set statef(r) .
- b. Execute state ?f(state) zero or more times,
until a value for state is obtained such that
g(state,a)?fail . - c. Set f(s)g(state,a) .
20Algorithm 3
Algorithm 3. Construction of the failure
function. Input. Goto function g and output
function output from Algorithm 2. Output. Failure
function fand output function output. Method.
begin queue ? empty for each a such that
g(0, a) s?0 do begin queue ? queue
? s f(s) ? 0 end
21 while queue ? empty do begin let r
be the next state in queue queue ? queue -
r for each asuch that g(r, a) s?fail
do begin queue ? queue ? s
state ? f(r) while g (state,
a) fail do state ? f(state) f(s) ?
g(state, a) output(s) ?output(s) ?
output(f(s)) end end end
22About construction
- When we determine f(s)s, we merge the outputs
of state s with the output of state s. - In fact, if the keyword his were not present,
then could go directly from state 4 to state 0,
skipping an unnecessary intermediate transition
to state 1. - To avoid above, we can use the deterministic
finite automaton, which discuss later.
23Properties of Algorithms 1,2,3
- Lemma 1 Suppose that in the goto graph state s
is represented by the string u and state t is
represented by the string v. Then f(s)t iff v is
the longest proper suffix of u that is also a
prefix of some keyword. - Proof
- Suppose ua1a2aj, and a1a2aj-1 represents state
r, let r1,r2,,rn be the sequence of states 1.
r1f(r) 2. ri1f(ri) 3.g(ri,aj)fail for
1?iltn 4.g(rn,aj)t - Suppose vi represents state ri, v1 is the longest
proper suffix of a1a2aj-1 that is a prefix of
some keyword v2 is the longest proper suffix of
v1 that is a prefix of some keyword, and so on. - Thus vn is the longest suffix of a1a2aj-1 such
that vnaj is a prefix of some keyword.
24Properties of Algorithms 1,2,3
- Lemma 2 The set output(s) contains y if and
only if y is a keyword that is a suffix of the
string representing state s. - Proof
- Consider a string y in output(s).
- If y is added to output(s) by algorithm 2, then
yu and y is a keyword. - If y is added to output(s) by algorithm 3, then y
is in output(f(s)). If y is a proper suffix of u,
then from the inductive hypothesis and Lemma 1 we
know output(f(s)) contains y.
25Properties of Algorithms 1,2,3
- Lemma 3 After the jth operating cycle,
Algorithm 1 will be in state s iff s is
represented by the longest suffix of a1a2aj that
is a prefix of some keyword. - Proof Similar to Lemma 1.
- THEOREM 1 Algorithms 2 and 3 produce valid
goto,failure, and output functions. - Proof By Lemmas 2 and 3.
26Time Complexity of Algorithms 1, 2, and 3
- THEOREM 2 Using the goto, failure and output
functions created by Algorithms 2 and 3,
Algorithm 1 makes fewer than 2n state transitions
in processing a text string of length n. - From state s of depth d Algorithm 1 make d
failure transitions at most in one operating
cycle. - Number of failure transitions must be at least
one less than number of goto transitions. - processing an input of length n Algorithm 1 makes
exactly n goto transitions. Therefore the total
number of state transitions is less than 2n.
27Time Complexity of Algorithms 1, 2, and 3
- THEOREM 3 Algorithms 2 requires time linearly
proportional to the sum of the lengths of the
keywords. - Proof
- Straightforward
- THEOREM 4 Algorithms 3 can be implemented to
run in time proportional to the sum of the
lengths of the keywords. - Proof
- Total number of executions of state? f(state) is
bounded by the sum of the lengths of the
keywords. - Using linked lists to represent the output set of
a state, we can execute the statement output(s) ?
output(s)? output(f(s)) in constant time.
28 procedure enter(a 1 a 2 a m ) begin state
? 0 j ? 1 while g (state, aj )? fail do
begin state ? g (state, aj) j ? j
l end for p ? j until m do begin
newstate ? newstate 1 g (state, ap )
? newstate state ? newstate end
output(state) ? a 1 a 2 a m end
29 while queue ? empty do begin let r
be the next state in queue queue ? queue -
r for each asuch that g(r, a) s?fail
do begin queue ? queue ? s
state ? f(r) while g (state,
a) fail do state ? f(state) f(s) ?
g(state, a) output(s) ?output(s) ?
output(f(s)) end end end
30Eliminating Failure Transitions
- Using in algorithm 1
- d(s, a), a next move function d such that for
each state s and input symbol a. - By using the next move function d, we can
dispense with all failure transitions, and make
exactly one state transition per input character.
31Algorithm 4. Construction of a deterministic
finite automaton. Input. Goto function g from
Algorithm 2 and failure function f from Algorithm
3. Output. Next move function 8. Method. begin
queue ? empty for each symbol a do
begin d(0, a) ? g(0, a) if g (0,
a) ? 0 then queue ? queue? g (0, a) end
while queue ? empty do begin let
r be the next state in queue queue ?
queue - r for each symbol a do
if g(r, a) s ? fail do begin
queue ? queue ? s d(r, a)
? s end elsed(r, a)
?d(f(r), a) end end
32 Fig. 3. Next move function. input symbol next
state state 0 h 1 s 3 . 0 state 1
e 2 i 6 h 1 s 3 . 0
state 9state7 state3 h 4 s 3 . 0
state 5state2 r 8 h
1 s 3 . 0 state 6 s 7 h 1 .
0 state 4 e 5 i 6 h 1 s 3 .
0 state 8 s 9 h 1 . 0
33Conclusion
- Attractive in large numbers of keywords, since
all keywords can be simultaneously matched in one
pass. - Using Next move function
- can potentially reduce state transitions by 50,
but more memory. - Spend most time in state 0 from which there are
no failure transitions.
34(No Transcript)