Efficient String Matching : An Aid to Bibliographic Search - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Efficient String Matching : An Aid to Bibliographic Search

Description:

Each virus has its peculiar signature. Example in ClamAV ... W32.Hybris.C (Clam)=4000?83?75f2e9????ffff00000000 * can be any chars (including no char) ... – PowerPoint PPT presentation

Number of Views:300
Avg rating:3.0/5.0
Slides: 35
Provided by: banyanCm
Category:

less

Transcript and Presenter's Notes

Title: Efficient String Matching : An Aid to Bibliographic Search


1
Efficient String Matching An Aid to
Bibliographic Search
  • Alfred V. Aho and Margaret J. Corasick
  • Bell Laboratories

2
Virus Definition
  • Each virus has its peculiar signature
  • Example in ClamAV
  • _0017_0001_00021b8004233c999cd218bd6b90300b440cd2
    18b4c198b541bb80157cd21b43ecd2132ed
  • _0017_0001_000 virus index
  • Hex(21)Dec(33)!
  • Match the signature for detecting virus

3
Regular Expression
  • Use RE to describe the signature
  • ? can be any one char
  • W32.Hybris.C (Clam)4000?????????????83??????75f2e
    9????ffff00000000
  • can be any chars (including no char)
  • Oror-fam (Clam)4952435669727553455859330f54554
    b617a61536e617073686f
  • n1-n2, there are n1n2 chars between two parts
  • Worm.Bagle.AG-empty (Clam)6e74656e742d547970653a2
    06170706c69636174696f6e2f6f637465742d73747265616d3
    b40-1302d2d2d2d2d2d2d2d

4
Introduction
  • Locate all occurrences of any of a finite number
    of keywords in a string of text.
  • Consists of two parts
  • constructing a finite state pattern matching
    machine from the keywords
  • using the pattern matching machine to process the
    text string in a single pass.

5
Pattern Matching Machine(1)
  • Our problem is to locate and identify all
    substrings of x which are keywords in K.
  • K Ky1,y2,,yk be a finite set of strings
    which we shall call keywords
  • x x is an arbitrary string which we shall call
    the text string.
  • The behavior of the pattern matching machine is
    dictated by three functions a goto function g, a
    failure function f, and an output function output.

6
Pattern Matching Machine(2)
  • g (s,a) s or failmaps a pair consisting of a
    state and an input symbol into a state or the
    message fail.
  • f (s) smaps a state into a state, and is
    consulted whenever the goto function reports
    fail.
  • output (s) keywordsassociating a set of
    keyword (possibly empty) with every state.

7
Pattern Matching Machine Example with keywords
he,she,his,hers
8
(No Transcript)
9
  • Start state is state 0.
  • Let s be the current state and a the current
    symbol of the input string x.
  • Operating cycle
  • If g(s,a)s, makes a goto transition, and enters
    state s and the next symbol of x becomes the
    current input symbol.
  • If g(s,a)fail, make a failure transition f. If
    f(s)s, the machine repeats the cycle with s as
    the current state and a as the current input
    symbol.

10
Example
  • Text u s h e r s
  • State 0 0 3 4 5 8 9
  • 2
  • In state 4, since g(4,e)5, and the machine
    enters state 5, and finds keywords she and he
    at the end of position four in text string, emits
    output(5)

11
Example Contd
  • In state 5 on input symbol r, the machine makes
    two state transitions in its operating cycle.
  • Since g(5,r)fail, M enters state 2f(5) . Then
    since g(2,r)8, M enters state 8 and advances to
    the next input symbol.
  • No output is generated in this operating cycle.

12
Algorithm 1. Pattern matching machine. Input. A
text string x a1 a2 a n where each a i is an
input symbol and a pattern matching machine M
with goto function g, failure function f, and
output function output, as described
above. Output. Locations at which keywords occur
in x. Method. begin state ? 0 for i ? 1
until n do begin while g (state, a i
) fail do state ? f(state) state ? g
(state, a i ) if output (state)? empty
then begin print i
print output (state) end end end
13
Construction the functions
  • Two part to the construction
  • FirstDetermine the states and the goto function.
  • SecondCompute the failure function.
  • Output function start at first, complete at
    second.

14
Construction of Goto function
  • Construct a goto graph like next page.
  • New vertices and edges to the graph, starting at
    the start state.
  • Add new edges only when necessary.
  • Add a loop from state 0 to state 0 on all input
    symbols other than the first one in each keyword.

15
Construction of Goto function with keywords
he,she,his,hers
16
Algorithm 2
Algorithm 2. Construction of the goto
function. Input. Set of keywords K yl, y2, . .
. . . yk. Output. Goto function g and a
partially computed output function
output. Method. We assume output(s) is empty when
state s is first created, and g(s, a)
fail if a is undefined or if g(s, a) has not yet
been defined. The procedure enter(y)
inserts into the goto graph a path that
spells out y. begin newstate ? 0 for i ?
1 until k do enter(y i ) for all a such that
g(0, a) fail do g(0, a) ? 0 end
17
procedure enter(a 1 a 2 a m ) begin state
? 0 j ? 1 while g (state, aj )? fail do
begin state ? g (state, aj) j ? j
l end for p ? j until m do begin
newstate ? newstate 1 g (state, ap )
? newstate state ? newstate end
output(state) ? a 1 a 2 a m end
18
Construction of Failure function
  • Depth of sthe length of the shortest path from
    the start state to state s.
  • The states of depth d can be determined from the
    states of depth d-1.
  • Make f(s)0 for all states s of depth 1.

19
Construction of Failure function Contd
  • Compute failure function for the state of depth
    d, each state r of depth d-1
  • 1. If g(r,a)fail for all a, do nothing.
  • 2. Otherwise, for each a such that g(r,a)s, do
    the following
  • a. Set statef(r) .
  • b. Execute state ?f(state) zero or more times,
    until a value for state is obtained such that
    g(state,a)?fail .
  • c. Set f(s)g(state,a) .

20
Algorithm 3
Algorithm 3. Construction of the failure
function. Input. Goto function g and output
function output from Algorithm 2. Output. Failure
function fand output function output. Method.
begin queue ? empty for each a such that
g(0, a) s?0 do begin queue ? queue
? s f(s) ? 0 end
21
while queue ? empty do begin let r
be the next state in queue queue ? queue -
r for each asuch that g(r, a) s?fail
do begin queue ? queue ? s
state ? f(r) while g (state,
a) fail do state ? f(state) f(s) ?
g(state, a) output(s) ?output(s) ?
output(f(s)) end end end
22
About construction
  • When we determine f(s)s, we merge the outputs
    of state s with the output of state s.
  • In fact, if the keyword his were not present,
    then could go directly from state 4 to state 0,
    skipping an unnecessary intermediate transition
    to state 1.
  • To avoid above, we can use the deterministic
    finite automaton, which discuss later.

23
Properties of Algorithms 1,2,3
  • Lemma 1 Suppose that in the goto graph state s
    is represented by the string u and state t is
    represented by the string v. Then f(s)t iff v is
    the longest proper suffix of u that is also a
    prefix of some keyword.
  • Proof
  • Suppose ua1a2aj, and a1a2aj-1 represents state
    r, let r1,r2,,rn be the sequence of states 1.
    r1f(r) 2. ri1f(ri) 3.g(ri,aj)fail for
    1?iltn 4.g(rn,aj)t
  • Suppose vi represents state ri, v1 is the longest
    proper suffix of a1a2aj-1 that is a prefix of
    some keyword v2 is the longest proper suffix of
    v1 that is a prefix of some keyword, and so on.
  • Thus vn is the longest suffix of a1a2aj-1 such
    that vnaj is a prefix of some keyword.

24
Properties of Algorithms 1,2,3
  • Lemma 2 The set output(s) contains y if and
    only if y is a keyword that is a suffix of the
    string representing state s.
  • Proof
  • Consider a string y in output(s).
  • If y is added to output(s) by algorithm 2, then
    yu and y is a keyword.
  • If y is added to output(s) by algorithm 3, then y
    is in output(f(s)). If y is a proper suffix of u,
    then from the inductive hypothesis and Lemma 1 we
    know output(f(s)) contains y.

25
Properties of Algorithms 1,2,3
  • Lemma 3 After the jth operating cycle,
    Algorithm 1 will be in state s iff s is
    represented by the longest suffix of a1a2aj that
    is a prefix of some keyword.
  • Proof Similar to Lemma 1.
  • THEOREM 1 Algorithms 2 and 3 produce valid
    goto,failure, and output functions.
  • Proof By Lemmas 2 and 3.

26
Time Complexity of Algorithms 1, 2, and 3
  • THEOREM 2 Using the goto, failure and output
    functions created by Algorithms 2 and 3,
    Algorithm 1 makes fewer than 2n state transitions
    in processing a text string of length n.
  • From state s of depth d Algorithm 1 make d
    failure transitions at most in one operating
    cycle.
  • Number of failure transitions must be at least
    one less than number of goto transitions.
  • processing an input of length n Algorithm 1 makes
    exactly n goto transitions. Therefore the total
    number of state transitions is less than 2n.

27
Time Complexity of Algorithms 1, 2, and 3
  • THEOREM 3 Algorithms 2 requires time linearly
    proportional to the sum of the lengths of the
    keywords.
  • Proof
  • Straightforward
  • THEOREM 4 Algorithms 3 can be implemented to
    run in time proportional to the sum of the
    lengths of the keywords.
  • Proof
  • Total number of executions of state? f(state) is
    bounded by the sum of the lengths of the
    keywords.
  • Using linked lists to represent the output set of
    a state, we can execute the statement output(s) ?
    output(s)? output(f(s)) in constant time.

28
procedure enter(a 1 a 2 a m ) begin state
? 0 j ? 1 while g (state, aj )? fail do
begin state ? g (state, aj) j ? j
l end for p ? j until m do begin
newstate ? newstate 1 g (state, ap )
? newstate state ? newstate end
output(state) ? a 1 a 2 a m end
29
while queue ? empty do begin let r
be the next state in queue queue ? queue -
r for each asuch that g(r, a) s?fail
do begin queue ? queue ? s
state ? f(r) while g (state,
a) fail do state ? f(state) f(s) ?
g(state, a) output(s) ?output(s) ?
output(f(s)) end end end
30
Eliminating Failure Transitions
  • Using in algorithm 1
  • d(s, a), a next move function d such that for
    each state s and input symbol a.
  • By using the next move function d, we can
    dispense with all failure transitions, and make
    exactly one state transition per input character.

31
Algorithm 4. Construction of a deterministic
finite automaton. Input. Goto function g from
Algorithm 2 and failure function f from Algorithm
3. Output. Next move function 8. Method. begin
queue ? empty for each symbol a do
begin d(0, a) ? g(0, a) if g (0,
a) ? 0 then queue ? queue? g (0, a) end
while queue ? empty do begin let
r be the next state in queue queue ?
queue - r for each symbol a do
if g(r, a) s ? fail do begin
queue ? queue ? s d(r, a)
? s end elsed(r, a)
?d(f(r), a) end end
32
Fig. 3. Next move function. input symbol next
state state 0 h 1 s 3 . 0 state 1
e 2 i 6 h 1 s 3 . 0
state 9state7 state3 h 4 s 3 . 0
state 5state2 r 8 h
1 s 3 . 0 state 6 s 7 h 1 .
0 state 4 e 5 i 6 h 1 s 3 .
0 state 8 s 9 h 1 . 0
33
Conclusion
  • Attractive in large numbers of keywords, since
    all keywords can be simultaneously matched in one
    pass.
  • Using Next move function
  • can potentially reduce state transitions by 50,
    but more memory.
  • Spend most time in state 0 from which there are
    no failure transitions.

34
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com