Title: Finite state subautomata Application to Electronic Dictionaries
1Finite state subautomataApplication to
Electronic Dictionaries
Lamia Tounsi Polytech'Tours, Computer Science
laboratory François Rabelais University of
Tours, France Lamia.tounsi_at_univ-tours.fr
2Motivation
- DFSA are widely used in Natural Language
processing - Find all sub structures in a given FSA.
- Search of subautomata in a DFSA
- Decompose a very large FSA into smaller ones
- Discover frequently occurring data
- Reduce memory consumption
3Plan
- Mathematical preliminaries
- Automaton
- Subautomaton
- Research of subautomata
- Smallest closed subautomaton
- Smallest subautomaton
- Application to automata representing dictionaries
- Indexation and Compression
- Conclusion
4Finite state subautomataApplication to
Electronic Dictionnaries
- Mathematical preliminaries
- Automaton
- Subautomaton
- Research of subautomata
- Smallest closed subautomaton
- Smallest subautomaton
- Application to automata representing dictionaries
- Indexation and Compression
- Conclusion
5Automaton
- A deterministic acyclic automaton A lt?, Q, ?,
qi, qf gt - ? is the alphabet
- Q is the finite set of states
- ? is the transition function ? Q ? ? ? Q
- qi is the initial state (qi ?Q)
- qf is the final state (qf ?Q)
- Let a ? ? and w ? ?
- ? (p, ?)p
- ? (p, wa) ? ( ? (p,w),a)
6Successors predecessors
- Succ(p) q?Q ????, ?(p,?) q
- Succ(p) q?Q ?w??, ?(p,w) q
- Pred(p) q?Q ????, ?(q,?) p
- Pred(p) q?Q ?w??, ?(q,w) p
- Height
- H(qf)0
- H(p)Maxq ?Succ(p) H(q)1
7Automaton
An automaton that recognizes the flexion of nine
verbs
8Source (E) Initial State (p)
E
9Source (E) Initial State (p)
- Let E ??
- AP(E) w path from qi to p, p ? E
10Source (E) Initial State (p)
- Let E ??
- AP(E) w path from qi to p, p ? E
- AN(E)p ?Q/ ?w ?AP(E), p ?w
-
11Source (E) Initial State (p)
- Let E ??
- AP(E) w path from qi to p, p ? E
- AN(E)p ?Q/ ?w ?AP(E), p ?w
- source(E) ? AN (E)
- Source(E)
- H(source(E)) Minq?AN (E)(H(q))
12Source (E) Initial State (p)
- Let E ??
- AP(E) w path from qi to p, p ? E
- AN(E)p ?Q/ ?w ?AP(E), p ?w
- source(E) ? AN (E)
- source(E)
- H(source(E)) Minq?AN (E)(H(q))
- Let p ?Q, p ?qi
- IS(p) Source(Pred(p))
13Source (E) Initial State (p)
- Source(q2, q3, q5) Source(q3, q4) q2
- Source(q3, q4, q5) Source(q3, q4, q5 , q6)
q1 - IS(q3) q2
- IS(q5) q1
- IS(q6) q1
14Sink (E) Final State (p)
- Let E ??
- PP(E) w path from p to qf, p ? E
- PN(E) p ?Q/ ?w ?PP(E), p ?w
- Sink(E) ? PN (E)
- Sink(E)
- H(Sink(E)) Maxq?PN (E)(H(q))
- Let p ?Q, p ?qi
- FS(p) Sink(Succ(p))
15Subautomaton (SA)
- Alt?, Q, ?, si, sf gt is a sub automaton of A
iff - Q? Q
- si, sf ? Q
- Q ? ? ? Q
- ?
- ?(q, ?) ? Q ? ? ? (q, ?) ? (q, ?)
- ?q ? Q q ? Succ(si) and q ? Pred(sf)
- ?q ? Q \ si, sf Succ(q) ? Q and Pred(q) ?
Q
16Subautomaton (SA)
SA
An automaton that recognizes the flexion of nine
verbs
17Subautomaton (SA)
SA
An automaton that recognizes the flexion of nine
verbs
18Subautomaton (SA)
SA
An automaton that recognizes the flexion of nine
verbs
19Subautomaton (SA)
An automaton that recognizes the flexion of nine
verbs
20Closed subautomaton (CSA)
- Let Q ? Q and si, sf two distinct states
- A subautomaton Alt?, Q, ?, si, sf gt is a
closed subautomaton iff - ?q ? Q \ si Pred(q) ? Q
- ?q ? Q \ sf Succ(q) ? Q
21Closed subautomaton (CSA)
?CSA
An automaton that recognizes the flexion of nine
verbs
22Closed subautomaton (CSA)
CSA
An automaton that recognizes the flexion of nine
verbs
23Closed subautomaton (CSA)
CSA
An automaton that recognizes the flexion of nine
verbs
24Smallest Closed subautomaton (SCSA)
- Let Q ? Q and si, sf two distinct states
- A closed subautomaton Alt?, Q, ?, si, sf gt
- is a smallest closed subautomaton iff
-
- (si, q) is CSA ? q sf
- ?q ? Q
-
- (q, sf) is CSA ? q si
25Smallest Closed subautomaton (SCSA)
?SCSA
SCSA
SCSA
SCSA
An automaton that recognizes the flexion of nine
verbs
26Smallest subautomaton (SSA)
- Let p ? Q \si, sf
- The subautomaton Alt?, Q, ?, si, sf gt
- is SSA(p) iff
- A strictly contains p
- ? Alt?, Q, ?, si, sf gt wich strictly
contains p Q ? Q
27Smallest subautomaton (SSA)
SSA(6)
SSA(18)
An automaton that recognizes the flexion of nine
verbs
28Finite state subautomataApplication to
Electronic Dictionaries
- Mathematical preliminaries
- Automaton
- Subautomaton
- Research of subautomata
- Smallest closed subautomaton (SCSA)
- Smallest subautomaton (SSA)
- Application to automata representing dictionaries
- Indexation and Compression
- Conclusion
29Research SCSA
- Property 1.
- (si, sf ) is a SCSA iff IS(sf) si FS(si) sf
- Property 2. (Associativity)
- If EE1?E2 and E1 ??, E2 ?? then
- Source(E) Source(Source(E1),Source(E2))
- Property 3. (Hierarchy between two SCSA )
- Either, they have no common transitions,
- Either, one is strictly included in the other.
30Research SCSA
- Let p ? Q
- P.IS initial state associated to p.
- P.FSmin minimal final state associated to p,
assuming that p is the initial state of a SCSA. - P.FSmax maximal final state associated to p,
assuming that p is the initial state of a SCSA. - Property 4.
- ?pgtqi, (p.IS,p) is a SCSA iff p.IS.FSmin ?p?
p.IS.FSmax - Complexity Algorithm O (n2)
31Research SCSA
32Research SCSA
33Research SSA
- Let Alt?, Q, ?, si, sf gt be a subautomaton
- Property 5.
- ?E? Q \ sf Succ(si)?Pred(E)? Q
- ?E? Q \ si Pred(sf)?Succ(E)? Q
34 SSA associated to grey states
35 SSA associated to grey states
36 SSA associated to grey states
Source
37 SSA associated to grey states
38 SSA associated to grey states
39Research SSA
- Property 6.
- Let p, p, q, q ? Q
- p, p ?Pred(q) and q, q ?Succ(p)
- H(p) H(p) and H(q) H(q)
- p and q belong to the same SSA
40All Subautomata of an automaton
- Algorithm input A - output
subautomata - 1 repeat
- 2 repeat
- 3 Detect, store and replace each parallels by
one transition - 4 Detect, store and replace each sequences by
one transition - 5 until the automaton is freed from all its
parallels and sequences - 6 Detect, store and replace each smallest
subautomata by one transition - 7 until The automaton A is reduced to one single
transition
Valdez J., Tarjan R. E., Lawler E. L., The
recognition of series-parallel digraphs, SIAM J.
Comput. 11-2298-313, 1982.
41All Subautomata of an automaton
42All Subautomata of an automaton
43All Subautomata of an automaton
44All Subautomata of an automaton
45All Subautomata of an automaton
46All Subautomata of an automaton
47Finite state subautomataApplication to
Electronic Dictionaries
- Mathematical preliminaries
- Automaton
- Subautomaton
- Research of subautomata
- Smallest closed subautomaton (SCSA)
- Smallest subautomaton (SSA)
- Application to automata representing dictionaries
- Indexation and Compression
- Conclusion
48Dictionaries and automata
- 10 dictionaries Lexicographic order of words
- 6 Delaf French, English, Serbian, German,
Polylexicaux English, French cities. - 4 Web Frech, Hungarian, Bulgarian and
Portuguese. - Properties of automata
- Finit set of states, Acyclic, deterministic,
unique initial state, unique final state, minimal.
49Internal structure of automata
50Internal structure of automata
51Experimental Results
52Finite state subautomataApplication to
Electronic Dictionnaries
- Mathematical preliminaries
- Automaton
- Subautomaton
- Research of subautomata
- Smallest closed subautomaton
- Smallest subautomaton
- Application to automata representing dictionaries
- Factorisation, indexation and compression
- Conclusion
53Factorisation, indexation and compression
- The reseach of subautomata detects sequences and
parallels - Sequence subautomaton
- Parallel subautomaton
- Proposal
- The application of the direct acyclic word graph,
initially dedicated for indexing text, to index
the subautomata, - heuristic to select the most interesting
substructure to factorize.
54Storage of an automaton
55Factorization
a
b
c
?
56Factorisation
57How can we choose the subautomata to factorize ?
- The best candidates to be factorized are those
which increase memory storage efficiency and
reduce the size of the initial automaton - Profit saved memory Consumed memory
- The memory space is saved by elimination of all
occurrences of the substructure - The memory space is consumed by the extention of
the alphabet and the index.
58Directed Acyclic word graph (DAWG)
Computations of frequency and profit associated
to each sequence with a DAWG
59Greedy Algorithm of Compression
- Algorithm input A - Output A, Alphabet
- 1 Iterative process
- 2 Select the best sequence s from the DAWG
- 3 Extend the alphabet to represent s
- 4 Delete s from A and from DAWG
- 5 Update the DAWG
-
60Compression FCM
FCM
61Compression FCNM
62Compression FCDic
63Best Compressions
64Best Compressions
65Finite state subautomataApplication to
Electronic Dictionaries
- Mathematical preliminaries
- Automaton
- Subautomaton
- Research of subautomata
- Smallest closed subautomaton
- Smallest subautomaton
- Application to automata representing dictionaries
- Factorisation, indexation and compression
- Conclusion
66Conclusion
- Research of two kinds of smallest subautomata
- Statistical analysis of the internal structure of
some automata associated to dictionnaries - Method of compression based on factorizations of
sequences or parallel subautomata - A minimised automaton does not always lead to the
better compression.
67Future works
- Factorization of more kinds of subautomata,
- Find a way to deminimised an automaton in order
to get a better compression, - Work on alternative encoding of automata, for
example a depth first codage